Law of total variance in and

Var(Y) = E(Var(Y|X)) + Var(E(Y|X)).

Informally speaking this can be viewed as the sum of the "unexplained" portion of Var(Y) and the portion of Var(Y) which is "explained" by X.

The delta method in and

Let $$X_n$$ be a sequence of random variables and $$x$$ a constant such that
$\sqrt{n}(X_n - x) \xrightarrow{d} N(0, \sigma^2),$
where $$\xrightarrow{d}$$ denotes convergence in distribution.

The Taylor approx.
$g(X_n) - g(x) \approx g'(x) (X_n - x)$
suggests something like
$$\sqrt{n}(g(X_n) - g(x))$$ approx. $$\sim N(0, [g'(x)]^2 \sigma^2)$$.

Indeed it turns out that
$$\sqrt{n}(g(X_n) - g(x)) \xrightarrow{d} N(0, [g'(x)]^2 \sigma^2).$$

Perceptron [cont.] - history

Originally the perceptron was intended to be a physical machine rather than software -- i.e., the elements xᵢ in the input vector are actual photocells, and all weights wᵢ and bias b are actual potentiometers.

In 1958 The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
(en.wikipedia.org/wiki/Perceptr)

The perceptron is a binary classification algorithm. After the weights $$\mathbf{w} \in \mathbb{R}^m$$ and bias $$b$$ have been chosen via some optimization routine, the perceptron makes its prediction for any new input $$\mathbf{x} \in \mathbb{R}^m$$ via the formula
$\mathbf{x} \mapsto \begin{cases}0,\,\text{if}\,\mathbf{w}^T \mathbf{x} + b \leq 0,\\ 1,\,\text{if}\,\mathbf{w}^T \mathbf{x} + b > 0.\end{cases}$

The perceptron can be regarded as the simplest neural network.

What I don't know is:
How do you define a (Lebesgue) measure on the space of $$p \times p$$ positive definite matrices to carry out the above integration?

Gamma function

For complex numbers with a positive real part the gamma function is defined as
$\Gamma(z) = \int_0^\infty x^{z-1} e^{-x} \mathrm{d}x.$
It extends the factorial function -- if $$n$$ is a positive integer then
$\Gamma(n) = (n-1)!$

The _multivariate_ gamma function $$\Gamma_p$$ further generalizes $$\Gamma$$:
$\Gamma_p(a) = \int_{S>0} e^{-\mathrm{tr}(S))} |S|^{a-(p+1)/2} \mathrm{d}S,$
where integration is over the space of $$p\times p$$ positive-definite real matrices.

Last toot says that if $$X$$ and $$Y$$ are real-valued random variables with the joint probability density function (p.d.f.) $$f_{X, Y}$$, then the p.d.f. of $$Z = Y - X$$ is given by
$f_Z(z) = \int_{-\infty}^{\infty} f_{X,Y}(x, z+x) \mathrm{d}x$

Now let $$W = |X - Y|$$. Then the p.d.f. of $$W$$ is given by
$f_W(w) = f_Z(w) + f_Z(-w),$
if $$w > 0$$, and $$f_W(w) = 0$$ if $$w \leq 0$$.

[Proof by observing that $$P(W \leq w)$$ is $$P(0 \leq Z \leq w) + P(-w \leq Z \leq 0)$$.]

Let $$X$$ and $$Y$$ be two independent real-valued random variables with probability density functions (p.d.f.) $$f_X$$ and $$f_Y$$ respectively. Let $$Z = Y - X$$. Then the p.d.f. of the random variable $$Z$$ is given by
$f_Z(z) = \int_{-\infty}^\infty f_X(x) f_Y(z + x) \mathrm{d}x.$

In the field of signal processing this is referred to as cross-correlation of the p.d.f.s $$f_X$$ and $$f_Y$$.

[proof by writing $$P(Z \leq z)$$ as a double integral + Tonelli's thm + Fundamental Thm of Calc]

Let $$X$$ and $$Y$$ be real-valued random variables with the joint probability density function (p.d.f.) $$f_{X,Y}$$. Let $$Z = X + Y$$. Then the p.d.f. of the random variable $$Z$$ is given by
$f_Z(z) = \int_{-\infty}^\infty f_{X, Y}(x, z - x) \mathrm{d}x.$

An important special case: If $$X$$ and $$Y$$ are stochastically independent with p.d.f.s $$f_X$$ and $$f_Y$$, then $$f_Z$$ is the convolution of $$f_X$$ and $$f_Y$$.

t-SNE [$$(2+\varepsilon) / 2$$]

What t-SNE plots look like: distill.pub/2016/misread-tsne/

t-SNE [2/2]

The t-SNE algorithm comprises two main steps:
(1) t-SNE constructs a probability distribution over pairs of high-dim. data points such that pairs of similar data points have a high probability of being picked. Similarity is typically evaluated using a Gaussian kernel.
(2) t-SNE defines a probability distribution over the points in the low-dim. map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

t-SNE [1/2]

t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for visualization. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The number of failures per unit of time is called the _failure rate_ $$\lambda$$. Let $$T$$ be the time to first failure. The failure rate need not be constant across time - so it's $$\lambda(t)$$, which can be computed based on a small time duration $$dt$$ as $$\lambda(t)=$$
$= \frac{P(t\leq T< t+dt \vert T\geq t)}{dt}.$
Taking the limit $$dt \to 0$$ we obtain the instantaneous failure rate $$h(t)=$$
$= \lim_{dt \to 0} \frac{P(t\leq T < t+dt \vert T\geq t)}{dt},$
aka the _Hazard_ function.

Consider a matrix $$A\in\mathbb{R}^{m\times n}$$. Then $$Ax$$ is a linear mapping from $$\mathbb{R}^n$$ to $$\mathbb{R}^m$$. There is an $$r\leq n$$, such that:

1. $$A$$ maps an $$r$$-dim. subspace of $$\mathbb{R}^n$$ to an $$r$$-dim. subspace of $$R^m$$ (the column space or image of $$A$$).

2. The other $$(n-r)$$-dimensional subspace of $$\mathbb{R}^n$$, called the null space of $$A$$, is mapped to 0.

Figure from Strang (1993) "The Fundamental Theorem of Linear Algebra".

If we know $$\mathrm{P}^{(Y|X)}$$, the conditional distribution of $$Y$$ for given $$X$$, and if we know $$\mathrm{P}^X$$, the marginal distribution of $$X$$, then we can generate samples $$y_1, y_2, \dots, y_n$$ from $$\mathrm{P}^Y$$, the marginal distribution of $$Y$$, by sampling a value $$x_i$$ from $$P^X$$ and using it to sample $$y_i$$ from $$P^{(Y|X=x_i)}$$ for each $$i=1, 2, \dots, n$$.

Let $$\{P_\theta | \theta\in\Theta\}$$ be a family of probability distributions. Let $$X = (X_1, \dots, X_n)$$ be a sample from $$P_\theta$$. Say our goal is to estimate the value of $$\theta$$ based on $$X$$. A statistic $$T(X)$$ is called _sufficient_ for $$\theta$$ if $$P_\theta(X \in A|T=t)$$ is independent of $$\theta$$ for any $$A$$.

Practically this means that for any decision procedure that is based on $$X$$ there is a decision procedure based on $$T(X)$$ that is equally good or better.

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 159:

"In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change."

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 159:

"By a conditional probability distribution of $$Y$$ for given $$X$$ is meant a function $$q$$ of two variables, a point $$x$$ and a set $$B$$, such that

(i) for a fixed set $$B$$

$q(X, B) = P\{ Y \in B | X \}$

is a conditional probability of the event $$\{Y \in B\}$$ for given $$X$$.

(ii) $$q$$ is for each $$x$$ a probability distribution."

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 157:

"By [a conditional probability of the event $$Y \in B$$ for given $$X$$] is meant a function $$q(X, B)$$ such that for every set $$A \in \mathbb{R}$$

$P\{X \in A, Y \in B\} = \int_A q(x, B) \mu(dx)$

where $$\mu$$ is the marginal distribution of $$X$$."

For a function $$f:\mathcal{X}\to\mathbb{R}$$ the convex conjugate function $$f^\ast$$ is defined as

$f^\ast(x^\ast) := \sup_{x\in\mathcal{X}} \{\langle x^\ast, x \rangle - f(x)\},$

If $$f$$ is strictly convex then the convex conjugate has a simple geometric interpretation: The boundary of the epigraph $$\Omega := \{(x, z) | x\in\mathcal{X}, z \geq f(x)\}$$ can be equivalently represented/encoded via

$g(y) := \sup_{x\in\mathcal{X}} \{ \langle x, y \rangle - f(x) \}$

for all $$y$$.