Law of total variance in #statistics and #probability

Var(Y) = E(Var(Y|X)) + Var(E(Y|X)).

Informally speaking this can be viewed as the sum of the "unexplained" portion of Var(Y) and the portion of Var(Y) which is "explained" by X.

The delta method in #statistics and #probability

Let \(X_n\) be a sequence of random variables and \(x\) a constant such that

\[\sqrt{n}(X_n - x) \xrightarrow{d} N(0, \sigma^2), \]

where \(\xrightarrow{d}\) denotes convergence in distribution.

The Taylor approx.

\[g(X_n) - g(x) \approx g'(x) (X_n - x)\]

suggests something like

\(\sqrt{n}(g(X_n) - g(x))\) approx. \(\sim N(0, [g'(x)]^2 \sigma^2)\).

Indeed it turns out that

\(\sqrt{n}(g(X_n) - g(x)) \xrightarrow{d} N(0, [g'(x)]^2 \sigma^2).\)

Perceptron [cont.] - history

Originally the perceptron was intended to be a physical machine rather than software -- i.e., the elements xᵢ in the input vector are actual photocells, and all weights wᵢ and bias b are actual potentiometers.

In 1958 The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

(https://en.wikipedia.org/wiki/Perceptron#History)

The perceptron is a binary classification algorithm. After the weights \(\mathbf{w} \in \mathbb{R}^m\) and bias \(b\) have been chosen via some optimization routine, the perceptron makes its prediction for any new input \(\mathbf{x} \in \mathbb{R}^m\) via the formula

\[\mathbf{x} \mapsto \begin{cases}0,\,\text{if}\,\mathbf{w}^T \mathbf{x} + b \leq 0,\\ 1,\,\text{if}\,\mathbf{w}^T \mathbf{x} + b > 0.\end{cases}\]

The perceptron can be regarded as the simplest neural network.

What I don't know is:

How do you define a (Lebesgue) measure on the space of \(p \times p\) positive definite matrices to carry out the above integration?

Gamma function

For complex numbers with a positive real part the gamma function is defined as

\[\Gamma(z) = \int_0^\infty x^{z-1} e^{-x} \mathrm{d}x.\]

It extends the factorial function -- if \(n\) is a positive integer then

\[\Gamma(n) = (n-1)!\]

The _multivariate_ gamma function \(\Gamma_p\) further generalizes \(\Gamma\):

\[\Gamma_p(a) = \int_{S>0} e^{-\mathrm{tr}(S))} |S|^{a-(p+1)/2} \mathrm{d}S,\]

where integration is over the space of \(p\times p\) positive-definite real matrices.

Last toot says that if \(X\) and \(Y\) are real-valued random variables with the joint probability density function (p.d.f.) \(f_{X, Y}\), then the p.d.f. of \(Z = Y - X\) is given by

\[

f_Z(z) = \int_{-\infty}^{\infty} f_{X,Y}(x, z+x) \mathrm{d}x

\]

Now let \(W = |X - Y|\). Then the p.d.f. of \(W\) is given by

\[

f_W(w) = f_Z(w) + f_Z(-w),

\]

if \(w > 0\), and \(f_W(w) = 0\) if \(w \leq 0\).

[Proof by observing that \(P(W \leq w)\) is \(P(0 \leq Z \leq w) + P(-w \leq Z \leq 0)\).]

Let \(X\) and \(Y\) be two independent real-valued random variables with probability density functions (p.d.f.) \(f_X\) and \(f_Y\) respectively. Let \(Z = Y - X\). Then the p.d.f. of the random variable \(Z\) is given by

\[

f_Z(z) = \int_{-\infty}^\infty f_X(x) f_Y(z + x) \mathrm{d}x.

\]

In the field of signal processing this is referred to as cross-correlation of the p.d.f.s \(f_X\) and \(f_Y\).

[proof by writing \(P(Z \leq z)\) as a double integral + Tonelli's thm + Fundamental Thm of Calc]

Let \(X\) and \(Y\) be real-valued random variables with the joint probability density function (p.d.f.) \(f_{X,Y}\). Let \(Z = X + Y\). Then the p.d.f. of the random variable \(Z\) is given by

\[f_Z(z) = \int_{-\infty}^\infty f_{X, Y}(x, z - x) \mathrm{d}x.\]

An important special case: If \(X\) and \(Y\) are stochastically independent with p.d.f.s \(f_X\) and \(f_Y\), then \(f_Z\) is the convolution of \(f_X\) and \(f_Y\).

t-SNE [\((2+\varepsilon) / 2\)]

What t-SNE plots look like: https://distill.pub/2016/misread-tsne/

t-SNE [2/2]

The t-SNE algorithm comprises two main steps:

(1) t-SNE constructs a probability distribution over pairs of high-dim. data points such that pairs of similar data points have a high probability of being picked. Similarity is typically evaluated using a Gaussian kernel.

(2) t-SNE defines a probability distribution over the points in the low-dim. map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

t-SNE [1/2]

t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for visualization. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

\[= \frac{P(t\leq T< t+dt \vert T\geq t)}{dt}.\]

Taking the limit \(dt \to 0\) we obtain the instantaneous failure rate \(h(t)=\)

\[= \lim_{dt \to 0} \frac{P(t\leq T < t+dt \vert T\geq t)}{dt},\]

aka the _Hazard_ function.

Consider a matrix \(A\in\mathbb{R}^{m\times n}\). Then \(Ax\) is a linear mapping from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). There is an \(r\leq n\), such that:

1. \(A\) maps an \(r\)-dim. subspace of \(\mathbb{R}^n\) to an \(r\)-dim. subspace of \(R^m\) (the column space or image of \(A\)).

2. The other \((n-r)\)-dimensional subspace of \(\mathbb{R}^n\), called the null space of \(A\), is mapped to 0.

Figure from Strang (1993) "The Fundamental Theorem of Linear Algebra".

Let \(\{P_\theta | \theta\in\Theta\}\) be a family of probability distributions. Let \(X = (X_1, \dots, X_n)\) be a sample from \(P_\theta\). Say our goal is to estimate the value of \(\theta\) based on \(X\). A statistic \(T(X)\) is called _sufficient_ for \(\theta\) if \(P_\theta(X \in A|T=t)\) is independent of \(\theta\) for any \(A\).

Practically this means that for any decision procedure that is based on \(X\) there is a decision procedure based on \(T(X)\) that is equally good or better.

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 159:

"In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change."

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 159:

"By a conditional probability distribution of \(Y\) for given \(X\) is meant a function \(q\) of two variables, a point \(x\) and a set \(B\), such that

(i) for a fixed set \(B\)

\[q(X, B) = P\{ Y \in B | X \}\]

is a conditional probability of the event \(\{Y \in B\}\) for given \(X\).

(ii) \(q\) is for each \(x\) a probability distribution."

Feller Volume II ("An Introduction to Probability Theory and its Applications"), p. 157:

"By [a conditional probability of the event \(Y \in B\) for given \(X\)] is meant a function \(q(X, B)\) such that for every set \(A \in \mathbb{R}\)

\[P\{X \in A, Y \in B\} = \int_A q(x, B) \mu(dx)\]

where \(\mu\) is the marginal distribution of \(X\)."

For a function \(f:\mathcal{X}\to\mathbb{R}\) the convex conjugate function \(f^\ast\) is defined as

\[f^\ast(x^\ast) := \sup_{x\in\mathcal{X}} \{\langle x^\ast, x \rangle - f(x)\},\]

If \(f\) is strictly convex then the convex conjugate has a simple geometric interpretation: The boundary of the epigraph \(\Omega := \{(x, z) | x\in\mathcal{X}, z \geq f(x)\}\) can be equivalently represented/encoded via

\[g(y) := \sup_{x\in\mathcal{X}} \{ \langle x, y \rangle - f(x) \}\]

for all \(y\).

Toots of random math/stats/ML trivia that I find interesting.

Joined Aug 2018