EM algorithm [2/2]

Here is the formal general statement.

Let \(X\) be the observed data, and let \(Z\) be the unobserved data. Let \(l(\theta; X, Z)\) be the log-likelihood of the complete data \((X, Z)\) where \(\theta\) is the parameter vector of interest.

With initial guess \(\hat{\theta}^0\) repeat until convergence:

1. E-step: Compute

\[Q(\theta, \hat{\theta}^j) = E(l(\theta; X, Z) | X, \hat{\theta}^j).\]

2. M-step:

\[\hat{\theta}^{j+1} = \arg\!\max_\theta Q(\theta, \hat{\theta}^j).\]

EM algorithm [1/2]

Let's say we want to estimate parameters θ based on data X according to some statistical model f. But assume that actually

f is f(X, Z; θ)

i.e., there are some unobserved variables Z which influence X.

The Expectation-Maximization (EM) algorithm roughly repeats the following steps:

1. (E-step) Based on the current estimate of θ compute the expected value of Z.

2. (M-step) Obtain a new estimate of θ based on f where Z is replaced with the expected value from step 1.

Let's say the human's overall classification accuracy is 0.8, and on a given dataset of n=100 cases the AI agrees on m=74 of those with the human.

What is α, the classification accuracy of the AI?

Assume that whether AI makes a correct classification decision is stochastically independent of whether the human's correct. Then

74 = m = 0.8nα + 0.2n(1-α)

and thus

α = 0.9.

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines: https://stats.stackexchange.com/a/47782

Some of the many faces of the Jensen's inequality

For a real convex function \(\phi\):

\(

\phi(\sum x_i / n) \leq \sum \phi(x_i) / n

\)

\(

\phi\left( \frac{1}{b-a} \int_a^b g(x) dx \right) \leq

\)

\(\leq \frac{1}{b-a} \int_a^b \phi(g(x)) dx

\)

\(

\phi(\mathrm{E}(X)) \leq \mathrm{E}(\phi(X))

\)

\(

\phi(\mathrm{E}(X | G)) \leq \mathrm{E}(\phi(X) | G)

\)

Oh, actually I meant to attach this figure.

Source: Strang (1993) The Fundamental Theorem of Linear Algebra

Let A be an n × m matrix with n > m that has linearly independent columns.

Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).

The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.

Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².

Fig. from Strang (1993)

Consider an (unfair) coin with probability of heads P(H) = p.

Consider the events:

A = "(r+s) coin tosses result in r or more heads"

B = "tossing the coin repeatedly, until a total of r heads appear, results in a total s or fewer tails"

It holds that

P(A) = P(B).

Or in "math":

If X~Bin(s+r, p) and Y~NBin(r, p) then P(X ≥ r) = P(Y ≤ s)

Proof:

P(A) = P(H appears ≥ r times in s+r tosses)

= P(H appears r times in ≤ s+r tosses)

= P(T appears ≤ s times before H appears r times)

=P(B)

Bias-Variance Decomposition

Suppose that Y = f(X) + ε with noise term ε having Var(ε) = σ².

Let g(X;θ) be a trained/fitted model used to predict Y based on X (i.e., ideally g ≈ f), where θ represents the vector of trainable model parameters.

Consider the expected squared prediction error for a new input point x, and denote y = (Y|X=x). Then

Sq.Err. = E((y - g(x;θ)²)

= σ² + [E(y) - E(g(x;θ)]² + E([g(x;θ) - E(g(x;θ))]²)

= "irreducible error" + Bias²(g(x;θ)) + Variance(g(x;θ))

Grid Search no more!

Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.

Turns out an ancient paper(*) has the answer.

If z = u₁ + iv₁ and w = u₂ + iv₂, where u₁, u₂, v₁, v₂ ~ N(0,1) (and independent), then the probability density of

r := |wz|

is given by

rK₀(r),

where K₀ denotes the modified Bessel function of the second kind with order 0.

(*) Wells, Anderson, Cell (1962) "The Distribution of the Product of Two Central or Non-Central Chi-Square Variates"

z = u₁ + iv₁ and

w = u₂ + iv₂,

where u₁, v₁, u₂, v₂ are independent standard normal random variables (N(0,1)).

Then what is the probability distribution of the absolute value of the product |zw|?

Some empirical investigation (simulation) shows that the distribution looks like this:

DNNs off the top of my head [3/3]

"Backpropagation":

\( \delta_r = dC / d z_r \in \mathbb{R}^c \)

\( \delta_{r-1} = dC / d z_{r-1} = \delta_r \cdot d z_r / d z_{r-1} \in \mathbb{R}^{h_{r-1}} \)

\( \vdots \)

\( \delta_1 = dC / d z_1 = \delta_2 \cdot d z_2 / d z_1 \in \mathbb{R}^{h_1} \)

(the δs get reused top to bottom - hence "backpropagation").

Finally,

\( dC / d W_k = f_{k-1}(z_{k-1}) \cdot \delta_k^T \in \mathbb{R}^{h_k \times h_{k-1}} \)

\( dC / d b_k = \delta_k \in \mathbb{R}^{h_k} \)

Deep neural networks (DNN) off the top of my head [2/3]

Suppose we have a "cost" function \( C(y_{\text{true}}, y) \), which quantifies the prediction accuracy / error btwn true and predicted values. One typically uses stochastic gradient descent to find the "best" weights \(W_1, \dots, W_r\) and biases \(b_1, \dots, b_r\). To do gradient descent one needs to differentiate C w.r.t. all weights and biases. The "backpropagation" algorithm (aka the Chain rule) is used to obtain these derivatives.

Deep neural networks (DNN) off the top of my head [1/3]

A DNN is basically a fnct \(\mathbb{R}^p \to \mathbb{R}^c : x \mapsto y\) that is evaluated as follows.

Input: \( x \in \mathbb{R}^p \)

\( z_1 = W_1 x + b_1 \in \mathbb{R}^{h_1} \)

\( z_2 = W_2 f_1(z_1) + b_1 \in \mathbb{R}^{h_2} \)

\( \vdots \)

\( z_r = W_r f_{r-1}(z_{r-1}) + b_r \in \mathbb{R}^{c} \)

Output: \( y = f_r(z_r) \in \mathbb{R}^c \)

where we need to optimize the weights \(W_1, \dots, W_r\) and the biases \(b_1, \dots, b_r\).

Binomial(n, p) is the probability distribution of the number of "successes" in n independent trials whereby each trial has a binary outcome ("success" or "failure") with the "success" probability p.

If n is large and p is close to 0, then Binomial(n, p) can be approximated by a Poisson(λ) distribution with λ=np.

If n is large and p is far from 0 or 1, then Binomial(n, p) can be approximated by a Gaussian distribution with mean np and variance np(1-p).

Some of the many faces of the Cauchy-Schwarz inequality:

\[

|\langle u,v \rangle| \leq \|u\| \|v\|

\]

\[

\left| \sum_{i=1}^n u_i \bar{v}_i \right|^2 \leq \sum_{j=1}^n |u_j|^2 \sum_{k=1}^n |v_k|^2

\]

\(\left| \int_{\mathbb{R}^n} f(x)\overline{g(x)} dx \right|^2\)

\(\leq\int_{\mathbb{R}^n} |f(x)|^2 dx \int_{\mathbb{R}^n} |g(x)|^2 dx\)

\[

|E(XY)|^2 \leq E(X^2) E(Y^2)

\]

\[

\mathrm{Cov}(X,Y)^2 \leq \mathrm{Var}(X) \mathrm{Var}(Y)

\]

\[

|\varphi(g^* f)|^2 \leq \varphi(f^* f) \varphi(g^* g)

\]

A generalization is the law of total covariance:

Cov(X, Y) = E(Cov(X, Y | Z) + Cov(E(X | Z), E(Y | Z)).

(the law of total variance corresponds to the case X=Y)

Toots of random math/stats/ML trivia that I find interesting.

Joined Aug 2018