Show more

EM algorithm [2/2]

Here is the formal general statement.

Let \(X\) be the observed data, and let \(Z\) be the unobserved data. Let \(l(\theta; X, Z)\) be the log-likelihood of the complete data \((X, Z)\) where \(\theta\) is the parameter vector of interest.

With initial guess \(\hat{\theta}^0\) repeat until convergence:

1. E-step: Compute
\[Q(\theta, \hat{\theta}^j) = E(l(\theta; X, Z) | X, \hat{\theta}^j).\]
2. M-step:
\[\hat{\theta}^{j+1} = \arg\!\max_\theta Q(\theta, \hat{\theta}^j).\]

EM algorithm [1/2]

Let's say we want to estimate parameters θ based on data X according to some statistical model f. But assume that actually
f is f(X, Z; θ)
i.e., there are some unobserved variables Z which influence X.

The Expectation-Maximization (EM) algorithm roughly repeats the following steps:

1. (E-step) Based on the current estimate of θ compute the expected value of Z.
2. (M-step) Obtain a new estimate of θ based on f where Z is replaced with the expected value from step 1.

However the assumption of independence seems highly questionable here...
(the cases/objects that a human will have trouble classifying correctly are likely the same cases/objects that the AI classification algorithm will have trouble classifying correctly as well).

Suppose we want to compare an AI classifier to a human but we know that an avg human's classification accuracy is imperfect too.
Let's say the human's overall classification accuracy is 0.8, and on a given dataset of n=100 cases the AI agrees on m=74 of those with the human.
What is α, the classification accuracy of the AI?
Assume that whether AI makes a correct classification decision is stochastically independent of whether the human's correct. Then
74 = m = 0.8nα + 0.2n(1-α)
and thus
α = 0.9.

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines: stats.stackexchange.com/a/4778

Some of the many faces of the Jensen's inequality

For a real convex function \(\phi\):

\(
\phi(\sum x_i / n) \leq \sum \phi(x_i) / n
\)

\(
\phi\left( \frac{1}{b-a} \int_a^b g(x) dx \right) \leq
\)
\(\leq \frac{1}{b-a} \int_a^b \phi(g(x)) dx
\)

\(
\phi(\mathrm{E}(X)) \leq \mathrm{E}(\phi(X))
\)

\(
\phi(\mathrm{E}(X | G)) \leq \mathrm{E}(\phi(X) | G)
\)

Oh, actually I meant to attach this figure.

Source: Strang (1993) The Fundamental Theorem of Linear Algebra

Let A be an n × m matrix with n > m that has linearly independent columns.
Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).
The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.
Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².

Fig. from Strang (1993)

Consider an (unfair) coin with probability of heads P(H) = p.
Consider the events:
A = "(r+s) coin tosses result in r or more heads"
B = "tossing the coin repeatedly, until a total of r heads appear, results in a total s or fewer tails"

It holds that
P(A) = P(B).

Or in "math":
If X~Bin(s+r, p) and Y~NBin(r, p) then P(X ≥ r) = P(Y ≤ s)

Proof:
P(A) = P(H appears ≥ r times in s+r tosses)
= P(H appears r times in ≤ s+r tosses)
= P(T appears ≤ s times before H appears r times)
=P(B)

Bias-Variance Decomposition

Suppose that Y = f(X) + ε with noise term ε having Var(ε) = σ².
Let g(X;θ) be a trained/fitted model used to predict Y based on X (i.e., ideally g ≈ f), where θ represents the vector of trainable model parameters.
Consider the expected squared prediction error for a new input point x, and denote y = (Y|X=x). Then

Sq.Err. = E((y - g(x;θ)²)
= σ² + [E(y) - E(g(x;θ)]² + E([g(x;θ) - E(g(x;θ))]²)
= "irreducible error" + Bias²(g(x;θ)) + Variance(g(x;θ))

Grid Search no more!

Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.

Turns out an ancient paper(*) has the answer.
If z = u₁ + iv₁ and w = u₂ + iv₂, where u₁, u₂, v₁, v₂ ~ N(0,1) (and independent), then the probability density of
r := |wz|
is given by
rK₀(r),
where K₀ denotes the modified Bessel function of the second kind with order 0.

(*) Wells, Anderson, Cell (1962) "The Distribution of the Product of Two Central or Non-Central Chi-Square Variates"

Consider two random complex numbers
z = u₁ + iv₁ and
w = u₂ + iv₂,
where u₁, v₁, u₂, v₂ are independent standard normal random variables (N(0,1)).
Then what is the probability distribution of the absolute value of the product |zw|?
Some empirical investigation (simulation) shows that the distribution looks like this:

DNNs off the top of my head [3/3]

"Backpropagation":

\( \delta_r = dC / d z_r \in \mathbb{R}^c \)
\( \delta_{r-1} = dC / d z_{r-1} = \delta_r \cdot d z_r / d z_{r-1} \in \mathbb{R}^{h_{r-1}} \)
\( \vdots \)
\( \delta_1 = dC / d z_1 = \delta_2 \cdot d z_2 / d z_1 \in \mathbb{R}^{h_1} \)

(the δs get reused top to bottom - hence "backpropagation").
Finally,

\( dC / d W_k = f_{k-1}(z_{k-1}) \cdot \delta_k^T \in \mathbb{R}^{h_k \times h_{k-1}} \)
\( dC / d b_k = \delta_k \in \mathbb{R}^{h_k} \)

Deep neural networks (DNN) off the top of my head [2/3]

Suppose we have a "cost" function \( C(y_{\text{true}}, y) \), which quantifies the prediction accuracy / error btwn true and predicted values. One typically uses stochastic gradient descent to find the "best" weights \(W_1, \dots, W_r\) and biases \(b_1, \dots, b_r\). To do gradient descent one needs to differentiate C w.r.t. all weights and biases. The "backpropagation" algorithm (aka the Chain rule) is used to obtain these derivatives.

Deep neural networks (DNN) off the top of my head [1/3]

A DNN is basically a fnct \(\mathbb{R}^p \to \mathbb{R}^c : x \mapsto y\) that is evaluated as follows.

Input: \( x \in \mathbb{R}^p \)
\( z_1 = W_1 x + b_1 \in \mathbb{R}^{h_1} \)
\( z_2 = W_2 f_1(z_1) + b_1 \in \mathbb{R}^{h_2} \)
\( \vdots \)
\( z_r = W_r f_{r-1}(z_{r-1}) + b_r \in \mathbb{R}^{c} \)
Output: \( y = f_r(z_r) \in \mathbb{R}^c \)

where we need to optimize the weights \(W_1, \dots, W_r\) and the biases \(b_1, \dots, b_r\).

Binomial(n, p) is the probability distribution of the number of "successes" in n independent trials whereby each trial has a binary outcome ("success" or "failure") with the "success" probability p.

If n is large and p is close to 0, then Binomial(n, p) can be approximated by a Poisson(λ) distribution with λ=np.

If n is large and p is far from 0 or 1, then Binomial(n, p) can be approximated by a Gaussian distribution with mean np and variance np(1-p).

Some of the many faces of the Cauchy-Schwarz inequality:

\[
|\langle u,v \rangle| \leq \|u\| \|v\|
\]
\[
\left| \sum_{i=1}^n u_i \bar{v}_i \right|^2 \leq \sum_{j=1}^n |u_j|^2 \sum_{k=1}^n |v_k|^2
\]
\(\left| \int_{\mathbb{R}^n} f(x)\overline{g(x)} dx \right|^2\)
\(\leq\int_{\mathbb{R}^n} |f(x)|^2 dx \int_{\mathbb{R}^n} |g(x)|^2 dx\)
\[
|E(XY)|^2 \leq E(X^2) E(Y^2)
\]
\[
\mathrm{Cov}(X,Y)^2 \leq \mathrm{Var}(X) \mathrm{Var}(Y)
\]
\[
|\varphi(g^* f)|^2 \leq \varphi(f^* f) \varphi(g^* g)
\]

A generalization is the law of total covariance:

Cov(X, Y) = E(Cov(X, Y | Z) + Cov(E(X | Z), E(Y | Z)).

(the law of total variance corresponds to the case X=Y)

Show more
Mathstodon

A Mastodon instance for maths people. The kind of people who make \(\pi z^2 \times a\) jokes. Use \( and \) for inline LaTeX, and \[ and \] for display mode.