[cont.] Kullback–Leibler divergence.

Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):

D(p||q) =

= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)

= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)

= H(p, q) - H(p)

[cont.] Cross-entropy.

By that interpretation, in contrast to entropy, cross-entropy is the expected number of bits per transmission under a potentially suboptimal encoding {log₂(1/q₁), log₂(1/q₂), ...}, which is based on a potentially inaccurate distribution {q₁, q₂, ...} for the symbols. That is, mathematically cross-entropy is given by:

H(p, q) = ∑ᵢ pᵢ log₂(1/qᵢ).

Entropy

(my understanding of information theory is very rudimentary)

Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).

The entropy is just the expected number of bits per transmission under this optimal encoding:

∑ᵢ pᵢ log₂(1/pᵢ).

Law of total probability/expectation

The probability of an event can be written as a weighted sum of conditional probabilities.

The expected value of a random variable can be written as a weighted sum of conditional expected values.

If {Aᵢ}ᵢ is a finite or countably infinite partition of the sample space, then

P(B) = ∑ᵢ P(B | Aᵢ) P(Aᵢ)

E(X) = ∑ᵢ E(X | Aᵢ) P(Aᵢ)

...An equivalent relationship holds in terms of random variables.

Relationship between simple linear regression and moments of random variables.

Regression equation:

yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n

(where εᵢ represents random noise)

If xᵢ is a random variable that is independently and identically distributed (i.i.d.) for all i=1,2,...,n, and εᵢ (also i.i.d.) has mean 0 and is independent of xᵢ. Then:

β = Cov(xᵢ, yᵢ) / Var(xᵢ)

= Cor(xᵢ, yᵢ) * Sd(yᵢ) / Sd(xᵢ),

α = E(y) - βE(x),

Relationship between simple linear regression and sample mean, correlation, standard deviation.

Regression equation:

yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n

(where εᵢ represents random noise)

Then the least squares estimator of β is

b = Cor(x, y) * Sd(y) / Sd(x),

and the estimator of α is

a = Mean(y) - b * Mean(x)

(where Mean, Sd, Cor and the sample mean, sample standard deviation, and the sample Pearson correlation coefficient respectively)

EM algorithm [2/2]

Here is the formal general statement.

Let \(X\) be the observed data, and let \(Z\) be the unobserved data. Let \(l(\theta; X, Z)\) be the log-likelihood of the complete data \((X, Z)\) where \(\theta\) is the parameter vector of interest.

With initial guess \(\hat{\theta}^0\) repeat until convergence:

1. E-step: Compute

\[Q(\theta, \hat{\theta}^j) = E(l(\theta; X, Z) | X, \hat{\theta}^j).\]

2. M-step:

\[\hat{\theta}^{j+1} = \arg\!\max_\theta Q(\theta, \hat{\theta}^j).\]

EM algorithm [1/2]

Let's say we want to estimate parameters θ based on data X according to some statistical model f. But assume that actually

f is f(X, Z; θ)

i.e., there are some unobserved variables Z which influence X.

The Expectation-Maximization (EM) algorithm roughly repeats the following steps:

1. (E-step) Based on the current estimate of θ compute the expected value of Z.

2. (M-step) Obtain a new estimate of θ based on f where Z is replaced with the expected value from step 1.

Suppose we want to compare an AI classifier to a human but we know that an avg human's classification accuracy is imperfect too.

Let's say the human's overall classification accuracy is 0.8, and on a given dataset of n=100 cases the AI agrees on m=74 of those with the human.

What is α, the classification accuracy of the AI?

Assume that whether AI makes a correct classification decision is stochastically independent of whether the human's correct. Then

74 = m = 0.8nα + 0.2n(1-α)

and thus

α = 0.9.

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines: https://stats.stackexchange.com/a/47782

Some of the many faces of the Jensen's inequality

For a real convex function \(\phi\):

\(

\phi(\sum x_i / n) \leq \sum \phi(x_i) / n

\)

\(

\phi\left( \frac{1}{b-a} \int_a^b g(x) dx \right) \leq

\)

\(\leq \frac{1}{b-a} \int_a^b \phi(g(x)) dx

\)

\(

\phi(\mathrm{E}(X)) \leq \mathrm{E}(\phi(X))

\)

\(

\phi(\mathrm{E}(X | G)) \leq \mathrm{E}(\phi(X) | G)

\)

Oh, actually I meant to attach this figure.

Source: Strang (1993) The Fundamental Theorem of Linear Algebra

Let A be an n × m matrix with n > m that has linearly independent columns.

Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).

The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.

Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².

Fig. from Strang (1993)

Consider an (unfair) coin with probability of heads P(H) = p.

Consider the events:

A = "(r+s) coin tosses result in r or more heads"

B = "tossing the coin repeatedly, until a total of r heads appear, results in a total s or fewer tails"

It holds that

P(A) = P(B).

Or in "math":

If X~Bin(s+r, p) and Y~NBin(r, p) then P(X ≥ r) = P(Y ≤ s)

Proof:

P(A) = P(H appears ≥ r times in s+r tosses)

= P(H appears r times in ≤ s+r tosses)

= P(T appears ≤ s times before H appears r times)

=P(B)

Bias-Variance Decomposition

Suppose that Y = f(X) + ε with noise term ε having Var(ε) = σ².

Let g(X;θ) be a trained/fitted model used to predict Y based on X (i.e., ideally g ≈ f), where θ represents the vector of trainable model parameters.

Consider the expected squared prediction error for a new input point x, and denote y = (Y|X=x). Then

Sq.Err. = E((y - g(x;θ)²)

= σ² + [E(y) - E(g(x;θ)]² + E([g(x;θ) - E(g(x;θ))]²)

= "irreducible error" + Bias²(g(x;θ)) + Variance(g(x;θ))

Grid Search no more!

Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.

Turns out an ancient paper(*) has the answer.

If z = u₁ + iv₁ and w = u₂ + iv₂, where u₁, u₂, v₁, v₂ ~ N(0,1) (and independent), then the probability density of

r := |wz|

is given by

rK₀(r),

where K₀ denotes the modified Bessel function of the second kind with order 0.

(*) Wells, Anderson, Cell (1962) "The Distribution of the Product of Two Central or Non-Central Chi-Square Variates"

Consider two random complex numbers

z = u₁ + iv₁ and

w = u₂ + iv₂,

where u₁, v₁, u₂, v₂ are independent standard normal random variables (N(0,1)).

Then what is the probability distribution of the absolute value of the product |zw|?

Some empirical investigation (simulation) shows that the distribution looks like this:

Toots of random math/stats/ML trivia that I find interesting.

Joined Aug 2018