Var(X) = E((X - E(X))²)
= E(X²) - E(2XE(X)) + E(X)²
= E(X²) - E(X)².

Probability distribution trivia:

If X ~ Beta(α, β), then
E(Xᵏ) = (α + k - 1) / (α + β + k - 1) E(Xᵏ⁻¹).

[cont.] Kullback–Leibler divergence.

Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):

D(p||q) =
= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)
= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)
= H(p, q) - H(p)

[cont.] Cross-entropy.

By that interpretation, in contrast to entropy, cross-entropy is the expected number of bits per transmission under a potentially suboptimal encoding {log₂(1/q₁), log₂(1/q₂), ...}, which is based on a potentially inaccurate distribution {q₁, q₂, ...} for the symbols. That is, mathematically cross-entropy is given by:

H(p, q) = ∑ᵢ pᵢ log₂(1/qᵢ).

(my understanding of information theory is very rudimentary)

Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).
The entropy is just the expected number of bits per transmission under this optimal encoding:
∑ᵢ pᵢ log₂(1/pᵢ).

Law of total probability/expectation

The probability of an event can be written as a weighted sum of conditional probabilities.

The expected value of a random variable can be written as a weighted sum of conditional expected values.

If {Aᵢ}ᵢ is a finite or countably infinite partition of the sample space, then

P(B) = ∑ᵢ P(B | Aᵢ) P(Aᵢ)

E(X) = ∑ᵢ E(X | Aᵢ) P(Aᵢ)

...An equivalent relationship holds in terms of random variables.

Relationship between simple linear regression and moments of random variables.

Regression equation:
yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n
(where εᵢ represents random noise)

If xᵢ is a random variable that is independently and identically distributed (i.i.d.) for all i=1,2,...,n, and εᵢ (also i.i.d.) has mean 0 and is independent of xᵢ. Then:

β = Cov(xᵢ, yᵢ) / Var(xᵢ)
= Cor(xᵢ, yᵢ) * Sd(yᵢ) / Sd(xᵢ),
α = E(y) - βE(x),

Relationship between simple linear regression and sample mean, correlation, standard deviation.

Regression equation:
yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n
(where εᵢ represents random noise)

Then the least squares estimator of β is
b = Cor(x, y) * Sd(y) / Sd(x),
and the estimator of α is
a = Mean(y) - b * Mean(x)

(where Mean, Sd, Cor and the sample mean, sample standard deviation, and the sample Pearson correlation coefficient respectively)

EM algorithm [2/2]

Here is the formal general statement.

Let \(X\) be the observed data, and let \(Z\) be the unobserved data. Let \(l(\theta; X, Z)\) be the log-likelihood of the complete data \((X, Z)\) where \(\theta\) is the parameter vector of interest.

With initial guess \(\hat{\theta}^0\) repeat until convergence:

1. E-step: Compute
\[Q(\theta, \hat{\theta}^j) = E(l(\theta; X, Z) | X, \hat{\theta}^j).\]
2. M-step:
\[\hat{\theta}^{j+1} = \arg\!\max_\theta Q(\theta, \hat{\theta}^j).\]

EM algorithm [1/2]

Let's say we want to estimate parameters θ based on data X according to some statistical model f. But assume that actually
f is f(X, Z; θ)
i.e., there are some unobserved variables Z which influence X.

The Expectation-Maximization (EM) algorithm roughly repeats the following steps:

1. (E-step) Based on the current estimate of θ compute the expected value of Z.
2. (M-step) Obtain a new estimate of θ based on f where Z is replaced with the expected value from step 1.

However the assumption of independence seems highly questionable here...
(the cases/objects that a human will have trouble classifying correctly are likely the same cases/objects that the AI classification algorithm will have trouble classifying correctly as well).

Suppose we want to compare an AI classifier to a human but we know that an avg human's classification accuracy is imperfect too.
Let's say the human's overall classification accuracy is 0.8, and on a given dataset of n=100 cases the AI agrees on m=74 of those with the human.
What is α, the classification accuracy of the AI?
Assume that whether AI makes a correct classification decision is stochastically independent of whether the human's correct. Then
74 = m = 0.8nα + 0.2n(1-α)
and thus
α = 0.9.

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines:

Some of the many faces of the Jensen's inequality

For a real convex function \(\phi\):

\phi(\sum x_i / n) \leq \sum \phi(x_i) / n

\phi\left( \frac{1}{b-a} \int_a^b g(x) dx \right) \leq
\(\leq \frac{1}{b-a} \int_a^b \phi(g(x)) dx

\phi(\mathrm{E}(X)) \leq \mathrm{E}(\phi(X))

\phi(\mathrm{E}(X | G)) \leq \mathrm{E}(\phi(X) | G)

Oh, actually I meant to attach this figure.

Source: Strang (1993) The Fundamental Theorem of Linear Algebra

Let A be an n × m matrix with n > m that has linearly independent columns.
Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).
The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.
Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².

Fig. from Strang (1993)

Consider an (unfair) coin with probability of heads P(H) = p.
Consider the events:
A = "(r+s) coin tosses result in r or more heads"
B = "tossing the coin repeatedly, until a total of r heads appear, results in a total s or fewer tails"

It holds that
P(A) = P(B).

Or in "math":
If X~Bin(s+r, p) and Y~NBin(r, p) then P(X ≥ r) = P(Y ≤ s)

P(A) = P(H appears ≥ r times in s+r tosses)
= P(H appears r times in ≤ s+r tosses)
= P(T appears ≤ s times before H appears r times)

Bias-Variance Decomposition

Suppose that Y = f(X) + ε with noise term ε having Var(ε) = σ².
Let g(X;θ) be a trained/fitted model used to predict Y based on X (i.e., ideally g ≈ f), where θ represents the vector of trainable model parameters.
Consider the expected squared prediction error for a new input point x, and denote y = (Y|X=x). Then

Sq.Err. = E((y - g(x;θ)²)
= σ² + [E(y) - E(g(x;θ)]² + E([g(x;θ) - E(g(x;θ))]²)
= "irreducible error" + Bias²(g(x;θ)) + Variance(g(x;θ))

Grid Search no more!

Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.

Show more

A Mastodon instance for maths people. The kind of people who make \(\pi z^2 \times a\) jokes.

Use \( and \) for inline LaTeX, and \[ and \] for display mode.