The law of total covariance,
a.k.a. the covariance decomposition formula,
a.k.a. the conditional covariance formula

Cov(Y,Z) = E[Cov(Y,Z|X)] + Cov[E(Y|X), E(Z|X)]

Show thread

The law of total variance,
a.k.a. variance decomposition formula,
a.k.a. conditional variance formulas,
a.k.a. the law of iterated variances,
a.k.a. Eve's law

Var(Y) = E(Var(Y|X)) + Var(E(Y|X))

X ~ Laplace(0, b)
then for t ≥ 0 it holds that
P(|X| ≥ tb) = exp(-t).

Multilayer (vanilla) RNN.

hₜˡ = tanh[ Wˡ ⋅ (hₜˡ⁻¹, hₜ₋₁ˡ)ᵀ ].

Assume that (Y₁, Y₂) follows a bivariate distribution with mean vector (μ₁, μ₂) and covariance matrix with entries σ₁², σ₂², σ₁₂, and let ρ=Cor(Y₁,Y₂).

If we collect i.i.d. observations (Yᵢ₁, Yᵢ₂) from this bivariate distribution, then the expected squared perpendicular distance of each such point from the 45 degree line in 2d plane is given by

E[(Y₁-Y₂)²] =
E[((Y₁-μ₁) - (Y₂-μ₂) + μ₁-μ₂)²] =
E[((Y₁-μ₁) - (Y₂-μ₂))²] + (μ₁-μ₂)² =
(μ₁-μ₂)²+σ₁²+σ₂²-2σ₁₂ =

Another elegant bound that's stronger for small x is

P(|X| > x) ≤ exp(-x² / 2)

for X ~ N(0, 1) and x >0.

(but I'm not aware of a proof quite as short as the previous one)

Show thread

Let \(X \sim \mathcal{N}(0, 1)\) and \(x > 0\).

\mathbb{P}(X > x)
= \frac{1}{\sqrt{2\pi}} \int_x^\infty e^{-t^2/2} \,\mathrm{d}t
\leq \frac{1}{\sqrt{2\pi}} \int_x^\infty \frac{t}{x} e^{-t^2/2} \,\mathrm{d}t
= \frac{e^{-x^2/2}}{x \sqrt{2\pi}}.

Quite elegant!


This one is interesting too:

∀ a>0, b>1: O(log_b(n)) ⊂ O(nᵃ).

Show thread

< O(log(n))
< O(sqrt(n))
< O(n)
< O(n log(n))
< O(n²)
< O(2ⁿ)
< O(n!)

Differential privacy [5/n]

A crucial property of differential privacy (d.p.) is that it is preserved under post-processing and adaptive composition in the following sense:

If M₁, M₂, ..., Mₘ are (ε₁, δ₁)-d.p., (ε₂, δ₂)-d.p., ..., (εₘ, δₘ)-d.p respectively, then any function of (M₁, M₂, ..., Mₘ) is (∑ εᵢ, ∑ δᵢ)-d.p.

Whereby the choice of Mᵢ₊₁ may depend on the output of M₁, ..., Mᵢ. Hence *adaptive* composition.

Show thread

Differential privacy [4/n]

What does a differentially private data access mechanism look like?

The simplest example is the Laplace Mechanism:

Let Zᵐ be a space of datasets with m rows. Let Lap(b) denote the Laplace distribution with scale parameter b>0.
Assume that we want to evaluate a function f: Zᵐ → R on a dataset x ∈ Zᵐ.
The Laplace Mechanism, given by

M(x) = f(x) + Lap(b),

is (ε, 0)-differentially private (or simply ε-d.p.) if
b > sup { |f(x) - f(x')| : d(x, x') = 1 } / ε.

Show thread

Differential privacy [3/n]

Another condition for differential privacy involves the tail probabilities of the "privacy loss":

Privacy loss := log( P(M(x) = ξ) / P(M(x') = ξ) ),

where x, x' ∈ Zᵐ have d(x, x') = 1, and ξ follows the distribution of M(x).

This holds:

M is (ε, δ)-differentially private if
P(privacy loss ≤ ε) ≥ 1 - δ.

However I find it hard to grasp the intuition for this result... need to think more about it...

[notice that E(privacy loss) = KL-divergence btwn M(x) and M(x')]

Show thread

Differential privacy [2/n]

The defining inequality
P(M(x) ∈ S) ≤ exp(ε) P(M(x') ∈ S) + δ
(∀ x, x' ∈ Zᵐ with d(x, x') = 1, ∀ measurable S ∈ Y)
has an intuitive meaning.

The results returned by M on x and x' are indistinguishable up to a degree given by ε and δ.
Practically, it means that an adversary should not be able to reconstruct/deanonymize/match any individual's data record by querying multiple dataset-level summary statistics through the query mechanism M and combining them in some way.

Show thread

Differential privacy [1/n]

Differential privacy is mathematically rigorous definition of "data privacy" (received the Gödel Prize in 2017).


Let Zᵐ be a space of datasets with m rows.
Let d(x, x') be the number of rows at which x, x' ∈ Zᵐ differ.
Let M be a mechanism that takes a dataset x ∈ Zᵐ as input and outputs a random variable M(x) ∈ Y.
M is called (ε, δ)-differentially private if

P(M(x) ∈ S) ≤ exp(ε) P(M(x') ∈ S) + δ,
∀ x, x' ∈ Zᵐ with d(x, x') = 1, ∀ measurable S ∈ Y.

Convergence in distribution

A seq. X₁, X₂, ... of R-valued random vars. converges in distribution (or converges weakly, or in law) to a random var. X
⇔ lim Fₙ(x) = F(x) as n→∞ for all continuity points x of F
⇔ E(f(Xₙ))→E(f(X)) for all bounded Lipschitz f
⇔ liminf E(f(Xₙ))≥E(f(X)) for all non-negative continuous f
⇔ liminf P(Xₙ∈G)≥P(X∈G) for every open set G
⇔ limsup P(Xₙ∈F)≤P(X∈F) for every closed set F
⇔ P(Xₙ∈B)→P(X∈B) for all continuity sets B of X (i.e., P(X∈∂B)=0)

How many can you prove?

Var(X) = E((X - E(X))²)
= E(X²) - E(2XE(X)) + E(X)²
= E(X²) - E(X)².

Probability distribution trivia:

If X ~ Beta(α, β), then
E(Xᵏ) = (α + k - 1) / (α + β + k - 1) E(Xᵏ⁻¹).

[cont.] Kullback–Leibler divergence.

Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):

D(p||q) =
= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)
= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)
= H(p, q) - H(p)

Show thread

[cont.] Cross-entropy.

By that interpretation, in contrast to entropy, cross-entropy is the expected number of bits per transmission under a potentially suboptimal encoding {log₂(1/q₁), log₂(1/q₂), ...}, which is based on a potentially inaccurate distribution {q₁, q₂, ...} for the symbols. That is, mathematically cross-entropy is given by:

H(p, q) = ∑ᵢ pᵢ log₂(1/qᵢ).

Show thread

(my understanding of information theory is very rudimentary)

Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).
The entropy is just the expected number of bits per transmission under this optimal encoding:
∑ᵢ pᵢ log₂(1/pᵢ).

Show older

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!