Differential privacy [5/n]

A crucial property of differential privacy (d.p.) is that it is preserved under post-processing and adaptive composition in the following sense:

If M₁, M₂, ..., Mₘ are (ε₁, δ₁)-d.p., (ε₂, δ₂)-d.p., ..., (εₘ, δₘ)-d.p respectively, then any function of (M₁, M₂, ..., Mₘ) is (∑ εᵢ, ∑ δᵢ)-d.p.

Whereby the choice of Mᵢ₊₁ may depend on the output of M₁, ..., Mᵢ. Hence *adaptive* composition.

Differential privacy [4/n]

What does a differentially private data access mechanism look like?

The simplest example is the Laplace Mechanism:

Let Zᵐ be a space of datasets with m rows. Let Lap(b) denote the Laplace distribution with scale parameter b>0.

Assume that we want to evaluate a function f: Zᵐ → R on a dataset x ∈ Zᵐ.

The Laplace Mechanism, given by

M(x) = f(x) + Lap(b),

is (ε, 0)-differentially private (or simply ε-d.p.) if

b > sup { |f(x) - f(x')| : d(x, x') = 1 } / ε.

Differential privacy [3/n]

Another condition for differential privacy involves the tail probabilities of the "privacy loss":

Privacy loss := log( P(M(x) = ξ) / P(M(x') = ξ) ),

where x, x' ∈ Zᵐ have d(x, x') = 1, and ξ follows the distribution of M(x).

This holds:

M is (ε, δ)-differentially private if

P(privacy loss ≤ ε) ≥ 1 - δ.

However I find it hard to grasp the intuition for this result... need to think more about it...

[notice that E(privacy loss) = KL-divergence btwn M(x) and M(x')]

Differential privacy [2/n]

The defining inequality

P(M(x) ∈ S) ≤ exp(ε) P(M(x') ∈ S) + δ

(∀ x, x' ∈ Zᵐ with d(x, x') = 1, ∀ measurable S ∈ Y)

has an intuitive meaning.

The results returned by M on x and x' are indistinguishable up to a degree given by ε and δ.

Practically, it means that an adversary should not be able to reconstruct/deanonymize/match any individual's data record by querying multiple dataset-level summary statistics through the query mechanism M and combining them in some way.

Differential privacy [1/n]

Differential privacy is mathematically rigorous definition of "data privacy" (received the Gödel Prize in 2017).

Definition:

Let Zᵐ be a space of datasets with m rows.

Let d(x, x') be the number of rows at which x, x' ∈ Zᵐ differ.

Let M be a mechanism that takes a dataset x ∈ Zᵐ as input and outputs a random variable M(x) ∈ Y.

M is called (ε, δ)-differentially private if

P(M(x) ∈ S) ≤ exp(ε) P(M(x') ∈ S) + δ,

∀ x, x' ∈ Zᵐ with d(x, x') = 1, ∀ measurable S ∈ Y.

Convergence in distribution

A seq. X₁, X₂, ... of R-valued random vars. converges in distribution (or converges weakly, or in law) to a random var. X

⇔ lim Fₙ(x) = F(x) as n→∞ for all continuity points x of F

⇔ E(f(Xₙ))→E(f(X)) for all bounded Lipschitz f

⇔ liminf E(f(Xₙ))≥E(f(X)) for all non-negative continuous f

⇔ liminf P(Xₙ∈G)≥P(X∈G) for every open set G

⇔ limsup P(Xₙ∈F)≤P(X∈F) for every closed set F

⇔ P(Xₙ∈B)→P(X∈B) for all continuity sets B of X (i.e., P(X∈∂B)=0)

How many can you prove?

[cont.] Kullback–Leibler divergence.

Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):

D(p||q) =

= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)

= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)

= H(p, q) - H(p)

[cont.] Cross-entropy.

By that interpretation, in contrast to entropy, cross-entropy is the expected number of bits per transmission under a potentially suboptimal encoding {log₂(1/q₁), log₂(1/q₂), ...}, which is based on a potentially inaccurate distribution {q₁, q₂, ...} for the symbols. That is, mathematically cross-entropy is given by:

H(p, q) = ∑ᵢ pᵢ log₂(1/qᵢ).

Entropy

(my understanding of information theory is very rudimentary)

Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).

The entropy is just the expected number of bits per transmission under this optimal encoding:

∑ᵢ pᵢ log₂(1/pᵢ).

Law of total probability/expectation

The probability of an event can be written as a weighted sum of conditional probabilities.

The expected value of a random variable can be written as a weighted sum of conditional expected values.

If {Aᵢ}ᵢ is a finite or countably infinite partition of the sample space, then

P(B) = ∑ᵢ P(B | Aᵢ) P(Aᵢ)

E(X) = ∑ᵢ E(X | Aᵢ) P(Aᵢ)

...An equivalent relationship holds in terms of random variables.

Relationship between simple linear regression and moments of random variables.

Regression equation:

yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n

(where εᵢ represents random noise)

If xᵢ is a random variable that is independently and identically distributed (i.i.d.) for all i=1,2,...,n, and εᵢ (also i.i.d.) has mean 0 and is independent of xᵢ. Then:

β = Cov(xᵢ, yᵢ) / Var(xᵢ)

= Cor(xᵢ, yᵢ) * Sd(yᵢ) / Sd(xᵢ),

α = E(y) - βE(x),

Relationship between simple linear regression and sample mean, correlation, standard deviation.

Regression equation:

yᵢ = α + βxᵢ + εᵢ, i = 1, 2, ..., n

(where εᵢ represents random noise)

Then the least squares estimator of β is

b = Cor(x, y) * Sd(y) / Sd(x),

and the estimator of α is

a = Mean(y) - b * Mean(x)

(where Mean, Sd, Cor and the sample mean, sample standard deviation, and the sample Pearson correlation coefficient respectively)

EM algorithm [2/2]

Here is the formal general statement.

Let \(X\) be the observed data, and let \(Z\) be the unobserved data. Let \(l(\theta; X, Z)\) be the log-likelihood of the complete data \((X, Z)\) where \(\theta\) is the parameter vector of interest.

With initial guess \(\hat{\theta}^0\) repeat until convergence:

1. E-step: Compute

\[Q(\theta, \hat{\theta}^j) = E(l(\theta; X, Z) | X, \hat{\theta}^j).\]

2. M-step:

\[\hat{\theta}^{j+1} = \arg\!\max_\theta Q(\theta, \hat{\theta}^j).\]

EM algorithm [1/2]

Let's say we want to estimate parameters θ based on data X according to some statistical model f. But assume that actually

f is f(X, Z; θ)

i.e., there are some unobserved variables Z which influence X.

The Expectation-Maximization (EM) algorithm roughly repeats the following steps:

1. (E-step) Based on the current estimate of θ compute the expected value of Z.

2. (M-step) Obtain a new estimate of θ based on f where Z is replaced with the expected value from step 1.

Suppose we want to compare an AI classifier to a human but we know that an avg human's classification accuracy is imperfect too.

Let's say the human's overall classification accuracy is 0.8, and on a given dataset of n=100 cases the AI agrees on m=74 of those with the human.

What is α, the classification accuracy of the AI?

Assume that whether AI makes a correct classification decision is stochastically independent of whether the human's correct. Then

74 = m = 0.8nα + 0.2n(1-α)

and thus

α = 0.9.

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines: https://stats.stackexchange.com/a/47782

Some of the many faces of the Jensen's inequality

For a real convex function \(\phi\):

\(

\phi(\sum x_i / n) \leq \sum \phi(x_i) / n

\)

\(

\phi\left( \frac{1}{b-a} \int_a^b g(x) dx \right) \leq

\)

\(\leq \frac{1}{b-a} \int_a^b \phi(g(x)) dx

\)

\(

\phi(\mathrm{E}(X)) \leq \mathrm{E}(\phi(X))

\)

\(

\phi(\mathrm{E}(X | G)) \leq \mathrm{E}(\phi(X) | G)

\)

Toots of random math/stats/ML trivia that I find interesting.

Joined Aug 2018