(my understanding of information theory is very rudimentary)

Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).
The entropy is just the expected number of bits per transmission under this optimal encoding:
∑ᵢ pᵢ log₂(1/pᵢ).


[cont.] Cross-entropy.

By that interpretation, in contrast to entropy, cross-entropy is the expected number of bits per transmission under a potentially suboptimal encoding {log₂(1/q₁), log₂(1/q₂), ...}, which is based on a potentially inaccurate distribution {q₁, q₂, ...} for the symbols. That is, mathematically cross-entropy is given by:

H(p, q) = ∑ᵢ pᵢ log₂(1/qᵢ).

[cont.] Kullback–Leibler divergence.

Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):

D(p||q) =
= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)
= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)
= H(p, q) - H(p)

Sign in to participate in the conversation

A Mastodon instance for maths people. The kind of people who make \(\pi z^2 \times a\) jokes. Use \( and \) for inline LaTeX, and \[ and \] for display mode.