(my understanding of information theory is very rudimentary)
Let's say you want to send symbols as messages through a binary channel (0s and 1s). Let pᵢ denote the relative frequency of i'th symbol. Then, to use the smallest number of bits per transmission on average, you should assign log₂(1/pᵢ) bits to the i'th symbol (afaik; don't know if fully correct, or how to prove...).
The entropy is just the expected number of bits per transmission under this optimal encoding:
∑ᵢ pᵢ log₂(1/pᵢ).
[cont.] Kullback–Leibler divergence.
Following that line of thought, the Kullback-Leibler divergence between p and q is simply the difference between cross-entropy(p, q) and entropy(p):
= ∑ᵢ pᵢ log₂(pᵢ / qᵢ)
= ∑ᵢ pᵢ log₂(1/qᵢ) - ∑ᵢ pᵢ log₂(1/pᵢ)
= H(p, q) - H(p)
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!