Kullback-Leibler divergence

1. As information gain

KL-divergence can be interpreted as the additional bits needed to encode samples from a distribution \(P\), using a code that has been optimized for \(Q\):

\[D_{KL}(P || Q) = \sum_{x\in \mathcal{X}} P(x) \left[ \log P(x) - \log Q(x) \right]\]

Recall that \(\sum_{x\in \mathcal{X}} P(x) \log P(x)\) is the entropy of \(P\) and \(\log P(x)\) is the optimal number of bits to use in a code optimized for \(P\).

2. in words

We only care about the regions where \(P(x) > 0\). So in the worst case, \(Q\) will not allocate any mass to any of the support of \(P\). In this case, \(D_{KL}=\infty\). In fact, as long as \(Q(x) << P(x)\) for some \(x\), we will incur a big cost.

What if we weighted everything by \(Q(x)\) instead? Then you would not be as harshly penalized for mass that you have outside the support of \(P\). In this sense, the \(D_{KL}\) tries to push \(Q\) under \(P\) as much as possible.

3. Properties

3.1. \(D (P || Q) \geq 0\)

3.1.1. Proof

\[\begin{align*} D(P || Q) &= \sum_{i=1}^{n} P(x_i) \log \left(\frac{P(x_i)}{Q(x_i)}\right)\\ &= -\sum_{i=1}^{n} P(x_i) \log \left(\frac{Q(x_i)}{P(x_i)}\right)\\ &\geq -\log \left(\sum_{i=1}^{n} P(x_i)\frac{Q(x_i)}{P(x_i)}\right)\\ &= -\log \left(1\right)\\ &= 0 \end{align*}\] The third line comes from Jensen's inequality and the fact that \(-\log(x)\) is a convex function.

Rough intuition: we want to pile as much of \(Q\) 's mass under \(P\) as possible. If at any point \(x\), we overflow \(P\), we will contribute a negative cost to the \(D_{KL}\). But because our probability mass is finite, that means we will be under-filling at some other point. But the cost of overfilling and under-filling don't cancel each other out. We take a log of the probability, so overflowing at \(x\) by a small amount means we are underflowing somewhere else at a small amount. And the log of a small amount is very large.

4. motivation for using KL divergence

here's a motivation for using KL in Maximum Likelihood Estimation from philippe rigollet
if we could, we would use total variation and our recipe for finding the best model would be:
- We want to find the true parameter \(\theta^*\). Let's construct a function that takes \(\theta\) s to estimates of \(TV(\theta, \theta^*)\). That is, for each \(\theta\) we have an estimator. Then, let's just choose the \(\theta\) that has the best estimated TV.
- problem: hard to construct an estimator of TV
But the KL divergence is easy to construct an estimator for. Because the KL divergence is an expectation (with respect to \(P_{\theta^*}\)) and as we collect more samples, we can build an average that approaches this expectation
And importantly, KL divergence is minimized when the parameters are correct.

5. see also

Cross Entropy
why is it not symmetric?
nice youtube video
another nice video with an intuition – KL loss can be derived from the ration

\[ \frac{P(\text{Observations from P} \mid P)}{P(\text{Observations from P} \mid Q)} \]