mutual information
The mutual information between random variables \(X\) and \(Y\) is the Kullback-Leibler divergence between \(P(X,Y)\) and \(P(X)P(Y)\): \[ I(X;Y) = D_{KL}(P(X,Y) || P(X)P(Y) = \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)} \] Expressed in terms of entropy: \(\begin{align} I(X;Y) &= H(X) - H(X\mid Y)\\ &= H(Y) - H(Y\mid X) \end{align}\) In the first line, you can think of:
- \(H(X)\) as the uncertainty in \(X\)
- the conditional entropy \(H(X\mid Y)\) as the uncertainty remaining in \(X\) after \(Y\) is known
- \(I(X;Y)\) as the amount that knowing \(Y\) reduces the uncertainty in \(X\)
You can also think of \(I(X;Y)\) as a measure of the information shared by \(X\) and \(Y\). How much does knowing one variable reduce uncertainty about the other? If \(X\) and \(Y\) are independent, then \(I(X;Y)\) is zero, because knowing the value of \(Y\) doesn't change the distribution of \(X\) at all. However, if \(X\) is a deterministic function of \(Y\), then \(I(X;Y) = H(X) - H(X\mid Y) = H(X)\) because \(H(X \mid Y) = 0\), since there is no uncertainty after observing \(Y\). Then, the mutual information is \(H(X)\). That is, the amount of uncertainty reduction that we get from observing \(Y\) is exactly all the uncertainty that \(X\) had to begin with.
You can also take a close look at the definition. Just like Entropy, MI is obtained by averaging over a distribution. Here, we average over the joint distribution, and at each point, we take a measure of how far from independence we are.
It turns out that \(I(X;Y) = 0\) if and only if \(X\) and \(Y\) are independent.
Note that \(I(X;Y) = I(Y;X)\) is a symmetric measure.
Notice that \(\log(p(x,y)) - \log(p(x)p(y))\) can be thought of as a distance from independence.