Cross Entropy
1. Statement
The cross entropy between two distributions \(p\) and \(q\) over the same outcome space \(\mathcal{X}\) is: \[H(p,q) = -\sum_{x\in\mathcal{X}} p(x)\log(q(x))\]
Cross entropy can be interpreted as the amount of bits required on average to encode samples from the true distribution \(p\), using a code that has been optimized for a distribution \(q\). Here, \(-\log(q(x))\) is the amount of bits needed to encode the outcome \(x\) in a code optimized for \(q\) (see entropy).
This interpretation comes from the Kraft-McMillan inequality, which says that the encoded lengths of the source symbols defines an implicit distribution \(q\). For example, if we use a binary code to encode \(x\), and the code word has length \(l_x\), then \(q(x) = (\frac{1}{2})^{l_x}\).
To see this interpretation a little differently, you can assume that we have already been given a coding for \(X\), where the code lengths are \(l_x\). Then, this code is optimized for a distribution \(q\), as described above.
2. Relation to KL-Divergence
Recall that the KL divergence is the additional bits needed on average to encode samples from the true distribution \(p\), using a code optimized for \(q\). The cross entropy can be written as: \[ H(p, q) = H(p) + D_{KL}(p || q) \] Recall that \(H(p)\) is interpreted as: "the average number of bits needed to encode a sample from \(p\) using a code optimized for \(p\)".
If we are trying to find the best \(q\), then we can consider \(p\) to be fixed and then optimizing for cross entropy is equivalent to optimizing for the KL-divergence.
This stack overflow answer says that cross entropy is more robust when doing mini-batch. (Why?)
3. Relation to maximum likelihood
3.1. Example
Let's say we are working on a classification problem. Our model outputs a distribution \(q\), say using the softmax. If there are \(N\) examples in total, and class \(i\) has a \(p_i\) fraction of the examples, then the likelihood of our parameters is: \[\prod_i q_i^{Np_i}\]
The log likelihood is: \[\sum_i Np_i\log(q_i) = -N\cdot H(p, q)\]
Here \(N\) is a constant, and we see that maximizing the log likelihood is equivalent to minimizing the cross-entropy.
4. Machine Learning
4.1. One hot encoding
Often in NLP, we have a network, whose parameters give a distribution \(q\). Our target is a sequence of words, drawn from a distribution \(p\), where the targets are one-hot encoded. So the sum of the cross-entropies in a sequence of words \(w\) is simply \(\sum_i \log(q(w_i))\).
5. Notation
Unfortunately, the notation for cross entropy \(H(p,q)\) is often the same as the notation for joint entropy