Cross Entropy
1. Statement
The cross entropy between two distributions \(p\) and \(q\) over the same outcome space \(\mathcal{X}\) is: \[H(p,q) = -\sum_{x\in\mathcal{X}} p(x)\log(q(x))\]
Cross entropy can be interpreted as the amount of bits required on average to encode samples from the true distribution \(p\), using a code that has been optimized for a distribution \(q\). Here, \(-\log(q(x))\) is the amount of bits needed to encode the outcome \(x\) in a code optimized for \(q\) (see entropy).
This interpretation comes from the Kraft-McMillan inequality, which says that the encoded lengths of the source symbols defines an implicit distribution \(q\). For example, if we use a binary code to encode \(x\), and the code word has length \(l_x\), then \(q(x) = (\frac{1}{2})^{l_x}\).
To see this interpretation a little differently, you can assume that we have already been given a coding for \(X\), where the code lengths are \(l_x\). Then, this code is optimized for a distribution \(q\), as described above.
2. Relation to KL-Divergence
Recall that the KL divergence is the additional bits needed on average to encode samples from the true distribution \(p\), using a code optimized for \(q\). The cross entropy can be written as: \[ H(p, q) = H(p) + D_{KL}(p || q) \] Recall that \(H(p)\) is interpreted as: "the average number of bits needed to encode a sample from \(p\) using a code optimized for \(p\)"
This stack overflow answer says that cross entropy is more robust when doing mini-batch. (Why?)
3. Relation to maximum likelihood
3.1. Example
Let's say we are working on a classification problem. Our model outputs a distribution \(q\), say using the softmax. If there are \(N\) examples in total, and class \(i\) has a \(p_i\) fraction of the examples, then the likelihood of our parameters is: \[\prod_i q_i^{Np_i}\]
The log likelihood is: \[\sum_i Np_i\log(q_i) = -N\cdot H(p, q)\]
Here \(N\) is a constant, and we see that maximizing the log likelihood is equivalent to minimizing the cross-entropy.
4. Machine Learning
4.1. One hot encoding
Often in NLP, we have a network, whose parameters give a distribution \(q\). Our target is a sequence of words, drawn from a distribution \(p\), where the targets are one-hot encoded. So the sum of the cross-entropies in a sequence of words \(w\) is simply \(\sum_i \log(q(w_i))\).
5. Notation
Unfortunately, the notation for cross entropy \(H(p,q)\) is often the same as the notation for joint entropy