# Cross Entropy

## 1. Statement

The cross entropy between two distributions \(p\) and \(q\) over the same outcome space \(\mathcal{X}\) is: \[H(p,q) = -\sum_{x\in\mathcal{X}} p(x)\log(q(x))\]

Cross entropy can be interpreted as the amount of bits required on average to encode samples from the true distribution \(p\), using a code that has been optimized for a distribution \(q\). Here, \(-\log(q(x))\) is the amount of bits needed to encode the outcome \(x\) in a code optimized for \(q\) (see entropy).

This interpretation comes from the Kraft-McMillan inequality, which says that the encoded lengths of the source symbols defines an implicit distribution \(q\). For example, if we use a binary code to encode \(x\), and the code word has length \(l_x\), then \(q(x) = (\frac{1}{2})^{l_x}\).

To see this interpretation a little differently, you can assume that we have already been given a coding for \(X\), where the code lengths are \(l_x\). Then, this code is optimized for a distribution \(q\), as described above.

## 2. Relation to KL-Divergence

Recall that the KL divergence is the *additional* bits needed on average to encode samples from the true distribution \(p\), using a code optimized for \(q\). The cross entropy can be written as:
\[
H(p, q) = H(p) + D_{KL}(p || q)
\]
Recall that \(H(p)\) is interpreted as: "the average number of bits needed to encode a sample from \(p\) using a code optimized for \(p\)"

This stack overflow answer says that cross entropy is more robust when doing mini-batch. (Why?)

## 3. Relation to maximum likelihood

### 3.1. Example

Let's say we are working on a classification problem. Our model outputs a distribution \(q\), say using the softmax. If there are \(N\) examples in total, and class \(i\) has a \(p_i\) fraction of the examples, then the likelihood of our parameters is: \[\prod_i q_i^{Np_i}\]

The log likelihood is: \[\sum_i Np_i\log(q_i) = -N\cdot H(p, q)\]

Here \(N\) is a constant, and we see that maximizing the log likelihood is equivalent to minimizing the cross-entropy.

## 4. Machine Learning

### 4.1. One hot encoding

Often in NLP, we have a network, whose parameters give a distribution \(q\). Our target is a sequence of words, drawn from a distribution \(p\), where the targets are one-hot encoded. So the sum of the cross-entropies in a sequence of words \(w\) is simply \(\sum_i \log(q(w_i))\).

## 5. Notation

Unfortunately, the notation for cross entropy \(H(p,q)\) is often the same as the notation for joint entropy