Entropy

The entropy of a categorical random variable \(X\) is the average information content of \(X\): \[ H(X) = -\sum_{x\in X} p(X)\log P(x) \]

1. Properties to note

\(H(X)\geq 0\) because \(\log(x) \leq 0\) when \(x\leq 1\)
\(H(X)\) is minimized when \(p(x)=1\) for some \(x\).

Entropy is maximized when the distribution is uniform. This can be proven using Jensen's Inequality:

1.1. Proof

From Jensen's inequality, we know that: \(H(X) = \mathbb{E}\left[\log \frac{1}{p(X)}\right] \leq \log \mathbb{E}\left[\frac{1}{p(X)}\right]\) because \(\log\) is concave. And \(\mathbb{E}\left[\log \frac{1}{p(X)}\right] = \sum_{i=1}^{n}\frac{p(x_i)}{p(x_i)} = n\) so \(H(X) \leq \log(n)\). For a categorical random variable \(X\) with \(|\mathcal{X}| = n\) and uniform distribution, the entropy is \(H(X) = \sum_{i=1}^{n}\frac{1}{n}\log(n) = \log(n)\).

2. Questions

2.1. Why does it makes sense to talk about low probability events as having high information?

Imagine playing a game of guess-who. Saying "my character is a man" has little information, since it picks out half of the characters. Saying "my character is a doctor" has more information, since it picks out less characters.