# source coding theorem

The task of compression: Convert sequence of symbols to sequence of encoded symbols. It's desirable for the encoded sequence to be as short as possible.

## 1. informal statement

Given a random variable \(X\) with entropy \(H(X)\), any encoding of \(X\) will not have a smaller rate than \(H(X)\). Conversely, any rate smaller than \(H(X)\) will make loss very likely. However, you can get the rate arbitrarily close to \(H(X)\) and get arbitrarily small error in the limit as the sequence length \(n\rightarrow \infty\).

### 1.1. intuitive statement

I am trying to find an encoding scheme that will compress the input information from \(X\). The *rate* is the number of coding bits needed per input bit. This theorem is telling us that we can't compress below a certain rate, namely \(H(X)\), without incurring the penalty of likely loss.

## 2. making the rate arbitrarily close to \(H(X)\)

Shannon's source coding theorem says that: Given a random variable \(X\) with entropy \(H(X)\), define a notation for \(n\) i.i.d repetitions of \(X\): call it \(X^{1:n} = X_1,...,X_n\) For any \(\epsilon > 0\), for a rate of \(H(X)+\epsilon\), there exists:

- large enough \(n\) and
- a code that will permit any \(X^{1:n}\) to be encoded in \(n(H(X)+\epsilon)\) bits such that the original \(X^{1:n}\) can be recovered with probability of at least \(1-\epsilon\)

Here, *rate* means the average number of bits per symbol used by the encoder.

A personal confusion I have with the above is why it's phrased so as to get better error bounds as you decrease \(\epsilon\). Wouldn't I expect the theorem to tell me I get better error bounds as I increase the rate above \(H(X)\)?

## 3. relationship with machine learning

If we had a perfect language model, then we could obtain a perfect code: just assign each string to a codeword that has a length equal to its surprisal value.

If we had a perfect coding scheme, we could recover the language model by using the lengths to back solve for probabilities.