# noisy channel coding theorem

Recall that the source coding theorem tells us that: for r.v. \(X\), we cannot compress a stream \(X^{1:n}\) of input below a *coding rate* of \(H(X)\) without making loss very likely. Here, *rate* means the number of coding bits used per input bit. So a lower rate is better, since it results in a more compressed message.

The noisy channel coding theorem is similar. It tells us that a noisy channel must allow a certain *channel rate* for communication to be possible. Here, *rate* means the amount of information transmitted per channel use. So a higher rate is better, since it results in more efficient use of the channel.

The mathematical model for a noisy channel is: \[ W \longrightarrow \boxed{encoder} \overset{X^{n}}{\longrightarrow} \boxed{channel} \overset{Y^{n}}{\longrightarrow} \boxed{decoder} \longrightarrow \hat{W} \] where

- \(W\) is the message
- \(X^{n}\) is the encoded message
- \(Y^{n}\) is the received sequence, corrupted by noise
- \(\hat{W}\) is the decoded message

For a single instance of the random variable \(X\), the channel outputs \(Y\). The noise in the channel is characterized by the distribution \(p(y \mid x)\)

So, the choice of \(p(x)\) completely determines \(p(y\mid x)\), which gives us \(p(x,y) = p(x)p(y\mid x)\). This induces a mutual information \(I(X;Y)\).

The *information capacity* is the highest \(I(X;Y)\) achievable with a choice of \(p(x)\):
$$

$$

It turns out that the *operational capacity* of a channel is exactly the *information capacity*. This is the noisy channel coding theorem.

## 1. Where does this bound come from?

A personal confusion I always had was: "how does choosing \(p(x)\) have anything to do with choosing a code?" This confusion is answered by realizing that while the value of the information capacity is important, its computation is unrelated to finding codes, whereas the *operational capacity* is the maximum rate achieved by a coding scheme.

However, what I said in the first paragraph isn't the whole story. The act of choosing \(p(x)\) has *something* to do with codes. Here's how it works. Imagine that the sender is equally likely to convey one of the \(|\mathcal{C}|\) messages in the \(\mathcal{C} = \{\mathbf{x}_1,...,\mathbf{x}_{|\mathcal{C}|}\}\) code, where \(\mathbf{x}_i = [x^1_i,...,x^N_i]\). The decoded message is also an \(n\) dimensional vector \(\mathbf{y}\)

Then, the conditional entropy of \(\mathbf{y}\), given \(\mathbf{x}\) should be close to zero, if we want to be able to recover the original message at the decoder: \[ H(\mathbf{x} \mid \mathbf{y}) = H(\mathbf{x}) - I(\mathbf{x}; \mathbf{y}) \]

Then, the information transmitted by each message is \(\log |\mathcal{C}| = RN\), where \(N\) is the block size and \(R\) is the rate. Why? There are \(\mathcal{C}\) equally likely messages, so \(\log |\mathcal{C}|\) bits of information per block.

If \(H(\mathbf{x} \mid \mathbf{y}) \approx 0\), then \(NR \approx I(\mathbf{x}; \mathbf{y})\), so \[ R \approx \frac{1}{N} I(\mathbf{x}; \mathbf{y}) \]

Then, the rate approximately depends on the mutual information between \(\mathbf{x}\) and \(\mathbf{y}\), which depends on the coding scheme \(\mathcal{C}\). So optimizing over all possible coding schemes gives us this capacity: \[ C = \arg\max_{\mathcal{C}} \frac{1}{N}I(\mathbf{x}; \mathbf{y}) \]

As it turns out, with some work, we see that we can instead examine the quantity from \(\ref{cap}\). See this textbook for details

## 2. Operational capacity

Now, we define the *operational capacity*:

A \((M,n)\) code for a channel with \((X,Y,p(Y\mid X))\) consists of:

- an index set of messages \(\{1,...,M\}\)
- an encoding into \(X^{n}\): \(\{1,...,M\}\rightarrow X^n\)
- a decoding scheme: \(Y^n \rightarrow \{1,...,M\}\)

Then, a *rate* of an \((M,n)\) code is:
\[
R = \frac{\log_2{M}}{n}
\]
This is the information content of your message, \(\log_2{M}\) bits, divided by the number of times you have to use the channel.

A rate is *achievable* if there exists a sequence of \((2^{nR}, n)\) codes such that the maximum probability of error for any message \(X^n\) goes to 0 as \(n\rightarrow \infty\). Why would we expect this to be the case? As the block size \(n\) increases, a code with rate \(R\) will have \(2^{nR}\) index symbols. However, the coding scheme will be able to choose from among \(2^{n}\) blocks. So, that leaves us \(2^{n(1-R)}\) blocks per index symbol. As that buffer grows, we can hope that the chance of error goes down.

Finally, we can define the *operational capacity* as the supremum of all achievable rates on a channel.