noisy channel coding theorem
Recall that the source coding theorem tells us that: for r.v. \(X\), we cannot compress a stream \(X^{1:n}\) of input below a coding rate of \(H(X)\) without making loss very likely. Here, rate means the number of coding bits used per input bit. So a lower rate is better, since it results in a more compressed message.
The noisy channel coding theorem is similar. It tells us that a noisy channel must allow a certain channel rate for communication to be possible. Here, rate means the amount of information transmitted per channel use. So a higher rate is better, since it results in more efficient use of the channel.
The mathematical model for a noisy channel is: \[ W \longrightarrow \boxed{encoder} \overset{X^{n}}{\longrightarrow} \boxed{channel} \overset{Y^{n}}{\longrightarrow} \boxed{decoder} \longrightarrow \hat{W} \] where
- \(W\) is the message
- \(X^{n}\) is the encoded message
- \(Y^{n}\) is the received sequence, corrupted by noise
- \(\hat{W}\) is the decoded message
For a single instance of the random variable \(X\), the channel outputs \(Y\). The noise in the channel is characterized by the distribution \(p(y \mid x)\)
So, the choice of \(p(x)\) completely determines \(p(y\mid x)\), which gives us \(p(x,y) = p(x)p(y\mid x)\). This induces a mutual information \(I(X;Y)\).
The information capacity is the highest \(I(X;Y)\) achievable with a choice of \(p(x)\): $$
\begin{equation} C = \sup_{p(X)} I(X;Y) \label{cap} \end{equation}$$
It turns out that the operational capacity of a channel is exactly the information capacity. This is the noisy channel coding theorem.
1. Where does this bound come from?
A personal confusion I always had was: "how does choosing \(p(x)\) have anything to do with choosing a code?" This confusion is answered by realizing that while the value of the information capacity is important, its computation is unrelated to finding codes, whereas the operational capacity is the maximum rate achieved by a coding scheme.
However, what I said in the first paragraph isn't the whole story. The act of choosing \(p(x)\) has something to do with codes. Here's how it works. Imagine that the sender is equally likely to convey one of the \(|\mathcal{C}|\) messages in the \(\mathcal{C} = \{\mathbf{x}_1,...,\mathbf{x}_{|\mathcal{C}|}\}\) code, where \(\mathbf{x}_i = [x^1_i,...,x^N_i]\). The decoded message is also an \(n\) dimensional vector \(\mathbf{y}\)
Then, the conditional entropy of \(\mathbf{y}\), given \(\mathbf{x}\) should be close to zero, if we want to be able to recover the original message at the decoder: \[ H(\mathbf{x} \mid \mathbf{y}) = H(\mathbf{x}) - I(\mathbf{x}; \mathbf{y}) \]
Then, the information transmitted by each message is \(\log |\mathcal{C}| = RN\), where \(N\) is the block size and \(R\) is the rate. Why? There are \(\mathcal{C}\) equally likely messages, so \(\log |\mathcal{C}|\) bits of information per block.
If \(H(\mathbf{x} \mid \mathbf{y}) \approx 0\), then \(NR \approx I(\mathbf{x}; \mathbf{y})\), so \[ R \approx \frac{1}{N} I(\mathbf{x}; \mathbf{y}) \]
Then, the rate approximately depends on the mutual information between \(\mathbf{x}\) and \(\mathbf{y}\), which depends on the coding scheme \(\mathcal{C}\). So optimizing over all possible coding schemes gives us this capacity: \[ C = \arg\max_{\mathcal{C}} \frac{1}{N}I(\mathbf{x}; \mathbf{y}) \]
As it turns out, with some work, we see that we can instead examine the quantity from \(\ref{cap}\). See this textbook for details
2. Operational capacity
Now, we define the operational capacity:
A \((M,n)\) code for a channel with \((X,Y,p(Y\mid X))\) consists of:
- an index set of messages \(\{1,...,M\}\)
- an encoding into \(X^{n}\): \(\{1,...,M\}\rightarrow X^n\)
- a decoding scheme: \(Y^n \rightarrow \{1,...,M\}\)
Then, a rate of an \((M,n)\) code is: \[ R = \frac{\log_2{M}}{n} \] This is the information content of your message, \(\log_2{M}\) bits, divided by the number of times you have to use the channel.
A rate is achievable if there exists a sequence of \((2^{nR}, n)\) codes such that the maximum probability of error for any message \(X^n\) goes to 0 as \(n\rightarrow \infty\). Why would we expect this to be the case? As the block size \(n\) increases, a code with rate \(R\) will have \(2^{nR}\) index symbols. However, the coding scheme will be able to choose from among \(2^{n}\) blocks. So, that leaves us \(2^{n(1-R)}\) blocks per index symbol. As that buffer grows, we can hope that the chance of error goes down.
Finally, we can define the operational capacity as the supremum of all achievable rates on a channel.