5. Information Theory: The Limits of the Possible // Digital Communications // bhaswanth

We have analysed individual modulation schemes. Now we ask: across all possible schemes, all possible coders, all possible receivers, what is the best you could ever do on a given channel? Claude Shannon answered that in 1948, and his result is the rare engineering theorem that genuinely deserves the word "fundamental."

5.1 Information measured in bits

If an event $E$ has probability $p$ , its self-information is

$I(E) = -\log_2 p \text{ bits.}$

A certain event ( $p = 1$ ) carries zero information. An evenly weighted coin flip ( $p = 1/2$ ) carries 1 bit. A roll of a fair six-sided die ( $p = 1/6$ ) carries 2.585 bits. The improbable carries more information than the expected; that is exactly the right intuition, because a surprising message tells you more.

Surprise analogy. A friend texts "the sun rose today." Zero surprise; zero information. They text "I won the lottery." Massive surprise; lots of information. Information theory makes that intuition quantitative.

5.2 Entropy

For a source emitting symbols $x_i$ with probabilities $p_i$ , the entropy is

$H(X) = -\sum_i p_i \log_2 p_i \text{ bits per symbol.}$

Entropy is the average self-information per symbol, equivalently the minimum average number of bits per symbol needed to encode the source losslessly. For a binary source with probability $p$ of 1, $H = -p \log_2 p - (1-p) \log_2 (1-p)$ , peaking at $H = 1$ when $p = 1/2$ (most uncertain) and dropping to 0 at $p = 0$ or $p = 1$ (certain).

Shannon's source coding theorem makes this precise: the best lossless encoder uses on average $H$ bits per symbol, and no encoder can do better than $H - \epsilon$ for any $\epsilon > 0$ . Run-length encoding, Huffman, arithmetic coding, Lempel-Ziv, all are practical attempts to reach that bound.

5.3 Mutual information

For two random variables $X$ and $Y$ , the mutual information $I(X; Y) = H(X) - H(X | Y)$ is how much knowing $Y$ reduces uncertainty about $X$ . It is symmetric: $I(X;Y) = I(Y;X)$ . For a communication channel with input $X$ and output $Y$ , $I(X; Y)$ is the rate at which information passes through. Maximizing over input distributions gives the channel capacity:

$C = \max_{p_X} I(X; Y) \text{ bits per channel use.}$

5.4 Shannon-Hartley capacity, derived

For a continuous-time channel of bandwidth $B$ Hz with AWGN of power spectral density $N_0$ , where the transmitter has average power $P$ , what is the capacity?

Sample at the Nyquist rate $2B$ . Each sample is a real number passed through an additive Gaussian noise channel with input power $P / (2B)$ per sample (over signal degrees of freedom $2B$ per second) and noise variance $N_0 B / (2B) = N_0 / 2$ per sample. The capacity per real sample of an AWGN channel is

$C_\text{sample} = \tfrac{1}{2} \log_2 \left( 1 + \frac{P_\text{sample}}{\sigma^2} \right) \text{ bits per sample.}$

This in turn is a classical result that follows from maximizing $I(X;Y) = H(Y) - H(Y|X)$ over input distributions, with the maximum achieved by Gaussian inputs (because Gaussian outputs have the highest entropy at given power). Plug in $P_\text{sample} / \sigma^2 = P / (N_0 B)$ and multiply by $2B$ samples per second:

$C = 2B \cdot \tfrac{1}{2} \log_2 \left( 1 + \frac{P}{N_0 B} \right) = B \log_2 \left( 1 + \text{SNR} \right) \text{ bits per second.}$

That is the Shannon-Hartley theorem, the cornerstone of communications theory. Below this rate you can in principle communicate with arbitrarily low error probability if you use a good enough code. Above it, you cannot, no matter how clever your coding.

Some examples:

A 4 kHz telephone line at 30 dB SNR: $C \approx 4000 \cdot \log_2(1001) \approx 40$ kbps. V.34 modems hit 33.6 kbps in 1994. Practical operation came within 1.5 dB of Shannon.
An 80 MHz Wi-Fi 6 channel at 30 dB SNR: $C \approx 80 \cdot 10^6 \cdot 9.97 \approx 800$ Mbps per spatial stream. With MIMO, real Wi-Fi 6 reaches multiple gigabits.
A deep-space probe with 100 Hz of bandwidth and $-10$ dB SNR: $C \approx 100 \cdot \log_2(1.1) \approx 14$ bps. Voyager's actual rate of 160 bps reflects rather more bandwidth (a few kHz) and very strong codes.

5.5 The Shannon limit on $E_b / N_0$

Taking $B \to \infty$ in the Shannon-Hartley formula and substituting $C = R$ (info bits per second) and $P = R E_b$ :

$\frac{E_b}{N_0} \to \ln 2 \approx -1.59 \text{ dB.}$

No system can communicate reliably below $-1.59$ dB $E_b/N_0$ , no matter the bandwidth, no matter the code. Every BER curve has this vertical wall at the left edge. Modern turbo, LDPC, and polar codes operate within a tenth of a dB of the wall in well-designed systems.

5.6 The bandwidth-power tradeoff

Shannon-Hartley implies: for a target rate $R$ , you can use a wide bandwidth at low SNR (spread spectrum, deep space) or a narrow bandwidth at high SNR (cable modem). Any combination on the curve $R = B \log_2 (1 + P / N_0 B)$ is in principle possible. The choice depends on what the channel and transmitter let you do. Spread spectrum, which we discuss next, is one of the most striking applications of this freedom.

5.7 Source coding: Huffman with a worked example

The source coding theorem says we should be able to encode a source at its entropy. Huffman coding does so for known probability distributions, optimally for fixed-length symbol input.

Symbols A, B, C, D with probabilities 0.5, 0.25, 0.125, 0.125. Entropy $H = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75$ bits/symbol. Huffman algorithm: pair up the two least-probable symbols, replace them by their combined probability, repeat. We get a tree:

plaintext

            ┌─ A (0.5) ──── 0
   root ────┤
            └── ┌─ B (0.25) ──── 10
                │
                └── ┌─ C (0.125) ─── 110
                    │
                    └─ D (0.125) ─── 111

Average length $= 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75$ bits/symbol = $H$ . Optimal. Huffman is in every ZIP file, JPEG file, MP3 file, and more or less every compression standard from 1990 onward, often as one stage in a deeper pipeline.

Run-length encoding (RLE) compresses long runs of repeated values. "AAAAAABBBC" becomes "(A,6)(B,3)(C,1)". Trivial but devastatingly effective for fax, simple bitmaps, and some transform-coded video coefficient streams.

Lempel-Ziv (LZ77, LZ78, LZW) builds a dictionary on the fly and emits references to previous matches. No advance knowledge of the source distribution needed. Every gzip, zlib, PNG, and DEFLATE file uses LZ77 under the hood; LZW lives on in GIF.

5.8 Why entropy bounds always show up

In any scheme that converts data to bits, you cannot beat the entropy of the source. In any scheme that converts bits to a noisy channel, you cannot beat the channel capacity. These are the two pillars: the floor on compression and the ceiling on transmission. The job of every modern codec is to live as close to both as possible.