>
section 5 of 146 min read

5. Information Theory: The Limits of the Possible

We have analysed individual modulation schemes. Now we ask: across all possible schemes, all possible coders, all possible receivers, what is the best you could ever do on a given channel? Claude Shannon answered that in 1948, and his result is the rare engineering theorem that genuinely deserves the word "fundamental."

5.1 Information measured in bits

If an event EE has probability pp, its self-information is

I(E)=log2p bits.I(E) = -\log_2 p \text{ bits.}

A certain event (p=1p = 1) carries zero information. An evenly weighted coin flip (p=1/2p = 1/2) carries 1 bit. A roll of a fair six-sided die (p=1/6p = 1/6) carries 2.585 bits. The improbable carries more information than the expected; that is exactly the right intuition, because a surprising message tells you more.

Surprise analogy. A friend texts "the sun rose today." Zero surprise; zero information. They text "I won the lottery." Massive surprise; lots of information. Information theory makes that intuition quantitative.

5.2 Entropy

For a source emitting symbols xix_i with probabilities pip_i, the entropy is

H(X)=ipilog2pi bits per symbol.H(X) = -\sum_i p_i \log_2 p_i \text{ bits per symbol.}

Entropy is the average self-information per symbol, equivalently the minimum average number of bits per symbol needed to encode the source losslessly. For a binary source with probability pp of 1, H=plog2p(1p)log2(1p)H = -p \log_2 p - (1-p) \log_2 (1-p), peaking at H=1H = 1 when p=1/2p = 1/2 (most uncertain) and dropping to 0 at p=0p = 0 or p=1p = 1 (certain).

Shannon's source coding theorem makes this precise: the best lossless encoder uses on average HH bits per symbol, and no encoder can do better than HϵH - \epsilon for any ϵ>0\epsilon > 0. Run-length encoding, Huffman, arithmetic coding, Lempel-Ziv, all are practical attempts to reach that bound.

5.3 Mutual information

For two random variables XX and YY, the mutual information I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X | Y) is how much knowing YY reduces uncertainty about XX. It is symmetric: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X). For a communication channel with input XX and output YY, I(X;Y)I(X; Y) is the rate at which information passes through. Maximizing over input distributions gives the channel capacity:

C=maxpXI(X;Y) bits per channel use.C = \max_{p_X} I(X; Y) \text{ bits per channel use.}

5.4 Shannon-Hartley capacity, derived

For a continuous-time channel of bandwidth BB Hz with AWGN of power spectral density N0N_0, where the transmitter has average power PP, what is the capacity?

Sample at the Nyquist rate 2B2B. Each sample is a real number passed through an additive Gaussian noise channel with input power P/(2B)P / (2B) per sample (over signal degrees of freedom 2B2B per second) and noise variance N0B/(2B)=N0/2N_0 B / (2B) = N_0 / 2 per sample. The capacity per real sample of an AWGN channel is

Csample=12log2(1+Psampleσ2) bits per sample.C_\text{sample} = \tfrac{1}{2} \log_2 \left( 1 + \frac{P_\text{sample}}{\sigma^2} \right) \text{ bits per sample.}

This in turn is a classical result that follows from maximizing I(X;Y)=H(Y)H(YX)I(X;Y) = H(Y) - H(Y|X) over input distributions, with the maximum achieved by Gaussian inputs (because Gaussian outputs have the highest entropy at given power). Plug in Psample/σ2=P/(N0B)P_\text{sample} / \sigma^2 = P / (N_0 B) and multiply by 2B2B samples per second:

C=2B12log2(1+PN0B)=Blog2(1+SNR) bits per second.C = 2B \cdot \tfrac{1}{2} \log_2 \left( 1 + \frac{P}{N_0 B} \right) = B \log_2 \left( 1 + \text{SNR} \right) \text{ bits per second.}

That is the Shannon-Hartley theorem, the cornerstone of communications theory. Below this rate you can in principle communicate with arbitrarily low error probability if you use a good enough code. Above it, you cannot, no matter how clever your coding.

Some examples:

  • A 4 kHz telephone line at 30 dB SNR: C4000log2(1001)40C \approx 4000 \cdot \log_2(1001) \approx 40 kbps. V.34 modems hit 33.6 kbps in 1994. Practical operation came within 1.5 dB of Shannon.
  • An 80 MHz Wi-Fi 6 channel at 30 dB SNR: C801069.97800C \approx 80 \cdot 10^6 \cdot 9.97 \approx 800 Mbps per spatial stream. With MIMO, real Wi-Fi 6 reaches multiple gigabits.
  • A deep-space probe with 100 Hz of bandwidth and 10-10 dB SNR: C100log2(1.1)14C \approx 100 \cdot \log_2(1.1) \approx 14 bps. Voyager's actual rate of 160 bps reflects rather more bandwidth (a few kHz) and very strong codes.

5.5 The Shannon limit on Eb/N0E_b / N_0

Taking BB \to \infty in the Shannon-Hartley formula and substituting C=RC = R (info bits per second) and P=REbP = R E_b:

EbN0ln21.59 dB.\frac{E_b}{N_0} \to \ln 2 \approx -1.59 \text{ dB.}

No system can communicate reliably below 1.59-1.59 dB Eb/N0E_b/N_0, no matter the bandwidth, no matter the code. Every BER curve has this vertical wall at the left edge. Modern turbo, LDPC, and polar codes operate within a tenth of a dB of the wall in well-designed systems.

5.6 The bandwidth-power tradeoff

Shannon-Hartley implies: for a target rate RR, you can use a wide bandwidth at low SNR (spread spectrum, deep space) or a narrow bandwidth at high SNR (cable modem). Any combination on the curve R=Blog2(1+P/N0B)R = B \log_2 (1 + P / N_0 B) is in principle possible. The choice depends on what the channel and transmitter let you do. Spread spectrum, which we discuss next, is one of the most striking applications of this freedom.

5.7 Source coding: Huffman with a worked example

The source coding theorem says we should be able to encode a source at its entropy. Huffman coding does so for known probability distributions, optimally for fixed-length symbol input.

Symbols A, B, C, D with probabilities 0.5, 0.25, 0.125, 0.125. Entropy H=0.51+0.252+0.1253+0.1253=1.75H = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75 bits/symbol. Huffman algorithm: pair up the two least-probable symbols, replace them by their combined probability, repeat. We get a tree:

plaintext
            ┌─ A (0.5) ──── 0
   root ────┤
            └── ┌─ B (0.25) ──── 10

                └── ┌─ C (0.125) ─── 110

                    └─ D (0.125) ─── 111

Average length =0.51+0.252+0.1253+0.1253=1.75= 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75 bits/symbol = HH. Optimal. Huffman is in every ZIP file, JPEG file, MP3 file, and more or less every compression standard from 1990 onward, often as one stage in a deeper pipeline.

Run-length encoding (RLE) compresses long runs of repeated values. "AAAAAABBBC" becomes "(A,6)(B,3)(C,1)". Trivial but devastatingly effective for fax, simple bitmaps, and some transform-coded video coefficient streams.

Lempel-Ziv (LZ77, LZ78, LZW) builds a dictionary on the fly and emits references to previous matches. No advance knowledge of the source distribution needed. Every gzip, zlib, PNG, and DEFLATE file uses LZ77 under the hood; LZW lives on in GIF.

5.8 Why entropy bounds always show up

In any scheme that converts data to bits, you cannot beat the entropy of the source. In any scheme that converts bits to a noisy channel, you cannot beat the channel capacity. These are the two pillars: the floor on compression and the ceiling on transmission. The job of every modern codec is to live as close to both as possible.