We have analysed individual modulation schemes. Now we ask: across all possible schemes, all possible coders, all possible receivers, what is the best you could ever do on a given channel? Claude Shannon answered that in 1948, and his result is the rare engineering theorem that genuinely deserves the word "fundamental."
5.1 Information measured in bits
If an event has probability , its self-information is
A certain event () carries zero information. An evenly weighted coin flip () carries 1 bit. A roll of a fair six-sided die () carries 2.585 bits. The improbable carries more information than the expected; that is exactly the right intuition, because a surprising message tells you more.
Surprise analogy. A friend texts "the sun rose today." Zero surprise; zero information. They text "I won the lottery." Massive surprise; lots of information. Information theory makes that intuition quantitative.
5.2 Entropy
For a source emitting symbols with probabilities , the entropy is
Entropy is the average self-information per symbol, equivalently the minimum average number of bits per symbol needed to encode the source losslessly. For a binary source with probability of 1, , peaking at when (most uncertain) and dropping to 0 at or (certain).
Shannon's source coding theorem makes this precise: the best lossless encoder uses on average bits per symbol, and no encoder can do better than for any . Run-length encoding, Huffman, arithmetic coding, Lempel-Ziv, all are practical attempts to reach that bound.
5.3 Mutual information
For two random variables and , the mutual information is how much knowing reduces uncertainty about . It is symmetric: . For a communication channel with input and output , is the rate at which information passes through. Maximizing over input distributions gives the channel capacity:
5.4 Shannon-Hartley capacity, derived
For a continuous-time channel of bandwidth Hz with AWGN of power spectral density , where the transmitter has average power , what is the capacity?
Sample at the Nyquist rate . Each sample is a real number passed through an additive Gaussian noise channel with input power per sample (over signal degrees of freedom per second) and noise variance per sample. The capacity per real sample of an AWGN channel is
This in turn is a classical result that follows from maximizing over input distributions, with the maximum achieved by Gaussian inputs (because Gaussian outputs have the highest entropy at given power). Plug in and multiply by samples per second:
That is the Shannon-Hartley theorem, the cornerstone of communications theory. Below this rate you can in principle communicate with arbitrarily low error probability if you use a good enough code. Above it, you cannot, no matter how clever your coding.
Some examples:
- A 4 kHz telephone line at 30 dB SNR: kbps. V.34 modems hit 33.6 kbps in 1994. Practical operation came within 1.5 dB of Shannon.
- An 80 MHz Wi-Fi 6 channel at 30 dB SNR: Mbps per spatial stream. With MIMO, real Wi-Fi 6 reaches multiple gigabits.
- A deep-space probe with 100 Hz of bandwidth and dB SNR: bps. Voyager's actual rate of 160 bps reflects rather more bandwidth (a few kHz) and very strong codes.
5.5 The Shannon limit on
Taking in the Shannon-Hartley formula and substituting (info bits per second) and :
No system can communicate reliably below dB , no matter the bandwidth, no matter the code. Every BER curve has this vertical wall at the left edge. Modern turbo, LDPC, and polar codes operate within a tenth of a dB of the wall in well-designed systems.
5.6 The bandwidth-power tradeoff
Shannon-Hartley implies: for a target rate , you can use a wide bandwidth at low SNR (spread spectrum, deep space) or a narrow bandwidth at high SNR (cable modem). Any combination on the curve is in principle possible. The choice depends on what the channel and transmitter let you do. Spread spectrum, which we discuss next, is one of the most striking applications of this freedom.
5.7 Source coding: Huffman with a worked example
The source coding theorem says we should be able to encode a source at its entropy. Huffman coding does so for known probability distributions, optimally for fixed-length symbol input.
Symbols A, B, C, D with probabilities 0.5, 0.25, 0.125, 0.125. Entropy bits/symbol. Huffman algorithm: pair up the two least-probable symbols, replace them by their combined probability, repeat. We get a tree:
┌─ A (0.5) ──── 0
root ────┤
└── ┌─ B (0.25) ──── 10
│
└── ┌─ C (0.125) ─── 110
│
└─ D (0.125) ─── 111Average length bits/symbol = . Optimal. Huffman is in every ZIP file, JPEG file, MP3 file, and more or less every compression standard from 1990 onward, often as one stage in a deeper pipeline.
Run-length encoding (RLE) compresses long runs of repeated values. "AAAAAABBBC" becomes "(A,6)(B,3)(C,1)". Trivial but devastatingly effective for fax, simple bitmaps, and some transform-coded video coefficient streams.
Lempel-Ziv (LZ77, LZ78, LZW) builds a dictionary on the fly and emits references to previous matches. No advance knowledge of the source distribution needed. Every gzip, zlib, PNG, and DEFLATE file uses LZ77 under the hood; LZW lives on in GIF.
5.8 Why entropy bounds always show up
In any scheme that converts data to bits, you cannot beat the entropy of the source. In any scheme that converts bits to a noisy channel, you cannot beat the channel capacity. These are the two pillars: the floor on compression and the ceiling on transmission. The job of every modern codec is to live as close to both as possible.