1. Side-Channel Attacks: Reading the Whisper // Hardware Security // bhaswanth

A chip executing AES is doing exactly what it is supposed to do. The output is a ciphertext, indistinguishable from random to anyone without the key. By the abstract definition of cryptographic security, the chip has revealed nothing.

The chip is also drawing current from a power rail. Its substrate is heating slightly. Its switching transistors are radiating electromagnetic fields in nanosecond pulses. Its package is emitting an acoustic signature in the kilohertz range as ceramic capacitors flex piezoelectrically. Each of these emissions is a function of what the chip is doing, and what it is doing is a function of the secret key. The cryptographic abstraction never claimed to hide any of this. It only claimed that the ciphertext bits would be unpredictable, and they are. Everything else, every emission, every delay, every fluctuation, is what we call a side channel.

Burglar at the door analogy. A safecracker in a movie holds a stethoscope to the dial of an old combination safe. The lock manufacturer never claimed the dial would be silent. They only claimed the bolt would not retract without the right number. The safecracker is not breaking the lock. They are listening to it. The lock is unwillingly broadcasting its internal state. Side-channel attacks are the safecracker, and the chip is the dial, and the secret key is the combination. The attacker is not breaking AES. They are listening to AES.

1.1 Why side channels are unavoidable

In Chapter 0 we drew a CMOS inverter. Two transistors, one PMOS and one NMOS, sharing a common gate input and a common drain output. When the input rises from 0 to 1, the NMOS turns on, the PMOS turns off, and the output capacitance discharges to ground through the NMOS. When the input falls from 1 to 0, the opposite happens, and the output capacitance charges from VDD through the PMOS. In both transitions, charge moves. Charge moving per unit time is current. The energy dissipated for one transition is approximately:

$E_\text{switch} = \tfrac{1}{2}\, C_L\, V_{DD}^2$

In static state, when no input is changing, the gate draws only the leakage current of off-transistors, which is many orders of magnitude smaller than switching current. So the instantaneous power consumption of a CMOS chip is dominated by the number of gates that are switching at any given clock edge, multiplied by the load capacitance each of those gates is driving.

Now, the attacker's leverage. The number of gates that switch when a 32-bit register loads a new value depends on the Hamming distance between the old value and the new value: how many bits flipped. If three bits flipped, three groups of gates switched. If twenty bits flipped, twenty groups switched. The current spike on VDD is roughly proportional to the bits-flipped count.

This is the entire mechanism. The data dependence of switching capacitance is the leakage. It is built into the physics of the device. You cannot wish it away. You can balance it, mask it, smear it in time, drown it in noise, but the physical channel is open the moment power reaches the chip. The reader who absorbed Chapter 1 already understands the worst news: every transistor in every chip ever made is a side channel.

1.2 Simple Power Analysis (SPA)

The first and easiest attack. Capture a single power trace of the chip executing the cryptographic operation, plot it on the oscilloscope from Chapter 22, and read the operation off with your eyes.

SPA was originally demonstrated against textbook RSA implementations. RSA decryption is modular exponentiation: $m = c^d \bmod N$ where $d$ is the secret key. The standard square-and-multiply algorithm processes $d$ bit by bit:

plaintext

for each bit b of d, MSB to LSB:
    m = m * m mod N         # square
    if b == 1:
        m = m * c mod N     # multiply

A square is one big multiplication. A square-then-multiply is two big multiplications. The multiply takes roughly twice as long, draws roughly twice the cumulative current. On a power trace you literally see the bit pattern of d. Wide humps mean 1, narrow humps mean 0. A Princeton group did this on the original implementation in OpenSSL and recovered keys directly off a trace.

The same trick reveals AES table-based implementations: each of the sixteen S-box lookups in the first round produces a memory access pattern visible in the trace, and on chips with cache, the cache miss/hit pattern is visible too. SPA also reveals control flow: any conditional branch is a tell. If your PIN-checking code returns early on the first wrong digit, an SPA trace shows exactly which digit was wrong.

plaintext

       Power
       |
   1.0 |       ____         _______         _____         ___
       |      /    \       /       \       /     \       /
   0.5 |  ___/      \_____/         \_____/       \_____/
       |
       +-----------------------------------------------> time
              SQ           SQ + MUL          SQ        SQ + MUL
              0              1               0            1

A trace like that, where every wide bump is a 1 and every narrow bump is a 0, was the start of the entire field. Defense against SPA is straightforward: make every code path identical, regardless of data. Constant-time programming. We will revisit constant-time in section 1.10.

1.3 Differential Power Analysis (DPA): the watershed

In 1998 Paul Kocher, Joshua Jaffe, and Benjamin Jun published the paper that hardware security people still cite as the founding moment of the modern field. They showed that even when SPA fails, even when the operations look identical on a single trace, the data dependence of power can still be extracted by collecting thousands of traces and treating power as a noisy random variable.

The leverage comes from a leakage model. The simplest model says: the instantaneous power at time $t$ during the chip's execution is:

$P(t) = \alpha\, \text{HW}(v(t)) + \beta\, R(t) + \gamma$

where $v(t)$ is the value sitting on the data bus at time $t$ , $\text{HW}$ is Hamming weight (the count of 1 bits), $R(t)$ is independent noise, and $\alpha, \beta, \gamma$ are constants of the technology. For wires and registers that get overwritten, the more accurate model uses Hamming distance, the count of bits that flipped between old value and new value:

$P(t) = \alpha\, \text{HD}(v_\text{old}, v_\text{new}) + \beta\, R(t) + \gamma$

Pick the model that matches what you think the hardware is actually doing. For pre-charged buses, Hamming weight is fine. For registers and SRAM cells, Hamming distance matches reality more closely.

Now the attack. Consider AES-128 first round. The state byte at position $i$ before the S-box is computed as $x_i = p_i \oplus k_i$ , where $p_i$ is the plaintext byte and $k_i$ is the key byte. After the S-box, the state byte is $y_i = S[x_i]$ . The Hamming weight of $y_i$ is what flows into the register that holds the post-SubBytes state, and that Hamming weight is what we believe will dominate the power consumption at the moment that register loads.

For a chosen target byte position $i$ , here is the full DPA procedure:

Trigger the chip to execute AES on $N$ random plaintexts $p^{(1)}, \dots, p^{(N)}$ that you control. Capture a power trace $T^{(j)}(t)$ for each. Each trace has perhaps 10,000 sample points.
For each candidate key byte $\hat k \in \{0, 1, \dots, 255\}$ $k^∈{0,1,…,255}$ :
- For each trace $j$ , compute the predicted intermediate value $\hat y^{(j)}_{\hat k} = S[p^{(j)}_i \oplus \hat k]$ .
- Compute the predicted power $H^{(j)}_{\hat k} = \text{HW}(\hat y^{(j)}_{\hat k})$ .
Now you have, for each key guess, a vector of $N$ predictions $H_{\hat k}$ .
Original DPA was even simpler than correlation. Kocher's difference of means: split the traces into two groups based on whether some predicted bit is 0 or 1, and compute the difference of mean traces. For the right key guess, that difference shows a clear spike at the moment the chip is processing that bit; for the wrong key, the difference is flat noise. We will compute the modern correlation form below.
Whichever $\hat k$ produces the largest spike is the recovered key byte.
Repeat independently for every byte position $i \in \{0, \dots, 15\}$ . The byte attacks are independent: the secret has been split into sixteen 8-bit problems, each requiring a search over 256 candidates. AES-128 has been reduced from $2^{128}$ to $16 \times 2^8 = 4096$ guesses checked against thousands of traces.

That is the watershed. A 128-bit cipher with no algorithmic flaw, broken in a few minutes of capture and a few seconds of compute, because the key-dependent intermediate value sits in a register that draws data-dependent current.

1.4 Correlation Power Analysis (CPA): cleaner statistics

DPA in its original difference-of-means form throws away information. Modern attackers use Pearson correlation, the same statistic from Chapter 3 cross-correlation, computed pointwise across the trace. For a key guess $\hat k$ and a single time index $t$ :

$\rho_{\hat k}(t) = \frac{\sum_j (H^{(j)}_{\hat k} - \bar H_{\hat k})(T^{(j)}(t) - \bar T(t))}{\sqrt{\sum_j (H^{(j)}_{\hat k} - \bar H_{\hat k})^2}\;\sqrt{\sum_j (T^{(j)}(t) - \bar T(t))^2}}$

The numerator is the sample covariance between predicted Hamming weight and measured power at time $t$ . The denominator normalizes by the standard deviations of each, giving a dimensionless number in $[-1, +1]$ .

For the correct key guess at the correct time, $\rho$ approaches its maximum (often 0.1 to 0.5 in practice, depending on noise). For wrong key guesses, $\rho$ stays near zero with statistical fluctuation. The attacker plots $\rho_{\hat k}(t)$ for all 256 key guesses and looks for the curve that spikes above the noise floor. That curve names the byte.

CPA is more robust than DPA because it uses the full Hamming weight prediction (0 through 8) instead of a single bit. It's the attack form that almost every modern side-channel paper uses, and it's what tools like ChipWhisperer ship by default.

Below is a complete DPA simulation in Python. Run it on a laptop. It generates synthetic traces with a Hamming-weight leakage model plus Gaussian noise, then recovers the key by correlation across guesses.

python

import numpy as np
 
# AES S-box
SBOX = np.array([
    0x63,0x7c,0x77,0x7b,0xf2,0x6b,0x6f,0xc5,0x30,0x01,0x67,0x2b,0xfe,0xd7,0xab,0x76,
    0xca,0x82,0xc9,0x7d,0xfa,0x59,0x47,0xf0,0xad,0xd4,0xa2,0xaf,0x9c,0xa4,0x72,0xc0,
    0xb7,0xfd,0x93,0x26,0x36,0x3f,0xf7,0xcc,0x34,0xa5,0xe5,0xf1,0x71,0xd8,0x31,0x15,
    0x04,0xc7,0x23,0xc3,0x18,0x96,0x05,0x9a,0x07,0x12,0x80,0xe2,0xeb,0x27,0xb2,0x75,
    0x09,0x83,0x2c,0x1a,0x1b,0x6e,0x5a,0xa0,0x52,0x3b,0xd6,0xb3,0x29,0xe3,0x2f,0x84,
    0x53,0xd1,0x00,0xed,0x20,0xfc,0xb1,0x5b,0x6a,0xcb,0xbe,0x39,0x4a,0x4c,0x58,0xcf,
    0xd0,0xef,0xaa,0xfb,0x43,0x4d,0x33,0x85,0x45,0xf9,0x02,0x7f,0x50,0x3c,0x9f,0xa8,
    0x51,0xa3,0x40,0x8f,0x92,0x9d,0x38,0xf5,0xbc,0xb6,0xda,0x21,0x10,0xff,0xf3,0xd2,
    0xcd,0x0c,0x13,0xec,0x5f,0x97,0x44,0x17,0xc4,0xa7,0x7e,0x3d,0x64,0x5d,0x19,0x73,
    0x60,0x81,0x4f,0xdc,0x22,0x2a,0x90,0x88,0x46,0xee,0xb8,0x14,0xde,0x5e,0x0b,0xdb,
    0xe0,0x32,0x3a,0x0a,0x49,0x06,0x24,0x5c,0xc2,0xd3,0xac,0x62,0x91,0x95,0xe4,0x79,
    0xe7,0xc8,0x37,0x6d,0x8d,0xd5,0x4e,0xa9,0x6c,0x56,0xf4,0xea,0x65,0x7a,0xae,0x08,
    0xba,0x78,0x25,0x2e,0x1c,0xa6,0xb4,0xc6,0xe8,0xdd,0x74,0x1f,0x4b,0xbd,0x8b,0x8a,
    0x70,0x3e,0xb5,0x66,0x48,0x03,0xf6,0x0e,0x61,0x35,0x57,0xb9,0x86,0xc1,0x1d,0x9e,
    0xe1,0xf8,0x98,0x11,0x69,0xd9,0x8e,0x94,0x9b,0x1e,0x87,0xe9,0xce,0x55,0x28,0xdf,
    0x8c,0xa1,0x89,0x0d,0xbf,0xe6,0x42,0x68,0x41,0x99,0x2d,0x0f,0xb0,0x54,0xbb,0x16,
], dtype=np.uint8)
 
def hw(x):
    """Hamming weight of bytes."""
    x = np.asarray(x, dtype=np.uint8)
    return np.unpackbits(x[:, None], axis=1).sum(axis=1)
 
# --- Simulate a chip with a hidden key byte and measure traces ---
np.random.seed(42)
SECRET_KEY_BYTE = 0xA5      # the value the attacker must recover
N_TRACES = 5000
TRACE_LEN = 100             # each trace is 100 samples
LEAK_TIME = 35              # sample where the leak occurs
NOISE_SIGMA = 1.5           # Gaussian noise std
 
plaintexts = np.random.randint(0, 256, size=N_TRACES, dtype=np.uint8)
intermediate = SBOX[plaintexts ^ SECRET_KEY_BYTE]
leak = hw(intermediate)
 
traces = NOISE_SIGMA * np.random.randn(N_TRACES, TRACE_LEN)
traces[:, LEAK_TIME] += leak.astype(float)
 
# --- CPA attack: try all 256 candidate keys ---
correlations = np.zeros((256, TRACE_LEN))
for k_guess in range(256):
    pred = hw(SBOX[plaintexts ^ k_guess]).astype(float)
    pred -= pred.mean()
    for t in range(TRACE_LEN):
        col = traces[:, t] - traces[:, t].mean()
        denom = np.sqrt((pred ** 2).sum() * (col ** 2).sum()) + 1e-12
        correlations[k_guess, t] = (pred * col).sum() / denom
 
best = np.unravel_index(np.argmax(np.abs(correlations)), correlations.shape)
print(f"Recovered key byte: 0x{best[0]:02X}, true key: 0x{SECRET_KEY_BYTE:02X}")
print(f"Peak correlation = {correlations[best]:.3f} at sample {best[1]}")

The recovered byte will match SECRET_KEY_BYTE essentially every time. With $N = 5000$ traces and $\sigma = 1.5$ , the correlation peak for the right key sits comfortably above the noise floor. Increase the noise, increase the number of traces. The full key is sixteen of these byte attacks, each independent.

1.5 The DPA hardware setup

You need surprisingly little:

plaintext

                   target chip on dev board
                          |
                       +--+--+
                       | DUT |
                       +--+--+
                          | V_DD pin
                  +-------+-------+
                  |               |
                  R = 1 ohm   high-bandwidth
                  |               oscilloscope
                  |              (>200 MHz, 8-bit, segmented)
                  GND             ^
                                  |
                          differential probe
                          across R measures
                          instantaneous current
 
    +--------+   USB / Ethernet   +-----------+
    |  PC    | <----------------> | ChipWhisp |
    | Python |   trigger + capture| er or sco |
    +--------+                    +-----------+

A shunt resistor (a few hundred milliohms to a few ohms) in the VDD path, a probe with enough bandwidth to see nanosecond features, a trigger from the chip's UART or GPIO marking the start of encryption, and a PC running numpy. ChipWhisperer rolls all of that into a $300 platform aimed at students. NewAE Phywhisperer, Riscure Inspector, and FOBOS are higher-end alternatives. The barrier to entry is low. **A patient hobbyist with a$ 400 budget can break unprotected AES on a microcontroller.**

1.6 Defenses against power analysis: hiding and masking

The threat model is symmetric: defenders know exactly what attackers do, and they design countermeasures targeted at each step of the attack. The two big families are hiding and masking.

Hiding tries to flatten the leakage. It accepts that the chip will draw current, but tries to make the current independent of data, or so noisy that the SNR of the leakage becomes useless.

Dual-rail logic. Every bit is represented by two wires that are always complementary. When one rises the other falls, so the total switching activity per bit is constant regardless of data. Used in secure smartcards from companies like Inside Secure and Riscure-evaluated products. Costs roughly 2x area and 2x power.
Balanced gates. WDDL (wave dynamic differential logic) and SABL (sense-amplifier-based logic) extend the dual-rail principle to combinational logic. Each gate dissipates the same energy regardless of inputs.
Noise injection. Add a parallel noise generator on chip whose current draw is uncorrelated with the secret. Increases the number of traces required for an attack, but does not make it impossible.
Decoupling. Big on-die capacitors smooth the VDD rail, attenuating fast features. Limited; large-scale leakage still gets through.
Random clock skew. Operations don't happen at the same time relative to a trigger across runs. Forces the attacker to align traces. We saw alignment in Chapter 17; the attacker can use cross-correlation alignment to undo this.

Masking is more elegant. Instead of trying to hide the leakage of a sensitive value, you split that value into shares whose joint distribution carries the secret but whose marginal distributions are uniform. No single share leaks anything. The attacker would need to combine power leakage from multiple shares simultaneously, which raises the order of the attack.

Boolean masking for AES looks like this. Pick random byte $r$ . Replace $x$ everywhere by the pair $(x', r)$ where $x' = x \oplus r$ . Linear operations like ShiftRows and MixColumns are easy to apply to each share. The S-box is the headache, because $S$ is non-linear: $S(x \oplus r) \neq S(x) \oplus S(r)$ . Solutions include precomputing a masked S-box table per execution, or using composite-field masking that performs the GF inversion on shares directly.

Voting district analogy. A masked secret is like a vote split between two ballot boxes by XOR. Open one box and you see uniformly random ballots. Open the other and you see uniformly random ballots. Only when you XOR the two together do you recover the actual vote. The attacker who only watches one box learns nothing. First-order DPA, which looks at one wire's leakage at a time, is defeated. To defeat $d$ -th order masking, the attacker must combine leakage from $d+1$ wires simultaneously, which takes exponentially more traces and is far more sensitive to noise.

A subtlety: higher-order masking only works if the implementation never combines shares without re-randomizing. If a "secure" implementation accidentally computes $x' \oplus r$ in some intermediate register, that register holds the unmasked $x$ and the entire masking scheme collapses. Real masked AES implementations require formal proofs (the "non-interference" or NI property) to certify that no such accidental combination ever occurs. Implementing high-order masking correctly is genuinely hard, and implementations have repeatedly been broken by missed combinations.

1.7 Constant-time code: the software discipline

Cryptographic code is supposed to take the same time on every input. Sounds obvious. In practice the language gives you a hundred ways to violate constant-time without realizing it. Compare:

// NAIVE PIN check - leaks via timing
int pin_check_bad(const char *guess, const char *real, int len) {
    for (int i = 0; i < len; i++) {
        if (guess[i] != real[i]) return 0;
    }
    return 1;
}
 
// CONSTANT TIME PIN check - takes same time always
int pin_check_good(const char *guess, const char *real, int len) {
    volatile uint8_t diff = 0;
    for (int i = 0; i < len; i++) {
        diff |= (uint8_t)(guess[i] ^ real[i]);
    }
    return diff == 0;   // returns 1 only if all bytes equal
}

The bad version returns the moment the first mismatched byte is seen. The attacker brute-forces digit by digit. Each correct prefix runs marginally longer than each incorrect prefix, and over enough trials the timing difference becomes statistically significant. The constant-time version XORs every byte, ORs the result, and only checks at the end. No data-dependent branches, no early exit.

Constant time alone is not enough. Even if the control flow is data independent, the power profile still depends on the data, and that is what DPA reads. Constant-time defeats timing attacks. It does not defeat power analysis. You need both constant time and masking and often hiding for high-assurance crypto.

Compilers also have a habit of "helpfully" inserting branches that break your hard-won constant-time property. The volatile qualifier above is one defense. Another is to write the inner loop in inline assembly, with explicit instructions. Crypto libraries like libsodium, BoringSSL, and OpenSSL maintain extensive lists of compiler-introduced timing leaks they have had to pin down.

1.8 Electromagnetic analysis: the contactless sibling

Power analysis requires access to the VDD pin, sometimes a desolder operation. EM analysis bypasses that. Hold a near-field magnetic loop probe (a tiny coil, often hand-wound 5-10 mm in diameter) close to the package or even directly on the decapped die, feed the induced voltage into a low-noise amplifier, and you have a signal that carries the same Hamming-weight leakage as the power line, often with better localization. By moving the probe across the chip's surface you can target specific functional blocks: AES coprocessor here, RSA engine there, leaving everything else as noise.

Stethoscope on a wall analogy. Power analysis is hearing the room through the floorboards: everything is mixed. EM is putting your ear next to the heating duct that connects to the boss's office: localized, less mixed, more revealing.

EM has cracked smart cards through epoxy, FPGAs through their PCB, and even fully assembled phones in lab conditions. The 2015 work by Genkin and Shamir extracted RSA keys from a smartphone using a $2 magnetic loop and a$ 300 SDR. The contactless nature is what makes EM scary in supply-chain contexts: a trojan probe inside an air-vent, an antenna hidden in a hotel desk lamp, can in principle exfiltrate keys from a target machine without ever being electrically connected.

1.9 Timing attacks across networks: Boneh and Brumley 2003

Side channels do not require physical proximity. Boneh and Brumley showed in 2003 that they could recover an OpenSSL RSA private key from a remote web server by measuring TLS handshake completion times across a local network. The attack exploits how Montgomery multiplication conditionally subtracts when an intermediate value exceeds the modulus, and how the sliding-window exponentiation has data-dependent loop iteration. Each of these introduces nanosecond-scale variability that aggregates over thousands of handshakes into a measurable signal across the network's millisecond-scale jitter.

This was a sledgehammer to the assumption that side channels needed lab equipment. Network attackers, with no physical access and no insider help, could extract keys from production servers given enough handshakes. OpenSSL's response was a hardening of Montgomery reduction to be constant-time, plus blinding for RSA. The general lesson: any operation whose duration depends on a secret will eventually leak that secret over a noisy channel, given enough samples.

1.10 Cache side channels: when the leak is microarchitectural

Modern CPUs are full of state that is invisible to the architectural specification but observable through clever measurement. The cache is the prime example. In Chapter 14 we saw L1, L2, and L3 caches with line sizes of 64 bytes and access latencies that differ by an order of magnitude depending on whether a line is hot, warm, or in DRAM.

That latency difference is a side channel. The attacker process and the victim process share a cache. The attacker fills the cache with their own data, waits while the victim runs, then measures access times to their own data. Lines that are still fast were not evicted, lines that are now slow were evicted by the victim's accesses. By choosing which addresses to fill, the attacker can effectively read out which cache lines the victim touched, with cache-line granularity.

The standard variants:

Flush+Reload. Requires shared memory between attacker and victim (e.g., a shared library mapped into both processes). The attacker clflushes a target line, waits for the victim to run, then reloads and times the access. Fast access means the victim brought the line in; slow access means they didn't. Used to recover AES keys from co-resident processes on shared cores.
Prime+Probe. No shared memory required. The attacker fills a cache set with their own data, waits, then re-accesses each of their own lines. Lines that became slow were evicted by the victim. Works across VMs on the same physical core, which is the original cross-tenant cloud attack.
Evict+Reload. Like Flush+Reload but uses cache pressure instead of clflush, which works on platforms without an unprivileged flush instruction.
Flush+Flush. Measures the timing of clflush itself, which differs based on whether the line is in cache. Quieter (no actual access).

A cache attack on T-table-based AES recovers the key in seconds: the table indices are functions of plaintext XOR key, so observing which cache lines the victim's encryption touched directly leaks the key bytes. AES-NI on Intel was designed in part to kill this attack class: the hardware AES instructions don't use software lookup tables, so there are no key-dependent cache accesses to observe. This is one of the rare cases where ISA-level features were added specifically as side-channel hardening.

1.11 Speculative execution: Spectre, Meltdown, and the cataclysm of 2018

In January 2018 the world learned that essentially every high-performance CPU shipped in the prior twenty years contained an architectural class of vulnerabilities that allowed unprivileged code to read kernel memory through cache side channels. The collective name is transient execution attacks, and the specifics are Spectre v1 (Bounds Check Bypass), Spectre v2 (Branch Target Injection), Meltdown (Rogue Data Cache Load), Foreshadow (L1TF, attacks SGX enclaves), MDS (Microarchitectural Data Sampling), Zombieload, RIDL, Fallout, CrossTalk, ÆPIC Leak, and a continuing series. The mechanism, in every case, is the same: the CPU speculatively executes instructions past a security boundary, the speculative results leave traces in microarchitectural state (cache, store buffers, line-fill buffers, register file), the architectural rollback only erases the architectural state (registers, retired instructions), and the surviving microarchitectural traces are read out by a cache side channel.

Spectre v1 in pseudocode:

// kernel function, called via syscall
if (user_index < array_size) {        // bounds check, speculatively bypassed
    char x = array1[user_index];      // out-of-bounds read into kernel memory
    char y = array2[x * 4096];        // load tied to stolen byte
}

The branch predictor has been pre-trained to predict the bounds check as true. The CPU speculatively executes the array1 read with an out-of-bounds index, getting kernel memory byte $x$ . It then loads array2[x * 4096], which brings cache line $x$ of array2 into the cache. The pipeline eventually retires, hits the actual bounds check, realizes the speculation was wrong, and rolls back the architectural state, but the cache line is still there. The attacker now flushes-and-reloads through array2, finds the one cache line that is fast, and that line index is the byte they just read from kernel memory.

Wrong speculation analogy. The CPU is a chef who, while waiting for the order, starts cooking what they think the customer will want. When the order arrives, if the chef guessed wrong, they throw the food away. Spectre says: even though the food was thrown away, you can smell it on the chef's apron. The cache is the apron.

Meltdown was even worse on Intel. The privilege check on a memory load happened late in the pipeline, late enough that the load's value was already forwarded to dependent instructions speculatively. So on vulnerable Intel chips an unprivileged process could just read kernel memory directly, no branch-prediction tricks required. AMD chips were largely immune to Meltdown because they performed the privilege check earlier; Spectre, in contrast, hit everyone.

Mitigations:

Microcode updates. Vendor-supplied microcode patches that change the speculation behavior, sometimes adding instructions like IBPB (Indirect Branch Predictor Barrier) and STIBP.
KPTI (Kernel Page Table Isolation). Maps the kernel into a separate page table while userspace runs, so even speculative loads cannot reach kernel addresses. Costs 5-30% on syscall-heavy workloads.
Retpolines. A compiler trick that replaces indirect branches with a return-trampoline that the predictor cannot easily train, defeating Spectre v2.
LFENCE / serialize. Compiler-inserted serializing instructions in sensitive code paths.
Hardware fixes. Intel's Cascade Lake and later, AMD Zen 2 and later, redesigned silicon to perform privilege checks earlier and to flush speculation traces on context switches.

The cost of the software mitigations is substantial. KPTI costs 5-30% on syscall-bound workloads. Retpolines cost a few percent in branch-heavy code. Together they ate years of microarchitectural progress. Spectre's deepest lesson is architectural: the entire industry assumed the boundary between architectural and microarchitectural state was inviolable, and it was not. Every CPU shipping today carries silicon mitigations that did not exist in 2017, and we are not done finding new variants.

1.12 Acoustic, thermal, photonic

The frontier of side channels keeps moving outward. Genkin et al. (2014) recorded the acoustic emissions of a laptop during RSA decryption (the chokes and capacitors emit ultrasound that's modulated by load) and recovered keys from a phone microphone four meters away. Thermal sensors built into modern CPUs (which the OS exposes) leak coarse-grained activity patterns; attackers have used them in cross-VM attacks. Photonic emission analysis on a decapped chip uses a near-IR camera to literally watch transistors switch (silicon emits a few near-IR photons during avalanche conduction) and read out a register's contents bit by bit.

These channels have lower bandwidth and require more careful setups, but they exist, and they remind us that any physical signal correlated with computation is potentially a side channel. The cryptographer's instinct (assume the channel is just bits) and the engineer's reality (everything is physics) collide here, and the engineer wins.