>
section 6 of 139 min read

6. Pipelining: The Assembly Line in Silicon

Single-cycle and multi-cycle execute one instruction at a time. Pipelining lets multiple instructions occupy the CPU at once, each in a different stage. This is the single biggest performance trick in modern CPUs.

6.1 The assembly-line analogy

Imagine washing dishes by hand. One dish at a time: rinse, soap, scrub, rinse, dry. Five steps, say 20 seconds each, for 100 seconds per dish. To wash 10 dishes: 1000 seconds.

Now hire five people, one per step. Person 1 rinses dish 1, hands it to person 2 to soap while she starts on dish 2. After the pipeline fills (5 dishes deep), one finished dish comes off the line every 20 seconds. To wash 10 dishes: about 280 seconds (100 to fill plus 9 more output cycles). Throughput up by 5x; latency per dish is the same.

That is exactly what a pipelined CPU does. Each instruction still takes 5 stages (5 cycles of latency from start to finish), but a different instruction is in each stage at every cycle, so one instruction retires per cycle.

6.2 The classic 5-stage MIPS pipeline

plaintext
   Cycle:  1    2    3    4    5    6    7    8
 
   I1:    IF | ID | EX |MEM | WB
   I2:         IF | ID | EX |MEM | WB
   I3:              IF | ID | EX |MEM | WB
   I4:                   IF | ID | EX |MEM | WB

Stages:

  1. IF (Instruction Fetch). PC indexes I-cache; instruction is read; PC += 4.
  2. ID (Instruction Decode / Register Read). Decode opcode, read source registers from the register file.
  3. EX (Execute). ALU operation (compute result, or compute memory address for load/store, or evaluate branch condition).
  4. MEM (Memory Access). Load or store hits the D-cache. Other instructions pass through.
  5. WB (Writeback). Write ALU result or loaded data back to the register file.

Latches between stages hold the partial results. The pipeline registers between IF and ID, ID and EX, etc., capture the values each stage computed and feed them to the next stage on the next cycle.

Throughput in steady state is one instruction per cycle, even though the latency per instruction is 5 cycles.

6.3 Why speedup is never N

The naive expectation is that an N-stage pipeline gives Nx speedup. It does not. Three reasons:

  1. Pipeline fill / drain. The first N1N - 1 cycles are partial; you cannot retire until the pipeline is full. For a long-running program this is a one-time cost, but for short kernels it matters.
  2. Stage imbalance. The slowest stage sets the clock period. If MEM takes 800 ps and the others take 500 ps, the pipeline runs at 800 ps. Faster stages waste their time.
  3. Hazards. Sometimes the next instruction cannot enter the pipeline yet, because it needs a result that has not been computed. This is the meat of the next subsection.

The actual speedup is

Speedup=N1+stall fraction\text{Speedup} = \frac{N}{1 + \text{stall fraction}}

where the stall fraction depends on workload. Real CPUs achieve maybe 0.7N to 0.9N of the ideal.

6.4 Hazards: the spanner in the works

A hazard is a situation where the pipeline cannot proceed at full rate because instructions have dependencies that the simple "one stage per cycle" plan does not respect.

Structural hazards

Two instructions need the same hardware resource at the same time. Example: a unified memory used for both instruction fetch (IF) and data access (MEM). On cycle 4, instruction I4 is in IF and instruction I1 is in MEM; both want the memory bus. Solution: duplicate the resource (split caches into L1I and L1D so IF and MEM use different memories) or stall.

Data hazards

Instruction depends on the result of an earlier instruction not yet written back. The classic case:

asm
ADD R1, R2, R3      ; R1 = R2 + R3
SUB R4, R1, R5      ; R4 = R1 - R5    ; needs R1!

The ADD writes R1 in WB (cycle 5). The SUB wants to read R1 in ID (cycle 3, only one cycle after ADD's ID). The SUB would read the stale R1.

Solutions:

  • Stall (bubble). Insert no-ops until the result is ready. Costs cycles.
  • Forwarding (bypassing). Route the ALU output back into the ALU input directly, before WB writes the register file. Hardware multiplexers detect the dependency and route. The standard solution; cuts most data hazards to zero stall.
  • Compiler reordering. The compiler arranges independent instructions between the producer and consumer:
    asm
    ADD R1, R2, R3
    ADD R8, R9, R10   ; independent, fills the gap
    SUB R4, R1, R5    ; now R1 is ready

A particularly nasty case is the load-use hazard: a load produces its result at the end of MEM (cycle 4), but the next instruction wants it in EX (cycle 4 also). Even with forwarding, this requires one bubble. Compilers try to insert one independent instruction after every load.

plaintext
   Cycle:  1    2    3    4    5    6    7
   LDR R1, [R2]:   IF | ID | EX |MEM | WB
   ADD R3, R1, ...:     IF | ID | -- | EX |MEM | WB
                                  (stall: R1 forwarded
                                   from MEM/WB latch
                                   into EX input)

Control hazards

A branch is decided at EX (cycle 3, in our 5-stage pipeline). By then, two instructions after the branch have already entered IF and ID. If the branch was taken, those two instructions are wrong and must be flushed.

Solutions:

  • Stall until branch resolves (slow, two bubbles).
  • Branch prediction. Guess which way the branch goes; speculatively fetch and decode. If right: free, no stall. If wrong: flush and pay the cost.
  • Delayed branches. Earlier MIPS used a delay slot: the instruction immediately after the branch always executes, regardless of branch outcome. Compilers fill delay slots with useful work (often the loop's increment). Clever, but ugly to debug; modern ISAs dropped it.

6.5 Branch prediction in depth

Modern CPUs predict branches with terrifying accuracy.

Static prediction

Compile-time guess. Heuristics:

  • Backward branches (loops) usually taken; predict taken.
  • Forward branches (if/then) often not taken; predict not taken.
  • Compiler hints (__builtin_expect in GCC).

Achieves maybe 70-80% accuracy.

Dynamic prediction

Hardware learns from history.

  • One-bit predictor. Remember the last outcome of each branch; predict it again. Achieves ~85%.
  • Two-bit saturating counter. Remember strength of prediction; one wrong prediction does not flip the choice. ~90%.
  • Two-level adaptive (Yeh-Patt). Use a global history register (last 10 branch outcomes) to index a table of two-bit counters. Captures correlations between nearby branches. ~95%.
  • Tournament predictors. Run multiple predictors in parallel; learn which one is best for each branch. ~97%.
  • Perceptron predictors. Use a tiny neural network. AMD Zen, recent Intel. ~98%.

The Branch Target Buffer (BTB) caches the target address of each predicted-taken branch, so the fetch unit can begin fetching the target on the very next cycle without waiting for the decoder.

Hardware-security tie-in. Branch prediction is the heart of Spectre v1. The CPU speculatively executes down a mispredicted path, touching memory along the way. Even though the speculation is later rolled back architecturally, the cache state is not rolled back. An attacker that primes the predictor to mispredict in a controlled way can cause a victim to speculatively load secret data into cache, then time the cache to recover the secret. Modern mitigations include: retpolines (replacing indirect branches with safer constructs), IBPB / IBRS / STIBP (microcode flushing of predictor state), and architectural changes like Intel's "speculation barrier" instructions. We unpack all of this in Chapter 24.

6.6 Superscalar: more than one instruction per cycle

If pipelining gives you 1 IPC (instructions per cycle), why not 2? Or 4? Superscalar designs duplicate the pipeline: multiple IF, ID, EX, MEM, WB stages running in parallel.

The Pentium (1993) was 2-way superscalar. Modern Apple M1 is 8-wide. Intel Sapphire Rapids is 6-wide.

The challenges multiply:

  • The fetch unit must deliver multiple instructions per cycle. This is easier with fixed-length encoding (RISC) and harder with variable-length (x86 needs a length pre-decoder and a µop cache to hit wide IPC).
  • The dispatch unit must find independent instructions to issue in parallel.
  • The register file must support multiple reads and writes per cycle.

6.7 Out-of-order execution

If instruction I3 is waiting for a slow load and I4 is independent, why not run I4 first? Out-of-order (OoO) execution lets the CPU dispatch instructions as their inputs become ready, regardless of program order, then commit them in program order at the end.

The architecture (Tomasulo-style):

  1. Fetch and decode in program order.
  2. Rename registers. Architectural registers (R0-R31 in ARM) are mapped to a larger pool of physical registers (often 200+). This breaks false dependencies caused by register reuse (write-after-write, write-after-read).
  3. Dispatch to reservation stations. Each instruction sits in a station, watching for its inputs.
  4. Execute when ready, on whichever functional unit is free.
  5. Reorder buffer (ROB). Holds completed instructions in program order until they can be retired.
  6. Retire (commit) in program order. Architectural state changes only on retirement.

This is what "out of order" actually means. Internally, instructions execute in dependency order. Externally (architecturally), they look in-order because retirement is in-order.

rendering diagram...

OoO is the defining feature of modern high-performance CPUs since Pentium Pro (1995). Apple M1 has a ~630-entry ROB. Intel Lion Cove has 576. Massive structures; massive transistor budgets.

Hardware-security tie-in. OoO + speculation = Spectre/Meltdown family. A misspeculated load executes anyway, in the shadows. Its result never commits, but its cache footprint persists. The fact that architectural rollback does not roll back microarchitectural state is the whole bug class. Mitigations either prevent misspeculation (slowing things down), prevent the speculative result from reaching cache (LSU changes), or randomize the cache so attacker measurements are noisy.