>
section 5 of 135 min read

5. The Datapath and Control Unit

We have talked about what a CPU does. Now we draw how the gates are arranged to do it.

5.1 The single-cycle datapath

The simplest CPU executes each instruction in one (long) clock cycle. Imagine drawing the route an instruction takes through the chip:

plaintext
         ┌───────────────────────────────────────────────────────────────┐
         │                                                                │
         │  ┌─────┐    ┌──────┐   ┌──────┐                                │
         │  │     │    │      │   │      │     ┌─────┐                    │
PC ─────▶│  │ Inst│───▶│Decode│──▶│ Reg  │──┐  │     │                    │
         │  │ Mem │    │      │   │ File │  ├─▶│ ALU │──┬───▶┌──────┐    │
         │  │     │    │      │   │      │  │  │     │  │    │ Data │    │
         │  └─────┘    └──────┘   └──────┘  │  └─────┘  │    │ Mem  │──┐ │
         │                                   │           │    └──────┘  │ │
         │                                   └────mux────┘              │ │
         │                                                              │ │
         └──────────────────────────────────────────────────────────────┘ │

                                                                  Writeback to Reg

The five logical stages all happen in one cycle:

  1. Instruction Fetch (IF). PC indexes into instruction memory; the instruction is read out.
  2. Instruction Decode (ID). Opcode bits feed the control unit; register specifiers read source operands from the register file.
  3. Execute (EX). The ALU does the math, or computes a memory address for load/store.
  4. Memory (MEM). If load, read data memory at the computed address; if store, write to it.
  5. Writeback (WB). Write the ALU result (or the loaded data) back into the register file.

Single-cycle is conceptually clean but performance-bad. The clock period must accommodate the slowest path through all stages combined: instruction memory + register read + ALU + data memory + register write. That can be 5-10 ns. So clock cannot exceed 100-200 MHz. And short instructions waste the unused stages.

5.2 Multi-cycle datapath

Break each instruction into stages, each fitting in one shorter cycle. Different instructions take different numbers of cycles:

  • ALU instructions: 4 cycles (IF, ID, EX, WB).
  • Load: 5 cycles (IF, ID, EX, MEM, WB).
  • Store: 4 cycles (IF, ID, EX, MEM).
  • Branch: 3 cycles (IF, ID, EX with branch decision).

Hardware reuse: one ALU is shared among address computation, branch comparison, and arithmetic. One memory is shared between instructions and data (this is single-memory architecture; we call it Princeton-style, as opposed to Harvard where instruction and data memory are separate buses).

CPI is now variable: different instructions take different cycle counts, weighted by frequency. A program with 50% ALU + 30% load + 20% branch has CPI=0.54+0.35+0.23=4.1\text{CPI} = 0.5 \cdot 4 + 0.3 \cdot 5 + 0.2 \cdot 3 = 4.1 on a multi-cycle. Higher than 1, but the clock is much faster (each stage is small), so total time wins.

5.3 Von Neumann vs Harvard

Two ways to wire memory:

  • Von Neumann (Princeton). Instructions and data share one memory bus. Simpler hardware, but instruction fetch and data access compete for the bus. Used in classic CPUs.
  • Harvard. Separate instruction memory and data memory, each with its own bus. Instruction fetch and data access happen in parallel. Used in DSPs (TI C6000, Analog Devices SHARC), small microcontrollers (PIC, AVR), and inside modern CPUs at the cache level (L1 split into L1I instruction cache and L1D data cache).

Modern CPUs are von Neumann at the main-memory level (one DRAM holds everything) but Harvard inside the L1 cache. Best of both.

5.4 The control unit: hardwired vs microprogrammed

The control unit produces the dozens of control signals (ALU op, register-write enable, MUX selectors, memory read/write) that orchestrate the datapath each cycle. Two implementation styles:

Hardwired control

A direct combinational + sequential FSM (Chapter 4 territory). Inputs are the opcode plus the FSM state; outputs are the control signals. Designed by writing out the state-transition table and synthesizing.

  • Fast (one gate-delay from opcode to control signal).
  • Inflexible (changing the ISA requires re-synthesizing the chip).
  • Used in RISC CPUs, small microcontrollers.

Microprogrammed control

Each "complex" instruction is implemented by a sequence of micro-instructions stored in a control ROM. Reading the ROM produces the control signals for one cycle; a "next address" field in the micro-instruction selects the next ROM entry.

  • Flexible (change the ROM, change the ISA without changing the hardware).
  • Slower (extra ROM access in the loop).
  • Used in CISC CPUs, mainframes (IBM System/360 was the canonical example).
plaintext
    Opcode ──▶ Micro-PC ──▶ Control ROM ──▶ Control signals
                  ▲              │
                  │              │
                  └──── next-address field ─────────┐

                                  (also feeds back to Micro-PC)

Modern reality: both

Modern x86 chips use hardwired decoding for fast paths (most common instructions) and microcode for slow paths (rep movsb, cpuid, complex string instructions, transcendental math, virtualization helpers). Microcode is also the patch mechanism for security: when Spectre and Meltdown were disclosed in 2018, Intel and AMD shipped microcode updates that changed the behavior of speculative execution. Updating a microcode is a kernel-level operation; the BIOS or OS loads the new microcode at boot.

Hardware-security tie-in. Microcode is signed by the vendor. The signing keys are themselves a critical asset. Supply-chain attacks on microcode signing would be catastrophic; researchers have shown that some old AMD chips had microcode keys recoverable via fault injection. Modern chips include hardware verification of microcode signatures. The whole topic is in Chapter 24.