Before any datapaths or cache lines, we need a clean mental model of what a computer is and is not. The model that won is John von Neumann's 1945 sketch in the First Draft of a Report on the EDVAC. Almost every machine you will ever touch follows this sketch with optimizations layered on top.
1.1 The von Neumann model in plain English
Von Neumann's idea was that you could build a single, general-purpose machine instead of a different gadget per task. Three insights, all radical at the time:
- Stored program. Instructions and data live together in the same memory. Earlier computers like the ENIAC were programmed by physically rewiring patch panels. Switching tasks took days. Von Neumann said: keep instructions in the same memory you use for data, and you can change what the machine does by writing new numbers, not new wires.
- Functional units. Separate the parts conceptually: a CPU that executes, a memory that stores, an I/O system that talks to the world. Buses carry information between them.
- Sequential execution with branches. The machine fetches one instruction, executes it, moves to the next. A branch instruction can jump to a different address based on a condition, which gives you all of computation (Turing-completeness).
That is the entire mental model. Everything else in this chapter, every cache hierarchy and pipeline trick, is an optimization. The fetch-decode-execute loop has not changed, only the throughput.
Restaurant analogy. Picture a single chef in a tiny kitchen. The CPU is the chef. Memory is the recipe binder bolted to the back wall and the pantry behind that. The chef's daily life is a loop: walk to the binder, read the next instruction ("dice an onion"), grab ingredients from the pantry (read data from memory), do the work at the cutting board (execute), put the result in a bowl on the counter (write back), then walk to the binder to read the next step. The chef does one thing at a time. The pantry is far away. Most of the chef's day is spent walking. Most of computer architecture is figuring out how to keep the chef cutting instead of walking.
1.2 Functional units in detail
Every computer, from a Cortex-M0 microcontroller in a smart light bulb to a Xeon in a data center, has the same five conceptual blocks:
- Input unit. Keyboards, mice, sensors, cameras, network packets, microphones. The world sending data inward.
- Output unit. Displays, speakers, LEDs, motors, network packets going outward.
- Memory. A giant array of cells. Modern systems have many tiers: registers (~1 KB total inside the CPU, instant access), caches (KB to tens of MB, a few cycles), main memory (DRAM, gigabytes, hundreds of cycles), storage (SSD/HDD, terabytes, milliseconds), archival (tape, glacial). Each tier is bigger and slower than the one above.
- Control unit. A finite state machine (Chapter 4) that orchestrates everything. It pulses control signals onto buses, tells the ALU which operation to perform, decides when to read or write memory, sequences instruction fetch.
- Datapath. The combinational and sequential logic that actually moves bits around: register file, ALU, multiplexers, latches between pipeline stages.
Together, the control unit and datapath form the CPU. The datapath is the kitchen counter, knife block, and stove. The control unit is the chef's brain telling the hands what to do next. Separating control from data is one of the cleanest abstractions in engineering, and we will see it again when we draw the single-cycle datapath.
1.3 Bus structures: the wiring between the units
The functional units sit on shared wires called buses. A bus carries three kinds of information:
- Address bus. "I want to talk about memory location ." Width determines how much memory you can address. A 32-bit address bus can name billion bytes (4 GiB), which is why 32-bit operating systems hit a memory wall. A 64-bit address bus names exabytes, more than any current machine has physical memory for.
- Data bus. "And the value is..." Width determines how many bits move per transaction. A 64-bit data bus moves 8 bytes per cycle.
- Control bus. "Is this a read or a write? Is this an interrupt? Is the bus master switching?" A handful of single-bit signals: , , , bus-grant lines, and so on.
CPU Memory I/O Device
│ │ │
├──── address bus (32 or 64 bits) ───┤ │
├──── data bus (32, 64, or 128) ─────┤───────────────────────┤
├──── control bus (RD, WR, IRQ...) ──┤───────────────────────┤
│ │ │
└──────────────── shared backplane ───┴───────────────────────┘In a tiny microcontroller these buses are physical wires running between modules on one die. In a big multi-socket server they are layered protocols on top of high-speed differential pairs (PCIe, UPI, CXL).
Bus arbitration. When more than one device wants to drive the bus at the same time, somebody loses. The arbiter decides who. Schemes:
- Daisy-chain. Bus-grant signal flows from highest-priority device down. Cheap, fixed priority, slow.
- Centralized parallel. Each device has a bus-request line into a central arbiter, which asserts the matching grant. Fast, requires a chip dedicated to arbitrating.
- Distributed self-selection. Devices each drive a priority code; the highest priority wins (a bus-resolution circuit). Used in older Multibus systems.
- Token passing. The right to drive the bus passes around like a token in a ring. Used in some real-time systems where bounded latency matters more than peak throughput.
Modern PCIe-style buses skip the shared-medium model entirely; they are point-to-point packetized links with a switch fabric. But the concept of "two masters wanting the bus, one has to wait" lives on inside the switch.
Master 0 requests, gets the grant (BG0), drives the address bus. When 0 releases, master 1 (which had been waiting) gets the grant.
1.4 Performance: how to count "fast"
There is a small zoo of metrics for CPU speed, and you have to keep them straight to read benchmarks honestly.
- Clock rate. Hertz. How often the central oscillator pulses. A 3 GHz CPU's clock period is picoseconds. Higher clock means more cycles per second, but a cycle is not necessarily an instruction.
- CPI (cycles per instruction). How many clock cycles, on average, an instruction takes from start to finish. Simple RISC CPUs aim for CPI close to 1. Pipelined superscalar CPUs can have effective CPI below 1 by retiring multiple instructions per cycle.
- MIPS (millions of instructions per second). . A flawed metric because instructions vary wildly in what they do (a MIPS multiply does the same job as a 50-cycle x86 trig instruction in different clothing), but useful for comparing within an architecture.
- FLOPS (floating-point operations per second). Same idea but for floating-point math, which dominates scientific computing and machine learning. Modern GPUs hit petaflops. The H100 GPU reaches ~67 teraflops of FP64 with thousands of SIMD lanes.
Deriving the Iron Law
The fundamental relationship between all of these is what Hennessy and Patterson call the Iron Law of Performance. It comes from putting units together carefully.
Suppose a program has instructions. Each instruction takes on average cycles. Each cycle has length seconds. Then the total wall-clock time to run the program is
Why is this iron? Because the three factors are independent levers, and you can attack any of them:
- Reduce : better algorithms (asymptotic complexity), better compilers (loop unrolling, vectorization), better instruction sets (one CISC instruction does what three RISC instructions do).
- Reduce : better microarchitecture (pipelining, superscalar, out-of-order, branch prediction).
- Reduce : faster transistors (smaller process node, better materials, FinFET, GAAFET), shorter pipeline stages.
The catch is that the levers fight each other. Pipelining shortens by chopping logic into smaller stages, but the bookkeeping (latches between stages, hazard logic) adds to or to . CISC reduces by packing more work per instruction, but each instruction has higher because the hardware has to do more. Modern designs are a careful balance, picked for the workload they target.
Worked example: the Iron Law in practice
A program has 1 billion instructions. CPI = 1.5. Clock = 2 GHz. Then
If a new compiler reduces the instruction count to (30% fewer), s. If a new microarchitecture reduces CPI to 1.0, s. If both, s. Multiplicative.
Amdahl's Law: the limit of optimization
Sometimes you optimize one part of a program. Amdahl noticed that the speedup from speeding up part of a program is bounded by how much of the program that part is.
If a fraction of the program is sped up by a factor , and the rest stays the same, the overall speedup is
In the limit (perfect speedup of the targeted part), the overall speedup approaches . If 90% of the program is parallelizable (), maximum speedup is , no matter how many cores you throw at it. The remaining 10% serial part dominates.
Restaurant analogy. Hire 100 chefs to chop vegetables for a soup recipe. If the chopping was 70% of the prep time, you cut chopping to nearly zero; the other 30% (boiling broth, simmering, plating) does not parallelize. The total speedup tops out at , no matter how many extra chefs. This is why we cannot just keep adding cores forever. The serial part eats the win.
Amdahl is also why specialized accelerators (GPUs for ML, NPUs for inference, hardware AES) matter so much. They go after the parallel-friendly part and speed it up enormously, leaving the serial part as the new bottleneck, and so on. We will see this fractal pattern repeat.
1.5 A tiny dose of history
| Generation | Era | Tech | Examples |
|---|---|---|---|
| 1st | 1940s-50s | Vacuum tubes | ENIAC, EDVAC |
| 2nd | 1950s-60s | Discrete transistors | IBM 7090, PDP-1 |
| 3rd | 1960s-70s | Small/medium-scale ICs | IBM System/360, PDP-11 |
| 4th | 1970s-now | Microprocessors | Intel 4004 (1971) -> modern x86 / ARM / RISC-V |
The 4004 in 1971 had 2,300 transistors. Apple's M3 has on the order of 25 billion. Moore's Law, the empirical observation that transistor count per chip doubles roughly every two years, drove this. It is now slowing as we hit physical limits (gate oxide leakage, atomic-scale features). Every modern advance is squeezing more performance from architecture rather than raw transistor count, which is why this chapter keeps mattering.