>
section 1 of 1310 min read

1. The Big Picture: What a Computer Actually Is

Before any datapaths or cache lines, we need a clean mental model of what a computer is and is not. The model that won is John von Neumann's 1945 sketch in the First Draft of a Report on the EDVAC. Almost every machine you will ever touch follows this sketch with optimizations layered on top.

1.1 The von Neumann model in plain English

Von Neumann's idea was that you could build a single, general-purpose machine instead of a different gadget per task. Three insights, all radical at the time:

  1. Stored program. Instructions and data live together in the same memory. Earlier computers like the ENIAC were programmed by physically rewiring patch panels. Switching tasks took days. Von Neumann said: keep instructions in the same memory you use for data, and you can change what the machine does by writing new numbers, not new wires.
  2. Functional units. Separate the parts conceptually: a CPU that executes, a memory that stores, an I/O system that talks to the world. Buses carry information between them.
  3. Sequential execution with branches. The machine fetches one instruction, executes it, moves to the next. A branch instruction can jump to a different address based on a condition, which gives you all of computation (Turing-completeness).

That is the entire mental model. Everything else in this chapter, every cache hierarchy and pipeline trick, is an optimization. The fetch-decode-execute loop has not changed, only the throughput.

Restaurant analogy. Picture a single chef in a tiny kitchen. The CPU is the chef. Memory is the recipe binder bolted to the back wall and the pantry behind that. The chef's daily life is a loop: walk to the binder, read the next instruction ("dice an onion"), grab ingredients from the pantry (read data from memory), do the work at the cutting board (execute), put the result in a bowl on the counter (write back), then walk to the binder to read the next step. The chef does one thing at a time. The pantry is far away. Most of the chef's day is spent walking. Most of computer architecture is figuring out how to keep the chef cutting instead of walking.

1.2 Functional units in detail

Every computer, from a Cortex-M0 microcontroller in a smart light bulb to a Xeon in a data center, has the same five conceptual blocks:

  • Input unit. Keyboards, mice, sensors, cameras, network packets, microphones. The world sending data inward.
  • Output unit. Displays, speakers, LEDs, motors, network packets going outward.
  • Memory. A giant array of cells. Modern systems have many tiers: registers (~1 KB total inside the CPU, instant access), caches (KB to tens of MB, a few cycles), main memory (DRAM, gigabytes, hundreds of cycles), storage (SSD/HDD, terabytes, milliseconds), archival (tape, glacial). Each tier is bigger and slower than the one above.
  • Control unit. A finite state machine (Chapter 4) that orchestrates everything. It pulses control signals onto buses, tells the ALU which operation to perform, decides when to read or write memory, sequences instruction fetch.
  • Datapath. The combinational and sequential logic that actually moves bits around: register file, ALU, multiplexers, latches between pipeline stages.

Together, the control unit and datapath form the CPU. The datapath is the kitchen counter, knife block, and stove. The control unit is the chef's brain telling the hands what to do next. Separating control from data is one of the cleanest abstractions in engineering, and we will see it again when we draw the single-cycle datapath.

1.3 Bus structures: the wiring between the units

The functional units sit on shared wires called buses. A bus carries three kinds of information:

  • Address bus. "I want to talk about memory location 0x40000x4000." Width determines how much memory you can address. A 32-bit address bus can name 23242^{32} \approx 4 billion bytes (4 GiB), which is why 32-bit operating systems hit a memory wall. A 64-bit address bus names 264162^{64} \approx 16 exabytes, more than any current machine has physical memory for.
  • Data bus. "And the value is..." Width determines how many bits move per transaction. A 64-bit data bus moves 8 bytes per cycle.
  • Control bus. "Is this a read or a write? Is this an interrupt? Is the bus master switching?" A handful of single-bit signals: RD\overline{RD}, WR\overline{WR}, IRQ\overline{IRQ}, bus-grant lines, and so on.
plaintext
   CPU                                  Memory                I/O Device
    │                                     │                       │
    ├──── address bus (32 or 64 bits) ───┤                       │
    ├──── data bus (32, 64, or 128) ─────┤───────────────────────┤
    ├──── control bus (RD, WR, IRQ...) ──┤───────────────────────┤
    │                                     │                       │
    └──────────────── shared backplane ───┴───────────────────────┘

In a tiny microcontroller these buses are physical wires running between modules on one die. In a big multi-socket server they are layered protocols on top of high-speed differential pairs (PCIe, UPI, CXL).

Bus arbitration. When more than one device wants to drive the bus at the same time, somebody loses. The arbiter decides who. Schemes:

  • Daisy-chain. Bus-grant signal flows from highest-priority device down. Cheap, fixed priority, slow.
  • Centralized parallel. Each device has a bus-request line into a central arbiter, which asserts the matching grant. Fast, requires a chip dedicated to arbitrating.
  • Distributed self-selection. Devices each drive a priority code; the highest priority wins (a bus-resolution circuit). Used in older Multibus systems.
  • Token passing. The right to drive the bus passes around like a token in a ring. Used in some real-time systems where bounded latency matters more than peak throughput.

Modern PCIe-style buses skip the shared-medium model entirely; they are point-to-point packetized links with a switch fabric. But the concept of "two masters wanting the bus, one has to wait" lives on inside the switch.

Master 0 requests, gets the grant (BG0), drives the address bus. When 0 releases, master 1 (which had been waiting) gets the grant.

1.4 Performance: how to count "fast"

There is a small zoo of metrics for CPU speed, and you have to keep them straight to read benchmarks honestly.

  • Clock rate. Hertz. How often the central oscillator pulses. A 3 GHz CPU's clock period is Tclk=1/3×109=333T_{clk} = 1/3 \times 10^9 = 333 picoseconds. Higher clock means more cycles per second, but a cycle is not necessarily an instruction.
  • CPI (cycles per instruction). How many clock cycles, on average, an instruction takes from start to finish. Simple RISC CPUs aim for CPI close to 1. Pipelined superscalar CPUs can have effective CPI below 1 by retiring multiple instructions per cycle.
  • MIPS (millions of instructions per second). MIPS=(clock rate)/(CPI106)\text{MIPS} = (\text{clock rate}) / (\text{CPI} \cdot 10^6). A flawed metric because instructions vary wildly in what they do (a MIPS multiply does the same job as a 50-cycle x86 trig instruction in different clothing), but useful for comparing within an architecture.
  • FLOPS (floating-point operations per second). Same idea but for floating-point math, which dominates scientific computing and machine learning. Modern GPUs hit petaflops. The H100 GPU reaches ~67 teraflops of FP64 with thousands of SIMD lanes.

Deriving the Iron Law

The fundamental relationship between all of these is what Hennessy and Patterson call the Iron Law of Performance. It comes from putting units together carefully.

Suppose a program has II instructions. Each instruction takes on average CPI\text{CPI} cycles. Each cycle has length TclkT_{clk} seconds. Then the total wall-clock time to run the program is

T=ICPITclk.T = I \cdot \text{CPI} \cdot T_{clk}.

Why is this iron? Because the three factors are independent levers, and you can attack any of them:

  • Reduce II: better algorithms (asymptotic complexity), better compilers (loop unrolling, vectorization), better instruction sets (one CISC instruction does what three RISC instructions do).
  • Reduce CPI\text{CPI}: better microarchitecture (pipelining, superscalar, out-of-order, branch prediction).
  • Reduce TclkT_{clk}: faster transistors (smaller process node, better materials, FinFET, GAAFET), shorter pipeline stages.

The catch is that the levers fight each other. Pipelining shortens TclkT_{clk} by chopping logic into smaller stages, but the bookkeeping (latches between stages, hazard logic) adds to II or to CPI\text{CPI}. CISC reduces II by packing more work per instruction, but each instruction has higher CPI\text{CPI} because the hardware has to do more. Modern designs are a careful balance, picked for the workload they target.

Worked example: the Iron Law in practice

A program has 1 billion instructions. CPI = 1.5. Clock = 2 GHz. Then

T=109×1.5×0.5 ns=0.75 s.T = 10^9 \times 1.5 \times 0.5\text{ ns} = 0.75\text{ s}.

If a new compiler reduces the instruction count to 0.7×1090.7 \times 10^9 (30% fewer), T=0.525T = 0.525 s. If a new microarchitecture reduces CPI to 1.0, T=0.5T = 0.5 s. If both, T=0.35T = 0.35 s. Multiplicative.

Amdahl's Law: the limit of optimization

Sometimes you optimize one part of a program. Amdahl noticed that the speedup from speeding up part of a program is bounded by how much of the program that part is.

If a fraction ff of the program is sped up by a factor ss, and the rest stays the same, the overall speedup is

Speedup=1(1f)+f/s.\text{Speedup} = \frac{1}{(1 - f) + f/s}.

In the limit ss \to \infty (perfect speedup of the targeted part), the overall speedup approaches 1/(1f)1/(1 - f). If 90% of the program is parallelizable (f=0.9f = 0.9), maximum speedup is 1/0.1=10×1/0.1 = 10\times, no matter how many cores you throw at it. The remaining 10% serial part dominates.

Restaurant analogy. Hire 100 chefs to chop vegetables for a soup recipe. If the chopping was 70% of the prep time, you cut chopping to nearly zero; the other 30% (boiling broth, simmering, plating) does not parallelize. The total speedup tops out at 1/0.3=3.33×1/0.3 = 3.33\times, no matter how many extra chefs. This is why we cannot just keep adding cores forever. The serial part eats the win.

Amdahl is also why specialized accelerators (GPUs for ML, NPUs for inference, hardware AES) matter so much. They go after the parallel-friendly part and speed it up enormously, leaving the serial part as the new bottleneck, and so on. We will see this fractal pattern repeat.

1.5 A tiny dose of history

GenerationEraTechExamples
1st1940s-50sVacuum tubesENIAC, EDVAC
2nd1950s-60sDiscrete transistorsIBM 7090, PDP-1
3rd1960s-70sSmall/medium-scale ICsIBM System/360, PDP-11
4th1970s-nowMicroprocessorsIntel 4004 (1971) -> modern x86 / ARM / RISC-V

The 4004 in 1971 had 2,300 transistors. Apple's M3 has on the order of 25 billion. Moore's Law, the empirical observation that transistor count per chip doubles roughly every two years, drove this. It is now slowing as we hit physical limits (gate oxide leakage, atomic-scale features). Every modern advance is squeezing more performance from architecture rather than raw transistor count, which is why this chapter keeps mattering.