Single-core performance growth slowed in the 2000s. The way out: more cores.
8.1 SMP: symmetric multiprocessing
Multiple identical CPU cores share one main memory and one OS. Each core runs its own thread, scheduling done by the OS. From software's view, threads run in parallel; from hardware's view, the L3 cache and memory controllers are shared.
Problems:
- Memory contention. Multiple cores hammering the same DRAM controller serialize.
- Cache coherency overhead. As discussed, MESI traffic grows quadratically with core count.
- False sharing. Two threads update different fields in the same cache line, causing constant invalidation between cores. Painful performance bug.
SMP scales to maybe 8-16 cores cleanly. Beyond that, NUMA enters.
8.2 NUMA: non-uniform memory access
Each core (or each socket of multiple cores) has its own local memory; access to memory attached to other sockets is slower. The OS aware-of-NUMA places threads on cores that have their data nearby.
Socket 0 Socket 1
┌───────────────┐ ┌───────────────┐
│ Cores 0-15 │◀──QPI/UPI──▶│ Cores 16-31 │
│ L3 cache │ Inter-socket│ L3 cache │
│ Mem ctrl │ link │ Mem ctrl │
└──────┬────────┘ └──────┬────────┘
│ │
DRAM 0 DRAM 1
(local) (local)Local DRAM access: 100 ns. Remote DRAM access: 200-300 ns. The penalty is real and heavily impacts database, web, and HPC workloads.
8.3 Message passing
Some architectures abandon shared memory entirely. Each "node" has its own private memory; communication is through explicit message-passing primitives (MPI, channels, network packets). Used by:
- Clusters and supercomputers. Thousands of nodes, only network between them.
- Heterogeneous SoCs. Big.LITTLE ARM with different cores; sometimes between cores.
- Distributed actor models (Erlang, Akka).
Cleaner programming model (no shared-state races), worse worst-case latency.
8.4 GPUs: massively parallel SIMD
A different beast altogether. CPUs are 4-8 wide superscalar, deep pipelines, complex caches, branch predictors, OoO. GPUs are thousands of simple in-order ALUs running the same instruction in lockstep on different data lanes. Single Instruction, Multiple Data (SIMD) on steroids; NVIDIA calls it SIMT (Single Instruction, Multiple Thread, where each "thread" is a lane).
A modern NVIDIA H100 has 16,896 CUDA cores grouped into Streaming Multiprocessors. Each SM executes a "warp" of 32 threads in lockstep. Memory is GDDR6X or HBM with massive bandwidth (>2 TB/s), but high latency. GPUs trade latency for throughput by oversubscribing: thousands of warps hide the latency of any one memory access.
Use cases:
- Graphics (the original use).
- Scientific computing (matrix-heavy: weather, fluid dynamics, molecular dynamics).
- Machine learning (CNN/Transformer training and inference). The biggest current driver; H100s sell out.
- Crypto mining (briefly, until ASICs took over).
- Password cracking (hashcat).
Hardware-security tie-in. GPUs were once thought immune to side-channel attacks but are not. Memory layout, warp scheduling, and shared L2 cache all leak. There are ML-attack channels on shared GPU instances in cloud environments. A thread of research, not yet a major real-world incident, but worth tracking.
8.5 Heterogeneous systems
Modern SoCs combine CPU cores of different sizes, GPUs, DSPs, neural-network accelerators, video codecs, and security coprocessors all on one die. Apple M-series have:
- Performance cores (deeply pipelined, OoO, big caches).
- Efficiency cores (in-order, smaller, low power).
- GPU.
- Neural Engine (matrix multiplies for ML).
- Secure Enclave (separate ARM core for crypto).
- Image signal processor for cameras.
- Multiple DSPs.
The OS scheduler decides which tasks run where. Big.LITTLE on phones runs the lock-screen UI on efficiency cores; the game engine wakes the performance cores on demand.