7. Memory Hierarchy // Computer Architecture and Organization // bhaswanth

All the pipelining in the world cannot help if memory takes 200 cycles to respond. The solution: caches, layered from fast-and-tiny to slow-and-huge.

7.1 The pyramid

plaintext

       ▲ fastest, smallest, costliest per bit
       │
       │   Registers (KB inside CPU,    ~0 cycles)
       │   L1 cache    (32-64 KB,        2-4 cycles)
       │   L2 cache    (256 KB - 2 MB,   10-15 cycles)
       │   L3 cache    (10s of MB,       30-60 cycles)
       │   Main memory DRAM (GBs,        150-300 cycles)
       │   NVMe SSD    (TB,              ~30,000 cycles)
       │   Spinning HDD(TB,              ~10⁷ cycles)
       │   Tape archive (PB+,            seconds-to-minutes)
       │
       ▼ slowest, biggest, cheapest per bit

The genius is locality of reference: programs do not access memory uniformly. They access the same locations repeatedly (temporal locality) and access nearby locations together (spatial locality). The hierarchy exploits both.

7.2 Why locality works

Temporal locality. A loop accesses the same variable each iteration. A function uses the same parameters repeatedly. Same instruction is re-executed each iteration. So once a value is in cache, it stays useful for a while.
Spatial locality. Arrays are accessed sequentially. Struct fields sit next to each other. Code is fetched in sequence (most of the time). So fetching a block of nearby bytes (a "cache line", typically 64 bytes) is nearly free per byte and likely to be useful.

7.3 Cache structure: tag, index, offset

A cache is an array of cache lines. Each line is typically 64 bytes. To check if a memory address $A$ is in the cache:

Split $A$ $A$ into three fields:
- Offset: which byte within the line (last 6 bits for 64-byte lines).
- Index: which set within the cache (next $\log_2(\text{sets})$ bits).
- Tag: the high-order bits identifying the line.
Use the index to look up a set in the cache.
Compare the tag of each way in the set against the address's tag.
If matched, hit; return the byte at the offset.
If not, miss; go to next level.

plaintext

    Address bits (32-bit example, 64-byte lines, 256 sets, 4-way):
 
      ┌─────────── Tag (18 bits) ──────────┬─ Index (8) ─┬─ Offset (6) ─┐
      │                                     │             │              │
      ▼                                     ▼             ▼              ▼
 
   Cache:
                Set 0:  [tag][data 64 B] [tag][data] [tag][data] [tag][data]
                Set 1:  [tag][data]      [tag][data] [tag][data] [tag][data]
                ...
                Set 255: ...
 
   Lookup: index → set, then compare tag with each of the 4 ways in parallel.

7.4 Mapping policies

How does an address pick its set?

Direct-mapped. Each address maps to exactly one set (one slot). Simple. Conflict misses when two hot addresses map to the same slot.
Fully associative. Any address can go in any slot. Best hit rate. Very expensive: every slot needs a tag-comparator.
N-way set-associative. Compromise: each address maps to one of $N$ slots in one set. Comparators in parallel inside the set. Modern CPUs typically use 4-way to 16-way set-associative for L1, 8-way to 16-way for L2, 12-16-way for L3.

7.5 Replacement policies

When a cache miss arrives and the target set is full, something has to be evicted. Choices:

LRU (Least Recently Used). Evict the slot whose last access was longest ago. Best for typical workloads. Hardware tracks "age" bits per slot.
FIFO. Evict the slot that arrived earliest. Simpler.
Random. Pick a slot at random. Surprisingly competitive and the simplest hardware. ARM Cortex-A often uses random or pseudo-LRU.
Pseudo-LRU. Tree of "older than" bits, approximating LRU with $\log_2 N$ bits per set instead of $N \log_2 N$ . Used in practice.

7.6 Write policies

When the CPU writes a value:

Write-through. Update both the cache and the next level (or memory) on every write. Simple, slow writes.
Write-back. Update only the cache; mark the line dirty. Write to memory only when the line is evicted. Faster, but requires tracking dirty bits and is more complex on multi-core.

Plus a policy for what happens on a write miss:

Write-allocate. Bring the line into cache on a miss, then write. Pairs with write-back.
No-write-allocate. Skip cache on a write miss; write straight through. Pairs with write-through.

Modern caches: L1D and L2 are write-back, write-allocate. L3 is shared and write-back.

7.7 Average memory access time

Why a hierarchy? Because the average memory access time is much lower than a slow main-memory access if the hit rate is high.

Define:

$h$ = hit rate at this level.
$T_{hit}$ = access time on hit.
$T_{miss}$ = access time on miss (which includes going to next level).

Then

$\text{AMAT} = h \cdot T_{hit} + (1 - h) \cdot T_{miss}.$

For a real chip with three cache levels:

$\text{AMAT} = T_{L1} + (1 - h_{L1}) \cdot \big[T_{L2} + (1 - h_{L2}) \cdot \big[T_{L3} + (1 - h_{L3}) \cdot T_{DRAM}\big]\big].$

If $h_{L1} = 0.97$ , $h_{L2}|miss_{L1} = 0.5$ , $h_{L3}|miss_{L2} = 0.5$ , $T_{L1} = 4$ , $T_{L2} = 12$ , $T_{L3} = 40$ , $T_{DRAM} = 200$ :

$\text{AMAT} = 4 + 0.03 \cdot [12 + 0.5 \cdot [40 + 0.5 \cdot 200]] = 4 + 0.03 \cdot 82 = 6.46 \text{ cycles}.$

Without caches, every access would be 200 cycles. With caches, average is ~6.5 cycles. 30x speedup just from the hierarchy. This is why caches are not optional.

7.8 Multi-level cache structure

Modern Intel/AMD/Apple CPUs have:

L1I and L1D. Per-core, split. Fast (4 cycles), small (32-64 KB each), Harvard-style.
L2. Per-core, unified (instructions + data). Medium (12-15 cycles), 256 KB - 2 MB.
L3 (LLC, last-level cache). Shared across all cores. Larger (60 cycles), tens of MB.
Sometimes L4 as on-package cache (Intel Crystal Well had 128 MB eDRAM).

Why per-core L1/L2 and shared L3? Per-core L1 minimizes contention; shared L3 allows efficient sharing of common data across cores without round-trips to main memory.

7.9 Cache coherency

Multi-core systems share L3 and main memory but each core has its own L1/L2. If core 0 writes a value to address $A$ and the value sits in core 0's L1, what happens when core 1 reads $A$ ? Cache coherency ensures all cores see a consistent value.

The classical protocol is MESI, named for the four states each cache line can be in:

M (Modified). This core has the only valid copy, and it has been modified relative to memory (dirty).
E (Exclusive). This core has the only valid copy, unmodified (clean).
S (Shared). Multiple cores have valid clean copies.
I (Invalid). This core's copy is not valid.

Transitions are driven by snooping on the bus or by directory-based messages between cores:

rendering diagram...

When core 0 writes to a line that core 1 also has in S state, core 1's line is invalidated (I), core 0's line becomes M. When core 1 next reads that address, it misses, and the cache controller routes the read to core 0's modified line (or to memory after a writeback).

Variants:

MOESI adds an "Owned" state, useful for sharing dirty data without writeback.
MESIF adds "Forward," used in Intel chips to designate one S-line as the responder.
Directory-based for many-core systems where snooping does not scale.

Hardware-security tie-in. Cache coherency is a side channel. An attacker on core 1 can detect the latency difference between accessing a line in S vs M state (because M requires a coherency transaction). Flush+Reload and Prime+Probe attacks exploit this. They time their own cache accesses to detect what the victim accessed. Used to break AES, RSA, and many crypto implementations. Mitigations include hardware features (Intel CAT for cache partitioning) and software (constant-time crypto).

7.10 Virtual memory

A program does not see physical addresses. It sees virtual addresses translated by the Memory Management Unit (MMU) to physical addresses.

Why?

Process isolation. Each process has its own address space. Process A's address $0x1000$ and process B's address $0x1000$ map to different physical bytes. Crashes do not propagate; secrets do not leak.
Larger address space than physical RAM. Pages can live on disk and be paged in on demand.
Fragmentation control. Even if physical memory is fragmented, the virtual view is contiguous.
Memory-mapped files. mmap() makes a file look like memory. The OS pages it in lazily.
Security: ASLR. Randomize where libraries and stacks land in virtual space. Attackers don't know absolute addresses.

Paging mechanics

Memory is divided into pages (typically 4 KB) on both sides of the translation. The lowest 12 bits of the address (the page offset) are unchanged; the upper bits (the page number) are translated.

plaintext

   Virtual address (48-bit)
     ┌──────── VPN (36 bits) ────────┬── Offset (12 bits) ──┐
                       │                            │
                       │  Page table walk           │
                       ▼                            │
                   PPN (36 bits)                    │
                       │                            │
                       └─────────┬──────────────────┘
                                 ▼
   Physical address (48-bit)
     ┌──────── PPN ──────────┬── Offset (12 bits) ──┐

The translation is held in a page table, a tree-structured data structure in memory. x86-64 uses a 4-level page table (and the latest with 5-level for 57-bit virtual addresses); ARM has multiple levels too. To translate, walk the tree: top-level table gives address of next-level table, and so on, until the leaf gives the physical page number.

A single page-table walk is 4 memory accesses. That would be unbearably slow on every load. Hence:

TLB: the translation cache

The Translation Lookaside Buffer (TLB) is a small (16-512 entry) fully-associative cache of recent translations. Hit in the TLB: translation is essentially free. Miss: do the walk, install the result.

Modern CPUs have separate iTLBs and dTLBs, plus larger second-level TLBs.

Hardware-security tie-in. The TLB is a side channel. Spy on TLB hit/miss latency to detect victim memory accesses. Meltdown specifically exploits the fact that x86 chips speculatively perform loads even from kernel addresses, then check protection at retirement; the speculative result reaches cache (and TLB). The attacker uses cache timing to recover the speculatively loaded byte. Mitigation: KPTI (Kernel Page Table Isolation), which gives kernel and user different page tables so the kernel addresses are not mapped into the user's view.

7.11 Page faults and demand paging

If the page table entry says "not present" (the page lives on disk, or has not been allocated), the MMU raises a page fault. The OS catches the fault, fetches the page from disk (or zeroes it if it is a new allocation), updates the page table, and resumes the offending instruction.

This is the mechanism behind:

Swap. Use disk as overflow when RAM is full. Slow when active.
Copy-on-write. Two processes can share a physical page; on write, the OS makes a private copy. fork() in Unix uses this.
Mmap. Map a file into memory; pages fault in on first access.

7.12 Segmentation: a historical detour

Before paging, the dominant scheme was segmentation: dividing memory into variable-size segments (code, data, stack), each with its own base and limit. x86 retained segmentation through the 8086, 80286, and 80386. In 64-bit mode, segmentation is mostly turned off (segment bases are zero), but the FS and GS segment registers are still used for thread-local storage and percpu data in Linux.

Pure segmentation suffered from external fragmentation (free space in pieces too small to fit a segment). Paging won. Hybrid schemes (Multics, x86 32-bit) used segmentation atop paging, but everybody ended up with paging-only architectures.