>
section 9 of 132 min read

9. Putting It All Together: A Day in the Life of an Instruction

Let us walk through a single x86 instruction on a modern CPU. The instruction:

asm
add eax, ebx

(Add the contents of register ebx into eax.) On an Intel Skylake-generation core:

  1. Fetch. The fetch unit reads 16 bytes from the L1 instruction cache at the address in RIP (the 64-bit program counter). Length pre-decoder identifies instruction boundaries.
  2. Decode. The decoder converts the x86 instruction add eax, ebx into a single µop: "PRF[reg_eax_alias] = PRF[reg_eax_alias] + PRF[reg_ebx_alias], set flags." Fast path through the simple decoder.
  3. µop cache check. This decoded µop is placed in the µop cache; future fetches of the same address hit the µop cache and skip decode.
  4. Rename. The architectural register eax is mapped to a freshly-allocated physical register (out of ~280). Old eax retires later. This breaks WAW/WAR hazards.
  5. Dispatch. The µop is sent to the scheduler, which inserts it into a 97-entry reservation station.
  6. Schedule. Once both physical registers (the renamed eax and ebx) are ready, the µop is dispatched to one of 4 ALU ports.
  7. Execute. The ALU adds. ~1 cycle.
  8. Writeback. The result is written to the destination physical register.
  9. Retire. The reorder buffer (224-entry) waits until all earlier instructions have completed without faults, then commits the result and frees the old physical register.

End-to-end latency: ~14-20 cycles. Throughput: 1 add per cycle (often 4 adds per cycle if independent and dispatched in parallel). Total power for that one operation: maybe 1-10 picojoules.

That happens 10 billion times per second on every core, in concert with 1000 other instructions in flight at any moment. The miracle of modern engineering.

The same code on ARM:

asm
ADD x0, x0, x1

Different ISA, same idea. The ARM Cortex-X4 is similarly OoO, similarly superscalar (8-wide), similarly cached. The encoding is fixed-32-bit so decoding is simpler; the µops are 1:1 with instructions for most cases. Apple M3 takes this further with even wider issue and a huge ROB.