>
section 2 of 138 min read

2. Instructions, Memory, and Programs

2.1 Memory locations and addresses

Memory is a giant addressable array. Each address points to one cell. The "cell" is almost always one byte (8 bits) on modern systems, but the CPU usually reads several bytes at a time.

Word. The natural register and bus width of the CPU. 32-bit machines have 32-bit (4-byte) words. 64-bit machines (most modern systems) have 64-bit (8-byte) words. A "word load" reads 4 or 8 bytes at once.

Byte. Always 8 bits. The standardization came from IBM System/360 in 1964; before that, machine word sizes were all over the map (6-bit, 12-bit, 36-bit, 48-bit). 8 bits is the universal building block now.

Alignment. Most CPUs require multi-byte loads to be aligned. A 4-byte load from address AA requires AA to be a multiple of 4. Misaligned access is either slow (the hardware does two reads and stitches) or a fault (the OS handles it, perhaps killing the process). ARM Cortex-M0 traps on misalignment; x86 just slows down.

2.2 Endianness: the byte order question

For multi-byte values like 32-bit integers, which byte goes at the lower address?

  • Big-endian. Most-significant byte at lowest address. The number 0x123456780x12345678 stored at address AA has byte 0x120x12 at AA, 0x340x34 at A+1A+1, 0x560x56 at A+2A+2, 0x780x78 at A+3A+3. "Read it like a book." Used by IBM mainframes, the original Motorola 68000, all network protocols (TCP/IP headers are big-endian, hence "network byte order").
  • Little-endian. Least-significant byte at lowest address. The same number stores as 0x78,0x56,0x34,0x120x78, 0x56, 0x34, 0x12. Used by every Intel x86, ARM by default, RISC-V.
plaintext
   Big-endian storage of 0x12345678:   0x12 0x34 0x56 0x78
   Little-endian storage:              0x78 0x56 0x34 0x12
   address →                            +0   +1   +2   +3

The two camps cannot both be right; the "correct" choice depends on your aesthetic. Little-endian wins on arithmetic (LSB at the bottom feels right when you do paper-and-pencil addition), big-endian wins on string-display (cast a multi-byte int to a byte array and the order matches a human-readable hex dump).

C programmers convert with htonl() (host to network long), ntohs() (network to host short), and friends. The kernel calls them; they boil down to a byte-swap (bswap_32 instruction on x86) on little-endian machines and a no-op on big-endian.

c
#include <arpa/inet.h>
 
uint32_t local = 0x12345678;
uint32_t wire  = htonl(local);   // becomes 0x78563412 on x86,
                                 // unchanged on PowerPC big-endian.

Mixing endianness is a perennial bug source. A device sends a sensor reading over UART big-endian; the host reads it as little-endian and gets nonsense. Network packets going through a NAT need byte swapping at every hop. The lesson: every time bytes cross between two machines or between memory and a peripheral register, ask "what endianness is this?"

2.3 Memory operations: load and store

There are exactly two ways to interact with memory:

  • Load. Read from memory address into a register. LDR R1, [R2] in ARM. mov eax, [esi] in x86.
  • Store. Write from a register to memory. STR R1, [R2] in ARM. mov [edi], eax in x86.

CISC machines (x86) blur this by allowing memory operands in arithmetic instructions: add eax, [esi] reads from memory, adds to eax, writes back to eax. Internally, the CPU still fetches from cache, runs the ALU, and writes back; it just labels the whole thing one instruction. RISC machines (ARM, RISC-V, MIPS) keep load/store strictly separate from arithmetic. The simpler load-store pattern is easier to pipeline.

2.4 Instruction sequencing and the program counter

How does the CPU know which instruction to execute next? There is a special register called the Program Counter (PC) that holds the address of the next instruction. The fetch-execute loop:

  1. Read instruction from memory at the address in PC.
  2. PC = PC + (size of instruction). On a fixed-32-bit RISC, that is PC = PC + 4.
  3. Decode the instruction.
  4. Execute it.
  5. Write back results.
  6. Goto 1.

That is the whole show. Branches and jumps modify PC directly instead of letting it auto-increment. A taken branch sets PC to the target address; the next fetch comes from there.

Restaurant analogy. The chef's recipe binder has a small magnetic clip that marks "current page." Each time the chef finishes a step, the clip slides down one line. A "jump to step 14" instruction picks up the clip and slams it on line 14 instead. A conditional branch is "if the bowl is empty, jump to step 7; else continue."

2.5 Addressing modes: how to specify operands

An instruction like add needs to know where its operands are. The addressing mode of each operand specifies that. Modes you will meet:

  • Immediate. The operand value is encoded in the instruction itself. ADDI R1, R2, #5: add the literal 5 to R2, store in R1. Used for small constants, loop bounds, offsets.
  • Register. The operand is in a register. ADD R1, R2, R3: R1 = R2 + R3. Fastest.
  • Direct (absolute). The operand is at a fixed memory address embedded in the instruction. LDR R1, [#0x4000]. Rare in modern RISC because addresses are too long to embed.
  • Register indirect. The operand is in memory, at the address held in a register. LDR R1, [R2]: load from memory at the address in R2.
  • Indexed (base + offset). Address = base register + constant offset. LDR R1, [R2, #4]. The bread and butter of accessing struct fields and array elements.
  • Base + index, scaled. Address = base + (index_register ×\times scale). mov eax, [esi + edi*4] in x86. Beautiful for arrays of 4-byte ints.
  • PC-relative. Address = PC + offset. Used for branches (so code can be relocated without rewriting addresses) and for accessing nearby read-only constants. Critical for position-independent code (used in shared libraries and in modern ASLR-mitigated executables).
  • Auto-increment / auto-decrement. Use the address, then update the register. LDR R1, [R2], #4 in ARM: load from R2, then R2 += 4. Perfect for walking arrays. The auto-decrement version powers stacks ("push" decrements SP then writes; "pop" reads then increments).

Each mode is a different way to get from "what the programmer wants to talk about" to "physical memory address," and the cost of each mode (in cycles, in encoding bits) is different. RISC ISAs typically expose only a few modes. CISC ISAs expose many; x86 has truly baroque modes that the CPU's address-generation unit handles in dedicated hardware.

2.6 Stacks and queues

Two data structures are special enough that they show up as hardware-supported abstractions.

Stack. Last-in-first-out. The hardware has a stack pointer (SP) register, and instructions push and pop (or auto-increment/decrement loads and stores) operate on it. Used for:

  • Function call return addresses. call pushes the address of the next instruction; ret pops it back into PC.
  • Local variables. The compiler allocates a stack frame on entry, deallocates on return.
  • Saved registers during interrupts and context switches.

The stack typically grows downward (toward lower addresses) on x86, ARM, RISC-V. Why downward? It started as a convention from PDP-11 days and became universal: heap grows up, stack grows down, they meet (and crash) in the middle if either runs away.

plaintext
high addresses    ┌────────────────┐
                  │  command-line  │
                  │  args, env     │
                  ├────────────────┤
                  │  stack         │
                  │     │          │
                  │     ▼ grows    │
                  │     │          │
                  │     ▼          │
                  ├────────────────┤
                  │  (free)        │
                  ├────────────────┤
                  │     ▲          │
                  │     │          │
                  │     │ grows    │
                  │  heap          │
                  ├────────────────┤
                  │ static data    │
                  │ (.data, .bss)  │
                  ├────────────────┤
                  │ code (.text)   │
low addresses     └────────────────┘

Queue. First-in-first-out. Less directly hardware-supported but ubiquitous in firmware: ring buffers between an interrupt handler and a main-loop consumer, packet queues in network stacks, command queues for GPU rendering.

Hardware-security tie-in. The stack is also where many memory-safety attacks live. A buffer overflow on a stack-allocated array can clobber the saved return address; control flow then jumps wherever the attacker wrote. Mitigations include stack canaries (a magic value placed before the saved return address; check it before returning), W^X (no page is both writable and executable), and ASLR (randomize where stack and heap live so attackers cannot guess addresses). Chapter 24 dives deep.

2.7 ALU operations: logic, shift, rotate, arithmetic

The Arithmetic Logic Unit handles the actual computation. Its menu:

  • Logic: AND, OR, XOR, NOT, NAND, NOR. Bit-by-bit on operands.
  • Shifts: logical left (LSL), logical right (LSR, fills with 0), arithmetic right (ASR, sign-extends), barrel shifts that move by an arbitrary amount in one cycle.
  • Rotates: ROL, ROR. Bits that shift off one end come back in the other end.
  • Arithmetic: ADD, SUB, possibly MUL, DIV (some CPUs do these in dedicated units).
  • Compare: SUB but only set flags, do not write the result.

The ALU is a giant combinational circuit: it takes two operand inputs and an op-select, produces one output and some status flags (zero, negative, carry, overflow). We will draw it inside the datapath in section 5.

Why shifts are fast and divides are slow. A left shift by kk multiplies by 2k2^k; a right shift divides by 2k2^k. Both are essentially "rewire which bit goes where," which is one cycle of combinational logic. A general multiplier needs an array of adders; modern CPUs do it in 3-5 cycles. A general divider needs a bit-by-bit subtraction loop; even with tricks, divide takes 20-30 cycles. This is why compilers replace x / 2 with x >> 1 whenever the divisor is a power of two and the type is unsigned.