>
section 4 of 136 min read

4. Input/Output Organization

4.1 Memory-mapped vs port-mapped I/O

How does the CPU talk to a UART chip or a network card? Two approaches:

Memory-mapped I/O. The device's control and data registers appear at specific memory addresses. To read the UART receive register, do a load from address 0x4000_40000x4000\_4000 (or wherever the SoC vendor put it). The same LDR/STR instructions work for memory and for devices. Used by ARM, RISC-V, modern x86 (PCIe configuration space, MMIO BARs). The dominant approach today.

Port-mapped I/O. A separate address space for I/O, accessed by special instructions (IN and OUT on x86). Address spaces don't overlap. Legacy on x86 (parallel port at 0x3780x378, keyboard controller at 0x600x60/0x640x64, PIC at 0x200x20/0xA00xA0). Modern x86 still supports it for backward compatibility but new devices use MMIO.

Memory-mapped wins because it unifies the instruction set. The MMU and cache handling can apply to device addresses (with non-cacheable attributes for device regions, so writes go through immediately).

4.2 Programmed I/O: the polling loop

The simplest way to get data from a device:

c
while (!(uart->status & RX_READY)) { /* spin */ }
char c = uart->data;

The CPU asks the device "are you ready?" over and over until it answers yes. Then it reads the data. Simple, correct, and a colossal waste of CPU cycles for slow devices. A 9600-baud UART produces a byte every 1\sim 1 ms; a 3 GHz CPU could be doing 3 million other operations in that millisecond, but the polling loop wastes them.

Polling is right when:

  • The device is fast and the CPU has nothing else to do.
  • The latency budget is tight and you cannot afford an interrupt.
  • The system is simple enough that the polling loop is the entire program (a bare-metal sensor reader).

4.3 Interrupt-driven I/O

The smarter pattern: the device interrupts the CPU when it has something to say.

plaintext
Device: "I have data ready!" ──asserts IRQ──▶ CPU
CPU: "Hold on, finishing this instruction..."
CPU: pushes PC, status flags onto stack
CPU: jumps to interrupt service routine (ISR) in the interrupt vector table
ISR: reads device, copies byte to a queue, signals waiting task
CPU: pops PC, status, resumes original work

The CPU does productive work between events instead of spinning. Interrupts are asynchronous (they can happen at any cycle), which means the hardware needs to:

  1. Recognize the IRQ on a clock edge.
  2. Finish the currently-executing instruction (so the saved PC points to a clean boundary).
  3. Save enough state so the ISR can run without trampling the interrupted code's data.
  4. Look up the ISR address in a vector table indexed by IRQ number.
  5. Jump to the ISR.

When the ISR finishes, it executes a "return from interrupt" instruction (RTI, RETI, RFE) that restores the saved state and resumes.

Vectored interrupts and priorities

Multiple devices share IRQ lines through a controller chip:

  • PIC (Programmable Interrupt Controller). Original IBM PC: 8259 chip, 8 IRQ lines, cascadable.
  • APIC (Advanced PIC). Modern x86, more lines, message-signaled interrupts.
  • GIC (Generic Interrupt Controller). ARM systems-on-chip.
  • NVIC (Nested Vectored Interrupt Controller). ARM Cortex-M; integrates priority handling.

The controller does priority arbitration: if a higher-priority interrupt arrives during a lower-priority ISR, the lower ISR is itself interrupted (nested interrupts). Each priority level has its own vector slot.

rendering diagram...

Hardware-security tie-in. Interrupts can leak timing. A spy process measuring its own scheduling latency can detect when the OS handles an interrupt for a victim process (Spectre-class side channels). Real-time systems often disable interrupts during cryptographic operations to avoid leaking through this channel.

4.4 DMA: Direct Memory Access

Some devices generate data fast (a 1 Gbps network card, an NVMe SSD at 3 GB/s, a sound card). Even with interrupts, having the CPU copy each byte is too slow. Direct Memory Access (DMA) lets the device move data to/from memory without CPU involvement.

The flow:

  1. CPU sets up a DMA descriptor: source address, destination address, length, direction.
  2. CPU tells the DMA controller to start.
  3. DMA controller takes the bus and moves data (one byte / word per cycle, or a burst).
  4. CPU is free to execute other instructions in the meantime.
  5. When the transfer completes, DMA fires an interrupt.
  6. CPU handles completion in the ISR.
rendering diagram...

DMA modes

  • Burst mode. DMA grabs the bus and transfers everything in one shot. Fast but locks out the CPU.
  • Cycle stealing. DMA transfers one word per cycle, alternating with CPU bus cycles. Smoother latency for the CPU, slower DMA.
  • Transparent (hidden) mode. DMA only takes the bus when the CPU is not using it (during instruction-decode cycles, etc.). Best behavior, requires close coordination.

DMA is everywhere: every disk controller, network card, USB controller, GPU memory transfer, audio codec, ADC streaming.

Hardware-security tie-in. DMA bypasses normal CPU-mediated access checks. A malicious peripheral plugged into Thunderbolt or PCIe can use DMA to read all of system memory, including kernel keys. DMA attacks are a well-documented class. Mitigations:

  • IOMMU. A second MMU between devices and memory translates and checks device DMA requests. Linux supports it via VFIO and DMAR; macOS and Windows have similar.
  • Bus mastering is restricted to known-safe device classes by default in modern OSes.
  • Cold-boot DMA attacks (an attacker brings a hostile laptop to a powered system) get harder when the OS programs the IOMMU correctly.

4.5 Bus standards in the wild

A quick survey of the buses you will meet:

  • PCI / PCIe. The dominant expansion bus on PCs and servers. PCIe Gen 5 hits 32 GB/s per direction per lane, lanes bundled into x1, x4, x8, x16 slots.
  • USB. The friendly outside-world bus. USB 3.2 hits 10 Gbps; USB4 hits 40 Gbps.
  • I²C. Two-wire low-speed inter-chip bus. Most sensors, EEPROMs, real-time clocks. 100 kHz or 400 kHz typical.
  • SPI. Four-wire fast bus. Flash chips, SD cards, displays. 50-100 MHz.
  • CAN. Robust automotive bus. Differential, multi-master, used in your car's engine, brakes, infotainment.
  • AXI / AHB / APB. ARM AMBA family. Internal SoC buses connecting cores, caches, memory controllers, peripherals.
  • UPI / Infinity Fabric / NVLink / CXL. High-speed coherent interconnects between CPUs, between CPU and accelerator, in modern data centers.

Each has its own electrical layer, framing, arbitration, and error detection. The architectural ideas (master/slave, point-to-point vs shared, priorities) recur.