The three fundamental methods for transferring data between CPU, memory, and peripheral devices β programmed I/O (polling), interrupt-driven I/O, and Direct Memory Access (DMA). How each works, when to use each, and how they connect to modern SoC peripheral design.
Every computer system must solve the same fundamental problem: how does the CPU coordinate data transfers between itself, memory, and much-slower peripheral devices β without wasting CPU time or losing data? Three techniques address this, each a step more sophisticated than the last:
The three I/O techniques in order of increasing CPU efficiency. Programmed I/O (polling) wastes CPU cycles waiting. Interrupt-driven I/O frees the CPU between transfers but still requires it to handle each data word. DMA delegates entire block transfers to dedicated hardware, involving the CPU only at the start and end.
In programmed I/O (also called polling or busy-wait I/O), the CPU has direct, exclusive control over every step of the I/O operation. It initiates the transfer, monitors the device status, and transfers each data word manually.
Programmed I/O read flowchart. The CPU issues a READ command, then enters a tight polling loop reading the status register. The “NO” branch of READY? loops back β this is the busy-wait. Once READY, the CPU reads the data word and writes it to memory, then repeats for the next word.
Interrupt-driven I/O eliminates the busy-wait by allowing the CPU to do useful work while the I/O device operates. Instead of the CPU asking “are you ready yet?”, the device says “I am ready now” via an interrupt signal.
Interrupt-driven I/O timing. While the device operates (yellow), the CPU executes other instructions (green). When the device is ready, it pulses the IRQ line (purple). The CPU runs the short ISR, reads the data word and stores it, then resumes its previous work. Device and CPU activity overlap.
When an interrupt fires, the CPU must know which device triggered it. Four identification methods exist:
| Method | How it works | Advantage | Disadvantage |
|---|---|---|---|
| Multiple IRQ lines | Each device gets its own dedicated interrupt line. CPU examines which line is asserted. | Fastest identification β hardware direct | Limits number of devices to number of CPU interrupt pins |
| Software poll | All devices share one IRQ line. On interrupt, CPU polls each device’s status register in priority order. | Unlimited devices on one IRQ line | Slow β must poll all devices |
| Daisy chain (HW poll) | Interrupt ACK propagates along a chain. First device that requested the interrupt claims it and places its vector on the data bus. | Fast hardware identification Β· Scalable | Priority fixed by position in chain |
| Bus mastering | Device must claim the bus before raising an interrupt. Used in PCI. | Combined bus arbitration and interrupt | Complex arbitration protocol |
Modern systems use a dedicated interrupt controller (Intel 8259A PIC; ARM GIC handles up to 1020 sources with programmable priority). The interrupt controller signals the CPU on a single INTR line and provides the interrupt vector when the CPU acknowledges.
In a real system, many devices can interrupt simultaneously. The interrupt controller assigns priorities so the most urgent device is serviced first. Higher-priority interrupts can preempt lower-priority ISRs already running:
Setup: Three interrupt sources: Timer (priority 1 β highest), Disk (priority 2), UART (priority 3 β lowest).
t=0ms: UART fires IRQ (priority 3) β CPU saves context, enters UART ISR
t=1ms: Disk fires IRQ (priority 2 > 3) β UART ISR preempted
CPU saves UART ISR context, enters Disk ISR
t=3ms: Timer fires IRQ (priority 1 > 2) β Disk ISR preempted
CPU saves Disk ISR context, enters Timer ISR
t=3.1ms: Timer ISR completes β restores Disk ISR context
t=5ms: Disk ISR completes β restores UART ISR context
t=5.5ms: UART ISR completes β restores user program context
Stack at deepest nesting: [User context] [UART ISR context] [Disk ISR context] β Timer ISR running. The stack depth is bounded by the number of priority levels.
Both programmed I/O and interrupt-driven I/O share two fundamental drawbacks for large data transfers:
DMA (Direct Memory Access) introduces a dedicated hardware controller that can take over the system bus and transfer data directly between a peripheral and main memory without the CPU processing each word. The CPU is involved only at the start (to program the DMA controller) and end (to receive the completion interrupt).
To initiate a DMA transfer, the CPU programs the DMA controller with four pieces of information:
| Parameter | What the CPU tells the DMA controller |
|---|---|
| Direction | Read (device β memory) or Write (memory β device) |
| Device address | Which peripheral device (I/O port address or device identifier) |
| Memory start address | Starting address in RAM where data will be read from or written to |
| Word/byte count | How many words (or bytes) to transfer in total |
Once programmed, the CPU continues with other work. The DMA controller executes the entire transfer autonomously:
DMA controller internal structure. Three key registers: Data Register (one-word buffer), Address Register (auto-incrementing memory pointer), and Count Register (words remaining). DMA_REQ/DMA_ACK signals arbitrate for bus ownership. When Count reaches zero, INTR fires to notify the CPU.
Three DMA hardware configurations exist, differing in how the DMA controller connects to the system bus:
Three DMA configurations. Config β : DMA separate from I/O devices on the same bus β each word transfer uses the bus twice. Config β‘: DMA controller integrates I/O devices β reads device internally, one bus access to write memory. Config β’: I/O devices on a dedicated I/O bus; DMA bridges to the system bus β cleanest separation.
CPU activity during a block I/O read. Programmed I/O (left): CPU occupies the entire transfer period. Interrupt-Driven (centre): CPU alternates between useful work and short ISRs, one per word. DMA (right): CPU runs user work uninterrupted; only one IRQ at the end of the entire block transfer.
| Feature | Programmed I/O | Interrupt-Driven I/O | DMA |
|---|---|---|---|
| CPU during transfer | Busy β polling loop | Mostly free β ISR per word | Free β only start/end |
| Interrupts per block | 0 (no interrupts) | N (one per word) | 1 (block complete) |
| Hardware needed | Minimal (none extra) | Interrupt controller | DMA controller + arbiter |
| Transfer rate | Limited by CPU speed | Better | Full bus bandwidth |
| Best for | Short transfers, dedicated MCU | Word-sized transfers, low latency | Large block transfers (disk, NIC) |
| Complexity | Lowest | Medium | Highest |
Every non-trivial SoC contains one or more DMA controllers. ARM’s PL330 / DMA-330 is the standard AMBA DMA engine used in STM32, Rockchip, Samsung Exynos, and many others. The DMA-330 has 8 channels, each with a dedicated thread executing a microcode program β the channel descriptor. Instead of simple source/destination/count registers, the DMA-330 executes a small instruction set (DMALP for loop, DMALD for load, DMAST for store, DMAEND for end) stored in a descriptor buffer in memory. This allows complex scatter-gather transfers β reading from 10 non-contiguous source regions and writing to a single destination β with a single CPU programming operation.
DMA creates a subtle cache coherence problem. If the CPU writes data to a buffer in memory, and the CPU’s L1 write-back cache has not yet flushed those writes to DRAM (dirty cache lines), the DMA controller will read stale data from DRAM. Solutions: (1) Cache flush/invalidate in software β the driver explicitly flushes dirty lines before DMA TX and invalidates cache lines after DMA RX. (2) Hardware coherent DMA β the DMA’s AXI master port is connected to the cache coherent interconnect (ACE or CHI), so DMA automatically participates in the MESI protocol. ARM’s CCI-400 and CMN-600 support hardware-coherent DMA. Choosing between these two approaches β and correctly implementing the chosen one in both RTL and driver code β is one of the most common DMA-related bugs in SoC design.
The three I/O techniques represent a fundamental engineering trade-off between latency and throughput. Programmed I/O has the lowest latency β the CPU reacts to data as fast as it polls, with no interrupt overhead. Interrupt-driven I/O has moderate latency with multitasking. DMA has the highest throughput for bulk transfers but adds setup latency (programming the controller, cache maintenance) that makes it inefficient for small transfers. In SoC peripheral design: a touch screen controller generating 120 events/second uses interrupt-driven I/O; a camera ISP writing 4K frames at 60 fps uses DMA; a JTAG debug interface reading one byte at a time uses programmed I/O.