CA-09: I/O Techniques β€” Programmed I/O, Interrupt-Driven & DMA β€” VLSI Trainers
VLSI Trainers CA Series Β· 9 / 12
Computer Architecture Β· Article 9 of 12

CA-09: I/O Techniques

The three fundamental methods for transferring data between CPU, memory, and peripheral devices β€” programmed I/O (polling), interrupt-driven I/O, and Direct Memory Access (DMA). How each works, when to use each, and how they connect to modern SoC peripheral design.

πŸ”„The Three I/O Techniques

Every computer system must solve the same fundamental problem: how does the CPU coordinate data transfers between itself, memory, and much-slower peripheral devices β€” without wasting CPU time or losing data? Three techniques address this, each a step more sophisticated than the last:

Figure 1 β€” The three I/O techniques: CPU involvement decreasing left to right
β‘  Programmed I/O (Polling) πŸ‘€ CPU issues I/O command CPU polls status register CPU transfers data word by word CPU busy-waits: wasteful CPU involvement: MAXIMUM β‘‘ Interrupt-Driven (Event-driven) πŸ”” CPU issues I/O command CPU does other work Device interrupts when ready CPU handles each word via ISR CPU involvement: MODERATE β‘’ DMA (Direct Memory Access) πŸš€ CPU programs DMA controller DMA transfers entire block DMA interrupts when done CPU free during transfer CPU involvement: MINIMAL vlsitrainers.com

The three I/O techniques in order of increasing CPU efficiency. Programmed I/O (polling) wastes CPU cycles waiting. Interrupt-driven I/O frees the CPU between transfers but still requires it to handle each data word. DMA delegates entire block transfers to dedicated hardware, involving the CPU only at the start and end.

πŸ‘€Programmed I/O β€” How It Works

In programmed I/O (also called polling or busy-wait I/O), the CPU has direct, exclusive control over every step of the I/O operation. It initiates the transfer, monitors the device status, and transfers each data word manually.

Step-by-step sequence for a read

  1. CPU issues a READ command to the I/O module (via control bus)
  2. I/O module initiates the operation with the peripheral device
  3. CPU enters a busy-wait loop β€” it repeatedly reads the I/O module’s status register until the READY bit is set
  4. When READY is detected: CPU reads the data word from the I/O module’s data register
  5. CPU writes the word to main memory (or processes it directly)
  6. Repeat from step 1 for the next word
The fundamental waste: At step 3, the CPU spins in a tight loop doing nothing useful. For a printer running at 100 characters/second, the CPU is idle for 99.99% of its time between each character. At 1 GHz: 10 million wasted cycles per character.

When programmed I/O is acceptable

πŸ“ŠProgrammed I/O β€” Flowchart

Figure 2 β€” Programmed I/O flowchart (read operation)
CPU actions I/O Module & Device Issue READ command to I/O READ command Initiate device operation Read I/O status register status Device busy (reading…) READY? NO β€” poll again YES Error condition Read data word from I/O data word Data register ready βœ“ Write word to memory Done? NO β€” next word YES Transfer complete CPU busy throughout Cannot execute other instructions during wait vlsitrainers.com

Programmed I/O read flowchart. The CPU issues a READ command, then enters a tight polling loop reading the status register. The “NO” branch of READY? loops back β€” this is the busy-wait. Once READY, the CPU reads the data word and writes it to memory, then repeats for the next word.

πŸ””Interrupt-Driven I/O β€” How It Works

Interrupt-driven I/O eliminates the busy-wait by allowing the CPU to do useful work while the I/O device operates. Instead of the CPU asking “are you ready yet?”, the device says “I am ready now” via an interrupt signal.

From the CPU’s perspective

  1. Issues READ command to I/O module
  2. Continues executing other instructions β€” no polling
  3. At the end of each instruction cycle, checks the IRQ line
  4. When IRQ detected: saves context (PC, registers, PSW)
  5. Executes the ISR β€” reads word from I/O module’s data register, stores it in memory
  6. Restores context, resumes the interrupted program
Figure 3 β€” Interrupt-driven I/O: CPU and I/O device operate concurrently
CPU I/O IRQ Issue cmd CPU executes other useful instructions β†’ ISR Resumes previous work β†’ idle Device performing I/O operation (slow β€” ms timescale) Data register ready β€” data held for CPU IRQ! cmd issued IRQ fires ISR done CPU and I/O operate concurrently β€” CPU wastes zero cycles polling vlsitrainers.com

Interrupt-driven I/O timing. While the device operates (yellow), the CPU executes other instructions (green). When the device is ready, it pulses the IRQ line (purple). The CPU runs the short ISR, reads the data word and stores it, then resumes its previous work. Device and CPU activity overlap.

Remaining limitation: Interrupt-driven I/O still requires the CPU to execute ISR code for every single data word transferred. For a disk reading a 64 KB file at 32-bit word granularity, that means 16,384 interrupts β€” 16,384 context saves, 16,384 ISRs, 16,384 context restores. This motivates DMA.

πŸ”Interrupt I/O β€” Identifying the Device

When an interrupt fires, the CPU must know which device triggered it. Four identification methods exist:

MethodHow it worksAdvantageDisadvantage
Multiple IRQ linesEach device gets its own dedicated interrupt line. CPU examines which line is asserted.Fastest identification β€” hardware directLimits number of devices to number of CPU interrupt pins
Software pollAll devices share one IRQ line. On interrupt, CPU polls each device’s status register in priority order.Unlimited devices on one IRQ lineSlow β€” must poll all devices
Daisy chain (HW poll)Interrupt ACK propagates along a chain. First device that requested the interrupt claims it and places its vector on the data bus.Fast hardware identification Β· ScalablePriority fixed by position in chain
Bus masteringDevice must claim the bus before raising an interrupt. Used in PCI.Combined bus arbitration and interruptComplex arbitration protocol

Modern systems use a dedicated interrupt controller (Intel 8259A PIC; ARM GIC handles up to 1020 sources with programmable priority). The interrupt controller signals the CPU on a single INTR line and provides the interrupt vector when the CPU acknowledges.

🎚️Multiple Interrupts & Priority

In a real system, many devices can interrupt simultaneously. The interrupt controller assigns priorities so the most urgent device is serviced first. Higher-priority interrupts can preempt lower-priority ISRs already running:

πŸ” Worked Example β€” Priority Interrupt Handling

Setup: Three interrupt sources: Timer (priority 1 β€” highest), Disk (priority 2), UART (priority 3 β€” lowest).

t=0ms: UART fires IRQ (priority 3) β†’ CPU saves context, enters UART ISR
t=1ms: Disk fires IRQ (priority 2 > 3) β†’ UART ISR preempted
CPU saves UART ISR context, enters Disk ISR
t=3ms: Timer fires IRQ (priority 1 > 2) β†’ Disk ISR preempted
CPU saves Disk ISR context, enters Timer ISR
t=3.1ms: Timer ISR completes β†’ restores Disk ISR context
t=5ms: Disk ISR completes β†’ restores UART ISR context
t=5.5ms: UART ISR completes β†’ restores user program context

Stack at deepest nesting: [User context] [UART ISR context] [Disk ISR context] β†’ Timer ISR running. The stack depth is bounded by the number of priority levels.

⚑DMA β€” Why It Is Needed

Both programmed I/O and interrupt-driven I/O share two fundamental drawbacks for large data transfers:

  1. Transfer rate limited by CPU speed: The CPU must execute instructions to move each word.
  2. CPU tied up per word: For interrupt-driven I/O, every word requires a context save + ISR + context restore β€” typically 50–200 cycles overhead per word.
Quantified cost: A disk read of 512 KB at 32-bit word granularity = 131,072 words = 131,072 interrupts. At 100 cycles per interrupt = 13,107,200 wasted cycles. At 3 GHz this is ~4.4 ms of pure overhead. DMA reduces this to a single interrupt at the end of the entire block.

DMA (Direct Memory Access) introduces a dedicated hardware controller that can take over the system bus and transfer data directly between a peripheral and main memory without the CPU processing each word. The CPU is involved only at the start (to program the DMA controller) and end (to receive the completion interrupt).

πŸš€DMA β€” How It Works

To initiate a DMA transfer, the CPU programs the DMA controller with four pieces of information:

ParameterWhat the CPU tells the DMA controller
DirectionRead (device β†’ memory) or Write (memory β†’ device)
Device addressWhich peripheral device (I/O port address or device identifier)
Memory start addressStarting address in RAM where data will be read from or written to
Word/byte countHow many words (or bytes) to transfer in total

Once programmed, the CPU continues with other work. The DMA controller executes the entire transfer autonomously:

  1. DMA controller requests the system bus (asserts DMA_REQ to the bus arbiter)
  2. Bus arbiter grants the bus (asserts DMA_ACK)
  3. DMA controller reads one word from the peripheral into its internal data register
  4. DMA controller writes that word to the next memory location (auto-increments the address register)
  5. Decrements the word count register
  6. If count β‰  0: repeat from step 1 for the next word
  7. When count = 0: assert IRQ to notify the CPU that the block transfer is complete
Cycle stealing: The DMA controller steals one bus cycle from the CPU at a time. The CPU is suspended just before it would access the bus (between instruction decode and operand fetch). This is not a context switch β€” the CPU does not save registers. It simply pauses for one cycle, then resumes. The net effect is the CPU runs slightly slower, but the DMA transfer proceeds at full bus bandwidth.

πŸ”§DMA Block Diagram

Figure 4 β€” DMA controller: internal registers and system bus connection
SYSTEM BUS β€” Data Lines Β· Address Lines Β· Control Lines CPU Memory DMA Controller Data Register (one word buffer) Address Reg (auto-increments) Count Reg (decrements) Control Logic β€” sequences bus requests, transfers, and IRQ DMA_REQ DMA_ACK INTR (block complete) I/O Module & Peripheral DMA Data Flow (cycle stealing) β‘  DMA steals bus β‘‘ read from I/O device β‘’ write word to memory β‘£ decrement count; repeat vlsitrainers.com

DMA controller internal structure. Three key registers: Data Register (one-word buffer), Address Register (auto-incrementing memory pointer), and Count Register (words remaining). DMA_REQ/DMA_ACK signals arbitrate for bus ownership. When Count reaches zero, INTR fires to notify the CPU.

πŸ”€DMA Configurations

Three DMA hardware configurations exist, differing in how the DMA controller connects to the system bus:

Figure 5 β€” Three DMA configurations: detached, integrated, and separate I/O bus
① Single Bus, Detached DMA Each transfer uses bus TWICE CPU Memory DMA I/O Dev A I/O Dev B Bus used twice: I/O→DMA then DMA→Memory. CPU suspended 2× ⑑ Single Bus, Integrated DMA Each transfer uses bus ONCE CPU Memory DMA controller I/O A I/O B I/O devices integrated with DMA. DMA reads device directly (internal) then writes memory. Bus used once. ⑒ Separate I/O Bus Best isolation: one bus use CPU Memory DMA I/O BUS I/O A I/O B I/O C bridge I/O devices on dedicated bus. DMA bridges I/O bus to system bus. System bus used only once. Cycle stealing: DMA requests bus, CPU pauses for one cycle (no context switch), DMA moves one word, CPU resumes. CPU runs slightly slower but is not stalled for the whole transfer. vlsitrainers.com

Three DMA configurations. Config β‘ : DMA separate from I/O devices on the same bus β€” each word transfer uses the bus twice. Config β‘‘: DMA controller integrates I/O devices β€” reads device internally, one bus access to write memory. Config β‘’: I/O devices on a dedicated I/O bus; DMA bridges to the system bus β€” cleanest separation.

βš–οΈTechnique Comparison

Figure 6 β€” CPU activity comparison across all three I/O techniques (reading a block)
Programmed I/O CPU busy β€” polling + word transfers entire block User program: BLOCKED (no progress) Device: operating (fed by CPU) CPU User I/O Interrupt-Driven I/O cmd user work ISR user work ISR … repeat per word progress βœ“ progress βœ“ Device operating DMA cmd user work β€” uninterrupted β†’ IRQ continuous progress βœ“βœ“ DMA handles transfer autonomously Programmed I/O Interrupt-Driven I/O DMA CPU busy entire transfer Β· No multitasking CPU free between words Β· ISR per word CPU free entire block Β· 1 ISR at end vlsitrainers.com

CPU activity during a block I/O read. Programmed I/O (left): CPU occupies the entire transfer period. Interrupt-Driven (centre): CPU alternates between useful work and short ISRs, one per word. DMA (right): CPU runs user work uninterrupted; only one IRQ at the end of the entire block transfer.

FeatureProgrammed I/OInterrupt-Driven I/ODMA
CPU during transferBusy β€” polling loopMostly free β€” ISR per wordFree β€” only start/end
Interrupts per block0 (no interrupts)N (one per word)1 (block complete)
Hardware neededMinimal (none extra)Interrupt controllerDMA controller + arbiter
Transfer rateLimited by CPU speedBetterFull bus bandwidth
Best forShort transfers, dedicated MCUWord-sized transfers, low latencyLarge block transfers (disk, NIC)
ComplexityLowestMediumHighest

πŸ”¬VLSI Connections

πŸ”¬ DMA controllers in SoC β€” AMBA DMA-330 and channel descriptors

Every non-trivial SoC contains one or more DMA controllers. ARM’s PL330 / DMA-330 is the standard AMBA DMA engine used in STM32, Rockchip, Samsung Exynos, and many others. The DMA-330 has 8 channels, each with a dedicated thread executing a microcode program β€” the channel descriptor. Instead of simple source/destination/count registers, the DMA-330 executes a small instruction set (DMALP for loop, DMALD for load, DMAST for store, DMAEND for end) stored in a descriptor buffer in memory. This allows complex scatter-gather transfers β€” reading from 10 non-contiguous source regions and writing to a single destination β€” with a single CPU programming operation.

πŸ”¬ Cache coherence and DMA β€” a critical SoC design problem

DMA creates a subtle cache coherence problem. If the CPU writes data to a buffer in memory, and the CPU’s L1 write-back cache has not yet flushed those writes to DRAM (dirty cache lines), the DMA controller will read stale data from DRAM. Solutions: (1) Cache flush/invalidate in software β€” the driver explicitly flushes dirty lines before DMA TX and invalidates cache lines after DMA RX. (2) Hardware coherent DMA β€” the DMA’s AXI master port is connected to the cache coherent interconnect (ACE or CHI), so DMA automatically participates in the MESI protocol. ARM’s CCI-400 and CMN-600 support hardware-coherent DMA. Choosing between these two approaches β€” and correctly implementing the chosen one in both RTL and driver code β€” is one of the most common DMA-related bugs in SoC design.

πŸ”¬ Interrupt latency vs throughput β€” the engineering trade-off

The three I/O techniques represent a fundamental engineering trade-off between latency and throughput. Programmed I/O has the lowest latency β€” the CPU reacts to data as fast as it polls, with no interrupt overhead. Interrupt-driven I/O has moderate latency with multitasking. DMA has the highest throughput for bulk transfers but adds setup latency (programming the controller, cache maintenance) that makes it inefficient for small transfers. In SoC peripheral design: a touch screen controller generating 120 events/second uses interrupt-driven I/O; a camera ISP writing 4K frames at 60 fps uses DMA; a JTAG debug interface reading one byte at a time uses programmed I/O.

Summary β€” CA-09 key points: Three I/O techniques with decreasing CPU involvement. Programmed I/O (polling): CPU controls every step, polls status register in a busy-wait loop β€” wastes CPU time; best for single-purpose MCUs or very short transfers. Interrupt-Driven I/O: CPU issues command and does other work; device asserts IRQ when ready; CPU runs ISR for each word β€” eliminates polling waste but CPU still handles every word; four device identification methods (dedicated line, software poll, daisy chain, bus mastering). DMA: CPU programs DMA controller (direction, device address, memory start, word count); DMA transfers entire block via cycle stealing; single IRQ on completion β€” CPU free throughout. DMA creates cache coherence challenges requiring software cache maintenance or hardware-coherent bus connections.
I/O Overview ☰ CA Series Index CPU & ALU
Scroll to Top