What a bus is, how the system bus connects CPU, memory, and I/O, the three bus line types, and the five key elements of bus design — type, arbitration, timing, width, and data transfer. Synchronous vs asynchronous timing diagrams explained.
A bus is a shared communication pathway connecting two or more devices. The word “bus” comes from the Latin omnibus — “for all” — reflecting that the pathway is available to all connected devices, not just a point-to-point link between two.
Three key properties define a bus:
The CPU is transmitting — its signal is broadcast to all connected devices simultaneously. Memory, I/O Module 1, I/O Module 2, and the DMA controller all see the signal at the same time. The address on the address bus identifies which device should respond. Only the addressed device acts on the data.
A system bus typically consists of 50 to 100 separate lines grouped into three functional sets. Each line is a single-bit channel; together they form the full bus:
The three system bus line groups. Address lines (blue) run unidirectionally from CPU to all devices — they specify which memory location or I/O port is being accessed. Data lines (orange) are bidirectional — the CPU writes data to memory on a write, or reads data from memory back to the CPU. Control lines (purple) carry command and timing signals in both directions.
| Bus group | Direction | Width | What it carries | Impact on system |
|---|---|---|---|---|
| Address Bus | CPU → all (unidirectional) | 32 or 64 lines | Memory address or I/O port number being accessed | Width determines max addressable memory: 32-bit → 4 GB, 64-bit → 16 EB |
| Data Bus | Bidirectional | 8, 16, 32, or 64 lines | The actual data value being read or written | Width determines transfer bandwidth: 64-bit bus moves 8 bytes per cycle vs 1 byte for 8-bit |
| Control Bus | Bidirectional | ~10–20 lines | Read/write commands, IRQ, clock, ACK, bus grant/request | Determines bus protocol richness and arbitration capability |
Scenario: A CPU has an 8-bit data bus. Each instruction is 16 bits (2 bytes). How many memory accesses are needed per instruction fetch?
Answer: 2 memory accesses — the CPU must fetch the high byte first, then the low byte. This halves effective instruction throughput.
Solution: Widen the data bus to 16 bits → 1 access per instruction fetch. This doubles throughput with no change to clock frequency. Most modern CPUs use 64-bit data buses and fetch multiple instructions per cycle.
Address bus example: The Intel 8080 (1974) had a 16-bit address bus → 2¹⁶ = 64 KB maximum memory. The Intel 8086 (1978) had a 20-bit address bus → 1 MB. The 80386 (1985) had a 32-bit address bus → 4 GB. Modern 64-bit processors have 48-bit physical address buses → 256 TB.
The control bus carries all the signalling needed to orchestrate bus transactions. Each line has a specific function:
| Control signal | Direction | Meaning |
|---|---|---|
| Memory Write | CPU → Memory | Data on the data bus should be written to the address on the address bus |
| Memory Read | CPU → Memory | Data at the addressed memory location should be placed on the data bus |
| I/O Write | CPU → I/O | Data on the data bus should be output to the addressed I/O port |
| I/O Read | CPU → I/O | Data from the addressed I/O port should be placed on the data bus |
| Transfer ACK | Slave → Master | Data has been accepted from or placed on the bus (handshake acknowledgement) |
| Bus Request | Device → Arbiter | A module (e.g. DMA) needs to gain control of the bus |
| Bus Grant | Arbiter → Device | The requesting module has been granted control of the bus — it may now drive address and data |
| Interrupt Request | Device → CPU | An interrupt is pending — the device needs CPU attention |
| Interrupt ACK | CPU → Device | The pending interrupt has been recognised — the CPU is about to handle it |
| Clock | Clock source → all | Synchronises all bus operations to a common clock edge (synchronous buses) |
| Reset | CPU/System → all | Initialises all modules to a known power-on state |
A single system bus connecting all components is the simplest design, but it creates a bottleneck as the system grows. Adding more devices increases contention — only one device can transmit at a time, so traffic from many devices queues up.
Single bus (left): all devices share one path — fast CPU is delayed by slow disk or USB traffic. Multi-bus hierarchy (right): high-speed bus handles CPU-cache-RAM traffic; a bridge connects to a slower expansion bus for I/O devices. The bridge isolates the two domains so slow I/O does not stall fast CPU-memory operations.
Five elements must be specified when designing a bus. Together they determine the bus’s performance, cost, and complexity:
| Type | How it works | Advantage | Disadvantage |
|---|---|---|---|
| Dedicated | Separate physical lines permanently assigned to address, separate lines for data. Both sets active simultaneously. | Highest throughput — address and data transfer can overlap. Simpler control logic. Lower latency. | More physical wires needed — higher pin count, larger connectors, more PCB traces. |
| Multiplexed | Address and data share the same lines. An “address valid” control signal indicates when the lines carry an address; “data valid” indicates when they carry data. | Fewer lines → lower pin count → cheaper packaging, smaller connectors, simpler PCB layout. Critical for ICs with limited pins. | More complex control logic. Address and data cannot transfer simultaneously — adds latency. Lower peak throughput. |
When more than one device wants bus access simultaneously (CPU, DMA controller, and I/O module all need the bus), arbitration determines which device gets control. Only one device may be the bus master at any time — others must wait.
Centralised arbitration (left): one Bus Arbiter receives Bus Requests from all devices and sends Bus Grants. Simple but a single point of failure. Distributed arbitration (right): each device contains its own priority logic and negotiates access via shared arbitration wires — more resilient but more complex per device.
Timing defines how bus events are coordinated between devices. The two approaches are fundamentally different in their clock relationship:
Synchronous (top): all events happen at predetermined clock edges — simple, fast, but all devices must run at the same speed. Asynchronous (bottom): MSYN/SSYN handshake signals allow any-speed slave — the master waits until the slave signals data ready, regardless of how many cycles that takes.
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| Clock | Shared clock line — all events on clock edges | No clock — handshake signals only |
| Speed | Faster — no handshake overhead | Slower per transfer — handshake adds latency |
| Device speed | All devices must run at the same bus speed | Works with any-speed device — slave dictates its own readiness |
| Complexity | Simple — count clock edges | More complex — handshake state machine needed |
| Max cable length | Limited by clock skew | Longer cables possible — no skew problem |
| Examples | PCI, DDR SDRAM, SPI | Original Unibus, some I²C implementations |
Bus width affects two separate system properties — address bus width affects memory capacity; data bus width affects bandwidth:
| Bus | Width | Impact | Example |
|---|---|---|---|
| Address bus width | More bits = larger address space | 32-bit → 2³² = 4 GB max memory. 64-bit → 2⁶⁴ = 16 EB max memory | Intel 8080: 16-bit → 64 KB. x86-64: 48-bit physical → 256 TB |
| Data bus width | More bits = more data per cycle | 8-bit bus: 1 byte/cycle. 64-bit bus: 8 bytes/cycle — 8× bandwidth for same clock rate | 8088: 8-bit external. 8086: 16-bit. Pentium: 64-bit. Modern CPUs: 64-bit (DDR5: 64-bit + 8 ECC) |
A bus protocol defines which types of data transfers it supports. All buses support basic read and write; richer protocols add compound and burst operations:
| Transfer type | Description | Use case |
|---|---|---|
| Read | Slave → Master: master requests data at an address; slave places it on the data bus | CPU reads from memory or I/O port |
| Write | Master → Slave: master sends address + data; slave stores it | CPU writes to memory or I/O port |
| Read-modify-write | Atomic read then write to the same address without releasing the bus between operations | Semaphore operations, test-and-set, compare-and-swap — critical for multiprocessor synchronisation |
| Read-after-write | Write followed immediately by a read from the same address to verify the write succeeded | Verifying writes to I/O-mapped hardware registers |
| Block transfer | Multiple consecutive words transferred in a burst — one address phase, multiple data phases | Cache line fill (typically 64 bytes = 8 × 64-bit words), DMA transfers, disk sector reads |
Every concept in this article maps directly to the AMBA bus family used in ARM-based SoCs. The three bus line types (address, data, control) become the AXI4 channel signals: AWADDR/ARADDR (address), WDATA/RDATA (data), and AWVALID/AWREADY/WVALID/WREADY/BVALID/BREADY (control/handshake). AXI4’s separate read and write channels are the “dedicated bus” approach applied at the channel level. Synchronous timing is used — all signals are sampled on the rising clock edge. Bus arbitration is implemented in the AXI interconnect fabric (crossbar or bus matrix) — the functional equivalent of the centralised arbiter. The AMBA specification is essentially a formal, standardised version of the bus design elements you just learned.
AXI4’s VALID/READY handshake is philosophically identical to the MSYN/SSYN asynchronous handshake in Figure 5 — but implemented on a synchronous (clocked) bus. The master asserts VALID when it has a valid address/data; the slave asserts READY when it can accept. A transfer occurs only when both VALID and READY are HIGH on the same clock edge. This gives the same “any-speed slave” flexibility as asynchronous buses while retaining the simplicity of synchronous clocking. Understanding the asynchronous handshake concept from this article is the key to understanding why AXI4 works the way it does — and why VALID/READY can be de-asserted independently, creating the back-pressure mechanism used for flow control across clock domains.
The bus width trade-off is central to memory interface design. LPDDR5 uses a 16-bit data bus per channel to minimise power and pin count on mobile SoCs — width is traded for lower power. HBM (High Bandwidth Memory) used in GPUs and AI accelerators achieves massive bandwidth via a 1024-bit-wide bus across a short interposer — width is maximised at the cost of physical complexity. DDR5 uses a 64-bit data bus plus 8 ECC bits. Every memory PHY you implement or verify expresses bus width as the number of DQ (data) pins — understanding bus width from first principles helps you reason about why HBM delivers 900 GB/s while LPDDR5 delivers 68 GB/s despite similar clock rates.