CA-04: System Bus & Bus Design — VLSI Trainers
Computer Architecture · Article 4 of 12

CA-04: System Bus & Bus Design

What a bus is, how the system bus connects CPU, memory, and I/O, the three bus line types, and the five key elements of bus design — type, arbitration, timing, width, and data transfer. Synchronous vs asynchronous timing diagrams explained.

🚌What Is a Bus?

A bus is a shared communication pathway connecting two or more devices. The word “bus” comes from the Latin omnibus — “for all” — reflecting that the pathway is available to all connected devices, not just a point-to-point link between two.

Three key properties define a bus:

Bus organisation: A bus is usually a group of parallel wires bundled together, each carrying one bit. A 32-bit data bus is 32 separate single-bit lines. In physical hardware, buses appear as: parallel copper traces on a PCB, ribbon cables, or edge-connector slots (like PCI/PCIe slots on a motherboard).
Figure 1 — Bus as shared medium: one transmits, all receive
SHARED BUS — all devices see all signals simultaneously CPU TRANSMITTING Memory receiving I/O Mod 1 receiving I/O Mod 2 receiving DMA idle / listening ⚠ Only ONE device may transmit at a time If two devices drive the bus simultaneously → signals overlap → data corruption (bus contention) Solution: bus arbitration — a protocol that grants exclusive bus ownership to one device at a time vlsitrainers.com

The CPU is transmitting — its signal is broadcast to all connected devices simultaneously. Memory, I/O Module 1, I/O Module 2, and the DMA controller all see the signal at the same time. The address on the address bus identifies which device should respond. Only the addressed device acts on the data.

🔌Bus Structure — Three Line Types

A system bus typically consists of 50 to 100 separate lines grouped into three functional sets. Each line is a single-bit channel; together they form the full bus:

Figure 2 — System bus: address, data, and control line groups
CPU MAR → address MBR ↔ data CU → control Memory Address selects the word location Data read or written here I/O Module Higher addr bits select I/O device Data to/from peripheral ADDRESS BUS (32–64 lines, unidirectional CPU →) DATA BUS (8, 16, 32, 64 lines, bidirectional ↔) CONTROL BUS (command + timing + interrupt lines) Address: which location to access Data: the actual value being transferred Control: read/write, IRQ, clock, ACK signals vlsitrainers.com

The three system bus line groups. Address lines (blue) run unidirectionally from CPU to all devices — they specify which memory location or I/O port is being accessed. Data lines (orange) are bidirectional — the CPU writes data to memory on a write, or reads data from memory back to the CPU. Control lines (purple) carry command and timing signals in both directions.

Bus groupDirectionWidthWhat it carriesImpact on system
Address Bus CPU → all (unidirectional) 32 or 64 lines Memory address or I/O port number being accessed Width determines max addressable memory: 32-bit → 4 GB, 64-bit → 16 EB
Data Bus Bidirectional 8, 16, 32, or 64 lines The actual data value being read or written Width determines transfer bandwidth: 64-bit bus moves 8 bytes per cycle vs 1 byte for 8-bit
Control Bus Bidirectional ~10–20 lines Read/write commands, IRQ, clock, ACK, bus grant/request Determines bus protocol richness and arbitration capability
🔍 Worked Example — Data bus width and instruction fetch efficiency

Scenario: A CPU has an 8-bit data bus. Each instruction is 16 bits (2 bytes). How many memory accesses are needed per instruction fetch?

Answer: 2 memory accesses — the CPU must fetch the high byte first, then the low byte. This halves effective instruction throughput.

Solution: Widen the data bus to 16 bits → 1 access per instruction fetch. This doubles throughput with no change to clock frequency. Most modern CPUs use 64-bit data buses and fetch multiple instructions per cycle.

Address bus example: The Intel 8080 (1974) had a 16-bit address bus → 2¹⁶ = 64 KB maximum memory. The Intel 8086 (1978) had a 20-bit address bus → 1 MB. The 80386 (1985) had a 32-bit address bus → 4 GB. Modern 64-bit processors have 48-bit physical address buses → 256 TB.

🎛️Control Lines Reference

The control bus carries all the signalling needed to orchestrate bus transactions. Each line has a specific function:

Control signalDirectionMeaning
Memory WriteCPU → MemoryData on the data bus should be written to the address on the address bus
Memory ReadCPU → MemoryData at the addressed memory location should be placed on the data bus
I/O WriteCPU → I/OData on the data bus should be output to the addressed I/O port
I/O ReadCPU → I/OData from the addressed I/O port should be placed on the data bus
Transfer ACKSlave → MasterData has been accepted from or placed on the bus (handshake acknowledgement)
Bus RequestDevice → ArbiterA module (e.g. DMA) needs to gain control of the bus
Bus GrantArbiter → DeviceThe requesting module has been granted control of the bus — it may now drive address and data
Interrupt RequestDevice → CPUAn interrupt is pending — the device needs CPU attention
Interrupt ACKCPU → DeviceThe pending interrupt has been recognised — the CPU is about to handle it
ClockClock source → allSynchronises all bus operations to a common clock edge (synchronous buses)
ResetCPU/System → allInitialises all modules to a known power-on state

🔗Single vs Multiple Bus

A single system bus connecting all components is the simplest design, but it creates a bottleneck as the system grows. Adding more devices increases contention — only one device can transmit at a time, so traffic from many devices queues up.

Single bus problems: (1) Propagation delay increases as more devices load the bus — the bus takes longer to settle. (2) If aggregate data transfer demand approaches bus bandwidth, the bus becomes a bottleneck. A single fast CPU paired with a single slow bus will idle most of the time waiting for memory. Solution: multiple buses at different speed tiers.
Figure 3 — Single bus vs multi-bus hierarchy
Single Bus (bottleneck) SYSTEM BUS CPU Cache RAM I/O 1 I/O 2 Disk USB ⚠ All traffic shares one bus Fast CPU must wait for slow I/O to release bus Disk and USB slow down CPU-memory bandwidth Multi-Bus Hierarchy (modern) HIGH-SPEED BUS (e.g. FSB / HyperTransport) CPU Cache RAM Bridge EXPANSION BUS (PCI / USB) I/O 1 I/O 2 Disk ✓ CPU-memory traffic on fast bus Slow I/O isolated on expansion bus via bridge CPU is not stalled by disk or USB activity vlsitrainers.com

Single bus (left): all devices share one path — fast CPU is delayed by slow disk or USB traffic. Multi-bus hierarchy (right): high-speed bus handles CPU-cache-RAM traffic; a bridge connects to a slower expansion bus for I/O devices. The bridge isolates the two domains so slow I/O does not stall fast CPU-memory operations.

🧩Elements of Bus Design — Overview

Five elements must be specified when designing a bus. Together they determine the bus’s performance, cost, and complexity:

① BUS TYPE — Dedicated vs Multiplexed
Are address and data lines separate wires, or do they share the same lines at different times?
② ARBITRATION — Centralised vs Distributed
When multiple devices want the bus simultaneously, which device decides who gets it?
③ TIMING — Synchronous vs Asynchronous
Are bus events locked to a clock, or do they use handshake signals to self-time?
④ BUS WIDTH — Address & Data widths
How many address bits (determines max memory)? How many data bits (determines bandwidth)?
⑤ DATA TRANSFER TYPE — Read, Write, Block…
What kinds of transfers does the bus protocol support — single word, burst, read-modify-write?

Element 1 — Bus Type: Dedicated vs Multiplexed

TypeHow it worksAdvantageDisadvantage
Dedicated Separate physical lines permanently assigned to address, separate lines for data. Both sets active simultaneously. Highest throughput — address and data transfer can overlap. Simpler control logic. Lower latency. More physical wires needed — higher pin count, larger connectors, more PCB traces.
Multiplexed Address and data share the same lines. An “address valid” control signal indicates when the lines carry an address; “data valid” indicates when they carry data. Fewer lines → lower pin count → cheaper packaging, smaller connectors, simpler PCB layout. Critical for ICs with limited pins. More complex control logic. Address and data cannot transfer simultaneously — adds latency. Lower peak throughput.
Real example — Intel 8086: The original 8086 used a multiplexed address/data bus (AD0–AD15). The same 16 pins carried the address in the first clock cycle and data in subsequent cycles. The ALE (Address Latch Enable) signal told external logic when to latch the address. This saved 16 pins but required an external 8282/8283 latch chip — exactly the trade-off between cost and complexity described above.

Element 2 — Arbitration: Centralised vs Distributed

When more than one device wants bus access simultaneously (CPU, DMA controller, and I/O module all need the bus), arbitration determines which device gets control. Only one device may be the bus master at any time — others must wait.

Figure 4 — Bus arbitration: centralised (left) vs distributed (right)
Centralised Arbitration Bus Arbiter CPU DMA I/O BR BR BR BG SYSTEM BUS Single arbiter grants bus to one requester. Simple logic. Single point of failure. BR = Bus Request, BG = Bus Grant Distributed Arbitration CPU arb. logic ⚙ priority ID DMA arb. logic ⚙ priority ID I/O arb. logic ⚙ priority ID SYSTEM BUS Each device has local arbitration logic. No single point of failure. More complex. Devices negotiate via shared priority lines. vlsitrainers.com

Centralised arbitration (left): one Bus Arbiter receives Bus Requests from all devices and sends Bus Grants. Simple but a single point of failure. Distributed arbitration (right): each device contains its own priority logic and negotiates access via shared arbitration wires — more resilient but more complex per device.

Element 3 — Timing: Synchronous vs Asynchronous

Timing defines how bus events are coordinated between devices. The two approaches are fundamentally different in their clock relationship:

Figure 5 — Synchronous vs asynchronous bus timing for a read operation
SYNCHRONOUS BUS TIMING (read operation) All events locked to a shared clock — master and slave both count clock edges CLK T1 T2 T3 T4 T5 ADDR VALID ADDRESS DATA VALID DATA addr put data ready No handshake needed — master counts clock cycles and knows when data will be ready ASYNCHRONOUS BUS TIMING (read operation) No clock — handshake signals (MSYN / SSYN) coordinate master and slave MSYN 1: master asserts ADDR VALID ADDRESS (stable while MSYN asserted) SSYN 2: slave asserts when data ready 3: slave deasserts after master ACKs Handshake: ① master puts address + asserts MSYN → ② slave puts data + asserts SSYN → ③ master reads data + deasserts MSYN → ④ slave deasserts SSYN Works with any speed slave — no fixed timing assumptions. Slightly higher overhead per transfer. vlsitrainers.com

Synchronous (top): all events happen at predetermined clock edges — simple, fast, but all devices must run at the same speed. Asynchronous (bottom): MSYN/SSYN handshake signals allow any-speed slave — the master waits until the slave signals data ready, regardless of how many cycles that takes.

AspectSynchronousAsynchronous
ClockShared clock line — all events on clock edgesNo clock — handshake signals only
SpeedFaster — no handshake overheadSlower per transfer — handshake adds latency
Device speedAll devices must run at the same bus speedWorks with any-speed device — slave dictates its own readiness
ComplexitySimple — count clock edgesMore complex — handshake state machine needed
Max cable lengthLimited by clock skewLonger cables possible — no skew problem
ExamplesPCI, DDR SDRAM, SPIOriginal Unibus, some I²C implementations

Element 4 — Bus Width

Bus width affects two separate system properties — address bus width affects memory capacity; data bus width affects bandwidth:

BusWidthImpactExample
Address bus widthMore bits = larger address space32-bit → 2³² = 4 GB max memory. 64-bit → 2⁶⁴ = 16 EB max memoryIntel 8080: 16-bit → 64 KB. x86-64: 48-bit physical → 256 TB
Data bus widthMore bits = more data per cycle8-bit bus: 1 byte/cycle. 64-bit bus: 8 bytes/cycle — 8× bandwidth for same clock rate8088: 8-bit external. 8086: 16-bit. Pentium: 64-bit. Modern CPUs: 64-bit (DDR5: 64-bit + 8 ECC)
Key insight — address vs data width are independent: The Intel 8088 had a 16-bit internal data bus but only 8-bit external data bus — to save pins on the chip package. It accessed memory twice per 16-bit word fetch. The 8086 had a 16-bit external data bus and was faster despite the same clock rate. Data bus width directly determines instruction fetch and data transfer throughput.

Element 5 — Data Transfer Types

A bus protocol defines which types of data transfers it supports. All buses support basic read and write; richer protocols add compound and burst operations:

Transfer typeDescriptionUse case
ReadSlave → Master: master requests data at an address; slave places it on the data busCPU reads from memory or I/O port
WriteMaster → Slave: master sends address + data; slave stores itCPU writes to memory or I/O port
Read-modify-writeAtomic read then write to the same address without releasing the bus between operationsSemaphore operations, test-and-set, compare-and-swap — critical for multiprocessor synchronisation
Read-after-writeWrite followed immediately by a read from the same address to verify the write succeededVerifying writes to I/O-mapped hardware registers
Block transferMultiple consecutive words transferred in a burst — one address phase, multiple data phasesCache line fill (typically 64 bytes = 8 × 64-bit words), DMA transfers, disk sector reads
Why block transfer matters: A cache line fill requires 64 bytes. With a 64-bit data bus at 100 MHz, fetching 64 bytes one word at a time with full address-data-ACK cycles = 8 × 3 cycles = 24 cycles minimum. With a burst transfer — one address phase + 8 back-to-back data phases = 9 cycles. Block transfers are the key reason modern memory systems can sustain high bandwidth despite long initial latency.

🔬VLSI Connections

🔬 System bus → AMBA (AXI / AHB / APB) in modern SoCs

Every concept in this article maps directly to the AMBA bus family used in ARM-based SoCs. The three bus line types (address, data, control) become the AXI4 channel signals: AWADDR/ARADDR (address), WDATA/RDATA (data), and AWVALID/AWREADY/WVALID/WREADY/BVALID/BREADY (control/handshake). AXI4’s separate read and write channels are the “dedicated bus” approach applied at the channel level. Synchronous timing is used — all signals are sampled on the rising clock edge. Bus arbitration is implemented in the AXI interconnect fabric (crossbar or bus matrix) — the functional equivalent of the centralised arbiter. The AMBA specification is essentially a formal, standardised version of the bus design elements you just learned.

🔬 AXI4 handshake = asynchronous-style protocol over a synchronous bus

AXI4’s VALID/READY handshake is philosophically identical to the MSYN/SSYN asynchronous handshake in Figure 5 — but implemented on a synchronous (clocked) bus. The master asserts VALID when it has a valid address/data; the slave asserts READY when it can accept. A transfer occurs only when both VALID and READY are HIGH on the same clock edge. This gives the same “any-speed slave” flexibility as asynchronous buses while retaining the simplicity of synchronous clocking. Understanding the asynchronous handshake concept from this article is the key to understanding why AXI4 works the way it does — and why VALID/READY can be de-asserted independently, creating the back-pressure mechanism used for flow control across clock domains.

🔬 Bus width → memory interface design (LPDDR5, HBM)

The bus width trade-off is central to memory interface design. LPDDR5 uses a 16-bit data bus per channel to minimise power and pin count on mobile SoCs — width is traded for lower power. HBM (High Bandwidth Memory) used in GPUs and AI accelerators achieves massive bandwidth via a 1024-bit-wide bus across a short interposer — width is maximised at the cost of physical complexity. DDR5 uses a 64-bit data bus plus 8 ECC bits. Every memory PHY you implement or verify expresses bus width as the number of DQ (data) pins — understanding bus width from first principles helps you reason about why HBM delivers 900 GB/s while LPDDR5 delivers 68 GB/s despite similar clock rates.

Summary — CA-04 key points: A bus is a shared broadcast medium — only one device transmits at a time; contention causes corruption. Three bus line types: Address (specifies location, unidirectional), Data (carries value, bidirectional), Control (command + timing + interrupt signals, bidirectional). Five bus design elements: (1) Type — dedicated (separate lines, higher performance) vs multiplexed (shared lines, fewer pins); (2) Arbitration — centralised (single arbiter, simpler) vs distributed (per-device logic, more resilient); (3) Timing — synchronous (clock-based, faster, fixed speed) vs asynchronous (handshake-based, any-speed slave); (4) Width — address width determines max memory; data width determines transfer bandwidth; (5) Data transfer types — read, write, read-modify-write, read-after-write, block transfer. Multiple bus hierarchies solve the single-bus bottleneck by separating fast and slow traffic.
Scroll to Top