PCIe Series — PCIe-01: Introduction to PCI Express — VLSI Trainers
PCIe Series · PCIe-01
Introduction to PCI Express
Why the parallel shared bus model hit a wall, how a serial point-to-point link sidesteps all three fundamental limits, the topology every PCIe system shares, the three-layer architecture, and how bandwidth has scaled from Gen 1’s 0.5 GB/s all the way to Gen 6’s 256 GB/s per link.
⛔ Why PCI Hit a Wall
PCI was a genuine step forward when it launched in the early 1990s. It gave every device a standardised configuration space, a plug-and-play environment, and 133 MB/s of shared bandwidth. For a decade that was enough. Then GPU cards, SCSI controllers, Gigabit Ethernet, and eventually storage devices started fighting over those same 133 MB/s — and the system became a bottleneck.
PCI-X pushed the frequency to 133 MHz and improved bus efficiency, reaching 1 GB/s on a 64-bit bus. But it was still a parallel shared bus. Every device plugged into the same wires. When one device was talking, no other device could. And going any faster ran straight into three physical problems that the bus model itself could not escape.
The ceiling wasn’t engineering — it was physics. Every attempt to make PCI or PCI-X faster eventually hit the same three walls: flight time, clock skew, and signal skew. These aren’t implementation bugs you can fix with better chips. They are consequences of the parallel shared bus model itself.
📋 The Three Fundamental Limits of a Parallel Bus
Figure 1 — The three physical limits that cap any parallel shared bus. All three are intrinsic to the model — you cannot engineer around them, only manage them until the system becomes impractical.
PCI-X 2.0 at 133 MHz was effectively forced to become point-to-point because signal timing demanded it. Once you’re point-to-point anyway, you’ve lost the main advantage of a shared bus — and the model stops making sense. The PCISIG recognised this and designed something fundamentally different.
⚡ How Serial Solves All Three at Once
PCIe is a serial, point-to-point, differential interconnect. Each connection is between exactly two PCIe ports. This single change invalidates all three limits simultaneously:
Parallel Shared Bus (PCI / PCI-X)
Shared wires — all devices contend for the same bus
One transfer at a time, one direction at a time
External common clock — clock skew unavoidable
Flight time must be less than clock period
32 or 64 bits in parallel — signal skew grows with width
Practical ceiling: ~133 MHz, ~1 GB/s
Serial Point-to-Point (PCIe)
Dedicated wires per link — zero contention
Send and receive simultaneously (full duplex)
Clock embedded in data stream — no external clock needed
Flight time irrelevant — clock travels with data
1 bit per lane at a time — signal skew eliminated within a lane
Gen 6: 64 GT/s, 256 GB/s on x16 link
📋 Differential Signalling — Noise Immunity at Multi-Gigabit Speeds
Every PCIe lane uses differential signalling. Instead of one wire carrying the signal referenced to ground, two complementary wires carry the same signal — one positive (D+), one inverted (D−). The receiver subtracts D− from D+ to determine the bit value.
Figure 2 — Differential signalling. Any noise coupling onto the trace affects D+ and D− equally. Since the receiver looks at the difference, the noise cancels out — giving excellent noise immunity even at 64 GT/s (Gen 6) voltage swings.
In Gen 6, the differential voltage swing is around 800 mV peak-to-peak — small enough that a single-ended signal at that level would be swamped by typical system noise. But because differential pairs are routed together with matched impedance, noise affects both wires identically and cancels in the subtraction. This is why PCIe can operate reliably at 64 GT/s on real PCB traces.
📋 No Common Clock — Clock Embedded in the Data Stream
There is no forwarded clock signal in a PCIe link. Instead, the transmitter embeds the clock into the data stream through encoding, and the receiver contains a CDR (Clock and Data Recovery) circuit — a PLL that extracts the clock from the incoming bit transitions.
Figure 3 — The CDR circuit locks its PLL to the transitions in the incoming bit stream. The recovered clock precisely matches the transmitter’s clock regardless of how long the signal takes to travel. This is why PCIe has no clock skew problem.
The encoding scheme is what keeps the PLL happy. The PLL needs frequent transitions to stay locked. Too long without a transition and it drifts. This is why 8b/10b (Gen 1/2) guarantees no more than 5 consecutive identical bits, and why 128b/130b (Gen 3–5) uses scrambling to randomise the bit stream. In Gen 6, PAM4 with FEC provides a different but equally robust approach.
🎯 PCIe Key Design Goals
Software backward compatibility with PCI — Memory, I/O, and configuration address spaces are unchanged. A PCI device driver written in 1998 runs on a PCIe system today with zero modification.
Point-to-point dedicated bandwidth — Every link has its own bandwidth. No contention between devices. An NVMe SSD doesn’t compete with a GPU for the same wires.
Scalable link width and speed — x1, x2, x4, x8, x16, x32. Gen 1 through Gen 6. Match the bandwidth to the device cost and requirements.
Low pin count — A x1 PCIe link needs just 8 signals (4 per direction). A 64-bit PCI-X slot needed over 100 pins. The difference matters enormously for connectors, PCB routing, and device packaging.
Packet-based protocol — No sideband control signals. Everything is in the packet header — the command, address, length, priority, and routing information.
Hardware-guaranteed reliability — The Data Link Layer’s ACK/NAK mechanism catches and retransmits any corrupted packet automatically, with no software involvement.
Quality of Service — Traffic Classes and Virtual Channels allow time-sensitive traffic (real-time video, AI inference) to get guaranteed bandwidth alongside bulk data traffic.
📋 Topology — Root Complex, Switch, Endpoint
A PCIe system is a tree. No loops are allowed — this preserves backward compatibility with PCI’s configuration software, which uses a simple recursive bus-number enumeration algorithm that assumes a tree structure.
Figure 4 — Example PCIe topology. Every line is a dedicated point-to-point link. The Root Complex sits at bus 0; Switches fan out downstream ports; Endpoints are the leaf devices. Software sees Switches as collections of PCI-to-PCI bridges.
Root Complex
The Root Complex (RC) is the interface between the CPU/memory system and the PCIe fabric. In modern systems it is partly inside the CPU die (memory controller, PCIe root ports) and partly in a separate I/O hub. It generates PCIe transactions on behalf of software and acts as the completer for memory requests from devices. To configuration software it appears as Bus 0 with a collection of virtual PCI-to-PCI bridges as its PCIe root ports.
Switch
A Switch is a packet router. It has one upstream port (facing the Root) and one or more downstream ports (facing endpoints or further switches). Internally it looks to software like a set of PCI-to-PCI bridges sharing a common internal bus. Every switch port implements all three PCIe layers — even though the switch is just forwarding packets, it has to look inside the TLP header to determine the correct egress port, which means parsing at the Transaction Layer.
Endpoint
Endpoints are the leaf devices — the GPUs, NVMe SSDs, network cards, and AI accelerators. A Native PCIe Endpoint was designed specifically for PCIe — it is memory-mapped only, uses no I/O space, and does not use locked transactions. A Legacy PCIe Endpoint is an older PCI-X device with a PCIe interface — it may still use I/O space and locked transactions for backward compatibility.
🔂 Lanes and Links
A Lane is the fundamental unit of a PCIe connection: one differential TX pair plus one differential RX pair. Four wires. Both directions active simultaneously — the spec calls this “dual-simplex” (two independent simplex paths).
A Link is the physical connection between two PCIe ports and consists of one or more lanes. Supported link widths: x1, x2, x4, x8, x12, x16, x32. More lanes = more bandwidth but also more board area, more power, and higher connector cost.
Figure 5 — Left: a x1 link (4 wires). Right: a x4 link (16 wires). On multi-lane links, bytes are striped across lanes — Lane 0 carries byte 0, Lane 1 carries byte 1, etc. — multiplying the bandwidth proportionally.
Byte striping on multi-lane links. When a x4 link sends a packet, it doesn’t send the whole packet down Lane 0 first. It stripes bytes across all four lanes simultaneously — byte 0 on Lane 0, byte 1 on Lane 1, byte 2 on Lane 2, byte 3 on Lane 3, then byte 4 on Lane 0 again, and so on. The receiver reassembles the bytes in order. This is why a x4 link carries exactly 4× the bandwidth of a x1 link at the same speed.
📋 The Three-Layer Architecture
Every PCIe port — whether it’s in a Root Complex, a Switch, or an Endpoint — implements the same three layers. The layers define what each part of the system is responsible for, not how the silicon must be partitioned internally.
Figure 6 — The three PCIe layers in both transmit and receive directions. Each layer adds fields when transmitting and strips them when receiving. The Physical Layer changes the most between generations (8b/10b → 128b/130b → PAM4+FEC). The upper layers are largely unchanged from Gen 1 to Gen 6.
One important point: the upper two layers (Transaction and Data Link) have been stable since Gen 1. The Transaction Layer protocol — TLP types, routing, ordering rules — is essentially identical from Gen 1 through Gen 6. The Data Link Layer ACK/NAK mechanism is unchanged. What changes between generations is almost entirely the Physical Layer — encoding, signal levels, equalization, and framing. This is intentional: keeping the upper layers stable means software and firmware written for Gen 1 works on Gen 6 hardware.
▶ A Packet’s Journey — Complete Walk-Through
Let’s trace a memory read from an NVMe SSD to system RAM through the layers, step by step.
Figure 7 — A TLP being built layer by layer (left to right), sent across the link, and received. Each layer adds its fields on transmit and strips them on receive. The device core at the far end sees only the original header and payload.
📋 TLPs vs DLLPs
There are two completely different kinds of packet in PCIe, with very different roles:
TLP — Transaction Layer Packet
Carries user data and commands — memory reads/writes, config accesses, completions, messages
Created at the Transaction Layer of one device, consumed at the Transaction Layer of another (often non-adjacent)
Routed through the topology by address or Requester ID
Variable length: 3–4 DW header + optional data payload (up to 4096 bytes)
Protected by LCRC (per hop, mandatory) and ECRC (end-to-end, optional)
Gen 6: carried inside flits; max payload still 4096 bytes
DLLP — Data Link Layer Packet
Carries link management — ACK, NAK, flow control credits, power management
Created at the Data Link Layer of one device, consumed at the Data Link Layer of the immediately adjacent device — never forwarded or routed
Always deliverable — not subject to flow control credits (prevents deadlock)
Invisible to the Transaction Layer and software
Why two packet types? If ACK/NAK were carried in TLPs, a full receive buffer could prevent ACKs from getting through — causing a deadlock where the sender can’t send because the buffer is full, and the buffer can’t be released because ACKs aren’t arriving. DLLPs sidestep this completely because they bypass the TLP flow control mechanism entirely.
📈 Gen 1 Through Gen 6 — Bandwidth at Every Width
Each generation roughly doubles per-lane bandwidth. The mechanism changes — it is not just a frequency increase.
Generation
Line Rate
Encoding / Modulation
x1 BW (each dir)
x16 Aggregate
Gen 1 (1.x)
2.5 GT/s
8b/10b NRZ
250 MB/s
8 GB/s
Gen 2 (2.x)
5.0 GT/s
8b/10b NRZ
500 MB/s
16 GB/s
Gen 3 (3.x)
8.0 GT/s
128b/130b NRZ
~1 GB/s
32 GB/s
Gen 4 (4.0)
16.0 GT/s
128b/130b NRZ
~2 GB/s
64 GB/s
Gen 5 (5.0)
32.0 GT/s
128b/130b NRZ
~4 GB/s
128 GB/s
Gen 6 (6.0)
64.0 GT/s
PAM4 + FEC + Flit
~8 GB/s
256 GB/s
How the bandwidth is calculated
Gen 1/2 (8b/10b): 10 bits on the wire carry 8 bits of data. Gen 1 at 2.5 GT/s gives 2.5 Gb/s ÷ 10 = 250 MB/s per lane per direction.
Gen 3–5 (128b/130b): 130 bits carry 128 bits. The 2-bit overhead (1.5%) is small enough to ignore in round-number calculations. Gen 3 at 8 GT/s gives approximately 1 GB/s per lane per direction.
Gen 6 (PAM4): PAM4 encodes 2 bits per symbol instead of 1. At 32 GBaud (32 billion baud), the line rate is 64 GT/s. After FEC overhead (~1.5%), effective payload throughput is approximately 8 GB/s per lane per direction.
⚡ Gen 6 — What Changed and Why It Matters
Gen 6 is not just Gen 5 running twice as fast. It is a significant architectural change at the Physical Layer driven by a hard physical reality: NRZ (Non-Return-to-Zero) signalling cannot reliably run faster than 32 GT/s on typical PCB trace lengths.
The NRZ ceiling
NRZ uses two voltage levels — high for logic 1, low for logic 0. As you increase the symbol rate, the voltage levels get closer and the eye diagram (the open area where the receiver samples the bit) closes. At 32 GT/s, the eye is already tight. Doubling to 64 GT/s with NRZ would essentially close the eye completely on any realistically lossy channel — the signal integrity simply doesn’t support it.
PAM4 — Four levels instead of two
PCIe 6.0 switches to PAM4 (Pulse Amplitude Modulation with 4 levels). Instead of two voltage levels carrying 1 bit per symbol, PAM4 uses four voltage levels carrying 2 bits per symbol at the same baud rate. At 32 GBaud, PAM4 delivers 64 Gb/s — same baud rate as Gen 5 NRZ, double the bit rate.
Figure 8 — NRZ (left): 2 voltage levels, 1 bit/symbol. PAM4 (right): 4 voltage levels, 2 bits/symbol. Gen 6 uses PAM4 at 32 GBaud, delivering 64 Gb/s p