PCIe Series — PCIe-01: Introduction to PCI Express — VLSI Trainers
PCIe Series · PCIe-01

Introduction to PCI Express

Why the parallel shared bus model hit a wall, how a serial point-to-point link sidesteps all three fundamental limits, the topology every PCIe system shares, the three-layer architecture, and how bandwidth has scaled from Gen 1’s 0.5 GB/s all the way to Gen 6’s 256 GB/s per link.

Why PCI Hit a Wall

PCI was a genuine step forward when it launched in the early 1990s. It gave every device a standardised configuration space, a plug-and-play environment, and 133 MB/s of shared bandwidth. For a decade that was enough. Then GPU cards, SCSI controllers, Gigabit Ethernet, and eventually storage devices started fighting over those same 133 MB/s — and the system became a bottleneck.

PCI-X pushed the frequency to 133 MHz and improved bus efficiency, reaching 1 GB/s on a 64-bit bus. But it was still a parallel shared bus. Every device plugged into the same wires. When one device was talking, no other device could. And going any faster ran straight into three physical problems that the bus model itself could not escape.

The ceiling wasn’t engineering — it was physics. Every attempt to make PCI or PCI-X faster eventually hit the same three walls: flight time, clock skew, and signal skew. These aren’t implementation bugs you can fix with better chips. They are consequences of the parallel shared bus model itself.

📋 The Three Fundamental Limits of a Parallel Bus

Three Physical Limits That Cap the Parallel Bus ① Flight Time TX RX t_flight ← Clock Period → flight setup+skew Clock period must be longer than flight time. Higher freq → must shorten traces → impractical ② Clock Skew Dev A Dev B shared CLK bus CLK arrives t₀ CLK arrives t₀+Δ Δt = skew Clock arrives at every device at a slightly different time. Eats into timing budget. Cannot be eliminated. ③ Signal Skew TX drives D[7:0] simultaneously… …arrive at different times at RX Must wait for slowest bit before latching any of them. Worsens with bus width. 64-bit PCI-X hit this hard.
Figure 1 — The three physical limits that cap any parallel shared bus. All three are intrinsic to the model — you cannot engineer around them, only manage them until the system becomes impractical.

PCI-X 2.0 at 133 MHz was effectively forced to become point-to-point because signal timing demanded it. Once you’re point-to-point anyway, you’ve lost the main advantage of a shared bus — and the model stops making sense. The PCISIG recognised this and designed something fundamentally different.

How Serial Solves All Three at Once

PCIe is a serial, point-to-point, differential interconnect. Each connection is between exactly two PCIe ports. This single change invalidates all three limits simultaneously:

Parallel Shared Bus (PCI / PCI-X)

  • Shared wires — all devices contend for the same bus
  • One transfer at a time, one direction at a time
  • External common clock — clock skew unavoidable
  • Flight time must be less than clock period
  • 32 or 64 bits in parallel — signal skew grows with width
  • Practical ceiling: ~133 MHz, ~1 GB/s

Serial Point-to-Point (PCIe)

  • Dedicated wires per link — zero contention
  • Send and receive simultaneously (full duplex)
  • Clock embedded in data stream — no external clock needed
  • Flight time irrelevant — clock travels with data
  • 1 bit per lane at a time — signal skew eliminated within a lane
  • Gen 6: 64 GT/s, 256 GB/s on x16 link

📋 Differential Signalling — Noise Immunity at Multi-Gigabit Speeds

Every PCIe lane uses differential signalling. Instead of one wire carrying the signal referenced to ground, two complementary wires carry the same signal — one positive (D+), one inverted (D−). The receiver subtracts D− from D+ to determine the bit value.

TX Driver D+ (positive signal) D− (inverted signal) Noise couples Noise hits D+ and D− equally → cancels out in subtraction RX D+ − D− = data bit Noise cancelled Clean bit value recovered Gen 6 differential voltage: ~0.8 V peak-to-peak Noise margin maintained by tight trace routing (matched impedance)
Figure 2 — Differential signalling. Any noise coupling onto the trace affects D+ and D− equally. Since the receiver looks at the difference, the noise cancels out — giving excellent noise immunity even at 64 GT/s (Gen 6) voltage swings.

In Gen 6, the differential voltage swing is around 800 mV peak-to-peak — small enough that a single-ended signal at that level would be swamped by typical system noise. But because differential pairs are routed together with matched impedance, noise affects both wires identically and cancels in the subtraction. This is why PCIe can operate reliably at 64 GT/s on real PCB traces.

📋 No Common Clock — Clock Embedded in the Data Stream

There is no forwarded clock signal in a PCIe link. Instead, the transmitter embeds the clock into the data stream through encoding, and the receiver contains a CDR (Clock and Data Recovery) circuit — a PLL that extracts the clock from the incoming bit transitions.

Transmitter Clock embedded via encoding Data stream (clock embedded by transitions) CDR / PLL Phase Detector Loop Filter VCO locks to transitions Receiver Recovered CLK latches bits → Flight time is now irrelevant — clock travels with data at the same speed
Figure 3 — The CDR circuit locks its PLL to the transitions in the incoming bit stream. The recovered clock precisely matches the transmitter’s clock regardless of how long the signal takes to travel. This is why PCIe has no clock skew problem.

The encoding scheme is what keeps the PLL happy. The PLL needs frequent transitions to stay locked. Too long without a transition and it drifts. This is why 8b/10b (Gen 1/2) guarantees no more than 5 consecutive identical bits, and why 128b/130b (Gen 3–5) uses scrambling to randomise the bit stream. In Gen 6, PAM4 with FEC provides a different but equally robust approach.

🎯 PCIe Key Design Goals

📋 Topology — Root Complex, Switch, Endpoint

A PCIe system is a tree. No loops are allowed — this preserves backward compatibility with PCI’s configuration software, which uses a simple recursive bus-number enumeration algorithm that assumes a tree structure.

Root Complex (RC) CPU + Memory Controller + PCIe Root Ports · appears as Bus 0 to software x16 Link 64 GT/s (Gen 6) x8 Link x4 Link Gen 5 GPU Endpoint Native PCIe · x16 · Gen 6 Requester + Completer 1 Upstream Port only Switch 1 Upstream · 3 Downstream Ports NVMe SSD Native PCIe Endpoint · x4 Internal virtual bus — appears as PCI-to-PCI bridges to software x4 x4 x1 10GbE NIC Endpoint · x4 FPGA Card Endpoint · x4 PCIe→PCI Bridge Connects legacy PCI devices Root Complex Switch Native PCIe Endpoint Bridge All connections are point-to-point links
Figure 4 — Example PCIe topology. Every line is a dedicated point-to-point link. The Root Complex sits at bus 0; Switches fan out downstream ports; Endpoints are the leaf devices. Software sees Switches as collections of PCI-to-PCI bridges.

Root Complex

The Root Complex (RC) is the interface between the CPU/memory system and the PCIe fabric. In modern systems it is partly inside the CPU die (memory controller, PCIe root ports) and partly in a separate I/O hub. It generates PCIe transactions on behalf of software and acts as the completer for memory requests from devices. To configuration software it appears as Bus 0 with a collection of virtual PCI-to-PCI bridges as its PCIe root ports.

Switch

A Switch is a packet router. It has one upstream port (facing the Root) and one or more downstream ports (facing endpoints or further switches). Internally it looks to software like a set of PCI-to-PCI bridges sharing a common internal bus. Every switch port implements all three PCIe layers — even though the switch is just forwarding packets, it has to look inside the TLP header to determine the correct egress port, which means parsing at the Transaction Layer.

Endpoint

Endpoints are the leaf devices — the GPUs, NVMe SSDs, network cards, and AI accelerators. A Native PCIe Endpoint was designed specifically for PCIe — it is memory-mapped only, uses no I/O space, and does not use locked transactions. A Legacy PCIe Endpoint is an older PCI-X device with a PCIe interface — it may still use I/O space and locked transactions for backward compatibility.

📋 The Three-Layer Architecture

Every PCIe port — whether it’s in a Root Complex, a Switch, or an Endpoint — implements the same three layers. The layers define what each part of the system is responsible for, not how the silicon must be partitioned internally.

Transmitting Port Software / Device Core Generates: MRd / MWr / CfgRd / CfgWr / Message request Transaction Layer Builds TLP Header + Payload + optional ECRC Flow control • QoS (Traffic Classes) • Ordering Data Link Layer Adds Sequence Number + LCRC → stores copy in Replay Buffer Sends ACK/NAK DLLPs to neighbour • Flow control DLLPs Physical Layer Gen 1/2: 8b/10b encode + framing chars Gen 3–5: 128b/130b + scramble Gen 6: PAM4 + FEC + flit Byte Stripe → Serialise → Differential TX on all lanes Link Receiving Port Differential RX → CDR → Deserialise → Decode → De-skew Physical Layer Check framing / sync headers → strip encoding LTSSM manages link state and training Data Link Layer Check LCRC + Sequence Number → send ACK or NAK Remove Seq No + LCRC → forward TLP upward Transaction Layer Check ECRC (optional) → decode TLP header Route (if Switch) or deliver to Device Core (if Endpoint) Software / Device Core Receives: command + address + data → acts on it
Figure 6 — The three PCIe layers in both transmit and receive directions. Each layer adds fields when transmitting and strips them when receiving. The Physical Layer changes the most between generations (8b/10b → 128b/130b → PAM4+FEC). The upper layers are largely unchanged from Gen 1 to Gen 6.

One important point: the upper two layers (Transaction and Data Link) have been stable since Gen 1. The Transaction Layer protocol — TLP types, routing, ordering rules — is essentially identical from Gen 1 through Gen 6. The Data Link Layer ACK/NAK mechanism is unchanged. What changes between generations is almost entirely the Physical Layer — encoding, signal levels, equalization, and framing. This is intentional: keeping the upper layers stable means software and firmware written for Gen 1 works on Gen 6 hardware.

A Packet’s Journey — Complete Walk-Through

Let’s trace a memory read from an NVMe SSD to system RAM through the layers, step by step.

① Transaction Layer Builds TLP: MRd header Header (no data) TLP formed ② Data Link Layer Adds SeqNo + LCRC, saves copy Seq Header LCRC Replay Buffer copy saved ③ Physical Layer Gen 6: PAM4 encode + FEC + flit framing SOP Seq Header LCRC EOP PAM4 symbols + FEC parity blocks Link ④ PL Receive CDR → Deserialise → FEC correct → decode ↓ Strip framing ↓ DLL: Check LCRC → ACK sent ↓ TL: Decode → route or deliver TLP content (Transaction Layer) Data Link additions Physical Layer additions SOP = Start of Packet · EOP = End of Packet · Seq = Sequence Number · LCRC = Link CRC
Figure 7 — A TLP being built layer by layer (left to right), sent across the link, and received. Each layer adds its fields on transmit and strips them on receive. The device core at the far end sees only the original header and payload.

📋 TLPs vs DLLPs

There are two completely different kinds of packet in PCIe, with very different roles:

TLP — Transaction Layer Packet

  • Carries user data and commands — memory reads/writes, config accesses, completions, messages
  • Created at the Transaction Layer of one device, consumed at the Transaction Layer of another (often non-adjacent)
  • Routed through the topology by address or Requester ID
  • Variable length: 3–4 DW header + optional data payload (up to 4096 bytes)
  • Protected by LCRC (per hop, mandatory) and ECRC (end-to-end, optional)
  • Gen 6: carried inside flits; max payload still 4096 bytes

DLLP — Data Link Layer Packet

  • Carries link management — ACK, NAK, flow control credits, power management
  • Created at the Data Link Layer of one device, consumed at the Data Link Layer of the immediately adjacent device — never forwarded or routed
  • Fixed length: always 8 bytes (6 bytes content + 2 bytes CRC)
  • Always deliverable — not subject to flow control credits (prevents deadlock)
  • Invisible to the Transaction Layer and software
Why two packet types? If ACK/NAK were carried in TLPs, a full receive buffer could prevent ACKs from getting through — causing a deadlock where the sender can’t send because the buffer is full, and the buffer can’t be released because ACKs aren’t arriving. DLLPs sidestep this completely because they bypass the TLP flow control mechanism entirely.

📈 Gen 1 Through Gen 6 — Bandwidth at Every Width

Each generation roughly doubles per-lane bandwidth. The mechanism changes — it is not just a frequency increase.

Generation Line Rate Encoding / Modulation x1 BW (each dir) x16 Aggregate
Gen 1 (1.x) 2.5 GT/s 8b/10b NRZ 250 MB/s 8 GB/s
Gen 2 (2.x) 5.0 GT/s 8b/10b NRZ 500 MB/s 16 GB/s
Gen 3 (3.x) 8.0 GT/s 128b/130b NRZ ~1 GB/s 32 GB/s
Gen 4 (4.0) 16.0 GT/s 128b/130b NRZ ~2 GB/s 64 GB/s
Gen 5 (5.0) 32.0 GT/s 128b/130b NRZ ~4 GB/s 128 GB/s
Gen 6 (6.0) 64.0 GT/s PAM4 + FEC + Flit ~8 GB/s 256 GB/s

How the bandwidth is calculated

Gen 1/2 (8b/10b): 10 bits on the wire carry 8 bits of data. Gen 1 at 2.5 GT/s gives 2.5 Gb/s ÷ 10 = 250 MB/s per lane per direction.

Gen 3–5 (128b/130b): 130 bits carry 128 bits. The 2-bit overhead (1.5%) is small enough to ignore in round-number calculations. Gen 3 at 8 GT/s gives approximately 1 GB/s per lane per direction.

Gen 6 (PAM4): PAM4 encodes 2 bits per symbol instead of 1. At 32 GBaud (32 billion baud), the line rate is 64 GT/s. After FEC overhead (~1.5%), effective payload throughput is approximately 8 GB/s per lane per direction.

Gen 6 — What Changed and Why It Matters

Gen 6 is not just Gen 5 running twice as fast. It is a significant architectural change at the Physical Layer driven by a hard physical reality: NRZ (Non-Return-to-Zero) signalling cannot reliably run faster than 32 GT/s on typical PCB trace lengths.

The NRZ ceiling

NRZ uses two voltage levels — high for logic 1, low for logic 0. As you increase the symbol rate, the voltage levels get closer and the eye diagram (the open area where the receiver samples the bit) closes. At 32 GT/s, the eye is already tight. Doubling to 64 GT/s with NRZ would essentially close the eye completely on any realistically lossy channel — the signal integrity simply doesn’t support it.

PAM4 — Four levels instead of two

PCIe 6.0 switches to PAM4 (Pulse Amplitude Modulation with 4 levels). Instead of two voltage levels carrying 1 bit per symbol, PAM4 uses four voltage levels carrying 2 bits per symbol at the same baud rate. At 32 GBaud, PAM4 delivers 64 Gb/s — same baud rate as Gen 5 NRZ, double the bit rate.

NRZ (Gen 1–5) 2 voltage levels · 1 bit per symbol +300mV -300mV 0 1 0 1 1 0 1 Eye closing at 32 GT/s — can’t go faster PAM4 (Gen 6) 4 voltage levels · 2 bits per symbol +3 = 11 +1 = 10 -1 = 01 -3 = 00 00 11 10 00 11 01 10 11 Same baud rate as Gen 5 — 2× the bits
Figure 8 — NRZ (left): 2 voltage levels, 1 bit/symbol. PAM4 (right): 4 voltage levels, 2 bits/symbol. Gen 6 uses PAM4 at 32 GBaud, delivering 64 Gb/s p
Scroll to Top