PCIe Series — PCIe-04: PCIe Generations Gen 1 to Gen 6 — VLSI Trainers
PCIe Series · PCIe-04
PCIe Generations — Gen 1 to Gen 6
The bandwidth maths behind every generation, why each speed step chose its encoding scheme, what physical constraints drove each decision, and a deep dive into Gen 6’s PAM4 and flit architecture with worked numbers.
📈 All Six Generations at a Glance
Figure 1 — Per-lane bandwidth by generation (each direction). The doubling mechanism changes: Gen 1→2 doubles the baud rate, Gen 2→3 changes encoding, Gen 3–5 double the baud rate again, Gen 5→6 changes modulation to PAM4. The bar chart is not to scale above Gen 5 — Gen 6 would be twice the height of Gen 5.
📋 How Bandwidth Is Calculated
The raw line rate (GT/s = Giga-Transfers per second) is the number of symbol transitions on the wire per second. That is not the same as data throughput — the encoding overhead must be subtracted.
Figure 2 — Bandwidth calculation for 8b/10b (Gen 1/2) and 128b/130b (Gen 3–5). For Gen 6, PAM4 contributes 2 bits per symbol at 32 GBaud, giving 64 GT/s effective, with FEC parity overhead reducing the actual payload throughput to approximately 8 GB/s per lane.
📋 Gen 1 — 2.5 GT/s and 8b/10b
The first PCIe generation launched with the spec in 2003. The target was simple: be software-compatible with PCI while beating its bandwidth on a modern serial link.
Why 2.5 GT/s?
The 2.5 GHz bit clock was chosen as the highest frequency achievable with 2001-era CMOS process technology and standard PCB materials (FR4) at acceptable signal integrity margins. The transmitter spec required a differential swing of approximately 800 mV peak-to-peak, which is achievable without exotic processes.
Why 8b/10b?
8b/10b encoding was already proven in other serial protocols (Fibre Channel, SATA, USB 3.0). It delivers DC balance (essential for AC-coupled links), guaranteed transition density for CDR, and clear packet framing using K-characters. The 20% overhead was an acceptable cost in 2003 when the alternative was a parallel bus with 20× less bandwidth.
Special symbols (K-characters)
8b/10b reserves specific 10-bit patterns as control symbols with no 8-bit equivalent. PCIe uses these for framing and link management:
Symbol
Name
Purpose
K28.5
COM
Start of an Ordered Set — tells receiver to start the sync/framing process. Used in TS1, TS2, SKIP ordered sets.
K27.7
STP
Start of TLP — marks the beginning of a Transaction Layer Packet on the wire
K28.2
SDP
Start of DLLP — marks the beginning of a Data Link Layer Packet
K29.7
END
End of packet — marks the end of TLP or DLLP
K23.7
EDB
End Bad — marks the end of a packet that was intentionally nullified by the transmitter
K28.0
SKP
Skip — used in the SKIP Ordered Set for elastic buffer clock compensation
The SKIP Ordered Set (K28.5 · K28.0 · K28.0 · K28.0) is transmitted every 1180–1538 symbol times. Its purpose is to compensate for the ±300 ppm clock frequency difference between two connected devices — the receiver’s elastic buffer adds or removes SKP symbols as needed to prevent buffer overflow or underflow. This is necessary because the transmitter and receiver use independent clocks, each within ±300 ppm of the target frequency.
Gen 1 numbers
Parameter
Value
Line rate
2.5 GT/s
Encoding
8b/10b — 10-bit symbols, 8 data bits
Encoding efficiency
80%
x1 lane throughput (each dir)
250 MB/s
x16 aggregate (both dirs)
8 GB/s
Symbol time
4 ns
SKIP ordered set interval
1180–1538 symbol times
Clock tolerance
±300 ppm (max 600 ppm between TX and RX)
📋 Gen 2 — 5.0 GT/s, Same Encoding
Gen 2 (PCIe 2.0, 2007) doubled the line rate from 2.5 to 5.0 GT/s while keeping 8b/10b encoding. This was the simplest possible speed step — if the physical layer can clock twice as fast, you get twice the bandwidth. No encoding change required.
Why keep 8b/10b at 5 GT/s?
At 5 GT/s on FR4 PCB material, the channel loss is moderate but manageable with simple pre-emphasis at the transmitter. 8b/10b still works. The encoded symbols are wider in time (2 ns each) but the same framing and K-characters apply without modification. Changing the encoding would have broken backward compatibility with Gen 1 receivers.
Backward compatibility rule. A Gen 2 transmitter must fall back to Gen 1 speed when connected to a Gen 1 receiver — the LTSSM negotiates the highest common speed during link training. This works cleanly because the upper layers (TL, DLL) and software model are completely unchanged between Gen 1 and Gen 2.
What actually changed in Gen 2
Higher frequency — pre-emphasis increased to compensate for greater channel loss at 5 GT/s
Receiver equalization (CTLE — Continuous Time Linear Equalizer) added to boost high-frequency content at the input
Otherwise identical to Gen 1 at the protocol level — same TLPs, same DLLPs, same framing
📋 Gen 3 — Why 8 GT/s, Not 10
This is the question everyone asks. If Gen 1 is 2.5 GT/s and Gen 2 is 5.0 GT/s — why is Gen 3 at 8 GT/s instead of 10 GT/s? The answer is in the encoding change.
Figure 3 — Why Gen 3 chose 8 GT/s with 128b/130b instead of 10 GT/s with 8b/10b. Both deliver approximately 1 GB/s per lane, but the 8 GT/s option operates at a 20% lower Nyquist frequency, giving 3–5 dB better channel loss margin — enough to maintain FR4 compatibility.
128b/130b — what it buys
The transition from 8b/10b to 128b/130b in Gen 3 changes three things:
Overhead drops from 20% to 1.54% — 128 data bits per 130-bit block instead of 8 per 10
No more K-characters — framing moves to the sync header (2-bit prefix: 01=data block, 10=control block)
Scrambling becomes mandatory and stronger — 128b/130b has no inherent DC balance, so a 23-bit LFSR scrambles all data to guarantee transition density for CDR lock
Gen 3 also added link equalisation
At 8 GT/s, the channel is lossy enough that simple pre-emphasis no longer suffices. Gen 3 introduced a formal link equalisation phase in the LTSSM — during link training, the two devices negotiate multi-tap FIR (Finite Impulse Response) filter coefficients. The transmitter and receiver exchange proposals and the best filter settings are selected before the link enters L0. This is entirely absent in Gen 1/2.
📋 Gen 4 — 16 GT/s, Pushing NRZ
Gen 4 (PCIe 4.0, 2017) doubled Gen 3’s line rate from 8 to 16 GT/s while keeping 128b/130b encoding. Same encoding change as Gen 2 was to Gen 1 — just crank the baud rate.
The Nyquist frequency jumps from 4 GHz to 8 GHz. This is still achievable on FR4 with aggressive equalization — multi-tap TX FIR coefficients, more powerful RX CTLE and DFE (Decision Feedback Equalizer). Channel trace lengths need to be shorter or use better PCB laminates.
x1 throughput per direction: ~2 GB/s
x16 aggregate: ~64 GB/s
First generation where AI/ML workloads started driving the demand (GPU↔CPU at 32 GB/s bidirectional)
First generation where PCIe retimers became common on server boards for longer traces
📋 Gen 5 — 32 GT/s, the NRZ Ceiling
Gen 5 (PCIe 5.0, 2019) doubled again to 32 GT/s, still using 128b/130b. Nyquist frequency is 16 GHz. At this speed, NRZ signalling on any realistic PCB trace length is at its practical limit.
Figure 4 — Eye diagram comparison. Gen 1/2 NRZ has one wide-open eye. Gen 5 NRZ has a closing eye requiring heavy equalization. Gen 6 PAM4 has three stacked eyes, each with only 1/3 the voltage margin of NRZ — which is why FEC is mandatory in Gen 6.
Gen 5 pushed equalisation to its limits. Retimers (active repeaters) are effectively mandatory on server PCB designs for runs longer than 4–6 inches. Despite this, the protocol is unchanged — same 128b/130b, same TLPs and DLLPs, same software model.
x1 throughput per direction: ~4 GB/s
x16 aggregate: ~128 GB/s
CXL 1.0/2.0 both run on PCIe 5.0 PHY
First PCIe generation where retimers are practically mandatory in most designs
⚡ Gen 6 — 64 GT/s and Why Everything Changed
Gen 6 (PCIe 6.0, 2022) faces a hard constraint: NRZ signalling at 64 GT/s is physically impractical. Doubling to 64 GT/s with NRZ would mean a Nyquist frequency of 32 GHz — at that frequency, a 1-inch trace on FR4 has more than 20 dB of insertion loss. The signal would be unrecoverable without exotic and expensive channel materials.
The solution is a different modulation — PAM4 — combined with FEC and a new framing model. Gen 6 achieves 64 GT/s effective bit rate at only 32 GBaud (the same baud rate as Gen 5 at 32 GT/s NRZ), because PAM4 carries 2 bits per symbol.
Figure 5 — PAM4 achieves 64 GT/s effective bit rate at only 32 GBaud, keeping the Nyquist frequency at 16 GHz — identical to Gen 5. A hypothetical 64 GT/s NRZ would require 32 GHz Nyquist, which is impractical on standard PCB materials.
The tradeoff Gen 6 makes. PAM4 gets twice the bits per symbol at the same baud rate, but at a significant cost: the voltage gap between adjacent levels is only 1/3 of NRZ. If NRZ has a 600 mV peak-to-peak eye, PAM4’s three eyes are each only about 200 mV. That is much noisier. The raw bit error rate (BER) of PAM4 at the same signal quality is roughly 10–100× worse than NRZ. The spec target for PCIe has always been BER < 10⁻¹² (1 error per trillion bits). PAM4 without FEC cannot meet this. That is why FEC is mandatory in Gen 6 — it is not optional, it is architecturally required.
📋 PAM4 Deep Dive — Symbols, Eyes, and BER
PAM4 (Pulse Amplitude Modulation with 4 levels) encodes 2 bits into each symbol by using four distinct voltage levels. The mapping is straightforward Gray-coded to minimise errors on adjacent levels:
Figure 6 — PAM4 symbol mapping, waveform example, and BER impact. Gray coding ensures adjacent-level errors cost only 1 bit instead of 2. Without FEC, PAM4’s raw BER is 10⁶–10⁹× worse than NRZ. Reed-Solomon FEC corrects this back to the spec target of < 10⁻¹².
📋 FEC Deep Dive — RS Codes and Correction Capability
Gen 6 uses a Reed-Solomon FEC code. The specific code parameters chosen by the PCIe 6.0 spec are RS(544, 514) — meaning 544 total symbols per codeword, 514 of which carry data and 30 carry parity. Each RS symbol is 10 bits wide.
Figure 7 — RS(544,514) FEC codeword. 514 data symbols + 30 parity symbols = 544 total. The code can correct up to 15 symbol errors per codeword. The FEC decoder operates in the Physical Layer — the Data Link Layer sees only the corrected bitstream and runs the same ACK/NAK protocol as Gen 1–5.
Why FEC is at the Physical Layer, not the Data Link Layer. Adding FEC to the DLL would mean the DLL had to know about flit boundaries, PAM4 codewords, and symbol-level error correction — it would break the clean layer separation. Putting FEC in the Physical Layer keeps the DLL identical across all generations. The DLL just gets clean bits, same as always. This is the right architecture choice.
📋 Flit Deep Dive — 256-byte Format and Protocol Impact
Gen 6 replaces start/end framing symbols with a flit-based framing model. A flit (flow control unit) is 256 bytes. Every flit is sent as a fixed-size container regardless of how many TLPs or DLLPs it carries.
Figure 8 — Gen 6 flit structure. The 256-byte container carries one or more TLPs, optional DLLPs, padding to fill the flit, and an FEC parity block. TLPs can span flit boundaries. ACK/NAK replay operates at flit granularity. Compared to start/end framing in Gen 1–5, flit-based framing reduces per-TLP overhead and enables clean FEC block boundaries.
📋 Clock Recovery Across Generations
Every PCIe generation embeds the clock in the data stream. There is no forwarded clock signal. The receiver uses a CDR (Clock and Data Recovery) circuit — typically a PLL — to lock onto the incoming bit transitions and extract the transmitter’s clock from them.
Figure 9 — Three CDR clock architecture options supported by PCIe. All three require the receiver to achieve and maintain bit lock on the incoming stream. In Gen 6, the CDR must lock on PAM4 symbols — a more complex process than NRZ, requiring DSP-based CDR in most implementations.
Every PCIe receiver has an elastic buffer between the CDR and the Data Link Layer. Its purpose is to handle the small clock frequency difference between the transmitter and the receiver — even though both must be within ±300 ppm of the target frequency, the worst case is 600 ppm apart. That is 1 symbol difference every 1,666 symbols.
The elastic buffer absorbs this difference by adding or removing SKP symbols (Gen 1/2) or equivalent padding in Gen 3+ from the periodic SKIP ordered sets that arrive. Symbols are clocked into the buffer using the recovered clock (same rate as the transmitter) and clocked out using the local clock (which may be slightly faster or slower). Adding or removing a SKP symbol when the buffer level approaches overflow or underflow keeps the buffer within its safe operating range.
Clock scenario
Buffer behaviour
Correction
TX clock faster than RX local clock
Buffer filling up → overflow risk
Remove (discard) a SKP symbol from the SKIP ordered set — drain the buffer
RX local clock faster than TX clock
Buffer emptying → underflow risk
Insert an extra SKP symbol into the SKIP ordered set — fill the buffer
Clocks matched within tolerance
Buffer level stable
No modification to SKP ordered sets needed
📋 Equalization — Gen 3 Through Gen 6
At Gen 1/2 speeds, a simple fixed pre-emphasis on the TX side is sufficient to compensate for PCB trace loss. From Gen 3 onwards, the channel loss is too severe and too variable for a fixed setting — an adaptive equalization process is built into the LTSSM link training sequence.
Figure 10 — Equalization complexity by generation. Gen 1/2 uses fixed pre-emphasis. Gen 3+ introduces adaptive FIR negotiation during link training. Gen 6 requires DSP-based equalisation plus FEC. Active retimers become critical from Gen 4 onward.
📋 Full Bandwidth Comparison — Every Width
Generation
Line Rate
Encoding
x1 per dir
x4 per dir
x8 per dir
x16 per dir
x16 aggr.
Gen 1
2.5 GT/s
8b/10b
250 MB/s
1 GB/s
2 GB/s
4 GB/s
8 GB/s
Gen 2
5.0 GT/s
8b/10b
500 MB/s
2 GB/s
4 GB/s
8 GB/s
16 GB/s
Gen 3
8.0 GT/s
128b/130b
~985 MB/s
~3.9 GB/s
~7.9 GB/s
~15.8 GB/s
~32 GB/s
Gen 4
16.0 GT/s
128b/130b
~1.97 GB/s
~7.9 GB/s
~15.8 GB/s
~31.5 GB/s
~64 GB/s
Gen 5
32.0 GT/s
128b/130b
~3.94 GB/s
~15.8 GB/s
~31.5 GB/s
~63 GB/s
~128 GB/s
Gen 6
64.0 GT/s
PAM4 + FEC
~7.6 GB/s
~30.5 GB/s
~61 GB/s
~122 GB/s
~256 GB/s
Why Gen 6 x16 is ~122 GB/s per direction, not exactly 2× Gen 5. Gen 5 uses 128b/130b (1.54% overhead). Gen 6 uses PAM4 + RS(544,514) FEC (~5.5% overhead at symbol level) plus flit header bytes (~3% additional overhead). Effective payload efficiency ≈ 91–93%. At 64 GT/s PAM4 × ~91% efficiency ÷ 8 ≈ 7.3–7.7 GB/s per lane, rounded to ~8 GB/s in spec documentation. Exact values depend on payload mix and flit fill efficiency.
📋 Quick Reference
Concept
Key Point
BW formula (8b/10b)
GT/s × 0.8 ÷ 8 = GB/s per lane per direction
BW formula (128b/130b)
GT/s × (128/130) ÷ 8 ≈ GT/s × 0.123 GB/s per lane
Gen 1 — 2.5 GT/s
8b/10b · 250 MB/s · Symbol time 4 ns · SKIP every 1180–1538 symbols
Gen 2 — 5.0 GT/s
8b/10b unchanged · 2× Gen 1 frequency · LTSSM negotiates fallback to Gen 1 with Gen 1 devices
Gen 3 — 8.0 GT/s
128b/130b (1.54% overhead) · 8 GT/s chosen over 10 GT/s because lower Nyquist keeps FR4 compatibility · Added adaptive link equalisation
Gen 4 — 16.0 GT/s
128b/130b · ~2 GB/s per lane · Retimers common · CXL 1.x PHY
Absorbs ±300 ppm TX/RX clock mismatch by adding/removing SKP symbols from periodic SKIP ordered sets
Link equalisation Gen 3+
LTSSM equalization phase negotiates multi-tap FIR coefficients between TX and RX · Mandatory from Gen 3
Retimers
Active signal regenerators · Transparent to software (no BDF) · Mandatory for long traces at Gen 5/6
Coming next: PCIe-05 covers the Transaction Layer in depth — TLP header formats for every TLP type, byte enables, the Tag field, and worked packet diagrams for every request-completion pair.