PCIe Series — PCIe-15: 8b/10b Encoding — VLSI Trainers
PCIe Series · PCIe-15

8b/10b Encoding

Why raw binary can’t be sent directly on a PCIe link — the three problems 8b/10b solves, how a 5b/6b + 3b/4b split works, Current Running Disparity, special control characters (K-codes), the 20% overhead and why it matters, and what Gen 3+ replaced it with.

📋 Why Raw Binary Cannot Be Sent

It seems logical to just send raw binary data directly onto the PCIe differential pair — if the byte is 0xFF (all ones), drive D+ high for eight consecutive bit times. But this creates three serious problems for the Physical Layer that make reliable high-speed communication impossible without encoding.

Three Problems With Sending Raw Binary at High Speed ① DC Wander AC coupling caps block DC. Long runs of 1s charge the cap. Signal droops — receiver loses its reference voltage for distinguishing 0 from 1. Fix: balanced encoding → equal ② Clock Recovery Fails Receiver CDR locks onto bit transitions to recover the clock. If data is 00000000 11111111 there are no transitions — CDR PLL drifts, data is corrupted. Fix: guaranteed transition density ③ Limited Error Detection All 8-bit patterns are valid raw binary. A corrupted bit just becomes a different valid byte — the physical layer has no way to detect the error itself. Fix: many 10-bit codes are illegal
Figure 1 — Three problems that make raw binary transmission unreliable at multi-gigabit speeds. DC balance is essential because PCIe uses AC coupling. Clock recovery requires regular transitions. Error detection benefits from illegal code space. All three are solved simultaneously by 8b/10b encoding.

📋 Three Problems 8b/10b Solves

8b/10b encoding was designed in the early 1980s for fibre channel and adopted by many high-speed serial standards including PCIe Gen 1 and Gen 2. It solves all three raw binary problems simultaneously through a single encoding transformation:

📋 How 8b/10b Works

8b/10b takes an 8-bit input byte and produces a 10-bit output symbol. The transformation is not a straightforward expansion — it is a split encoding: the 8-bit input is divided into a 5-bit sub-block and a 3-bit sub-block, each encoded separately.

8b/10b Encoding — 5b/6b + 3b/4b Split 8-bit Input H G F E D C B A split 5-bit EDCBA lower 5 bits of input 3-bit HGF upper 3 bits of input 5b→6b 3b→4b 6-bit abcdei encodes EDCBA 4-bit fghj encodes HGF concat 10-bit Symbol abcdei fghj CRD Input Current Running Disparity tells encoder which of two symbol encodings to choose
Figure 2 — 8b/10b encoding split. The 8-bit input is split into a lower 5-bit field (EDCBA) and an upper 3-bit field (HGF). The 5-bit field is encoded to 6 bits (5b/6b); the 3-bit field is encoded to 4 bits (3b/4b). The CRD (Current Running Disparity) determines which of the two possible encodings is used. The outputs are concatenated in order abcdei·fghj to form the 10-bit symbol sent on the wire.

Each 5-bit sub-block has two possible encodings: one with more ones (positive disparity) and one with more zeros (negative disparity). Some sub-blocks have only one encoding — their encoding is disparity-neutral and looks the same regardless of CRD. The encoder selects between the two options to maintain DC balance.

📋 Character Notation — Dxx.y and Kxx.y

8b/10b uses a shorthand notation for naming characters. The 8-bit input byte is described in a specific format rather than just its hex value:

Character Notation — Worked Derivation of D10.3 Input Example byte: 0x6A 01101010b binary: 011 01010 Partition Split into sub-blocks: 011 | 01010 upper 3 bits: 011 lower 5 bits: 01010 Swap positions Put 5-bit first, 3-bit second: 01010 | 011 xx = 01010 = 10 y = 011 = 3 Result D10.3 D = Data character 10 = decimal of 5-bit xx 3 = decimal of 3-bit y
Figure 3 — Deriving the name D10.3 from byte 0x6A. Step ①: partition 01101010 into its 3-bit upper half (011) and 5-bit lower half (01010). Step ②: swap their positions to put the 5-bit field first. Step ③: convert each sub-block to decimal — 01010=10 and 011=3 — giving D10.3. D prefix means Data character; K prefix means Control character.

This notation uniquely identifies every 8b/10b character. There are 256 data characters (D0.0 through D31.7) and 12 defined control (K) characters. Not all Dxx.y combinations are legal — some are reserved or used only with specific disparity states.

📋 Disparity and the CRD

Disparity refers to the imbalance of ones versus zeros within a 10-bit symbol. A symbol with more ones than zeros has positive disparity (+). A symbol with more zeros has negative disparity (–). A symbol with equal ones and zeros has neutral disparity.

The Current Running Disparity (CRD) is a single bit maintained by both transmitter and receiver. It tracks the ongoing balance of ones and zeros in the serial stream. The CRD starts at either + or – and flips each time a non-neutral symbol is sent.

CRD — Current Running Disparity Tracking Example Character CRD before 10-bit symbol chosen Symbol disparity CRD after K28.5 001111 1010 positive (+) → more 1s + K28.5 + 110000 0101 negative (–) → more 0s D10.3 010101 1100 neutral → 5 ones, 5 zeros
Figure 4 — CRD tracking through three characters. K28.5 has two encodings — with CRD=–, it uses 001111 1010 (6 ones, positive disparity) and CRD flips to +. With CRD=+, K28.5 uses 110000 0101 (6 zeros, negative disparity) and CRD flips to –. D10.3 is neutral (5 ones, 5 zeros) — CRD does not change.

📋 CRD Rules and Encoding Choice

The CRD drives the encoding choice. The transmitter and receiver independently track the same CRD state — they stay synchronised because they apply the same rules to the same symbol stream. If they ever disagree, a disparity error is flagged.

CRD before symbolSymbol disparityCRD after symbolLegal?
Negative (–) Positive (+) — more ones Positive (+) — flips ✓ Legal — balances previous negative
Negative (–) Neutral — equal ones and zeros Negative (–) — unchanged ✓ Legal — doesn’t worsen balance
Negative (–) Negative (–) — more zeros Would be even more negative ✗ Illegal — disparity error
Positive (+) Negative (–) — more zeros Negative (–) — flips ✓ Legal — balances previous positive
Positive (+) Neutral — equal ones and zeros Positive (+) — unchanged ✓ Legal
Positive (+) Positive (+) — more ones Would be even more positive ✗ Illegal — disparity error

The rule is simple: the disparity of each symbol must be opposite to or neutral with respect to the current CRD. Sending a positive-disparity symbol when CRD is already positive would make the link more unbalanced — that encoding is illegal and would never be produced by a correct encoder.

Worked Encoding Example

Let’s encode the byte 0x6A (D10.3) with CRD = negative. The 5-bit sub-block EDCBA = 01010 (decimal 10) and the 3-bit sub-block HGF = 011 (decimal 3).

Encoding 0x6A (D10.3) with CRD = Negative 5-bit field: 01010 CRD=– → choose 6-bit encoding that has more 1s (positive) 010101 3 ones, 3 zeros — neutral CRD stays – (neutral sub-block) 3-bit field: 011 CRD=– → choose 4-bit encoding that has more 1s (positive) 1100 2 ones, 2 zeros — neutral CRD stays – (neutral sub-block) Concatenated symbol abcdei = 010101 fghj = 1100 0101011100 5 ones, 5 zeros Neutral disparity → CRD stays – D10.3 (CRD=–) 0101011100 transmitted CRD remains –
Figure 5 — Encoding 0x6A as D10.3 with CRD=–. The 5b→6b and 3b→4b encoding of D10.3 both happen to be neutral regardless of CRD. The final symbol 0101011100 has exactly 5 ones and 5 zeros — neutral disparity — so the CRD stays negative. The alternate encoding (with CRD=+) gives the same bit pattern for D10.3, which is unusual but valid for this particular character.

📋 Properties of a Legal Symbol

A 10-bit symbol produced by correct 8b/10b encoding always satisfies all of the following constraints. Any received symbol violating these is an immediate code violation error:

Of the 2¹⁰ = 1024 possible 10-bit patterns, the encoding uses only 512 as valid data symbol encodings (2 per byte × 256 bytes). Adding the K-code symbols brings the total to approximately 530 valid patterns. The remaining 494+ are illegal and instantly detectable as transmission errors.

📋 Special Control Symbols (K-Codes)

Besides the 256 data characters (D-codes), 8b/10b defines 12 special control characters called K-codes. K-codes do not represent data bytes — they are used for link management, framing, and ordered set signalling. The receiver distinguishes K-codes from D-codes because K-codes map to their own separate lookup table.

Symbol name8b/10b nameHex bytePurpose in PCIe
COMK28.50xBCFirst symbol of every ordered set. Used for symbol lock — the receiver detects COM to find symbol boundaries in the serial bitstream. Also resets the scrambler state.
STPK27.70xFBStart of TLP. Marks the beginning of a TLP in the serial stream. The Data Link Layer looks for STP to delineate incoming TLPs.
SDPK28.20x5CStart of DLLP. Marks the beginning of a DLLP in the serial stream.
ENDK29.70xFDEnd of packet (good). Appended to the last byte of an error-free TLP or DLLP.
EDBK30.70xFEEnd of bad packet. Replaces END when a switch in cut-through mode detects a packet error mid-stream and must nullify the in-flight packet.
SKPK28.00x1CSkip. Part of the SKIP ordered set used for clock tolerance compensation. Receiver may add or delete SKP characters to prevent elastic buffer overflow/underflow.
FTSK28.10x3CFast Training Sequence. Used in FTS ordered sets to exit the L0s low-power link state and return to L0. The required number of FTS symbols is negotiated during link training.
IDLK28.30x7CIdle. Part of the Electrical Idle Ordered Set (EIOS). Signals the receiver to prepare for electrical idle.
PADK23.70xF7Padding. On multi-lane links, if a packet doesn’t fill all lanes and no new packet is ready, PAD fills the remaining lanes.
EIEK28.70xFFElectrical Idle Exit. Part of the EIEOS ordered set added in PCIe 2.0 to provide a reliable signal for detecting exit from electrical idle at speeds above 2.5 GT/s.
K-codes are not scrambled. Data bytes passing through the Physical Layer are scrambled (XORed with a pseudo-random pattern) to prevent spectral peaks from repetitive data patterns. K-codes are exempt from scrambling — they must arrive at the receiver with their exact bit patterns intact so that the receiver can use COM for symbol lock and identify packet boundaries via STP/SDP/END.

📋 Ordered Sets

An ordered set is a specific sequence of symbols sent simultaneously on all active lanes at the same time. Ordered sets always begin with the COM (K28.5) symbol on every lane, followed by additional symbols that define the specific ordered set type. They are called “ordered” because their structure is tightly defined — the receiver can identify them by pattern, not just by header.

Key ordered sets in PCIe Gen 1/2:

📋 Error Detection — Code Violations and Disparity Errors

8b/10b provides Physical Layer error detection through two complementary mechanisms. Neither is as strong as the LCRC (which covers the full TLP), but they catch errors that LCRC cannot — specifically errors within the Physical Layer itself before bits are assembled into TLPs.

Code violations

Any received 10-bit symbol that does not appear in either the D-code or K-code lookup tables is a code violation. Immediately detectable. Causes: signal corruption on the link, loss of symbol lock, bit errors that happen to create an unrecognised pattern.

Disparity errors

The receiver maintains its own CRD alongside the transmitter’s CRD. After receiving each symbol, the receiver checks that the symbol’s disparity is consistent with the current CRD value. If a transmitted symbol had a positive disparity and the receiver’s CRD was already positive, this is a disparity error — it could not have been produced by a correct encoder.

8b/10b error detection has limits. If two bits flip in a symbol and the result is a valid symbol with valid disparity, the error is undetected at the Physical Layer. This is why LCRC covers the full TLP at the Data Link Layer — 8b/10b error detection is a fast first-pass check at the physical bit level, not a substitute for the Data Link Layer’s 32-bit CRC.

📋 The 20% Overhead

For every 8 bits of useful data, the transmitter sends 10 bits on the wire. The overhead is exactly 2/10 = 20%. This is significant and directly reduces the usable data throughput of the link.

20% Encoding Overhead — Bandwidth Impact by Generation Gen 1 (2.5 GT/s) Raw line rate: 2.5 Gbps Data rate: 2.5 × 0.8 = 2.0 Gbps = 250 MB/s per lane Gen 2 (5 GT/s) Raw line rate: 5.0 Gbps Data rate: 5.0 × 0.8 = 4.0 Gbps = 500 MB/s per lane Gen 3 (8 GT/s, 128b/130b) Raw line rate: 8.0 Gbps Overhead: 2/130 = ~1.5% · Data: 8.0 × 0.985 = ~7.88 Gbps = ~984 MB/s per lane (nearly 4× Gen 2 data rate) Only 60% more raw speed but almost 100% more data throughput
Figure 6 — 8b/10b overhead impact. Gen 2 at 5 GT/s delivers 500 MB/s useful data per lane. Gen 3 at 8 GT/s with 128b/130b encoding (only 1.5% overhead) delivers ~984 MB/s — nearly double — despite the raw line rate being only 60% higher. Dropping 8b/10b is most of the reason Gen 3 achieves near-linear bandwidth scaling.

The 20% overhead means that a Gen 1 x16 link with 2.5 GT/s per lane has a peak data throughput of 2.5 GT/s × 0.8 × 16 lanes / 8 bits per byte = 4000 MB/s = ~3.9 GB/s. The 20% is the primary reason PCIe Gen 3 abandoned 8b/10b entirely in favour of the far more efficient 128b/130b encoding.

📋 Scrambling — Why It Accompanies 8b/10b

Even with 8b/10b balancing ones and zeros, repetitive data patterns can produce periodic spectral peaks in the transmitted bitstream. For example, transmitting a long sequence of 0x55 bytes would create a regular alternating 1-0-1-0 pattern — a single strong frequency component that creates EMI and may stress PLL clock recovery circuits.

The Physical Layer scrambles data bytes before 8b/10b encoding by XORing them with a pseudo-random sequence generated by a Linear Feedback Shift Register (LFSR). The scrambling sequence is deterministic — both transmitter and receiver use the same LFSR initialised to the same value — so the receiver can trivially descramble by applying the same XOR sequence again. Scrambling spreads the spectral energy evenly across frequencies, eliminating strong periodic tones.

Gen 3 Onwards — 128b/130b Replaces 8b/10b

Starting with Gen 3 (8 GT/s), PCIe abandoned 8b/10b encoding entirely. The 20% overhead was simply too costly as data rates increased. Gen 3 introduced 128b/130b encoding — a fundamentally different approach that achieves the same goals with only ~1.54% overhead.

8b/10b (Gen 1/2) vs 128b/130b (Gen 3–5) vs Flit + FEC (Gen 6) 8b/10b (Gen 1 and 2) Each byte → 10-bit symbol Overhead: 20% DC balance: via CRD per symbol Transitions: guaranteed ≤5 same Special chars: K-codes (STP/END) Bandwidth limit: 80% of raw rate Used by: Gen 1 (2.5), Gen 2 (5 GT/s) 128b/130b (Gen 3, 4, 5) 128 data bits + 2-bit sync header Overhead: 2/130 ≈ 1.54% DC balance: via scrambling only Transitions: from scrambler Framing: sync header 01=data, 10=OS Bandwidth: 98.5% of raw rate Used by: Gen 3 (8), Gen 4 (16), Gen 5 (32) Flit + FEC (Gen 6) 256-byte flit + FEC parity No character-level encoding DC balance: from PAM4 equalisation Error correction: RS(544,514) FEC Framing: flit header BER: raw ~10⁻⁶ · corrected 10⁻¹⁵ Used by: Gen 6 (64 GT/s PAM4)
Figure 7 — Evolution of Physical Layer encoding. 8b/10b carries 20% overhead but achieves DC balance and transition density through the CRD mechanism. 128b/130b replaces 8b/10b in Gen 3 with a 2-bit sync header per 128-bit block, relying on scrambling for DC balance. Gen 6 abandons character encoding entirely — PAM4 with FEC handles error correction at flit granularity.

How 128b/130b is different

Instead of encoding individual bytes, 128b/130b groups 128 data bits together as a single block and prepends a 2-bit sync header. The header encodes exactly two states: 01 meaning “this is a data block” and 10 meaning “this is an ordered set block”. DC balance is achieved entirely through scrambling — there is no CRD mechanism. K-codes and the COM character are no longer used for framing — the sync header replaces them. SKP ordered sets are still used for clock compensation, but their encoding changes. There is no longer a single special character for symbol lock — block alignment is achieved differently.

This is why Gen 3 hardware is more complex than Gen 1/2: the scrambler is more sophisticated (24-bit LFSR vs 16-bit), the framing detection is different, and equalization is mandatory rather than optional. The reward is nearly 100% efficiency instead of 80%.

8b/10b is only used in Gen 1 and Gen 2. If you are designing or debugging a Gen 3 or later PCIe interface, you will not encounter K-codes, CRD, or the 20% overhead. Gen 3 uses 128b/130b. Gen 4 and Gen 5 extend 128b/130b with additional scrambling and equalization improvements. Gen 6 uses PAM4 with flit-based framing and FEC — no 8b/10b of any kind.

📋 Quick Reference

ItemValue / Rule
What 8b/10b doesEncodes each 8-bit input byte into a 10-bit output symbol. Simultaneously maintains DC balance, guarantees transition density, and enables code violation error detection.
Structure5b→6b encoding for lower 5 bits (EDCBA) + 3b→4b encoding for upper 3 bits (HGF). Outputs concatenated as abcdei·fghj.
Overhead20% — 10 bits transmitted per 8 bits of data. Reduces effective bandwidth to 80% of raw line rate.
Character notationDxx.y for data characters (D0.0–D31.7) · Kxx.y for control characters. xx = decimal of 5-bit field, y = decimal of 3-bit field.
DisparityPositive (+): symbol has more 1s than 0s · Negative (–): more 0s · Neutral: equal. Every symbol has 4, 5, or 6 ones.
CRDCurrent Running Disparity — single bit tracking the DC balance of the serial stream. Maintained identically by both transmitter and receiver.
CRD ruleA new symbol must have disparity opposite to or neutral with the current CRD. Sending a same-polarity symbol when CRD matches = disparity error.
Two encodings per characterMost non-neutral characters have two 10-bit encodings: one for CRD=+ and one for CRD=–. The encoder chooses based on the current CRD to maintain balance.
Max run lengthNo more than 5 consecutive bits of the same polarity in the serial stream — guaranteed by the encoding, even across symbol boundaries.
Legal symbol space~530 valid patterns out of 1024 possible 10-bit patterns. The other 494+ are detectable code violations.
COM (K28.5)The unique ordered set start character. Not scrambled. Used by receiver for symbol lock. Resets scrambler LFSR. Always recognisable by its 0011111010 or 1100000101 pattern.
STP (K27.7)Start of TLP marker. Enables the Data Link Layer to find packet boundaries in the symbol stream.
SDP (K28.2)Start of DLLP marker.
END (K29.7)End of good packet.
EDB (K30.7)End of nullified (bad) packet — inverted LCRC follows. Used by switches in cut-through mode.
SKP (K28.0)Skip character — part of SKIP ordered set for clock tolerance compensation. Sent every 1180–1538 symbol times. Receiver may add/delete SKP characters.
K-codes not scrambledControl characters pass through the scrambler unchanged. Data characters are scrambled before 8b/10b encoding.
Error detectionCode violations: illegal 10-bit pattern. Disparity errors: symbol disparity conflicts with CRD. Both reported as Receiver Error to DLL.
Generations using 8b/10bGen 1 (2.5 GT/s) and Gen 2 (5 GT/s) only.
Gen 3+ replacement128b/130b encoding — 2-bit sync header per 128-bit block. Only 1.54% overhead. No CRD, no K-codes for framing. Scrambling provides DC balance.
Gen 6No character encoding at all. PAM4 at 32 Gbaud. RS(544,514) FEC per 256-byte flit. Flit header for framing. No 8b/10b anywhere in the stack.
Scroll to Top