PCIe Series · PCIe-03 – Your VLSI Journey Starts Here

📋 Layer Responsibilities — the Full Picture

Every PCIe port implements exactly three layers. They stack the same way regardless of whether the port is in a Root Complex, a Switch, or an Endpoint. The layers define what each piece of hardware is responsible for — they are not optional, and a design does not have to be physically partitioned this way to be spec-compliant.

Figure 1 — All three PCIe layers, transmit and receive sides. The Physical Layer is the only one that changes between generations. The Transaction and Data Link layers produce and consume the same packets whether the link is Gen 1 at 2.5 GT/s or Gen 6 at 64 GT/s.

📋 TLP Common Header — Every Field

Every TLP starts with the same first Doubleword (DW 0). It tells the receiver everything about what kind of packet is coming and how to parse the rest of the header.

Figure 2 — TLP first Doubleword. All TLP types share this same first DW. Colour coding: blue = transaction type/size info, green = QoS, orange = attributes and AT, purple = ECRC flag, red = poisoned data flag. Reserved bits always read as 0.

📋 Complete TLP on the Wire

By the time a TLP reaches the physical link it has grown from what the Transaction Layer built. Each layer below adds its own fields:

Figure 3 — TLP assembly across all three layers. Row 1 = what the Transaction Layer builds. Row 2 = after the Data Link Layer adds its SeqNo (12-bit) and LCRC (32-bit CRC). Row 3 = after the Physical Layer adds framing (STP/END for Gen 1/2, sync headers for Gen 3–5, flit packing for Gen 6). The receiver strips these fields in reverse order.

LCRC vs ECRC — the key difference. LCRC (Link CRC) is calculated fresh at every hop — when a Switch receives a TLP, it checks the LCRC. When it retransmits the TLP, it calculates a new LCRC for that outgoing link. ECRC (End-to-End CRC, controlled by the TD bit) is calculated by the original sender and only checked by the final destination. This means ECRC can detect errors that happen inside a Switch’s internal forwarding path — which LCRC cannot, because LCRC is stripped and recalculated at the switch boundary.

📋 Flow Control — Credit-Based

PCIe uses a credit-based flow control scheme. The receiver advertises its available buffer space to the transmitter as credits. The transmitter tracks those credits and only sends a TLP when enough credits are available. As the receiver processes TLPs and frees buffer space, it returns credits to the transmitter via FC Update DLLPs.

This is not a stop-and-wait scheme — the transmitter can send continuously as long as credits allow. There is no back-pressure signal. If credits run out, the transmitter stalls and waits for a FC Update DLLP to arrive.

Figure 4 — Flow control credit loop. Credits flow in the opposite direction to TLPs: TLPs consume credits, FC Update DLLPs return them. The transmitter’s counters track what the receiver has available — it never sends more than the receiver can hold.

📋 The Six Credit Types

Credits are tracked separately for three TLP categories, and within each, header and data are counted independently — six pools total.

Abbrev.	Covers	1 unit = ?	Infinite allowed?	Why?
PH	Posted Header (MWr, Msg headers)	1 TLP header	No	Buffer must have space for the header before accepting the TLP
PD	Posted Data (MWr payload, MsgD payload)	4 bytes	No	Buffer must have space for the payload bytes
NPH	Non-Posted Header (MRd, IORd/Wr, CfgRd/Wr headers)	1 TLP header	No	Non-posted requests are held in the NP buffer until a completion returns
NPD	Non-Posted Data (IOWr, CfgWr payloads only)	4 bytes	No	Small — only IOWr and CfgWr carry NP data
CPLH	Completion Header (Cpl, CplD headers)	1 TLP header	Yes — must be ∞	Deadlock prevention — a device that sent a read request must always be able to receive its completion
CPLD	Completion Data (CplD payload)	4 bytes	Yes — must be ∞	Same reason — completion data must always have space in the requester’s receive buffer

The deadlock scenario — why CPLH/CPLD must be infinite. Suppose an NVMe SSD has sent 64 read requests. The RC now holds 64 completions waiting to be sent back. But the SSD’s receive buffer is full of other posted writes. Those writes are waiting because the SSD’s core is busy. The SSD’s core is busy waiting for its read completions. If the RC needed completion credits to send CplD TLPs, the whole system would freeze — no one can move. PCIe prevents this by mandating that endpoints always advertise infinite CPLH/CPLD, meaning the RC can always send completions back regardless of the endpoint’s state.

📋 Credit Initialisation — The DLCMSM

Credits are exchanged before any TLP can flow. This happens automatically in hardware after Physical Layer link training completes. The Data Link Control and Management State Machine (DLCMSM) runs the process.

Figure 5 — DLCMSM states and FC initialisation. The DLCMSM starts in DL_Inactive after reset. Once the Physical Layer’s LTSSM raises LinkUp, it enters FC_Init1. Both sides simultaneously advertise their buffer sizes using InitFC1 DLLPs. After receiving the other side’s values, each side sends InitFC2 and transitions to DL_Active — at which point TLPs can flow for the first time.

FC_Init1 is sent repeatedly — not just once — because a single DLLP could get corrupted during the noisy early moments after link training. The spec requires continuing to send InitFC1 DLLPs until the partner’s values have been received reliably. Only then does the link transition to FC_Init2.

📋 Virtual Channels and Traffic Classes

The Traffic Class (TC) field in the TLP header is a 3-bit priority selector. The hardware maps TCs to Virtual Channels (VCs) — physical buffer partitions — and uses an arbiter to decide which VC’s packets are sent next.

Figure 6 — TC→VC mapping. The TC field in the TLP header selects a priority level. Hardware maps TCs to VC buffers according to a software-programmed TC-VC map. The VC arbiter selects which VC sends next — higher VCs can be given strict priority to guarantee bandwidth for time-sensitive traffic.

📋 Transaction Ordering Rules

Within a Virtual Channel, TLPs normally exit in order. The ordering rules define the exceptions — when a TLP waiting behind another may “pass” it. These rules exist to prevent deadlocks and to match the ordering guarantees PCI software already depends on.

New TLP (waiting) \ TLP ahead in queue	Posted Write	Non-Posted Read	Completion	Non-Posted Write
Posted Write	No — must not pass	Yes — may pass	Yes — may pass	Yes — may pass
Non-Posted Read	Yes — may pass	No — must not pass	Yes — may pass	No — must not pass
Completion	Yes — may pass	Yes — may pass	No — must not pass	Yes — may pass
Non-Posted Write	Yes — may pass	No — must not pass	Yes — may pass	No — must not pass

The single most important rule: Posted writes may not pass posted writes. Memory write ordering is the foundation of DMA correctness — if a CPU writes a “data ready” flag after writing the data, the flag write must arrive after the data write.

The Relaxed Ordering (RO) attribute bit in the TLP header overrides these rules when set — a TLP with RO=1 is allowed to bypass posted writes ahead of it. GPUs and high-bandwidth DMA engines use this to increase throughput when out-of-order delivery is safe for that particular transfer type.

📋 Data Link Layer — ACK/NAK State Machine

The Data Link Layer’s core job is to make the unreliable physical link look reliable to the Transaction Layer above. It does this with Sequence Numbers, LCRC, and the ACK/NAK retry protocol. The whole thing is hardware — no software is involved, and the Transaction Layer never knows it happened.

Figure 7 — ACK/NAK transmit state machine. Every sent TLP awaits an ACK DLLP. ACK flushes the replay buffer up to that sequence number (cumulative). NAK or timeout triggers replay of all unacknowledged TLPs. After 4 failed replays the error escalates to the Physical Layer — the LTSSM may reset the link.

📋 Replay Buffer Mechanics

The replay buffer is the insurance policy of the Data Link Layer. Every TLP that leaves the transmitter’s Data Link Layer is copied into the replay buffer before being handed to the Physical Layer. It stays there until an ACK DLLP arrives confirming safe delivery at the neighbour.

Figure 8 — Replay buffer contents. Sequence numbers 0 and 1 have been ACKed and can be freed. Numbers 2 and 3 are still in the buffer awaiting acknowledgement. ACK(N) means “flush everything up to and including N” — it takes only one ACK DLLP to free multiple TLPs.

📋 DLLP Types

DLLPs are always 8 bytes: 6 bytes of content + 2 bytes of 16-bit CRC. They are created and consumed only within the Data Link Layer — never seen by the Transaction Layer, never routed by Switches.

DLLP Type	Direction	Purpose
ACK	RX port → TX port	Cumulative acknowledgement up to SeqNum N — transmitter may flush replay buffer ≤ N
NAK	RX port → TX port	Error signal at SeqNum N — transmitter replays from N onwards
UpdateFC_P	RX → TX	Returns Posted (PH/PD) flow control credits as the receiver drains its buffers
UpdateFC_NP	RX → TX	Returns Non-Posted (NPH/NPD) flow control credits
UpdateFC_Cpl	RX → TX	Returns Completion (CPLH/CPLD) flow control credits — typically always infinite
InitFC1_P/NP/Cpl	Both (simultaneous)	Advertise initial buffer sizes during FC initialisation (sent in order: P → NP → Cpl)
InitFC2_P/NP/Cpl	Both (simultaneous)	Confirm receipt of partner’s InitFC1 values — transitions link to DL_Active
PM_Enter_L1	Downstream → Upstream	Request entry into L1 ASPM power state
PM_Enter_L23	Either direction	Request entry into L2/L3 power state
PM_Request_Ack	Upstream → Downstream	Acknowledge the power management request from downstream device

📋 Physical Layer Gen 1/2 — 8b/10b Logical Sub-block

Gen 1 (2.5 GT/s) and Gen 2 (5.0 GT/s) use 8b/10b encoding. Every 8-bit byte maps to a unique 10-bit symbol on the wire — 20% overhead, but it delivers three critical properties the link needs.

Figure 9 — Gen 1/2 transmit pipeline. The 20% encoding overhead is real cost, but 8b/10b provides guaranteed DC balance and sufficient transition density for CDR lock — both essential properties for a high-speed serial link. K-characters enable unambiguous packet framing.

📋 Physical Layer Gen 3–5 — 128b/130b Logical Sub-block

Gen 3 (8 GT/s), Gen 4 (16 GT/s), and Gen 5 (32 GT/s) all use 128b/130b encoding. The headline improvement: only 1.5% overhead instead of 20%. This is why Gen 3 at 8 GT/s delivers about the same effective throughput as Gen 2 at 10 GT/s would have — without the encoding tax.

Figure 10 — 128b/130b block structure. 128 bits of payload, 2-bit sync header, total 130 bits. The sync header bits (01 or 10) tell the receiver whether this block carries data or control information — replacing the K-character framing of 8b/10b entirely. Scrambling is mandatory to ensure adequate CDR transitions since 128b/130b has no inherent DC balance mechanism.

📋 Physical Layer Gen 6 — PAM4 + FEC + Flit

Gen 6 (64 GT/s) is the biggest Physical Layer change since 8b/10b → 128b/130b in Gen 3. Three mechanisms work together — none of them can be used without the others at this speed.

Figure 11 — Gen 6’s three interdependent mechanisms. PAM4 doubles bandwidth at the same baud rate, but reduces voltage margins. FEC corrects the increased error rate in hardware, restoring effective BER to spec levels. Flit framing provides fixed-size blocks that make FEC encoding efficient and clean — you can’t have one without the others at this speed.

What stays the same in Gen 6 from upper layers’ perspective. From the Transaction Layer’s point of view, nothing has changed. It still builds the same TLP headers with the same fields. It still uses the same six flow control credit types. The same ACK/NAK protocol runs in the Data Link Layer. The same ordering rules apply. A device driver or firmware written for Gen 3 hardware works on Gen 6 without any modification. The entire change in Gen 6 is contained within the Physical Layer.

📋 What Changes Between Generations

Layer / Feature	Gen 1/2	Gen 3–5	Gen 6
Transaction Layer	Unchanged — same TLP formats, FC credit types, VC/TC model, ordering rules
Data Link Layer	Unchanged — same ACK/NAK, LCRC, replay buffer, DLLP types		Same — replay granularity adapted to flit level in Gen 6
PL — Encoding	8b/10b (20% overhead)	128b/130b (1.5% overhead)	PAM4 (2 bits/symbol)
PL — Framing	STP / SDP / END K-characters	Sync header bits (01 or 10)	256-byte flits
PL — Error correction	Detection only (LCRC)	Detection only (LCRC)	Mandatory FEC per flit + LCRC
PL — Equalization	Simple TX pre-emphasis	Explicit FIR coefficient negotiation (LTSSM)	Advanced multi-tap FIR + DSP
PL — Modulation	NRZ — 2 voltage levels	NRZ — 2 voltage levels	PAM4 — 4 voltage levels
Software view	Completely unchanged — same BDF, config registers, memory map, driver model across all generations

📋 Quick Reference

Concept	Key Point
Transaction Layer role	Build and decode TLPs; check FC credits; enforce ordering; manage VC/TC priority
TLP DW0	Fmt (3/4 DW, data or not) + Type (MRd/MWr/Cpl…) + TC (priority 0–7) + TD (ECRC present) + EP (poisoned) + Attr (RO/NS) + AT + Length
ECRC vs LCRC	LCRC: per-hop, mandatory, recalculated at each Switch. ECRC: end-to-end, optional, survives routing, detects intra-Switch errors.
Six credit types	PH · PD · NPH · NPD (all finite) + CPLH · CPLD (must be infinite — deadlock prevention)
1 data credit	= 4 bytes of payload space in the receiver’s VC buffer
FC initialisation	DLCMSM: DL_Inactive → FC_Init1 (both sides advertise) → FC_Init2 (both confirm) → DL_Active → TLPs can flow
TC/VC	TC is priority label in TLP. VC is physical buffer. TC 0 always maps to VC0. Ordering rules apply within a VC.
Ordering rule #1	Posted writes must not pass posted writes — guarantees DMA write ordering
Relaxed Ordering	Attr bit in TLP — allows bypass of posted writes. Used by GPU scatter-gather and high-BW DMA.
Data Link Layer role	Add 12-bit SeqNo + 32-bit LCRC; copy to replay buffer; run ACK/NAK; generate FC and PM DLLPs
ACK DLLP	Cumulative — “all TLPs ≤ SeqN received correctly, flush them from replay buffer”
NAK DLLP	“Error at SeqN — replay from N onwards (all unACKed TLPs from that point)”
Replay limit	4 consecutive failed replays → Physical Layer escalation → link may reset via LTSSM Recovery
Gen 1/2 encoding	8b/10b: 20% overhead, K-characters for framing, DC balance guaranteed, max 5 identical bits in a row
Gen 3–5 encoding	128b/130b: 1.5% overhead, sync header (01/10) for framing, mandatory scrambling, explicit equalization
Gen 6 PAM4	4 voltage levels, 2 bits/symbol, 32 GBaud = 64 GT/s, eye margin 1/3 of NRZ → raw BER rises
Gen 6 FEC	Mandatory Reed-Solomon per flit, corrects errors in hardware before DLL, restores BER to < 10⁻¹⁵
Gen 6 Flit	256-byte fixed frame: 236 B payload + 20 B FEC parity. Multiple TLPs/DLLPs per flit. Replay at flit granularity.

Coming next: PCIe-04 covers PCIe Generations Gen 1 to Gen 6 — the detailed bandwidth maths behind each generation, why Gen 3 chose 8 GT/s instead of 10 GT/s, what drove each design decision from the spec, and a deep dive into the Gen 6 flit format and FEC structure with worked numbers.