PCIe Series — PCIe-03: The Three-Layer Model in Detail — VLSI Trainers
PCIe Series · PCIe-03
The Three-Layer Model in Detail
How all three PCIe layers fit together — TLP header fields, flow control credits, virtual channels, the ACK/NAK state machine, the replay buffer, and how the Physical Layer changes completely from Gen 1 through Gen 6 while the upper two layers stay the same.
📋 Layer Responsibilities — the Full Picture
Every PCIe port implements exactly three layers. They stack the same way regardless of whether the port is in a Root Complex, a Switch, or an Endpoint. The layers define what each piece of hardware is responsible for — they are not optional, and a design does not have to be physically partitioned this way to be spec-compliant.
Figure 1 — All three PCIe layers, transmit and receive sides. The Physical Layer is the only one that changes between generations. The Transaction and Data Link layers produce and consume the same packets whether the link is Gen 1 at 2.5 GT/s or Gen 6 at 64 GT/s.
📋 TLP Common Header — Every Field
Every TLP starts with the same first Doubleword (DW 0). It tells the receiver everything about what kind of packet is coming and how to parse the rest of the header.
Figure 2 — TLP first Doubleword. All TLP types share this same first DW. Colour coding: blue = transaction type/size info, green = QoS, orange = attributes and AT, purple = ECRC flag, red = poisoned data flag. Reserved bits always read as 0.
📋 Complete TLP on the Wire
By the time a TLP reaches the physical link it has grown from what the Transaction Layer built. Each layer below adds its own fields:
Figure 3 — TLP assembly across all three layers. Row 1 = what the Transaction Layer builds. Row 2 = after the Data Link Layer adds its SeqNo (12-bit) and LCRC (32-bit CRC). Row 3 = after the Physical Layer adds framing (STP/END for Gen 1/2, sync headers for Gen 3–5, flit packing for Gen 6). The receiver strips these fields in reverse order.
LCRC vs ECRC — the key difference. LCRC (Link CRC) is calculated fresh at every hop — when a Switch receives a TLP, it checks the LCRC. When it retransmits the TLP, it calculates a new LCRC for that outgoing link. ECRC (End-to-End CRC, controlled by the TD bit) is calculated by the original sender and only checked by the final destination. This means ECRC can detect errors that happen inside a Switch’s internal forwarding path — which LCRC cannot, because LCRC is stripped and recalculated at the switch boundary.
📋 Flow Control — Credit-Based
PCIe uses a credit-based flow control scheme. The receiver advertises its available buffer space to the transmitter as credits. The transmitter tracks those credits and only sends a TLP when enough credits are available. As the receiver processes TLPs and frees buffer space, it returns credits to the transmitter via FC Update DLLPs.
This is not a stop-and-wait scheme — the transmitter can send continuously as long as credits allow. There is no back-pressure signal. If credits run out, the transmitter stalls and waits for a FC Update DLLP to arrive.
Figure 4 — Flow control credit loop. Credits flow in the opposite direction to TLPs: TLPs consume credits, FC Update DLLPs return them. The transmitter’s counters track what the receiver has available — it never sends more than the receiver can hold.
📋 The Six Credit Types
Credits are tracked separately for three TLP categories, and within each, header and data are counted independently — six pools total.
Abbrev.
Covers
1 unit = ?
Infinite allowed?
Why?
PH
Posted Header (MWr, Msg headers)
1 TLP header
No
Buffer must have space for the header before accepting the TLP
Non-posted requests are held in the NP buffer until a completion returns
NPD
Non-Posted Data (IOWr, CfgWr payloads only)
4 bytes
No
Small — only IOWr and CfgWr carry NP data
CPLH
Completion Header (Cpl, CplD headers)
1 TLP header
Yes — must be ∞
Deadlock prevention — a device that sent a read request must always be able to receive its completion
CPLD
Completion Data (CplD payload)
4 bytes
Yes — must be ∞
Same reason — completion data must always have space in the requester’s receive buffer
The deadlock scenario — why CPLH/CPLD must be infinite. Suppose an NVMe SSD has sent 64 read requests. The RC now holds 64 completions waiting to be sent back. But the SSD’s receive buffer is full of other posted writes. Those writes are waiting because the SSD’s core is busy. The SSD’s core is busy waiting for its read completions. If the RC needed completion credits to send CplD TLPs, the whole system would freeze — no one can move. PCIe prevents this by mandating that endpoints always advertise infinite CPLH/CPLD, meaning the RC can always send completions back regardless of the endpoint’s state.
📋 Credit Initialisation — The DLCMSM
Credits are exchanged before any TLP can flow. This happens automatically in hardware after Physical Layer link training completes. The Data Link Control and Management State Machine (DLCMSM) runs the process.
Figure 5 — DLCMSM states and FC initialisation. The DLCMSM starts in DL_Inactive after reset. Once the Physical Layer’s LTSSM raises LinkUp, it enters FC_Init1. Both sides simultaneously advertise their buffer sizes using InitFC1 DLLPs. After receiving the other side’s values, each side sends InitFC2 and transitions to DL_Active — at which point TLPs can flow for the first time.
FC_Init1 is sent repeatedly — not just once — because a single DLLP could get corrupted during the noisy early moments after link training. The spec requires continuing to send InitFC1 DLLPs until the partner’s values have been received reliably. Only then does the link transition to FC_Init2.
📋 Virtual Channels and Traffic Classes
The Traffic Class (TC) field in the TLP header is a 3-bit priority selector. The hardware maps TCs to Virtual Channels (VCs) — physical buffer partitions — and uses an arbiter to decide which VC’s packets are sent next.
Figure 6 — TC→VC mapping. The TC field in the TLP header selects a priority level. Hardware maps TCs to VC buffers according to a software-programmed TC-VC map. The VC arbiter selects which VC sends next — higher VCs can be given strict priority to guarantee bandwidth for time-sensitive traffic.
📋 Transaction Ordering Rules
Within a Virtual Channel, TLPs normally exit in order. The ordering rules define the exceptions — when a TLP waiting behind another may “pass” it. These rules exist to prevent deadlocks and to match the ordering guarantees PCI software already depends on.
New TLP (waiting) \ TLP ahead in queue
Posted Write
Non-Posted Read
Completion
Non-Posted Write
Posted Write
No — must not pass
Yes — may pass
Yes — may pass
Yes — may pass
Non-Posted Read
Yes — may pass
No — must not pass
Yes — may pass
No — must not pass
Completion
Yes — may pass
Yes — may pass
No — must not pass
Yes — may pass
Non-Posted Write
Yes — may pass
No — must not pass
Yes — may pass
No — must not pass
The single most important rule: Posted writes may not pass posted writes. Memory write ordering is the foundation of DMA correctness — if a CPU writes a “data ready” flag after writing the data, the flag write must arrive after the data write.
The Relaxed Ordering (RO) attribute bit in the TLP header overrides these rules when set — a TLP with RO=1 is allowed to bypass posted writes ahead of it. GPUs and high-bandwidth DMA engines use this to increase throughput when out-of-order delivery is safe for that particular transfer type.
📋 Data Link Layer — ACK/NAK State Machine
The Data Link Layer’s core job is to make the unreliable physical link look reliable to the Transaction Layer above. It does this with Sequence Numbers, LCRC, and the ACK/NAK retry protocol. The whole thing is hardware — no software is involved, and the Transaction Layer never knows it happened.
Figure 7 — ACK/NAK transmit state machine. Every sent TLP awaits an ACK DLLP. ACK flushes the replay buffer up to that sequence number (cumulative). NAK or timeout triggers replay of all unacknowledged TLPs. After 4 failed replays the error escalates to the Physical Layer — the LTSSM may reset the link.
📋 Replay Buffer Mechanics
The replay buffer is the insurance policy of the Data Link Layer. Every TLP that leaves the transmitter’s Data Link Layer is copied into the replay buffer before being handed to the Physical Layer. It stays there until an ACK DLLP arrives confirming safe delivery at the neighbour.
Figure 8 — Replay buffer contents. Sequence numbers 0 and 1 have been ACKed and can be freed. Numbers 2 and 3 are still in the buffer awaiting acknowledgement. ACK(N) means “flush everything up to and including N” — it takes only one ACK DLLP to free multiple TLPs.
📋 DLLP Types
DLLPs are always 8 bytes: 6 bytes of content + 2 bytes of 16-bit CRC. They are created and consumed only within the Data Link Layer — never seen by the Transaction Layer, never routed by Switches.
DLLP Type
Direction
Purpose
ACK
RX port → TX port
Cumulative acknowledgement up to SeqNum N — transmitter may flush replay buffer ≤ N
NAK
RX port → TX port
Error signal at SeqNum N — transmitter replays from N onwards
UpdateFC_P
RX → TX
Returns Posted (PH/PD) flow control credits as the receiver drains its buffers
UpdateFC_NP
RX → TX
Returns Non-Posted (NPH/NPD) flow control credits
UpdateFC_Cpl
RX → TX
Returns Completion (CPLH/CPLD) flow control credits — typically always infinite
InitFC1_P/NP/Cpl
Both (simultaneous)
Advertise initial buffer sizes during FC initialisation (sent in order: P → NP → Cpl)
InitFC2_P/NP/Cpl
Both (simultaneous)
Confirm receipt of partner’s InitFC1 values — transitions link to DL_Active
PM_Enter_L1
Downstream → Upstream
Request entry into L1 ASPM power state
PM_Enter_L23
Either direction
Request entry into L2/L3 power state
PM_Request_Ack
Upstream → Downstream
Acknowledge the power management request from downstream device
📋 Physical Layer Gen 1/2 — 8b/10b Logical Sub-block
Gen 1 (2.5 GT/s) and Gen 2 (5.0 GT/s) use 8b/10b encoding. Every 8-bit byte maps to a unique 10-bit symbol on the wire — 20% overhead, but it delivers three critical properties the link needs.
Figure 9 — Gen 1/2 transmit pipeline. The 20% encoding overhead is real cost, but 8b/10b provides guaranteed DC balance and sufficient transition density for CDR lock — both essential properties for a high-speed serial link. K-characters enable unambiguous packet framing.
📋 Physical Layer Gen 3–5 — 128b/130b Logical Sub-block
Gen 3 (8 GT/s), Gen 4 (16 GT/s), and Gen 5 (32 GT/s) all use 128b/130b encoding. The headline improvement: only 1.5% overhead instead of 20%. This is why Gen 3 at 8 GT/s delivers about the same effective throughput as Gen 2 at 10 GT/s would have — without the encoding tax.
Figure 10 — 128b/130b block structure. 128 bits of payload, 2-bit sync header, total 130 bits. The sync header bits (01 or 10) tell the receiver whether this block carries data or control information — replacing the K-character framing of 8b/10b entirely. Scrambling is mandatory to ensure adequate CDR transitions since 128b/130b has no inherent DC balance mechanism.
📋 Physical Layer Gen 6 — PAM4 + FEC + Flit
Gen 6 (64 GT/s) is the biggest Physical Layer change since 8b/10b → 128b/130b in Gen 3. Three mechanisms work together — none of them can be used without the others at this speed.
Figure 11 — Gen 6’s three interdependent mechanisms. PAM4 doubles bandwidth at the same baud rate, but reduces voltage margins. FEC corrects the increased error rate in hardware, restoring effective BER to spec levels. Flit framing provides fixed-size blocks that make FEC encoding efficient and clean — you can’t have one without the others at this speed.
What stays the same in Gen 6 from upper layers’ perspective. From the Transaction Layer’s point of view, nothing has changed. It still builds the same TLP headers with the same fields. It still uses the same six flow control credit types. The same ACK/NAK protocol runs in the Data Link Layer. The same ordering rules apply. A device driver or firmware written for Gen 3 hardware works on Gen 6 without any modification. The entire change in Gen 6 is contained within the Physical Layer.
📋 What Changes Between Generations
Layer / Feature
Gen 1/2
Gen 3–5
Gen 6
Transaction Layer
Unchanged — same TLP formats, FC credit types, VC/TC model, ordering rules
Data Link Layer
Unchanged — same ACK/NAK, LCRC, replay buffer, DLLP types
Same — replay granularity adapted to flit level in Gen 6
PL — Encoding
8b/10b (20% overhead)
128b/130b (1.5% overhead)
PAM4 (2 bits/symbol)
PL — Framing
STP / SDP / END K-characters
Sync header bits (01 or 10)
256-byte flits
PL — Error correction
Detection only (LCRC)
Detection only (LCRC)
Mandatory FEC per flit + LCRC
PL — Equalization
Simple TX pre-emphasis
Explicit FIR coefficient negotiation (LTSSM)
Advanced multi-tap FIR + DSP
PL — Modulation
NRZ — 2 voltage levels
NRZ — 2 voltage levels
PAM4 — 4 voltage levels
Software view
Completely unchanged — same BDF, config registers, memory map, driver model across all generations
📋 Quick Reference
Concept
Key Point
Transaction Layer role
Build and decode TLPs; check FC credits; enforce ordering; manage VC/TC priority
TLP DW0
Fmt (3/4 DW, data or not) + Type (MRd/MWr/Cpl…) + TC (priority 0–7) + TD (ECRC present) + EP (poisoned) + Attr (RO/NS) + AT + Length
ECRC vs LCRC
LCRC: per-hop, mandatory, recalculated at each Switch. ECRC: end-to-end, optional, survives routing, detects intra-Switch errors.
4 voltage levels, 2 bits/symbol, 32 GBaud = 64 GT/s, eye margin 1/3 of NRZ → raw BER rises
Gen 6 FEC
Mandatory Reed-Solomon per flit, corrects errors in hardware before DLL, restores BER to < 10⁻¹⁵
Gen 6 Flit
256-byte fixed frame: 236 B payload + 20 B FEC parity. Multiple TLPs/DLLPs per flit. Replay at flit granularity.
Coming next: PCIe-04 covers PCIe Generations Gen 1 to Gen 6 — the detailed bandwidth maths behind each generation, why Gen 3 chose 8 GT/s instead of 10 GT/s, what drove each design decision from the spec, and a deep dive into the Gen 6 flit format and FEC structure with worked numbers.