PCIe Series — PCIe-05: Transaction Layer in Depth — VLSI Trainers
PCIe Series · PCIe-05
Transaction Layer in Depth
Every TLP type explained with full header diagrams — Memory Read/Write, Completions, Config, IO, Messages, and AtomicOps. Address routing vs ID routing vs implicit routing. Byte enables, the Tag field, split transactions, and how routing decisions are made at every node. Gen 6 TLP considerations throughout.
📋 TLP Structure Overview
A TLP (Transaction Layer Packet) is the fundamental unit of communication in PCIe. It carries commands and data between the software stacks of two devices — one the requester, one the completer. Every TLP starts with a header of 3 or 4 Doublewords (12 or 16 bytes), optionally followed by a data payload, and optionally ending with an ECRC.
Figure 1 — TLP anatomy. The header is mandatory; payload and ECRC are conditional. Data Link Layer adds SeqNo and LCRC wrapping the entire TLP. In Gen 6, one or more wrapped TLPs are packed into a 256-byte flit with FEC parity appended.
📋 Fmt and Type Field Encoding
The first two fields in DW0 tell the receiver everything about the TLP’s structure before it parses any other field. Fmt sets the header size and whether a payload follows; Type selects the TLP variety.
Figure 2 — Fmt encoding (left) and Type[4:0] key values (right). Each row in the Type table is its own TLP group with a clean Type field value, name, and routing method. The receiver parses Fmt first to know header size, then Type to know what to do with it.
📋 Memory Read (MRd) — 3DW and 4DW Headers
A Memory Read request (MRd) asks the completer to return a block of data from a memory-mapped address. It is non-posted — a completion with data (CplD) must come back. The requester uses a Tag to match the completion to its request.
Figure 3 — MRd header layouts. Left: 3DW (12-byte) for 32-bit addresses below 4 GB. Right: 4DW (16-byte) for 64-bit addresses. The Requester ID identifies who sent the request; the Tag distinguishes this read from up to 255 other simultaneous reads from the same function. Byte Enables indicate which bytes in the first and last DW are valid.
📋 Memory Write (MWr)
A Memory Write (MWr) posts data to a memory address. It is posted — no completion comes back. The requester sends it and immediately continues. This is what makes DMA writes fast: the CPU or DMA engine fires the write and moves on without waiting for confirmation.
Figure 4 — MWr header with payload. The header is identical to MRd but with Fmt’s data bit set, and the payload immediately follows the last header DW. Since MWr is posted, a completion TLP never comes back — but the Data Link Layer’s per-hop ACK DLLP still confirms delivery to the adjacent neighbour.
Why MWr is posted but MRd is not. With a write, the data is in the TLP itself — the requester has everything it needs to continue its work. With a read, the data is at the target — the requester must wait for the completion to arrive before it can proceed. Posting memory writes and returning completions only for reads is the PCIe version of the “fire and forget” principle that makes DMA engines so efficient.
📋 Byte Enables — First DW and Last DW
Every MRd and MWr (and IORd/IOWr) header carries two 4-bit Byte Enable fields: one for the first DW of the transfer and one for the last DW. Each bit selects one byte within its DW.
Figure 5 — Byte Enable fields. First DW BE controls validity of bytes in the first DW of the transfer; Last DW BE controls the final DW. When Length=1, Last BE must be 0000. Together they allow sub-DW granularity at both ends of any transfer.
📋 Tag Field — Multiple Outstanding Requests
The Tag field allows a single function to have multiple non-posted requests outstanding simultaneously without confusing the completions when they arrive back. Each outstanding request gets a unique Tag; the completer echoes it in the completion header; the requester uses it to match the completion to the original request.
Figure 6 — Tag field enables multiple outstanding requests. The NVMe sends three simultaneous MRd TLPs, each with a different Tag (5, 6, 7). Completions return in any order — Tag=6 first, Tag=7 second, Tag=5 last. The NVMe’s Transaction Layer uses the Tag to deliver each completion to the correct waiting context.
Tag capability
Tag bits
Max simultaneous requests per function
Enabled by
Standard (Gen 1–3)
8 bits
256
Default
Extended Tag (Gen 3+)
10 bits
1024
Extended Tag Enable bit in Device Control register
10-bit Tag (Gen 6)
10 bits
1024
Supported in Gen 6 as default for high-BW devices
📋 Completion (Cpl and CplD)
A Completion is the response to any non-posted request. It routes back to the requester using the Requester ID (BDF) embedded in the original request. It carries the Tag from the request to match back to the original transaction, a Completion Status code, and — for reads — the requested data payload.
Figure 7 — CplD (Completion with Data) header. DW1 carries the Completer ID (who is sending this completion), Status code, and Byte Count. DW2 carries the Requester ID (who gets the completion — used for routing), the echoed Tag, and Lower Address. The data payload follows immediately after DW2.
▶ Split Transaction — End-to-End Walk-Through
PCIe uses a split transaction model for all non-posted requests. The request and response are two separate TLPs on potentially different paths. A single read may even return in multiple completion TLPs if the data straddles buffer boundaries at the completer.
Figure 8 — Split transaction flow. The MRd travels upstream (GPU → Switch → RC). The RC fetches from RAM and returns a CplD with the same Tag (42) and the requester’s BDF (01:00.0) for routing. The Switch routes the CplD downstream based on the Requester ID’s bus number (bus 01). The GPU’s Transaction Layer matches Tag 42 and delivers the data.
Split completion — when one read returns multiple CplD TLPs
A completer is allowed to return fewer bytes than requested in a single CplD if its internal buffer or packet size constraints require it. Multiple CplD TLPs can satisfy one MRd. The requester reassembles them using the Byte Count field (tracks remaining bytes) and the Lower Address field (tracks the byte offset of the current chunk).
📋 Configuration TLPs — Type 0 and Type 1
Configuration TLPs access the 4 KB configuration space of PCIe functions. They are non-posted — a completion always comes back. Only the Root Complex may generate configuration requests (no peer-to-peer configuration is allowed).
Figure 9 — Configuration TLP header. DW2 carries the target Bus/Device/Function and the Register Number (DW offset within the 4 KB config space). Type0 (Type=0_0100) targets a device on the Secondary Bus; Type1 (Type=0_0101) is forwarded further downstream until a bridge converts it to Type0 at the target bus.
TLP
Type field
When used
Completer action
CfgRd0
0_0100
Target device is on the Secondary Bus of the forwarding bridge — it sees Type 0 directly
Reads config register, returns CplD
CfgRd1
0_0101
Target device is further downstream — bridges forward it until a bridge’s Secondary Bus matches the target bus, then converts to CfgRd0
Forwarded until Type1→Type0 conversion
CfgWr0
0_0100
Write to local bus device
Writes register, returns Cpl (no data)
CfgWr1
0_0101
Write to downstream device
Forwarded, converted, then Cpl returns
📋 IO Read and Write (IORd / IOWr)
IO TLPs target the legacy IO address space (16-bit on x86 systems). They use 3DW headers (IO space is always 32-bit, well under 4 GB). Both are non-posted — IOWr must return a Cpl to confirm the write landed, which is essential because legacy device drivers often depend on write-ordering guarantees in IO space.
The PCIe spec discourages IO address space and indicates it may be deprecated in a future revision. Only Legacy PCIe Endpoints (older PCI/PCI-X devices with a PCIe interface) should use IO space. Native PCIe Endpoints use MMIO only.
📋 Message TLPs
Message TLPs replaced the sideband signals of PCI — interrupt pins, error pins, power management signals — with in-band packets. They always use a 4DW header. They are posted (no completion). Their routing is controlled by the lower 3 bits of the Type field.
Device wakeup request — device has data and wants the link powered up
ERR_COR, ERR_NONFATAL, ERR_FATAL
Implicit → Root
Error reporting to Root Complex for AER handling
Unlock
Implicit → broadcast down
Terminates locked transaction sequence (legacy)
Slot Power Limit
Implicit → broadcast down
Root informs card of physical slot power budget
Vendor-Defined Type 0/1
Address or ID
Vendor-specific message routed to a specific address or BDF
Attention Button Pressed
Implicit → Root
Hot-plug slot attention button event
Presence Detect Changed
Implicit → Root
Hot-plug card insertion/removal event
Why implicit routing “to Root”? The Root Complex is always at the top of the tree. A message routed “to Root” just travels upstream at every hop — no address or BDF needed. Any Switch receiving a “to Root” message on a downstream port forwards it upstream. The message terminates at the Root Complex, which is always upstream of everything.
📋 AtomicOp TLPs — Read-Modify-Write in Hardware
AtomicOp TLPs (introduced in PCIe 2.1) allow a requester to perform an atomic read-modify-write operation on a memory location at the completer, without software locks. The operation is performed atomically — no other requester can access that location between the read and the write. A completion returns the original value of the location before the operation.
AtomicOp
Type[4:0]
Payload
Operation
FetchAdd
0_1100
1 or 2 DW (operand)
Reads current value, adds operand, writes back. Returns old value.
Swap
0_1101
1 or 2 DW (new value)
Reads current value, writes new value. Returns old value.
CAS
0_1110
2 or 4 DW (compare + swap)
Compares current value with first DW/2DW. If match, writes second DW/2DW. Returns old value always.
All AtomicOps are non-posted (completion returns old value). They can target 32-bit or 64-bit data (1 DW or 2 DW operands). AtomicOps require the completer to declare AtomicOp Routing/Completer support in its PCIe Capability structure.
📋 TLP Routing — Three Methods
Every TLP that arrives at a port is inspected by the Transaction Layer to determine if it should be consumed locally or forwarded to another port. The routing method is determined by the TLP Type field.
Routing method
TLP types
How routing is decided
Address Routing
MRd, MWr, IORd, IOWr, AtomicOp
The address in the TLP header is compared against the port’s BAR values and Base/Limit registers in the Type 1 header
ID Routing
CfgRd/Wr, Cpl/CplD, some Msg
The Bus/Device/Function number in the header is compared against the port’s BDF and its Secondary/Subordinate range
Implicit Routing
Most Msg TLPs
The routing sub-field (Type[2:0]) specifies “toward Root” or “broadcast downstream” or “terminate here” — no address or ID comparison needed
📋 Address Routing — How a Switch Makes Its Decision
When a TLP using address routing arrives at a Switch port, the Switch checks the target address against three things in order:
Figure 10 — Address routing decision tree at a Switch port. Step 1: check own BARs. Step 2: check Base/Limit register ranges for downstream ports. If neither matches: Unsupported Request. For upstream-traveling TLPs, the same logic applies in reverse — step 2 would check if the address should go further upstream.
📋 ID Routing — Completions and Configuration
Completions (Cpl/CplD) use ID routing to get back to the requester. The Requester ID field in the completion header contains the BDF of the original requester. Every Switch compares the target bus number against its Secondary/Subordinate range to decide whether to forward downstream — the same mechanism used for Type 1 configuration packets.
Figure 11 — ID routing for a completion. The Switch extracts the target bus number (03) from the Requester ID field. It checks which downstream port’s Secondary/Subordinate range contains bus 03 — matches Port 2 (Secondary=3, Subordinate=5). Forwards on that port. NVMe SSD (03:00.0) receives it and matches Tag=14.
📋 Implicit Routing — Messages
Message TLPs use implicit routing — a 3-bit code in Type[2:0] tells every Switch how to route it without needing an address or BDF lookup.
Type[2:0]
Routing
Behaviour
Example messages
000
→ Root Complex
Every Switch forwards upstream. Root Complex terminates.
INTx, PME, ERR_*, Hot-plug
001
By Address
Normal address lookup in Base/Limit registers
Vendor-defined
010
By ID
Normal BDF lookup
Vendor-defined
011
Broadcast downstream
Switch duplicates message to all downstream ports
Unlock, Slot Power Limit
100
Local — terminate here
Message is consumed by the receiving port, not forwarded
Set_Slot_Power_Limit
101
Gather → Root
Forwarded upstream; switch may combine with others
PM_PME
⚡ Gen 6 — TLP in Flit Context
In Gen 6, TLPs are carried inside 256-byte flits. The TLP format itself — header fields, payload structure, routing information — is unchanged from Gen 1. What changes is how those TLPs are physically transported across the link.
TLP headers are identical. A Gen 6 endpoint builds the same Fmt/Type/TC/Length header it always has. A Gen 6 driver reads the same config registers it always has. The upper layers see no change.
Flit packing. One or more TLPs (plus any DLLPs scheduled at the same time) are packed into a 256-byte flit. If a TLP is larger than the remaining space in a flit, it spans into the next flit. The flit header records where each TLP starts and ends within the flit.
ACK/NAK granularity. In Gen 1–5, each TLP has its own sequence number. In Gen 6, sequence numbers are assigned at the flit level — a NAK replays the entire flit containing the bad TLP. A flit may contain multiple TLPs; all are replayed together.
FEC applies to the flit. FEC parity is appended per flit, not per TLP. Error correction is at the flit boundary. From the Data Link Layer’s perspective, this is transparent — it sees corrected bits, same as Gen 1–5.
Maximum payload unchanged. TLP payload is still limited to 1–1024 DW (4–4096 bytes) by the Max_Payload_Size parameter. This does not change in Gen 6.
The bottom line for TLP authors. If you are writing RTL, firmware, or drivers that generate or parse TLPs, you do not need to change anything for Gen 6. The TLP format, byte enables, tags, routing fields, and completion matching are all unchanged. Gen 6 differences are entirely in the Physical Layer — the TLP you send looks exactly the same on both ends.
Request and completion are separate TLPs · one read may return multiple CplD TLPs · Byte Count tracks remaining bytes
CfgRd0/CfgWr0
Type=0_0100 · targets device on Secondary Bus of receiving bridge · always 1 DW access
CfgRd1/CfgWr1
Type=0_0101 · forwarded downstream until bridge converts it to Type0 at target bus
Address routing
Compare target address against BAR (consume), then Base/Limit (forward), else UR
ID routing
Compare target BDF against own BDF (consume), then Secondary/Subordinate range (forward downstream)
Implicit routing
Type[2:0]=000 → Root · 011 → broadcast · 100 → local · no address comparison needed
Gen 6 TLP impact
Zero — TLP format unchanged · flit packing is transparent to TL · same Max_Payload_Size
Coming next: PCIe-06 covers the TLP Ordering Rules in depth — the full 12-rule ordering table, why posted writes must not pass posted writes, the Relaxed Ordering and No Snoop attributes, and how ordering interacts with Virtual Channels and Traffic Classes.