PCIe Series — PCIe-07: Completion TLPs — VLSI Trainers
PCIe Series · PCIe-07
Completion TLPs
The full Cpl and CplD header — Completer ID, Status codes, Byte Count, Lower Address, Requester ID, and Tag — plus split-completion mechanics, Completion Timeout, CplLk/CplDLk for locked transactions, and what none of this changes in Gen 6.
📋 What a Completion Is
Every non-posted request — MRd, MRdLk, IOWr, IORd, CfgRd, CfgWr, AtomicOp — must eventually receive a completion TLP in return. The completion is how the completer tells the requester: “I have processed your request. Here is the result.” Without the completion, the requester has no way to know whether its request succeeded, failed, or was even received.
Figure 1 — Split-transaction model. The MRd travels upstream (① blue arrow). While it is being processed, the requester continues other work. The CplD returns downstream (② orange arrow) with the echoed Tag=42 to identify which request it satisfies. These two TLPs are independent packets — they may take different paths if the topology allows.
Only non-posted requests generate completions. Posted requests (MWr, Msg) do not — by design. Completions must always have infinite receive buffer credits at the requester to prevent deadlock (covered in PCIe-13 on Flow Control).
📋 Cpl, CplD, CplLk, CplDLk
There are four completion TLP types, all sharing the same 3DW header layout. The Fmt field distinguishes whether data is present; the Type field is always the same.
Name
Fmt
Type
Data?
Used for
Cpl
000
0_1010
No
Status-only response to IOWr, CfgWr, and failed reads (UR/CA)
CplD
010
0_1010
Yes
Response to MRd, IORd, CfgRd, AtomicOp — carries the requested data
CplLk
000
0_1011
No
Status-only response to MRdLk when no data (error case)
CplDLk
010
0_1011
Yes
Response to MRdLk — carries data, used in legacy locked transactions
How to tell Cpl from CplD at a glance. Look at Fmt bit[1] (DW0 bit 29). If it is 0 (Fmt=000), this is a Cpl — no data follows the header. If it is 1 (Fmt=010), this is a CplD — a data payload immediately follows the 12-byte header. The receiver determines this before reading any further into the packet.
📋 Completion Header — All 12 Bytes
Every completion uses a fixed 3DW (12-byte) header. There is no 4DW variant — completions are always 3DW regardless of whether the original request used a 64-bit address. The address is not echoed in the completion; routing uses the Requester ID instead.
Figure 2 — Completion header, all 12 bytes. DW0 carries Fmt+Type+TC+Attr+Length. DW1 carries Completer ID + Status[2:0] + BCM + Byte Count[11:0]. DW2 carries Requester ID (for routing) + Tag (for matching) + Lower Address[6:0]. The data payload follows DW2 for CplD; nothing follows for Cpl.
📋 Completer ID (DW1, Bytes 4–5)
The Completer ID is the 16-bit BDF (Bus:Device:Function) of the device that is sending this completion. It is placed in DW1 bytes 4–5 — the same position that would be the “Requester ID” in a request TLP, but in a completion it identifies the completer rather than the requester.
During normal successful operation the Completer ID is largely informational — it is useful for debug when tracking which device returned data for a given request. The Requester ID in DW2 is what actually routes the completion back to its destination. However the spec notes it is valuable for error diagnosis: if a completion arrives with an unexpected status code, the Completer ID tells you exactly which device failed.
Completer ID vs Requester ID — easy to confuse. In DW1 (Bytes 4–5) lives the Completer ID — who is sending this completion. In DW2 (Bytes 8–9) lives the Requester ID — who originally sent the request and who should receive this completion. Switches use the Requester ID to route the completion downstream. The Completer ID is not used for routing.
📋 Completion Status Codes
The 3-bit Status field in DW1 byte 6 bits [7:5] tells the requester whether its request was serviced correctly and, if not, what went wrong.
Figure 3 — The four completion status codes. Codes 011 and 101–111 are reserved — a completion with a reserved status code is treated as if it were UR. A status code other than SC terminates the transaction — no further completions are expected for that request, and any data already received should be discarded.
A non-SC status terminates the transaction immediately. If a CplD with Status=CA arrives and this is the second of three expected split completions, the requester must discard all data received so far and consider the request failed. No further completions will come for this Tag. The Tag is freed, and the error is reported to the device driver.
📋 BCM — Byte Count Modified
BCM is a 1-bit field at DW1 byte 6 bit [4]. It is a legacy compatibility flag used only by PCI-X completers that may exist behind a PCIe-to-PCI-X bridge. In a pure PCIe system, BCM is always 0.
When BCM=1 (PCI-X bridges only): the Byte Count field in this first completion reports the size of this completion’s payload rather than the total remaining bytes for the original request. Subsequent completions reset BCM to 0 and Byte Count reverts to its normal meaning (remaining total). This distinction matters because a requester uses Byte Count to know when all splits of a request have arrived — BCM=1 signals “do not use this Byte Count to determine completion”.
For all native PCIe completers, BCM is always 0 and can be ignored by the receiver.
📋 Byte Count Field (DW1, 12 bits)
The 12-bit Byte Count field at DW1 byte 6 bits [3:0] + byte 7 bits [7:0] carries the number of bytes remaining to satisfy the original request, including the bytes in this completion. It counts down with each successive completion TLP.
Figure 4 — Byte Count counting down. The original request asks for 128 bytes. The first CplD returns 64 bytes and has Byte Count=128 (128 total remaining, including this completion’s 64 bytes). The second CplD returns the remaining 64 bytes with Byte Count=64. When Byte Count equals Length×4, the requester knows this is the final completion.
📋 Requester ID and Tag (DW2)
DW2 carries the two fields that tie a completion back to its original request: the Requester ID and the Tag.
Field
Location
Value
Purpose
Requester ID
Bytes 8–9
Copied from the original request’s Requester ID field
Used for routing — switches compare the bus number against their Secondary/Subordinate ranges to forward the completion downstream to the correct port
Tag
Byte 10
Copied exactly from the original request’s Tag field
Used for matching — the requester’s Transaction Layer matches incoming CplD tags against its table of outstanding requests to deliver data to the correct waiting context
The Requester ID routes; the Tag matches. These are distinct jobs. A switch that receives a completion looks at the Requester ID bus number (bytes 8–9) to decide which downstream port to forward it to. It does not look at the Tag at all. The Tag is only examined by the final destination — the original requester — which uses it to find the pending request and deliver the data.
📋 Lower Address Field (DW2 Byte 11 bits [6:0])
Lower Address is a 7-bit field at byte 11 bits [6:0] of DW2. It carries the byte address of the first valid data byte being returned in this completion. It is not the full address — it is the low 7 bits of the byte-level start address of the data in this completion.
Its primary purpose is to help the requester calculate how many bytes remain before hitting the next Read Completion Boundary (RCB). RCB is either 64 or 128 bytes depending on a Root Complex configuration register. Completions that are entirely within one RCB must be returned in a single CplD. The Lower Address field tells the requester the exact byte position within the current RCB so it can do this calculation.
For AtomicOp completions, Lower Address is reserved (set to zero). For all other completion types (write completions — Cpl with no data), Lower Address is also zero.
📋 Calculating the Lower Address Field
The completer calculates Lower Address from the DW-aligned start address and the First DW Byte Enable pattern of the original request. The result is the byte offset within the first DW at which the first valid data byte lives.
Figure 5 — Lower Address calculation. The First DW BE from the original request determines the byte offset within the starting DW. Lower Address is the low 7 bits of (DW-aligned start address + offset from First DW BE). The requester uses it to calculate how many bytes reach the next RCB boundary.
📋 How Completions Are Routed
Completions use ID routing — not address routing. Every switch that receives a completion extracts the Requester ID bus number (DW2 bytes 8–9) and compares it against the Secondary and Subordinate bus numbers of its downstream ports to determine which port to forward the completion to.
Figure 6 — Completion routing. The switch extracts the Requester ID bus number (03) from DW2 bytes 8–9 and compares it against each downstream port’s Secondary/Subordinate bus range. Port 2 has Secondary=3, Subordinate=5 — bus 03 is in this range, so the completion is forwarded there. At the NVMe SSD, Tag=42 matches the pending MRd and the data is delivered.
📋 Split Completions and the Read Completion Boundary
A completer is allowed to satisfy a single read request with multiple completion TLPs. This is called a split completion. The spec defines when this is legal and how the completions must be structured.
Read Completion Boundary (RCB)
The RCB is a naturally-aligned boundary in the memory address space, at either 64 or 128 bytes. The Root Complex’s RCB size is readable from a configuration register. Bridges and endpoints may also implement an RCB configuration bit.
The rules from the spec:
A completion that is entirely within a single RCB-aligned block must be returned in one CplD — it cannot be split at a point within an RCB.
A completion spanning an RCB boundary may be split at the boundary. The first CplD covers up to (but not including) the next RCB boundary; subsequent CplDs cover one or more full RCB blocks; the final CplD covers the remaining bytes.
Multiple completions for one request must be returned in strictly increasing address order.
A single completion can only satisfy one request — it cannot partially satisfy two requests.
For IO and Config reads, the completer always returns exactly 1 DW (4 bytes) in a single completion — never split.
Figure 7 — Split completion at RCB=64 bytes. The original MRd covers bytes 0x38–0x97 (96 bytes across three segments). CplD #1 covers up to the first RCB boundary (8 bytes, Byte Count=96). CplD #2 covers the full middle RCB block (64 bytes, Byte Count=88). CplD #3 covers the tail (24 bytes, Byte Count=24 — equals its own Length×4, signalling last completion).
📋 Completion Timeout
A requester that sends a non-posted request and never receives a completion is in trouble — its Tag is stuck, its receive buffer may be held, and dependent operations cannot proceed. The PCIe spec defines a Completion Timeout mechanism to recover from this situation.
Property
Value / Description
Where enabled
Device Control 2 register — Completion Timeout Value field and Completion Timeout Disable bit
Default timeout range
50 µs to 50 ms (default), configurable to 16 µs–55 ms, 65 ms–210 ms, 260 ms–900 ms, or 1 s–3.5 s
What happens on timeout
Device reports a Completion Timeout error to software (AER: advisory non-fatal error). The pending Tag is released. The outstanding request is considered failed.
Disable option
Software can set the Completion Timeout Disable bit to prevent timeouts. Only appropriate for specific test scenarios — disabled timeouts risk hanging the device indefinitely on a lost completion.
CRS exception
CRS (Config Request Retry Status) completions do not start the Completion Timeout for the RC — the RC is expected to retry. But software must eventually stop retrying if CRS continues beyond 1 second after reset.
Gen 6 consideration
At Gen 6 speeds, a small timeout value may trigger spuriously if the completer is processing a large AtomicOp or slow memory operation. Firmware should configure the timeout range appropriately for the workload.
Completion Timeout is mandatory in PCIe 2.0 and later. Devices designed to PCIe 1.x had no standard timeout mechanism — firmware had to poll the device and manually detect hangs. From PCIe 2.0 onward, the timeout is a required hardware feature, making system error recovery far more reliable.
⚡ Completions in Gen 6
The completion TLP header format — every byte, every field, every bit position — is identical in Gen 6 compared to Gen 1 through Gen 5. The same 3DW, 12-byte header. The same Completer ID in DW1. The same Requester ID and Tag in DW2. The same four Status codes. The same Byte Count and Lower Address fields.
Gen 6 flit packing of completions
Cpl (no data) is very small. A status-only completion is 12 bytes of header — it fits comfortably alongside other TLPs in a single Gen 6 flit. Multiple Cpl TLPs for write confirmations can pack into one flit.
CplD with large payloads spans flits. A CplD returning 4096 bytes of data is 12-byte header + 4096-byte payload = 4108 bytes total, spanning roughly 18 flits. Each flit carries its own FEC parity. A single corrupted flit triggers replay of only that flit — not the entire large CplD.
Infinite completion credits are unchanged. The requirement that completion receive buffers must always be infinite (CPLH/CPLD FC credits = infinite) is unchanged in Gen 6. The flit-based FEC layer is Physical Layer — it is transparent to the flow control system.
Completion Timeout values may need adjustment. At 64 GT/s x16, a completer that is fast will return completions in sub-microsecond time. But large AtomicOp operations or high-latency DRAM accesses may still take microseconds. The Gen 6 Completion Timeout range (down to 16 µs) is appropriate but should be configured based on the specific workload and memory system latency.
The practical rule for Gen 6 completion handling. If you are writing RTL that builds or parses completion TLPs, no changes are needed for Gen 6. The header fields, Byte Count logic, Lower Address calculation, RCB rules, and Tag matching all work identically. The Physical Layer handles flit packing transparently and the Transaction Layer sees the same clean TLP stream it has always seen.
📋 Quick Reference
Item
Value / Rule
Cpl — Fmt, Type
Fmt=000 · Type=0_1010 · no data · 12 bytes
CplD — Fmt, Type
Fmt=010 · Type=0_1010 · data follows · 12 bytes header
CplLk / CplDLk
Type=0_1011 · locked transaction legacy variants
Header size
Always 3DW (12 bytes) · no 4DW variant for completions
DW0
Fmt + Type + TC (must match request) + Attr (must match request) + Length (payload DWs)
DW1 — Completer ID
BDF of the device sending this completion · bytes 4–5 · for debug only, not used for routing
Coming next — PCIe-08: Configuration TLPs — CfgRd0, CfgRd1, CfgWr0, CfgWr1 — Type 0 vs Type 1, how Bus/Device/Function addressing works in config space, and how Type 1 is converted to Type 0 at the target bus.