PCIe Series — PCIe-06: TLP Ordering Rules — VLSI Trainers
PCIe Series · PCIe-06
TLP Ordering Rules
Why the Producer/Consumer model demands strict write ordering, the complete ordering table explained row by row, how Relaxed Ordering (RO), No Snoop (NS) and ID-based Ordering (IDO) work, deadlock scenarios and how they are prevented, and what ordering means in Gen 6 flit mode.
📋 Why Ordering Rules Exist
PCIe is a packet-based fabric. Packets are buffered at every hop. Buffers stall and drain independently. Without ordering rules, two packets sent in sequence by one device can arrive at the destination in the wrong order — and software that depends on the order will silently compute wrong results.
The ordering rules solve four distinct problems:
Producer/Consumer correctness — ensures that data written to memory arrives before the flag that signals it is ready
Deadlock prevention — ensures that completions can always make forward progress even when posted write buffers are full
PCI/PCI-X legacy compatibility — exactly matches the ordering model of shared-bus PCI, so software written for PCI works on PCIe
Performance optimisation — Relaxed Ordering and ID-based Ordering allow selective reordering where safe
Ordering applies within a single Traffic Class / Virtual Channel. TLPs in different TCs have no ordering relationship with each other. A TC 7 packet and a TC 0 packet can overtake each other freely. The ordering table below applies to TLPs of the same TC moving through the same VC.
📋 Three TLP Categories for Ordering
The ordering rules work on TLP categories, not individual TLP types. There are three:
Figure 1 — Three TLP categories for ordering purposes. Each VC buffer is split into three independent sub-buffers (P, NP, CPL) with separate flow control credits. This separation is what makes the ordering rules tractable — each category can be managed independently.
▶ The Producer/Consumer Model
The ordering rules are motivated by the Producer/Consumer programming model — a common pattern in DMA-based systems where one device writes data to memory and then signals another device to process it.
Figure 2 — Producer/Consumer sequence. The Producer writes data first (①), then sets a flag (②). The Consumer polls the flag (③) and only reads the data (④) when the flag is 1. The ordering rule “posted writes must not pass posted writes” guarantees ① arrives before ②. Without it, the Consumer could read the flag before the data is in memory.
The sequence relies on one absolutely critical guarantee: the data write (①) must be visible at the memory target before the flag write (②) is visible. Both are posted (MWr). If the flag somehow arrived first — which could happen if a buffer along the path forwarded the lighter flag packet ahead of the heavier data packet — the Consumer would read Flag=1, immediately fetch the data, and get whatever stale bytes were in memory before the Producer’s write landed. No error flag is raised. Software has no way to detect this. The result is silent data corruption.
📋 What Goes Wrong Without Ordering — Concrete Example
Figure 3 — Ordering failure without rules. The Producer’s data MWr is stuck in a full Posted buffer at the Switch. The flag MWr flows through the NP→CPL path and arrives at memory first. The Consumer reads Flag=1, then reads data — but gets stale bytes because the actual data write hasn’t landed yet. The ordering rule “Posted must not pass Posted” prevents this.
📋 The Simplified Ordering Table
The table is read as Row Pass Column. Each cell answers: “May the TLP in the Row pass (overtake) the TLP already waiting in the Column?”
No = the row TLP must not pass the column TLP — strict ordering enforced. Yes = the row TLP MUST be allowed to pass — required to prevent deadlock. Yes/No = permitted to pass, but not required — implementation choice.
The four hard rules (No entries): Row A Col A, Row C Col A, Row D Col A, Row C Col D, Row D Col C. These cannot be relaxed without risking data corruption or deadlock. Everything else is either permitted (Yes/No) or required for deadlock prevention (Yes).
📋 Rule-by-Rule Explanation
Figure 4 — Ordering rules explained. Left: the four hard rules (No/Yes mandatory). Right: the Yes/No rules that give implementations flexibility. The hard rules exist for correctness; the Yes/No rules exist for performance tuning.
📋 Relaxed Ordering (RO) — The Attribute Bit
Relaxed Ordering is a single bit in the TLP header (Attr[1], bit 13 of DW0). When set to 1, the requester is telling the fabric: “I guarantee this packet has no ordering dependency on anything that came before it. You are free to let it move ahead of earlier posted writes if that helps performance.”
Figure 5 — Relaxed Ordering bit position (DW0 bit 13, Attr[1]). Enabled per-packet by the device driver. Switches see this bit and MAY allow the packet to bypass earlier posted writes. The bit is forwarded unchanged — a switch must not modify it.
Key implementation detail: the RO bit is advisory, not mandatory. When a switch sees RO=1 on a packet, it is permitted to let it pass earlier posted writes — but the switch is not required to do so. A switch that ignores RO and always enforces strict ordering is compliant with the spec.
📋 RO Effects on Each TLP Type
TLP type
RO=0 behaviour
RO=1 behaviour at a Switch
Posted Write (MWr)
Must not pass earlier posted writes or messages. Strict ordering.
MAY pass earlier posted writes or messages. Order not guaranteed for this packet.
Message (Msg/MsgD)
Must not pass earlier posted writes.
MAY pass earlier posted writes. Same as MWr with RO=1.
Read Request (MRd)
May not pass posted writes ahead of it. The read is still forwarded in order — this flushes earlier writes past it.
Read request still forwarded in-order (RO does not relax the read-forward constraint). The completer uses RO in its completions.
Completion (CplD)
Must not pass earlier posted writes travelling in the same direction.
MAY pass earlier posted writes. GPU read-completions get back faster when they don’t wait for posted writes to drain.
Where RO is most useful in practice
GPU frame buffer reads — completions returning rendered-frame data can bypass DMA writes in the same switch because the frame reads have no dependency on those writes
High-bandwidth scatter-gather DMA — NVMe or network card posting many independent writes to non-overlapping memory regions can mark them RO so they pipeline past each other in the Switch
Not safe for — any write that sets a flag or semaphore that another device will read. The flag write must arrive strictly after the data write.
📋 No Snoop (NS) — Cache Coherency Hint
The No Snoop bit (Attr[0], DW0 bit 12) is separate from ordering. It tells the Root Complex and memory controller: “Do not perform a CPU cache snoop for this memory access. I (the software driver) guarantee there is no cached copy of this data that needs to be invalidated.”
Figure 6 — No Snoop effect. With NS=0, the Root Complex must snoop CPU caches before completing the memory access. With NS=1 the snoop is skipped. NS is used for PCIe-connected devices accessing DMA buffers allocated with uncacheable or write-combining memory types — the CPU will never cache these regions, so a snoop is a wasted round-trip.
NS=1 is appropriate for DMA transfer buffers, GPU frame buffers, NVMe data queues — memory regions allocated with ioremap_wc() or dma_alloc_coherent() in Linux. NS=1 is not safe for any memory region that the CPU might have cached — using it there risks reading stale data from memory while a valid modified copy sits in a CPU L1 cache.
📋 ID-Based Ordering (IDO)
ID-Based Ordering (IDO) was introduced in PCIe 2.1. The insight behind it: if two packets have different Requester IDs, they are almost certainly from unrelated software threads and have no dependency on each other. Strictly enforcing ordering between them sacrifices performance for no safety benefit.
Figure 7 — IDO effect. Without IDO, a blocked TLP from Dev-A stalls all subsequent TLPs in the same VC queue even from unrelated devices. With IDO enabled, the Switch recognises different Requester IDs as independent TLP streams and allows Dev-B and Dev-C to pass the blocked Dev-A packet.
How IDO is enabled
IDO has separate enable bits in the Device Control 2 register: one for Requests, one for Completions
The IDO attribute bit is Attr[2] (DW0 bit 18) — set per-TLP by the device when IDO is enabled
A switch that sees IDO=1 on two TLPs with different Requester IDs may reorder them relative to each other
Completions may use IDO independently of whether the original request used it — the completer decides
When NOT to use IDO: If Device A writes data to a shared memory buffer and then sends a peer-to-peer write to a flag in Device B, and Device B then writes data to the same shared buffer — those two DMA streams interact. Marking Device B’s writes as IDO would allow them to reorder relative to Device A’s writes, potentially corrupting the shared buffer. IDO is only safe when two devices genuinely have no shared state.
📋 Deadlock Avoidance
The ordering table has several cells marked Yes (mandatory pass). These are not for performance — they exist purely to prevent deadlock. The most important deadlock scenario:
Figure 8 — Deadlock scenario and its solution. The circular wait (CplD blocked by full Posted queue → Posted queue blocked by full receive buffer → receive buffer waiting for CplD) is broken by the mandatory rule that completions must be able to bypass posted writes. Separate P/NP/CPL queues per VC is the implementation that makes this work.
📋 Ordering and Traffic Classes / Virtual Channels
Ordering rules apply within a single Traffic Class. Two packets with different TC values have no ordering relationship — a TC 7 packet and a TC 0 packet can freely overtake each other.
When multiple TCs are mapped to the same Virtual Channel (which is the common case — most systems use only VC0 for all TC values), it becomes simpler to implement ordering across the whole VC rather than tracking per-TC ordering within it. The spec allows this — applying ordering rules to all traffic within a VC, even when multiple TCs share it, is compliant.
Scenario
Ordering applies?
Reason
Two MWrs with same TC, same VC
Yes — strictly
Producer/Consumer correctness
Two MWrs with different TCs, same VC
Implementation choice
Spec doesn’t require ordering across TCs; implementation may choose to apply VC-level ordering for simplicity
Two MWrs with different TCs, different VCs
No — unordered
Different VCs are independent; no ordering relationship
MWr (TC 0) and MRd (TC 7)
No
Different TCs have no ordering relationship regardless of VC mapping
⚡ Ordering in Gen 6 — Flit Mode
The transaction ordering rules are completely unchanged in Gen 6. They are a property of the Transaction Layer, not the Physical Layer. Flit packing is transparent to ordering.
Two specific points worth clarifying about Gen 6 and ordering:
Flit packing does not reorder TLPs. Multiple TLPs inside one flit are packed in the same order they would have been sent individually. The flit header records each TLP’s position. The receiver unpacks them in order.
Flit-level replay does not violate ordering. When a flit is replayed (due to FEC failure or link error), all TLPs in that flit are replayed together, in their original order. No TLP within a replayed flit can overtake a TLP in a subsequent flit.
RO, NS, IDO bits are unchanged. They are in TLP header DW0 — the same header fields, the same bit positions, the same meanings. A Gen 6 switch reads these bits the same way a Gen 1 switch does.
CXL 3.0 ordering. CXL 3.0 runs on the PCIe 6.0 PHY and adds its own coherency ordering model on top of PCIe’s ordering rules — but that is a CXL-layer protocol, not a PCIe ordering change.
The practical takeaway for Gen 6 design. If you are designing a Gen 6 endpoint or switch, your Transaction Layer ordering logic is identical to what you would write for Gen 6, Gen 5, or Gen 3. The ordering table, the RO/NS/IDO bits, the separate P/NP/CPL buffers — all exactly the same. The only thing that changes between Gen 5 and Gen 6 is what happens below the Data Link Layer.
📋 Quick Reference
Concept
Key Point
Ordering scope
Applies within a single Traffic Class / Virtual Channel. Different TCs have no ordering relationship.
Coming next: PCIe-07 covers the LTSSM — Link Training and Status State Machine — how a PCIe link goes from power-on to L0 (active), every state in the LTSSM, equalization phases in Gen 3+, the role of TS1/TS2 Ordered Sets, and how recovery and error handling work across all generations including Gen 6.