PCIe Series — PCIe-06: TLP Ordering Rules — VLSI Trainers
PCIe Series · PCIe-06

TLP Ordering Rules

Why the Producer/Consumer model demands strict write ordering, the complete ordering table explained row by row, how Relaxed Ordering (RO), No Snoop (NS) and ID-based Ordering (IDO) work, deadlock scenarios and how they are prevented, and what ordering means in Gen 6 flit mode.

📋 Why Ordering Rules Exist

PCIe is a packet-based fabric. Packets are buffered at every hop. Buffers stall and drain independently. Without ordering rules, two packets sent in sequence by one device can arrive at the destination in the wrong order — and software that depends on the order will silently compute wrong results.

The ordering rules solve four distinct problems:

  1. Producer/Consumer correctness — ensures that data written to memory arrives before the flag that signals it is ready
  2. Deadlock prevention — ensures that completions can always make forward progress even when posted write buffers are full
  3. PCI/PCI-X legacy compatibility — exactly matches the ordering model of shared-bus PCI, so software written for PCI works on PCIe
  4. Performance optimisation — Relaxed Ordering and ID-based Ordering allow selective reordering where safe
Ordering applies within a single Traffic Class / Virtual Channel. TLPs in different TCs have no ordering relationship with each other. A TC 7 packet and a TC 0 packet can overtake each other freely. The ordering table below applies to TLPs of the same TC moving through the same VC.

📋 Three TLP Categories for Ordering

The ordering rules work on TLP categories, not individual TLP types. There are three:

Three TLP Categories Used in Ordering Rules Posted (P) MWr — Memory Write Msg / MsgD — Messages Fire-and-forget · no completion returns Buffered in Posted VC sub-buffer Non-Posted (NP) MRd · MRdLk · IORd · IOWr CfgRd0/1 · CfgWr0/1 · AtomicOp Non-posted · a completion always returns Buffered in NP VC sub-buffer Completion (CPL) Cpl · CplD — with or without data CplLk · CplDLk — locked variants Response to NP · always has infinite credits Buffered in CPL VC sub-buffer
Figure 1 — Three TLP categories for ordering purposes. Each VC buffer is split into three independent sub-buffers (P, NP, CPL) with separate flow control credits. This separation is what makes the ordering rules tractable — each category can be managed independently.

The Producer/Consumer Model

The ordering rules are motivated by the Producer/Consumer programming model — a common pattern in DMA-based systems where one device writes data to memory and then signals another device to process it.

Producer/Consumer Pattern — 4-Step Sequence Producer (e.g. NIC) DMA engine Memory RAM / DRAM Consumer (e.g. CPU) processes data ① MWr — Data payload Posted · data written to memory ② MWr — Flag = 1 MUST arrive AFTER step ① ③ MRd — Read Flag Non-posted · completion returns ④ MRd — Read Data (Flag was 1) ① must arrive before ② — this is the ordering rule If ② arrived before ①: Consumer reads Flag=1, then reads memory — but data hasn’t arrived yet → Consumer reads stale/garbage data → silent corruption
Figure 2 — Producer/Consumer sequence. The Producer writes data first (①), then sets a flag (②). The Consumer polls the flag (③) and only reads the data (④) when the flag is 1. The ordering rule “posted writes must not pass posted writes” guarantees ① arrives before ②. Without it, the Consumer could read the flag before the data is in memory.

The sequence relies on one absolutely critical guarantee: the data write (①) must be visible at the memory target before the flag write (②) is visible. Both are posted (MWr). If the flag somehow arrived first — which could happen if a buffer along the path forwarded the lighter flag packet ahead of the heavier data packet — the Consumer would read Flag=1, immediately fetch the data, and get whatever stale bytes were in memory before the Producer’s write landed. No error flag is raised. Software has no way to detect this. The result is silent data corruption.

📋 What Goes Wrong Without Ordering — Concrete Example

Ordering Failure Scenario — Switch Buffers Cause Out-of-Order Delivery Producer NIC DMA Switch Upstream port Posted buffer FULL MWr-data stuck here NP buffer OK Flag read can flow CPL buffer OK Flag=1 completion flows Memory Data buffer STALE — write not arrived yet Flag = 1 already written Consumer CPU / RC Reads Flag → 1 Reads Data buffer Gets STALE data! No error raised silent corruption Result Without Ordering Rules Flag write flowed through NP→CPL path unblocked. Data write stuck in full Posted buffer. The ordering rule “Completions may pass Posted writes” MUST be paired with “Posted writes must not pass Posted writes” to keep the system safe
Figure 3 — Ordering failure without rules. The Producer’s data MWr is stuck in a full Posted buffer at the Switch. The flag MWr flows through the NP→CPL path and arrives at memory first. The Consumer reads Flag=1, then reads data — but gets stale bytes because the actual data write hasn’t landed yet. The ordering rule “Posted must not pass Posted” prevents this.

📋 The Simplified Ordering Table

The table is read as Row Pass Column. Each cell answers: “May the TLP in the Row pass (overtake) the TLP already waiting in the Column?”

No = the row TLP must not pass the column TLP — strict ordering enforced.
Yes = the row TLP MUST be allowed to pass — required to prevent deadlock.
Yes/No = permitted to pass, but not required — implementation choice.

Row (newer TLP) may pass ↓ \ Column (older TLP waiting) → A: Posted Write (MWr/Msg) B: Posted Write with RO=1 C: Non-Posted Read (MRd) D: Non-Posted Write (IOWr/CfgWr) E: Completion (Cpl/CplD)
A: Posted Write (MWr/Msg) No Yes/No Yes/No Yes/No Yes/No
B: Posted Write with RO=1 Yes Yes/No Yes/No Yes/No Yes/No
C: Non-Posted Read (MRd) No No Yes/No No Yes/No
D: Non-Posted Write (IOWr/CfgWr) No No Yes/No No Yes/No
E: Completion (Cpl/CplD) Yes Yes Yes/No Yes/No Yes/No
The four hard rules (No entries): Row A Col A, Row C Col A, Row D Col A, Row C Col D, Row D Col C. These cannot be relaxed without risking data corruption or deadlock. Everything else is either permitted (Yes/No) or required for deadlock prevention (Yes).

📋 Rule-by-Rule Explanation

Ordering Rules — Visual Explanation of Each Cell Rule A-A — NO: Posted MUST NOT pass Posted A memory write cannot overtake an earlier memory write. This is the Producer/Consumer rule — data must arrive before its flag. Applies to MWr and Msg TLPs. Absolute. Cannot be relaxed. Rule E-A — YES: Completion MUST pass Posted Write If a completion is stuck behind a full posted write buffer, deadlock results. The requester waiting for the completion has tied up NP credits that the posted write needs to drain. Completions must always be able to pass. Rule C-A / D-A — NO: NP must not pass Posted A read request cannot bypass an earlier write. If it did, the read could return data from before the write landed. This preserves the read-after- write ordering guarantee software relies on. Rule B-A — YES/NO: Posted (RO=1) MAY pass Posted When RO is set, software guarantees the write has no dependency on earlier writes. A switch MAY allow it to pass — but is not required to. RO is a hint, not a command. Used by GPU scatter-gather, high-BW DMA engines. Yes/No Rules — Context A may pass C/D/E (Yes/No) Posted write MAY pass NP or completions. Not required but allowed if implementation wants to prioritise. C may pass E (Yes/No) Read request may pass a completion. Generally safe since they target different resources. D may pass E (Yes/No) NP write may pass a completion. Similar reasoning. E may pass C/D (Yes/No) Completion may pass NP requests. Often safe because completions target the original requester, not the same resource as the NP request. Implementation choice. C/D may pass C/D (Yes/No) NP may pass NP within same category. Mostly implementation-defined. IDO makes this more useful. E may pass E (Yes/No) Completions from different requesters may pass each other. IDO makes this explicit.
Figure 4 — Ordering rules explained. Left: the four hard rules (No/Yes mandatory). Right: the Yes/No rules that give implementations flexibility. The hard rules exist for correctness; the Yes/No rules exist for performance tuning.

📋 Relaxed Ordering (RO) — The Attribute Bit

Relaxed Ordering is a single bit in the TLP header (Attr[1], bit 13 of DW0). When set to 1, the requester is telling the fabric: “I guarantee this packet has no ordering dependency on anything that came before it. You are free to let it move ahead of earlier posted writes if that helps performance.”

Relaxed Ordering Bit Position in TLP Header DW0 Fmt Type R TC [22:20] R LN [18] TH [17] TD [16] EP [15] RO Relax. Ordering [14] = bit 13 NS [13] No Snoop AT [11:10] R Length [9:0] payload in DW Set by software when packet has no ordering dependency
Figure 5 — Relaxed Ordering bit position (DW0 bit 13, Attr[1]). Enabled per-packet by the device driver. Switches see this bit and MAY allow the packet to bypass earlier posted writes. The bit is forwarded unchanged — a switch must not modify it.

Key implementation detail: the RO bit is advisory, not mandatory. When a switch sees RO=1 on a packet, it is permitted to let it pass earlier posted writes — but the switch is not required to do so. A switch that ignores RO and always enforces strict ordering is compliant with the spec.

📋 RO Effects on Each TLP Type

TLP typeRO=0 behaviourRO=1 behaviour at a Switch
Posted Write (MWr) Must not pass earlier posted writes or messages. Strict ordering. MAY pass earlier posted writes or messages. Order not guaranteed for this packet.
Message (Msg/MsgD) Must not pass earlier posted writes. MAY pass earlier posted writes. Same as MWr with RO=1.
Read Request (MRd) May not pass posted writes ahead of it. The read is still forwarded in order — this flushes earlier writes past it. Read request still forwarded in-order (RO does not relax the read-forward constraint). The completer uses RO in its completions.
Completion (CplD) Must not pass earlier posted writes travelling in the same direction. MAY pass earlier posted writes. GPU read-completions get back faster when they don’t wait for posted writes to drain.

Where RO is most useful in practice

📋 No Snoop (NS) — Cache Coherency Hint

The No Snoop bit (Attr[0], DW0 bit 12) is separate from ordering. It tells the Root Complex and memory controller: “Do not perform a CPU cache snoop for this memory access. I (the software driver) guarantee there is no cached copy of this data that needs to be invalidated.”

No Snoop — What Happens With and Without It NS = 0 — Normal (with snoop) ① MWr or MRd arrives at Root Complex ② RC sends snoop request to all CPU cores: “Does anyone have a modified copy of this cache line?” ③ CPU responds (could take 50–200ns depending on load) ④ If dirty cache line found: flush to memory first, then proceed Latency: adds snoop round-trip to every access NS = 1 — No Snoop ① MWr or MRd arrives at Root Complex ② RC skips the snoop entirely → goes directly to memory ③ Driver must guarantee: the target memory region is marked uncacheable OR CPU caches have been explicitly flushed before the PCIe device access Latency: saves 50–200ns per access — significant for isochronous traffic
Figure 6 — No Snoop effect. With NS=0, the Root Complex must snoop CPU caches before completing the memory access. With NS=1 the snoop is skipped. NS is used for PCIe-connected devices accessing DMA buffers allocated with uncacheable or write-combining memory types — the CPU will never cache these regions, so a snoop is a wasted round-trip.

NS=1 is appropriate for DMA transfer buffers, GPU frame buffers, NVMe data queues — memory regions allocated with ioremap_wc() or dma_alloc_coherent() in Linux. NS=1 is not safe for any memory region that the CPU might have cached — using it there risks reading stale data from memory while a valid modified copy sits in a CPU L1 cache.

📋 ID-Based Ordering (IDO)

ID-Based Ordering (IDO) was introduced in PCIe 2.1. The insight behind it: if two packets have different Requester IDs, they are almost certainly from unrelated software threads and have no dependency on each other. Strictly enforcing ordering between them sacrifices performance for no safety benefit.

ID-Based Ordering — Different Requester IDs Have No Ordering Relationship Without IDO — All packets ordered together Switch Egress Buffer TLP from Dev-A (BLOCKED) TLP from Dev-B (waits) TLP from Dev-C (waits) Dev-B and Dev-C blocked even though they have no dependency on Dev-A With IDO — Different IDs may reorder freely Switch Egress Buffer TLP from Dev-A (BLOCKED) TLP from Dev-B → PASSES TLP from Dev-C → PASSES Different Requester IDs → independently ordered · Dev-A blockage doesn’t spread
Figure 7 — IDO effect. Without IDO, a blocked TLP from Dev-A stalls all subsequent TLPs in the same VC queue even from unrelated devices. With IDO enabled, the Switch recognises different Requester IDs as independent TLP streams and allows Dev-B and Dev-C to pass the blocked Dev-A packet.

How IDO is enabled

When NOT to use IDO: If Device A writes data to a shared memory buffer and then sends a peer-to-peer write to a flag in Device B, and Device B then writes data to the same shared buffer — those two DMA streams interact. Marking Device B’s writes as IDO would allow them to reorder relative to Device A’s writes, potentially corrupting the shared buffer. IDO is only safe when two devices genuinely have no shared state.

📋 Deadlock Avoidance

The ordering table has several cells marked Yes (mandatory pass). These are not for performance — they exist purely to prevent deadlock. The most important deadlock scenario:

Deadlock Scenario — Why Completions MUST Pass Posted Writes Deadlock State (no ordering rules) ① Endpoint A sends MRd — RC must return CplD ② RC’s CplD heading back → blocked by full Posted Write buffer at Switch ③ Switch Posted buffer is full because Endpoint A’s receive buffer is full (waiting for the CplD!) ④ Endpoint A’s receive buffer won’t drain until the CplD arrives → Circular wait — nothing can move — DEADLOCK CplD blocked by Posted buffer, Posted buffer blocked by full receive buffer, Solution — Rule E-A: Cpl MUST pass Posted The ordering rule mandates: Completions MUST be allowed to bypass earlier Posted writes in the same egress port Implementation: Switches maintain separate egress queues for P, NP, and CPL traffic. A full Posted buffer cannot block the Completion queue from advancing. → CplD always flows past blocked writes → Deadlock never forms This is also why CPL credits must always be infinite
Figure 8 — Deadlock scenario and its solution. The circular wait (CplD blocked by full Posted queue → Posted queue blocked by full receive buffer → receive buffer waiting for CplD) is broken by the mandatory rule that completions must be able to bypass posted writes. Separate P/NP/CPL queues per VC is the implementation that makes this work.

📋 Ordering and Traffic Classes / Virtual Channels

Ordering rules apply within a single Traffic Class. Two packets with different TC values have no ordering relationship — a TC 7 packet and a TC 0 packet can freely overtake each other.

When multiple TCs are mapped to the same Virtual Channel (which is the common case — most systems use only VC0 for all TC values), it becomes simpler to implement ordering across the whole VC rather than tracking per-TC ordering within it. The spec allows this — applying ordering rules to all traffic within a VC, even when multiple TCs share it, is compliant.

ScenarioOrdering applies?Reason
Two MWrs with same TC, same VCYes — strictlyProducer/Consumer correctness
Two MWrs with different TCs, same VCImplementation choiceSpec doesn’t require ordering across TCs; implementation may choose to apply VC-level ordering for simplicity
Two MWrs with different TCs, different VCsNo — unorderedDifferent VCs are independent; no ordering relationship
MWr (TC 0) and MRd (TC 7)NoDifferent TCs have no ordering relationship regardless of VC mapping

Ordering in Gen 6 — Flit Mode

The transaction ordering rules are completely unchanged in Gen 6. They are a property of the Transaction Layer, not the Physical Layer. Flit packing is transparent to ordering.

Two specific points worth clarifying about Gen 6 and ordering:

The practical takeaway for Gen 6 design. If you are designing a Gen 6 endpoint or switch, your Transaction Layer ordering logic is identical to what you would write for Gen 6, Gen 5, or Gen 3. The ordering table, the RO/NS/IDO bits, the separate P/NP/CPL buffers — all exactly the same. The only thing that changes between Gen 5 and Gen 6 is what happens below the Data Link Layer.

📋 Quick Reference

ConceptKey Point
Ordering scopeApplies within a single Traffic Class / Virtual Channel. Different TCs have no ordering relationship.
Three categoriesPosted (MWr, Msg) · Non-Posted (MRd, IOWr, CfgWr…) · Completion (Cpl, CplD)
Producer/Consumer rulePosted must not pass Posted. Data MWr must arrive before Flag MWr — enforced by the hard No in cell A-A.
Deadlock ruleCompletion MUST pass Posted (mandatory Yes in cell E-A). Prevents circular wait between CplD and full write buffers.
NP must not pass PostedA read request cannot bypass an earlier write — read-after-write ordering would break. Hard No in cells C-A, D-A.
Yes/No rulesMost other combinations are implementation choice. Implementations may pass or block — both are compliant.
Relaxed Ordering (RO)Attr[1] in TLP DW0 bit 13. Software guarantee: “no dependency on earlier writes — you may bypass them.” Switch MAY (not must) honour it.
RO use casesGPU read completions · high-BW scatter-gather DMA · independent writes to non-overlapping memory regions
RO not safe forAny write that sets a flag/semaphore that software reads after checking. The flag MWr must preserve ordering with data MWrs.
No Snoop (NS)Attr[0] bit 12. Tells RC: skip CPU cache snoop. Safe only for uncacheable DMA buffer memory. Saves 50–200 ns per access.
IDOAttr[2] bit 18. Packets with different Requester IDs may reorder past each other. Enabled in Device Control 2 register.
IDO not safe forAny scenario where two devices share state via a common memory buffer — IDO could allow their writes to arrive out of order.
Gen 6 orderingCompletely unchanged. Flit packing preserves TLP order. Flit replay preserves TLP order. RO/NS/IDO bits unchanged.
Coming next: PCIe-07 covers the LTSSM — Link Training and Status State Machine — how a PCIe link goes from power-on to L0 (active), every state in the LTSSM, equalization phases in Gen 3+, the role of TS1/TS2 Ordered Sets, and how recovery and error handling work across all generations including Gen 6.
Scroll to Top