PCIe Transaction Ordering Explained in Detail – Your VLSI Journey Starts Here

📋 Why Ordering Rules Exist

PCIe is a packet-switched fabric. Every switch port has independent buffers that can stall or drain at different rates. Without ordering rules a small, lightweight packet can slide past a large stalled packet and arrive at the destination first — even though it was sent second. Software that depends on arrival order gets wrong results with no error flag raised anywhere.

The ordering rules solve three problems at once:

Data correctness — written data must arrive before the flag that says it is ready
Deadlock prevention — completions must always be able to make forward progress even when write queues are backed up
PCI compatibility — the rules exactly match the PCI/PCI-X ordering model so legacy software works unchanged on PCIe

Ordering applies within one Traffic Class. Two TLPs with different TC values have no ordering relationship — they may freely overtake each other. Everything in this post applies to TLPs sharing the same TC moving through the same Virtual Channel.

📋 Three TLP Categories

The ordering table groups TLPs into three buckets. Every TLP in the system belongs to exactly one of them:

Figure 1 — Three TLP categories with separate VC sub-buffers and independent flow control credit pools. Separating them is what makes the ordering rules implementable — a full Posted buffer cannot block the Completion buffer by design.

▶ The Producer/Consumer Model

Most ordering rules exist to protect one specific programming pattern: a Producer writes data to memory and sets a flag, a Consumer polls the flag and reads the data when it is 1. This pattern is everywhere — NIC DMA, GPU command queues, NVMe submission rings.

Figure 2 — Producer/Consumer pattern. Steps ① and ② are both posted MWr TLPs. If ② arrives before ①, the Consumer reads Flag=1 and fetches the data buffer — but finds whatever stale bytes were there before ① landed. No error is raised. The rule “Posted must not pass Posted” is the only thing preventing this.

📋 What Breaks Without Ordering

Figure 3 — Flag write (②) is small, flows through via the Non-Posted/Completion path. Data write (①) is stuck in the full Posted buffer. Flag=1 appears in memory first. Consumer reads Flag=1 and fetches data — but data hasn’t arrived. The ordering rule “Posted must not pass Posted” prevents ② from ever overtaking ①.

📋 How to Read the Ordering Table

The table is structured as Row passes Column. Columns represent TLPs already waiting in an egress queue. Rows represent a newly arrived TLP that wants to go out the same port. Each cell answers one question: may the row TLP jump ahead of the column TLP?

Figure 4 — Cell meanings. “No” entries protect data correctness. “Yes” entries prevent deadlock. “Yes/No” entries give implementers freedom to optimise or to keep logic simple — neither choice is wrong.

📋 The Ordering Table

Read each row as: “This newly arrived TLP — may it pass a TLP that is already queued?”

Newly arrived TLP ↓ Already queued TLP →	Posted (MWr · Msg)	Non-Posted Read (MRd · IORd)	Non-Posted Write (IOWr · CfgWr)	Completion (Cpl · CplD)
Posted Write / Message (MWr · Msg · MsgD)	No core ordering rule	Yes deadlock prevention	Yes deadlock prevention	Yes/No impl. choice
Non-Posted Read (MRd · IORd · AtomicOp)	No write-before-read	Yes/No	Yes/No	Yes/No
Non-Posted Write (IOWr · CfgWr0/1)	No write-before-read	Yes/No	Yes/No	Yes/No
Completion (Cpl · CplD)	Yes/No Yes if RO set	Yes deadlock prevention	Yes deadlock prevention	Yes/No different IDs: Y/N same ID: No

There are four hard rules in this table — three No cells and two mandatory Yes cells. Everything else is implementation choice. The sections below explain each group in plain English.

📋 Posted vs Posted — The Core Rule

A Posted Write must not pass a Posted Write that arrived earlier.

This single rule protects the Producer/Consumer pattern. The data write (①) and the flag write (②) are both MWr TLPs. Without this rule a lightweight flag write can bypass a heavy data write inside a switch buffer and arrive first. The Consumer then reads Flag=1 before the data has landed — silent data corruption with no error flag raised anywhere in the system.

Figure 5 — Posted writes leave the egress port in the same order they entered it. ① always exits before ②. The switch must not allow a newer Posted to bypass an older Posted regardless of size difference.

There is one exception for advanced use: when ID-Based Ordering (IDO) is enabled and two Posted packets come from devices with different Requester IDs, they may be allowed to reorder — because packets from different devices are almost certainly unrelated. IDO is explained later in this post.

📋 Posted vs Non-Posted — Mandatory Pass

A Posted Write must be allowed to pass a queued Non-Posted request. This is not optional — it is mandatory to prevent deadlock.

The scenario that demands this rule: a read request (MRd) is stuck at an egress port because the NP buffer at the next hop is full. If an MWr is not allowed to bypass that stuck MRd, the MWr is also stuck. If the MWr carries data that the read’s target needs to return in its completion, and the completion cannot return until the MWr lands — nothing moves. The switch, the requester, and the target all wait on each other forever.

Allowing the MWr to pass the stuck MRd breaks this circle. The data lands. The target reads it. The completion flows back. The MRd resolves. Everything drains.

This applies equally to Non-Posted writes (IOWr, CfgWr). A Posted Write that arrives behind a stalled IOWr or CfgWr must also be allowed to bypass it. The deadlock scenario is identical. Blocking MWr behind any stuck Non-Posted request is prohibited.

📋 Posted vs Completion — Implementation Choice

A newly arrived Posted Write may optionally pass a queued Completion going in the same direction. This is a Yes/No cell — both choices are compliant. Neither data corruption nor deadlock results from either decision.

When Relaxed Ordering (RO) is set on the Completion, passing the queued MWr is specifically permitted. This is the most useful case: a GPU read completion bypassing queued DMA writes to deliver frame data back to the CPU faster.

There is one bridge-specific exception: in a PCIe-to-PCI/PCI-X bridge translating traffic from PCIe into PCI, a Posted Write must be able to pass a Completion or a deadlock can form due to PCI’s legacy delayed-transaction model. For all native PCIe-to-PCIe paths this does not apply.

📋 Non-Posted Rules

Non-Posted must not pass Posted

A read request or non-posted write must never bypass an earlier Posted Write. This enforces write-before-read ordering. If a read bypassed an earlier write and returned data, the data it returned could be the pre-write value — old data, delivered to software as if it were current. No error flag.

This is the read-side mirror of the core Posted rule. Together they ensure that all writes a device has issued are visible at the target before any subsequent read from that same device can return.

Non-Posted may pass Non-Posted — Yes/No

Two non-posted requests from different contexts may optionally reorder relative to each other. If an MRd is stalled because the NP buffer at the next hop is full, a subsequent MRd from an unrelated context may be allowed to bypass it. No correctness risk — the two reads target different addresses and have no dependency between them. This is called weak ordering and exists to prevent head-of-line blocking from spreading across unrelated traffic.

Non-Posted may pass Completion — Yes/No

A non-posted request may optionally bypass a queued completion. Again, purely implementation choice. The read request and the completion are almost certainly unrelated.

📋 Completion Rules

Completion may pass Posted — Yes/No (Yes if RO set)

A Completion going toward the original requester may optionally bypass queued Posted Writes heading in the same direction. Without Relaxed Ordering this is an implementation choice. With RO=1 on the completion, switches are specifically permitted to let it pass — improving read latency by not making completions wait behind write queues.

Completion must pass Non-Posted — mandatory Yes

A Completion must always be allowed to pass a queued Non-Posted request. This is the second mandatory rule and it exists for the same reason as the Posted-passes-NP rule: deadlock.

The scenario: a requester holds Non-Posted flow-control credits while waiting for a completion. The completion is stuck behind a queued MRd at an intermediate switch. The MRd is stuck because the NP buffer downstream is full. The NP buffer is full because the requester’s own NP buffer is backed up waiting for… the completion. If the completion cannot bypass the stuck MRd, nothing moves. Allowing it to pass breaks the deadlock.

Completions with different IDs may pass each other — Yes/No

Two completions returning to different requesters (different Requester ID + Tag combinations) may optionally reorder relative to each other. They are delivering data to completely different waiting contexts. Neither one cares what order the other arrives in.

Completions for the same request must not reorder — hard No

When a single large read is satisfied by multiple CplD TLPs (a split completion), those partial completions must arrive at the requester in ascending address order. CplD #2 must not arrive before CplD #1. If it did, the requester would assemble the pieces in the wrong order — corrupted data, no error flag.

📋 Deadlock — Why Some Cells Are Mandatory

Four cells in the table are mandatory — not optional performance hints but hard requirements without which the fabric can permanently stall. The two “Posted must pass Non-Posted” entries and the two “Completion must pass Non-Posted” entries all exist to prevent circular waits.

Figure 6 — Deadlock circle. The Endpoint holds NP credits waiting for its CplD. The CplD is stuck at the Switch behind a full Posted queue. The Posted queue cannot drain because the Endpoint’s receive buffer is full. The Endpoint’s receive buffer is full because it is waiting for the CplD. The mandatory rule “Completion must pass Posted/Non-Posted” cuts this circle at the Switch.

📋 Relaxed Ordering (RO)

Relaxed Ordering is a single bit in the TLP header (DW0 bit 13, Attr[1]). When set to 1 by the software driver, it is a declaration: “I guarantee this packet has no ordering dependency on earlier posted writes. You may let it pass them.”

A switch that sees RO=1 is permitted — but not required — to reorder that packet ahead of earlier posted writes. This makes it an advisory hint rather than a command.

Where RO helps

GPU read completions — returning rendered frame data can bypass queued DMA writes because the two streams are completely unrelated
Scatter-gather DMA — many independent writes to non-overlapping memory regions can mark themselves RO so they pipeline past each other inside switch buffers
Read requests — when an MRd carries RO=1, the Completer echoes RO=1 in the returned CplD, allowing that completion to bypass write queues on the way back

Where RO is unsafe

Never set RO on a flag write that follows a data write. The flag write depends on the data write having arrived. Marking the flag write as RO allows it to bypass the data write — reintroducing the exact Producer/Consumer corruption the ordering rules exist to prevent. RO is only safe when software can genuinely guarantee the marked packet has zero dependency on anything that came before it.

TLP with RO=1	Can pass…
Posted Write	Earlier Posted Writes and Messages
Message	Earlier Posted Writes and Messages
Read Completion (CplD)	Earlier Posted Writes and Messages

📋 ID-Based Ordering (IDO)

ID-Based Ordering (IDO, added in PCIe 2.1) is a performance enhancement based on a simple observation: packets from different Requester IDs almost certainly have no ordering relationship with each other. A write from Device A and a write from Device B are almost always independent — they come from different software contexts, targeting different memory regions.

IDO allows a switch to reorder two TLPs that would normally be kept in order, as long as they have different Requester IDs. It effectively says: “treat each device’s traffic as its own independent ordered stream — don’t let one device’s blockage stall another device’s unrelated traffic.”

Figure 7 — IDO effect. Without IDO, a stuck MWr from Device A blocks all subsequent packets regardless of source. With IDO enabled, packets from Device B and Device C are recognised as independent streams and bypass the stuck Dev-A packet freely.

Enabling and using IDO safely

Enabled in the Device Control 2 register — separate enable bits for Requests and Completions
Per-TLP IDO flag lives in DW0 bit 14 (Attr[2])
Completions may use IDO even if the original request did not — the completer decides independently
Safe when each device communicates only with the Root Complex and never shares state through a common memory buffer with another device
Not safe when Device A writes data, and Device B reads from the same memory area — IDO could allow their writes to arrive out of order at that shared region

📋 Ordering and Virtual Channels

Every Virtual Channel buffer is split into three independent sub-buffers: Posted (P), Non-Posted (NP), and Completion (CPL). Each sub-buffer has its own flow-control credit pool. This physical separation is what makes the mandatory “Yes” rules implementable — the CPL queue can always drain past a full P queue because they are different hardware structures with independent credits.

Ordering rules apply strictly within a single VC. Two TLPs travelling in different VCs have no ordering relationship whatsoever — a TC 7 packet in VC 1 can freely overtake a TC 0 packet in VC 0 without any restriction.

When multiple Traffic Classes are mapped to the same VC (the common case — most systems use only VC 0 for all TC values), the implementation may choose to apply ordering rules across all traffic within that VC for simplicity. The rules only require enforcement within a single TC; applying them across a full VC is a valid superset.

⚡ Ordering in Gen 6

The transaction ordering rules are completely unchanged in Gen 6. They live in the Transaction Layer. Gen 6 changes only the Physical Layer — flit packing and FEC are entirely transparent to ordering logic.

Flit packing preserves TLP order. Multiple TLPs inside a single 256-byte Gen 6 flit are packed in transmission order. The flit header records each TLP’s position. The receiver unpacks them in the same sequence they were inserted. No ordering rule is violated by the packing process.
Flit replay preserves TLP order. When a flit is replayed after FEC failure, all TLPs in that flit replay together in their original order. A TLP from a replayed flit cannot overtake a TLP from a later flit that was received successfully — the Data Link Layer’s sequence numbering prevents this, exactly as in Gen 1 through Gen 5.
RO and IDO bits are unchanged. Both live in TLP header DW0 at exactly the same bit positions. A Gen 6 switch reads and acts on them identically to a Gen 1 switch. No driver changes, no firmware changes, no RTL changes needed.

Zero impact on ordering logic for Gen 6. Every rule in this post applies identically to a Gen 6 link. If you are writing switch RTL or a PCIe controller, your ordering enforcement code is the same for Gen 1 through Gen 6.

📋 Quick Reference

Rule	Cell	Plain-English Meaning
Posted must not pass Posted	No — hard	Data writes must arrive before flag writes. The foundation of Producer/Consumer correctness. No exceptions (except IDO with different device IDs).
Posted must pass Non-Posted Read	Yes — mandatory	A write must be able to bypass a stuck read request. Required to break circular deadlocks. Cannot be blocked.
Posted must pass Non-Posted Write	Yes — mandatory	Same deadlock reason as above, for IOWr/CfgWr stuck in the NP queue.
Posted may pass Completion	Yes/No	Implementation choice. No correctness or deadlock risk either way.
Non-Posted must not pass Posted	No — hard	A read must not bypass a write that preceded it. Would break write-before-read ordering.
Non-Posted may pass Non-Posted	Yes/No	Weak ordering — independent reads from unrelated contexts may reorder. No correctness risk.
Non-Posted may pass Completion	Yes/No	Implementation choice.
Completion may pass Posted	Yes/No	Yes when Relaxed Ordering bit is set. Enables fast read completion bypass of write queues.
Completion must pass Non-Posted Read	Yes — mandatory	A completion must bypass a stuck read request. Mandatory deadlock prevention.
Completion must pass Non-Posted Write	Yes — mandatory	Same as above, for IOWr/CfgWr.
Completion may pass Completion (diff IDs)	Yes/No	Two completions returning to different requesters may reorder. Each goes to a different context.
Completion must not pass Completion (same ID)	No — hard	Split completions for the same read must arrive in ascending address order. Out-of-order assembly = corruption.
Relaxed Ordering (RO)	DW0 bit 13	Driver declares no dependency on prior writes. Switch may (not must) let the TLP pass earlier posted writes. Never safe for flag writes.
ID-Based Ordering (IDO)	DW0 bit 14	Packets from different Requester IDs may reorder. Safe only when devices share no state through common memory. Enabled in Device Control 2.
Ordering scope	Within TC/VC	Different TCs have no ordering relationship. Different VCs have no ordering relationship.
Gen 6 impact	None	All rules, RO, IDO, P/NP/CPL buffer separation — identical in Gen 6. Flit packing is transparent.