PCIe Series · PCIe-06 – Your VLSI Journey Starts Here

📋 Memory TLPs — the Three Types

Memory transactions are the workhorse of PCIe. Almost every DMA transfer, GPU frame render, and NVMe command queue operation is a memory read or write. There are three memory TLP types:

Figure 1 — Three memory TLP types. MRd and MWr differ only in Fmt (no-data vs with-data) and in being non-posted vs posted. MRdLk shares the MRd header layout but uses Type=0_0001 and triggers legacy lock behaviour in switches.

The key distinction between MRd and MWr is the Fmt field: MRd has Fmt bit[1] = 0 (no payload), MWr has Fmt bit[1] = 1 (payload follows header). The Type field is the same — 0_0000 — for both. The receiver tells them apart purely by the Fmt field before reading any further.

📋 MRd — 3DW Header (32-bit address)

Use the 3DW MRd when the target address is below 4 GB (fits in 32 bits). The header is 12 bytes — three Doublewords.

Figure 2 — MRd 3DW header (Fmt=000). DW0 carries Fmt+Type+TC+flags+Length. DW1 carries Requester ID (16-bit BDF) + Tag (8-bit) + Last DW BE (4-bit) + First DW BE (4-bit). DW2 carries the 30-bit address (bits [31:2]), with bits [1:0] always 00 — the address is always DW-aligned. No payload follows.

📋 MRd — 4DW Header (64-bit address)

Use the 4DW MRd when the target address is ≥ 4 GB (requires more than 32 bits). The header grows to 16 bytes — four Doublewords. DW0 and DW1 are identical to the 3DW variant. DW2 and DW3 together carry the full 64-bit address.

Figure 3 — MRd 4DW header (Fmt=001). DW0 and DW1 are identical to the 3DW variant. DW2 carries Address[63:32] — the upper 32 bits. DW3 carries Address[31:2] plus two reserved zero bits. The total header is 16 bytes. No data payload follows in either the 3DW or 4DW MRd.

The spec says: “An address below 4 GB must use a 3DW header. The spec states that receiver behaviour is undefined if a 4DW header is used for an address below 4 GB with the upper 32 bits of the 64-bit address set to zero.” In other words — if your address fits in 32 bits, always use the 3DW variant. Using 4DW for a sub-4GB address wastes 4 bytes of header and may confuse some receivers.

📋 Requester ID and Tag Field

DW1 of every memory request header carries two fields that together form the Transaction ID — the unique identity of this request within the PCIe fabric:

Figure 4 — Transaction ID composition. Requester ID is the BDF of whoever sent the request. Tag is a counter the sender increments per request. Together they uniquely identify every outstanding non-posted transaction. The completer echoes both in the CplD so routing and matching work correctly.

Tag field modes

Mode	Bits used	Max outstanding requests per function	How to enable
Default	8 bits (Tag[7:0])	256	On by default in all PCIe devices
Extended Tag	10 bits (Tag[9:0])	1024	Set Extended Tag Enable bit in Device Control register · both endpoints must support it

Why does Gen 6 benefit from extended tags? At 64 GT/s per lane, a x16 Gen 6 link can transfer ~122 GB/s. An NVMe SSD reading 4 KB blocks would need roughly 30 million requests per second to saturate that link. With only 256 tags at 8 bits, the sender could run out of unique tags for in-flight requests, forcing it to wait for completions before sending new requests. Extended 10-bit tags (1024 values) are therefore strongly recommended for high-bandwidth Gen 6 devices.

📋 Byte Enables — First DW and Last DW

Byte 7 of every memory request header carries two 4-bit Byte Enable fields. They select which individual bytes within the first and last Doublewords of the transfer are active. This is how PCIe supports sub-DW-granular transfers without needing a separate length-in-bytes counter.

Figure 5 — Byte Enable fields in Byte 7 of DW1. The lower nibble [3:0] applies to the first DW of the transfer; the upper nibble [7:4] applies to the last DW. Bit=1 means valid. When Length=1 (single DW transfer), Last BE must be 0x0 — only First BE is meaningful.

📋 Byte Enable Rules from the Spec

These rules come directly from Table 5-5 and the byte enable section of Chapter 5 in the MindShare PCIe specification:

#	Rule	Why it matters
1	Byte enable bits are high-true. A bit = 1 means that byte is valid; bit = 0 means ignore that byte.	Receivers must check this before writing to registers or memory. A 0 bit means “leave that byte unchanged.”
2	If Length = 1 DW, the Last DW BE must be 0x0 (all bits zero). Only First DW BE is used.	There is only one DW — it is both the first and the last. Last BE would be redundant and must be cleared.
3	If Length ≥ 2 DW, the First DW BE must have at least one bit set.	A transfer of 2+ DWs with no valid bytes in the first DW would be nonsensical. At least one byte must be active in the starting DW.
4	If Length ≥ 3 DW, both First and Last DW BE bits must be contiguous — no holes in the enable pattern.	The middle DWs are always fully valid. Only the first and last DWs can have partial bytes. If Length ≥ 3, the BEs only define the start and end byte offsets — they cannot select non-contiguous bytes.
5	Discontinuous BE patterns are allowed only if Length = 1.	For a single-DW transfer you can have bits like 0101 (Bytes 0 and 2 but not 1 and 3). This is valid for narrow register accesses.
6	A write with Length = 1 and no BEs set (all zero) is legal but has no effect on the completer.	This is used as a memory-flush mechanism. The completer accepts the request and returns a completion but writes nothing. This forces all previously posted writes from the requester to drain through the switch before the completion returns.
7	A read with Length = 1 and no BEs set causes the completer to return 1 DW of undefined data.	Same flush purpose as rule 6 — the ordering rule ensures the completion cannot come back until all earlier posted writes have propagated. The data is meaningless and should be discarded.

▶ Byte Enable Worked Examples

These examples show how Byte Enables select the valid bytes for transfers of different sizes and alignments.

Figure 6 — Byte Enable examples showing which bytes are active (coloured) vs skipped (grey). Example A: full 4-byte DW read. B: single byte at offset 2. C: 6-byte unaligned read spanning two DWs — First BE=0xE skips byte 0, Last BE=0x3 takes only bytes 0–1. D: 3-DW aligned read — all bytes valid, middle DW has no BEs (always fully valid).

📋 MWr — Memory Write with Payload

MWr is the posted counterpart to MRd. The header layout is identical to MRd (same DW1 with Requester ID, Tag, BEs; same DW2/DW3 address structure). The two differences are in DW0: Fmt has the data bit set (Fmt bit[1]=1), and the data payload immediately follows the last header DW.

Figure 7 — MWr 4DW header + payload. The header is identical to MRd except Fmt=011 (data present). The payload follows DW3 immediately — no gap. Since MWr is posted, there is no completion TLP returned. The Data Link Layer’s per-hop ACK DLLP still confirms delivery to the adjacent device, but that ACK is invisible to the Transaction Layer.

MWr vs MRd — the two differences

Property	MRd	MWr
Fmt bit[1] (data present)	0 — no payload	1 — payload follows
Transaction type	Non-posted	Posted
Completion returned?	Yes — CplD (data) from completer	No — fire and forget
Tag purpose	Matches incoming CplD to outstanding read	Present in header but not used by completer for completion matching
Ordering rule	Cannot pass earlier posted writes	Cannot pass earlier posted writes (strict ordering unless RO=1)
Flow control buffer	Non-Posted (NPH + NPD credits)	Posted (PH + PD credits)
Type[4:0] encoding	0_0000	0_0000 (same — Fmt distinguishes)

📋 MRdLk — Memory Read Locked

MRdLk is a legacy transaction inherited from PCI’s locked transaction protocol. The header is identical to MRd except for the Type field: Type = 0_0001 instead of 0_0000. The Fmt values are the same (000 for 3DW, 001 for 4DW).

MRdLk is a legacy feature — avoid in new designs. When a switch sees an MRdLk it locks VC0 for all requesters except the one that issued the lock, preventing any other Posted writes through VC0 until an Unlock Message TLP arrives. This is a significant performance hazard in modern multi-device systems. The PCIe spec strongly discourages it — native PCIe Endpoints must not use MRdLk. It exists only for bridges to PCI/PCI-X legacy devices.

Field	MRd	MRdLk
Fmt[2:0]	000 (3DW) or 001 (4DW)	000 (3DW) or 001 (4DW) — identical
Type[4:0]	0_0000	0_0001
Header layout	Requester ID, Tag, BEs, Address	Identical — same byte positions
Completion type	CplD — completion with data	CplDLk — locked completion with data
Side effect at switches	None	Locks VC0 until Unlock Message received
Allowed on	All endpoint types	Legacy Endpoints and PCI/PCI-X bridges only

📋 Payload Rules

These rules apply to all TLPs that carry a data payload (MWr, CplD, IOWr, CfgWr). They come directly from Chapter 5 of the spec:

Figure 8 — Four payload rules from the spec. These apply to MWr and any other TLP with a data payload. The Length-in-DWs requirement and the Max_Payload_Size limit are the two most commonly encountered in practice.

📋 Address Rules — Alignment and the 4 KB Boundary

The address carried in any memory TLP header must follow two hard rules:

Figure 9 — The two hard address rules. DW alignment is enforced by the reserved bits [1:0] in the header — the sender physically cannot express a non-DW-aligned address; byte granularity is achieved via Byte Enables. The 4 KB no-crossing rule means software DMA engines must split large transfers that cross page boundaries into two separate TLPs.

📋 When to Use 3DW vs 4DW

Situation	Use	Reason
Target address ≤ 0xFFFF_FFFF (below 4 GB)	3DW (Fmt=000/010)	Shorter header saves 4 bytes per TLP. Spec says to use 3DW when address fits in 32 bits.
Target address > 0xFFFF_FFFF (≥ 4 GB)	4DW (Fmt=001/011)	The upper 32 bits of the address must be carried in DW2.
Target address is 0, but system may have > 4 GB RAM	3DW	If address fits in 32 bits, use 3DW regardless of system RAM size.
Using 4DW for a <4GB address (upper DW = 0)	Undefined per spec	Spec says receiver behaviour is undefined. Some devices accept it; avoid this in practice.
MRd targeting PCIe device registers (BAR)	Depends on BAR address	32-bit BARs are always below 4 GB → 3DW. 64-bit BARs may be above 4 GB → 4DW required.

⚡ Memory TLPs in Gen 6

The MRd, MWr, and MRdLk header formats are completely unchanged in Gen 6. Every bit position, field width, encoding, and rule described in this post applies identically to a Gen 6 link. Gen 6 changes only the Physical Layer — the TLP header the Transaction Layer produces is the same byte pattern whether the link is Gen 1 at 2.5 GT/s or Gen 6 at 64 GT/s.

What flit packing means for memory TLPs

Small MRd TLPs pack efficiently. A 3DW MRd is only 12 bytes of header (+ 16 bytes of DLL wrapping = 28 bytes total). Multiple MRd TLPs fit comfortably in a single 256-byte Gen 6 flit — amortising the flit header overhead across many requests.
Large MWr TLPs span flits. A MWr with 4096-byte payload (the maximum) = 16 bytes header + 4096 bytes data = 4112 bytes. This spans roughly 17 flits at 236 payload bytes per flit. Each flit carries its own FEC parity. If one flit is corrupted, only that flit is replayed — not all 17.
Byte Enables are unchanged. The First and Last DW BE bits in Byte 7 of the header are the same whether the TLP is inside a Gen 1 link or packed into a Gen 6 flit. The flit boundary is transparent to the Transaction Layer.
Max_Payload_Size is unchanged. The 4096-byte maximum payload limit is a Transaction Layer constraint. It does not change with Gen 6.

Gen 6 performance tip for memory writes. At Gen 6 x16 speeds (~122 GB/s per direction), a single large MWr TLP with 4096-byte payload approaches the theoretical flit utilisation maximum. But many smaller MWr TLPs (e.g. 64-byte cache-line writes) pack multiple TLPs per flit, and the Gen 6 flit header amortises cleanly. The PCIe 6.0 spec’s flit-based framing was specifically designed to be efficient for both large streaming writes and small cacheline-sized transactions.

📋 Quick Reference

Item	Value / Rule
MRd Type field	0_0000 · Fmt = 000 (3DW no-data) or 001 (4DW no-data)
MWr Type field	0_0000 · Fmt = 010 (3DW with-data) or 011 (4DW with-data)
MRdLk Type field	0_0001 · same Fmt options as MRd · legacy only
3DW header size	12 bytes (DW0 + DW1 + DW2) · address in DW2 bits [31:2]
4DW header size	16 bytes (DW0 + DW1 + DW2 + DW3) · address in DW2[63:32] + DW3[31:2]
Use 3DW when	Target address fits in 32 bits (below 4 GB). Spec requires 3DW for sub-4GB addresses.
DW1 layout	Requester ID [31:16] · Tag [15:8] · Last DW BE [7:4] · First DW BE [3:0]
Requester ID	16-bit BDF of sender (Bus[15:8] · Device[7:3] · Function[2:0])
Tag	8-bit (256) standard · 10-bit (1024) extended · matched in CplD · unique per in-flight request
First DW BE	Byte 7 bits [3:0] · bit=1 means byte valid · selects which bytes in the first DW are active
Last DW BE	Byte 7 bits [7:4] · must be 0x0 when Length=1 · bit=1 means byte valid in last DW
BE discontinuous	Allowed only when Length=1. For Length≥3, First and Last BEs must be contiguous bit patterns.
Address alignment	Always DW-aligned — bits [1:0] of address are reserved (always 00). Byte offset via Byte Enables.
4 KB rule	No single TLP may cross a 4096-byte (0x1000) boundary. Split at the page boundary if needed.
Max payload	1 to 1024 DW (4 to 4096 bytes) · limited by Max_Payload_Size in Device Control register
MWr posted?	Yes — no completion returns. DLL ACK DLLP confirms per-hop delivery only.
MRd posted?	No — CplD returns with requested data and the echoed Tag value.
Gen 6 impact	None on header format — same bits, same positions, same rules. Flit packing is Physical Layer only.

Coming next — PCIe-07: Completion TLPs — the full Cpl and CplD header, Completer ID, Completion Status codes (SC/UR/CRS/CA), Byte Count, Lower Address, split-completion reassembly, and Completion Timeout.