PCIe Series — PCIe-02: Architecture, Topology and Components — VLSI Trainers
PCIe Series · PCIe-02
Architecture — Topology and Components
Root Complex internals and how they appear to software, the Switch’s internal virtual bus structure, BDF addressing, how every port sees the same three layers, transaction types, and a complete step-by-step bus number enumeration walk-through.
📋 The Tree Rule and Why It Exists
PCIe topologies must be strict trees — no loops, no rings, no meshes. Every node connects to exactly one upstream parent and zero or more downstream children. This is a deliberate constraint inherited from PCI, and it exists for one reason: software backward compatibility.
PCI’s configuration software uses a simple recursive depth-first algorithm to enumerate buses. The algorithm assigns bus numbers in the order it discovers bridges. It works perfectly on trees. It breaks completely on graphs with loops. Rather than require a new OS to be written for PCIe, the PCISIG preserved the tree constraint — and because the constraint is preserved, every PCI configuration driver ever written works unmodified on a PCIe system.
No loops is a hard protocol requirement, not a convention. Two devices connected with two links (a loop of length 2) would cause configuration broadcasts to circulate forever and transaction routing to become ambiguous. The tree constraint is enforced at the protocol level — Switches simply have no routing logic to handle loop topologies.
📋 Bus, Device, Function — BDF Addressing
Every PCIe Function is uniquely identified by a 16-bit BDF address: 8 bits of Bus Number, 5 bits of Device Number, and 3 bits of Function Number.
Figure 1 — BDF (Bus/Device/Function) address structure. The 16-bit register is split 8+5+3. Each cell in the top row is one bit. The worked example decodes BDF 03:1C.5 step by step. In practice, almost every PCIe endpoint is at Bus N / Device 0 / Function 0 because each PCIe link connects exactly one device.
Bus Number (8 bits) — up to 256 buses in one domain (0–255). Bus 0 is always the Root Complex’s internal virtual bus. Configuration software assigns all other bus numbers during enumeration.
Device Number (5 bits) — up to 32 device slots per bus. Because PCIe links are point-to-point, a PCIe endpoint always ends up as Device 0 on its bus. The Root Complex and Switches have virtual buses that can have multiple device slots for their embedded ports.
Function Number (3 bits) — up to 8 functions per device. Most devices have only Function 0. Multi-function devices (a network card with a storage controller on the same chip, for example) can expose multiple functions, each with its own configuration space.
The BDF is called the Requester ID when embedded in a request TLP (so the completer knows where to send the completion) and the Completer ID when embedded in a completion TLP (so the requester can match it to its outstanding request). Tags — a separate 8-bit or 10-bit field — distinguish multiple in-flight requests from the same function.
📋 Root Complex Internals
From the outside, the Root Complex looks like a single block at the top of the PCIe tree. From the inside — and from software’s perspective — it is a collection of bridges sharing a virtual internal bus.
Figure 2 — Root Complex internals as software sees them. The CPU die, memory controller, and PCH may all be part of the Root Complex. The internal virtual Bus 0 hosts virtual PCI-to-PCI bridges (Root Ports) — each one opens up a new downstream bus.
What the spec says about the RC
The PCIe spec deliberately does not fully define the Root Complex internals. It gives a list of required and optional behaviours but leaves the implementation to the vendor. In broad terms the RC must:
Present a virtual PCI Bus 0 to configuration software, with Root Ports appearing as PCI-to-PCI bridges
Generate and accept PCIe transactions on behalf of the CPU and system memory
Support the full TLP routing and flow-control mechanisms
Implement ACS (Access Control Services) in Gen 3+ for peer-to-peer isolation
In modern systems the RC is split across the CPU package (memory controller, high-bandwidth root ports for GPU) and the Platform Controller Hub or I/O Hub (USB, SATA, lower-bandwidth PCIe slots). The CPU ↔ PCH link (Intel’s DMI, AMD’s equivalent) is itself based on PCIe electricals. Both sides together are the RC from software’s point of view.
📋 Switch Internal Structure
A Switch looks deceptively simple from the outside. From the inside it is a collection of PCIe ports connected by an internal virtual bus — and that internal bus is what software enumerates as PCI bridges.
Figure 3 — Switch internal structure. The upstream port connects to the RC or parent switch. Downstream ports each open a new bus. The internal virtual Bus 2 is what software enumerates — it sees multiple P2P bridges at Device 0/1/2/3 on Bus 2.
How a Switch routes a packet
When a TLP arrives at the Switch’s upstream port, the switch does the following:
Memory/IO TLP: checks the target address against each downstream port’s Base/Limit registers (programmed during enumeration). If the address falls within a port’s range, the packet is forwarded out that port.
Config TLP (Type 1): the TLP header contains a Bus Number. The switch checks if that bus is downstream of one of its ports and forwards it on. When the target bus is the secondary bus of a downstream bridge, the Type 1 config TLP becomes a Type 0 config TLP on exit.
Completion TLP: routed by Requester ID — the bus number in the Requester ID field tells the switch which port is upstream toward the original requester.
Message TLP: routed based on the routing field in the message header — some go up (to Root), some go down (broadcast), some go to a specific address.
Why does a Switch need all three PCIe layers? A common question. You might think a Switch just needs to look at the header and forward. But “looking at the header” means decoding a TLP — which is a Transaction Layer operation. The Data Link Layer and Physical Layer are needed to reliably receive and retransmit the packet on each link. So every switch port implements the full three-layer stack.
📋 Endpoint Types
Type
Uses I/O Space?
Locked Requests?
Typical Devices
Native PCIe Endpoint
No — MMIO only
No
GPU, NVMe SSD, modern NIC, AI accelerator, FPGA
Legacy PCIe Endpoint
Yes (for backward compat)
Yes (for legacy support)
Older PCI-X devices with PCIe interface bolted on
Root Complex Integrated Endpoint
Optional
No
USB controller, SATA controller, audio integrated into RC
The endpoint type is declared in a field in the PCIe Capability structure in configuration space. Software reads this field to understand what the device is and what it can do. All modern devices designed specifically for PCIe (Gen 3 and later) are Native PCIe Endpoints and use only memory-mapped I/O.
📋 Every Port Implements All Three Layers
This applies without exception — Root Ports, Switch upstream ports, Switch downstream ports, and Endpoint ports all implement the Transaction Layer, Data Link Layer, and Physical Layer. The logic is the same regardless of the device type; what varies is what the Transaction Layer does with the decoded TLP (route it, deliver it to the device core, or generate a completion).
Figure 4 — Every PCIe port type implements all three layers. The layer logic is identical; what changes is what the Transaction Layer does with a decoded TLP — route it, deliver it, or generate a completion from it.
📋 Transaction Types — Posted vs Non-Posted
Every PCIe transaction is either posted or non-posted. The distinction drives fundamental protocol differences.
Transaction Type
Posted or Non-Posted?
Completion returned?
Why?
MRd — Memory Read
Non-Posted
Yes — CplD with data
Requester needs the data back
MWr — Memory Write
Posted
No
Fire-and-forget — data is delivered eventually. No need to wait.
IORd — I/O Read
Non-Posted
Yes — CplD with data
Legacy — I/O space only on Legacy Endpoints
IOWr — I/O Write
Non-Posted
Yes — Cpl (no data)
Must confirm write reached target before next I/O step
CfgRd — Config Read
Non-Posted
Yes — CplD with data
BIOS/OS needs the register value back
CfgWr — Config Write
Non-Posted
Yes — Cpl (no data)
Must confirm write before programming next register
Msg — Message
Posted
No
INTx/PME/error signalling — informational, no ack needed
Why are Config and I/O writes non-posted? During boot, the BIOS programs configuration registers in strict sequence — write BAR, then write Command register, then enable device. If writes were posted (fire-and-forget), the BIOS could not know when each write had actually landed and the ordering guarantees collapse. Non-posted writes return a completion confirming the write reached the target — only then does software proceed to the next step.
📋 All TLP Types
Abbreviation
Full Name
Header Size
Notes
MRd
Memory Read Request
3 DW (32-bit) / 4 DW (64-bit)
Most common request; non-posted
MRdLk
Memory Read Locked
3 DW / 4 DW
Legacy atomic read-modify-write; RC only
MWr
Memory Write Request
3 DW / 4 DW + payload
Posted; no completion
IORd
I/O Read
3 DW
Legacy endpoints only
IOWr
I/O Write
3 DW + payload
Non-posted; legacy endpoints only
CfgRd0
Config Read Type 0
3 DW
For device on the target bus (endpoint/bridge)
CfgRd1
Config Read Type 1
3 DW
For device on a bus further downstream; routed by bus number
CfgWr0
Config Write Type 0
3 DW + payload
Non-posted; write to local device config registers
CfgWr1
Config Write Type 1
3 DW + payload
Forwarded downstream toward the target bus
Msg
Message (no data)
4 DW
INTx, PME, slot events, vendor messages
MsgD
Message with Data
4 DW + payload
Vendor-defined messages with data payload
Cpl
Completion (no data)
3 DW
Response to config/IO writes; status only
CplD
Completion with Data
3 DW + payload
Response to reads; carries requested data
CplLk
Locked Completion (no data)
3 DW
Response to MRdLk when no data returned
CplDLk
Locked Completion with Data
3 DW + payload
Response to MRdLk with data
▶ Memory Read — End-to-End Example
Let’s walk through an NVMe SSD reading data from system memory. The SSD is behind a Switch. This example shows every hop and every packet.
Figure 5 — Memory read flow. The MRd travels upstream hop by hop; each hop sends its own ACK DLLP back (dashed). The RC fetches data from RAM and returns a CplD downstream; each hop again ACKs independently. The Tag (5) links the completion to the original request.
The Tag is what prevents confusion with multiple outstanding requests. The NVMe SSD can have up to 256 (or 1024 in extended tag mode) reads in flight simultaneously. Each one has a different Tag value. When CplD packets come back in any order, the Tag tells the SSD exactly which original request each completion is answering.
▶ Memory Write — Posted Example
A GPU DMA engine writing rendered frame data to system memory. Posted — no completion comes back.
Figure 6 — Posted memory write. The GPU fires the MWr and immediately continues. The link-level ACK DLLP (dashed) confirms link delivery but is not a TLP completion — the GPU’s Transaction Layer never receives feedback. This is intentional — the performance benefit outweighs the error-reporting limitation.
📋 ACK/NAK Protocol — Per-Hop Reliability
PCIe’s Data Link Layer guarantees delivery of every TLP between adjacent devices using ACK and NAK DLLPs. This is important to understand: it is per-hop, not end-to-end.
Figure 7 — ACK/NAK protocol. The transmitter keeps a copy in its replay buffer until ACK arrives. ACK means “flush this and all older packets”. NAK means “replay from this sequence number onwards”. Each PCIe link runs this protocol independently.
The Sequence Number is 12 bits (0–4095), wrapping around. Transmitters use it to identify which TLPs the receiver is acknowledging or rejecting.
An ACK for sequence number N means “I have received everything up to and including N — you can discard all of those from your replay buffer.”
A NAK for sequence number N means “I found a problem at N — please resend from N onwards.”
If no ACK or NAK arrives within a timeout period, the transmitter replays all unacknowledged TLPs.
In Gen 6, the replay mechanism is modified to work within the flit-based framing model — but the fundamental ACK/NAK logic is preserved.
📋 Bus Enumeration — Step-by-Step Walk-Through
Enumeration is the process by which BIOS/OS firmware discovers the PCIe topology and assigns bus numbers to every bus segment. It uses a depth-first search — it always goes as deep as possible before backtracking.
Figure 8 — Result of depth-first enumeration. Numbers are assigned in the order the algorithm digs: RC → Port A → Switch → Bus 2 → NVMe (Bus 3) → NIC (Bus 4) → Bridge (Bus 5) → backtrack → RC Port B → GPU (Bus 6). Subordinate numbers are set after all downstream buses are known.
1
Assign Bus 0 to the Root Complex
Hardware hardcodes Bus 0. Software writes 0 into the Root Port’s Primary Bus Number register. Scans Bus 0 by sending CfgRd0 packets to every Device/Function combination.
2
Find Port A (Dev 1) — a bridge → set Secondary Bus = 1, Subordinate = 255 (placeholder)
Software writes Primary=0, Secondary=1, Subordinate=255 to the bridge. It sets Subordinate=255 as a temporary maximum to allow configuration traffic to flow downstream during discovery.
3
Scan Bus 1 — find Switch upstream port → assign Bus 2 (virtual internal bus)
The switch’s upstream port presents itself as a bridge at Bus 1/Dev 0. Software gives the switch’s internal bus number = 2. Switch upstream bridge gets Primary=1, Secondary=2, Subordinate=255.
4
Scan Bus 2 — find Switch downstream ports (Dev 0, Dev 1, Dev 2) — go deep into each
Downstream port 0 gets Secondary=3 (NVMe). NVMe is an endpoint — no further buses. Downstream port 1 gets Secondary=4 (NIC). NIC is an endpoint. Downstream port 2 gets Secondary=5 (PCIe→PCI bridge) — and PCI bus 5 is enumerated separately.
5
Backtrack — set Subordinate numbers correctly now that all buses are known
Switch’s upstream bridge: Subordinate=5 (highest bus under it). Port A bridge on Bus 0: Subordinate=5. Then scan Port B (Dev 2) → GPU endpoint at Bus 6. Port B bridge: Primary=0, Secondary=6, Subordinate=6.
Subordinate = 255 is the key to depth-first search. Setting Subordinate=255 temporarily lets configuration TLPs flow all the way downstream during discovery. Once all downstream buses are known, the correct Subordinate value is written. Any TLP destined for a bus number that falls outside a bridge’s [Secondary, Subordinate] range is blocked by that bridge — this is how routing works.
⚡ Gen 6 Topology Considerations
The topology model — tree, BDF addressing, root/switch/endpoint roles, enumeration algorithm — is unchanged in Gen 6. What Gen 6 changes is the physical link speed and framing, not the topology or software model.
However, Gen 6’s higher bandwidth and flit-based framing introduce some practical topology considerations:
Retimer requirements. At 64 GT/s PAM4, channel loss limits trace length to approximately 4–6 inches between a CPU root port and a device. Longer channels require PCIe Retimers — active clock-forwarding devices that regenerate the signal. Retimers are transparent to software (they have no BDF) but add latency. In data centre and AI systems, 1–2 retimers per link are common.
CXL and Gen 6. CXL (Compute Express Link) 3.0 runs on the PCIe 6.0 PHY. The topology management for CXL 3.0 includes multi-headed and fabric topologies that go beyond PCIe’s strict tree — but the PCIe Gen 6 electrical layer is the same.
No topology changes for software. A Gen 6 GPU still appears in the same BDF at Bus N/Dev 0/Function 0 as it always has. The enumeration algorithm is identical. Existing drivers work without change.
Flit-mode impact on Switches. A Gen 6 Switch internally operates in flit mode — packing and unpacking flits as TLPs are routed between ports. The TLP routing decisions are identical, but the data is physically carried in 256-byte flit containers with FEC protection on each link.
📋 Quick Reference
Concept
Key Point
Tree topology
Mandatory — no loops. Preserves software compatibility with PCI’s simple depth-first enumeration algorithm.
BDF
8-bit Bus + 5-bit Device + 3-bit Function = 16-bit unique address. Always Device 0 on real PCIe links; multiple devices only on virtual buses.
Requester ID
BDF of the sender embedded in the TLP header — completions use it to route back to the originator.
Tag
8-bit or 10-bit field distinguishing multiple in-flight non-posted requests from the same function.
Root Complex — software view
Bus 0 with P2P bridges (Root Ports) at Device 1, 2, 3… Each Root Port opens a new downstream bus.
Switch — software view
Collection of P2P bridges sharing a virtual internal bus. Upstream port is the bridge secondary side; downstream ports are primary sides opening new buses.
Posted transaction
MWr, Msg — no TLP completion returned. Link-level ACK DLLP still sent per hop.
Non-Posted transaction
MRd, IORd, IOWr, CfgRd, CfgWr — completion TLP returned from target to requester.
ACK/NAK
Per-hop, not end-to-end. Each link runs independently. Replay buffer holds copies until ACK received.
Enumeration
Depth-first search from Bus 0. Set Subordinate=255 going down; correct it on the way back up. Endpoint = no bridge, stops there.
Gen 6 topology impact
No software changes. Retimers may be needed on long channels. Flit mode is handled internally per link.
Coming next: PCIe-03 covers The Three-Layer Model in Detail — a deep dive into the Transaction Layer’s virtual channel management and flow control credit types, the Data Link Layer’s ACK/NAK state machine, and the Physical Layer’s logical sub-block responsibilities across Gen 1 through Gen 6.