PCIe Series — PCIe-02: Architecture, Topology and Components — VLSI Trainers
PCIe Series · PCIe-02

Architecture — Topology and Components

Root Complex internals and how they appear to software, the Switch’s internal virtual bus structure, BDF addressing, how every port sees the same three layers, transaction types, and a complete step-by-step bus number enumeration walk-through.

📋 The Tree Rule and Why It Exists

PCIe topologies must be strict trees — no loops, no rings, no meshes. Every node connects to exactly one upstream parent and zero or more downstream children. This is a deliberate constraint inherited from PCI, and it exists for one reason: software backward compatibility.

PCI’s configuration software uses a simple recursive depth-first algorithm to enumerate buses. The algorithm assigns bus numbers in the order it discovers bridges. It works perfectly on trees. It breaks completely on graphs with loops. Rather than require a new OS to be written for PCIe, the PCISIG preserved the tree constraint — and because the constraint is preserved, every PCI configuration driver ever written works unmodified on a PCIe system.

No loops is a hard protocol requirement, not a convention. Two devices connected with two links (a loop of length 2) would cause configuration broadcasts to circulate forever and transaction routing to become ambiguous. The tree constraint is enforced at the protocol level — Switches simply have no routing logic to handle loop topologies.

📋 Bus, Device, Function — BDF Addressing

Every PCIe Function is uniquely identified by a 16-bit BDF address: 8 bits of Bus Number, 5 bits of Device Number, and 3 bits of Function Number.

16-bit BDF — Bus:Device.Function 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 BUS bits [15:8] DEV bits [7:3] FN [2:0] 8 bits → values 0x00–0xFF → 256 buses Bus 0 = Root Complex internal bus (hardwired) 5 bits → values 0–31 → 32 device slots Always 0x00 on real PCIe links (point-to-point) 3 bits → 0–7 up to 8 fns/device Worked Example — BDF 03:1C.5 03 Bus 3 (0000 0011) : 1C Device 28 (1 1100) . 5 Function 5 (101) → Bus 3, Device 28, Function 5 Most endpoints are 01:00.0 → Bus 1, Device 0, Function 0 The BDF is the Requester ID embedded in every outgoing TLP — completions route back to the sender using this ID Tag field (8–10 bits) distinguishes multiple simultaneous in-flight requests from the same function
Figure 1 — BDF (Bus/Device/Function) address structure. The 16-bit register is split 8+5+3. Each cell in the top row is one bit. The worked example decodes BDF 03:1C.5 step by step. In practice, almost every PCIe endpoint is at Bus N / Device 0 / Function 0 because each PCIe link connects exactly one device.

The BDF is called the Requester ID when embedded in a request TLP (so the completer knows where to send the completion) and the Completer ID when embedded in a completion TLP (so the requester can match it to its outstanding request). Tags — a separate 8-bit or 10-bit field — distinguish multiple in-flight requests from the same function.

📋 Root Complex Internals

From the outside, the Root Complex looks like a single block at the top of the PCIe tree. From the inside — and from software’s perspective — it is a collection of bridges sharing a virtual internal bus.

Root Complex Boundary (may span CPU + PCH/IOH chips) CPU Die CPU Cores Memory Controller PCIe Root Ports (x16 GPU) DDR / LPDDR Memory Internal Virtual Bus 0 (appears as PCI Bus 0 to software) Host Bridge Root Port A P2P Bridge · Dev 1 Root Port B P2P Bridge · Dev 2 DMI/PCH Port P2P Bridge · Dev 3 x16 link GPU Endpoint Bus 1 · Dev 0 · Fn 0 x8 link Switch Bus 2 (upstream) DMI/PCIe PCH / IOH USB · SATA · PCIe slots Software sees: Bus 0 with 3 P2P Bridges at Dev 1/2/3 → scans each → discovers Bus 1 (GPU), Bus 2 (Switch), Bus N (PCH)
Figure 2 — Root Complex internals as software sees them. The CPU die, memory controller, and PCH may all be part of the Root Complex. The internal virtual Bus 0 hosts virtual PCI-to-PCI bridges (Root Ports) — each one opens up a new downstream bus.

What the spec says about the RC

The PCIe spec deliberately does not fully define the Root Complex internals. It gives a list of required and optional behaviours but leaves the implementation to the vendor. In broad terms the RC must:

In modern systems the RC is split across the CPU package (memory controller, high-bandwidth root ports for GPU) and the Platform Controller Hub or I/O Hub (USB, SATA, lower-bandwidth PCIe slots). The CPU ↔ PCH link (Intel’s DMI, AMD’s equivalent) is itself based on PCIe electricals. Both sides together are the RC from software’s point of view.

📋 Switch Internal Structure

A Switch looks deceptively simple from the outside. From the inside it is a collection of PCIe ports connected by an internal virtual bus — and that internal bus is what software enumerates as PCI bridges.

Switch Internal Structure Upstream Port Faces Root Complex · acts as a P2P Bridge secondary port · Bus 2 x8 link to RC (Bus 1) Switch Internal Virtual Bus 2 (invisible to external PCIe — only seen by configuration software) Downstream Port 0 P2P Bridge · Dev 0 · Bus 2 Downstream Port 1 P2P Bridge · Dev 1 · Bus 2 Downstream Port 2 P2P Bridge · Dev 2 · Bus 2 Downstream Port 3 P2P Bridge · Dev 3 · Bus 2 x4 · Bus 3 NVMe SSD Bus 3 · Dev 0 · Fn 0 x4 · Bus 4 NIC 10GbE Bus 4 · Dev 0 · Fn 0 x4 · Bus 5 FPGA Card Bus 5 · Dev 0 · Fn 0 x1 · Bus 6 PCIe→PCI Bridge Bus 6 · Dev 0 · Fn 0 Switch routes TLPs by examining the target address or Requester ID and selecting the correct downstream port
Figure 3 — Switch internal structure. The upstream port connects to the RC or parent switch. Downstream ports each open a new bus. The internal virtual Bus 2 is what software enumerates — it sees multiple P2P bridges at Device 0/1/2/3 on Bus 2.

How a Switch routes a packet

When a TLP arrives at the Switch’s upstream port, the switch does the following:

Why does a Switch need all three PCIe layers? A common question. You might think a Switch just needs to look at the header and forward. But “looking at the header” means decoding a TLP — which is a Transaction Layer operation. The Data Link Layer and Physical Layer are needed to reliably receive and retransmit the packet on each link. So every switch port implements the full three-layer stack.

📋 Endpoint Types

TypeUses I/O Space?Locked Requests?Typical Devices
Native PCIe Endpoint No — MMIO only No GPU, NVMe SSD, modern NIC, AI accelerator, FPGA
Legacy PCIe Endpoint Yes (for backward compat) Yes (for legacy support) Older PCI-X devices with PCIe interface bolted on
Root Complex Integrated Endpoint Optional No USB controller, SATA controller, audio integrated into RC

The endpoint type is declared in a field in the PCIe Capability structure in configuration space. Software reads this field to understand what the device is and what it can do. All modern devices designed specifically for PCIe (Gen 3 and later) are Native PCIe Endpoints and use only memory-mapped I/O.

📋 Every Port Implements All Three Layers

This applies without exception — Root Ports, Switch upstream ports, Switch downstream ports, and Endpoint ports all implement the Transaction Layer, Data Link Layer, and Physical Layer. The logic is the same regardless of the device type; what varies is what the Transaction Layer does with the decoded TLP (route it, deliver it to the device core, or generate a completion).

All PCIe Ports — Same Three Layers, Different Core Action Root Port Transaction Layer → CPU/memory transactions Data Link Layer Physical Layer Switch Upstream Port Transaction Layer → route to correct DP Data Link Layer Physical Layer Switch Downstream Port Transaction Layer → forward toward endpoint Data Link Layer Physical Layer Endpoint Port Transaction Layer → deliver to device core Data Link Layer Physical Layer All four port types implement identical layer logic — only the device core action differs
Figure 4 — Every PCIe port type implements all three layers. The layer logic is identical; what changes is what the Transaction Layer does with a decoded TLP — route it, deliver it, or generate a completion from it.

📋 Transaction Types — Posted vs Non-Posted

Every PCIe transaction is either posted or non-posted. The distinction drives fundamental protocol differences.

Transaction TypePosted or Non-Posted?Completion returned?Why?
MRd — Memory Read Non-Posted Yes — CplD with data Requester needs the data back
MWr — Memory Write Posted No Fire-and-forget — data is delivered eventually. No need to wait.
IORd — I/O Read Non-Posted Yes — CplD with data Legacy — I/O space only on Legacy Endpoints
IOWr — I/O Write Non-Posted Yes — Cpl (no data) Must confirm write reached target before next I/O step
CfgRd — Config Read Non-Posted Yes — CplD with data BIOS/OS needs the register value back
CfgWr — Config Write Non-Posted Yes — Cpl (no data) Must confirm write before programming next register
Msg — Message Posted No INTx/PME/error signalling — informational, no ack needed
Why are Config and I/O writes non-posted? During boot, the BIOS programs configuration registers in strict sequence — write BAR, then write Command register, then enable device. If writes were posted (fire-and-forget), the BIOS could not know when each write had actually landed and the ordering guarantees collapse. Non-posted writes return a completion confirming the write reached the target — only then does software proceed to the next step.

📋 All TLP Types

AbbreviationFull NameHeader SizeNotes
MRdMemory Read Request3 DW (32-bit) / 4 DW (64-bit)Most common request; non-posted
MRdLkMemory Read Locked3 DW / 4 DWLegacy atomic read-modify-write; RC only
MWrMemory Write Request3 DW / 4 DW + payloadPosted; no completion
IORdI/O Read3 DWLegacy endpoints only
IOWrI/O Write3 DW + payloadNon-posted; legacy endpoints only
CfgRd0Config Read Type 03 DWFor device on the target bus (endpoint/bridge)
CfgRd1Config Read Type 13 DWFor device on a bus further downstream; routed by bus number
CfgWr0Config Write Type 03 DW + payloadNon-posted; write to local device config registers
CfgWr1Config Write Type 13 DW + payloadForwarded downstream toward the target bus
MsgMessage (no data)4 DWINTx, PME, slot events, vendor messages
MsgDMessage with Data4 DW + payloadVendor-defined messages with data payload
CplCompletion (no data)3 DWResponse to config/IO writes; status only
CplDCompletion with Data3 DW + payloadResponse to reads; carries requested data
CplLkLocked Completion (no data)3 DWResponse to MRdLk when no data returned
CplDLkLocked Completion with Data3 DW + payloadResponse to MRdLk with data

Memory Read — End-to-End Example

Let’s walk through an NVMe SSD reading data from system memory. The SSD is behind a Switch. This example shows every hop and every packet.

NVMe SSD Requester BDF 03:00.0 Switch Routes by address + by Req ID Root Complex Completer Fetches from RAM System RAM Holds requested data ① MRd TLP addr=0x1000, len=64B, tag=5, ReqID=03:00.0 ① MRd forwarded same TLP, new LCRC on each link ACK DLLP (per-hop) ACK DLLP fetch data ② CplD TLP CplID=00:00.0, ReqID=03:00.0, tag=5, 64B data ② CplD forwarded Switch routes by ReqID bus=03 ACK DLLP for CplD ACK DLLP for CplD NVMe SSD receives CplD, matches tag=5 to outstanding MRd, delivers 64 bytes to DMA engine
Figure 5 — Memory read flow. The MRd travels upstream hop by hop; each hop sends its own ACK DLLP back (dashed). The RC fetches data from RAM and returns a CplD downstream; each hop again ACKs independently. The Tag (5) links the completion to the original request.
The Tag is what prevents confusion with multiple outstanding requests. The NVMe SSD can have up to 256 (or 1024 in extended tag mode) reads in flight simultaneously. Each one has a different Tag value. When CplD packets come back in any order, the Tag tells the SSD exactly which original request each completion is answering.

Memory Write — Posted Example

A GPU DMA engine writing rendered frame data to system memory. Posted — no completion comes back.

GPU Requester BDF 01:00.0 Root Complex Routes MWr to system memory System RAM Frame buffer written ① MWr TLP (posted) addr=0x8000_0000, 4 KB payload, no completion needed ACK DLLP — per-hop only (no TLP completion returned to GPU) ① MWr to RAM data written, done — no CplD Note: GPU gets no TLP confirmation. If the write fails at the memory, an AER error message is sent to the Root — but the GPU’s Transaction Layer never hears about it.
Figure 6 — Posted memory write. The GPU fires the MWr and immediately continues. The link-level ACK DLLP (dashed) confirms link delivery but is not a TLP completion — the GPU’s Transaction Layer never receives feedback. This is intentional — the performance benefit outweighs the error-reporting limitation.

📋 ACK/NAK Protocol — Per-Hop Reliability

PCIe’s Data Link Layer guarantees delivery of every TLP between adjacent devices using ACK and NAK DLLPs. This is important to understand: it is per-hop, not end-to-end.

ACK/NAK Protocol — Replay Buffer per Port Transmitter Port Replay Buffer TLP#4 TLP#5 TLP#6 copy saved until ACK received On ACK(6): flush ≤6 from buffer On NAK(5): replay TLP#5,6… On timeout: replay all Link Receiver Port On receive: ① Check LCRC ② Check Sequence Number No error → send ACK DLLP forward TLP up Error found → send NAK DLLP drop TLP TLP + Seq + LCRC ACK DLLP (or NAK) Data Link passes TLP up ACK/NAK operates on every link independently — a packet crossing 3 links gets ACKed 3 times (once per hop), not end-to-end
Figure 7 — ACK/NAK protocol. The transmitter keeps a copy in its replay buffer until ACK arrives. ACK means “flush this and all older packets”. NAK means “replay from this sequence number onwards”. Each PCIe link runs this protocol independently.

📋 Bus Enumeration — Step-by-Step Walk-Through

Enumeration is the process by which BIOS/OS firmware discovers the PCIe topology and assigns bus numbers to every bus segment. It uses a depth-first search — it always goes as deep as possible before backtracking.

Enumeration Result — Bus Numbers Assigned by Depth-First Search Root Complex Internal Virtual Bus 0 Port A (Dev 1) Port B (Dev 2) Switch Upstream Port Bus 1 · Primary=0 · Secondary=1 · Subordinate=4 virtual Bus 2 Switch Virtual Bus 2 NVMe · Bus 3 NIC · Bus 4 Bridge · Bus 5 · PCI Bus 5 GPU Endpoint Bus 6 · Dev 0 · No bridge — endpoint stops here Bridge at Bus 0/Dev 1 gets: Primary=0, Secondary=1, Subordinate=5 (deepest bus under it is 5)
Figure 8 — Result of depth-first enumeration. Numbers are assigned in the order the algorithm digs: RC → Port A → Switch → Bus 2 → NVMe (Bus 3) → NIC (Bus 4) → Bridge (Bus 5) → backtrack → RC Port B → GPU (Bus 6). Subordinate numbers are set after all downstream buses are known.
1
Assign Bus 0 to the Root Complex
Hardware hardcodes Bus 0. Software writes 0 into the Root Port’s Primary Bus Number register. Scans Bus 0 by sending CfgRd0 packets to every Device/Function combination.
2
Find Port A (Dev 1) — a bridge → set Secondary Bus = 1, Subordinate = 255 (placeholder)
Software writes Primary=0, Secondary=1, Subordinate=255 to the bridge. It sets Subordinate=255 as a temporary maximum to allow configuration traffic to flow downstream during discovery.
3
Scan Bus 1 — find Switch upstream port → assign Bus 2 (virtual internal bus)
The switch’s upstream port presents itself as a bridge at Bus 1/Dev 0. Software gives the switch’s internal bus number = 2. Switch upstream bridge gets Primary=1, Secondary=2, Subordinate=255.
4
Scan Bus 2 — find Switch downstream ports (Dev 0, Dev 1, Dev 2) — go deep into each
Downstream port 0 gets Secondary=3 (NVMe). NVMe is an endpoint — no further buses. Downstream port 1 gets Secondary=4 (NIC). NIC is an endpoint. Downstream port 2 gets Secondary=5 (PCIe→PCI bridge) — and PCI bus 5 is enumerated separately.
5
Backtrack — set Subordinate numbers correctly now that all buses are known
Switch’s upstream bridge: Subordinate=5 (highest bus under it). Port A bridge on Bus 0: Subordinate=5. Then scan Port B (Dev 2) → GPU endpoint at Bus 6. Port B bridge: Primary=0, Secondary=6, Subordinate=6.
Subordinate = 255 is the key to depth-first search. Setting Subordinate=255 temporarily lets configuration TLPs flow all the way downstream during discovery. Once all downstream buses are known, the correct Subordinate value is written. Any TLP destined for a bus number that falls outside a bridge’s [Secondary, Subordinate] range is blocked by that bridge — this is how routing works.

Gen 6 Topology Considerations

The topology model — tree, BDF addressing, root/switch/endpoint roles, enumeration algorithm — is unchanged in Gen 6. What Gen 6 changes is the physical link speed and framing, not the topology or software model.

However, Gen 6’s higher bandwidth and flit-based framing introduce some practical topology considerations:

📋 Quick Reference

ConceptKey Point
Tree topologyMandatory — no loops. Preserves software compatibility with PCI’s simple depth-first enumeration algorithm.
BDF8-bit Bus + 5-bit Device + 3-bit Function = 16-bit unique address. Always Device 0 on real PCIe links; multiple devices only on virtual buses.
Requester IDBDF of the sender embedded in the TLP header — completions use it to route back to the originator.
Tag8-bit or 10-bit field distinguishing multiple in-flight non-posted requests from the same function.
Root Complex — software viewBus 0 with P2P bridges (Root Ports) at Device 1, 2, 3… Each Root Port opens a new downstream bus.
Switch — software viewCollection of P2P bridges sharing a virtual internal bus. Upstream port is the bridge secondary side; downstream ports are primary sides opening new buses.
Posted transactionMWr, Msg — no TLP completion returned. Link-level ACK DLLP still sent per hop.
Non-Posted transactionMRd, IORd, IOWr, CfgRd, CfgWr — completion TLP returned from target to requester.
ACK/NAKPer-hop, not end-to-end. Each link runs independently. Replay buffer holds copies until ACK received.
EnumerationDepth-first search from Bus 0. Set Subordinate=255 going down; correct it on the way back up. Endpoint = no bridge, stops there.
Gen 6 topology impactNo software changes. Retimers may be needed on long channels. Flit mode is handled internally per link.
Coming next: PCIe-03 covers The Three-Layer Model in Detail — a deep dive into the Transaction Layer’s virtual channel management and flow control credit types, the Data Link Layer’s ACK/NAK state machine, and the Physical Layer’s logical sub-block responsibilities across Gen 1 through Gen 6.
Scroll to Top