The PCI capability linked list, how software walks it, the four mandatory capability structures every PCIe device must implement — Power Management, PCIe Capability, MSI, MSI-X — plus all their register fields, configuration sequences, and how they evolve in Gen 6.
The 64-byte PCI-compatible header (Type 0 or Type 1) covers identification, command/status, BARs, and interrupt routing. Everything beyond basic device identification and address assignment lives in capability structures — self-describing register blocks that extend the configuration space in a standardised, forward-compatible way.
Capability structures occupy the 192-byte device-specific region of PCI-compatible configuration space (offsets 40h–FFh), chained together as a linked list. Each structure begins with two bytes: a Capability ID that identifies the type, and a Next Pointer byte that gives the offset of the next structure in the chain. The chain ends when Next Pointer = 00h.
This design means software does not need prior knowledge of a device’s capabilities. It reads the Capabilities Pointer from offset 34h, follows the list, and discovers whatever is present. A device can implement any subset of capabilities and they appear in the list in whatever order the designer chose.
The Capabilities Pointer at offset 34h in the Type 0 or Type 1 header holds the offset of the first capability structure. Software follows the chain until it reaches a structure whose Next Pointer is 00h.
The algorithm for walking the capability linked list:
The Power Management Capability structure is mandatory for all PCIe devices. It provides the standard software interface for transitioning a device between its power states (D0 through D3). Software reads the Power Management Capabilities (PMC) register to learn which states and features the device supports, then uses the PM Control/Status Register (PMCSR) to change the state.
| PMC Bits | Field | Access | Meaning |
|---|---|---|---|
| [2:0] | Version | RO | Power Management spec version. Must be 010b for PCIe 2.0+ compliance. |
| [3] | PME Clock | RO | Unused in PCIe (0). PCI legacy field — clock needed for PME signalling on PCI bus. |
| [4] | Immediate Readiness on Return to D0 | RO | When 1: device is immediately operational when transitioning to D0 (no software init needed). When 0: software must re-initialise the device after D3→D0. |
| [8] | D1 Support | RO | When 1: device supports the optional D1 power state. |
| [9] | D2 Support | RO | When 1: device supports the optional D2 power state. |
| [15:11] | PME Support | RO | Bitmask of power states from which the device can generate a PME (Power Management Event) message. Bit 11=D0, 12=D1, 13=D2, 14=D3hot, 15=D3cold. |
D1 and D2 have no standard definition beyond “less power than D0, more than D3.” Their specific behaviour is device-class specific. In practice most PCIe devices only implement D0 and D3hot — D1 and D2 are rarely used. The device advertises which states it supports in the PMC register, and software must not attempt to transition to an unsupported state.
The PM Control/Status Register (PMCSR) at DWORD offset 1 within the PM Capability structure is the primary runtime control register. Software writes to it to change the device’s power state and reads it to monitor PME events.
| PMCSR Bits | Field | Access | Meaning |
|---|---|---|---|
| [1:0] | Power State | RW | Current/requested power state. 00b=D0, 01b=D1, 10b=D2, 11b=D3hot. Software writes here to transition states. Hardware resets to 00b (D0). |
| [2] | No Soft Reset | RO | When 1: device preserves register context across D3hot→D0 transitions. When 0: software must treat the device as reset after D3hot→D0 and re-initialise it. |
| [8] | PME Enable | RW | When 1: device is allowed to assert PME to request wake-up from a low-power state. PME signalling uses an in-band TLP message in PCIe (not a pin). |
| [12:9] | Data Select | RW | Selects what data the Data register reports (power consumption, heat dissipation, etc.). Legacy PCI field — rarely used in PCIe. |
| [14:13] | Data Scale | RO | Scale factor for the Data register value. Legacy PCI field. |
| [15] | PME Status | RW1C | When 1: device has generated a PME message (or is requesting PME). Software clears by writing 1. Sticky — stays set until explicitly cleared. |
The PCIe Capability structure (Capability ID 10h) is the most important capability for PCIe-specific features. It is mandatory for all PCIe devices and is the structure that distinguishes a native PCIe device from a legacy PCI device. It contains Device Capabilities/Control/Status registers and Link Capabilities/Control/Status registers.
| Bits | Field | Values |
|---|---|---|
| [3:0] | Capability Version | Must be 2h for PCIe 2.0 and later |
| [7:4] | Device/Port Type | 0000b=Endpoint · 0001b=Legacy Endpoint · 0100b=Root Port · 0101b=Upstream Switch Port · 0110b=Downstream Switch Port · 1001b=Root Complex Event Collector |
| [8] | Slot Implemented | 1=this port has a slot connector (hot-plug capable ports) |
| [13:9] | Interrupt Message Number | MSI/MSI-X vector number used by this port for PCIe events (hot-plug, power management, etc.) |
The Device Control register (DW2 bits [15:0]) controls per-function behaviour. The Device Status register (DW2 bits [31:16]) reports sticky error and capability flags.
| Dev Control Bit(s) | Field | Access | Purpose |
|---|---|---|---|
| 0 | Correctable Error Reporting Enable | RW | Enables ERR_COR messages for correctable errors. Must be set to make AER correctable errors visible to the Root Complex. |
| 1 | Non-Fatal Error Reporting Enable | RW | Enables ERR_NONFATAL messages. |
| 2 | Fatal Error Reporting Enable | RW | Enables ERR_FATAL messages. |
| 3 | Unsupported Request Reporting Enable | RW | Enables URs to be reported as Non-Fatal errors. If 0, URs are silently ignored (no message sent). |
| 4 | Relaxed Ordering Enable | RW | Enables device to set the RO bit in TLPs it generates. Default typically 1 (enabled). |
| [7:5] | Max Payload Size | RW | Sets the Maximum Payload Size for TLPs from this device. Must not exceed the value in Device Capabilities Max Payload Size Supported. 000b=128B · 001b=256B · 010b=512B · 011b=1KB · 100b=2KB · 101b=4KB. |
| 8 | Extended Tag Field Enable | RW | Enables use of 8-bit tags (allowing 256 outstanding transactions) vs 5-bit tags (32 outstanding). Requires both sides to support extended tags. |
| 9 | Phantom Functions Enable | RW | Enables use of phantom function numbers in the tag field to increase outstanding transaction count. |
| 10 | Auxiliary Power PM Enable | RW | Enables auxiliary power to remain powered for PME generation from D3cold. |
| 11 | No Snoop Enable | RW | Enables device to set the NS (No Snoop) bit in TLPs — allows CPU cache snooping to be bypassed for DMA buffers that software manages explicitly. |
| [14:12] | Max Read Request Size | RW | Maximum size of read requests from this device. 000b=128B · 001b=256B · 010b=512B · 011b=1KB · 100b=2KB · 101b=4KB. Should not be set higher than Max Payload Size for efficiency. |
| 15 | Initiate FLR | RW | Function Level Reset — writing 1 initiates a self-reset of this function only (not the entire device). Completes within 100ms. Only valid if Device Capabilities FLR Capable bit is set. |
| Dev Status Bit(s) | Field | Access | Meaning |
|---|---|---|---|
| 0 | Correctable Error Detected | RW1C | Set when a correctable error was detected. Sticky — clear by writing 1. |
| 1 | Non-Fatal Error Detected | RW1C | Set when a non-fatal uncorrectable error was detected. |
| 2 | Fatal Error Detected | RW1C | Set when a fatal uncorrectable error was detected. |
| 3 | Unsupported Request Detected | RW1C | Set when this function was the source of an Unsupported Request. |
| 4 | AUX Power Detected | RO | Hardware sets this when auxiliary power (Vaux) is present. Read-only snapshot. |
| 5 | Transactions Pending | RO | When 1: function has non-posted requests with completions pending. Software should wait for this to clear before removing power or initiating FLR. |
Link Control (DW4 bits [15:0]) and Link Status (DW4 bits [31:16]) are link-level registers, relevant at both endpoints of a link but typically read/written by the downstream device and the upstream port’s driver.
| Link Control Bit(s) | Field | Access | Purpose |
|---|---|---|---|
| [1:0] | ASPM Control | RW | Controls Active State Power Management. 00b=disabled · 01b=L0s enabled · 10b=L1 enabled · 11b=L0s+L1 enabled. Both endpoints must agree — typically configured by BIOS/OS PM driver. |
| 3 | Read Completion Boundary | RW | 0=64-byte boundary · 1=128-byte boundary for read completion coalescing. Legacy PCI feature, no effect in PCIe. |
| 4 | Link Disable | RW | Writing 1 disables the link — the LTSSM enters Disabled state. Link re-enables when cleared. Downstream Port only. |
| 5 | Retrain Link | RW | Writing 1 initiates link retraining (enters Recovery state). Used to request a speed change or width change. Bit self-clears when retraining begins. |
| 6 | Common Clock Configuration | RW | Both sides share the same reference clock source. Must match the topology — BIOS sets this correctly. |
| 7 | Extended Synch | RW | Forces 4096 FTS symbols when exiting L0s (instead of the N_FTS-negotiated count). Used by test equipment that needs more time to achieve lock. |
| 8 | Enable Clock Power Management | RW | Enables the downstream device to request removal of the reference clock during L1 to save power. |
| Link Status Bit(s) | Field | Access | Meaning |
|---|---|---|---|
| [3:0] | Current Link Speed | RO | Active link speed. 0001b=2.5 GT/s · 0010b=5 GT/s · 0011b=8 GT/s · 0100b=16 GT/s · 0101b=32 GT/s · 0110b=64 GT/s (Gen 6) |
| [9:4] | Negotiated Link Width | RO | Active link width. 000001b=x1 · 000010b=x2 · 000100b=x4 · 001000b=x8 · 010000b=x16 · 100000b=x32 |
| 10 | Link Training | RO | When 1: link training or retraining is in progress. When 0: link is operational or disabled. |
| 11 | Slot Clock Configuration | RO | When 1: device uses the reference clock from the slot connector. Hardware-set. |
| 12 | Data Link Layer Link Active | RO | When 1: DLL is in the Active state — TLPs and DLLPs can flow. This is the “link is really up” flag. LinkUp from LTSSM sets this. |
| 13 | Link Bandwidth Management Status | RW1C | Set when link speed or width changed autonomously (hardware-initiated bandwidth management). Sticky. |
Message Signaled Interrupts replace the legacy INTx interrupt pin mechanism with in-band Memory Write TLPs. A device signals an interrupt by writing a specific data value to a specific memory address — both programmed by software during configuration. The Root Complex or IOAPIC detects this write and delivers the interrupt to the appropriate CPU core.
| Bit(s) | Field | Access | Meaning |
|---|---|---|---|
| 0 | MSI Enable | RW | When 1: device uses MSI for interrupts. INTx and MSI-X are automatically disabled. Software sets this after programming Message Address and Message Data. |
| [3:1] | Multiple Message Capable | RO | How many interrupt vectors the device wants. 000b=1, 001b=2, 010b=4, 011b=8, 100b=16, 101b=32. Always a power of two. |
| [6:4] | Multiple Message Enable | RW | How many vectors software actually allocated. Same encoding as Capable field. Must be ≤ Capable. Device varies the lower N bits of Message Data to generate different vectors. |
| 7 | 64-bit Address Capable | RO | When 1: Message Upper Address register is present. Device can be assigned a 64-bit interrupt address. All native PCIe endpoints must set this. |
| 8 | Per-Vector Masking Capable | RO | When 1: Mask Bits and Pending Bits registers are present, enabling individual interrupt vector masking. |
MSI can deliver up to 32 interrupt vectors per function. When more than one vector is allocated (Multiple Message Enable ≥ 1), the device signals different events by modifying the lower N bits of the Message Data value before writing. If 4 messages are allocated (Enable = 010b), bits [1:0] of Message Data are variable — the device sends Data, Data+1, Data+2, or Data+3 for its four events.
The complete sequence for enabling MSI on a device:
MSI-X overcomes the three key limitations of MSI: it supports up to 2048 vectors per function (vs 32 for MSI), each vector can target a different CPU/APIC address (enabling optimal interrupt distribution), and vectors do not need to be contiguous. The interrupt vector table is stored in device MMIO space (pointed to by a BAR) rather than in configuration space, making it easily extensible.
| Bit(s) | Field | Access | Meaning |
|---|---|---|---|
| [10:0] | Table Size | RO | N–1 encoding of total number of vectors supported. A value of 7 means 8 vectors. Maximum is 2047 (meaning 2048 vectors). Hardware sets this. |
| [13:11] | Reserved | RO | Always 0. |
| 14 | Function Mask | RW | Global mask — when 1, all interrupt vectors from this function are masked regardless of individual per-vector mask bits. Allows atomic masking of all interrupts during driver updates. |
| 15 | MSI-X Enable | RW | When 1: MSI-X is enabled. MSI and INTx are disabled. Software sets this after programming all table entries. |
The MSI-X Table lives in the device’s MMIO space (in the BAR identified by Table BIR). It contains one 128-bit entry per supported vector. Each entry has its own Address, Data, and Vector Control registers — enabling fully independent configuration of each interrupt.
Each MSI-X table entry has a Vector Control register. Bit 0 is the Mask bit. When 1, the associated interrupt vector is masked — the device cannot send the corresponding MSI-X write, and if the event fires the corresponding PBA bit is set instead. When the mask is cleared, if the PBA bit is set the device must send the interrupt immediately. This per-vector masking is far more granular than MSI, where masking applies to all allocated vectors simultaneously.
| Property | MSI | MSI-X |
|---|---|---|
| Maximum vectors per function | 32 | 2048 |
| Vector addresses | All share one address; data lower bits vary | Each vector has its own independent address |
| CPU targeting | All vectors go to the same CPU | Each vector can target a different CPU (ideal for multi-core IRQ affinity) |
| Vector numbering | Must be contiguous (base + offset) | Fully independent — any vector number in any order |
| Per-vector masking | Optional (Per-Vector Masking Capable bit) | Always present (bit 0 of each table entry’s Vector Control) |
| Configuration space footprint | 3–5 DWs in config space | 3 DWs in config space; table in MMIO BAR space |
| Typical use | Simpler devices (NVMe with few queues, USB, audio) | High-performance devices (GPUs, 100GbE NICs, AI accelerators, NVMe with many queues) |
| Required to implement | Yes (for all PCIe functions) | Optional (but strongly preferred for high-queue-count devices) |
INTx (legacy interrupt pin signalling) is emulated in PCIe via two in-band TLPs: Assert_INTx and Deassert_INTx. These are Message TLPs (Type = 10100b with specific message codes). PCIe does not have physical interrupt wires — the message TLPs mimic the edge/level behaviour of legacy PCI interrupt pins.
The Interrupt Pin register at offset 3Ch [15:8] declares which legacy pin (INTA# through INTD#) the function emulates. The Interrupt Disable bit in the Command register (bit 10) globally enables or disables INTx signalling for the function.
All four mandatory capability structures — PM (01h), PCIe Capability (10h), MSI (05h), MSI-X (11h) — are unchanged in Gen 6. Their formats, register layouts, and software interfaces are identical across all PCIe generations. This is the whole point of the capability mechanism: new features are added as new capability IDs in the extended config space (offset 100h+), not by changing the existing structures.
What Gen 6 adds in the context of these structures:
| Item | Value / Rule |
|---|---|
| Capability structure header | Byte 0 = Capability ID · Byte 1 = Next Pointer (DWORD-aligned, 00h = end) |
| List start | Read offset 34h bits [7:0], mask bits [1:0]. Check Status bit 20 first. |
| PM Capability ID | 01h — mandatory for all PCIe functions |
| PMC Version | Must be 010b for PCIe 2.0+ compliance |
| PMC D1/D2 Support bits | Bits 8 and 9 — optional states, rarely used in PCIe |
| PMC PME Support [15:11] | Bitmask of D-states from which PME can be sent (D0/D1/D2/D3hot/D3cold) |
| PMCSR Power State [1:0] | Software writes: 00b=D0 · 01b=D1 · 10b=D2 · 11b=D3hot |
| PMCSR PME Enable [8] | Write 1 to allow device to send PME messages when in low-power state |
| PMCSR PME Status [15] | RW1C — sticky flag that device sent a PME. Write 1 to clear. |
| PMCSR No Soft Reset [2] | When 0: software must re-initialise device after D3hot→D0. When 1: context preserved. |
| PCIe Capability ID | 10h — mandatory for all PCIe functions |
| Device/Port Type [7:4] | 0000b=EP · 0001b=Legacy EP · 0100b=Root Port · 0101b=USP · 0110b=DSP |
| Device Control MPS [7:5] | Max Payload Size. Must be ≤ Device Capabilities MPS Supported. 000b=128B … 101b=4KB. |
| Device Control FLR [15] | Writing 1 triggers Function Level Reset if Device Capabilities FLR Capable is set. Completes ≤100ms. |
| Link Control ASPM [1:0] | 00b=off · 01b=L0s · 10b=L1 · 11b=L0s+L1. Both endpoints must agree. |
| Link Status Current Link Speed [3:0] | 0001b=2.5GT/s · 0010b=5GT/s · 0011b=8GT/s · 0100b=16GT/s · 0101b=32GT/s · 0110b=64GT/s |
| Link Status DL Link Active [12] | 1 = DLL active, TLPs can flow. The definitive “link is up” indicator. |
| MSI Capability ID | 05h — mandatory. Max 32 vectors per function, contiguous, one shared address. |
| MSI Enable sequence | Set Interrupt Disable → program address+data → set MME → set MSI Enable |
| MSI-X Capability ID | 11h — optional. Max 2048 vectors, independent address/data/mask per vector. |
| MSI-X Table BIR | 3-bit field identifying which BAR (0–5) holds the MSI-X table in MMIO |
| MSI-X per-vector mask | Bit 0 of Vector Control in each entry. PBA records pending masked interrupts. |
| INTx Interrupt Disable | Command bit 10. Must be 1 before enabling MSI or MSI-X. |
| Gen 6 additions | Link Speed encoding 0110b=64GT/s in Link Status · 64GT/s in Link Capabilities 2 · no format changes to any of the four mandatory structures. |