PCIe Series — PCIe-26: Device Power States — D0 to D3 — VLSI Trainers
PCIe Series · PCIe-26
Device Power States — D0 to D3
The four PCIe device power states in full detail — D0 Uninitialized vs D0 Active, the optional D1 and D2 intermediate states, D3hot with its No Soft Reset rule, D3cold with auxiliary power, PME context preservation, wake signalling, state transition delays, and how D-states interact with link L-states in Gen 6 systems.
📋 D-States and L-States — Two Separate Systems
PCIe power management operates at two independent levels that are tightly coupled but not identical:
Device power states (D-states) — control the function’s internal logic: its registers, internal clocks, device-specific power rails, and operating capability. D-states are set by software via PMCSR writes. They apply per function, not per link.
Link power states (L-states) — control the PCIe physical link: whether transmitters are active, whether the reference clock runs, whether PLLs are locked. L-states are set by hardware (ASPM) or triggered by D-state transitions. They apply per link segment, not per function.
The coupling between them: when software places a device in D1, D2, or D3hot, the device autonomously triggers an L1 transition on its link — hardware handles this without further software involvement. When software returns a device to D0, the link exits L1 automatically as the first configuration write causes Recovery. This coupling is summarised throughout this post.
D-states originate from the ACPI specification (D0–D3) and the PCI Bus Power Management Interface Specification. PCIe inherits them fully and makes the PM Capability structure (ID 01h) mandatory for all functions — unlike PCI where it was optional.
📋 Device Power State Map
Figure 1 — Device power state overview. D0 and D3 are mandatory for all PCIe functions. D1 and D2 are optional — a device declares support in the PMC register bits [9:8]. D3cold is entered by hardware (Vcc removal), not software. The PMCSR Power State field controls D0–D3hot transitions. Each state below D0 forces the link to L1, which stays in L1 until the next configuration access.
📋 D0 — Full On
D0 is the only fully operational state. The function can originate any PCIe transaction type, respond to any request, generate interrupts, and perform DMA. All PCIe functions must implement D0 — it is the state in which drivers load, initialise, and operate the device.
D0 is the only state where ASPM may operate on the link. In all other D-states, the link is forced to L1 (or lower) because the device cannot respond quickly enough to the exit latency requirements of L0s. When a device is in D0, the LTSSM is free to enter L0s and L1 ASPM based on traffic patterns.
Technically, D0 has two sub-states: D0 Uninitialized and D0 Active. These are not directly controlled by the PMCSR register — D0 Uninitialized is entered automatically after reset, and D0 Active is entered when the driver finishes configuration.
📋 D0 Uninitialized vs D0 Active
Figure 2 — D0 sub-states. D0 Uninitialized is the state after any reset or after a D3hot→D0 transition where context was not retained. The device only accepts configuration reads and writes. D0 Active is the normal operating state — the driver has programmed BARs, enabled the Command register, and the device is fully functional. The transition from Uninitialized to Active is entirely handled by software (driver init sequence).
📋 Dynamic Power Allocation — D0 Substates
The PCIe 2.1 revision added Dynamic Power Allocation (DPA) — an optional extended capability (found in extended config space at offsets starting at 000h) that defines up to 32 numbered substates within D0. The goal is to allow software to negotiate fine-grained power reduction with a device that remains technically in D0 (and therefore does not go offline) but operates at reduced internal performance.
Unlike D1/D2/D3 which take the device partially or fully offline, DPA substates keep the device fully in D0 Active — software can still issue any transaction type. The device internally reduces power (fewer active processing units, reduced clocks on internal logic, power-gated subsystems) while still accepting all requests. Substate 0 always represents the highest power/performance level. Higher substate numbers represent progressively lower power.
DPA is particularly useful for GPUs and AI accelerators where internal compute engines can be clock-gated between workloads without needing a full D3→D0 cycle (which would require driver re-initialisation).
📋 D1 — Light Sleep
D1 is an optional, lightly defined power state. The spec intentionally leaves most of its behaviour device-class-specific. All that is guaranteed is: D1 consumes less power than D0 and more than D2. In practice, D1 is rarely used in modern PCIe deployments — most devices implement only D0 and D3hot. But its characteristics are precisely specified:
Link state: the link is forced to L1 when the device enters D1. This happens automatically via PM_Enter_L1 DLLP without software intervention.
Incoming requests: only configuration transactions and Message TLPs are accepted. All other request types must be returned with Unsupported Request (UR) status. Completions for outstanding requests received during D1 may optionally be treated as Unexpected Completions.
Outgoing requests: the device may not initiate any transactions except a PME message if PME is supported and enabled in this state. PME generation from D1 requires the link to temporarily exit L1 to deliver the message.
Error reporting: error messages triggered by incoming requests may be sent in D1. Errors from other causes (e.g. Completion timeout) are deferred until D0 is restored.
Context: the device may or may not lose its internal context. If PME is supported, the device must retain at least its PME context (see below) regardless.
Recovery: D1 → D0 has zero mandated delay. The function enters D0 Uninitialized immediately after the PMCSR write. Software is still responsible for re-initialising if context was lost.
Software must drain outstanding completions before entering D1. Before writing D1 to PMCSR, software must poll the Transactions Pending bit (Device Status register bit 5 in the PCIe Capability structure) until it reads 0. Only then is it safe to request the state change. If software skips this check, a pending completion returning from upstream will arrive when the device is in D1 and may be treated as an Unexpected Completion — corrupting the outstanding transaction.
📋 D2 — Deep Sleep
D2 is the second optional intermediate state, deeper than D1 but less aggressive than D3hot. Like D1, most of its characteristics are device-class-specific. In practice D2 is even less commonly implemented than D1 — most PCIe device designers skip directly from D0 to D3hot:
Link state: link forced to L1 on D2 entry, same as D1.
Incoming requests: identical to D1 — configuration and Messages only, all others UR.
Outgoing requests: identical to D1 — only PME messages permitted.
Context: may be lost, same rule as D1 — PME context must be retained if PME is supported.
Recovery delay: D2 → D0 requires a 200 µs delay after the PMCSR write before the first access (including configuration accesses). This reflects that deeper power savings in D2 may require longer wake time.
The same pre-transition requirement applies: drain all non-posted requests (poll Transactions Pending = 0) before entering D2.
Property
D1
D2
Mandatory
No
No
Link state forced
L1
L1
Accepted requests
Config + Messages
Config + Messages
Can send PME
Yes (if supported)
Yes (if supported)
Context retention
May be lost
May be lost
D→D0 delay
0 µs
200 µs minimum
Practical usage
Rare — mostly legacy devices
Very rare — almost never implemented
📋 D3hot — Full Off (Power On)
D3hot is the mandatory deepest software-accessible power state. Software writes PMCSR Power State bits [1:0] = 11b to enter it. The device is maximally powered down while main power (Vcc) remains applied — the device retains just enough logic to respond to configuration accesses and maintain PME capability. Unlike D1/D2 which are lightly defined, D3hot has precise rules:
Figure 3 — D3hot capabilities and constraints. The device is largely inactive but configuration space remains accessible. The minimum required capability is responding to configuration reads/writes and the PME_Turn_Off Message. All other request types return UR. The link enters L1 on D3hot entry. A 10 ms delay is required after the PMCSR write before any access (including configuration reads).
Link state: link forced to L1 on D3hot entry via PM_Enter_L1 DLLP. Exits L1 only when a configuration access is directed to the device.
Recovery delay: D3hot → D0 requires a mandatory 10 ms delay after the PMCSR write before the first access. This reflects that the device may need significant time to power up internal logic and re-establish internal clocks.
Context retention: the spec does not guarantee context retention in D3hot. However, the No Soft Reset bit (PMCSR bit 2) can change this — see next section.
Pre-transition requirement: same as D1/D2 — software must poll Transactions Pending = 0 before writing D3hot to PMCSR.
📋 No Soft Reset — Context Retention
In early PCIe versions, a D3hot → D0 transition always implied a soft reset — the device re-initialised all registers to their default values, and the driver had to re-program everything from scratch. The 1.2 revision of the PCI PM spec added the No Soft Reset bit in PMCSR to change this:
Figure 4 — No Soft Reset bit (PMCSR bit 2). When 0 (legacy behaviour), D3hot→D0 is equivalent to a hardware reset — all PCI configuration registers clear, BARs are zeroed, and the driver must do a full re-initialisation. When 1, the device promises to retain its PCI configuration space context across D3hot→D0. Note: device-specific registers (internal hardware state) may still be lost even when No Soft Reset = 1.
Check No Soft Reset before assuming fast D3→D0. Software should always read PMCSR bit 2 after the device enumerates. If No Soft Reset = 0, every D3hot→D0 cycle requires re-scanning BARs and re-initialising the Command register — even if the driver remembers the previous values. The hardware may have cleared them. If No Soft Reset = 1, the driver can safely skip BAR re-programming and directly re-enable the device. Modern power-managed drivers (Linux power management framework) check this bit during D3→D0 resume.
📋 D3cold — Full Off (Power Removed)
D3cold is entered when main power (Vcc) is physically removed from the device. This is a hardware event, not a software write — it happens after the L2/L3 Ready handshake completes and the OS/BIOS triggers actual power removal on the platform. All PCIe functions are required to implement D3cold (the specification is that every function must tolerate Vcc removal).
D3cold has distinct characteristics from D3hot:
Communication: the device has no communication capability whatsoever. The PCIe link is in L2 (if Vaux present) or L3 (if no Vaux). No TLPs, DLLPs, or ordered sets can flow.
Configuration registers: completely inaccessible — PMCSR cannot be read or written while in D3cold.
Context: all context is assumed lost. When power returns, the device undergoes a Fundamental Reset and enters D0 Uninitialized.
Wake signalling: uses hardware signals (Beacon on the PCIe lane or the out-of-band WAKE# pin), not in-band TLPs, since the link is not functional. Beacon is a low-frequency signal that the device drives on its transmit lanes even without a reference clock.
Recovery: requires power restoration, a Fundamental Reset, re-enumeration, and driver re-initialisation. There is no software-defined delay — the timing is controlled by the platform hardware and firmware power sequencing.
D3cold ≠ D3hot. These are often confused. D3hot: device still has Vcc, software writes PMCSR to enter and exit, link is in L1, 10 ms recovery delay after PMCSR write to D0. D3cold: Vcc removed, entered by hardware power removal (not PMCSR write), link is in L2 or L3, recovery requires fundamental reset and full re-enumeration. A driver that handles D3hot resume (10 ms wait then re-enable) cannot use the same code path for D3cold resume (full re-init including BAR re-sizing).
📋 Auxiliary Power (Vaux)
Vaux is a secondary 3.3V standby power supply that remains active even when the main power rail (Vcc/+12V/+3.3V main) is removed. Its presence determines whether the device enters L2 (Vaux present) or L3 (no Vaux) when the system powers down to D3cold.
Condition
Link state
Device capability in this state
Vcc present, device in D3hot
L1
Can respond to config accesses. Can send PME if enabled and powered.
Vcc removed, Vaux present
L2
Can monitor for external events. Can assert Beacon or WAKE# to request power restore.
Vcc removed, no Vaux
L3
No capability. Device is completely powerless.
Vaux-powered devices are commonly found in:
Wake-on-LAN NICs — stay powered to monitor the network for a magic packet and wake the system when one arrives.
Wireless LAN cards — may monitor for connection wake events in standby modes (modern WLAN wake patterns).
Storage controllers — enterprise SSDs and NVMe controllers may use Vaux to preserve wear-levelling metadata across power cycles.
Thunderbolt/USB4 controllers — maintain discovery and authentication state in low-power modes.
Whether a device supports Vaux operation is declared in the PMC register’s PME Support bits [15:11]. Specifically, bit 15 = PME from D3cold support. If this bit is set, the device can send a PME (via Beacon or WAKE# signal) from D3cold — implying it must have Vaux capability.
📋 PME Context — What Must Be Retained
PME context is the minimal set of state that a device must preserve in a low-power state if it supports PME (Power Management Events). Without PME context, the device cannot detect the event that requires waking, cannot generate the PME message, and cannot correctly re-initialise after the wake-up.
PME context includes:
The PME Enable bit in PMCSR (bit 8) — the device must remember it was enabled to send PME
PME Status bit in PMCSR (bit 15) — must retain whether a PME event occurred
Device-specific wake event configuration registers — the specific events that should trigger a PME (e.g. which network packet patterns, which USB device insertion events, which SATA drive insertion signals)
Device-specific event detection logic powered by Vaux (in D3cold)
The requirement by state:
D-state
PME context requirement
D0
Full context always maintained (device is fully powered)
D1
Must retain PME context if PME is supported in D1 (PMC bit 11 = 1)
D2
Must retain PME context if PME is supported in D2 (PMC bit 12 = 1)
D3hot
Must retain PME context if PME is supported in D3hot (PMC bit 14 = 1)
D3cold
Must retain PME context on Vaux if PME is supported in D3cold (PMC bit 15 = 1). Requires Vaux-powered logic.
📋 PME Message — Wake Signalling
When a device in a low-power state detects an event that requires the system to restore its power (a wake event), it signals this by sending a PME message TLP to the Root Complex. The PME message is a standard PCIe Message TLP routed to the Root Complex. It carries the Requester ID (Bus:Device:Function) of the device that generated the event, allowing PM software to identify exactly which device needs service.
Figure 5 — PME wake flow. The device detects a wake event while in a low-power state. If the link is in L1, the device first exits L1 (initiating Recovery to L0). Once in L0, it sends the PME Message TLP upstream. The message routes to the Root Complex which records PME Status and generates an interrupt. The PM interrupt service routine reads the Requester ID, identifies the source device, and writes D0 to its PMCSR to restore it to full operation.
PME message constraints
PME messages can only be delivered when the link is in L0. If the device is in L1 when a wake event occurs, it must first exit L1 (exit electrical idle, go through Recovery) before the PME TLP can be sent.
From D3cold, the PME cannot be an in-band TLP because the link is in L2/L3. Instead, the device uses an out-of-band Beacon (a low-frequency signal on PCIe lanes) or the platform-specific WAKE# sideband signal to request power restoration. Once power is restored and the link trains, a PME TLP may be sent.
The Root Complex tracks PME Status in the PM Capability’s PMCSR PME Status bit (bit 15). This is an RW1C (sticky) bit that remains set until PM software explicitly clears it by writing 1.
Switches propagate PME messages upstream transparently (Route to Root Complex routing). The Root Complex sees the Requester ID of the originating function, not the switch’s ID.
📋 State Transitions and Delays
Figure 6 — Full D-state transition diagram. Arrows show allowed transitions. D0 can go to D1, D2, or D3hot directly (software PMCSR writes). D1 can go to D2 or D3hot. D2 can go to D3hot. D3hot becomes D3cold when Vcc is removed (hardware event). Recovery from D3hot/D3cold always lands in D0 Uninitialized, not D0 Active — driver must re-initialise. Note: D1 → D0 is not directly shown but is allowed (PMCSR=00b, zero delay).
Hardware: Vcc restore → Fundamental Reset → enumeration
Platform-specific minimum delay
📋 Pre-Transition Software Requirements
Before writing any power state below D0 to the PMCSR, software must ensure no outstanding non-posted requests are pending. Entering D1/D2/D3hot while completions are in flight leaves orphan transactions that will never receive their completions, potentially hanging the device driver.
Quiesce the driver — stop issuing new DMA requests and new MMIO reads.
Read Device Status register (PCIe Capability DW2 bits [31:16]) bit 5 — Transactions Pending.
If Transactions Pending = 1, wait and re-poll. Allow sufficient time for all outstanding completions to return (this time depends on the device’s completion timeout setting — up to 50 ms by default).
Only when Transactions Pending = 0: write the desired power state to PMCSR[1:0].
Wait the mandatory delay (0 µs, 200 µs, or 10 ms depending on target state) before attempting any access.
Skipping the Transactions Pending check is a common driver bug. Linux PM drivers have had multiple incidents where the D3 path omitted this check. The result is that the device enters D3hot while a DMA completion is in flight — the completion returns after D3hot entry, is treated as an Unexpected Completion (or ignored), and the DMA buffer is never properly released. The kernel’s pci_set_power_state() function handles this correctly, but drivers that bypass it may not.
📋 PMCSR — The Runtime Control Register
The PM Control/Status Register (PMCSR) is the primary runtime register for device power management. It sits at DW1 of the PM Capability structure (Cap ID 01h), typically at offset 44h or so in the capability chain.
Bit(s)
Field
Access
Purpose
[1:0]
Power State
RW
Current/requested D-state. 00b=D0 · 01b=D1 · 10b=D2 · 11b=D3hot. Write here to change power state. Hardware transitions begin immediately on write — respect the mandatory delays before accessing the device again.
[2]
No Soft Reset
RO
When 1: device retains PCI configuration context across D3hot→D0. When 0: registers reset to defaults on D3hot→D0. Hardware-set, not writable by software.
[7:3]
Reserved
—
Must return 0 when read. Must not write non-zero values.
[8]
PME Enable
RW
When 1: device is enabled to generate PME messages from the current power state (if PMC says PME is supported in this state). Set this before entering a low-power state if wake is desired.
[12:9]
Data Select
RW
Selects which metric the Data register (PMCSR bits [31:24]) reports — power consumption or heat dissipation data, indexed per power state. Legacy PCI field rarely used in PCIe.
[14:13]
Data Scale
RO
Scale factor (0.1W per unit or similar) for the Data register reading. Legacy field.
[15]
PME Status
RW1C
Set when the device has sent (or wants to send) a PME message. Sticky — persists until software writes 1 to clear it. PM software clears this bit after servicing the wake event.
⚡ D-States in Gen 6
The D-state model — D0, D0 Uninitialized, D0 Active, D1, D2, D3hot, D3cold — and all their characteristics (context retention, transition delays, PME support, No Soft Reset behaviour, Transactions Pending requirement) are completely unchanged in Gen 6. The PMCSR register layout, PMC register fields, and all PMCSR field definitions are identical across all PCIe generations.
What changes in Gen 6 device power management practice:
Aspect
Gen 6 impact
D-state register format
Unchanged — same PMC and PMCSR layout, same bit definitions, same delays
D3hot→D0 after No Soft Reset
Unchanged — same 10 ms delay, same context retention rules
PME message format
Unchanged — same TLP format, same Message Code, same routing
Transactions Pending check
More critical at Gen 6 — higher throughput means more in-flight requests; the Transactions Pending window may be longer for AI accelerators with many outstanding DMA completions
D3hot and L1 coupling
Gen 6 adds L0p within L0 (PCIe-25), but L0p is not used during D3hot — the link remains in L1 when any function on the device is in D3hot
D0 Active and L0p
L0p (the Gen 6 in-band bandwidth reduction) only operates in D0 Active — it requires the device to be fully operational to negotiate bandwidth reduction with the link partner
DPA substates in Gen 6
More relevant for Gen 6 AI accelerators — allows fine-grained power control of compute engines (SM clusters, HBM memory controllers) without entering D3hot and losing driver state
PME from D3cold via WAKE#
Unchanged — WAKE# sideband signalling works the same at all generations. CXL.mem devices may have additional protocol-level wake mechanisms but these are in addition to, not replacing, the standard PME path.
The D-state model is the stable foundation for all PCIe power management. From PCIe Gen 1 to Gen 6, the same PMCSR register, the same four states, the same Transactions Pending check, and the same 10 ms D3hot→D0 delay have been consistent. Drivers written for Gen 3 power management work correctly on Gen 6 hardware without modification to the D-state code paths.
📋 Quick Reference
Item
Value / Rule
Mandatory D-states
D0 and D3 (D3hot + D3cold). D1 and D2 are optional.
D1/D2 support declaration
PMC register bit 8 = D2 Supported · bit 9 = D1 Supported. Read-only, set by designer.
Poll Device Status bit 5 (Transactions Pending) = 0 before any PMCSR state write below D0.
D0→D1 delay
0 µs (immediate)
D0/D1→D2 delay
200 µs minimum before first access after PMCSR write
D0/D1/D2→D3hot delay
10 ms minimum before first access after PMCSR write
D1→D0 delay
0 µs (immediate)
D2→D0 delay
200 µs minimum
D3hot→D0 delay
10 ms minimum
No Soft Reset (PMCSR bit 2)
0=D3hot→D0 resets all PCI config registers (driver must re-init). 1=PCI config registers retained across D3hot→D0.
D3hot entry
Software writes PMCSR[1:0]=11b. Device sends PM_Enter_L1 DLLP → link enters L1. PMCSR remains accessible. Only config + PME accepted.
D3cold entry
Hardware event: Vcc removed (after L2/L3 Ready handshake). PMCSR inaccessible. Link enters L2 (Vaux) or L3 (no Vaux).
D3cold exit
Vcc restore → Fundamental Reset → D0 Uninitialized. Full re-enumeration and driver init required.
D0 Uninitialized
After any reset or D3hot→D0 with No Soft Reset=0. Config only. Command register enables cleared. BARs zeroed.
D0 Active
After driver configures BARs and sets Command register. All transaction types enabled. ASPM may operate.
Link state vs D-state
D0→L0 (ASPM free). D1/D2/D3hot→L1 (mandatory, via PM_Enter_L1 DLLP). D3cold→L2 or L3.
PME Enable (PMCSR bit 8)
Must be set by software before entering low-power state if wake is desired. Controls whether device may send PME TLP.
PME Status (PMCSR bit 15)
RW1C. Set when PME sent. PM software clears by writing 1 after handling wake event.
PME from D3cold
PMC bit 15. Uses Beacon or WAKE# sideband (in-band TLP unavailable when link is in L2/L3). Requires Vaux.
PME context
Minimum state device must retain to detect and signal a wake event. Required in any state where PME is supported.
Vaux present → L2
Device can monitor events, signal WAKE# or Beacon. No PCIe communication possible.
No Vaux → L3
Device has no power. Cannot detect or signal anything. Wake only by physical power restore.
Gen 6 changes
D-state formats, delays, and protocols unchanged. L0p only available in D0 Active. DPA substates more relevant for AI accelerators. Transactions Pending window may be longer at high throughput.