Everything about PCIe error detection — baseline vs AER capability, correctable vs uncorrectable error taxonomy, all error bit positions, error masking and severity control, the First Error Pointer, 128-bit Header Log, ECRC end-to-end data integrity, error forwarding (data poisoning), advisory non-fatal mechanism, and how AER integrates in Gen 6 systems.
The basic PCI Status register has only a handful of error bits — detected parity error, signalled system error, received master/target abort. These tell you something went wrong but almost nothing about what. PCI error reporting was designed for a parallel shared bus where all devices could observe each other’s signals; a single SERR# or PERR# pin carried the entire error notification.
PCIe replaces physical error pins with in-band error messages and adds a rich standardised logging structure through the Advanced Error Reporting (AER) capability. AER enables software to determine: which specific error type fired (from a taxonomy of 20+ error categories), which transaction caused the error (via the 128-bit Header Log capturing the guilty TLP’s full header), which error fired first when multiple errors accumulate (First Error Pointer), and whether the error is correctable by hardware or requires software intervention.
AER is implemented as an Extended Capability (Cap ID 0001h) in the extended configuration space (offsets 100h+), accessible only via ECAM. All native PCIe endpoints and Root Ports are strongly recommended to implement AER.
PCIe defines two tiers of error reporting:
| Bit | Field | Effect when 1 |
|---|---|---|
| 0 | Correctable Error Reporting Enable | Device sends ERR_COR message when a correctable error is detected |
| 1 | Non-Fatal Error Reporting Enable | Device sends ERR_NONFATAL message for non-fatal uncorrectable errors |
| 2 | Fatal Error Reporting Enable | Device sends ERR_FATAL message for fatal uncorrectable errors |
| 3 | Unsupported Request Reporting Enable | UR errors are reported as non-fatal errors. When 0: UR errors are silently ignored (no message sent). Must remain 0 during enumeration. |
PCIe error messages are 3-DW Message TLPs routed to the Root Complex using Route-to-Root-Complex routing. The detecting device sends the appropriate error message when an error occurs and error reporting is enabled. The message carries the Requester ID of the detecting device so the Root Complex knows the source.
| Message | Code | Severity | Link still functional? | Software response |
|---|---|---|---|---|
| ERR_COR | 30h | Correctable | Yes | Optional — monitor frequency. Too many correctable errors may indicate hardware degradation. |
| ERR_NONFATAL | 31h | Non-fatal uncorrectable | Yes | Driver must handle. Read AER Uncorrectable Error Status + Header Log. Retry or abort the failed transaction. |
| ERR_FATAL | 33h | Fatal uncorrectable | No | Mandatory reset. Read AER registers before reset. Reset affected device or link segment. Re-enumerate. |
Each bit corresponds to one specific uncorrectable error type. Hardware sets the bit when the error is detected, regardless of masking or severity settings. Software clears bits by writing 1 to them (RW1C). Multiple bits can be set simultaneously.
| Bit | Error Name | Default Severity | Description |
|---|---|---|---|
| 4 | Data Link Protocol Error | Fatal | DLL received ACK/NAK with a sequence number that doesn’t match any unacknowledged TLP or the ACKD_SEQ number. Indicates protocol violation at Data Link Layer. |
| 5 | Surprise Down Error | Fatal | Physical Layer reports LinkUp = 0 unexpectedly — link communication failed. Only valid for downstream ports. Fatal because the link is no longer communicating. |
| 12 | Poisoned TLP Received | Non-Fatal | Received a TLP with the EP (Error Poisoned) bit set. Data in the TLP is known to be corrupt. Default Non-Fatal because some devices can handle poisoned data (e.g. audio stream). |
| 13 | Flow Control Protocol Error | Fatal | Flow control credits exceeded or invalid FC DLLP received. Fatal because FC violations indicate the device is not maintaining correct buffer accounting. |
| 14 | Completion Timeout | Non-Fatal | A non-posted request was sent but no completion arrived within the configured timeout period (default 50 µs–50 ms). Non-fatal because a retry is often possible. |
| 15 | Completer Abort | Non-Fatal | Received a completion with Completer Abort (CA) status — the completer had a programming violation or internal error and could not complete the request. |
| 16 | Unexpected Completion | Non-Fatal | Received a completion that does not match any outstanding request tag. May be a mis-routed completion. Advisory non-fatal (handled as ERR_COR) in some scenarios. |
| 17 | Receiver Overflow | Fatal | More TLPs arrived than the receive buffer could hold — buffer overflow. Fatal because data was lost. |
| 18 | Malformed TLP | Fatal | TLP header violated formatting rules — bad length, mismatched byte enables, payload exceeds Max Payload Size, illegal type field, etc. Fatal because it indicates a serious protocol violation. |
| 19 | ECRC Error | Non-Fatal | ECRC check failed on a received TLP — data was corrupted end-to-end. Only set if ECRC checking is enabled. Non-fatal as a retry may succeed. |
| 20 | Unsupported Request | Non-Fatal | Completer could not handle the request type. Request was correctly formed but unsupported — e.g. wrong request type for this device. |
| 21 | ACS Violation | Non-Fatal | TLP violated Access Control Services policy at a switch port — e.g. peer-to-peer DMA when ACS source validation rejected the requester ID. |
| 22 | Uncorrectable Internal Error | Non-Fatal | Internal device error that could not be corrected by hardware. Device-specific — what constitutes an internal error is implementation-defined. |
| 23 | MC Blocked TLP | Non-Fatal | A multicast TLP was blocked by an egress port configured to deny forwarding multicast to untranslated addresses. |
| 24 | AtomicOp Egress Blocked | Non-Fatal | An AtomicOp TLP was blocked at an egress port that does not allow AtomicOps to flow to the downstream device. |
| 25 | TLP Prefix Blocked Error | Non-Fatal | A TLP containing an End-to-End TLP Prefix was blocked at an egress port configured to not forward such TLPs. |
The Severity register has the same bit positions as the Uncorrectable Error Status register. Each bit controls whether the corresponding error type is treated as Fatal (1) or Non-Fatal (0). The default values reflect the PCIe specification’s judgment of how serious each error is, but software can change them based on application requirements.
The Mask register prevents an error message from being sent for the corresponding error type, even though the error still sets its Status bit and is still logged by the First Error Pointer. Masking is useful when a particular error type is expected in a specific context and sending an error message would be misleading.
Common masking use cases: masking Completion Timeout during hot-plug removal (completions are expected to not return), masking Unsupported Request during enumeration probing (UR responses are expected when a BAR size is read as 0xFFFFFFFF), masking advisory errors that generate noise in specific platform configurations.
Correctable errors are automatically fixed by hardware — the bit is set purely for logging purposes. All correctable errors are reported with ERR_COR messages (if enabled) regardless of severity — there is no correctable error severity register because by definition all correctable errors use ERR_COR.
| Bit | Error Name | Description |
|---|---|---|
| 0 | Receiver Error | Physical Layer detected an error in an incoming packet — 8b/10b code violation, disparity error, or 128b/130b sync header error. Packet discarded. Link Layer informed. Buffer space released. |
| 6 | Bad TLP | Data Link Layer received a TLP with a bad LCRC, an out-of-sequence Sequence Number, or an incorrectly nullified packet. Packet discarded. NAK DLLP sent to trigger retransmission from the Replay Buffer. |
| 7 | Bad DLLP | Data Link Layer received a DLLP with a CRC failure. DLLP dropped. A subsequent DLLP of the same type is expected to carry the same information. |
| 8 | REPLAY_NUM Rollover | The retry counter has rolled over — a set of TLPs has been transmitted four consecutive times without receiving an ACK, and the counter has returned to zero. Hardware automatically retrains the link. |
| 12 | Replay Timer Timeout | Transmitted TLPs did not receive an ACK or NAK within the allowed timeout. Hardware replays all unacknowledged TLPs from the Replay Buffer. |
| 13 | Advisory Non-Fatal Error | An uncorrectable error was downgraded to a correctable advisory notification. The corresponding uncorrectable error bit is also set. An ERR_COR is sent here instead of ERR_NONFATAL to avoid confusing the error source identification. |
| 14 | Corrected Internal Error | An internal device error was detected and corrected by hardware without any data loss or improper behaviour. Device-specific — e.g. ECC correction on internal SRAM. |
| 15 | Header Log Overflow | The Header Log register capacity has been reached — a subsequent error’s header could not be captured. Only relevant when Multiple Header Recording is enabled. |
Same per-bit structure as the Correctable Error Status register. When a mask bit is set, the corresponding correctable error does not generate an ERR_COR message but still sets its Status bit. The default is all bits clear (all correctable errors generate ERR_COR if enabled).
A common practical use: masking Replay Timer Timeout and Bad TLP in systems with slightly marginal signal integrity — these correctable errors may occur at low frequency during normal operation, and sending ERR_COR messages for every LCRC retry would generate unnecessary interrupt overhead without requiring any corrective action.
The AECR contains the First Error Pointer and ECRC control fields. It is the most operationally important register in the AER structure for error diagnosis.
| Bit(s) | Field | Access | Description |
|---|---|---|---|
| [4:0] | First Error Pointer | RO | The bit position in the Uncorrectable Error Status register of the first uncorrectable error that fired since the pointer was last updated. When software clears the corresponding Status bit by writing 1, the pointer advances to the next-oldest error in the Status register. Value of 0–31 maps directly to a bit position. |
| 5 | ECRC Generation Capable | RO | Hardware supports generating ECRC on outgoing TLPs (sets TD bit in header and appends ECRC DW). Set by designer. |
| 6 | ECRC Generation Enable | RW | When 1: device generates ECRC on all outgoing TLPs. The TD bit in the TLP header is set and a 32-bit CRC DW is appended to the TLP after the data payload. |
| 7 | ECRC Check Capable | RO | Hardware supports verifying ECRC on incoming TLPs. Set by designer. |
| 8 | ECRC Check Enable | RW | When 1: device checks ECRC on incoming TLPs with TD bit set. ECRC failures set the ECRC Error Status bit and (if not masked and if error reporting enabled) generate ERR_NONFATAL. |
| 9 | Multiple Header Recording Enable | RW | When 1: device records headers for multiple uncorrectable errors (up to a device-specific count). When 0: only the first error’s header is logged, and subsequent errors set the Header Log Overflow bit. |
The Header Log register (offsets 11Ch–128h) captures the full 128-bit (4-DW) header of the TLP that caused the first uncorrectable error. Not all error types record a header — only those where the TLP itself was the cause and where the header is meaningful for diagnosis.
| Error type | Is header logged? | What the header tells you |
|---|---|---|
| Poisoned TLP Received | Yes | Address of the poisoned write, Requester ID of the sender, whether it’s a DMA or completion |
| Malformed TLP | Yes | Which TLP type violated the formatting rules and what the offending fields were |
| ECRC Error | Yes | Identifies the TLP whose ECRC failed — address, requester, type |
| Unsupported Request | Yes | What transaction type was not supported and what address/BDF it targeted |
| Unexpected Completion | Yes | The Completion’s Completer ID, Requester ID, and Tag that didn’t match any outstanding request |
| Completion Timeout | No (no TLP to capture) | No header — the error is the absence of a completion, not a received TLP |
| Data Link Protocol Error | No | DLL error — no TLP header involved |
| Flow Control Protocol Error | No | FC error — no TLP header involved |
Root Ports (and Root Complex Event Collectors) have three additional AER registers that endpoints and switch ports do not have. These are the final error collection point for the entire fabric:
| Register | Offset from cap start | Key fields |
|---|---|---|
| Root Error Command | 12Ch | 3 enable bits: Correctable Error Reporting Enable [0], Non-Fatal Error Reporting Enable [1], Fatal Error Reporting Enable [2]. When set, the Root Complex generates an MSI/MSI-X interrupt when the corresponding error type is received from downstream. |
| Root Error Status | 130h | ERR_COR Received [0], Multiple ERR_COR Received [1], ERR_FATAL/NONFATAL Received [2], Multiple ERR_FATAL/NONFATAL Received [3], First Uncorrectable Fatal [4], Non-Fatal Error Messages Received [5], Fatal Error Messages Received [6]. Advanced Error Interrupt Message Number [31:27] — MSI/MSI-X vector used for error interrupts. |
| Error Source Identification | 134h | ERR_COR Source ID [15:0] — BDF of the first device that sent ERR_COR. ERR_FATAL/NONFATAL Source ID [31:16] — BDF of the first device that sent ERR_FATAL or ERR_NONFATAL. Read-only sticky (ROS) — persists until cleared by a write. |
LCRC (Link CRC, added by the Data Link Layer) protects each TLP on a single link segment — it is stripped and re-computed at every switch hop. If data is corrupted inside a switch (internal memory error), the new LCRC covers the corrupted data and the receiver will accept the corrupted TLP without knowing it is wrong.
ECRC (End-to-End CRC) is computed by the originating Requester over the TLP header and data payload, and survives all the way to the final destination. Intermediate devices (switches) forward the ECRC unchanged. The final Completer checks the ECRC and reports an error if it fails. ECRC catches corruption that occurs inside switches — something LCRC cannot do.
Data poisoning (also called error forwarding) is a mechanism for a device to indicate that a TLP’s data payload is known to be corrupted, without discarding the TLP. The device sets the EP (Error Poisoned) bit in the TLP header’s DW0. The TLP is then forwarded normally — any device that receives a TLP with EP=1 knows the data is bad.
Some uncorrectable errors should be reported as ERR_COR (correctable) rather than ERR_NONFATAL to avoid confusion about the source. The rationale: when multiple devices detect the same underlying error event, only the most appropriate device should send the “real” ERR_NONFATAL. Other detectors send ERR_COR as an advisory notification.
PCIe 1.1 introduced Role-Based Error Reporting — devices compliant with 1.1 or later set the Role-Based Error Reporting bit in Device Capabilities register. These devices follow the advisory non-fatal rules. Older 1.0 devices do not.
The five advisory non-fatal cases where ERR_COR is sent instead of ERR_NONFATAL:
The AER capability structure — Cap ID 0001h, all register offsets, all error bit positions, ECRC mechanism, Header Log format, advisory non-fatal rules, error message codes — is completely unchanged in Gen 6. AER is defined at the Transaction Layer, and PCIe 6.0 is a Physical Layer change. The same AER driver code that works on Gen 3 works identically on Gen 6 hardware.
What changes in Gen 6 AER practice:
| Aspect | Gen 6 change or new consideration |
|---|---|
| AER register layout | Unchanged — same offsets, same bit definitions, same error codes |
| New physical layer error types | Gen 6 adds FEC (forward error correction) at the flit level. FEC corrects bit errors silently — correctly operating FEC does not generate AER events. FEC decode failures that result in corrupted flits will appear as Malformed TLP or ECRC errors in the AER Uncorrectable Status register. |
| ECRC and flit mode | ECRC is computed at the Transaction Layer before flit encapsulation and checked after flit de-encapsulation. Flit-mode framing is transparent to ECRC — it covers the same TLP header and data payload fields as before Gen 6. |
| Receiver Overflow | At 64 GT/s, the rate of TLP arrival is much higher — receive buffer overflow errors are more likely if the device has insufficient buffer depth. Ensuring adequate buffer sizing is critical for Gen 6 designs. |
| REPLAY_NUM Rollover | More likely at high data rates with long links (e.g. retimers add latency) — ACK latency may be proportionally longer relative to the retransmission window. Increasing the completion timeout and replay timer settings may be needed for Gen 6 links with multiple retimers. |
| IDE (Integrity and Data Encryption) | PCIe 6.0 adds the IDE extended capability (Cap ID 0034h). When IDE is active, TLPs are encrypted. AER errors on encrypted TLPs may not have useful Header Log content (the header fields will be ciphertext). Systems using IDE must factor this into their error investigation procedures. |
| Error investigation tooling | Standard AER tools (Linux aer-inject, Windows AER testing, PCIe Gen 6 compliance test suites) apply without modification. The AER error API is identical. |
| Item | Value / Rule |
|---|---|
| AER Cap ID | 0001h — Extended Capability in 100h+ space, accessible only via ECAM |
| Correctable errors | Fixed by hardware. Status bit set. ERR_COR sent if enabled. No software action needed. Examples: Bad TLP, DLLP CRC, Replay Timer Timeout. |
| Non-fatal uncorrectable | Not hardware-correctable. Software attention required. Link still functional. ERR_NONFATAL sent. Examples: Completion Timeout, Poisoned TLP, UR, ECRC Error. |
| Fatal uncorrectable | Link integrity compromised. Reset required. ERR_FATAL sent. Examples: DL Protocol Error, Surprise Down, Receiver Overflow, Malformed TLP. |
| Status registers | All RW1C (write 1 to clear). Hardware sets on error detection. Software clears after handling. |
| Mask registers | 1=suppress error message for this error type. Status bit still set. First Error Pointer still updated. |
| Severity register | 1=Fatal, 0=Non-Fatal per error type. Default: see table above. Software can escalate severity. |
| First Error Pointer [4:0] | In AECR (offset 118h). Bit position of first uncorrectable error. Advances when that Status bit is cleared. |
| Header Log | 128 bits (4 DWs at 11Ch–128h). Captures TLP header of first uncorrectable error. Not all error types produce a logged header. |
| ECRC Generation Enable | AECR bit 6. When 1: device appends 32-bit ECRC DW and sets TD bit in all outgoing TLPs. |
| ECRC Check Enable | AECR bit 8. When 1: device verifies ECRC on incoming TLPs with TD=1. Failure → ECRC Error Status bit + ERR_NONFATAL. |
| ECRC variant bits | Type bit 0 and EP bit treated as 1 during ECRC generation/checking — both can legally change in transit. |
| Data poisoning (EP=1) | Indicates TLP data is known corrupt. Only legal on TLPs with data payload. Poisoned control writes must be discarded by receiver. |
| Advisory Non-Fatal | Uncorrectable error sent as ERR_COR (not ERR_NONFATAL). Both correctable and uncorrectable Status bits set. Role-Based Error Reporting (Device Cap bit) must be set. |
| Root Error Command | Offset 12Ch. Three enable bits: ERR_COR [0], ERR_NONFATAL [1], ERR_FATAL [2] interrupt generation. Root Complex only. |
| Error Source ID | Offset 134h. BDF of first ERR_COR source [15:0] and first ERR_FATAL/NONFATAL source [31:16]. ROS — read-only sticky. Root Complex only. |
| Error investigation sequence | Read Root Error Status → read Error Source ID → go to source BDF → read Uncorrectable Error Status → read First Error Pointer → read Header Log → decode guilty TLP → service error → clear status bits. |
| Gen 6 changes | AER format unchanged. FEC failures appear as Malformed TLP or ECRC errors. IDE-encrypted links may have encrypted Header Log content. Higher bandwidth increases risk of Receiver Overflow if buffers insufficient. |