PCIe Series — PCIe-27: Advanced Error Reporting (AER) — VLSI Trainers
PCIe Series · PCIe-27

Advanced Error Reporting (AER)

Everything about PCIe error detection — baseline vs AER capability, correctable vs uncorrectable error taxonomy, all error bit positions, error masking and severity control, the First Error Pointer, 128-bit Header Log, ECRC end-to-end data integrity, error forwarding (data poisoning), advisory non-fatal mechanism, and how AER integrates in Gen 6 systems.

📋 Why AER Exists

The basic PCI Status register has only a handful of error bits — detected parity error, signalled system error, received master/target abort. These tell you something went wrong but almost nothing about what. PCI error reporting was designed for a parallel shared bus where all devices could observe each other’s signals; a single SERR# or PERR# pin carried the entire error notification.

PCIe replaces physical error pins with in-band error messages and adds a rich standardised logging structure through the Advanced Error Reporting (AER) capability. AER enables software to determine: which specific error type fired (from a taxonomy of 20+ error categories), which transaction caused the error (via the 128-bit Header Log capturing the guilty TLP’s full header), which error fired first when multiple errors accumulate (First Error Pointer), and whether the error is correctable by hardware or requires software intervention.

AER is implemented as an Extended Capability (Cap ID 0001h) in the extended configuration space (offsets 100h+), accessible only via ECAM. All native PCIe endpoints and Root Ports are strongly recommended to implement AER.

📋 Baseline vs AER Error Reporting

PCIe defines two tiers of error reporting:

Two Tiers of PCIe Error Reporting Baseline Error Reporting (Mandatory) PCI-compatible Status/Command register bits PCIe Capability Device Control error enable bits [3:0] ERR_COR / ERR_NONFATAL / ERR_FATAL messages enabled here No error type detail — only “something went wrong” category No Header Log — cannot identify the guilty TLP AER Capability (Strongly Recommended) Extended Capability Cap ID 0001h @ offset 100h+ 20+ specific error type bits with per-bit status, mask, severity First Error Pointer — identifies which error fired first Header Log — captures full 128-bit TLP header of the guilty TLP ECRC generation and checking capability
Figure 1 — Baseline vs AER. Baseline error reporting (mandatory for all PCIe devices) provides coarse error categorisation through the PCIe Capability Device Control register and the PCI-compatible Command/Status register. AER (Extended Capability 0001h) adds fine-grained per-error-type status, masking, severity control, first-error tracking, and the 128-bit Header Log. Without AER, error investigation is essentially impossible in complex systems.

Baseline Device Control error enable bits (PCIe Capability DW2 bits [3:0])

BitFieldEffect when 1
0Correctable Error Reporting EnableDevice sends ERR_COR message when a correctable error is detected
1Non-Fatal Error Reporting EnableDevice sends ERR_NONFATAL message for non-fatal uncorrectable errors
2Fatal Error Reporting EnableDevice sends ERR_FATAL message for fatal uncorrectable errors
3Unsupported Request Reporting EnableUR errors are reported as non-fatal errors. When 0: UR errors are silently ignored (no message sent). Must remain 0 during enumeration.

📋 Error Taxonomy — Three Severities

PCIe Error Severity Taxonomy Correctable Automatically fixed by hardware No software intervention needed Link still fully reliable Logged in AER Correctable Status Reported with ERR_COR message Examples: Bad TLP, LCRC retry Non-Fatal Uncorrectable Not auto-correctable by hardware Software attention required Link still functional (TLPs still route) Logged in AER Uncorrectable Status Reported with ERR_NONFATAL message Examples: Completion Timeout, UR Fatal Uncorrectable Not auto-correctable by hardware Software intervention mandatory Link likely non-functional Logged in AER Uncorrectable Status Reported with ERR_FATAL message Examples: DL Protocol, Malformed TLP
Figure 2 — Three error severities. Correctable errors are handled silently by hardware (LCRC retry, DLLP retransmit) — software may receive ERR_COR messages to track frequency trends but no action is required. Non-fatal uncorrectable errors require software attention but the link remains usable. Fatal uncorrectable errors indicate the link may be non-functional and require a reset. Software controls severity promotion: a non-fatal error can be escalated to fatal via the AER Severity register.

📋 Error Messages — ERR_COR, ERR_NONFATAL, ERR_FATAL

PCIe error messages are 3-DW Message TLPs routed to the Root Complex using Route-to-Root-Complex routing. The detecting device sends the appropriate error message when an error occurs and error reporting is enabled. The message carries the Requester ID of the detecting device so the Root Complex knows the source.

MessageCodeSeverityLink still functional?Software response
ERR_COR30hCorrectableYesOptional — monitor frequency. Too many correctable errors may indicate hardware degradation.
ERR_NONFATAL31hNon-fatal uncorrectableYesDriver must handle. Read AER Uncorrectable Error Status + Header Log. Retry or abort the failed transaction.
ERR_FATAL33hFatal uncorrectableNoMandatory reset. Read AER registers before reset. Reset affected device or link segment. Re-enumerate.
Error messages must be enabled at every bridge in the path. The SERR# Enable bit (Command register bit 8) on each bridge between the error source and the Root Complex must be set to 1 for the error message to propagate upstream. If any bridge in the path has SERR# Enable = 0, the error message is silently dropped at that bridge. The AER-specific enabling (Device Control bits [2:0]) on the source device must also be set.

📋 AER Capability Structure Layout

AER Extended Capability Structure (Cap ID 0001h) — Register Map Extended Capability Header — Cap ID=0001h · Version · Next Offset — offset 100h Uncorrectable Error Status [31:0] — RW1C sticky — offset 104h Uncorrectable Error Mask [31:0] — 1=masked (no error msg sent) — offset 108h Uncorrectable Error Severity [31:0] — 1=Fatal, 0=Non-Fatal per error type — offset 10Ch Correctable Error Status [31:0] — RW1C sticky — offset 110h Correctable Error Mask [31:0] — 1=masked (no ERR_COR sent) — offset 114h Advanced Error Capability and Control (AECR) — ECRC enable, First Error Pointer — offset 118h Header Log DW0 · DW1 · DW2 · DW3 (4×DW = 128 bits) — offsets 11Ch · 120h · 124h · 128h
Figure 3 — AER Capability structure register map. Six core DW groups: Extended Cap Header, Uncorrectable Status/Mask/Severity (3 registers), Correctable Status/Mask (2 registers), AECR (1 register), Header Log (4 DWs). For Root Complex ports only, three additional registers follow at 12Ch–134h: Root Error Command, Root Error Status, Error Source ID. All Status registers are RW1C (write 1 to clear). All Mask and Severity registers are RWS (read/write sticky).

📋 Uncorrectable Error Status Register (offset 104h)

Each bit corresponds to one specific uncorrectable error type. Hardware sets the bit when the error is detected, regardless of masking or severity settings. Software clears bits by writing 1 to them (RW1C). Multiple bits can be set simultaneously.

BitError NameDefault SeverityDescription
4Data Link Protocol ErrorFatalDLL received ACK/NAK with a sequence number that doesn’t match any unacknowledged TLP or the ACKD_SEQ number. Indicates protocol violation at Data Link Layer.
5Surprise Down ErrorFatalPhysical Layer reports LinkUp = 0 unexpectedly — link communication failed. Only valid for downstream ports. Fatal because the link is no longer communicating.
12Poisoned TLP ReceivedNon-FatalReceived a TLP with the EP (Error Poisoned) bit set. Data in the TLP is known to be corrupt. Default Non-Fatal because some devices can handle poisoned data (e.g. audio stream).
13Flow Control Protocol ErrorFatalFlow control credits exceeded or invalid FC DLLP received. Fatal because FC violations indicate the device is not maintaining correct buffer accounting.
14Completion TimeoutNon-FatalA non-posted request was sent but no completion arrived within the configured timeout period (default 50 µs–50 ms). Non-fatal because a retry is often possible.
15Completer AbortNon-FatalReceived a completion with Completer Abort (CA) status — the completer had a programming violation or internal error and could not complete the request.
16Unexpected CompletionNon-FatalReceived a completion that does not match any outstanding request tag. May be a mis-routed completion. Advisory non-fatal (handled as ERR_COR) in some scenarios.
17Receiver OverflowFatalMore TLPs arrived than the receive buffer could hold — buffer overflow. Fatal because data was lost.
18Malformed TLPFatalTLP header violated formatting rules — bad length, mismatched byte enables, payload exceeds Max Payload Size, illegal type field, etc. Fatal because it indicates a serious protocol violation.
19ECRC ErrorNon-FatalECRC check failed on a received TLP — data was corrupted end-to-end. Only set if ECRC checking is enabled. Non-fatal as a retry may succeed.
20Unsupported RequestNon-FatalCompleter could not handle the request type. Request was correctly formed but unsupported — e.g. wrong request type for this device.
21ACS ViolationNon-FatalTLP violated Access Control Services policy at a switch port — e.g. peer-to-peer DMA when ACS source validation rejected the requester ID.
22Uncorrectable Internal ErrorNon-FatalInternal device error that could not be corrected by hardware. Device-specific — what constitutes an internal error is implementation-defined.
23MC Blocked TLPNon-FatalA multicast TLP was blocked by an egress port configured to deny forwarding multicast to untranslated addresses.
24AtomicOp Egress BlockedNon-FatalAn AtomicOp TLP was blocked at an egress port that does not allow AtomicOps to flow to the downstream device.
25TLP Prefix Blocked ErrorNon-FatalA TLP containing an End-to-End TLP Prefix was blocked at an egress port configured to not forward such TLPs.

📋 Uncorrectable Error Severity Register (offset 10Ch)

The Severity register has the same bit positions as the Uncorrectable Error Status register. Each bit controls whether the corresponding error type is treated as Fatal (1) or Non-Fatal (0). The default values reflect the PCIe specification’s judgment of how serious each error is, but software can change them based on application requirements.

Uncorrectable Error Severity — Default Values (1=Fatal, 0=Non-Fatal) Fatal by Default (Severity bit = 1) Data Link Protocol Error (bit 4) Surprise Down Error (bit 5) Flow Control Protocol Error (bit 13) Receiver Overflow (bit 17) · Malformed TLP (bit 18) Non-Fatal by Default (Severity bit = 0) Poisoned TLP (12) · Completion Timeout (14) Completer Abort (15) · Unexpected Completion (16) ECRC Error (19) · Unsupported Request (20) ACS Violation (21) · Uncorrectable Internal Error (22)
Figure 4 — Default severity values. Fatal errors are those that fundamentally break the link protocol or overflow buffers. Non-fatal errors are those where the link remains functional and the error is recoverable by software retry or device driver intervention. Software can override any severity bit — for example, escalating Completion Timeout to Fatal in a mission-critical storage controller that must not silently lose I/O requests.

📋 Uncorrectable Error Mask Register (offset 108h)

The Mask register prevents an error message from being sent for the corresponding error type, even though the error still sets its Status bit and is still logged by the First Error Pointer. Masking is useful when a particular error type is expected in a specific context and sending an error message would be misleading.

Common masking use cases: masking Completion Timeout during hot-plug removal (completions are expected to not return), masking Unsupported Request during enumeration probing (UR responses are expected when a BAR size is read as 0xFFFFFFFF), masking advisory errors that generate noise in specific platform configurations.

📋 Correctable Error Status Register (offset 110h)

Correctable errors are automatically fixed by hardware — the bit is set purely for logging purposes. All correctable errors are reported with ERR_COR messages (if enabled) regardless of severity — there is no correctable error severity register because by definition all correctable errors use ERR_COR.

BitError NameDescription
0Receiver ErrorPhysical Layer detected an error in an incoming packet — 8b/10b code violation, disparity error, or 128b/130b sync header error. Packet discarded. Link Layer informed. Buffer space released.
6Bad TLPData Link Layer received a TLP with a bad LCRC, an out-of-sequence Sequence Number, or an incorrectly nullified packet. Packet discarded. NAK DLLP sent to trigger retransmission from the Replay Buffer.
7Bad DLLPData Link Layer received a DLLP with a CRC failure. DLLP dropped. A subsequent DLLP of the same type is expected to carry the same information.
8REPLAY_NUM RolloverThe retry counter has rolled over — a set of TLPs has been transmitted four consecutive times without receiving an ACK, and the counter has returned to zero. Hardware automatically retrains the link.
12Replay Timer TimeoutTransmitted TLPs did not receive an ACK or NAK within the allowed timeout. Hardware replays all unacknowledged TLPs from the Replay Buffer.
13Advisory Non-Fatal ErrorAn uncorrectable error was downgraded to a correctable advisory notification. The corresponding uncorrectable error bit is also set. An ERR_COR is sent here instead of ERR_NONFATAL to avoid confusing the error source identification.
14Corrected Internal ErrorAn internal device error was detected and corrected by hardware without any data loss or improper behaviour. Device-specific — e.g. ECC correction on internal SRAM.
15Header Log OverflowThe Header Log register capacity has been reached — a subsequent error’s header could not be captured. Only relevant when Multiple Header Recording is enabled.

📋 Correctable Error Mask Register (offset 114h)

Same per-bit structure as the Correctable Error Status register. When a mask bit is set, the corresponding correctable error does not generate an ERR_COR message but still sets its Status bit. The default is all bits clear (all correctable errors generate ERR_COR if enabled).

A common practical use: masking Replay Timer Timeout and Bad TLP in systems with slightly marginal signal integrity — these correctable errors may occur at low frequency during normal operation, and sending ERR_COR messages for every LCRC retry would generate unnecessary interrupt overhead without requiring any corrective action.

📋 Advanced Error Capability and Control Register (AECR — offset 118h)

The AECR contains the First Error Pointer and ECRC control fields. It is the most operationally important register in the AER structure for error diagnosis.

Bit(s)FieldAccessDescription
[4:0]First Error PointerROThe bit position in the Uncorrectable Error Status register of the first uncorrectable error that fired since the pointer was last updated. When software clears the corresponding Status bit by writing 1, the pointer advances to the next-oldest error in the Status register. Value of 0–31 maps directly to a bit position.
5ECRC Generation CapableROHardware supports generating ECRC on outgoing TLPs (sets TD bit in header and appends ECRC DW). Set by designer.
6ECRC Generation EnableRWWhen 1: device generates ECRC on all outgoing TLPs. The TD bit in the TLP header is set and a 32-bit CRC DW is appended to the TLP after the data payload.
7ECRC Check CapableROHardware supports verifying ECRC on incoming TLPs. Set by designer.
8ECRC Check EnableRWWhen 1: device checks ECRC on incoming TLPs with TD bit set. ECRC failures set the ECRC Error Status bit and (if not masked and if error reporting enabled) generate ERR_NONFATAL.
9Multiple Header Recording EnableRWWhen 1: device records headers for multiple uncorrectable errors (up to a device-specific count). When 0: only the first error’s header is logged, and subsequent errors set the Header Log Overflow bit.
Using the First Error Pointer. When multiple uncorrectable errors occur simultaneously, all their Status bits are set. The First Error Pointer identifies which bit position fired first — that error’s header was logged in the Header Log. Error handling software should: (1) read First Error Pointer to know which error to investigate first, (2) read Header Log to identify the guilty TLP, (3) service that error, (4) write 1 to the corresponding Status bit to clear it — this causes the First Error Pointer to advance to the next-oldest error.

📋 Header Log — 128-bit TLP Capture

The Header Log register (offsets 11Ch–128h) captures the full 128-bit (4-DW) header of the TLP that caused the first uncorrectable error. Not all error types record a header — only those where the TLP itself was the cause and where the header is meaningful for diagnosis.

Header Log Register — 128-bit TLP Header Capture Header Log DW0 [31:0] — offset 11Ch — TLP Header Byte 0: Fmt, Type, TC, Attr, AT, Length Header Log DW1 [31:0] — offset 120h — TLP Header DW1: Requester ID, Tag, Last/First BE Header Log DW2 [31:0] — offset 124h — TLP Header DW2: Address [63:32] or Completer ID + Byte Count Header Log DW3 [31:0] — offset 128h — TLP Header DW3: Address [31:0] or Requester ID + Lower Address
Figure 5 — Header Log captures the complete 128-bit TLP header of the first uncorrectable error. DW0 always holds TLP Fmt, Type, TC, TD, EP, Attr, AT, Length. DW1 holds the Requester ID (Bus/Device/Function of the originator), Tag, and Byte Enables. DW2 and DW3 hold the remaining header fields which differ per TLP type — for memory TLPs these are the 64-bit or 32-bit address; for completions these are Completer ID, Completion Status, Byte Count, and Lower Address.

Errors that log headers in the Header Log

Error typeIs header logged?What the header tells you
Poisoned TLP ReceivedYesAddress of the poisoned write, Requester ID of the sender, whether it’s a DMA or completion
Malformed TLPYesWhich TLP type violated the formatting rules and what the offending fields were
ECRC ErrorYesIdentifies the TLP whose ECRC failed — address, requester, type
Unsupported RequestYesWhat transaction type was not supported and what address/BDF it targeted
Unexpected CompletionYesThe Completion’s Completer ID, Requester ID, and Tag that didn’t match any outstanding request
Completion TimeoutNo (no TLP to capture)No header — the error is the absence of a completion, not a received TLP
Data Link Protocol ErrorNoDLL error — no TLP header involved
Flow Control Protocol ErrorNoFC error — no TLP header involved

📋 Root Complex AER Registers

Root Ports (and Root Complex Event Collectors) have three additional AER registers that endpoints and switch ports do not have. These are the final error collection point for the entire fabric:

RegisterOffset from cap startKey fields
Root Error Command12Ch3 enable bits: Correctable Error Reporting Enable [0], Non-Fatal Error Reporting Enable [1], Fatal Error Reporting Enable [2]. When set, the Root Complex generates an MSI/MSI-X interrupt when the corresponding error type is received from downstream.
Root Error Status130hERR_COR Received [0], Multiple ERR_COR Received [1], ERR_FATAL/NONFATAL Received [2], Multiple ERR_FATAL/NONFATAL Received [3], First Uncorrectable Fatal [4], Non-Fatal Error Messages Received [5], Fatal Error Messages Received [6]. Advanced Error Interrupt Message Number [31:27] — MSI/MSI-X vector used for error interrupts.
Error Source Identification134hERR_COR Source ID [15:0] — BDF of the first device that sent ERR_COR. ERR_FATAL/NONFATAL Source ID [31:16] — BDF of the first device that sent ERR_FATAL or ERR_NONFATAL. Read-only sticky (ROS) — persists until cleared by a write.
Error Source ID is the key to error triage. When an error interrupt fires at the Root Complex, the Error Source ID register immediately tells software the BDF of the originating device — no need to poll all devices. Software reads the BDF from Error Source ID, walks to that device’s AER registers, reads the Uncorrectable Error Status to find the error type, reads the Header Log to identify the guilty TLP, and then decides on the response (retry, reset, or declare the device failed).

📋 ECRC — End-to-End CRC

LCRC (Link CRC, added by the Data Link Layer) protects each TLP on a single link segment — it is stripped and re-computed at every switch hop. If data is corrupted inside a switch (internal memory error), the new LCRC covers the corrupted data and the receiver will accept the corrupted TLP without knowing it is wrong.

ECRC (End-to-End CRC) is computed by the originating Requester over the TLP header and data payload, and survives all the way to the final destination. Intermediate devices (switches) forward the ECRC unchanged. The final Completer checks the ECRC and reports an error if it fails. ECRC catches corruption that occurs inside switches — something LCRC cannot do.

ECRC vs LCRC — Coverage Comparison Endpoint ECRC origin Link 1: LCRC-A covers Switch strips/regen LCRC Link 2: LCRC-B covers Root Complex ECRC checked here ECRC covers: Endpoint → Switch internals → Root Complex (end-to-end) Corruption inside switch undetected by LCRC-B · caught by ECRC ECRC: optional TD bit in header ECRC DW appended
Figure 6 — ECRC vs LCRC coverage. LCRC-A protects Link 1 only — stripped and recomputed by the switch. LCRC-B protects Link 2 only. If the switch’s internal memory corrupts the TLP between Link 1 and Link 2, LCRC-B covers the corrupt data and the Root Complex accepts the corrupted TLP. ECRC covers the entire path from originating Endpoint through the switch to the Root Complex — internal switch corruption is detected.

ECRC format details

📋 Error Forwarding — Data Poisoning

Data poisoning (also called error forwarding) is a mechanism for a device to indicate that a TLP’s data payload is known to be corrupted, without discarding the TLP. The device sets the EP (Error Poisoned) bit in the TLP header’s DW0. The TLP is then forwarded normally — any device that receives a TLP with EP=1 knows the data is bad.

Data Poisoning — When and Why to Poison a TLP Scenario 1: Poisoned Completion Device reads data from ECC memory — ECC failure detected Instead of dropping the completion (causing Completion Timeout), device sends completion with EP=1 (data known corrupt) Requester sees the EP bit and knows the data is bad Scenario 2: Switch Internal Error TLP passes through switch — switch detects internal data corruption Switch sets EP=1 on the forwarded TLP (error forwarding) TLP is still delivered to final destination — not dropped Final receiver sees EP=1 and can take appropriate action
Figure 7 — Data poisoning use cases. Sending a completion with EP=1 is preferable to no completion at all because the Requester can immediately identify that the round-trip path worked (so the problem is at the Completer or inside a switch) rather than seeing a vague Completion Timeout. For streaming data like audio, the receiver may choose to accept poisoned data with a glitch rather than stalling for error recovery.

Rules for data poisoning

📋 Advisory Non-Fatal Errors

Some uncorrectable errors should be reported as ERR_COR (correctable) rather than ERR_NONFATAL to avoid confusion about the source. The rationale: when multiple devices detect the same underlying error event, only the most appropriate device should send the “real” ERR_NONFATAL. Other detectors send ERR_COR as an advisory notification.

PCIe 1.1 introduced Role-Based Error Reporting — devices compliant with 1.1 or later set the Role-Based Error Reporting bit in Device Capabilities register. These devices follow the advisory non-fatal rules. Older 1.0 devices do not.

The five advisory non-fatal cases where ERR_COR is sent instead of ERR_NONFATAL:

  1. Completer sent UR or CA completion. The completer sends ERR_COR (not ERR_NONFATAL) because the Requester is better positioned to report the uncorrectable error when it receives the UR/CA completion.
  2. Intermediate device detected poisoned TLP. A switch forwarding a poisoned TLP sends ERR_COR. The final destination is the right device to send ERR_NONFATAL if it cannot handle the data.
  3. Destination received poisoned TLP but can handle it. An audio device receiving a poisoned audio packet accepts the data (glitch better than stall) and sends only ERR_COR.
  4. Requester experienced Completion Timeout but can retry. If the requester retries and expects to succeed, it sends ERR_COR for the timeout.
  5. Unexpected Completion received. Always advisory — the real Requester will eventually timeout and send the appropriate error message.
Advisory errors still set the Uncorrectable Status bit. An advisory non-fatal error sets both: the corresponding Uncorrectable Error Status bit (in the AER Uncorrectable register) and the Advisory Non-Fatal Error Status bit (in the AER Correctable register). The uncorrectable bit is set for tracking but no ERR_NONFATAL message is sent — only ERR_COR. Software can distinguish advisory errors by noting that a correctable error message arrived despite an uncorrectable error bit being set.

AER in Gen 6

The AER capability structure — Cap ID 0001h, all register offsets, all error bit positions, ECRC mechanism, Header Log format, advisory non-fatal rules, error message codes — is completely unchanged in Gen 6. AER is defined at the Transaction Layer, and PCIe 6.0 is a Physical Layer change. The same AER driver code that works on Gen 3 works identically on Gen 6 hardware.

What changes in Gen 6 AER practice:

AspectGen 6 change or new consideration
AER register layoutUnchanged — same offsets, same bit definitions, same error codes
New physical layer error typesGen 6 adds FEC (forward error correction) at the flit level. FEC corrects bit errors silently — correctly operating FEC does not generate AER events. FEC decode failures that result in corrupted flits will appear as Malformed TLP or ECRC errors in the AER Uncorrectable Status register.
ECRC and flit modeECRC is computed at the Transaction Layer before flit encapsulation and checked after flit de-encapsulation. Flit-mode framing is transparent to ECRC — it covers the same TLP header and data payload fields as before Gen 6.
Receiver OverflowAt 64 GT/s, the rate of TLP arrival is much higher — receive buffer overflow errors are more likely if the device has insufficient buffer depth. Ensuring adequate buffer sizing is critical for Gen 6 designs.
REPLAY_NUM RolloverMore likely at high data rates with long links (e.g. retimers add latency) — ACK latency may be proportionally longer relative to the retransmission window. Increasing the completion timeout and replay timer settings may be needed for Gen 6 links with multiple retimers.
IDE (Integrity and Data Encryption)PCIe 6.0 adds the IDE extended capability (Cap ID 0034h). When IDE is active, TLPs are encrypted. AER errors on encrypted TLPs may not have useful Header Log content (the header fields will be ciphertext). Systems using IDE must factor this into their error investigation procedures.
Error investigation toolingStandard AER tools (Linux aer-inject, Windows AER testing, PCIe Gen 6 compliance test suites) apply without modification. The AER error API is identical.

📋 Quick Reference

ItemValue / Rule
AER Cap ID0001h — Extended Capability in 100h+ space, accessible only via ECAM
Correctable errorsFixed by hardware. Status bit set. ERR_COR sent if enabled. No software action needed. Examples: Bad TLP, DLLP CRC, Replay Timer Timeout.
Non-fatal uncorrectableNot hardware-correctable. Software attention required. Link still functional. ERR_NONFATAL sent. Examples: Completion Timeout, Poisoned TLP, UR, ECRC Error.
Fatal uncorrectableLink integrity compromised. Reset required. ERR_FATAL sent. Examples: DL Protocol Error, Surprise Down, Receiver Overflow, Malformed TLP.
Status registersAll RW1C (write 1 to clear). Hardware sets on error detection. Software clears after handling.
Mask registers1=suppress error message for this error type. Status bit still set. First Error Pointer still updated.
Severity register1=Fatal, 0=Non-Fatal per error type. Default: see table above. Software can escalate severity.
First Error Pointer [4:0]In AECR (offset 118h). Bit position of first uncorrectable error. Advances when that Status bit is cleared.
Header Log128 bits (4 DWs at 11Ch–128h). Captures TLP header of first uncorrectable error. Not all error types produce a logged header.
ECRC Generation EnableAECR bit 6. When 1: device appends 32-bit ECRC DW and sets TD bit in all outgoing TLPs.
ECRC Check EnableAECR bit 8. When 1: device verifies ECRC on incoming TLPs with TD=1. Failure → ECRC Error Status bit + ERR_NONFATAL.
ECRC variant bitsType bit 0 and EP bit treated as 1 during ECRC generation/checking — both can legally change in transit.
Data poisoning (EP=1)Indicates TLP data is known corrupt. Only legal on TLPs with data payload. Poisoned control writes must be discarded by receiver.
Advisory Non-FatalUncorrectable error sent as ERR_COR (not ERR_NONFATAL). Both correctable and uncorrectable Status bits set. Role-Based Error Reporting (Device Cap bit) must be set.
Root Error CommandOffset 12Ch. Three enable bits: ERR_COR [0], ERR_NONFATAL [1], ERR_FATAL [2] interrupt generation. Root Complex only.
Error Source IDOffset 134h. BDF of first ERR_COR source [15:0] and first ERR_FATAL/NONFATAL source [31:16]. ROS — read-only sticky. Root Complex only.
Error investigation sequenceRead Root Error Status → read Error Source ID → go to source BDF → read Uncorrectable Error Status → read First Error Pointer → read Header Log → decode guilty TLP → service error → clear status bits.
Gen 6 changesAER format unchanged. FEC failures appear as Malformed TLP or ECRC errors. IDE-encrypted links may have encrypted Header Log content. Higher bandwidth increases risk of Receiver Overflow if buffers insufficient.
Scroll to Top