S 0.0: PCI Express Physical Layer Architecture and Operation

1.0 PCIe Layered Architecture and the Physical Layer

The PCI Express architecture is conceptually partitioned into three layers: the Transaction Layer, the Data Link Layer, and the Physical Layer. This layered design allows for modularity, enabling the Physical Layer to be adapted for higher data rates with minimal impact on the upper layers.

1.1 Role of the Physical Layer

The Physical Layer is the lowest layer in the PCIe stack, responsible for the actual transmission and reception of data across the serial link. Its functions are managed by the Link Training and Status State Machine (LTSSM). The PCIe specification is purposefully vague on implementation details to allow vendors to innovate, but the functionality is clearly defined.

1.2 Physical Layer Sub-Blocks

The Physical Layer is divided into two main sub-blocks:

Logical Sub-Block: Handles digital logic functions such as packet buffering, scrambling/de-scrambling, encoding/decoding, and byte striping/un-striping.
Electrical Sub-Block: Manages the analog characteristics of the link, including the differential transmitters and receivers, timing recovery, and impedance matching.

2.0 Gen1 and Gen2 Logical Physical Layer (8b/10b Encoding)

For data rates of 2.5 GT/s (Gen1) and 5.0 GT/s (Gen2), PCIe uses an 8b/10b encoding scheme. This scheme converts 8-bit data bytes into 10-bit symbols, which ensures DC balance and provides a sufficient number of signal transitions for reliable clock recovery at the receiver.

2.1 Transmit Logic

The transmit logic prepares packets from the Data Link Layer for serial transmission through a multi-stage process:

Buffering: Packets arriving from the Data Link Layer are first placed in a transmit buffer. This allows for the insertion of Physical Layer-specific packets, such as Ordered Sets, into the data stream.
Multiplexing and Framing: A multiplexer injects control characters (K-characters) and Ordered Sets.
- K-characters (Control): Special characters used for packet framing (e.g., STP for Start of TLP, END for End of Packet) and other control functions. A D/K# bit differentiates them from data characters.
- Ordered Sets: Sequences of characters used for link management, such as link training (TS1, TS2), clock tolerance compensation (SKIP), and power state transitions (FTS, EIOS). Every Ordered Set begins with a COM (K28.5) character.
- Logical Idle: When no packets or Ordered Sets are being sent, Logical Idle characters (Data 00h) are transmitted to maintain the receiver PLL’s lock.
Byte Striping: The byte stream is striped, or distributed, across all active lanes of the link. For a multi-lane link, one byte is transferred per lane simultaneously, increasing the effective data throughput.
Scrambling: The data is XORed with a pseudo-random pattern generated by a Linear-Feedback Shift Register (LFSR). This prevents repetitive data patterns, which reduces EMI and crosstalk between adjacent lanes. The scrambler is reset by the COM character to ensure synchronization between the transmitter and receiver. SKP characters from SKIP Ordered Sets are not scrambled.
8b/10b Encoding: Each 8-bit character is converted into a 10-bit symbol. The encoder maintains a “running disparity” (the difference between the number of 1s and 0s transmitted) to ensure the signal remains DC-balanced.
Serialization: The parallel 10-bit symbols are converted into a serial bit stream for transmission on each lane by the electrical sub-block.

2.2 Receive Logic

The receive logic reverses the transmit process to reconstruct the original packet data:

Differential Receiver: The electrical sub-block’s differential receiver senses the voltage difference on the wire to determine if a logical 1 or 0 was sent.
Clock and Data Recovery (CDR): The receiver recovers a clock signal from the transitions in the incoming serial bit stream, a process known as achieving Bit Lock. This recovered clock is used to sample the data.
Deserialization and Symbol Lock: The serial bit stream is converted back into 10-bit parallel symbols. The receiver searches for the unique pattern of the COM character (K28.5) to identify the 10-bit symbol boundaries, a process called achieving Symbol Lock.
Elastic Buffer and Clock Compensation: Received symbols are clocked into an elastic buffer using the recovered clock and clocked out using the receiver’s local clock. Because these two clocks can have slight frequency differences (up to 600 ppm total), the transmitter periodically inserts SKIP Ordered Sets. The receiver’s elastic buffer logic can add or delete SKP characters from these sets to prevent buffer overflow or underflow, compensating for the clock frequency mismatch.
8b/10b Decoding: The 10-bit symbols are decoded back into 8-bit characters plus a D/K# signal. The decoder checks for code violations and disparity errors.
De-Scrambling: The original data is recovered by applying the same LFSR algorithm used by the transmitter. The de-scrambler LFSR is also reset by the COM character and does not advance on SKP characters.
Byte Un-Striping and Filtering: The byte streams from all lanes are combined. Control characters, Ordered Sets, and Logical Idle sequences are filtered out and processed by the Physical Layer, while the remaining Transaction Layer Packets (TLPs) and Data Link Layer Packets (DLLPs) are passed up to the Data Link Layer.

3.0 Gen3 Logical Physical Layer (128b/130b Encoding)

To achieve the 8.0 GT/s data rate for Gen3, PCIe transitioned to a 128b/130b encoding scheme. This was done to overcome the challenges of higher frequencies, including increased power consumption, signal degradation on standard circuit board materials, and the 20% overhead of 8b/10b encoding.

3.1 Key Differences from Gen1/2

Encoding Efficiency: 128b/130b encoding has an overhead of only ~1.5% (2 sync bits for every 128 data bits), a significant improvement over the 20% overhead of 8b/10b.
Framing Mechanism: Instead of K-characters like STP and END, Gen3 uses 2-byte “Tokens” embedded within the data stream to frame packets.
Block-Based Structure: Data is transmitted in “Blocks,” each containing 128 bits (16 bytes) of payload preceded by a 2-bit Sync Header.
Synchronization: Block alignment is achieved using the Electrical Idle Exit Ordered Set (EIEOS), which has a recognizable pattern of alternating 00h and FFh bytes.
Scrambling: A more sophisticated 24-bit LFSR is used for scrambling. To reduce crosstalk, different scrambling values can be used for neighboring lanes, either through multiple LFSRs or by using different tap equations from a single LFSR.

3.2 Gen3 Framing and Tokens

Data is organized into a Data Stream that begins after a Start of Data Stream (SDS) Ordered Set and contains Data Blocks.

Blocks: The fundamental unit of transmission. A 2-bit Sync Header identifies the block type:
- 10b: Data Block
- 01b: Ordered Set Block
Tokens: Special 2-byte indicators within a Data Block used for framing and control. Key tokens include:
- STP (Start of TLP): Marks the beginning of a TLP and includes a length field in dwords. No END token is needed as the receiver simply counts the dwords.
- SDP (Start of DLLP): Marks the beginning of a DLLP, which always has a fixed length of 8 bytes (2-byte token + 4-byte payload + 2-byte LCRC).
- EDB (End of Bad TLP): Appended to a nullified TLP. A TLP is considered nullified if an error is detected during switch cut-through forwarding. The TLP’s LCRC is also inverted.
- EDS (End of Data Stream): Indicates the Data Stream is pausing (for an SOS) or ending.
- IDL (Logical Idle): Filler data (zero bytes) used when the link is idle.

3.3 Gen3 Ordered Sets

Ordered Sets in Gen3 are transmitted in their own blocks. The SKP Ordered Set (SOS) is used for clock compensation, similar to the SKIP set in Gen1/2. An SOS contains the current 24-bit state of the scrambler LFSR, which helps maintain synchronization.

4.0 Electrical Physical Layer

The electrical sub-block forms the physical interface to the link, using low-voltage differential signaling for high-speed data transfer.

4.1 Signaling and Clocking

Differential Signaling: Uses two complementary signals (D+ and D-) to transmit data. This provides high common-mode noise rejection, as noise typically affects both signals equally, leaving the voltage difference between them largely unchanged.
Spread Spectrum Clocking (SSC): An optional feature that modulates the clock frequency over a small range (e.g., 0.5%). This spreads the signal’s radiated energy over a wider frequency band, reducing peak EMI and helping systems meet regulatory emissions standards. SSC is not supported in separate Refclk architectures.
Reference Clock (Refclk) Architecture:
- Common Refclk: Both link partners share the same 100 MHz reference clock.
- Separate Refclks: Each partner uses its own reference clock. This makes timing closure more difficult and requires SSC to be disabled.

4.2 Transmitter (Tx) Characteristics

Receiver Detection: Before training, the transmitter performs a detection sequence to determine if a receiver is present. It applies a voltage step and measures the charge time; a long charge time indicates a receiver’s termination impedance is present. This is done only at 2.5 GT/s.
Impedance: Transmitters must match the 100 Ω differential impedance (80-120 Ω) of the channel to minimize signal reflections.
Signal Conditioning: To counteract signal degradation caused by the transmission channel (e.g., circuit board traces, connectors), transmitters employ equalization techniques.
- Gen1/2 (De-emphasis): After the first bit in a sequence of identical bits (e.g., 11111), subsequent bits are transmitted at a reduced voltage level (-3.5 dB for Gen1, selectable -3.5 dB or -6.0 dB for Gen2). This pre-compensates for the channel’s low-pass filter effect, resulting in a cleaner signal at the receiver.
- Gen3 (3-Tap Equalizer): A more advanced Finite Impulse Response (FIR) filter is used. It adjusts the transmitted signal voltage based on the current bit (cursor), the previous bit (post-cursor), and the next bit (pre-cursor). The weights, or coefficients (C-1, C0, C+1), for these three taps are dynamically tuned during link equalization to optimize the signal for the specific channel. The spec defines 11 starting “presets” for this process.

4.3 Receiver (Rx) Characteristics

Lane-to-Lane De-skew: On multi-lane links, receivers must compensate for differences in signal arrival times between lanes, which can be up to 20 ns for Gen1, 8 ns for Gen2, and 6 ns for Gen3.
Receiver Equalization (Gen3): In addition to transmitter equalization, Gen3 receivers employ their own equalization to further clean up the signal.
- CTLE (Continuous-Time Linear Equalizer): A high-pass filter that reduces low-frequency distortion without amplifying high-frequency noise.
- DFE (Decision Feedback Equalization): A feedback loop where the decision made on a previous bit is used to cancel out the inter-symbol interference it causes on the current bit.

5.0 Link Training and Status State Machine (LTSSM)

The LTSSM is a hardware state machine in the Physical Layer that manages the entire link initialization and training process. It progresses through a series of states to establish a fully operational link.

5.1 Key Training Processes

During training, devices exchange TS1 and TS2 Ordered Sets to negotiate the following:

Bit/Symbol/Block Lock: The receiver synchronizes its clock and framing to the incoming data stream.
Link Width: Devices determine the maximum number of mutually supported lanes (e.g., x1, x4, x8, x16).
Lane Numbering: The downstream port assigns lane numbers.
Lane Reversal: An optional feature to simplify board layout by allowing the logical lane numbering to be reversed if one device supports it.
Polarity Inversion: The receiver can automatically correct for swapped D+ and D- signals.
Data Rate: Devices advertise their supported speeds and negotiate the highest common rate.

5.2 Overview of LTSSM States

The LTSSM consists of 11 primary states:

State	Description
Detect	The initial state after reset. The transmitter electrically checks for a receiver’s presence.
Polling	Devices exchange TS1s at 2.5 GT/s to achieve bit/symbol lock and detect each other. The link may enter a compliance test mode from this state.
Configuration	The link width and lane numbers are negotiated through the exchange of TS1s and TS2s.
L0	The normal, fully operational state. TLP and DLLP traffic can be exchanged.
Recovery	A retraining state used to change link speed/width, recover from errors, or exit from the L1 power state. Includes the Gen3 equalization process.
L0s	A low-power Active State Power Management (ASPM) state with a short exit latency. Entered independently for each link direction.
L1	A lower-power ASPM state with a longer exit latency. Both link directions must enter L1 together. Exiting L1 requires going through Recovery.
L2	A deep power-saving state where main power may be off, but an auxiliary voltage (Vaux) is present. Wake-up is initiated via a low-frequency Beacon signal.
L3	The deepest power-off state with no power.
Hot Reset	An in-band reset mechanism triggered by software.
Loopback	A test/debug state where the slave device loops back all received data to the master.
Disabled	The link is logically turned off and lanes are electrically idle.

5.3 Gen3 Link Equalization Process

When training to 8.0 GT/s for the first time, the LTSSM enters the Recovery.Equalization state to perform a multi-phase handshake to tune the transmitter equalization settings.

Phase 0: The upstream port transmits using an initial preset value provided by the downstream port.
Phase 1: The downstream port evaluates the signal from the upstream port and provides feedback on full-swing (FS) and low-frequency (LF) signal characteristics.
Phase 2: The downstream port’s receiver requests specific coefficient or preset changes from the upstream port’s transmitter. The upstream port applies these settings, and the downstream port evaluates the result. This iterative process continues until the signal quality meets requirements (BER < 10⁻¹²).
Phase 3: The roles are reversed. The upstream port’s receiver evaluates and requests changes to the downstream port’s transmitter settings.

Once complete, the final equalization coefficients are stored by the devices for future use at 8.0 GT/s, avoiding the need to retrain unless the electrical environment changes.

S 0.0: PCI Express Physical Layer Architecture and Operation

Leave a Comment Cancel Reply