R 4.4: The Watchdogs of the Link: Understanding REPLAY_TIMER and REPLAY_NUM

In the PCI Express (PCIe) Data Link Layer, the Ack/Nak protocol is incredibly robust. By keeping copies of unacknowledged Transaction Layer Packets (TLPs) in the Replay Buffer, a transmitter can easily rescue and re-send data if it receives a “Nak” indicating a transmission error.

But what happens if the Ack or Nak itself is lost in transit, or the receiving device encounters an error and stops responding altogether? Without a fallback mechanism, the transmitter could end up waiting indefinitely, permanently stalling the system.

To prevent this fatal freeze, the PCIe Data Link Layer employs two critical internal trackers: the REPLAY_TIMER and the REPLAY_NUM counter. Here is how these internal watchdogs ensure your link either recovers automatically or takes drastic measures to fix itself.

1. The REPLAY_TIMER: The Ultimate Watchdog

The REPLAY_TIMER is exactly what it sounds like: a watchdog timer designed to make sure the transmitter is receiving timely Ack or Nak packets for the TLPs it has sent.

Here is how it monitors the link’s health:

  • Starting the Clock: The timer begins running the moment the last symbol of any outbound TLP is transmitted across the link.
  • Resetting on Success: If a valid Ack arrives confirming forward progress, and there are still unacknowledged TLPs left in the Replay Buffer, the timer instantly resets to zero and immediately starts counting again.
  • The Timeout Trigger: If the timer expires, the transmitter knows that an Ack or Nak should have been received by now, meaning a packet was almost certainly lost. To fix this, the transmitter halts new traffic, re-transmits every single TLP currently stored in the Replay Buffer, and restarts the timer.

To ensure the timeout period is perfectly balanced—not too short to cause unnecessary replays, but not too long to stall the system—the REPLAY_TIMER‘s expiration value is mathematically calculated to be exactly three times the duration of the receiver’s AckNak_LATENCY_TIMER.

2. The REPLAY_NUM Counter: Knowing When to Quit

While the REPLAY_TIMER is great for recovering from a lost packet, a transmitter cannot simply replay a failing packet infinitely. If a severe electrical issue or physical interference is preventing data from crossing the link, endlessly replaying the buffer will accomplish nothing.

To prevent infinite replay loops, the transmitter utilizes a dedicated 2-bit counter known as REPLAY_NUM.

  • Tracking Attempts: Every time the transmitter is forced to replay the buffer—whether triggered by receiving a Nak or by the REPLAY_TIMER expiring—the REPLAY_NUM counter increments by one.
  • The Four-Strike Rule: Because it is a 2-bit counter, it can only count from 0 (00b) to 3 (11b). If a fourth consecutive replay is required, the counter “rolls over” from 11b back to 00b.
  • Forcing Link Retraining: This rollover event is the ultimate distress signal. It indicates four consecutive failed attempts to deliver the exact same set of TLPs. The Data Link Layer immediately assumes the physical connection is severely compromised and automatically forces the Physical Layer to completely retrain the Link (transitioning the Link Training and Status State Machine into the Recovery state).

Once the physical hardware successfully finishes retraining the link, the transmitter will automatically resume the replay process, hoping the reset has cleared the physical blockage and the TLPs can finally be delivered successfully.

Summary The REPLAY_TIMER and REPLAY_NUM counters work hand-in-hand to ensure the transmission pipeline never permanently clogs. By guaranteeing the transmitter never waits indefinitely for a response, and strictly limiting failure loops to four attempts before escalating the issue to the Physical Layer, these watchdogs make the PCIe link highly autonomous and self-healing.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top