How PCIe devices initiate Direct Memory Access, why unrestricted DMA is a critical security and reliability risk, how the IOMMU translates device-visible IOVA addresses to physical memory, the structure of IOMMU remapping tables, ATS-based translation caching, and how DMA isolation scales in Gen 6 AI and virtualised infrastructure.
Transferring data between a PCIe device and system memory is fundamental to every I/O operation. The simplest approach — Programmed I/O (PIO) — has the CPU read device registers into its internal registers and then write those values to system memory, one word at a time. This works but is deeply inefficient: for every byte transferred, the CPU performs two memory bus transactions, and more importantly, the CPU is occupied with data movement rather than computation.
Direct Memory Access (DMA) eliminates the CPU from the data path. The PCIe device’s internal DMA engine initiates memory transactions directly — reading data from system memory into the device’s receive buffers, or writing data from the device’s transmit buffers into system memory — without CPU involvement per transfer. The CPU only needs to set up the DMA descriptor (source address, destination address, byte count) and is then free to do other work. When the DMA completes, the device sends an interrupt to notify the CPU.
DMA is the dominant data transfer model for all high-performance PCIe devices: network adapters (receiving packets directly to kernel socket buffers), NVMe SSDs (writing read data directly to application memory), GPUs (copying buffers between VRAM and system RAM), and AI accelerators (streaming model weights from DRAM to on-chip SRAM).
In PCIe, a device that can initiate DMA is called a Bus Master — a term inherited from PCI where arbitration for shared bus ownership was required. In PCIe, there is no shared bus and no arbitration — every link is point-to-point and full-duplex. Any device may send a TLP at any time subject to flow control credit availability. But the term Bus Master is retained because the same configuration register bit controls DMA capability.
A PCIe device can only initiate DMA (Memory Read or Memory Write TLPs as Requester) when the Bus Master Enable bit in its Command register (offset 04h bit 2) is set to 1. Without Bus Master Enable = 1, the device must not originate any TLPs except Configuration Requests and Message TLPs. This gives the OS a hardware-enforced way to prevent a device from performing DMA before the OS is ready to manage it.
A PCIe DMA operation is a sequence of standard Memory Read (MRd) and Memory Write (MWr) TLPs, with the device acting as Requester and system memory (via the Root Complex) acting as Completer. The CPU’s only role is setup and notification:
DMA reads are non-posted — the device sends Memory Read Request TLPs and must wait for Completion TLPs carrying the data. The device uses its Tag field (8–12 bits) to match each incoming completion to the outstanding read request. Multiple reads can be outstanding simultaneously (up to the number of tag values). The Root Complex acts as Completer, reading DRAM and sending Completion TLPs back to the device. Each completion may carry up to Max Payload Size bytes.
| DMA type | TLPs used | Posted? | Typical use |
|---|---|---|---|
| DMA Write | MWr (Memory Write) | Yes — no completion | Device → System memory. NIC receiving packets to kernel buffer. NVMe writing read data to application buffer. |
| DMA Read | MRd (Memory Read) + CplD completions | No — completion required | System memory → Device. NIC reading socket data to transmit. GPU loading model weights. |
| Scatter-Gather DMA | MWr + MRd to descriptor chain | Mixed | Transfer to/from physically non-contiguous memory pages described by a driver-built descriptor ring. Most modern DMA engines support this natively. |
| Peer-to-Peer DMA | MWr to another device’s BAR address | Yes | GPU-to-GPU NVLink/PCIe transfers. One device’s DMA writes directly to another device’s MMIO BAR. IOMMU and ACS policy controls whether this is allowed. |
A PCIe device with Bus Master Enable = 1 and no IOMMU can DMA to any physical address in the system. The address it puts in an MWr or MRd TLP header is used verbatim as the physical memory address. No hardware enforces which addresses are “its” memory vs. the OS kernel vs. another application vs. hypervisor memory.
The IOMMU (Input-Output Memory Management Unit) is a hardware unit in the chipset or Root Complex that sits between the PCIe fabric and the DRAM controller. Every DMA transaction — every MRd and MWr TLP from every PCIe device — passes through the IOMMU. For each transaction, the IOMMU performs two operations:
Each device (or IOMMU domain, see below) has its own I/O Virtual Address (IOVA) space. An IOVA is the address that the device puts in its TLP header. It is completely independent of the CPU virtual addresses used by processes and of the physical addresses of DRAM. The OS driver controls the mapping between IOVAs and physical pages.
The OS allocates IOVAs from a device’s IOVA space using an IOMMU allocator (Linux: iommu_alloc_iova()). For each allocation, the OS creates an IOMMU page table entry mapping that IOVA to the designated physical page. The driver then programs the device’s DMA engine with the IOVA — the device uses that IOVA in all its DMA TLPs.
The beauty of this design: the OS can assign completely different physical pages to the same IOVA in different devices’ domain page tables. Device A’s IOVA 0x0000_1000 might map to physical page 0x8000_1000. Device B’s IOVA 0x0000_1000 (same IOVA value) might map to physical page 0x9200_0000. Each device is entirely isolated — they cannot reach each other’s pages even if both use the same IOVA values.
The IOMMU translates IOVAs using a hierarchy of page tables, similar in concept to the CPU’s virtual memory page tables but operating on DMA addresses. The exact structure varies by platform (Intel VT-d, ARM SMMU, AMD IOMMU) but the principles are the same.
For each DMA TLP, the IOMMU performs a page table walk starting from the Context Entry’s page table root. For a 48-bit IOVA address with 4-level tables and 4 KB pages:
The Context Table entry for each PCIe function contains a Domain ID. The Domain ID determines which IOMMU page table is used for that device’s DMA translations. Multiple devices can share the same Domain ID (and thus the same page tables) — they form an IOMMU domain and can DMA to the same physical pages.
When the IOMMU blocks a DMA access, it records a fault in its fault status registers and optionally generates an interrupt. The IOMMU records:
The OS IOMMU driver handles the fault interrupt, reads the fault records, and decides the response. Common responses: log the fault (for observability), terminate the DMA (by resetting the device), or in virtualised environments, deliver a fault notification to the VM’s guest kernel so it can handle the DMA fault at the VM level.
The IOMMU IOTLB caches translations at the IOMMU hardware. ATS (Address Translation Services — PCIe Extended Capability 000Fh, covered in PCIe-22) pushes this further: the device itself can cache translations, eliminating even the IOTLB lookup for subsequent DMA using the same IOVA.
ATS is critical for latency-sensitive DMA: RDMA network adapters performing sub-microsecond DMA, NVMe completion queues, and AI accelerators streaming weights. Without ATS, every small DMA to a new address requires an IOMMU walk (4 memory reads) adding microseconds of latency. With ATS, the device translates once and then DMAes directly at full link speed.
Standard IOMMU domains isolate one device from another. PASID (Process Address Space ID — PCIe Extended Capability 001Bh, covered in PCIe-22) extends isolation to per-process within a single VM or physical machine. Each DMA TLP carries a 20-bit PASID value in a TLP Prefix, and the IOMMU uses that PASID to select which page table to use for the translation.
Without PASID: all DMA from a shared GPU goes through one page table (one VM’s or one user process’s address space). If two processes share the GPU, their DMA must share the IOMMU address space — isolation is at the device level only.
With PASID: each compute kernel running on the GPU can have its own PASID and its own IOMMU page table pointing to that process’s virtual memory. The GPU’s DMA engine tags each DMA request with the kernel’s PASID. The IOMMU translates using the PASID-indexed page table. Multiple processes share the GPU hardware with IOMMU-enforced per-process memory isolation.
| Platform | IOMMU name | Specification | Linux driver | Key features |
|---|---|---|---|---|
| Intel x86 | VT-d (Virtualization Technology for Directed I/O) | Intel VT-d Architecture Specification | iommu/intel | Root/Context Table, 4-level IOMMU page tables, Interrupt Remapping, ATS, PRS (Page Request Service), PASID, Posted Interrupts, IOMMU for Intel Integrated GPUs |
| AMD x86 | AMD-Vi (AMD Virtualization for I/O) | AMD I/O Virtualization Technology Specification | iommu/amd | Device Table (combined Root+Context), 4-level page tables, Guest Virtual Address translation, IOMMU Event Log for fault recording |
| ARM/AArch64 | SMMU (System Memory Management Unit) | ARM System Memory Management Unit Architecture v3 | iommu/arm-smmu-v3 | Stream Table (per-device context), 4-level page tables shared with CPU MMU format, CMDQ for IOTLB invalidation, PRIQ for page request reporting, PASID/SSID |
| RISC-V | RISC-V IOMMU | RISC-V IOMMU Specification (ratified 2023) | iommu/riscv | Device Directory Table, RISC-V page table format (Sv48/Sv57), MSI remapping, PASID |
All four IOMMU implementations serve the same purpose — IOVA-to-PA translation and permission checking for PCIe DMA — but use different table formats and hardware interfaces. The OS IOMMU abstraction layer (Linux: drivers/iommu/) presents a unified API (iommu_map(), iommu_unmap(), iommu_alloc_domain()) regardless of the underlying hardware.
VFIO (Virtual Function I/O) is the Linux kernel framework that enables safe PCIe device passthrough to virtual machines and containers by using the IOMMU to enforce isolation. It exposes IOMMU group-level device assignment to userspace and virtual machine managers.
The key concepts:
| Check VFIO performs before device assignment | Why |
|---|---|
| ACS enabled on all switches in the path to the RC | Without ACS, the device can peer-DMA to other devices bypassing the IOMMU |
| IOMMU enabled in hardware (VT-d / AMD-Vi / SMMU active) | Without IOMMU active, DMA is unrestricted regardless of VFIO setup |
| Device in its own IOMMU group (no shared groups) | If device shares a group with another device, that other device must also be assigned to the same VM |
| Interrupt remapping supported and enabled | Without interrupt remapping, device can inject arbitrary MSI vectors into any CPU |
The PCIe DMA mechanism — MRd/MWr TLPs, Bus Master Enable, Requester ID in TLP header, AT field for ATS, PASID TLP prefix — is completely unchanged in Gen 6. The IOMMU interacts with the Transaction Layer, not the Physical Layer. Gen 6’s PAM4 signalling is invisible to the IOMMU.
What changes in Gen 6 DMA and IOMMU practice:
| Aspect | Gen 6 change or new consideration |
|---|---|
| DMA TLP format | Unchanged — same MRd/MWr format, same AT field, same PASID TLP prefix |
| IOMMU table format | Unchanged — VT-d, SMMU, AMD-Vi page table formats are independent of PCIe generation |
| DMA bandwidth | Gen 6 at 64 GT/s × 16 lanes = 512 GB/s raw. IOMMU IOTLB miss bandwidth must scale — PCIe 6.0 systems require high-bandwidth IOTLB and fast page table walkers to avoid IOMMU becoming a bottleneck at these rates. |
| ATS importance | At 512 GB/s DMA bandwidth, IOMMU per-TLP translation is completely impractical without ATS. Gen 6 AI accelerators doing high-frequency small DMA (inference token streaming) absolutely require ATS to avoid IOMMU latency bottleneck. |
| PASID and multi-tenant AI | Gen 6 AI accelerators (100,000+ concurrent inference sessions per GPU cluster) depend on PASID for per-tenant IOMMU isolation. 20-bit PASID = 1M concurrent address spaces — sufficient for large-scale multi-tenant cloud GPU deployments. |
| IDE and DMA security | PCIe 6.0 adds IDE (Integrity and Data Encryption, Cap ID 0034h) for TLP-level encryption. DMA TLPs can be encrypted end-to-end — even if an attacker taps the PCIe trace, they cannot read DMA data. Critical for confidential computing: model weights and activation data remain encrypted in transit over PCIe even to trusted DMA engines. |
| CXL.mem DMA | CXL (Compute Express Link, based on PCIe 6.0 PHY) adds CXL.mem — a new protocol for host-managed memory on attached accelerators. CXL.mem bypasses the standard PCIe IOMMU path for host-side memory accesses. IOMMU must be extended to cover CXL.mem device memory through the ACPI SRAT and HMAT tables. Emerging work in Linux kernel IOMMU subsystem. |
| Flit mode and IOMMU | Gen 6 flit mode (64-byte fixed TLP framing) is fully transparent to the IOMMU. The IOMMU operates at the Transaction Layer — flit boundaries at the Physical Layer are invisible. No IOMMU changes needed for flit mode. |
| Item | Value / Rule |
|---|---|
| DMA definition | Device-initiated memory transactions (MRd/MWr TLPs) without CPU involvement per transfer |
| Bus Master Enable | Command register bit 2. Must = 1 for device to initiate DMA. Reset default = 0. OS sets to 1 only after IOMMU domain is configured. |
| DMA write TLP | Posted MWr — no completion returned. Device can send next TLP immediately. |
| DMA read TLP | Non-posted MRd — completion with data (CplD) returned per read. Device tracks outstanding reads by Tag. |
| IOVA | I/O Virtual Address — address device puts in TLP header. Translated by IOMMU to physical address. Completely separate from CPU virtual addresses. |
| IOMMU position | Between PCIe Root Complex and DRAM controller. Intercepts every DMA TLP. |
| IOMMU function | 1) Translate IOVA → Physical Address. 2) Check read/write permissions. 3) Block and fault on invalid accesses. |
| IOTLB | IOMMU Translation Lookaside Buffer — caches recent IOVA→PA translations. Must be invalidated when OS unmaps DMA buffers. |
| Root Table | Indexed by PCIe Bus number (256 entries). Each entry → Context Table base. |
| Context Table | Indexed by Device+Function (256 entries per bus). Each entry → Domain ID + IOMMU page table root. |
| IOMMU Domain | A set of devices sharing the same IOMMU page tables. Each VM typically gets one domain. Domain ID identifies which page table to use for a device’s DMA. |
| IOMMU page table walk | Up to 4 levels (48-bit IOVA). Each level indexed by 9 IOVA bits → physical address of next-level table. Leaf PTE contains physical page base + R/W permission bits. |
| IOMMU fault | Triggered when IOVA is unmapped, write permission denied, or read permission denied. IOMMU records fault address + Requester ID + reason. Generates interrupt to OS IOMMU driver. |
| ATS | Address Translation Services (PCIe Cap 000Fh). Device sends Translation Request TLP → IOMMU returns physical address. Device caches in ATC. Subsequent DMA uses AT=10b to bypass IOMMU re-translation. |
| IOTLB invalidation | OS must invalidate ATC and IOTLB entries when unmapping DMA buffers. PCIe Invalidation Request TLP sent by IOMMU to device to flush ATC entries. |
| PASID | Process Address Space ID (PCIe Cap 001Bh). 20-bit tag in TLP Prefix. IOMMU selects per-PASID page table for translation. Enables per-process DMA isolation within a device. |
| IOMMU Group | Minimum set of devices that must be assigned together. Determined by ACS capability and PCIe topology. VFIO enforces group-based assignment. |
| ACS requirement | ACS Source Validation + P2P Request Redirect must be enabled on all switches for devices to get individual IOMMU groups and single-device assignment to VMs. |
| x86 Intel | VT-d (Virtualization Technology for Directed I/O). Linux driver: iommu/intel. |
| x86 AMD | AMD-Vi. Linux driver: iommu/amd. |
| ARM | SMMU v3 (System Memory Management Unit). Linux driver: iommu/arm-smmu-v3. |
| Gen 6 changes | DMA/IOMMU mechanism unchanged. ATS critical at 512 GB/s. PASID essential for multi-tenant AI. IDE adds TLP-level DMA encryption. CXL.mem requires IOMMU extension beyond standard PCIe domain. |