PCIe DMA and IOMMU Explained – Your VLSI Journey Starts Here

📋 DMA — What It Is and Why It Exists

Transferring data between a PCIe device and system memory is fundamental to every I/O operation. The simplest approach — Programmed I/O (PIO) — has the CPU read device registers into its internal registers and then write those values to system memory, one word at a time. This works but is deeply inefficient: for every byte transferred, the CPU performs two memory bus transactions, and more importantly, the CPU is occupied with data movement rather than computation.

Direct Memory Access (DMA) eliminates the CPU from the data path. The PCIe device’s internal DMA engine initiates memory transactions directly — reading data from system memory into the device’s receive buffers, or writing data from the device’s transmit buffers into system memory — without CPU involvement per transfer. The CPU only needs to set up the DMA descriptor (source address, destination address, byte count) and is then free to do other work. When the DMA completes, the device sends an interrupt to notify the CPU.

DMA is the dominant data transfer model for all high-performance PCIe devices: network adapters (receiving packets directly to kernel socket buffers), NVMe SSDs (writing read data directly to application memory), GPUs (copying buffers between VRAM and system RAM), and AI accelerators (streaming model weights from DRAM to on-chip SRAM).

📋 Bus Master Enable and TLP Initiator Role

In PCIe, a device that can initiate DMA is called a Bus Master — a term inherited from PCI where arbitration for shared bus ownership was required. In PCIe, there is no shared bus and no arbitration — every link is point-to-point and full-duplex. Any device may send a TLP at any time subject to flow control credit availability. But the term Bus Master is retained because the same configuration register bit controls DMA capability.

A PCIe device can only initiate DMA (Memory Read or Memory Write TLPs as Requester) when the Bus Master Enable bit in its Command register (offset 04h bit 2) is set to 1. Without Bus Master Enable = 1, the device must not originate any TLPs except Configuration Requests and Message TLPs. This gives the OS a hardware-enforced way to prevent a device from performing DMA before the OS is ready to manage it.

Figure 1 — Bus Master Enable is the hardware interlock for DMA. Zero (reset default) prevents the device from initiating any memory transactions. One allows full DMA capability. The OS must configure the IOMMU before setting Bus Master Enable = 1 — otherwise the device can DMA to any physical address from the moment the bit is set. This sequencing is enforced by the Linux PCI framework and is a security requirement for safe device operation.

📋 DMA Flow in PCIe

A PCIe DMA operation is a sequence of standard Memory Read (MRd) and Memory Write (MWr) TLPs, with the device acting as Requester and system memory (via the Root Complex) acting as Completer. The CPU’s only role is setup and notification:

Figure 2 — DMA write flow. Memory Write TLPs are posted — no completion is returned for each individual write TLP. The device sends all data TLPs and then sends the MSI interrupt as an ordering fence. Because MWr TLPs use PCIe ordering rules (posted writes are ordered before MSI writes), the CPU is guaranteed that all DMA data is in memory when the interrupt fires. The IOMMU (step ④) is the crucial security layer — all DMA must pass through it.

DMA Read flow (non-posted)

DMA reads are non-posted — the device sends Memory Read Request TLPs and must wait for Completion TLPs carrying the data. The device uses its Tag field (8–12 bits) to match each incoming completion to the outstanding read request. Multiple reads can be outstanding simultaneously (up to the number of tag values). The Root Complex acts as Completer, reading DRAM and sending Completion TLPs back to the device. Each completion may carry up to Max Payload Size bytes.

📋 DMA Types — Write, Read, and Scatter-Gather

DMA type	TLPs used	Posted?	Typical use
DMA Write	MWr (Memory Write)	Yes — no completion	Device → System memory. NIC receiving packets to kernel buffer. NVMe writing read data to application buffer.
DMA Read	MRd (Memory Read) + CplD completions	No — completion required	System memory → Device. NIC reading socket data to transmit. GPU loading model weights.
Scatter-Gather DMA	MWr + MRd to descriptor chain	Mixed	Transfer to/from physically non-contiguous memory pages described by a driver-built descriptor ring. Most modern DMA engines support this natively.
Peer-to-Peer DMA	MWr to another device’s BAR address	Yes	GPU-to-GPU NVLink/PCIe transfers. One device’s DMA writes directly to another device’s MMIO BAR. IOMMU and ACS policy controls whether this is allowed.

Scatter-Gather enables DMA to/from virtual memory. OS virtual memory can map a contiguous virtual address range to non-contiguous physical pages. A driver builds a scatter-gather list — an array of (physical_address, length) pairs — and programs it into the device’s DMA descriptor ring. The device’s DMA engine steps through the list, issuing a separate TLP for each contiguous segment. This allows DMA directly to application memory without copying to a physically contiguous kernel buffer first.

📋 The DMA Threat — Why Unrestricted DMA is Dangerous

A PCIe device with Bus Master Enable = 1 and no IOMMU can DMA to any physical address in the system. The address it puts in an MWr or MRd TLP header is used verbatim as the physical memory address. No hardware enforces which addresses are “its” memory vs. the OS kernel vs. another application vs. hypervisor memory.

Figure 3 — Three DMA attack classes without IOMMU. Left: a physical attacker plugs in a malicious Thunderbolt/USB4 device and overwrites kernel memory to achieve code execution (“Thunderstrike” and similar attacks). Centre: in a multi-tenant cloud without IOMMU, a compromised VF driver in one VM can directly read/write another VM’s physical memory. Right: a supply-chain-compromised device continuously exfiltrates encryption keys from kernel memory. All three are defeated by an IOMMU that restricts each device to its own IOMMU domain pages.

📋 IOMMU — The Solution

The IOMMU (Input-Output Memory Management Unit) is a hardware unit in the chipset or Root Complex that sits between the PCIe fabric and the DRAM controller. Every DMA transaction — every MRd and MWr TLP from every PCIe device — passes through the IOMMU. For each transaction, the IOMMU performs two operations:

Address translation: the address in the TLP (called the I/O Virtual Address, IOVA) is translated to the actual physical memory address using per-device page tables. The device never sees or needs to know physical addresses — it only sees IOVAs assigned by the OS.
Permission checking: the IOMMU verifies that the device is permitted to read or write the translated physical address. If the device has no mapping for this address (or is trying to write a read-only page), the IOMMU blocks the access and generates a fault.

Figure 4 — IOMMU intercepts every DMA TLP. The device uses its assigned IOVA. The IOMMU translates IOVA to physical address using the device’s domain page table and checks read/write permissions. Valid accesses proceed to DRAM. Invalid accesses (unmapped IOVA, wrong permissions, write to read-only page) are blocked and generate a fault reported as an interrupt to the OS. The device never learns physical addresses of other devices or the OS kernel.

📋 IOVA — I/O Virtual Address Space

Each device (or IOMMU domain, see below) has its own I/O Virtual Address (IOVA) space. An IOVA is the address that the device puts in its TLP header. It is completely independent of the CPU virtual addresses used by processes and of the physical addresses of DRAM. The OS driver controls the mapping between IOVAs and physical pages.

The OS allocates IOVAs from a device’s IOVA space using an IOMMU allocator (Linux: iommu_alloc_iova()). For each allocation, the OS creates an IOMMU page table entry mapping that IOVA to the designated physical page. The driver then programs the device’s DMA engine with the IOVA — the device uses that IOVA in all its DMA TLPs.

The beauty of this design: the OS can assign completely different physical pages to the same IOVA in different devices’ domain page tables. Device A’s IOVA 0x0000_1000 might map to physical page 0x8000_1000. Device B’s IOVA 0x0000_1000 (same IOVA value) might map to physical page 0x9200_0000. Each device is entirely isolated — they cannot reach each other’s pages even if both use the same IOVA values.

📋 IOMMU Remapping Tables

The IOMMU translates IOVAs using a hierarchy of page tables, similar in concept to the CPU’s virtual memory page tables but operating on DMA addresses. The exact structure varies by platform (Intel VT-d, ARM SMMU, AMD IOMMU) but the principles are the same.

Figure 5 — IOMMU table hierarchy (Intel VT-d model). The Root Table is indexed by PCIe Bus number (256 entries). Each Root Entry points to a Context Table indexed by Device+Function (256 entries per bus). Each Context Entry contains a Domain ID and pointer to that device’s I/O page table root. The page table is walked with IOVA bits using up to 4 levels (9 bits each) giving 48-bit IOVA space. The leaf Page Table Entry (PTE) contains the physical page base address and permission bits.

📋 Page Table Walk — How IOMMU Translates

For each DMA TLP, the IOMMU performs a page table walk starting from the Context Entry’s page table root. For a 48-bit IOVA address with 4-level tables and 4 KB pages:

Extract IOVA bits [47:39] → index into Level 4 table → get Level 3 table base
Extract IOVA bits [38:30] → index into Level 3 table → get Level 2 table base
Extract IOVA bits [29:21] → index into Level 2 table → get Level 1 table base
Extract IOVA bits [20:12] → index into Level 1 table → get Page Table Entry (PTE)
PTE contains physical page base [51:12] + permission bits. Add IOVA bits [11:0] (page offset) → final physical address.
Check permission bits against TLP type: Read permission for MRd, Write permission for MWr. If permission denied: block TLP, generate fault.

IOTLB caches translations to avoid page walks per TLP. A 4-level page walk requires 4 memory reads per translation — unacceptable for high-bandwidth DMA. The IOMMU maintains an IOTLB (I/O Translation Lookaside Buffer) that caches recently-used translations. When a translation is cached, the IOMMU translates in a single cycle. The IOTLB is tagged with Domain ID so translations from different devices don’t collide. When the OS modifies a page table entry (e.g. when unmapping a DMA buffer), it must issue IOTLB invalidation commands to flush stale cached entries.

📋 Context Table and Domain Isolation

The Context Table entry for each PCIe function contains a Domain ID. The Domain ID determines which IOMMU page table is used for that device’s DMA translations. Multiple devices can share the same Domain ID (and thus the same page tables) — they form an IOMMU domain and can DMA to the same physical pages.

Figure 6 — IOMMU domain isolation. Domain 1 contains all devices assigned to VM A. Their page tables only map IOVAs to VM A’s physical memory range (0x4000_0000–0x7FFF_FFFF). Domain 2 contains all devices assigned to VM B with completely separate page tables mapping only to VM B’s physical pages. Even if a device in Domain 1 issues a DMA targeting VM B’s physical address, the IOMMU finds no valid mapping for that address in Domain 1’s page tables and blocks the access with a fault.

📋 IOMMU Fault Handling

When the IOMMU blocks a DMA access, it records a fault in its fault status registers and optionally generates an interrupt. The IOMMU records:

Fault address: the IOVA that caused the fault
Requester ID: the BDF of the device that sent the TLP
Fault reason: write to read-only page, read from write-only page, address not present (no mapping exists), access type mismatch
Fault type: page fault (missing PTE) or access violation (permission denied)

The OS IOMMU driver handles the fault interrupt, reads the fault records, and decides the response. Common responses: log the fault (for observability), terminate the DMA (by resetting the device), or in virtualised environments, deliver a fault notification to the VM’s guest kernel so it can handle the DMA fault at the VM level.

IOMMU faults during normal operation indicate a driver bug or attack. A well-written driver never causes IOMMU faults — it always maps DMA buffers before programming the device’s DMA engine and unmaps them only after the DMA completes. An IOMMU fault in production means either: the driver freed a DMA buffer while the device was still DMAing to it (use-after-free), the device firmware is buggy, or an active attack is in progress. All three cases warrant immediate investigation.

📋 ATS — IOMMU Translation Caching in the Device

The IOMMU IOTLB caches translations at the IOMMU hardware. ATS (Address Translation Services — PCIe Extended Capability 000Fh, covered in PCIe-22) pushes this further: the device itself can cache translations, eliminating even the IOTLB lookup for subsequent DMA using the same IOVA.

Figure 7 — ATS eliminates per-TLP IOMMU translation for cached entries. The device requests a translation once (Translation Request TLP), receives the physical address back from the IOMMU (Translation Completion), caches it internally (Address Translation Cache, ATC), and then uses AT=10b in all subsequent DMA TLPs to the same address. The IOMMU trusts AT=10b TLPs without re-translating — the device already has the correct physical address. IOMMU can invalidate cached translations (Invalidation Request TLP) when mappings change.

ATS is critical for latency-sensitive DMA: RDMA network adapters performing sub-microsecond DMA, NVMe completion queues, and AI accelerators streaming weights. Without ATS, every small DMA to a new address requires an IOMMU walk (4 memory reads) adding microseconds of latency. With ATS, the device translates once and then DMAes directly at full link speed.

📋 PASID — Per-Process IOMMU Contexts

Standard IOMMU domains isolate one device from another. PASID (Process Address Space ID — PCIe Extended Capability 001Bh, covered in PCIe-22) extends isolation to per-process within a single VM or physical machine. Each DMA TLP carries a 20-bit PASID value in a TLP Prefix, and the IOMMU uses that PASID to select which page table to use for the translation.

Without PASID: all DMA from a shared GPU goes through one page table (one VM’s or one user process’s address space). If two processes share the GPU, their DMA must share the IOMMU address space — isolation is at the device level only.

With PASID: each compute kernel running on the GPU can have its own PASID and its own IOMMU page table pointing to that process’s virtual memory. The GPU’s DMA engine tags each DMA request with the kernel’s PASID. The IOMMU translates using the PASID-indexed page table. Multiple processes share the GPU hardware with IOMMU-enforced per-process memory isolation.

PASID enables GPU virtualisation without hypervisor intervention in the data path. An AI cloud provider running 100 concurrent LLM inference sessions on one GPU uses PASID to give each session its own IOMMU address space. Session A’s DMA cannot reach Session B’s model weights — the IOMMU enforces the boundary. At Gen 6 PCIe speeds (512 GB/s), this isolation at IOMMU-wire-speed is essential for secure multi-tenant AI inference.

📋 IOMMU on x86 and ARM

Platform	IOMMU name	Specification	Linux driver	Key features
Intel x86	VT-d (Virtualization Technology for Directed I/O)	Intel VT-d Architecture Specification	`iommu/intel`	Root/Context Table, 4-level IOMMU page tables, Interrupt Remapping, ATS, PRS (Page Request Service), PASID, Posted Interrupts, IOMMU for Intel Integrated GPUs
AMD x86	AMD-Vi (AMD Virtualization for I/O)	AMD I/O Virtualization Technology Specification	`iommu/amd`	Device Table (combined Root+Context), 4-level page tables, Guest Virtual Address translation, IOMMU Event Log for fault recording
ARM/AArch64	SMMU (System Memory Management Unit)	ARM System Memory Management Unit Architecture v3	`iommu/arm-smmu-v3`	Stream Table (per-device context), 4-level page tables shared with CPU MMU format, CMDQ for IOTLB invalidation, PRIQ for page request reporting, PASID/SSID
RISC-V	RISC-V IOMMU	RISC-V IOMMU Specification (ratified 2023)	`iommu/riscv`	Device Directory Table, RISC-V page table format (Sv48/Sv57), MSI remapping, PASID

All four IOMMU implementations serve the same purpose — IOVA-to-PA translation and permission checking for PCIe DMA — but use different table formats and hardware interfaces. The OS IOMMU abstraction layer (Linux: drivers/iommu/) presents a unified API (iommu_map(), iommu_unmap(), iommu_alloc_domain()) regardless of the underlying hardware.

📋 VFIO — Safe Device Passthrough via IOMMU

VFIO (Virtual Function I/O) is the Linux kernel framework that enables safe PCIe device passthrough to virtual machines and containers by using the IOMMU to enforce isolation. It exposes IOMMU group-level device assignment to userspace and virtual machine managers.

The key concepts:

IOMMU Group: the minimum set of devices that must be isolated together. All devices in a group share IOMMU domain membership — if any device in the group can see another device’s DMA without IOMMU enforcement (e.g. they’re behind the same PCIe switch without ACS), they must be in the same group and assigned together to the same VM.
IOMMU Group ↔ ACS: when ACS is enabled on all switch ports (see PCIe-28), each device gets its own IOMMU group (no cross-device peer DMA). When ACS is absent, multiple devices may form one group — all must be assigned to the same VM.
VFIO container: a handle for one or more IOMMU groups sharing an address space. The VM monitor (QEMU, Cloud Hypervisor, etc.) creates a VFIO container, assigns devices to it, and maps guest physical memory ranges into the container’s IOMMU domain.
Interrupt remapping: VFIO also uses IOMMU interrupt remapping tables to prevent assigned devices from injecting arbitrary interrupts into the host or other VMs. Only remapped interrupt vectors permitted by the IOMMU’s interrupt remapping tables are allowed.

Check VFIO performs before device assignment	Why
ACS enabled on all switches in the path to the RC	Without ACS, the device can peer-DMA to other devices bypassing the IOMMU
IOMMU enabled in hardware (VT-d / AMD-Vi / SMMU active)	Without IOMMU active, DMA is unrestricted regardless of VFIO setup
Device in its own IOMMU group (no shared groups)	If device shares a group with another device, that other device must also be assigned to the same VM
Interrupt remapping supported and enabled	Without interrupt remapping, device can inject arbitrary MSI vectors into any CPU

⚡ DMA and IOMMU in Gen 6

The PCIe DMA mechanism — MRd/MWr TLPs, Bus Master Enable, Requester ID in TLP header, AT field for ATS, PASID TLP prefix — is completely unchanged in Gen 6. The IOMMU interacts with the Transaction Layer, not the Physical Layer. Gen 6’s PAM4 signalling is invisible to the IOMMU.

What changes in Gen 6 DMA and IOMMU practice:

Aspect	Gen 6 change or new consideration
DMA TLP format	Unchanged — same MRd/MWr format, same AT field, same PASID TLP prefix
IOMMU table format	Unchanged — VT-d, SMMU, AMD-Vi page table formats are independent of PCIe generation
DMA bandwidth	Gen 6 at 64 GT/s × 16 lanes = 512 GB/s raw. IOMMU IOTLB miss bandwidth must scale — PCIe 6.0 systems require high-bandwidth IOTLB and fast page table walkers to avoid IOMMU becoming a bottleneck at these rates.
ATS importance	At 512 GB/s DMA bandwidth, IOMMU per-TLP translation is completely impractical without ATS. Gen 6 AI accelerators doing high-frequency small DMA (inference token streaming) absolutely require ATS to avoid IOMMU latency bottleneck.
PASID and multi-tenant AI	Gen 6 AI accelerators (100,000+ concurrent inference sessions per GPU cluster) depend on PASID for per-tenant IOMMU isolation. 20-bit PASID = 1M concurrent address spaces — sufficient for large-scale multi-tenant cloud GPU deployments.
IDE and DMA security	PCIe 6.0 adds IDE (Integrity and Data Encryption, Cap ID 0034h) for TLP-level encryption. DMA TLPs can be encrypted end-to-end — even if an attacker taps the PCIe trace, they cannot read DMA data. Critical for confidential computing: model weights and activation data remain encrypted in transit over PCIe even to trusted DMA engines.
CXL.mem DMA	CXL (Compute Express Link, based on PCIe 6.0 PHY) adds CXL.mem — a new protocol for host-managed memory on attached accelerators. CXL.mem bypasses the standard PCIe IOMMU path for host-side memory accesses. IOMMU must be extended to cover CXL.mem device memory through the ACPI SRAT and HMAT tables. Emerging work in Linux kernel IOMMU subsystem.
Flit mode and IOMMU	Gen 6 flit mode (64-byte fixed TLP framing) is fully transparent to the IOMMU. The IOMMU operates at the Transaction Layer — flit boundaries at the Physical Layer are invisible. No IOMMU changes needed for flit mode.

At Gen 6, the IOMMU is no longer optional. At 512 GB/s of DMA bandwidth with hundreds of VFs across multiple tenants, a single unprotected DMA write can corrupt gigabytes of another tenant’s memory in milliseconds. Gen 6 deployments — cloud AI, confidential computing, multi-tenant storage — universally require: IOMMU enabled, ATS for performance, PASID for per-process isolation, IDE for encryption, and ACS on all switches. These form the complete Gen 6 DMA security stack.

📋 Quick Reference

Item	Value / Rule
DMA definition	Device-initiated memory transactions (MRd/MWr TLPs) without CPU involvement per transfer
Bus Master Enable	Command register bit 2. Must = 1 for device to initiate DMA. Reset default = 0. OS sets to 1 only after IOMMU domain is configured.
DMA write TLP	Posted MWr — no completion returned. Device can send next TLP immediately.
DMA read TLP	Non-posted MRd — completion with data (CplD) returned per read. Device tracks outstanding reads by Tag.
IOVA	I/O Virtual Address — address device puts in TLP header. Translated by IOMMU to physical address. Completely separate from CPU virtual addresses.
IOMMU position	Between PCIe Root Complex and DRAM controller. Intercepts every DMA TLP.
IOMMU function	1) Translate IOVA → Physical Address. 2) Check read/write permissions. 3) Block and fault on invalid accesses.
IOTLB	IOMMU Translation Lookaside Buffer — caches recent IOVA→PA translations. Must be invalidated when OS unmaps DMA buffers.
Root Table	Indexed by PCIe Bus number (256 entries). Each entry → Context Table base.
Context Table	Indexed by Device+Function (256 entries per bus). Each entry → Domain ID + IOMMU page table root.
IOMMU Domain	A set of devices sharing the same IOMMU page tables. Each VM typically gets one domain. Domain ID identifies which page table to use for a device’s DMA.
IOMMU page table walk	Up to 4 levels (48-bit IOVA). Each level indexed by 9 IOVA bits → physical address of next-level table. Leaf PTE contains physical page base + R/W permission bits.
IOMMU fault	Triggered when IOVA is unmapped, write permission denied, or read permission denied. IOMMU records fault address + Requester ID + reason. Generates interrupt to OS IOMMU driver.
ATS	Address Translation Services (PCIe Cap 000Fh). Device sends Translation Request TLP → IOMMU returns physical address. Device caches in ATC. Subsequent DMA uses AT=10b to bypass IOMMU re-translation.
IOTLB invalidation	OS must invalidate ATC and IOTLB entries when unmapping DMA buffers. PCIe Invalidation Request TLP sent by IOMMU to device to flush ATC entries.
PASID	Process Address Space ID (PCIe Cap 001Bh). 20-bit tag in TLP Prefix. IOMMU selects per-PASID page table for translation. Enables per-process DMA isolation within a device.
IOMMU Group	Minimum set of devices that must be assigned together. Determined by ACS capability and PCIe topology. VFIO enforces group-based assignment.
ACS requirement	ACS Source Validation + P2P Request Redirect must be enabled on all switches for devices to get individual IOMMU groups and single-device assignment to VMs.
x86 Intel	VT-d (Virtualization Technology for Directed I/O). Linux driver: `iommu/intel`.
x86 AMD	AMD-Vi. Linux driver: `iommu/amd`.
ARM	SMMU v3 (System Memory Management Unit). Linux driver: `iommu/arm-smmu-v3`.
Gen 6 changes	DMA/IOMMU mechanism unchanged. ATS critical at 512 GB/s. PASID essential for multi-tenant AI. IDE adds TLP-level DMA encryption. CXL.mem requires IOMMU extension beyond standard PCIe domain.