PCIe Series — PCIe-29: DMA and IOMMU — VLSI Trainers
PCIe Series · PCIe-29

DMA and IOMMU

How PCIe devices initiate Direct Memory Access, why unrestricted DMA is a critical security and reliability risk, how the IOMMU translates device-visible IOVA addresses to physical memory, the structure of IOMMU remapping tables, ATS-based translation caching, and how DMA isolation scales in Gen 6 AI and virtualised infrastructure.

📋 DMA — What It Is and Why It Exists

Transferring data between a PCIe device and system memory is fundamental to every I/O operation. The simplest approach — Programmed I/O (PIO) — has the CPU read device registers into its internal registers and then write those values to system memory, one word at a time. This works but is deeply inefficient: for every byte transferred, the CPU performs two memory bus transactions, and more importantly, the CPU is occupied with data movement rather than computation.

Direct Memory Access (DMA) eliminates the CPU from the data path. The PCIe device’s internal DMA engine initiates memory transactions directly — reading data from system memory into the device’s receive buffers, or writing data from the device’s transmit buffers into system memory — without CPU involvement per transfer. The CPU only needs to set up the DMA descriptor (source address, destination address, byte count) and is then free to do other work. When the DMA completes, the device sends an interrupt to notify the CPU.

DMA is the dominant data transfer model for all high-performance PCIe devices: network adapters (receiving packets directly to kernel socket buffers), NVMe SSDs (writing read data directly to application memory), GPUs (copying buffers between VRAM and system RAM), and AI accelerators (streaming model weights from DRAM to on-chip SRAM).

📋 Bus Master Enable and TLP Initiator Role

In PCIe, a device that can initiate DMA is called a Bus Master — a term inherited from PCI where arbitration for shared bus ownership was required. In PCIe, there is no shared bus and no arbitration — every link is point-to-point and full-duplex. Any device may send a TLP at any time subject to flow control credit availability. But the term Bus Master is retained because the same configuration register bit controls DMA capability.

A PCIe device can only initiate DMA (Memory Read or Memory Write TLPs as Requester) when the Bus Master Enable bit in its Command register (offset 04h bit 2) is set to 1. Without Bus Master Enable = 1, the device must not originate any TLPs except Configuration Requests and Message TLPs. This gives the OS a hardware-enforced way to prevent a device from performing DMA before the OS is ready to manage it.

Bus Master Enable — The Gateway to Device-Initiated DMA Bus Master Enable = 0 Device may ONLY send Configuration Requests and Message TLPs MRd/MWr TLPs initiated by device = protocol violation Set automatically to 0 after reset — safe default for enumeration No IOMMU configuration needed since device cannot DMA Bus Master Enable = 1 Device can originate MRd, MWr, and other non-posted TLPs DMA to/from any address the device can form in its TLP header IOMMU must be configured before OS sets this bit Set by OS driver after IOMMU domain and page tables are ready
Figure 1 — Bus Master Enable is the hardware interlock for DMA. Zero (reset default) prevents the device from initiating any memory transactions. One allows full DMA capability. The OS must configure the IOMMU before setting Bus Master Enable = 1 — otherwise the device can DMA to any physical address from the moment the bit is set. This sequencing is enforced by the Linux PCI framework and is a security requirement for safe device operation.

📋 DMA Flow in PCIe

A PCIe DMA operation is a sequence of standard Memory Read (MRd) and Memory Write (MWr) TLPs, with the device acting as Requester and system memory (via the Root Complex) acting as Completer. The CPU’s only role is setup and notification:

PCIe DMA Write Flow — Device Writes Data to System Memory ① CPU (driver): Programs DMA descriptor — dest PA 0xC000_0000, size 64 KB. Sets Bus Master Enable. Returns to other work. ② Device: Internal DMA engine reads descriptor. Splits data into MWr TLPs (each ≤ Max Payload Size bytes). Forms TLP headers with destination PA as address. ③ PCIe Fabric: MWr TLPs flow upstream through switches → Root Complex. Each TLP carries: Requester ID (device BDF), Address (PA target), Data payload (≤4KB). ④ IOMMU (in Root Complex/chipset): Intercepts each MWr. Translates device address → physical address. Checks permissions. Forwards to DRAM controller. ⑤ DRAM: Data written to physical memory. MWr is posted (no completion returned to device). Device continues sending next TLP immediately. ⑥ Device: After all MWr TLPs sent, sends MSI-X MWr TLP to APIC. CPU receives interrupt. Driver reads DMA completion status. Processes received data.
Figure 2 — DMA write flow. Memory Write TLPs are posted — no completion is returned for each individual write TLP. The device sends all data TLPs and then sends the MSI interrupt as an ordering fence. Because MWr TLPs use PCIe ordering rules (posted writes are ordered before MSI writes), the CPU is guaranteed that all DMA data is in memory when the interrupt fires. The IOMMU (step ④) is the crucial security layer — all DMA must pass through it.

DMA Read flow (non-posted)

DMA reads are non-posted — the device sends Memory Read Request TLPs and must wait for Completion TLPs carrying the data. The device uses its Tag field (8–12 bits) to match each incoming completion to the outstanding read request. Multiple reads can be outstanding simultaneously (up to the number of tag values). The Root Complex acts as Completer, reading DRAM and sending Completion TLPs back to the device. Each completion may carry up to Max Payload Size bytes.

📋 DMA Types — Write, Read, and Scatter-Gather

DMA typeTLPs usedPosted?Typical use
DMA WriteMWr (Memory Write)Yes — no completionDevice → System memory. NIC receiving packets to kernel buffer. NVMe writing read data to application buffer.
DMA ReadMRd (Memory Read) + CplD completionsNo — completion requiredSystem memory → Device. NIC reading socket data to transmit. GPU loading model weights.
Scatter-Gather DMAMWr + MRd to descriptor chainMixedTransfer to/from physically non-contiguous memory pages described by a driver-built descriptor ring. Most modern DMA engines support this natively.
Peer-to-Peer DMAMWr to another device’s BAR addressYesGPU-to-GPU NVLink/PCIe transfers. One device’s DMA writes directly to another device’s MMIO BAR. IOMMU and ACS policy controls whether this is allowed.
Scatter-Gather enables DMA to/from virtual memory. OS virtual memory can map a contiguous virtual address range to non-contiguous physical pages. A driver builds a scatter-gather list — an array of (physical_address, length) pairs — and programs it into the device’s DMA descriptor ring. The device’s DMA engine steps through the list, issuing a separate TLP for each contiguous segment. This allows DMA directly to application memory without copying to a physically contiguous kernel buffer first.

📋 The DMA Threat — Why Unrestricted DMA is Dangerous

A PCIe device with Bus Master Enable = 1 and no IOMMU can DMA to any physical address in the system. The address it puts in an MWr or MRd TLP header is used verbatim as the physical memory address. No hardware enforces which addresses are “its” memory vs. the OS kernel vs. another application vs. hypervisor memory.

DMA Attack Scenarios Without IOMMU Protection Thunderbolt DMA Attack Malicious device plugged into Thunderbolt/USB4 port MWr to kernel code page Overwrites kernel function ptr Achieves kernel code execution Real attack: “Thunderstrike” VM DMA Escape Compromised VM’s VF driver issues DMA to other VM’s GPA Without IOMMU: DMA lands in the other VM’s physical pages Full memory read/write of victim Critical in SR-IOV deployments Rogue Device Firmware Signed device + malicious firmware update — NIC, SSD, GPU Device continuously DMA-reads encryption keys from kernel Exfiltrates secrets via network IOMMU blocks cross-region reads
Figure 3 — Three DMA attack classes without IOMMU. Left: a physical attacker plugs in a malicious Thunderbolt/USB4 device and overwrites kernel memory to achieve code execution (“Thunderstrike” and similar attacks). Centre: in a multi-tenant cloud without IOMMU, a compromised VF driver in one VM can directly read/write another VM’s physical memory. Right: a supply-chain-compromised device continuously exfiltrates encryption keys from kernel memory. All three are defeated by an IOMMU that restricts each device to its own IOMMU domain pages.

📋 IOMMU — The Solution

The IOMMU (Input-Output Memory Management Unit) is a hardware unit in the chipset or Root Complex that sits between the PCIe fabric and the DRAM controller. Every DMA transaction — every MRd and MWr TLP from every PCIe device — passes through the IOMMU. For each transaction, the IOMMU performs two operations:

  1. Address translation: the address in the TLP (called the I/O Virtual Address, IOVA) is translated to the actual physical memory address using per-device page tables. The device never sees or needs to know physical addresses — it only sees IOVAs assigned by the OS.
  2. Permission checking: the IOMMU verifies that the device is permitted to read or write the translated physical address. If the device has no mapping for this address (or is trying to write a read-only page), the IOMMU blocks the access and generates a fault.
IOMMU Position in the System — Every DMA Passes Through PCIe Device DMA engine uses IOVA e.g. 0x0000_1000 MWr (IOVA) IOMMU IOVA → PA translation Permission check Context + Domain table → PA 0x8000_1000 DRAM Device memory pages only kernel/other VMs blocked DMA to unmapped / wrong-permission address → IOMMU blocks + generates fault interrupt
Figure 4 — IOMMU intercepts every DMA TLP. The device uses its assigned IOVA. The IOMMU translates IOVA to physical address using the device’s domain page table and checks read/write permissions. Valid accesses proceed to DRAM. Invalid accesses (unmapped IOVA, wrong permissions, write to read-only page) are blocked and generate a fault reported as an interrupt to the OS. The device never learns physical addresses of other devices or the OS kernel.

📋 IOVA — I/O Virtual Address Space

Each device (or IOMMU domain, see below) has its own I/O Virtual Address (IOVA) space. An IOVA is the address that the device puts in its TLP header. It is completely independent of the CPU virtual addresses used by processes and of the physical addresses of DRAM. The OS driver controls the mapping between IOVAs and physical pages.

The OS allocates IOVAs from a device’s IOVA space using an IOMMU allocator (Linux: iommu_alloc_iova()). For each allocation, the OS creates an IOMMU page table entry mapping that IOVA to the designated physical page. The driver then programs the device’s DMA engine with the IOVA — the device uses that IOVA in all its DMA TLPs.

The beauty of this design: the OS can assign completely different physical pages to the same IOVA in different devices’ domain page tables. Device A’s IOVA 0x0000_1000 might map to physical page 0x8000_1000. Device B’s IOVA 0x0000_1000 (same IOVA value) might map to physical page 0x9200_0000. Each device is entirely isolated — they cannot reach each other’s pages even if both use the same IOVA values.

📋 IOMMU Remapping Tables

The IOMMU translates IOVAs using a hierarchy of page tables, similar in concept to the CPU’s virtual memory page tables but operating on DMA addresses. The exact structure varies by platform (Intel VT-d, ARM SMMU, AMD IOMMU) but the principles are the same.

IOMMU Table Hierarchy — Root, Context, and Page Tables Root Table Indexed by Bus number 256 entries (one per bus) Each → Context Table Context Table Indexed by Dev+Fn number 256 entries per bus Each → Domain ID + Page Table root Page Table (L4) IOVA bits [47:39] → L3 page table (Up to 4 levels depending on IOVA bits) PTE — Page Table Entry Physical Page Base [51:12] Read [0] · Write [1] bits Present [0] · Superpage [7] IOVA → Physical Address Translation 1. Root Table[Bus] → Context Table base. 2. Context Table[Dev×8+Fn] → Domain ID + Page Table root. 3. Walk page table with IOVA bits (4 levels × 9 bits + 12-bit page offset). 4. PTE gives physical page base. 5. Offset added → physical address.
Figure 5 — IOMMU table hierarchy (Intel VT-d model). The Root Table is indexed by PCIe Bus number (256 entries). Each Root Entry points to a Context Table indexed by Device+Function (256 entries per bus). Each Context Entry contains a Domain ID and pointer to that device’s I/O page table root. The page table is walked with IOVA bits using up to 4 levels (9 bits each) giving 48-bit IOVA space. The leaf Page Table Entry (PTE) contains the physical page base address and permission bits.

📋 Page Table Walk — How IOMMU Translates

For each DMA TLP, the IOMMU performs a page table walk starting from the Context Entry’s page table root. For a 48-bit IOVA address with 4-level tables and 4 KB pages:

  1. Extract IOVA bits [47:39] → index into Level 4 table → get Level 3 table base
  2. Extract IOVA bits [38:30] → index into Level 3 table → get Level 2 table base
  3. Extract IOVA bits [29:21] → index into Level 2 table → get Level 1 table base
  4. Extract IOVA bits [20:12] → index into Level 1 table → get Page Table Entry (PTE)
  5. PTE contains physical page base [51:12] + permission bits. Add IOVA bits [11:0] (page offset) → final physical address.
  6. Check permission bits against TLP type: Read permission for MRd, Write permission for MWr. If permission denied: block TLP, generate fault.
IOTLB caches translations to avoid page walks per TLP. A 4-level page walk requires 4 memory reads per translation — unacceptable for high-bandwidth DMA. The IOMMU maintains an IOTLB (I/O Translation Lookaside Buffer) that caches recently-used translations. When a translation is cached, the IOMMU translates in a single cycle. The IOTLB is tagged with Domain ID so translations from different devices don’t collide. When the OS modifies a page table entry (e.g. when unmapping a DMA buffer), it must issue IOTLB invalidation commands to flush stale cached entries.

📋 Context Table and Domain Isolation

The Context Table entry for each PCIe function contains a Domain ID. The Domain ID determines which IOMMU page table is used for that device’s DMA translations. Multiple devices can share the same Domain ID (and thus the same page tables) — they form an IOMMU domain and can DMA to the same physical pages.

IOMMU Domains — Isolation Between Virtual Machines Domain 1 — VM A Devices: NIC VF1 (BDF 01:00.2), NVMe VF1 (BDF 02:00.6) Page tables map IOVAs to VM A’s guest physical pages only DMA target: PA 0x4000_0000 – 0x7FFF_FFFF (1 GB for VM A) Cannot reach Domain 2 pages — IOMMU enforces hard boundary NIC VF1 NVMe VF1 Domain 2 — VM B Devices: NIC VF2 (BDF 01:00.4), NVMe VF2 (BDF 02:00.8) Page tables map IOVAs to VM B’s guest physical pages only DMA target: PA 0x8000_0000 – 0xBFFF_FFFF (1 GB for VM B) Cannot reach Domain 1 pages — completely isolated NIC VF2 NVMe VF2
Figure 6 — IOMMU domain isolation. Domain 1 contains all devices assigned to VM A. Their page tables only map IOVAs to VM A’s physical memory range (0x4000_0000–0x7FFF_FFFF). Domain 2 contains all devices assigned to VM B with completely separate page tables mapping only to VM B’s physical pages. Even if a device in Domain 1 issues a DMA targeting VM B’s physical address, the IOMMU finds no valid mapping for that address in Domain 1’s page tables and blocks the access with a fault.

📋 IOMMU Fault Handling

When the IOMMU blocks a DMA access, it records a fault in its fault status registers and optionally generates an interrupt. The IOMMU records:

The OS IOMMU driver handles the fault interrupt, reads the fault records, and decides the response. Common responses: log the fault (for observability), terminate the DMA (by resetting the device), or in virtualised environments, deliver a fault notification to the VM’s guest kernel so it can handle the DMA fault at the VM level.

IOMMU faults during normal operation indicate a driver bug or attack. A well-written driver never causes IOMMU faults — it always maps DMA buffers before programming the device’s DMA engine and unmaps them only after the DMA completes. An IOMMU fault in production means either: the driver freed a DMA buffer while the device was still DMAing to it (use-after-free), the device firmware is buggy, or an active attack is in progress. All three cases warrant immediate investigation.

📋 ATS — IOMMU Translation Caching in the Device

The IOMMU IOTLB caches translations at the IOMMU hardware. ATS (Address Translation Services — PCIe Extended Capability 000Fh, covered in PCIe-22) pushes this further: the device itself can cache translations, eliminating even the IOTLB lookup for subsequent DMA using the same IOVA.

ATS — Device-Side Translation Caching for High-Performance DMA Without ATS (AT=00b) Every DMA TLP carries IOVA in header (AT=00b) IOMMU must translate every single TLP (IOTLB hit or walk) High-frequency small DMA: IOMMU becomes bottleneck Suitable for low-frequency, large-buffer DMA With ATS (AT=10b) Device sends Translation Request TLP (TR) with IOVA IOMMU responds with Translation Completion (Physical Address) Device caches IOVA→PA in internal ATC. Uses AT=10b in DMA TLPs. IOMMU passes AT=10b TLPs without re-translating (trust the cache) Critical for RDMA, GPU compute, 100G+ NIC performance
Figure 7 — ATS eliminates per-TLP IOMMU translation for cached entries. The device requests a translation once (Translation Request TLP), receives the physical address back from the IOMMU (Translation Completion), caches it internally (Address Translation Cache, ATC), and then uses AT=10b in all subsequent DMA TLPs to the same address. The IOMMU trusts AT=10b TLPs without re-translating — the device already has the correct physical address. IOMMU can invalidate cached translations (Invalidation Request TLP) when mappings change.

ATS is critical for latency-sensitive DMA: RDMA network adapters performing sub-microsecond DMA, NVMe completion queues, and AI accelerators streaming weights. Without ATS, every small DMA to a new address requires an IOMMU walk (4 memory reads) adding microseconds of latency. With ATS, the device translates once and then DMAes directly at full link speed.

📋 PASID — Per-Process IOMMU Contexts

Standard IOMMU domains isolate one device from another. PASID (Process Address Space ID — PCIe Extended Capability 001Bh, covered in PCIe-22) extends isolation to per-process within a single VM or physical machine. Each DMA TLP carries a 20-bit PASID value in a TLP Prefix, and the IOMMU uses that PASID to select which page table to use for the translation.

Without PASID: all DMA from a shared GPU goes through one page table (one VM’s or one user process’s address space). If two processes share the GPU, their DMA must share the IOMMU address space — isolation is at the device level only.

With PASID: each compute kernel running on the GPU can have its own PASID and its own IOMMU page table pointing to that process’s virtual memory. The GPU’s DMA engine tags each DMA request with the kernel’s PASID. The IOMMU translates using the PASID-indexed page table. Multiple processes share the GPU hardware with IOMMU-enforced per-process memory isolation.

PASID enables GPU virtualisation without hypervisor intervention in the data path. An AI cloud provider running 100 concurrent LLM inference sessions on one GPU uses PASID to give each session its own IOMMU address space. Session A’s DMA cannot reach Session B’s model weights — the IOMMU enforces the boundary. At Gen 6 PCIe speeds (512 GB/s), this isolation at IOMMU-wire-speed is essential for secure multi-tenant AI inference.

📋 IOMMU on x86 and ARM

PlatformIOMMU nameSpecificationLinux driverKey features
Intel x86VT-d (Virtualization Technology for Directed I/O)Intel VT-d Architecture Specificationiommu/intelRoot/Context Table, 4-level IOMMU page tables, Interrupt Remapping, ATS, PRS (Page Request Service), PASID, Posted Interrupts, IOMMU for Intel Integrated GPUs
AMD x86AMD-Vi (AMD Virtualization for I/O)AMD I/O Virtualization Technology Specificationiommu/amdDevice Table (combined Root+Context), 4-level page tables, Guest Virtual Address translation, IOMMU Event Log for fault recording
ARM/AArch64SMMU (System Memory Management Unit)ARM System Memory Management Unit Architecture v3iommu/arm-smmu-v3Stream Table (per-device context), 4-level page tables shared with CPU MMU format, CMDQ for IOTLB invalidation, PRIQ for page request reporting, PASID/SSID
RISC-VRISC-V IOMMURISC-V IOMMU Specification (ratified 2023)iommu/riscvDevice Directory Table, RISC-V page table format (Sv48/Sv57), MSI remapping, PASID

All four IOMMU implementations serve the same purpose — IOVA-to-PA translation and permission checking for PCIe DMA — but use different table formats and hardware interfaces. The OS IOMMU abstraction layer (Linux: drivers/iommu/) presents a unified API (iommu_map(), iommu_unmap(), iommu_alloc_domain()) regardless of the underlying hardware.

📋 VFIO — Safe Device Passthrough via IOMMU

VFIO (Virtual Function I/O) is the Linux kernel framework that enables safe PCIe device passthrough to virtual machines and containers by using the IOMMU to enforce isolation. It exposes IOMMU group-level device assignment to userspace and virtual machine managers.

The key concepts:

Check VFIO performs before device assignmentWhy
ACS enabled on all switches in the path to the RCWithout ACS, the device can peer-DMA to other devices bypassing the IOMMU
IOMMU enabled in hardware (VT-d / AMD-Vi / SMMU active)Without IOMMU active, DMA is unrestricted regardless of VFIO setup
Device in its own IOMMU group (no shared groups)If device shares a group with another device, that other device must also be assigned to the same VM
Interrupt remapping supported and enabledWithout interrupt remapping, device can inject arbitrary MSI vectors into any CPU

DMA and IOMMU in Gen 6

The PCIe DMA mechanism — MRd/MWr TLPs, Bus Master Enable, Requester ID in TLP header, AT field for ATS, PASID TLP prefix — is completely unchanged in Gen 6. The IOMMU interacts with the Transaction Layer, not the Physical Layer. Gen 6’s PAM4 signalling is invisible to the IOMMU.

What changes in Gen 6 DMA and IOMMU practice:

AspectGen 6 change or new consideration
DMA TLP formatUnchanged — same MRd/MWr format, same AT field, same PASID TLP prefix
IOMMU table formatUnchanged — VT-d, SMMU, AMD-Vi page table formats are independent of PCIe generation
DMA bandwidthGen 6 at 64 GT/s × 16 lanes = 512 GB/s raw. IOMMU IOTLB miss bandwidth must scale — PCIe 6.0 systems require high-bandwidth IOTLB and fast page table walkers to avoid IOMMU becoming a bottleneck at these rates.
ATS importanceAt 512 GB/s DMA bandwidth, IOMMU per-TLP translation is completely impractical without ATS. Gen 6 AI accelerators doing high-frequency small DMA (inference token streaming) absolutely require ATS to avoid IOMMU latency bottleneck.
PASID and multi-tenant AIGen 6 AI accelerators (100,000+ concurrent inference sessions per GPU cluster) depend on PASID for per-tenant IOMMU isolation. 20-bit PASID = 1M concurrent address spaces — sufficient for large-scale multi-tenant cloud GPU deployments.
IDE and DMA securityPCIe 6.0 adds IDE (Integrity and Data Encryption, Cap ID 0034h) for TLP-level encryption. DMA TLPs can be encrypted end-to-end — even if an attacker taps the PCIe trace, they cannot read DMA data. Critical for confidential computing: model weights and activation data remain encrypted in transit over PCIe even to trusted DMA engines.
CXL.mem DMACXL (Compute Express Link, based on PCIe 6.0 PHY) adds CXL.mem — a new protocol for host-managed memory on attached accelerators. CXL.mem bypasses the standard PCIe IOMMU path for host-side memory accesses. IOMMU must be extended to cover CXL.mem device memory through the ACPI SRAT and HMAT tables. Emerging work in Linux kernel IOMMU subsystem.
Flit mode and IOMMUGen 6 flit mode (64-byte fixed TLP framing) is fully transparent to the IOMMU. The IOMMU operates at the Transaction Layer — flit boundaries at the Physical Layer are invisible. No IOMMU changes needed for flit mode.
At Gen 6, the IOMMU is no longer optional. At 512 GB/s of DMA bandwidth with hundreds of VFs across multiple tenants, a single unprotected DMA write can corrupt gigabytes of another tenant’s memory in milliseconds. Gen 6 deployments — cloud AI, confidential computing, multi-tenant storage — universally require: IOMMU enabled, ATS for performance, PASID for per-process isolation, IDE for encryption, and ACS on all switches. These form the complete Gen 6 DMA security stack.

📋 Quick Reference

ItemValue / Rule
DMA definitionDevice-initiated memory transactions (MRd/MWr TLPs) without CPU involvement per transfer
Bus Master EnableCommand register bit 2. Must = 1 for device to initiate DMA. Reset default = 0. OS sets to 1 only after IOMMU domain is configured.
DMA write TLPPosted MWr — no completion returned. Device can send next TLP immediately.
DMA read TLPNon-posted MRd — completion with data (CplD) returned per read. Device tracks outstanding reads by Tag.
IOVAI/O Virtual Address — address device puts in TLP header. Translated by IOMMU to physical address. Completely separate from CPU virtual addresses.
IOMMU positionBetween PCIe Root Complex and DRAM controller. Intercepts every DMA TLP.
IOMMU function1) Translate IOVA → Physical Address. 2) Check read/write permissions. 3) Block and fault on invalid accesses.
IOTLBIOMMU Translation Lookaside Buffer — caches recent IOVA→PA translations. Must be invalidated when OS unmaps DMA buffers.
Root TableIndexed by PCIe Bus number (256 entries). Each entry → Context Table base.
Context TableIndexed by Device+Function (256 entries per bus). Each entry → Domain ID + IOMMU page table root.
IOMMU DomainA set of devices sharing the same IOMMU page tables. Each VM typically gets one domain. Domain ID identifies which page table to use for a device’s DMA.
IOMMU page table walkUp to 4 levels (48-bit IOVA). Each level indexed by 9 IOVA bits → physical address of next-level table. Leaf PTE contains physical page base + R/W permission bits.
IOMMU faultTriggered when IOVA is unmapped, write permission denied, or read permission denied. IOMMU records fault address + Requester ID + reason. Generates interrupt to OS IOMMU driver.
ATSAddress Translation Services (PCIe Cap 000Fh). Device sends Translation Request TLP → IOMMU returns physical address. Device caches in ATC. Subsequent DMA uses AT=10b to bypass IOMMU re-translation.
IOTLB invalidationOS must invalidate ATC and IOTLB entries when unmapping DMA buffers. PCIe Invalidation Request TLP sent by IOMMU to device to flush ATC entries.
PASIDProcess Address Space ID (PCIe Cap 001Bh). 20-bit tag in TLP Prefix. IOMMU selects per-PASID page table for translation. Enables per-process DMA isolation within a device.
IOMMU GroupMinimum set of devices that must be assigned together. Determined by ACS capability and PCIe topology. VFIO enforces group-based assignment.
ACS requirementACS Source Validation + P2P Request Redirect must be enabled on all switches for devices to get individual IOMMU groups and single-device assignment to VMs.
x86 IntelVT-d (Virtualization Technology for Directed I/O). Linux driver: iommu/intel.
x86 AMDAMD-Vi. Linux driver: iommu/amd.
ARMSMMU v3 (System Memory Management Unit). Linux driver: iommu/arm-smmu-v3.
Gen 6 changesDMA/IOMMU mechanism unchanged. ATS critical at 512 GB/s. PASID essential for multi-tenant AI. IDE adds TLP-level DMA encryption. CXL.mem requires IOMMU extension beyond standard PCIe domain.

Scroll to Top