PCIe SR-IOV Explained — Single Root I/O Virtualization

📋 Why SR-IOV Exists

In a virtualised server running 100 virtual machines, each VM needs network access, storage access, and potentially GPU or accelerator access. The naive approach — one physical NIC per VM — is completely impractical: 100 NICs per server would require 100 PCIe slots, 100 drivers, 100 separate configurations, and hundreds of watts of power.

Software virtualisation (the hypervisor emulates a virtual NIC for each VM) works but has a serious overhead: every network packet sent by a VM must be intercepted by the hypervisor, translated, and forwarded to the physical NIC — adding latency and consuming CPU cycles that should be running VM workloads. At 100 GbE speeds, software emulation becomes a bottleneck.

SR-IOV solves this by moving the virtualisation into the hardware of the PCIe device itself. A single SR-IOV-capable NIC (or storage controller, or GPU) can present up to 256 independent virtual devices to the system — each called a Virtual Function (VF). Each VM is assigned one VF directly, bypassing the hypervisor for data-plane operations. The hypervisor retains control over resource allocation and policy through the Physical Function (PF). The result: near-native I/O performance in virtualised environments with a single physical PCIe card.

📋 Physical Functions and Virtual Functions

Figure 1 — Physical Function vs Virtual Function. The PF is the real PCIe device with full configuration space and the SR-IOV Extended Capability. The PF driver (running in the hypervisor or privileged domain) creates and destroys VFs. Each VF appears as an independent PCIe endpoint with its own BDF, but its configuration space is minimal — only Vendor ID, Device ID, VF BARs, and a small set of mandatory capabilities. VFs do not have BARs in their Type 0 header; they are addressed via the VF BAR registers in the PF’s SR-IOV capability.

📋 SR-IOV Topology in the PCIe Fabric

From the PCIe fabric’s perspective, SR-IOV VFs are ordinary PCIe functions. They have BDF addresses, they send and receive TLPs, and they respond to memory accesses. The fabric routes TLPs to them exactly as to any other function. The only specialness is inside the physical device, which decodes each VF’s MMIO region and delivers data to the appropriate internal hardware queue.

Figure 2 — SR-IOV system layout. One physical NIC presents as a PF (managed by the hypervisor’s PF driver) and many VFs. Each VM is assigned one VF and its VF driver communicates directly with the device hardware queue via MMIO — no hypervisor involvement in the data path. The PCIe fabric routes TLPs to both the PF and all VFs by their BDF addresses. All VFs share the same physical PCIe link as the PF.

📋 ARI — Alternative Routing-ID Interpretation

Standard PCIe addressing allows at most 8 functions per device (3-bit Function field in BDF). For a single SR-IOV device to present many VFs, 8 functions is not enough. ARI (Alternative Routing-ID Interpretation) solves this by repurposing the 5-bit Device Number field in the BDF to extend the Function Number to 8 bits — giving 256 functions per bus.

Figure 3 — ARI extends the Function Number from 3 bits to 8 bits by fixing Device = 0 and merging the 5 Device bits into the Function field. This gives 256 functions per bus. ARI must be enabled in both the PCIe switch port above the device (ARI Forwarding Enable in Link Control 2) and in the PF’s SR-IOV Control register. Without ARI, SR-IOV is limited to 8 VFs (or fewer depending on the PF’s own function numbers).

📋 SR-IOV Extended Capability Structure

The SR-IOV Extended Capability (Cap ID 0010h) lives in the PF’s extended configuration space (offset 100h+). VFs do not have this structure — only the PF does. The structure contains the complete control and status registers for creating and managing VFs.

Figure 4 — SR-IOV Extended Capability register map. Only the PF has this structure. The three most critical fields are: TotalVFs (hardware max — read-only), NumVFs (currently active — software writes this to create/destroy VFs), and First VF Offset + VF Stride (the formula for computing each VF’s BDF from the PF’s BDF). VF BARs define the MMIO aperture for each VF — each VF maps to an equal slice of the total VF BAR space.

📋 SR-IOV Capability Registers

Register / Field	Offset	Access	Description
VF Migration Capable	104h bit 0	RO	When 1: device supports live migration of VFs between physical hosts. Requires VF Migration State Array in PF MMIO space.
VF Migration Interrupt Msg#	104h [31:21]	RO	MSI/MSI-X vector number for VF migration state change notifications. Allows hypervisor to receive interrupt when a VF’s migration state changes.
TotalVFs	10Ch [31:16]	RO	Maximum number of VFs the device can expose simultaneously. Hardware limit. Software may not write NumVFs greater than TotalVFs. Typical values: 16, 32, 64, 128, 256 depending on device.
InitialVFs	10Ch [15:0]	RO	Number of VFs the device presents at power-on even before VF Enable is set. These VFs are available immediately and do not require software enablement. Most devices set this to 0 (no initial VFs).
NumVFs	110h [31:16]	RW	The number of VFs currently active. Software writes this before setting VF Enable. After VF Enable, VFs with numbers 1 through NumVFs are present and visible to the OS.
First VF Offset	114h [15:0]	RO	The routing ID offset from the PF to VF 1. Used in the VF BDF calculation formula.
VF Stride	114h [31:16]	RO	The routing ID increment between successive VFs. VF n+1 has BDF = VF n BDF + VF Stride.
VF Device ID	118h [31:16]	RO	The Device ID that all VFs present. The Vendor ID is the same as the PF’s Vendor ID. Drivers match VFs using this Device ID.
Supported Page Sizes	11Ch	RO	Bitmask of memory page sizes the device can use for VF BAR base address alignment. Bit 0 = 4 KB, bit 1 = 8 KB, bit 2 = 16 KB, etc.
System Page Size	120h	RW	Software writes the page size it will use for VF BAR alignment. Must be one of the supported page sizes. Determines alignment granularity of VF BAR segments.

📋 SR-IOV Control and Status Registers

SR-IOV Control Bit	Offset	Access	Function
VF Enable	108h bit 0	RW	The master VF enable bit. When set to 1, the VFs numbered 1 through NumVFs become visible as PCIe functions. Setting to 0 disables all VFs. Must write NumVFs before setting this bit. After setting, software must wait before attempting to access VF configuration space — device needs time to initialise VF hardware.
VF Migration Enable	108h bit 1	RW	Enables live VF migration. Only settable if VF Migration Capable = 1.
VF Migration Interrupt Enable	108h bit 2	RW	Enables generation of MSI/MSI-X interrupt when VF migration state changes.
VF MSE (Memory Space Enable)	108h bit 3	RW	When 1: VFs respond to memory TLPs targeting their BAR address ranges. Equivalent to the Memory Space Enable bit in the Command register, but applies to all VFs collectively. Must be set after VF Enable for VFs to respond to MMIO accesses.
ARI Capable Hierarchy	108h bit 4	RW	Indicates that ARI is being used for VF function numbering. Must be set before VF Enable if ARI is required (which is almost always). Both the switch port above and this bit must be configured for ARI to work.

📋 VF BDF — How VFs are Addressed

Each VF has a unique BDF (Routing ID) computed from the PF’s BDF using the First VF Offset and VF Stride fields. This formula allows the hardware designer to place VFs at any function number positions and with any spacing, as long as the pattern follows:

Figure 5 — VF Routing ID formula. The PF’s Routing ID plus the First VF Offset gives VF1’s Routing ID. Each subsequent VF adds VF Stride to the previous VF’s Routing ID. Without ARI, Function numbers 0–7 are the only valid values — any VF whose Routing ID produces a Function number ≥ 8 requires ARI to be enabled. Most SR-IOV deployments with more than a handful of VFs use ARI.

Routing ID arithmetic is done as 16-bit integers. If the result of the VF n BDF calculation causes the Function number field to overflow past Fn 7 (without ARI) or Fn 255 (with ARI), the BDF rolls over to the next bus number. Software must verify that all VF BDFs fall within valid ranges before enabling VFs.

📋 VF BARs — Separate MMIO per VF

VFs expose MMIO space to their assigned VMs but they do not have BARs in the Type 0 configuration space header. Instead, the PF’s SR-IOV capability has six VF BAR registers (VF BAR0 through VF BAR5) that define the total MMIO aperture for all VFs collectively. The MMIO aperture is divided equally among all active VFs based on the System Page Size.

Figure 6 — VF BAR layout. The PF has its own separate BAR address space for management registers. The VF BAR covers the MMIO space for all VFs combined — sized by VF BAR register × NumVFs. Each VF gets an equal slice of this space. Software programs VF BAR0 in the SR-IOV capability (same sizing procedure as a regular BAR — write 0xFFFFFFFF, read back), then programs the base address. Each VF’s MMIO base = VF_BAR_base + (VF_number – 1) × per_VF_size.

The VF BAR sizing procedure in the PF’s SR-IOV capability is identical to regular BAR sizing: write 0xFFFFFFFF, read back, find lowest 1-bit to determine the per-VF-size. The total aperture that must be allocated is per_VF_size × NumVFs, aligned to per_VF_size × NumVFs. Software allocates this total region and writes the base address to the VF BAR register in the SR-IOV Extended Capability. The IOMMU then maps each VM’s VF to the correct slice of this total aperture.

▶ VF Enable Sequence

Discover SR-IOV capability: walk the extended capability list from offset 100h. Find Cap ID = 0010h.
Read TotalVFs to learn the hardware maximum. Decide how many VFs to create (≤ TotalVFs).
Enable ARI in the upstream switch port’s PCIe Capability Device Control 2 register (ARI Forwarding Enable bit). Also set ARI Capable Hierarchy in SR-IOV Control [bit 4].
Write NumVFs (SR-IOV Control offset 110h [31:16]) with the desired VF count.
Size and allocate VF BARs: write 0xFFFFFFFF to each VF BAR register, read back, compute per-VF size, allocate total aperture (per-VF-size × NumVFs) from MMIO address space, write base address to VF BAR register.
Set System Page Size to the page size used by the IOMMU.
Set VF MSE (SR-IOV Control bit 3 = 1) to enable VF memory space decode.
Set VF Enable (SR-IOV Control bit 0 = 1). Device creates all VFs internally. A minimum 100ms delay must be observed before accessing VF configuration space.
Enumerate VFs: use the First VF Offset + VF Stride formula to compute each VF’s BDF. Read the VF’s config space at that BDF to confirm Vendor ID is valid.
Configure IOMMU: create IOMMU page table entries mapping each VM’s memory to the appropriate VF’s DMA address space. Set ACS on the switch port above the PF.
Assign VFs to VMs: pass each VF’s BDF to the appropriate VM (via VFIO on Linux, PCI Passthrough on Xen/Windows). The VM’s VF driver loads and configures the VF for direct operation.

VF Enable sets a 100ms minimum wait before VF config access. After setting VF Enable bit to 1, the device needs time to initialise all VF hardware instances. The PCIe SR-IOV specification requires software to wait at least 100ms before attempting to read configuration space from any VF. Attempting config reads too early may return stale or undefined data.

📋 IOMMU Isolation — Why ACS is Critical

The most critical security requirement for SR-IOV deployments is preventing VMs from accessing each other’s memory. When a VM performs DMA via its assigned VF, the DMA request carries the VF’s Requester ID. The IOMMU uses this Requester ID to look up which address space translation to apply — only addresses in the VM’s allocated IOMMU pages are accessible.

Without ACS (Access Control Services), a PCIe switch between the SR-IOV device and the Root Complex could allow peer-to-peer DMA between two VFs on the same switch — bypassing the IOMMU entirely. A compromised VM could DMA directly to another VM’s memory without the IOMMU seeing the request.

Figure 7 — ACS and SR-IOV isolation. Without ACS on the switch port, a malicious VF can send DMA requests targeting another VF’s address range and the switch forwards them directly — bypassing the IOMMU. With ACS P2P Request Redirect enabled, all peer-to-peer TLPs are redirected upstream through the Root Complex and IOMMU, where isolation is enforced. Linux VFIO (the kernel framework for safe device passthrough) verifies ACS is enabled on every switch in the path before allowing device assignment.

The required ACS features for SR-IOV are: ACS Source Validation (rejects spoofed Requester IDs), ACS Translation Blocking (prevents AT ≠ 00b TLPs from bypassing IOMMU), and ACS P2P Request Redirect (forces all peer-to-peer DMA upstream). All three must be enabled on every switch port in the path from the VF to the Root Complex.

📋 VF Migration

VF Migration allows a running virtual machine (with its VF assigned) to be migrated from one physical server to another without stopping the VM. This is much harder than software-only VM migration because the VF’s hardware state (DMA queues, receive buffers, hardware ring buffer pointers) must be saved and restored on the destination server.

SR-IOV defines a VF Migration State Array in the PF’s MMIO space (pointed to by a PF BAR). Each VF has a migration state register that the hypervisor monitors. During migration, the hypervisor quiesces the VF (stops new DMA), checkpoints the hardware state, transfers it to the destination host, and restores it there. The VF Migration Enable bit in SR-IOV Control enables the interrupt that signals when VF migration state changes.

In practice, VF migration is rarely implemented in hardware — the complexity of saving all device-specific hardware state makes it device-class specific. Most SR-IOV deployments use “cold migration” (VM suspended, VF released, VM re-started on destination server with a new VF) rather than live migration.

📋 PF Driver vs VF Driver

Property	PF Driver (hypervisor)	VF Driver (VM)
Runs in	Privileged domain (hypervisor, Dom0, host kernel)	Guest VM (unprivileged)
Controls	VF creation/destruction (NumVFs, VF Enable), hardware resource allocation, queue configuration	Only its own VF’s data plane — Tx/Rx queues, interrupts
Sees	Full PF configuration space including SR-IOV Extended Capability, all VF status	Only the VF’s own configuration space — Vendor ID, Device ID, VF BARs, minimal capabilities
DMA address space	Unrestricted (Root Complex identity) with IOMMU	Only the IOMMU pages assigned to this VF’s VM — hardware-enforced isolation
MSI-X vectors	PF’s own MSI-X table for management events	VF’s own separate MSI-X table — independent vectors targeting VM’s vCPU
Example (NIC)	Intel PF driver: manages switchdev, traffic shaping, VF MAC assignment, VLAN configuration	Intel VF driver (iavf): sends/receives packets on its allocated hardware queue
Example (GPU)	NVIDIA MIG: partitions GPU compute resources per VF	CUDA driver in VM: uses its partition without seeing other partitions

📋 Real-World Use Cases

Device	SR-IOV usage	VF count	Per-VM benefit
100 GbE NIC (Mellanox ConnectX, Intel E810)	Each VM gets a dedicated hardware Tx/Rx queue pair — direct DMA to VM memory without hypervisor	64–128 VFs	Near wire-speed networking, sub-microsecond latency in HFT/HPC
NVMe SSD with SR-IOV (Samsung PM9A3)	Each VM gets its own NVMe queue pair — isolated namespace access, hardware-enforced I/O isolation	8–64 VFs	Dedicated I/O bandwidth, no shared queue contention
GPU (NVIDIA A100 MIG, AMD MI300X)	GPU is partitioned into slices (Multi-Instance GPU) — each slice presented as a VF	2–7 MIG instances	GPU compute isolation — each VM gets a fraction of the silicon
SmartNIC/DPU (NVIDIA BlueField)	Host NIC ports virtualised as VFs for tenants; DPU runs control plane	64+ VFs	Zero-touch network offload — encryption, flow steering in hardware
SR-IOV in Kubernetes	Each pod gets a VF allocated by the SR-IOV Network Device Plugin	1 VF per pod	Container networking at NIC line rate — replaces software overlay

⚡ SR-IOV in Gen 6

The SR-IOV Extended Capability structure — Cap ID 0010h, all register offsets, the VF Offset + Stride formula, the VF BAR sizing procedure, ARI, the enable sequence — is completely unchanged in Gen 6. SR-IOV is an application-layer feature defined above the Transaction Layer; Gen 6 changes only the Physical Layer.

What changes in Gen 6 SR-IOV deployments:

Aspect	Gen 6 impact
SR-IOV register layout	Unchanged — same Extended Capability, same VF BDF formula, same enable sequence
VF count limits	Unchanged — TotalVFs max is 65535, ARI supports 256 functions per bus. The limiting factor is now the device hardware (queue count) not the PCIe spec.
Bandwidth per VF	Gen 6 at 64 GT/s × 16 lanes = 512 GB/s raw. With 64 VFs each sharing the link, each VF has access to 8 GB/s average peak — comparable to a dedicated Gen 4 x4 endpoint for each VF.
AI accelerator SR-IOV	Gen 6 AI accelerators (next-generation H200 successors, AMD MI400 class) use SR-IOV or MIG-style partitioning to share GPU silicon across tenants. At Gen 6 speeds, VF DMA can sustain the compute density required for LLM inference without PCIe bottleneck.
PASID + SR-IOV	Gen 6 workloads increasingly combine SR-IOV with PASID (PCIe-11, PCIe-22) — each compute kernel within a VF can have its own IOMMU address space via PASID, enabling per-process isolation within a VM that itself has a VF assigned.
ACS and IDE	Gen 6 adds IDE (TLP encryption, Cap ID 0034h) which can be applied per-VF to protect VF DMA traffic from physical eavesdropping on the PCIe traces or cables. Critical for multi-tenant cloud confidential computing.
VF MSI-X vectors	2048 MSI-X vectors per VF (PCIe spec max) may be reached by AI accelerator VFs with many compute queues. Gen 6 AI workloads may push this limit.

SR-IOV is fundamental to Gen 6 cloud AI infrastructure. At 64 GT/s, a single PCIe 6.0 slot connects an AI accelerator with hundreds of gigabytes per second of memory bandwidth. SR-IOV partitions this accelerator safely among multiple tenants without hypervisor mediation of the data path — enabling multi-tenant GPU/AI-as-a-Service at cloud scale. The SR-IOV mechanism defined in 2007 scales to Gen 6 hardware without protocol changes.

📋 Quick Reference

Item	Value / Rule
SR-IOV Extended Capability ID	0010h — in PF extended config space (100h+), accessible via ECAM
Physical Function (PF)	Standard PCIe endpoint. Contains SR-IOV capability. Managed by hypervisor. One per SR-IOV capable function.
Virtual Function (VF)	Lightweight endpoint. Own BDF address. Own VF BARs. No SR-IOV capability of its own. Assigned to VM.
TotalVFs	Hardware maximum VF count. Read-only. Software may not set NumVFs > TotalVFs.
NumVFs	Write before VF Enable. Sets how many VFs are created. Can only be changed when VF Enable = 0.
VF Enable (bit 0)	Write 1 to create VFs. Wait ≥100ms before VF config space access. Write 0 to destroy all VFs.
VF MSE (bit 3)	Must be set for VFs to respond to memory TLPs. Independent of Command register MSE in PF.
ARI Capable Hierarchy (bit 4)	Set before VF Enable when using ARI. Also requires ARI Forwarding Enable in upstream switch port Device Control 2.
First VF Offset	Routing ID offset from PF to VF1. Read-only, set by hardware.
VF Stride	Routing ID increment between consecutive VFs. Read-only, set by hardware.
VF BDF formula	VF_n = PF_Routing_ID + First_VF_Offset + (n−1) × VF_Stride. n counts from 1.
ARI	Extends Function number to 8 bits (256 functions per bus) by setting Device = 0. Required for >8 VFs.
VF BAR sizing	Same write-0xFFFFFFFF / read-back procedure as regular BARs. VF BAR total size = per-VF-size × NumVFs.
VF MMIO address formula	VF_n base = VF_BAR_base + (n−1) × per_VF_size
VF Device ID	All VFs have the same VF Device ID (from SR-IOV Cap register). Vendor ID = PF’s Vendor ID.
ACS requirement	ACS Source Validation + P2P Request Redirect must be enabled on all switch ports between SR-IOV device and Root Complex. Linux VFIO verifies this.
IOMMU isolation	Each VF’s DMA is isolated by IOMMU per-VF address space. PASID extends this to per-process within the VM.
VF config space	Minimal — Vendor ID, VF Device ID, Subsystem IDs, a few mandatory capabilities. No BARs in Type 0 header (BARs are in PF’s SR-IOV Extended Capability).
Gen 6 changes	SR-IOV mechanism unchanged. Per-VF bandwidth dramatically higher at 64 GT/s. IDE adds per-VF TLP encryption. PASID + SR-IOV for per-process IOMMU isolation in AI workloads.