How cache works — direct-mapped, fully associative, and set-associative mapping — replacement policies, write strategies, and how virtual memory extends the address space beyond physical RAM using page tables, TLBs, and the MMU.
The fundamental problem: a modern CPU can execute billions of instructions per second, but main DRAM takes 50–100 ns per access — hundreds of CPU cycles per memory operation. Without mitigation, the CPU would spend most of its time idle, waiting for memory.
The solution is cache memory — a small, fast SRAM buffer placed between the CPU and main DRAM. Cache holds copies of recently and frequently used memory locations:
Cache works because of locality of reference: programs access the same locations repeatedly (temporal locality) and access nearby locations in sequence (spatial locality). A well-designed cache achieves hit rates of 90–99%.
A cache is organised into cache lines (also called cache blocks). Each line holds a copy of a contiguous chunk of main memory — typically 64 bytes. Each cache line also stores metadata:
| Field | Width | Purpose |
|---|---|---|
| Valid bit | 1 bit | Indicates whether this cache line holds a valid copy (0 = empty/invalid, 1 = valid) |
| Tag | Address bits − index bits − offset bits | Identifies which main memory block this line holds — used to check if a cache hit occurred |
| Dirty bit | 1 bit (write-back only) | Marks a line that has been written but not yet propagated to DRAM — must be written back on eviction |
| Data | 64 bytes (512 bits) | The actual cached copy of the memory block |
A 32-bit address is split into three fields. The Index selects which cache row to inspect. The Tag is compared against the stored tag to confirm a hit. The Block Offset selects the specific byte within the 64-byte cache line.
In a direct-mapped cache, each main memory block maps to exactly one cache line — determined by the block address modulo the number of cache lines:
Cache line = Memory block number MOD (Number of cache lines)
Direct-mapped cache with 4 lines. Each DRAM block maps to exactly one cache line (block mod 4). Blocks 0, 4, 8, 12 all map to Line 0. If the program alternates between Block 4 and Block 8, every access is a miss — called thrashing.
Cache: 64 lines, 64-byte blocks. Address: 32-bit.
Address decomposition: Offset = log₂(64) = 6 bits Index = log₂(64 lines) = 6 bits Tag = 32 − 6 − 6 = 20 bits
Access address 0x00001A40:
Binary: 0000 0000 0000 0000 0001 1010 0100 0000
Offset [5:0] = 00 0000 = 0 (byte 0 of the line)
Index [11:6] = 10 1001 = 41 (look in cache line 41)
Tag [31:12]= 0000 0000 0000 0000 0001 = 0x00001
Hit check: Cache line 41: Valid=1, stored tag=0x00001 → HIT. Return byte 0 of line 41’s data.
Miss scenario: If stored tag ≠ 0x00001, or Valid=0 → MISS. Fetch 64 bytes from DRAM address 0x00001A00 (block aligned), write into line 41, update tag to 0x00001, set Valid=1, serve the request.
In a fully associative cache, any main memory block can be placed in any cache line. There is no index field — the entire address (minus offset) is the tag. On every access, all stored tags are compared simultaneously using CAM hardware.
The set-associative cache is the practical compromise: the cache is divided into sets, and each set contains N ways (N cache lines). A memory block maps to exactly one set (using the index field), but can occupy any of the N ways within that set. This is called an N-way set-associative cache.
2-way set-associative cache with 4 sets. Index selects the set (row). Both ways in that set are checked simultaneously. A block that maps to Set 0 can occupy either Way 0 or Way 1 — eliminating the conflict that would occur in direct-mapped.
| Mapping type | Conflict misses | Hardware cost | Common use |
|---|---|---|---|
| Direct-mapped (1-way) | High | Lowest — one comparator | Simple, low-power L1 in some embedded CPUs |
| 2-way set-associative | Moderate | 2 comparators per set | Common for small L1 caches |
| 4-way set-associative | Low | 4 comparators per set | Typical L1 cache (ARM Cortex-A, x86) |
| 8-way set-associative | Very low | 8 comparators per set | L2 / L3 caches |
| Fully associative | None | Highest — N comparators | TLB, victim cache |
When a cache miss occurs and the selected set is full, a victim line must be evicted. Three main replacement policies:
| Policy | Rule | Advantage | Disadvantage |
|---|---|---|---|
| LRU Least Recently Used | Evict the line that was accessed least recently | Best hit rate in practice — exploits temporal locality optimally | Requires tracking access time of each line (LRU counter bits) |
| FIFO First In, First Out | Evict the line that has been in cache the longest | Simple — just a queue, no per-access tracking | May evict frequently-used old lines (Bélády’s anomaly) |
| Random | Randomly select a victim line | Trivially simple hardware — no state tracking needed | Unpredictable — occasionally evicts the most-needed line |
When the CPU writes to a cached location, when should the write propagate to main DRAM? Two strategies:
Write miss policy: Write-allocate (fetch the block into cache, then write — pairs naturally with write-back) or no-write-allocate (write directly to DRAM without loading cache — pairs with write-through).
System: L1 cache hit time = 4 cycles. L2 cache hit time = 12 cycles. DRAM access = 200 cycles. L1 miss rate = 5%. L2 miss rate (given L1 miss) = 20%.
AMAT calculation (hierarchical):
L2 AMAT = L2 hit time + L2 miss rate × DRAM penalty
= 12 + 0.20 × 200 = 12 + 40 = 52 cycles
Overall AMAT = L1 hit time + L1 miss rate × L2 AMAT
= 4 + 0.05 × 52 = 4 + 2.6 = 6.6 cycles
Without any cache: Every access = 200 cycles. Cache provides a 30× speedup for this workload.
Physical DRAM is finite. Several problems arise without virtual memory:
Virtual memory creates the illusion that every process has its own large, private address space — typically the full 32-bit or 64-bit range. The OS and hardware transparently map virtual addresses to physical RAM locations (or to disk when RAM is full).
The most common virtual memory implementation divides both the virtual address space and physical memory into fixed-size chunks called pages (typically 4 KB). The hardware unit that translates virtual to physical addresses is the MMU (Memory Management Unit).
Virtual-to-physical address translation. The MMU uses the Virtual Page Number to index the page table (maintained by the OS). The page table entry contains the Physical Page Number plus permission bits (Valid, Read, Write, Execute). The 12-bit page offset is concatenated unchanged to form the physical address. If Valid=0, a page fault exception is raised.
A problem: every memory access requires one page table lookup (which itself is a memory access). This doubles the effective memory access time. The solution is the TLB — a small, fully-associative cache of recently used page table entries, stored inside the MMU.
TLB lookup flow. On a TLB hit (the common case, 90%+ of accesses), the PPN is available in ~1 ns with no additional memory access. On a TLB miss, the hardware page table walker reads the page table from memory, loads the entry into the TLB, and retries the access.
| Parameter | Typical TLB | Notes |
|---|---|---|
| Entries | 32–128 | Fully associative — any VPN→PPN pair in any entry |
| Access time | 1–2 ns | Runs in parallel with L1 cache tag lookup (virtually indexed) |
| Hit rate | >99% | Most programs have working sets of a few hundred pages |
| On a miss | Hardware page table walk (x86, ARM: PTWA hardware walker) or software TLB miss handler (MIPS). Walk may require 2–4 memory accesses for multi-level page tables. | |
| TLB flush | On context switch | Each process has its own page table → TLB must be flushed on process switch (or use ASIDs — Address Space IDs — to avoid full flush) |
A physical L1 cache in RTL consists of two SRAM macros (tag array and data array) plus combinational logic. The tag array is addressed by the index field; its output (tag bits + valid bit + dirty bit) is compared against the incoming address tag using XOR gates — one comparator per way. The result drives a hit/miss signal. On a hit, the data array is read using the same index, and the block offset selects bytes. The LRU update logic (pseudo-LRU tree) updates after every access. This entire RTL block — typically 1,000–5,000 lines of SystemVerilog — is one of the most performance-critical pieces of any CPU design and requires thorough verification including miss/hit boundary conditions, dirty eviction, simultaneous read-write conflicts, and coherence protocol integration.
Every ARM Cortex-A CPU core contains an MMU with dedicated instruction and data TLBs (iTLB and dTLB). In a modern SoC, DMA masters (GPU, video codec, display controller, network accelerator) also generate memory accesses that must be address-translated and permission-checked. ARM’s SMMU (System Memory Management Unit) is a standalone MMU placed on the system bus that translates virtual addresses for non-CPU masters. When you integrate a video codec IP into a SoC, you connect its master port through the SMMU, configure SMMU stream mappings, and verify that the codec can only access its allocated memory regions.
In a multicore SoC, each core has its own private L1/L2 cache. If Core 0 modifies a cache line and Core 1 has a copy of the same line in its L1, Core 1’s copy is now stale — a cache coherence problem. The MESI protocol (Modified, Exclusive, Shared, Invalid) is the standard solution: every cache line is tagged with one of four states, and cores communicate via a coherence directory or snooping bus to keep copies consistent. ARM uses the AMBA CHI protocol to implement MESI coherence across all L1/L2 caches and a shared L3 in a DynamIQ cluster. Verifying cache coherence requires directed random testing and often formal methods, and is one of the hardest verification problems in CPU design.