CA-07: Cache Memory & Virtual Memory — VLSI Trainers
VLSI Trainers CA Series · 7 / 12
Computer Architecture · Article 7 of 12

CA-07: Cache Memory & Virtual Memory

How cache works — direct-mapped, fully associative, and set-associative mapping — replacement policies, write strategies, and how virtual memory extends the address space beyond physical RAM using page tables, TLBs, and the MMU.

The Cache Concept

The fundamental problem: a modern CPU can execute billions of instructions per second, but main DRAM takes 50–100 ns per access — hundreds of CPU cycles per memory operation. Without mitigation, the CPU would spend most of its time idle, waiting for memory.

The solution is cache memory — a small, fast SRAM buffer placed between the CPU and main DRAM. Cache holds copies of recently and frequently used memory locations:

Cache works because of locality of reference: programs access the same locations repeatedly (temporal locality) and access nearby locations in sequence (spatial locality). A well-designed cache achieves hit rates of 90–99%.

Average memory access time (AMAT): AMAT = Hit time + Miss rate × Miss penalty. Example: L1 hit time = 4 cycles, miss rate = 5%, DRAM miss penalty = 200 cycles. AMAT = 4 + 0.05 × 200 = 4 + 10 = 14 cycles — far better than 200 cycles every time.

🗂️Cache Structure — Tags, Lines, and Sets

A cache is organised into cache lines (also called cache blocks). Each line holds a copy of a contiguous chunk of main memory — typically 64 bytes. Each cache line also stores metadata:

FieldWidthPurpose
Valid bit1 bitIndicates whether this cache line holds a valid copy (0 = empty/invalid, 1 = valid)
TagAddress bits − index bits − offset bitsIdentifies which main memory block this line holds — used to check if a cache hit occurred
Dirty bit1 bit (write-back only)Marks a line that has been written but not yet propagated to DRAM — must be written back on eviction
Data64 bytes (512 bits)The actual cached copy of the memory block

Address decomposition

Figure 1 — Memory address decomposed into Tag | Index | Block Offset
Bit 31 Bit 0 TAG (identifies which DRAM block — checked against stored tag) INDEX (selects cache set / row) BLOCK OFFSET (byte within the 64-byte line) bits [31:12] — 20 tag bits bits [11:6] — 6 index bits bits [5:0] — 6 offset bits Example: 32-bit address · 64 sets · 64-byte lines (6 offset bits = log₂64, 6 index bits = log₂64) Steps: (1) use INDEX to locate set → (2) compare TAG against stored tags → (3) on hit, use OFFSET to select byte vlsitrainers.com

A 32-bit address is split into three fields. The Index selects which cache row to inspect. The Tag is compared against the stored tag to confirm a hit. The Block Offset selects the specific byte within the 64-byte cache line.

1️⃣Direct-Mapped Cache

In a direct-mapped cache, each main memory block maps to exactly one cache line — determined by the block address modulo the number of cache lines:

Cache line = Memory block number MOD (Number of cache lines)

Figure 2 — Direct-mapped cache: each DRAM block maps to exactly one cache line
Main Memory (DRAM) 16 blocks shown (0–15) Block 0 Block 1 Block 2 Block 3 Block 4 ← maps to line 0 Block 5 → line 1 Block 6 → line 2 Block 7 → line 3 Block 8 ← also maps to line 0! Cache (4 lines) Valid | Tag | Data (64 bytes) V TAG Line 0 — DATA 0 Line 1 — DATA 1 Line 2 — DATA 2 Line 3 — DATA 3 ⚠ Conflict miss: Blocks 0, 4, 8, 12 all compete for Line 0 Accessing Block 4 evicts Block 0; accessing Block 8 evicts Block 4 vlsitrainers.com

Direct-mapped cache with 4 lines. Each DRAM block maps to exactly one cache line (block mod 4). Blocks 0, 4, 8, 12 all map to Line 0. If the program alternates between Block 4 and Block 8, every access is a miss — called thrashing.

🔍 Worked Example — Direct-Mapped Cache Lookup

Cache: 64 lines, 64-byte blocks. Address: 32-bit.

Address decomposition: Offset = log₂(64) = 6 bits   Index = log₂(64 lines) = 6 bits   Tag = 32 − 6 − 6 = 20 bits

Access address 0x00001A40:

Binary: 0000 0000 0000 0000 0001 1010 0100 0000
Offset [5:0] = 00 0000 = 0 (byte 0 of the line)
Index [11:6] = 10 1001 = 41 (look in cache line 41)
Tag [31:12]= 0000 0000 0000 0000 0001 = 0x00001

Hit check: Cache line 41: Valid=1, stored tag=0x00001 → HIT. Return byte 0 of line 41’s data.

Miss scenario: If stored tag ≠ 0x00001, or Valid=0 → MISS. Fetch 64 bytes from DRAM address 0x00001A00 (block aligned), write into line 41, update tag to 0x00001, set Valid=1, serve the request.

🔓Fully Associative Cache

In a fully associative cache, any main memory block can be placed in any cache line. There is no index field — the entire address (minus offset) is the tag. On every access, all stored tags are compared simultaneously using CAM hardware.

When fully associative is used: TLBs are typically fully associative because they are small (32–128 entries) and any virtual page must be able to map to any TLB entry. L1/L2/L3 caches are too large for fully associative organisation — a 32 KB L1 with 64-byte lines has 512 lines, requiring 512 simultaneous tag comparisons per access.

🔢Set-Associative Cache

The set-associative cache is the practical compromise: the cache is divided into sets, and each set contains N ways (N cache lines). A memory block maps to exactly one set (using the index field), but can occupy any of the N ways within that set. This is called an N-way set-associative cache.

Figure 3 — 2-way set-associative cache: 4 sets, 2 ways per set
TAG [31:4] — 28 bits INDEX [3:2] — 2 bits OFFSET [1:0] ↑ 4 sets = 2-bit index Set Way 0 Way 1 V Tag Data (64B) V Tag Data (64B) 0 1 0x1A3 Block 0x68C data… 1 0x2B7 Block 0xADC data… 1 1 0x004 Loop body code… 0 (invalid) 2 1 0x3FF Stack frame… 1 0x001 Global vars… 3 1 0x080 Heap block… 0 (invalid) Each set has an LRU bit indicating which way was least recently used — the next victim on a miss. Set 0 has both ways valid → replacement needed when a third block maps to Set 0. vlsitrainers.com

2-way set-associative cache with 4 sets. Index selects the set (row). Both ways in that set are checked simultaneously. A block that maps to Set 0 can occupy either Way 0 or Way 1 — eliminating the conflict that would occur in direct-mapped.

Mapping typeConflict missesHardware costCommon use
Direct-mapped (1-way)HighLowest — one comparatorSimple, low-power L1 in some embedded CPUs
2-way set-associativeModerate2 comparators per setCommon for small L1 caches
4-way set-associativeLow4 comparators per setTypical L1 cache (ARM Cortex-A, x86)
8-way set-associativeVery low8 comparators per setL2 / L3 caches
Fully associativeNoneHighest — N comparatorsTLB, victim cache

🗑️Replacement Policies

When a cache miss occurs and the selected set is full, a victim line must be evicted. Three main replacement policies:

PolicyRuleAdvantageDisadvantage
LRU
Least Recently Used
Evict the line that was accessed least recentlyBest hit rate in practice — exploits temporal locality optimallyRequires tracking access time of each line (LRU counter bits)
FIFO
First In, First Out
Evict the line that has been in cache the longestSimple — just a queue, no per-access trackingMay evict frequently-used old lines (Bélády’s anomaly)
RandomRandomly select a victim lineTrivially simple hardware — no state tracking neededUnpredictable — occasionally evicts the most-needed line
LRU approximation in real hardware: True LRU for N-way caches requires log₂(N!) bits per set. For an 8-way L1, that is log₂(8!) = 15 bits per set — expensive. Real CPUs use pseudo-LRU (PLRU) — a binary tree of 1-bit comparisons that approximates LRU with only N−1 bits per set. ARM Cortex-A53 uses pseudo-random replacement for L1; Cortex-A77 uses PLRU for L1/L2.

✍️Write Policy

When the CPU writes to a cached location, when should the write propagate to main DRAM? Two strategies:

WRITE-THROUGH
  • Every cache write immediately writes to DRAM
  • Cache and memory always consistent
  • Simpler — no dirty bit needed
  • More bus traffic — every write hits DRAM
  • Common in L1 with write buffer (coalesces writes)
  • Used where memory consistency is critical (DMA, multicore)
WRITE-BACK
  • Write only updates the cache line — sets dirty bit
  • DRAM updated only when dirty line is evicted
  • Less bus traffic — multiple writes to same line = one DRAM write
  • More complex — dirty bit per line, writeback on eviction
  • Higher performance for write-intensive workloads
  • Requires cache coherence protocol in multicore (MESI)

Write miss policy: Write-allocate (fetch the block into cache, then write — pairs naturally with write-back) or no-write-allocate (write directly to DRAM without loading cache — pairs with write-through).

📊Cache Performance

🔍 Worked Example — Effective Memory Access Time

System: L1 cache hit time = 4 cycles. L2 cache hit time = 12 cycles. DRAM access = 200 cycles. L1 miss rate = 5%. L2 miss rate (given L1 miss) = 20%.

AMAT calculation (hierarchical):

L2 AMAT = L2 hit time + L2 miss rate × DRAM penalty
= 12 + 0.20 × 200 = 12 + 40 = 52 cycles

Overall AMAT = L1 hit time + L1 miss rate × L2 AMAT
= 4 + 0.05 × 52 = 4 + 2.6 = 6.6 cycles

Without any cache: Every access = 200 cycles. Cache provides a 30× speedup for this workload.

🗺️Virtual Memory — The Problem

Physical DRAM is finite. Several problems arise without virtual memory:

Virtual memory creates the illusion that every process has its own large, private address space — typically the full 32-bit or 64-bit range. The OS and hardware transparently map virtual addresses to physical RAM locations (or to disk when RAM is full).

📄Paging & Page Tables

The most common virtual memory implementation divides both the virtual address space and physical memory into fixed-size chunks called pages (typically 4 KB). The hardware unit that translates virtual to physical addresses is the MMU (Memory Management Unit).

Figure 4 — Virtual address → physical address translation via page table
VIRTUAL ADDRESS (32-bit: 4KB pages → 20-bit VPN + 12-bit offset) Virtual Page Number (VPN) — 20 bits Page Offset — 12 bits Page Table (in OS kernel memory) VPN 0 → PPN 0x1A3, V=1, R=1, W=0 VPN 1 → PPN 0x007, V=1, R=1, W=1 VPN 2 → V=0 (PAGE FAULT!) VPN 3 → PPN 0x2BC, V=1, R=1, W=1 … 2²⁰ = 1M entries total … MMU uses VPN to index PHYSICAL ADDRESS = PPN (from page table) concatenated with Page Offset (unchanged) Physical Page Number (PPN) — from table Page Offset (same) VPN lookup PPN → physical addr offset unchanged vlsitrainers.com

Virtual-to-physical address translation. The MMU uses the Virtual Page Number to index the page table (maintained by the OS). The page table entry contains the Physical Page Number plus permission bits (Valid, Read, Write, Execute). The 12-bit page offset is concatenated unchanged to form the physical address. If Valid=0, a page fault exception is raised.

Page fault handling

  1. CPU accesses virtual address → MMU checks page table → Valid bit = 0
  2. MMU raises a page fault exception
  3. OS page fault handler runs: finds a free physical frame (or evicts one)
  4. OS reads the required page from disk (swap space) into the free frame
  5. OS updates page table entry: sets Valid=1, stores new PPN
  6. OS returns from exception → CPU retries the faulting instruction → now hits (Valid=1)

🔍TLB — Translation Lookaside Buffer

A problem: every memory access requires one page table lookup (which itself is a memory access). This doubles the effective memory access time. The solution is the TLB — a small, fully-associative cache of recently used page table entries, stored inside the MMU.

Figure 5 — TLB lookup: fast path (hit) vs slow path (miss → page table walk)
CPU VA TLB 32–128 entries fully associative VPN → PPN lookup ~1–2 ns TLB HIT (90%+) Cache / DRAM data → CPU TLB MISS (<10%) Page Table Walk HW or SW walks page table loads entry into TLB → retry refill TLB, retry TLB hit path: ~1 extra cycle (TLB lookup parallel with cache) TLB miss path: 10–100+ extra cycles (page table walk through memory) vlsitrainers.com

TLB lookup flow. On a TLB hit (the common case, 90%+ of accesses), the PPN is available in ~1 ns with no additional memory access. On a TLB miss, the hardware page table walker reads the page table from memory, loads the entry into the TLB, and retries the access.

ParameterTypical TLBNotes
Entries32–128Fully associative — any VPN→PPN pair in any entry
Access time1–2 nsRuns in parallel with L1 cache tag lookup (virtually indexed)
Hit rate>99%Most programs have working sets of a few hundred pages
On a missHardware page table walk (x86, ARM: PTWA hardware walker) or software TLB miss handler (MIPS). Walk may require 2–4 memory accesses for multi-level page tables.
TLB flushOn context switchEach process has its own page table → TLB must be flushed on process switch (or use ASIDs — Address Space IDs — to avoid full flush)

🔬VLSI Connections

🔬 Cache microarchitecture in RTL — tag SRAM, data SRAM, LRU logic

A physical L1 cache in RTL consists of two SRAM macros (tag array and data array) plus combinational logic. The tag array is addressed by the index field; its output (tag bits + valid bit + dirty bit) is compared against the incoming address tag using XOR gates — one comparator per way. The result drives a hit/miss signal. On a hit, the data array is read using the same index, and the block offset selects bytes. The LRU update logic (pseudo-LRU tree) updates after every access. This entire RTL block — typically 1,000–5,000 lines of SystemVerilog — is one of the most performance-critical pieces of any CPU design and requires thorough verification including miss/hit boundary conditions, dirty eviction, simultaneous read-write conflicts, and coherence protocol integration.

🔬 MMU and TLB in SoC design — SMMU for DMA masters

Every ARM Cortex-A CPU core contains an MMU with dedicated instruction and data TLBs (iTLB and dTLB). In a modern SoC, DMA masters (GPU, video codec, display controller, network accelerator) also generate memory accesses that must be address-translated and permission-checked. ARM’s SMMU (System Memory Management Unit) is a standalone MMU placed on the system bus that translates virtual addresses for non-CPU masters. When you integrate a video codec IP into a SoC, you connect its master port through the SMMU, configure SMMU stream mappings, and verify that the codec can only access its allocated memory regions.

🔬 Cache coherence — MESI protocol in multicore SoCs

In a multicore SoC, each core has its own private L1/L2 cache. If Core 0 modifies a cache line and Core 1 has a copy of the same line in its L1, Core 1’s copy is now stale — a cache coherence problem. The MESI protocol (Modified, Exclusive, Shared, Invalid) is the standard solution: every cache line is tagged with one of four states, and cores communicate via a coherence directory or snooping bus to keep copies consistent. ARM uses the AMBA CHI protocol to implement MESI coherence across all L1/L2 caches and a shared L3 in a DynamIQ cluster. Verifying cache coherence requires directed random testing and often formal methods, and is one of the hardest verification problems in CPU design.

Summary — CA-07 key points: Cache exploits locality of reference to bridge the CPU-DRAM speed gap. AMAT = Hit time + Miss rate × Miss penalty. Three mapping strategies: direct-mapped (one possible location, fast, conflict-prone), fully-associative (any location, no conflicts, expensive), set-associative (N ways per set — the practical compromise). Address = Tag | Index | Offset. Replacement policies: LRU (best hit rate), FIFO (simple), Random (trivial hardware). Write policy: write-through (always updates DRAM, simpler) vs write-back (dirty bit, only writes DRAM on eviction, better performance). Virtual memory uses paging — VPN indexes the page table to get PPN; offset is unchanged. TLB caches recent VPN→PPN translations; on a TLB miss a page table walk fetches the entry. Cache coherence (MESI) is required in multicore designs.
Internal Memory ☰ CA Series Index I/O Overview
Scroll to Top