Cache Memory and Virtual Memory for VLSI and Processor Design Engineers

⚡The Cache Concept

The fundamental problem: a modern CPU can execute billions of instructions per second, but main DRAM takes 50–100 ns per access — hundreds of CPU cycles per memory operation. Without mitigation, the CPU would spend most of its time idle, waiting for memory.

The solution is cache memory — a small, fast SRAM buffer placed between the CPU and main DRAM. Cache holds copies of recently and frequently used memory locations:

Cache hit: Data found in cache → served in 1–10 ns (no DRAM access needed)
Cache miss: Data not in cache → fetch from DRAM (50–100 ns), store in cache, serve to CPU

Cache works because of locality of reference: programs access the same locations repeatedly (temporal locality) and access nearby locations in sequence (spatial locality). A well-designed cache achieves hit rates of 90–99%.

Average memory access time (AMAT): AMAT = Hit time + Miss rate × Miss penalty. Example: L1 hit time = 4 cycles, miss rate = 5%, DRAM miss penalty = 200 cycles. AMAT = 4 + 0.05 × 200 = 4 + 10 = 14 cycles — far better than 200 cycles every time.

🗂️Cache Structure — Tags, Lines, and Sets

A cache is organised into cache lines (also called cache blocks). Each line holds a copy of a contiguous chunk of main memory — typically 64 bytes. Each cache line also stores metadata:

Field	Width	Purpose
Valid bit	1 bit	Indicates whether this cache line holds a valid copy (0 = empty/invalid, 1 = valid)
Tag	Address bits − index bits − offset bits	Identifies which main memory block this line holds — used to check if a cache hit occurred
Dirty bit	1 bit (write-back only)	Marks a line that has been written but not yet propagated to DRAM — must be written back on eviction
Data	64 bytes (512 bits)	The actual cached copy of the memory block

Address decomposition

Figure 1 — Memory address decomposed into Tag | Index | Block Offset

A 32-bit address is split into three fields. The Index selects which cache row to inspect. The Tag is compared against the stored tag to confirm a hit. The Block Offset selects the specific byte within the 64-byte cache line.

1️⃣Direct-Mapped Cache

In a direct-mapped cache, each main memory block maps to exactly one cache line — determined by the block address modulo the number of cache lines:

Cache line = Memory block number MOD (Number of cache lines)

Figure 2 — Direct-mapped cache: each DRAM block maps to exactly one cache line

Direct-mapped cache with 4 lines. Each DRAM block maps to exactly one cache line (block mod 4). Blocks 0, 4, 8, 12 all map to Line 0. If the program alternates between Block 4 and Block 8, every access is a miss — called thrashing.

🔍 Worked Example — Direct-Mapped Cache Lookup

Cache: 64 lines, 64-byte blocks. Address: 32-bit.

Address decomposition: Offset = log₂(64) = 6 bits Index = log₂(64 lines) = 6 bits Tag = 32 − 6 − 6 = 20 bits

Access address 0x00001A40:

Binary: 0000 0000 0000 0000 0001 1010 0100 0000
Offset [5:0] = 00 0000 = 0 (byte 0 of the line)
Index [11:6] = 10 1001 = 41 (look in cache line 41)
Tag [31:12]= 0000 0000 0000 0000 0001 = 0x00001

Hit check: Cache line 41: Valid=1, stored tag=0x00001 → HIT. Return byte 0 of line 41’s data.

Miss scenario: If stored tag ≠ 0x00001, or Valid=0 → MISS. Fetch 64 bytes from DRAM address 0x00001A00 (block aligned), write into line 41, update tag to 0x00001, set Valid=1, serve the request.

🔓Fully Associative Cache

In a fully associative cache, any main memory block can be placed in any cache line. There is no index field — the entire address (minus offset) is the tag. On every access, all stored tags are compared simultaneously using CAM hardware.

Advantage: No conflict misses — any block goes anywhere. Maximum flexibility.
Disadvantage: Expensive — requires parallel hardware comparators for every cache line. Practical only for very small caches (TLBs typically have 32–128 fully-associative entries).

When fully associative is used: TLBs are typically fully associative because they are small (32–128 entries) and any virtual page must be able to map to any TLB entry. L1/L2/L3 caches are too large for fully associative organisation — a 32 KB L1 with 64-byte lines has 512 lines, requiring 512 simultaneous tag comparisons per access.

🔢Set-Associative Cache

The set-associative cache is the practical compromise: the cache is divided into sets, and each set contains N ways (N cache lines). A memory block maps to exactly one set (using the index field), but can occupy any of the N ways within that set. This is called an N-way set-associative cache.

Figure 3 — 2-way set-associative cache: 4 sets, 2 ways per set

2-way set-associative cache with 4 sets. Index selects the set (row). Both ways in that set are checked simultaneously. A block that maps to Set 0 can occupy either Way 0 or Way 1 — eliminating the conflict that would occur in direct-mapped.

Mapping type	Conflict misses	Hardware cost	Common use
Direct-mapped (1-way)	High	Lowest — one comparator	Simple, low-power L1 in some embedded CPUs
2-way set-associative	Moderate	2 comparators per set	Common for small L1 caches
4-way set-associative	Low	4 comparators per set	Typical L1 cache (ARM Cortex-A, x86)
8-way set-associative	Very low	8 comparators per set	L2 / L3 caches
Fully associative	None	Highest — N comparators	TLB, victim cache

🗑️Replacement Policies

When a cache miss occurs and the selected set is full, a victim line must be evicted. Three main replacement policies:

Policy	Rule	Advantage	Disadvantage
LRU Least Recently Used	Evict the line that was accessed least recently	Best hit rate in practice — exploits temporal locality optimally	Requires tracking access time of each line (LRU counter bits)
FIFO First In, First Out	Evict the line that has been in cache the longest	Simple — just a queue, no per-access tracking	May evict frequently-used old lines (Bélády’s anomaly)
Random	Randomly select a victim line	Trivially simple hardware — no state tracking needed	Unpredictable — occasionally evicts the most-needed line

LRU approximation in real hardware: True LRU for N-way caches requires log₂(N!) bits per set. For an 8-way L1, that is log₂(8!) = 15 bits per set — expensive. Real CPUs use pseudo-LRU (PLRU) — a binary tree of 1-bit comparisons that approximates LRU with only N−1 bits per set. ARM Cortex-A53 uses pseudo-random replacement for L1; Cortex-A77 uses PLRU for L1/L2.

✍️Write Policy

When the CPU writes to a cached location, when should the write propagate to main DRAM? Two strategies:

WRITE-THROUGH

Every cache write immediately writes to DRAM
Cache and memory always consistent
Simpler — no dirty bit needed
More bus traffic — every write hits DRAM
Common in L1 with write buffer (coalesces writes)
Used where memory consistency is critical (DMA, multicore)

WRITE-BACK

Write only updates the cache line — sets dirty bit
DRAM updated only when dirty line is evicted
Less bus traffic — multiple writes to same line = one DRAM write
More complex — dirty bit per line, writeback on eviction
Higher performance for write-intensive workloads
Requires cache coherence protocol in multicore (MESI)

Write miss policy: Write-allocate (fetch the block into cache, then write — pairs naturally with write-back) or no-write-allocate (write directly to DRAM without loading cache — pairs with write-through).

📊Cache Performance

🔍 Worked Example — Effective Memory Access Time

System: L1 cache hit time = 4 cycles. L2 cache hit time = 12 cycles. DRAM access = 200 cycles. L1 miss rate = 5%. L2 miss rate (given L1 miss) = 20%.

AMAT calculation (hierarchical):

L2 AMAT = L2 hit time + L2 miss rate × DRAM penalty
= 12 + 0.20 × 200 = 12 + 40 = 52 cycles

Overall AMAT = L1 hit time + L1 miss rate × L2 AMAT
= 4 + 0.05 × 52 = 4 + 2.6 = 6.6 cycles

Without any cache: Every access = 200 cycles. Cache provides a 30× speedup for this workload.

🗺️Virtual Memory — The Problem

Physical DRAM is finite. Several problems arise without virtual memory:

Programs larger than physical RAM cannot run at all
Multiple programs share RAM — one program can read/overwrite another’s data (no isolation)
Programmers must manually manage which parts of a program fit in RAM
No protection between OS and user programs

Virtual memory creates the illusion that every process has its own large, private address space — typically the full 32-bit or 64-bit range. The OS and hardware transparently map virtual addresses to physical RAM locations (or to disk when RAM is full).

📄Paging & Page Tables

The most common virtual memory implementation divides both the virtual address space and physical memory into fixed-size chunks called pages (typically 4 KB). The hardware unit that translates virtual to physical addresses is the MMU (Memory Management Unit).

Figure 4 — Virtual address → physical address translation via page table

Virtual-to-physical address translation. The MMU uses the Virtual Page Number to index the page table (maintained by the OS). The page table entry contains the Physical Page Number plus permission bits (Valid, Read, Write, Execute). The 12-bit page offset is concatenated unchanged to form the physical address. If Valid=0, a page fault exception is raised.

Page fault handling

CPU accesses virtual address → MMU checks page table → Valid bit = 0
MMU raises a page fault exception
OS page fault handler runs: finds a free physical frame (or evicts one)
OS reads the required page from disk (swap space) into the free frame
OS updates page table entry: sets Valid=1, stores new PPN
OS returns from exception → CPU retries the faulting instruction → now hits (Valid=1)

🔍TLB — Translation Lookaside Buffer

A problem: every memory access requires one page table lookup (which itself is a memory access). This doubles the effective memory access time. The solution is the TLB — a small, fully-associative cache of recently used page table entries, stored inside the MMU.

Figure 5 — TLB lookup: fast path (hit) vs slow path (miss → page table walk)

TLB lookup flow. On a TLB hit (the common case, 90%+ of accesses), the PPN is available in ~1 ns with no additional memory access. On a TLB miss, the hardware page table walker reads the page table from memory, loads the entry into the TLB, and retries the access.

Parameter	Typical TLB	Notes
Entries	32–128	Fully associative — any VPN→PPN pair in any entry
Access time	1–2 ns	Runs in parallel with L1 cache tag lookup (virtually indexed)
Hit rate	>99%	Most programs have working sets of a few hundred pages
On a miss	Hardware page table walk (x86, ARM: PTWA hardware walker) or software TLB miss handler (MIPS). Walk may require 2–4 memory accesses for multi-level page tables.
TLB flush	On context switch	Each process has its own page table → TLB must be flushed on process switch (or use ASIDs — Address Space IDs — to avoid full flush)

🔬VLSI Connections

🔬 Cache microarchitecture in RTL — tag SRAM, data SRAM, LRU logic

A physical L1 cache in RTL consists of two SRAM macros (tag array and data array) plus combinational logic. The tag array is addressed by the index field; its output (tag bits + valid bit + dirty bit) is compared against the incoming address tag using XOR gates — one comparator per way. The result drives a hit/miss signal. On a hit, the data array is read using the same index, and the block offset selects bytes. The LRU update logic (pseudo-LRU tree) updates after every access. This entire RTL block — typically 1,000–5,000 lines of SystemVerilog — is one of the most performance-critical pieces of any CPU design and requires thorough verification including miss/hit boundary conditions, dirty eviction, simultaneous read-write conflicts, and coherence protocol integration.

🔬 MMU and TLB in SoC design — SMMU for DMA masters

Every ARM Cortex-A CPU core contains an MMU with dedicated instruction and data TLBs (iTLB and dTLB). In a modern SoC, DMA masters (GPU, video codec, display controller, network accelerator) also generate memory accesses that must be address-translated and permission-checked. ARM’s SMMU (System Memory Management Unit) is a standalone MMU placed on the system bus that translates virtual addresses for non-CPU masters. When you integrate a video codec IP into a SoC, you connect its master port through the SMMU, configure SMMU stream mappings, and verify that the codec can only access its allocated memory regions.

🔬 Cache coherence — MESI protocol in multicore SoCs

In a multicore SoC, each core has its own private L1/L2 cache. If Core 0 modifies a cache line and Core 1 has a copy of the same line in its L1, Core 1’s copy is now stale — a cache coherence problem. The MESI protocol (Modified, Exclusive, Shared, Invalid) is the standard solution: every cache line is tagged with one of four states, and cores communicate via a coherence directory or snooping bus to keep copies consistent. ARM uses the AMBA CHI protocol to implement MESI coherence across all L1/L2 caches and a shared L3 in a DynamIQ cluster. Verifying cache coherence requires directed random testing and often formal methods, and is one of the hardest verification problems in CPU design.

Summary — CA-07 key points: Cache exploits locality of reference to bridge the CPU-DRAM speed gap. AMAT = Hit time + Miss rate × Miss penalty. Three mapping strategies: direct-mapped (one possible location, fast, conflict-prone), fully-associative (any location, no conflicts, expensive), set-associative (N ways per set — the practical compromise). Address = Tag | Index | Offset. Replacement policies: LRU (best hit rate), FIFO (simple), Random (trivial hardware). Write policy: write-through (always updates DRAM, simpler) vs write-back (dirty bit, only writes DRAM on eviction, better performance). Virtual memory uses paging — VPN indexes the page table to get PPN; offset is unchanged. TLB caches recent VPN→PPN translations; on a TLB miss a page table walk fetches the entry. Cache coherence (MESI) is required in multicore designs.