source

08 Advanced Caches

Advanced Caches

Prerequisites: 01-Introduction-and-Metrics Learning Goals: Understand the AMAT equation and the three levers to improve it: hit time, miss rate, and miss penalty. Learn specific hardware and compiler techniques for each lever.


The AMAT Framework

AMAT=Hit Time+Miss Rate×Miss Penalty\text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty}

Three ways to improve AMAT:

MethodLever
Reduce Hit TimePipelined caches, VIPT caches, way prediction, fast replacement policy
Reduce Miss RateLarger cache, higher associativity, larger blocks, prefetching, loop interchange
Reduce Miss PenaltyNon-blocking caches, MSHR, cache hierarchies

Reducing Hit Time

1. Pipelined Caches

Pipelining the cache = overlapping one cache hit with the next.

Hit Time=Actual Hit+Wait Time\text{Hit Time} = \text{Actual Hit} + \text{Wait Time}

Three pipeline stages for a cache:

  1. Read the index → find the set
  2. Determine hit/miss; begin data read
  3. Finish data read

2. Overlap Cache Hit with TLB Hit (VIPT Cache)

A PIPT cache (Physically Indexed, Physically Tagged) uses the physical address → must wait for TLB translation before looking up the cache.

A VIPT cache (Virtually Indexed, Physically Tagged) uses the virtual address for indexing but physical address for tag comparison → TLB translation and cache lookup happen in parallel.

VIPT Design:

Constraint for no aliasing: Cache SizeAssociativity×Page Size\text{Cache Size} \leq \text{Associativity} \times \text{Page Size}

If all index bits come from the page offset portion of the address (bits below the page boundary), there is no aliasing.

Virtually Addressed (VIVT) Cache — full virtual address used:

3. Way Prediction

The processor predicts which way of a set-associative cache will be a hit.

Way Prediction Performance:

Cache TypeWay Prediction Benefit
Fully associativeYes
8-way set associativeYes
2-way set associativeYes
Direct mappedNo (already one way)

4. Replacement Policy and Hit Time

PolicyHit TimeMiss Rate
RandomFast (no bookkeeping)Worse
LRUSlower (update all counters on hit)Best
NMRU (Not-Most-Recently-Used)Fast (only track MRU)Good
PLRU (Pseudo-LRU)MediumGood (compromise)

NMRU: Only track the most recently used block; replace any other block.

PLRU: Set LRU bit to 1 on access; evict block with LRU bit = 0. When all bits are 1, reset all to 0 and set the evicted block to 1.


Reducing Miss Rate

The 3 Cs of Cache Misses

Miss TypeCauseFix
CompulsoryFirst access to a blockPrefetching, larger blocks
CapacityCache is fullLarger cache
ConflictLimited associativityHigher associativity, larger blocks

Larger Cache Blocks

Larger blocks bring in more data per miss (exploits spatial locality):

Prefetching

Prefetching = loading a block into cache before it is needed.

Compiler prefetching: Insert prefetch instructions; difficult to determine optimal prefetch distance.

Hardware prefetching: Hardware guesses what will be needed soon:

TypeStrategy
Stream bufferPrefetch the next sequential block
Stride prefetchPrefetch the block at a fixed distance d
Correlating prefetcherIf block A is fetched, prefetch B (learned correlation)

Loop Interchange

Loop interchange: swap inner and outer loops to access arrays in row-major (sequential) order rather than column-major.


Reducing Miss Penalty

Non-Blocking Cache

Blocking cache: Only one outstanding miss at a time — no progress while waiting.

Non-blocking cache allows:

Non-blocking caches can reduce the effective miss penalty by ~50% (memory-level parallelism).

Miss Status Handling Registers (MSHR)

MSHRs track outstanding misses:

ScenarioAction
New miss (not yet requested)Allocate an MSHR; record waiting instruction
Half-miss (already requested)Add instruction to existing MSHR
Data arrives from memoryUse MSHR to notify waiting instructions; free MSHR

Typical MSHR count: 16–32 registers.


Cache Hierarchies

Multiple cache levels reduce AMAT by catching misses at a closer level:

AMAT=TL1+mL1(TL2+mL2(TL3+mL3Tmem))\text{AMAT} = T_{L1} + m_{L1} \cdot (T_{L2} + m_{L2} \cdot (T_{L3} + m_{L3} \cdot T_{mem}))

PropertyRelationship
CapacityL1 < L2 < L3
LatencyL1 < L2 < L3
Miss rateL1 > L2 > L3 (typically)

The final level before main memory is the Last Level Cache (LLC).

Local vs. Global Hit Rate

MetricFormula
Local Hit RateHits in this cache / Accesses to this cache
Global Miss RateMisses in this cache / Total memory accesses
Global Hit Rate1 − Global Miss Rate
MPKIMisses per 1,000 instructions

Inclusion Property

Three possibilities for the same block in L1 and L2:

PolicyRule
Default (neither)Block may or may not be in L2
InclusionBlock in L1 must also be in L2
ExclusionBlock in L1 cannot also be in L2

Enforcing inclusion: requires an inclusion bit per block to track whether the block is in another cache level.


Summary

Key Takeaways:

Common Exam Topics:

See Also: 01-Introduction-and-Metrics, 09-Cache-Coherence Next: 09-Cache-Coherence