source

09 Cache Coherence

Cache Coherence

Prerequisites: 08-Advanced-Caches Learning Goals: Understand the cache coherence problem in multicore processors, the two fundamental approaches (update vs. invalidate), major coherence protocols (MSI, MOSI, MOESI), and directory-based coherence for scalable systems.


The Cache Coherence Problem

In a multicore system, each core has its own private cache. Multiple cores can cache the same memory location, causing their views to diverge.

Incoherent: Each cache copy behaves as an independent copy instead of as the same shared memory location.

Programmer expectation: shared memory — any write by one core should be visible to all cores.


Coherence Requirements

Three requirements for a coherent memory system:

  1. A core reading a memory location receives the value written by the last valid write
  2. If core A writes and then core B reads the same location, B should see A’s value
  3. All cores agree on the order of writes to any given memory location

Approaches to Coherence (Non-Solutions First)

ApproachProblem
No cachesCorrect, but terrible performance
All cores share one L1 cacheCorrect, but terrible performance
Private write-through caches (no protocol)Not coherent

Maintaining Coherence Property 2: Update vs. Invalidate

StrategyMechanismWhen Better
Write-UpdateOn a write, broadcast the new value to all caches holding that blockOne core writes, many cores read frequently
Write-InvalidateOn a write, invalidate all other copiesBurst writes to one address; writes to different words in same block; thread migration

All modern processors use write-invalidate — it handles thread migration better.


Maintaining Coherence Property 3: Ordering

MechanismDescription
SnoopingAll writes go on a shared bus; all cores monitor (snoop) the bus
DirectoryEach block has a directory entry tracking its state; no broadcast needed

Write-Update Optimization: Dirty Bit

Problem: Broadcasting all writes to memory creates a bandwidth bottleneck.

Solution: Delay writes to memory using a dirty bit per cache block.

Benefits:


Write-Invalidate Snooping

Disadvantage: Every reader gets a miss when a core writes. Advantage: Multiple consecutive writes to the same block are fast — no need to broadcast after the first write (no other valid copies).


MSI Protocol

An invalidation-based snooping protocol with 3 states:

StateMeaning
I (Invalid)This cache does not have a valid copy
S (Shared)This cache has a clean (read-only) copy; other caches may also
M (Modified)This cache has the only valid (dirty) copy

MSI State Transitions (Summary)

Current StateEventActionNext State
ILocal readPut Read on bus; get dataS
ILocal writePut Write+Invalidate on bus; get dataM
SLocal readS
SLocal writePut Invalidation on busM
SSnoop write on busInvalidateI
MLocal read/writeM
MSnoop read on busWrite back; supply dataS
MSnoop write on busWrite back; supply dataI

Cache-to-Cache Transfers

When cache C1 has a block in the M state and C2 requests it:

MethodDescriptionCost
Abort and RetryC1 aborts C2’s request; C2 retries after writeback2× memory latency
InterventionC1 tells memory it will respond directly to C21× memory latency (better)

Modern processors use Intervention.

Intervention requires an extra signal on the bus; hardware is more complex but faster.


MOSI Protocol

Problem with MSI: Going from Shared → Modified requires passing through Invalid (wasteful).

O (Owner) state: A core modified the data and shared it — it is responsible for:

  1. Responding to read requests from other cores
  2. Writing back to memory when the block is evicted
StateMeaning
MCore has modified; only valid copy
OCore has modified; has shared with ≥1 other core (owner responsible for writeback)
S≥1 core has clean copy
IInvalid

MOESI Protocol

Problem with MOSI: Going from S → M still requires passing through I.

E (Exclusive) state: Core is the only core with a clean copy.

When a block is in the E state:

StateMeaning
MModified; only valid copy
OModified; shared; owner does writeback
EExclusive clean copy; can write silently
SShared clean copy
IInvalid

Directory-Based Coherence

Snooping limitation: Requires a shared bus — only scales to ~16 processors.

Directory approach: Each memory block has a directory entry; no broadcast needed.

Directory Structure

Directory Entry Fields

FieldMeaning
Dirty bitIs any cache’s copy dirty?
Present bits (1 per cache)Is this block in a valid state in each cache?

For an 8-core system: 8 present bits per directory entry.

Communication: After a request, the directory sends a command to the relevant caches; caches send an acknowledgement back.


Cache Misses with Coherence: 4 Cs

The classic “3 Cs” become 4:

Miss TypeCause
CompulsoryFirst access to a block
CapacityCache too small
ConflictLimited associativity
CoherenceAnother core invalidated/updated the block

Two Types of Coherence Misses

TypeDescription
True SharingDifferent cores genuinely access the same data (expected coherence cost)
False SharingDifferent cores access different data that happens to be in the same cache block — coherence treats them as the same

False sharing can be reduced by padding data structures so independently-accessed fields land in different cache blocks.


Summary

Key Takeaways:

Common Exam Topics:

See Also: 08-Advanced-Caches