Cache Coherence

Prerequisites: 08-Advanced-Caches Learning Goals: Understand the cache coherence problem in multicore processors, the two fundamental approaches (update vs. invalidate), major coherence protocols (MSI, MOSI, MOESI), and directory-based coherence for scalable systems.

The Cache Coherence Problem

In a multicore system, each core has its own private cache. Multiple cores can cache the same memory location, causing their views to diverge.

Incoherent: Each cache copy behaves as an independent copy instead of as the same shared memory location.

Programmer expectation: shared memory — any write by one core should be visible to all cores.

Coherence Requirements

Three requirements for a coherent memory system:

A core reading a memory location receives the value written by the last valid write
If core A writes and then core B reads the same location, B should see A’s value
All cores agree on the order of writes to any given memory location

Approaches to Coherence (Non-Solutions First)

Approach	Problem
No caches	Correct, but terrible performance
All cores share one L1 cache	Correct, but terrible performance
Private write-through caches (no protocol)	Not coherent

Maintaining Coherence Property 2: Update vs. Invalidate

Strategy	Mechanism	When Better
Write-Update	On a write, broadcast the new value to all caches holding that block	One core writes, many cores read frequently
Write-Invalidate	On a write, invalidate all other copies	Burst writes to one address; writes to different words in same block; thread migration

All modern processors use write-invalidate — it handles thread migration better.

Maintaining Coherence Property 3: Ordering

Mechanism	Description
Snooping	All writes go on a shared bus; all cores monitor (snoop) the bus
Directory	Each block has a directory entry tracking its state; no broadcast needed

Write-Update Optimization: Dirty Bit

Problem: Broadcasting all writes to memory creates a bandwidth bottleneck.

Solution: Delay writes to memory using a dirty bit per cache block.

When a core writes → broadcast to update other caches, set dirty bit
Write to main memory deferred until the dirty block is evicted

Benefits:

Greatly reduces writes to memory
Greatly reduces reads from memory (dirty copy is served from cache)

Write-Invalidate Snooping

A write causes all other copies to be invalidated
The writing cache becomes the only valid copy
Other cores that read next will miss and request the data
A shared bit indicates whether multiple caches have clean copies

Disadvantage: Every reader gets a miss when a core writes. Advantage: Multiple consecutive writes to the same block are fast — no need to broadcast after the first write (no other valid copies).

MSI Protocol

An invalidation-based snooping protocol with 3 states:

State	Meaning
I (Invalid)	This cache does not have a valid copy
S (Shared)	This cache has a clean (read-only) copy; other caches may also
M (Modified)	This cache has the only valid (dirty) copy

MSI State Transitions (Summary)

Current State	Event	Action	Next State
I	Local read	Put Read on bus; get data	S
I	Local write	Put Write+Invalidate on bus; get data	M
S	Local read	—	S
S	Local write	Put Invalidation on bus	M
S	Snoop write on bus	Invalidate	I
M	Local read/write	—	M
M	Snoop read on bus	Write back; supply data	S
M	Snoop write on bus	Write back; supply data	I

Cache-to-Cache Transfers

When cache C1 has a block in the M state and C2 requests it:

Method	Description	Cost
Abort and Retry	C1 aborts C2’s request; C2 retries after writeback	2× memory latency
Intervention	C1 tells memory it will respond directly to C2	1× memory latency (better)

Modern processors use Intervention.

Intervention requires an extra signal on the bus; hardware is more complex but faster.

MOSI Protocol

Problem with MSI: Going from Shared → Modified requires passing through Invalid (wasteful).

O (Owner) state: A core modified the data and shared it — it is responsible for:

Responding to read requests from other cores
Writing back to memory when the block is evicted

State	Meaning
M	Core has modified; only valid copy
O	Core has modified; has shared with ≥1 other core (owner responsible for writeback)
S	≥1 core has clean copy
I	Invalid

MOESI Protocol

Problem with MOSI: Going from S → M still requires passing through I.

E (Exclusive) state: Core is the only core with a clean copy.

When a block is in the E state:

No other core has a copy
This core can move directly to M on a write (no bus transaction needed!)

State	Meaning
M	Modified; only valid copy
O	Modified; shared; owner does writeback
E	Exclusive clean copy; can write silently
S	Shared clean copy
I	Invalid

Directory-Based Coherence

Snooping limitation: Requires a shared bus — only scales to ~16 processors.

Directory approach: Each memory block has a directory entry; no broadcast needed.

Directory Structure

Distributed across all cores — each core has a slice of the directory
Each slice manages a set of memory blocks

Directory Entry Fields

Field	Meaning
Dirty bit	Is any cache’s copy dirty?
Present bits (1 per cache)	Is this block in a valid state in each cache?

For an 8-core system: 8 present bits per directory entry.

Communication: After a request, the directory sends a command to the relevant caches; caches send an acknowledgement back.

Cache Misses with Coherence: 4 Cs

The classic “3 Cs” become 4:

Miss Type	Cause
Compulsory	First access to a block
Capacity	Cache too small
Conflict	Limited associativity
Coherence	Another core invalidated/updated the block

Two Types of Coherence Misses

Type	Description
True Sharing	Different cores genuinely access the same data (expected coherence cost)
False Sharing	Different cores access different data that happens to be in the same cache block — coherence treats them as the same

False sharing can be reduced by padding data structures so independently-accessed fields land in different cache blocks.

Summary

Key Takeaways:

Coherence = every core sees a consistent view of memory
Write-invalidate dominates modern processors (better for thread migration, burst writes)
Snooping: simple, scales to ~16 cores; requires shared bus
Directory: scales to many cores; no bus needed; more latency per operation
MSI → add O state for owner writeback avoidance (MOSI) → add E state for silent upgrades (MOESI)
Coherence misses: true sharing (unavoidable) and false sharing (avoidable by padding)

Common Exam Topics:

Draw state transitions for MSI/MOESI given a sequence of read/write operations from multiple cores
Identify true vs. false sharing scenarios
Compare snooping vs. directory for scalability

See Also: 08-Advanced-Caches