HPCA Study Guide — CS 6290 Georgia Tech
Based on all 10 Problem Sets + Solutions
How to Use This Guide
Each section follows this structure:
- Core Concepts — what you must know cold
- Key Formulas — the equations that appear on every problem
- Problem Patterns — the types of questions asked, with worked strategy
- Practice Questions — new questions modeled after the problem sets
- Common Traps — mistakes the solutions explicitly call out
Module 1: Metrics and Evaluation
Core Concepts
- Iron Law:
Time = (Instructions/Program) × (Cycles/Instruction) × (Time/Cycle) - Amdahl’s Law: Overall speedup is limited by the fraction that cannot be enhanced
- CPI (Cycles Per Instruction): weighted average across instruction types
- The “unenhanced” portion is the ceiling on speedup — no matter how fast the rest gets
Key Formulas
Amdahl’s Law (single enhancement):
S = 1 / [(1 - F_e) + F_e/S_e]
F_e= fraction of time the enhancement appliesS_e= speedup of the enhanced portion
Generalized Amdahl’s Law (multiple non-overlapping enhancements):
S = 1 / [(1 - Σf_i) + Σ(f_i / S_i)]
Iron Law speedup ratio (when instruction count doesn’t change):
S = (CPI_old × CycleTime_old) / (CPI_new × CycleTime_new)
Weighted CPI:
CPI = Σ (fraction_i × CPI_i)
Cache miss penalty CPI adjustment:
CPI_new = CPI_base + (miss_rate × miss_penalty × fraction_memory_instructions)
Problem Patterns
Pattern 1 — Single enhancement (P1, P3):
- Identify
F_eandS_e, plug into Amdahl - Example: 40% of time sped up 10x → S = 1/(0.6 + 0.04) = 1.56
Pattern 2 — Compare two design choices (P2, P4):
- Compute CPI or execution time for each design separately, then compare
- For P4 “Triple RISC”: use generalized Amdahl with 1/3 at 3x, 1/3 at 2x, 1/3 at 1x
- For “RISCily Fast”: use Iron Law, compute new CPI by adding miss penalty × rate
Pattern 3 — Four-part Amdahl (P5):
- Treat parts with S=1 as “unenhanced residual” in denominator
- Key insight: a 20x speedup on 23% matters less than a 1.6x speedup on 48%
Pattern 4 — Nested enhancements (P6 — L1 + L2 cache):
- Do NOT multiply individual speedups — that is wrong
- Use generalized Amdahl: L1 covers 80%×30% = 24%, L2 covers ½×20%×30% = 3%
- Both applied simultaneously in the formula
Pattern 5 — Solve for unknown fraction (P9):
- Set up Amdahl equal to target speedup, solve for
F_calgebraically
Pattern 6 — Fraction of enhanced time with no enhancement active (P10):
F_NE_enhanced = (1 - Σf_i) / S_total
This equals the unenhanced time divided by the new total time.
Pattern 7 — Best single / best pair enhancement (P11):
- Compute S for each individually, pick highest
- For pairs: compute all three combinations; note that the best pair usually includes the two individually best enhancements
Practice Questions
P-1.1: An enhancement speeds up 60% of execution time by 8x. What is the overall speedup?
P-1.2: A processor has 30% FP instructions (CPI=6) and 70% integer (CPI=1.2). Two options:
- Option A: Reduce FP CPI to 3
- Option B: Reduce integer CPI to 1.0 Which option gives better overall CPI?
P-1.3: Three enhancements affect 15%, 25%, and 40% of execution time with speedups 20, 8, and 4 respectively. Assume no overlap. What is the overall speedup?
P-1.4: Given the same three enhancements above, what fraction of the reduced execution time has no enhancement active?
P-1.5 (challenge): Enhancement A affects 30% (speedup=50) and Enhancement B has speedup=10. What fraction of time must Enhancement B be used to achieve an overall speedup of 5?
Common Traps
- Overlap trap (P8): If the problem doesn’t say enhancements affect different parts, you cannot use generalized Amdahl’s Law — the answer may be indeterminate
- L1+L2 trap (P6): Never multiply speedups sequentially. After L1 is applied, the remaining miss time is no longer 20% of original — it’s larger as a fraction of new time
- Clock frequency vs CPI: When clock frequency changes, use Iron Law — don’t just compare CPI alone (P4 “RISCily Fast”)
- Amdahl’s ceiling: Speedup can never exceed
1/(1 - F_e)regardless of how largeS_egets
Module 2: Pipelining and Hazards
Core Concepts
- 5 stages: IF → ID → EX → MEM → WB
- Hazard types: Structural, Data (RAW/WAW/WAR), Control
- Dependency vs Hazard: Dependency is a property of the code; hazard is a runtime problem caused by pipelining
- True dependency: RAW — actual data flows from producer to consumer
- Name dependencies: WAR and WAW — only about register names, not actual data
- Forwarding: Passes results directly to later stages, avoiding stalls for most RAW hazards
- Load-use hazard: Even with forwarding, a load followed immediately by a use of the loaded value requires 1 stall cycle
Key Formulas
Pipelining speedup (ideal):
S = CPI_unpipelined × T_stage_old / (CPI_pipelined × T_cycle_new)
CPI with stalls:
CPI_actual = CPI_ideal + stalls_per_instruction
Pipeline clock cycle time:
T_cycle = max(stage_latencies) + latch_overhead
Unpipelined execution time per instruction:
T_avg = T_cycle × weighted_average_CPI
Problem Patterns
Pattern 1 — Identify dependencies (P1, P3):
- Go through each instruction pair; for every register that is written by instruction A and read/written by instruction B (B after A), note the dependency type
- RAW: A writes, B reads same register
- WAR: A reads, B writes same register (B would overwrite before A reads, if reordered)
- WAW: A writes, B writes same register
Pattern 2 — Pipeline timing diagrams (P4, P5):
- Without forwarding: consumer must wait until WB stage of producer (can read in same cycle as write)
- With forwarding: result available at end of EX, can be forwarded to start of next EX
- Mark stall cycles as “S” in the pipeline diagram
- Count total cycles from IF of first instruction to end of EX of last instruction
Pattern 3 — Delay slots (P5A):
- The instruction in the delay slot always executes after the branch
- Move an independent instruction into the slot (e.g., the SD whose address is now -4(R2) because R2 was already incremented)
Pattern 4 — Clock cycle time and 3-stage / 6-stage pipelines (P6):
- T_cycle = max(combined_stage_latency) + latch_overhead per intermediate latch
- To minimize clock cycle time: balance stages as evenly as possible
- To split for 6 stages: split the bottleneck stage (longest latency)
Pattern 5 — Out-of-order execution table (P12):
- Instructions stall in DI if operand not ready (no forwarding from WB)
- WB stage can only handle one instruction at a time
- WAW hazard: name-dependent instruction must not complete WB before the earlier instruction
Practice Questions
P-2.1: For the following code, identify all RAW, WAW, and WAR dependencies:
I1: ADD R1, R2, R3
I2: SUB R4, R1, R5
I3: MUL R2, R4, R6
I4: DIV R1, R2, R3
P-2.2: A 5-stage pipeline has stage latencies IF=150ps, ID=80ps, EX=200ps, MEM=220ps, WB=70ps. Each latch adds 15ps. (a) What is the clock cycle time? (b) If CPI=1 unpipelined (single-cycle) and CPI=1.3 pipelined, what is the speedup?
P-2.3: For the following loop with forwarding, show the pipeline timing and count the cycles for one iteration:
Loop: LD R1, 0(R2)
ADD R3, R1, R4
SW 0(R2), R3
ADDI R2, R2, 4
BNE R2, R5, Loop
P-2.4: What is the difference between a structural hazard and a data hazard? Give one example of each.
Common Traps
- Dependency ≠ hazard: A WAR dependency between I1 and I5 may not be a hazard if they are far enough apart in the pipeline
- Same-cycle WB/ID: If the problem says “registers can be read and written in the same cycle during WB,” then an instruction in ID that needs a value can get it from WB without stalling
- Delay slot: The instruction in the delay slot always executes — even if the branch is not taken
Module 3: Branch Prediction
Core Concepts
- Always-taken predictor: predict every branch as taken; mispredicts on not-taken branches and on the last iteration of loops
- 1-bit predictor: remembers last outcome; poor for loops (mispredicts twice per loop — exit and first re-entry)
- 2-bit saturating counter: states 0(SN), 1(WN), 2(WT), 3(ST); changes prediction only after two consecutive wrong predictions; much better for loops
- Local history predictor: per-branch history register selects which 2-bit counter to use; captures patterns within a single branch’s behavior
- Aliasing: two different branches hashing to the same predictor entry corrupt each other’s state
Key Formulas
Misprediction count for always-taken:
- Always-taken branches: 0 mispredictions per execution
- Always-not-taken branches: N mispredictions (one per execution)
- Loop-exit branches: 1 misprediction per loop execution (the exit)
- Unconditional branches (BEQ R0,R0,…): always taken, 0 mispredictions
CPI with branch penalty (from ROB PS Problem 1):
CPI = CPI_base + f_branch × [f_taken × (f_miss×2 + (1-f_miss)×1)
+ (1-f_taken) × (f_miss×2 + (1-f_miss)×0)]
Problem Patterns
Pattern 1 — Count mispredictions for always-taken (P1):
- Trace the actual execution of the nested loop
- For each branch instruction, determine how many times it is taken vs not-taken
- Always-taken mispredicts when branch is actually NOT taken
Pattern 2 — 1-bit predictor with 8 entries (P2):
- Map each branch address to a table entry using bits [4:2] of the PC (after stripping 2 LSBs)
- Track aliasing: if two branches share an entry, they interfere
- Follow state transitions: 0→taken prediction wrong → state flips
Pattern 3 — Local history predictor (P3):
- Each entry has a 2-bit history register and 4 2-bit saturating counters (one per history pattern)
- On each branch: use current history bits to index counter → predict → update counter → shift history
Pattern 4 — Branch resolution location and data hazards (P3A):
- Resolving branches in ID reduces control hazard stalls (only 1 cycle penalty instead of 2)
- But ID runs before EX, so it needs register values earlier → creates more data hazard stalls for branches that read registers being computed by recent instructions
Practice Questions
P-3.1: For a loop that executes 100 times, how many mispredictions occur with: (a) always-taken, (b) always-not-taken, (c) 1-bit predictor initialized to “not taken”?
P-3.2: Two branches at addresses 0x200 and 0x220 share the same entry in a 4-entry 1-bit predictor (initialized to 0). Branch at 0x200 is always taken. Branch at 0x220 alternates T/NT/T/NT… How many mispredictions occur in 8 executions of each?
P-3.3: Why does increasing branch predictor size beyond a point yield diminishing returns?
Common Traps
- Address indexing: strip the 2 LSBs first, then take the N least-significant bits of the result to index into an N-entry table
- Unconditional branches (BEQ R0,R0): always predicted taken = always correct with always-taken predictor
- 1-bit aliasing: even if the predictor has enough entries, aliasing can occur if two branches happen to hash to the same entry
Module 4: ILP and Register Renaming
Core Concepts
- ILP (Instruction-Level Parallelism): number of instructions that can execute simultaneously
- True ILP limit: only RAW dependencies are true; WAR/WAW can be eliminated by renaming
- Register renaming: assigns physical registers to eliminate false (WAR/WAW) dependencies
- Practical ILP limits: branch mispredictions, cache misses, limited rename registers, limited issue width, load/store ordering
Key Formulas
Minimum cycles with N-way issue:
Cycles = max(critical_path_length, ceil(num_instructions / N))
Where critical path = longest chain of RAW-dependent instructions
Problem Patterns
Pattern 1 — Register renaming (P1):
- Scan instructions in order
- When an instruction WRITES a register: assign a new physical location (L4, L5, …)
- When an instruction READS a register: use whatever location that register currently maps to
- Example:
MUL R1, R2, R3— R1 gets new name L4; reads use L1 (R1), L2 (R2), L3 (R3)
Pattern 2 — Cycle count with 2-way issue (P2):
- Draw dependency graph
- Group independent instructions together for parallel issue
- Count cycles along the critical path
Pattern 3 — Cycle count with 4-way issue (P3):
- Same as above, but up to 4 instructions per cycle
- More parallelism available; the critical path (RAW chains) still limits
Practice Questions
P-4.1: Use register renaming to eliminate false dependencies. Assume 8 physical locations L1-L8; initially R1=L1, R2=L2, R3=L3:
MUL R1, R2, R3
ADD R2, R1, R3
SUB R1, R2, R3
MUL R3, R1, R2
P-4.2: How many cycles does the following take with 3-way issue (all instructions latency=1)?
ADD R1, R2, R3
MUL R4, R1, R5
SUB R6, R4, R7
ADD R8, R2, R9
LD R10, (R11)
MUL R12, R10, R4
P-4.3: List three factors that prevent a real processor from achieving the theoretical maximum ILP shown in benchmarks.
Common Traps
- Renaming only eliminates WAR and WAW: RAW dependencies remain and constrain execution order
- Critical path determines minimum cycles: even with infinite issue width, you can’t execute faster than the longest dependency chain
Module 5: Tomasulo’s Algorithm and Out-of-Order Execution
Core Concepts
- Reservation stations (RS): hold instructions waiting for operands; tag unavailable operands with the RS/ROB entry that will produce them
- Common Data Bus (CDB): broadcasts result + tag; all RS entries watching for that tag latch the value
- Issue: send instruction to RS (in-order); read available operands; tag unavailable ones
- Execute: when all operands ready, send to functional unit
- Write result: broadcast on CDB; free RS
- RAW resolution: when the producing instruction writes its result to CDB, all waiting RS entries automatically receive the value
Key Rules
- Issue is in-order; execution and write-result can be out-of-order
- An instruction cannot begin executing until all its operands are available
- The result is written in the LAST cycle of execution (not the cycle after)
- A dependent instruction can begin the cycle AFTER the result is written
Problem Patterns
Pattern 1 — Fill in Tomasulo scheduling table (Instruction Scheduling PS P1, P4):
- Issue cycle: first cycle where an RS and (for MUL/DIV) the execution unit is free
- First execution cycle: cycle after all operands are available (either from register file or CDB write)
- Write result cycle: first execution cycle + latency - 1 (result written in last exec cycle)
Pattern 2 — Tomasulo with ROB (Interrupts PS P4):
- Same as above, but RS is freed when result is written (not at commit)
- Commit happens in-order; an instruction commits the cycle after it writes its result (earliest)
- ROB entry is freed at commit
Practice Questions
P-5.1: Schedule the following using Tomasulo’s algorithm. 1 MUL unit (latency 4, 2 RS), 1 ADD unit (latency 2, 3 RS). Issue starts at cycle 1:
I1: MUL F1, F2, F3
I2: ADD F4, F1, F5
I3: MUL F6, F4, F2
I4: ADD F1, F6, F3
P-5.2: Explain the role of the Register Alias Table (RAT) in Tomasulo’s algorithm. What does it store, and when is it updated?
P-5.3: Why can Tomasulo’s algorithm eliminate WAR and WAW hazards but not RAW hazards?
Common Traps
- Latency timing: if latency is 4 and first execution is cycle 3, write result is cycle 6 (cycles 3,4,5,6)
- Structural hazard: if the execution unit is already busy, the new instruction must wait even if all operands are ready
- Reservation station capacity: if all RS are full, no new instruction can issue — even if operands are ready
Module 6: ReOrder Buffer (ROB) and Exceptions
Core Concepts
- ROB purpose: enables in-order commit of out-of-order execution; supports precise exceptions; supports branch misprediction recovery
- ROB fields per entry: instruction, destination register, value, ready bit, exception bit
- Commit: instructions commit in-order from the head of the ROB when the ready bit is set
- Exception handling with ROB: set exception bit when exception detected; wait until that instruction reaches the head and would normally commit; then flush the ROB, RS, and reset the RAT
ROB Cleanup on Exception
- (a) Reservation stations: flush all instructions (they are speculative / after the excepting instruction)
- (b) ROB: flush all entries after the excepting instruction; the excepting instruction’s ROB entry can be used to restore state or jumped past
- (c) RAT: reset all entries to point to the register file (architectural state is now correct after in-order commit up to the exception)
Key Formulas
CPI with branch penalties:
CPI = CPI_base + f_branch × [f_taken × (f_miss×P_T + (1-f_miss)×P_T_correct)
+ (1-f_taken) × (f_miss×P_NT + (1-f_miss)×0)]
Where P_T = penalty for taken-predicted branch, P_NT = penalty for not-taken misprediction
Problem Patterns
Pattern 1 — CPI calculation with branch predictor (P1, P2):
- Substitute all given fractions and penalties into the CPI formula
- P2: also account for frequency scaling (compare time = CPI × cycle_time)
Pattern 2 — Exception handling questions (P3, P5):
- P3: Without ROB, you cannot get correct register values because some instructions before I5 may not have committed yet, and some after may have already written to registers
- P5: The ROB allows in-order commitment up to (but not including) the faulting instruction, giving the correct architectural state
Pattern 3 — 2-bit branch prediction explanation (P4 Q1):
- The 2-bit counter starts in state “strongly not taken” (00); requires two wrong predictions before changing the prediction
- Two (or more) branches share the global/local history table
Practice Questions
P-6.1: A 5-stage pipeline has 20% branch instructions, 70% taken. The predictor is correct 90% of the time. Misprediction penalty is 3 cycles. Base CPI = 1.0. What is the actual CPI?
P-6.2: Explain why a processor without an ROB cannot guarantee a precise exception when instructions execute out of order.
P-6.3: After an exception is triggered and caught, what is the correct state in the register file? How does the ROB ensure this?
Module 7: Advanced Caches
Core Concepts
- VIPT cache (Virtually Indexed, Physically Tagged): uses virtual address bits for index, physical tag for comparison
- Aliasing occurs when two virtual addresses map to the same index but different physical pages
- Safe if index bits come only from the page offset (bits below the page boundary)
- Max page size for no aliasing:
Cache_size / Associativity
- Cache geometry:
Sets = Cache_size / (Block_size × Associativity) - Tag bits = Physical_address_bits − Index_bits − Block_offset_bits
- PIPT cache (Physically Indexed, Physically Tagged): no aliasing, but requires TLB before cache lookup
- LRU replacement: LRU counter of 0 = least recently used (evict this one)
- Write-back + dirty bit: dirty bit set on writes; eviction of dirty block requires write-back to memory
- Prefetching: issue cache request N iterations ahead to hide miss latency
- Optimal prefetch distance ≈ miss_latency / iteration_time instructions ahead
Key Formulas
Index bits = log2(Sets)
Block offset bits = log2(Block_size)
Tag bits (PIPT) = Physical_address_bits − Index_bits − Block_offset_bits
Total tag storage = tag_bits × num_lines
For VIPT aliasing limit:
Max page size = Cache_size / Associativity
(The index bits must not extend above the page offset bits)
Problem Patterns
Pattern 1 — Cache geometry (P1-P4):
- Compute sets, tag bits, and overhead bits systematically
Pattern 2 — Cache state table (P5-P14): For each access:
- Decode address: determine which set and what tag
- Check if any valid line in that set has matching tag → hit or miss
- On hit: update LRU; if write, set dirty bit
- On miss: choose victim (LRU=0); if victim is dirty → write-back; load new block; set tag, valid, LRU
Pattern 3 — Prefetching analysis (P15-P21):
- P15: Count how many unique cache blocks are accessed → compulsory misses drive total count
- P16: Analyze which cache change (size, associativity, block size) eliminates the most misses for this access pattern
- P17-P21: Track when prefetch completes vs when load executes; prefetch at offset K means block arrives (49 cycles after prefetch) = K/4 iterations later
Practice Questions
P-7.1: A 4KB 4-way set-associative cache has 64-byte blocks. Physical address = 32 bits. (a) How many sets? (b) How many tag bits? (c) What is the max page size for VIPT with no aliasing?
P-7.2: A 2-set, 2-way PIPT write-back LRU 32-byte-block cache with 10-bit physical addresses. Initial state: Set 0: [V=1,D=0,Tag=1F,LRU=1] [V=1,D=1,Tag=2F,LRU=0]. Set 1: [V=0] [V=0]. Perform access: WR address 0x1E0. Show state after access. Is there a write-back?
P-7.3: The following array traversal runs on a 256-byte direct-mapped cache with 16-byte blocks. The array has 64 elements (4 bytes each). How many cache misses occur on the first pass?
Common Traps
- LRU=0 means LEAST recently used (the one to evict), not most recently used
- Write-allocate: on a write miss, the block is first brought into the cache, then written
- Dirty bit is set only on writes, not on reads
- Prefetch at offset 0 is useless: it fetches the same block that the immediately following load needs — only 1 cycle of latency hiding
Module 8: Memory Hierarchy (Virtual Memory, TLB, Cache Systems)
Core Concepts
- TLB (Translation Lookaside Buffer): caches recent virtual-to-physical translations
- TLB miss penalty: extra memory accesses to walk the page table
- Effective CPI with cache:
CPI_eff = CPI_base + f_mem × miss_rate × miss_penalty_cycles
- Miss penalty for write-back cache:
- On a clean eviction: just fetch new block from memory
- On a dirty eviction: write back dirty block, then fetch new block
- AMAT (Average Memory Access Time):
AMAT = Hit_time + Miss_rate × Miss_penalty
Key Formula — CPI with dirty write-back:
Miss_penalty = (block_size / bus_width) × transfer_cycles + memory_latency
Dirty_writeback_overhead = 0.5 × dirty_fraction × miss_rate × writeback_cycles
CPI_eff = CPI_base + miss_rate × (miss_penalty + dirty_fraction × writeback_penalty)
Problem Patterns
Pattern 1 — Compare three cache configurations (P1):
- Compute effective CPI for each:
CPI_eff = CPI × (cycle_length_factor) + miss_penalty_stalls - Miss penalty stalls =
f_mem × miss_rate × penalty_cycles - Add TLB penalty:
+ f_mem × TLB_miss_rate × TLB_penalty
Pattern 2 — Optimal block size (P2, P3):
- AMAT = 1 + miss_rate × miss_penalty(block_size)
- Miss penalty increases with block size (more transfers); miss rate decreases
- Find the minimum AMAT
Practice Questions
P-8.1: CPI_base = 2.0, 25% of instructions are memory ops, cache miss rate = 3%, miss penalty = 50 cycles. What is effective CPI?
P-8.2: A system has a TLB with 1% miss rate and 15-cycle penalty. Cache is physically addressed, so TLB must be checked on every memory access. How does this affect CPI from P-8.1?
Module 9: Cache Coherence (MESI Protocol)
Core Concepts
- MESI states: Modified, Exclusive, Shared, Invalid
- M (Modified): only copy; cache has latest value; memory is stale; must write-back before any other cache can access
- E (Exclusive): only copy; cache matches memory; can be silently upgraded to M on write
- S (Shared): multiple caches may have this block; must invalidate all others before writing
- I (Invalid): cache does not have valid copy
- Bus snooping: every cache watches all bus transactions and updates its state accordingly
State Transition Rules
| Action | Current State | New State | Bus Transaction |
|---|---|---|---|
| Read miss | I | S or E | BusRd |
| Write miss | I | M | BusRdX |
| Write hit | S | M | BusUpgr (invalidates others) |
| Write hit | E | M | Silent (no bus) |
| Write hit | M | M | Silent (no bus) |
| Read hit | S/E/M | unchanged | none |
| Snoop BusRd | M | S | write-back to memory |
| Snoop BusRdX | M/S/E | I | write-back if M |
Key: Data source on miss:
- If another cache has it in M state → that cache supplies data (and transitions to S or I)
- If all other caches have I → data comes from memory
Problem Patterns
Pattern 1 — Single access trace (Multiprocessing PS P1-P10): Step by step:
- Decode address to determine which cache set/tag is accessed
- Check the requesting processor’s cache: hit or miss?
- If miss: broadcast on bus; check other caches for M/E/S copies
- Determine write-back needed (if any M state elsewhere)
- Determine data source
- Determine final state in requestor’s cache
Pattern 2 — False sharing (P11-P13):
- When two cores write to different words in the same cache block, the block constantly ping-pongs between M states on different cores
- Solution: pad data so each core’s data occupies a separate cache block
Pattern 3 — Compiler optimization vs memory consistency (P14-P16):
- Loop-invariant code motion can move a variable read outside a loop
- If another thread modifies that variable, the hoisted read sees a stale value
- This is a classic memory consistency problem requiring
volatileor memory barriers
Practice Questions
P-9.1: With the same initial MESI state table from the PS (P0:I/M/I/M, P1:S/E/I/E, P2:S/M/E/S, P3:S/E/I/I for sets 0-3), what happens when P0 reads address 0x01xx (set 1)?
P-9.2: Explain what “false sharing” is and how it causes performance degradation even when two threads never access the same data.
P-9.3: Why does the MSI protocol need fewer states than MESI, and what performance does MESI gain by adding the E state?
Common Traps
- Address decoding: with 4 blocks per cache and 64-byte blocks, the set is bits [7:6] of the address; the tag is bits [15:8]
- Each problem is independent: the PS explicitly says each question starts from the same initial state, not the state left by the previous question
- M state always requires write-back when another processor requests the block (whether for read or write)
- E state → M on write: no bus transaction needed (already the only copy)
Module 10: Interrupts and Exceptions
Core Concepts
- Precise exception: at the point of exception, all instructions before the faulting instruction have completed, and no instructions after have modified architectural state
- Problem without ROB: out-of-order execution means later instructions may have already written to registers while earlier ones haven’t — impossible to reconstruct correct state
- Solution: ROB maintains in-order commit; architectural registers only updated at commit
Problem Patterns
Pattern 1 — Correct register values at exception (P1):
- Trace execution in program order (not execution order)
- The “correct” values are those that would result from executing I1 through I4 sequentially (everything before I5)
- I5 causes exception → I5 and after produce no architectural changes
Pattern 2 — Fill Tomasulo table with ROB (P4):
- Same scheduling rules as Tomasulo
- Commit column: instruction commits in-order, one per cycle, starting the cycle after write-result
- Must wait for all previous instructions to commit first
Pattern 3 — Exception clean-up steps (P5):
- Reservation stations: flush all (no longer valid)
- ROB: flush all entries after the faulting instruction’s entry (they are speculative); faulting instruction’s entry is used to restore and then freed
- RAT: reset all entries to point to the register file (which now has the correct architectural state after all prior commits)
Practice Questions
P-10.1: Explain in one sentence why it is impossible to get precise exception behavior in an OOO processor without a ROB.
P-10.2: In a processor with a ROB, what is the difference between “write result” and “commit”? Why is this distinction important for exceptions?
Quick Reference: Formula Sheet
Iron Law: Time = IC × CPI × T_cycle
Amdahl's: S = 1 / [(1-Fe) + Fe/Se]
Generalized: S = 1 / [(1-Σfi) + Σ(fi/Si)]
CPI weighted: CPI = Σ(f_i × CPI_i)
CPI miss penalty: CPI_eff = CPI_base + f_mem × miss_rate × penalty
CPI branch penalty: CPI_eff = CPI_base + f_branch × penalty_contribution
Pipeline speedup: S = T_old / T_new = (CPI_old × T_old) / (CPI_new × T_new)
Pipeline T_cycle: max(stage_latency) + latch_overhead
Cache sets: Sets = Size / (Block_size × Ways)
Tag bits: PA_bits − log2(Sets) − log2(Block_size)
VIPT max page: Cache_size / Associativity
AMAT: Hit_time + Miss_rate × Miss_penalty
Prefetch distance: miss_latency / cycles_per_iteration (in iterations)
Study Strategy
-
Amdahl’s Law problems → practice until the formula is automatic; the tricky part is always identifying what
f_iactually is (fraction of original execution time, not instruction count) -
Pipeline timing diagrams → draw these out physically; count stall cycles explicitly; the answer is always “count the columns”
-
Branch prediction → trace the actual execution by hand; track predictor state after each branch
-
Tomasulo / ROB scheduling → fill in tables cycle by cycle; always ask “is the execution unit free?” and “are all operands available?”
-
MESI coherence → memorize the state transition table; for each access, work through all five questions (hit/miss? write-back? bus broadcast? other caches? data source?)
-
Cache arithmetic → always start with: offset bits → index bits → tag bits; decode the address and compare with stored tags