source

10 Study Guide

HPCA Study Guide — CS 6290 Georgia Tech

Based on all 10 Problem Sets + Solutions


How to Use This Guide

Each section follows this structure:

  1. Core Concepts — what you must know cold
  2. Key Formulas — the equations that appear on every problem
  3. Problem Patterns — the types of questions asked, with worked strategy
  4. Practice Questions — new questions modeled after the problem sets
  5. Common Traps — mistakes the solutions explicitly call out

Module 1: Metrics and Evaluation

Core Concepts

Key Formulas

Amdahl’s Law (single enhancement):

S = 1 / [(1 - F_e) + F_e/S_e]

Generalized Amdahl’s Law (multiple non-overlapping enhancements):

S = 1 / [(1 - Σf_i) + Σ(f_i / S_i)]

Iron Law speedup ratio (when instruction count doesn’t change):

S = (CPI_old × CycleTime_old) / (CPI_new × CycleTime_new)

Weighted CPI:

CPI = Σ (fraction_i × CPI_i)

Cache miss penalty CPI adjustment:

CPI_new = CPI_base + (miss_rate × miss_penalty × fraction_memory_instructions)

Problem Patterns

Pattern 1 — Single enhancement (P1, P3):

Pattern 2 — Compare two design choices (P2, P4):

Pattern 3 — Four-part Amdahl (P5):

Pattern 4 — Nested enhancements (P6 — L1 + L2 cache):

Pattern 5 — Solve for unknown fraction (P9):

Pattern 6 — Fraction of enhanced time with no enhancement active (P10):

F_NE_enhanced = (1 - Σf_i) / S_total

This equals the unenhanced time divided by the new total time.

Pattern 7 — Best single / best pair enhancement (P11):

Practice Questions

P-1.1: An enhancement speeds up 60% of execution time by 8x. What is the overall speedup?

P-1.2: A processor has 30% FP instructions (CPI=6) and 70% integer (CPI=1.2). Two options:

P-1.3: Three enhancements affect 15%, 25%, and 40% of execution time with speedups 20, 8, and 4 respectively. Assume no overlap. What is the overall speedup?

P-1.4: Given the same three enhancements above, what fraction of the reduced execution time has no enhancement active?

P-1.5 (challenge): Enhancement A affects 30% (speedup=50) and Enhancement B has speedup=10. What fraction of time must Enhancement B be used to achieve an overall speedup of 5?

Common Traps


Module 2: Pipelining and Hazards

Core Concepts

Key Formulas

Pipelining speedup (ideal):

S = CPI_unpipelined × T_stage_old / (CPI_pipelined × T_cycle_new)

CPI with stalls:

CPI_actual = CPI_ideal + stalls_per_instruction

Pipeline clock cycle time:

T_cycle = max(stage_latencies) + latch_overhead

Unpipelined execution time per instruction:

T_avg = T_cycle × weighted_average_CPI

Problem Patterns

Pattern 1 — Identify dependencies (P1, P3):

Pattern 2 — Pipeline timing diagrams (P4, P5):

Pattern 3 — Delay slots (P5A):

Pattern 4 — Clock cycle time and 3-stage / 6-stage pipelines (P6):

Pattern 5 — Out-of-order execution table (P12):

Practice Questions

P-2.1: For the following code, identify all RAW, WAW, and WAR dependencies:

I1: ADD R1, R2, R3
I2: SUB R4, R1, R5
I3: MUL R2, R4, R6
I4: DIV R1, R2, R3

P-2.2: A 5-stage pipeline has stage latencies IF=150ps, ID=80ps, EX=200ps, MEM=220ps, WB=70ps. Each latch adds 15ps. (a) What is the clock cycle time? (b) If CPI=1 unpipelined (single-cycle) and CPI=1.3 pipelined, what is the speedup?

P-2.3: For the following loop with forwarding, show the pipeline timing and count the cycles for one iteration:

Loop: LD R1, 0(R2)
      ADD R3, R1, R4
      SW 0(R2), R3
      ADDI R2, R2, 4
      BNE R2, R5, Loop

P-2.4: What is the difference between a structural hazard and a data hazard? Give one example of each.

Common Traps


Module 3: Branch Prediction

Core Concepts

Key Formulas

Misprediction count for always-taken:

CPI with branch penalty (from ROB PS Problem 1):

CPI = CPI_base + f_branch × [f_taken × (f_miss×2 + (1-f_miss)×1)
      + (1-f_taken) × (f_miss×2 + (1-f_miss)×0)]

Problem Patterns

Pattern 1 — Count mispredictions for always-taken (P1):

Pattern 2 — 1-bit predictor with 8 entries (P2):

Pattern 3 — Local history predictor (P3):

Pattern 4 — Branch resolution location and data hazards (P3A):

Practice Questions

P-3.1: For a loop that executes 100 times, how many mispredictions occur with: (a) always-taken, (b) always-not-taken, (c) 1-bit predictor initialized to “not taken”?

P-3.2: Two branches at addresses 0x200 and 0x220 share the same entry in a 4-entry 1-bit predictor (initialized to 0). Branch at 0x200 is always taken. Branch at 0x220 alternates T/NT/T/NT… How many mispredictions occur in 8 executions of each?

P-3.3: Why does increasing branch predictor size beyond a point yield diminishing returns?

Common Traps


Module 4: ILP and Register Renaming

Core Concepts

Key Formulas

Minimum cycles with N-way issue:

Cycles = max(critical_path_length, ceil(num_instructions / N))

Where critical path = longest chain of RAW-dependent instructions

Problem Patterns

Pattern 1 — Register renaming (P1):

Pattern 2 — Cycle count with 2-way issue (P2):

Pattern 3 — Cycle count with 4-way issue (P3):

Practice Questions

P-4.1: Use register renaming to eliminate false dependencies. Assume 8 physical locations L1-L8; initially R1=L1, R2=L2, R3=L3:

MUL R1, R2, R3
ADD R2, R1, R3
SUB R1, R2, R3
MUL R3, R1, R2

P-4.2: How many cycles does the following take with 3-way issue (all instructions latency=1)?

ADD R1, R2, R3
MUL R4, R1, R5
SUB R6, R4, R7
ADD R8, R2, R9
LD  R10, (R11)
MUL R12, R10, R4

P-4.3: List three factors that prevent a real processor from achieving the theoretical maximum ILP shown in benchmarks.

Common Traps


Module 5: Tomasulo’s Algorithm and Out-of-Order Execution

Core Concepts

Key Rules

Problem Patterns

Pattern 1 — Fill in Tomasulo scheduling table (Instruction Scheduling PS P1, P4):

  1. Issue cycle: first cycle where an RS and (for MUL/DIV) the execution unit is free
  2. First execution cycle: cycle after all operands are available (either from register file or CDB write)
  3. Write result cycle: first execution cycle + latency - 1 (result written in last exec cycle)

Pattern 2 — Tomasulo with ROB (Interrupts PS P4):

Practice Questions

P-5.1: Schedule the following using Tomasulo’s algorithm. 1 MUL unit (latency 4, 2 RS), 1 ADD unit (latency 2, 3 RS). Issue starts at cycle 1:

I1: MUL F1, F2, F3
I2: ADD F4, F1, F5
I3: MUL F6, F4, F2
I4: ADD F1, F6, F3

P-5.2: Explain the role of the Register Alias Table (RAT) in Tomasulo’s algorithm. What does it store, and when is it updated?

P-5.3: Why can Tomasulo’s algorithm eliminate WAR and WAW hazards but not RAW hazards?

Common Traps


Module 6: ReOrder Buffer (ROB) and Exceptions

Core Concepts

ROB Cleanup on Exception

Key Formulas

CPI with branch penalties:

CPI = CPI_base + f_branch × [f_taken × (f_miss×P_T + (1-f_miss)×P_T_correct)
      + (1-f_taken) × (f_miss×P_NT + (1-f_miss)×0)]

Where P_T = penalty for taken-predicted branch, P_NT = penalty for not-taken misprediction

Problem Patterns

Pattern 1 — CPI calculation with branch predictor (P1, P2):

Pattern 2 — Exception handling questions (P3, P5):

Pattern 3 — 2-bit branch prediction explanation (P4 Q1):

Practice Questions

P-6.1: A 5-stage pipeline has 20% branch instructions, 70% taken. The predictor is correct 90% of the time. Misprediction penalty is 3 cycles. Base CPI = 1.0. What is the actual CPI?

P-6.2: Explain why a processor without an ROB cannot guarantee a precise exception when instructions execute out of order.

P-6.3: After an exception is triggered and caught, what is the correct state in the register file? How does the ROB ensure this?


Module 7: Advanced Caches

Core Concepts

Key Formulas

Index bits = log2(Sets)
Block offset bits = log2(Block_size)
Tag bits (PIPT) = Physical_address_bits − Index_bits − Block_offset_bits
Total tag storage = tag_bits × num_lines

For VIPT aliasing limit:

Max page size = Cache_size / Associativity
(The index bits must not extend above the page offset bits)

Problem Patterns

Pattern 1 — Cache geometry (P1-P4):

Pattern 2 — Cache state table (P5-P14): For each access:

  1. Decode address: determine which set and what tag
  2. Check if any valid line in that set has matching tag → hit or miss
  3. On hit: update LRU; if write, set dirty bit
  4. On miss: choose victim (LRU=0); if victim is dirty → write-back; load new block; set tag, valid, LRU

Pattern 3 — Prefetching analysis (P15-P21):

Practice Questions

P-7.1: A 4KB 4-way set-associative cache has 64-byte blocks. Physical address = 32 bits. (a) How many sets? (b) How many tag bits? (c) What is the max page size for VIPT with no aliasing?

P-7.2: A 2-set, 2-way PIPT write-back LRU 32-byte-block cache with 10-bit physical addresses. Initial state: Set 0: [V=1,D=0,Tag=1F,LRU=1] [V=1,D=1,Tag=2F,LRU=0]. Set 1: [V=0] [V=0]. Perform access: WR address 0x1E0. Show state after access. Is there a write-back?

P-7.3: The following array traversal runs on a 256-byte direct-mapped cache with 16-byte blocks. The array has 64 elements (4 bytes each). How many cache misses occur on the first pass?

Common Traps


Module 8: Memory Hierarchy (Virtual Memory, TLB, Cache Systems)

Core Concepts

CPI_eff = CPI_base + f_mem × miss_rate × miss_penalty_cycles
AMAT = Hit_time + Miss_rate × Miss_penalty

Key Formula — CPI with dirty write-back:

Miss_penalty = (block_size / bus_width) × transfer_cycles + memory_latency
Dirty_writeback_overhead = 0.5 × dirty_fraction × miss_rate × writeback_cycles
CPI_eff = CPI_base + miss_rate × (miss_penalty + dirty_fraction × writeback_penalty)

Problem Patterns

Pattern 1 — Compare three cache configurations (P1):

Pattern 2 — Optimal block size (P2, P3):

Practice Questions

P-8.1: CPI_base = 2.0, 25% of instructions are memory ops, cache miss rate = 3%, miss penalty = 50 cycles. What is effective CPI?

P-8.2: A system has a TLB with 1% miss rate and 15-cycle penalty. Cache is physically addressed, so TLB must be checked on every memory access. How does this affect CPI from P-8.1?


Module 9: Cache Coherence (MESI Protocol)

Core Concepts

State Transition Rules

ActionCurrent StateNew StateBus Transaction
Read missIS or EBusRd
Write missIMBusRdX
Write hitSMBusUpgr (invalidates others)
Write hitEMSilent (no bus)
Write hitMMSilent (no bus)
Read hitS/E/Munchangednone
Snoop BusRdMSwrite-back to memory
Snoop BusRdXM/S/EIwrite-back if M

Key: Data source on miss:

Problem Patterns

Pattern 1 — Single access trace (Multiprocessing PS P1-P10): Step by step:

  1. Decode address to determine which cache set/tag is accessed
  2. Check the requesting processor’s cache: hit or miss?
  3. If miss: broadcast on bus; check other caches for M/E/S copies
  4. Determine write-back needed (if any M state elsewhere)
  5. Determine data source
  6. Determine final state in requestor’s cache

Pattern 2 — False sharing (P11-P13):

Pattern 3 — Compiler optimization vs memory consistency (P14-P16):

Practice Questions

P-9.1: With the same initial MESI state table from the PS (P0:I/M/I/M, P1:S/E/I/E, P2:S/M/E/S, P3:S/E/I/I for sets 0-3), what happens when P0 reads address 0x01xx (set 1)?

P-9.2: Explain what “false sharing” is and how it causes performance degradation even when two threads never access the same data.

P-9.3: Why does the MSI protocol need fewer states than MESI, and what performance does MESI gain by adding the E state?

Common Traps


Module 10: Interrupts and Exceptions

Core Concepts

Problem Patterns

Pattern 1 — Correct register values at exception (P1):

Pattern 2 — Fill Tomasulo table with ROB (P4):

Pattern 3 — Exception clean-up steps (P5):

Practice Questions

P-10.1: Explain in one sentence why it is impossible to get precise exception behavior in an OOO processor without a ROB.

P-10.2: In a processor with a ROB, what is the difference between “write result” and “commit”? Why is this distinction important for exceptions?


Quick Reference: Formula Sheet

Iron Law:           Time = IC × CPI × T_cycle
Amdahl's:           S = 1 / [(1-Fe) + Fe/Se]
Generalized:        S = 1 / [(1-Σfi) + Σ(fi/Si)]
CPI weighted:       CPI = Σ(f_i × CPI_i)
CPI miss penalty:   CPI_eff = CPI_base + f_mem × miss_rate × penalty
CPI branch penalty: CPI_eff = CPI_base + f_branch × penalty_contribution
Pipeline speedup:   S = T_old / T_new = (CPI_old × T_old) / (CPI_new × T_new)
Pipeline T_cycle:   max(stage_latency) + latch_overhead
Cache sets:         Sets = Size / (Block_size × Ways)
Tag bits:           PA_bits − log2(Sets) − log2(Block_size)
VIPT max page:      Cache_size / Associativity
AMAT:               Hit_time + Miss_rate × Miss_penalty
Prefetch distance:  miss_latency / cycles_per_iteration  (in iterations)

Study Strategy

  1. Amdahl’s Law problems → practice until the formula is automatic; the tricky part is always identifying what f_i actually is (fraction of original execution time, not instruction count)

  2. Pipeline timing diagrams → draw these out physically; count stall cycles explicitly; the answer is always “count the columns”

  3. Branch prediction → trace the actual execution by hand; track predictor state after each branch

  4. Tomasulo / ROB scheduling → fill in tables cycle by cycle; always ask “is the execution unit free?” and “are all operands available?”

  5. MESI coherence → memorize the state transition table; for each access, work through all five questions (hit/miss? write-back? bus broadcast? other caches? data source?)

  6. Cache arithmetic → always start with: offset bits → index bits → tag bits; decode the address and compare with stored tags