source

05 Tomasulo And Ooo Execution

Tomasulo’s Algorithm and OOO Execution

Prerequisites: 04-ILP-and-Register-Renaming Learning Goals: Understand how Tomasulo’s Algorithm implements hardware register renaming and out-of-order execution via Reservation Stations, and the three-phase Issue/Dispatch/Broadcast cycle.


Improving IPC: The Full Picture

To achieve high IPC, a processor must address all dependency types:

Dependency TypeSolution
ControlBranch Prediction
WAR / WAW (false)Register Renaming
RAW (true)Out-of-Order Execution
StructuralWider Issue (more execution units)

ILP should be at least 4 instructions per cycle to justify wide-issue hardware.


Tomasulo’s Algorithm

Originally designed for floating-point units; modern processors extend it to all instructions.

Modern Extensions over Original Tomasulo

  1. All instructions use the algorithm (not just FP)
  2. Hundreds of instructions considered simultaneously (not just a few)
  3. Supports exception handling (original did not)

Data Flow Paths

Data Manipulation Path

Instruction Queue → Reservation Stations (RS)
                  → Execution Units (Adder / Multiplier)
                  → Broadcast on Bus

Load/Store Path

Instruction Queue → Address Adder (PC + offset)
                  → Load/Store Buffer
                  → Memory
                  → Broadcast on Bus

Key Terminology

TermMeaning
IssueInstruction exits IQ; goes to RS or address adder
DispatchInstruction exits RS; goes to execution unit (Adder or Multiplier)
Broadcast / Write ResultResult exits execution unit; placed on shared bus

Phase 1: Issue

Steps when issuing an instruction:

  1. Take next instruction from IQ (in program order)
  2. Look up source registers in the RAT
    • If RAT has an entry → use the RS/tag it points to
    • If RAT has no entry → read from register file
  3. Find a free RS of the correct type (adder RS or multiplier RS)
    • If no free RS → stall and wait
  4. Place instruction (with operand values or tags) in the RS
  5. Tag the destination register in the RAT → points to this RS entry

Phase 2: Dispatch

Steps when an instruction is ready to execute:

  1. RS monitors broadcast bus — match broadcast tags to pending operand tags
  2. When a match is found, capture the value into the RS entry
  3. When an RS has all inputs ready → eligible to dispatch
  4. Select which eligible RS to dispatch (strategies):
    • Oldest first
    • Most dependencies first (most waiting instructions) — hard to implement
    • Random
  5. Send instruction to the Adder or Multiplier
  6. Free the RS once dispatched

Phase 3: Broadcast

Steps when execution completes:

  1. Put the tag and result on the bus
  2. Write result to the register file
  3. Update the RAT — clear the entry (empty RAT entry = value in register file)
  4. Free the RS (clear valid bit)

Multiple Broadcasts Ready?

Options:

  1. Separate bus per arithmetic unit (more hardware)
  2. Prioritize the slower unit — multiply/divide takes more cycles, so it likely has more downstream dependencies waiting

Stale Results

A result is stale if the RAT no longer points to its RS (a newer write has been issued to the same architectural register).

⚠ Exam note: Stale results are broadcast but don’t update the RAT.


Cycle-Level Parallelism

In each clock cycle, the processor can simultaneously:

Timing Rules


Memory Dependencies in Tomasulo

Load/Store instructions can have data dependencies through memory:

DependencyScenario
RAWStore to address A, then Load from A
WARLoad from A, then Store to A
WAWTwo stores to the same address A

Tomasulo’s Solution

Tomasulo’s original algorithm handles memory by doing loads and stores in-order. Modern processors identify and reorder memory operations more aggressively (covered in 06-ROB-and-Memory-Ordering).


Summary

Key Takeaways:

Common Exam Task: Given an instruction sequence, trace the RS state, RAT state, and cycle numbers for Issue/Dispatch/Broadcast of each instruction.

See Also: 04-ILP-and-Register-Renaming, 06-ROB-and-Memory-Ordering Next: 06-ROB-and-Memory-Ordering