ReOrder Buffer and Memory Ordering

Prerequisites: 05-Tomasulo-and-OOO-Execution Learning Goals: Understand how the ROB enables correct OOO execution through in-order commit, how exceptions and branch mispredictions are handled, and how the Load/Store Queue manages memory dependencies.

The Problem with Tomasulo’s Algorithm

Major drawback: Tomasulo’s algorithm writes results directly to registers out-of-order. This causes two problems:

OOO Exceptions: Instructions after an exception-causing instruction may already have written results → incorrect state
Phantom Exceptions: After a branch misprediction, instructions from the wrong path may have executed and caused exceptions → these exceptions should never have occurred

Exception handling should not occur until the processor is certain the exception is not a phantom.

Correct OOO Execution with ROB

The solution:

Execute out-of-order ✓
Broadcast out-of-order ✓
Write to registers in-order via Commit ✓

The ReOrder Buffer (ROB) holds results until it is safe to commit them in program order.

A ReOrder Buffer remembers program order and keeps results until it is safe to write them to architectural registers.

ROB Structure

Fields per ROB Entry

Field	Purpose
Value	The result data held temporarily
Done bit	Whether this instruction has finished executing
Destination register	Which architectural register will receive this value

ROB Pointers

Pointer	Points To
Issue pointer (head)	Next empty slot — where the next issued instruction is placed
Commit pointer (tail)	Oldest in-flight instruction — next to be committed

Entries between Issue and Commit are currently in-flight (executing or waiting).

ROB Phase: Issue

Steps when issuing with ROB:

Take instruction from IQ
Get a free RS of the correct type
Get the next ROB entry (at Issue pointer)
Look up source registers in RAT
Update RAT for the destination register → point to the ROB entry (not the RS)

Key difference from vanilla Tomasulo: RAT points to ROB entries, not RS entries.

ROB Phase: Dispatch

Steps for dispatch with ROB:

Find waiting RS entries where broadcast tags match
Only dispatch instructions where all inputs are ready
Pick one instruction per functional unit
Free the RS immediately (broadcast uses ROB tag, not RS tag → RS freed sooner)

ROB Phase: Broadcast

Steps for broadcast with ROB:

Capture result in waiting RS entries (tag matching)
Write result to the ROB entry (not directly to register file)

ROB Phase: Commit

Steps for commit:

Look at the oldest instruction in the ROB (at Commit pointer)
Wait for its Done bit to be set
Copy result from ROB → register file (this is the only place register writes happen)
Update RAT: if RAT entry still points to this ROB entry → clear it (value now in register file)
Free the ROB entry; advance Commit pointer

Commit is always in-order. This guarantees architectural state is always correct.

RAT Updates on Commit

If RAT still points to this ROB entry → update RAT to point to the register file
If RAT has been updated to a newer ROB entry (a later write to the same register) → leave RAT alone

Branch Misprediction Recovery

Branches are not committed until resolved. All instructions after an unresolved branch are also not committed.

Recovery steps:

Move Issue pointer back to Commit pointer position (flush ROB entries after the branch)
Reset RAT entries to point to the correct registers
Fetch correct instructions from the correct address

Committed instructions cannot be “uncommitted” — they are permanent.

ROB and Exception Handling

Two exception scenarios solved by ROB:

Problem	ROB Solution
OOO instructions with exceptions	ROB hasn’t committed yet → flush everything, load exception handler
Phantom exceptions (wrong-path)	After branch flush, wrong-path instructions (and their exceptions) are discarded

Rule: Treat an exception like any other result → delay handling until Commit.

Unified Reservation Stations

Unified RS: One large array of RS shared across all functional units (adder + multiplier).

Advantage: RS can be used by whichever unit needs it — no waiting for a specific RS type
Disadvantage: More complex hardware (must route each RS entry to the correct unit)

Superscalar

A superscalar processor can handle multiple instructions per cycle in every stage:

Fetch > 1 instruction/cycle
Decode > 1 instruction/cycle
Issue > 1 instruction/cycle
Dispatch > 1 instruction/cycle
Broadcast > 1 instruction/cycle
Commit > 1 instruction/cycle

Weakest link: the processor is limited by the slowest stage. If all stages handle 4 instructions/cycle but Decode only handles 2, the whole processor is limited to 2/cycle.

What is Really Out of Order?

Stage	Order
Fetch	In-order
Decode	In-order
Issue	In-order
Execute	Out-of-order
Write Result	Out-of-order
Commit	In-order

Memory Ordering: Load/Store Queue (LSQ)

When Do Memory Writes Happen?

Memory writes occur at Commit. Writing before commit would make the write irrevocable — it could not be undone on an exception or misprediction.

Memory reads (loads) should complete as early as possible for performance.

LSQ Fields

Field	Purpose
Load/Store bit	Which type of operation
Address	Target memory address
Data value	Value to be stored (for stores)
Completion bit	Whether address/value is computed

Loads and stores are placed in the LSQ in program order.

Store-to-Load Forwarding

When a load is issued, its address is compared against all previous stores in the LSQ.

If a match is found → use the value from the LSQ (no need to go to memory)
This is Store-to-Load Forwarding

What if the Store Doesn’t Have its Address Yet?

Strategy	Description	Performance
1. Wait for the store	In-order: stall load until all prior stores have addresses	Slow
2. Wait for all prior stores’ addresses	Check for address match; only proceed if no conflict	Medium
3. Speculative load (aggressive)	Load from memory immediately; recover if a prior store had same address	Best performance, requires recovery

Modern processors use Strategy 3 (speculative load) because it yields the best performance.

Recovery from stale load: When the store gets its address and discovers a matching load already executed with stale data → re-execute the load (and all dependent instructions).

OOO vs. In-Order Load/Store

Mode	Rule	Performance
OOO Load/Store	Load goes to memory as soon as address is ready	Best, but may need recovery
In-Order Load/Store	Load cannot execute until all prior stores have addresses	Safe, but slow

LSQ, ROB, RS Interaction

Instruction Type	Resources Required at Issue
Load/Store	ROB entry + LSQ entry
Non-Load/Store	ROB entry + RS entry

Executing a Load/Store

Compute address
Produce the value (load from memory or from LSQ forwarding)
Loads: write result → broadcast

Committing a Load/Store

Free ROB entry
Free LSQ entry
For stores: write value to memory (at commit, not at execute)

Summary

Key Takeaways:

ROB enables OOO execution with in-order commit → correct architectural state
Commit is always in-order; register writes only happen at commit
Branch misprediction: flush ROB; exception: handle at commit
Phantom exceptions: safely discarded when branch misprediction flushes the ROB
Memory writes only happen at commit; loads are as early as possible
Store-to-load forwarding: LSQ provides value before memory is written
Modern processors use speculative loads with recovery

Common Exam Task: Trace ROB/RS/RAT state through Issue/Dispatch/Broadcast/Commit for a sequence of instructions.