source

07 Compiler Ilp And Vliw

Compiler ILP and VLIW

Prerequisites: 04-ILP-and-Register-Renaming Learning Goals: Understand how compilers extract ILP through scheduling and loop transformations, and how VLIW architecture shifts scheduling responsibility to the compiler.


Compiler ILP Goals

Two basic goals:

  1. Improve instruction scheduling — reduce stalls by placing independent instructions between dependent ones
  2. Reduce the number of instructions — eliminate overhead

Tree Height Reduction

Tree height = the length of the longest dependency chain in a computation.

Tree Height Reduction: Re-group calculations to shorten dependency chains.


Techniques to Expose Independent Instructions

1. Instruction Scheduling

When there is a dependency between two instructions, the processor would normally stall. The compiler can move an independent instruction into the stall slot.

Modifications that may be needed:

2. Scheduling and If Conversion

If conversion (predication) helps instruction scheduling in two ways:

  1. Reduces branches — both sides execute using predication
  2. More scheduling flexibility — without branches, instructions can be moved more freely, reducing stalls

A loop cannot use if-conversion, but can be improved with loop unrolling.

3. Loop Unrolling

Loop unrolling: expand a loop so each iteration does the work of N original iterations.

Original (N=1):   loop body once per iteration
Unrolled (N=4):   loop body 4x per iteration, loop runs ¼ as many times

Benefits:

Downsides:

4. Function Call Inlining

Inlining: replace a function call with the function body directly at the call site.

Benefits:

Downside:


Other Compiler IPC Techniques (not covered in detail)


VLIW Architecture

Processors that Can Issue > 1 Instruction/Cycle

Processor TypeSchedulingCostNotes
OOO SuperscalarHardware (RS, RAT)Very highLooks at many instructions; compiler helps
In-Order SuperscalarPartial hardwareMediumFewer instructions visible; needs compiler help
VLIWCompiler onlyLowExecutes 1 large instruction per cycle

VLIW: How It Works

A VLIW processor executes one very long instruction per cycle. Each VLIW instruction contains multiple operation slots (e.g., one ALU op, one FP op, one load/store op).

StepSuperscalarVLIW
1Hardware fetches multiple instructionsCompiler packs independent ops into one word
2Hardware checks dependencies at runtimeCompiler checks dependencies at compile time
3Hardware schedules for parallel executionHardware just executes the next large instruction word

If there are dependencies, the compiler places them in separate instruction words. This can cause code bloat (many NOPs in unused operation slots).


VLIW: The Good and the Bad

Advantages

Disadvantages


VLIW Instruction Features

FeaturePurpose
Standard ISA opcodesAll usual operations
Full predication supportCompiler can eliminate branches
Many registersNeeded due to scheduling optimizations (loop unrolling exposes more live values)
Branch hintsCompiler tells hardware its branch predictions
Instruction compactionReplace NOP slots with “stop” markers → reduces code bloat

VLIW Examples

ProcessorNotes
Itanium (Intel IA-64)Too complicated; poor performance on irregular code
DSP ProcessorsExcellent performance, energy efficient — regular workloads

Summary

Key Takeaways:

See Also: 04-ILP-and-Register-Renaming, 03-Branch-Prediction Next: 08-Advanced-Caches