source

04 Cpu Scheduling

CPU Scheduling

Foundation & Mechanism

Core Definition

The Scheduler decides which task (thread/process) runs on the CPU and when.

The Loop:

  1. Ready Queue: Set of tasks ready to run.
  2. Trigger: CPU becomes idle (task finishes, blocks on I/O, or timeslice expires).
  3. Context Switch: Scheduler picks a new task and swaps context.

Scheduling Metrics

MetricDefinitionGoal
ThroughputJobs completed per unit timeMaximize
Wait TimeTime spent waiting in the ready queueMinimize
Completion TimeTotal time from arrival to finishMinimize
CPU UtilizationFraction of time CPU is doing work (not idle)Maximize
ResponsivenessHow quickly the system responds to user inputMaximize

Scheduling Algorithms

Non-Preemptive Scheduling

Assumption: Once a task starts, it runs until it finishes or blocks voluntarily.

FCFS (First Come First Serve)

Tasks run in order of arrival; no preemption.

Example:

Timeline:

Metrics:

Runqueue implementation: Simple FIFO queue

Problem: A big job arriving early can make shorter jobs wait a lot → poor average completion/wait times. This is called the convoy effect.

SJF (Shortest Job First)

Idea: Always run the task with the shortest remaining execution time (for run-to-completion).

Data structures: sorted list, min-heap, or balanced tree

Problems:

Mechanics (Preemptive variant - SRTF):


Preemptive Scheduling

Modern Standard: Tasks are forced to stop when their time is up.

Round Robin Scheduling (RR)

Use when tasks have equal priority.

Basic (non-preemptive) rule:

Combine with timeslicing:

Benefits:

Trade-offs:


Priority Scheduling

We assign each task a priority.

Rules:

Real systems:

Implementation:

Scheduler:

Starvation and Priority Aging:

Problem: Low-priority tasks can starve if high-priority tasks keep arriving.

Solution: Priority aging

Effective priority = function(actual priority, waiting time)

The longer a task waits, the more its effective priority is boosted. Eventually it bubbles up high enough to get CPU time.


Priority Inversion

Classic subtle bug.

Setup:

Result: A lower-priority thread is effectively blocking a higher-priority one → priority inversion

Fix: Priority Inheritance

When a high-priority task is blocked on a lock held by a low-priority task:

  1. Temporarily boost T3’s priority to that of T1
  2. The scheduler now sees T3 as high priority and runs it
  3. T3 runs and releases the mutex
  4. T3’s priority is dropped back down

Multilevel Feedback Queue (MLFQ)

Give different tasks different timeslices and priorities automatically, based on their behavior.

Core idea: Maintain multiple queues, each with:

Top queue:

Bottom queue:

Heuristic rules:

  1. New tasks start in the topmost queue
  2. When a task runs:
    • If it yields before its timeslice ends → likely I/O-bound or interactive → keep it in a high-priority queue
    • If it must be preempted because it used up its timeslice → likely CPU-bound → demote it to a lower queue with a longer quantum
    • If a task in a lower queue starts yielding often, you can boost it back up

The scheduler learns from how tasks behave and adjusts their queue accordingly. It adapts priorities dynamically.

Benefits:

Challenges:


Real-World Linux Schedulers

Linux O(1) Scheduler

Selecting next task and adding a task to runqueue happens in constant time, independent of number of tasks.

Priority Levels

140 priority levels total:

Nice values: “Nice” means you let others go first. “Not Nice” (negative) means you are selfish and want the CPU now.

Timeslice and Feedback

Runqueue Structure

Two arrays per CPU:

Each array:

Behavior:

  1. Scheduler always chooses from active:

    • Use “find first set bit” CPU instruction on the bitmask to find highest-priority non-empty queue in O(1)
    • Take first task from that list
  2. When a running task:

    • Blocks or is preempted before using full timeslice → put back in active
    • Uses entire timeslice → move it to expired
  3. When active is empty:

    • Swap active and expired

Why small timeslices for low-priority tasks?


Linux CFS (Completely Fair Scheduler)

Goal: Provide fair sharing of CPU time among tasks, proportional to their priority (weight).

Key Concept: Virtual Runtime (vruntime)

High-priority tasks can run longer before being considered “unfair” Low-priority tasks are more easily preempted

Runqueue Data Structure

How It Runs

  1. Pick leftmost node (min vruntime)
  2. Run that task
  3. While it runs:
    • Its vruntime increases (fast or slow, depending on priority)
  4. Periodically compare its vruntime to leftmost node’s vruntime:
    • If still smallest → keep running it
    • If it has “caught up” or exceeded another task → preempt and put it back in the tree

Effects

Downside: For huge numbers of tasks, the log n factor of insert/remove might become a bottleneck.


Multiprocessor Scheduling

We have multiple CPUs/cores.

Architectures

  1. Shared memory multiprocessor:

    • Multiple CPUs
    • Each CPU: private L1/L2; maybe shared last-level cache
    • Shared main memory (DRAM)
  2. Multicore CPU:

    • One socket with multiple cores inside
    • Each core: private L1/L2
    • Shared last-level cache on chip
    • DRAM shared

From OS viewpoint: sees multiple logical CPUs (each core or hardware thread is something schedulable).

Cache Affinity

Reasoning:

Hence:

Load Balancing

We usually give each CPU its own runqueue and scheduler, plus a load balancer:

Trade-off:


Hardware Considerations

Hyperthreading/SMT

Hyperthreading (Simultaneous Multithreading) treats one physical core as two logical CPUs.

Key Concept: Hiding memory stalls by treating one core as two.

Strategy: Mix CPU-bound and memory-bound threads.

From OS perspective:


NUMA (Non-Uniform Memory Access)

Some systems have multiple memory nodes; each CPU/socket is closer (physically) to some memory banks.

Goal: NUMA-aware scheduling

Implementation:


Summary

Scheduling Algorithm Evolution

  1. FCFS: Simple but suffers from convoy effect
  2. SJF/SRTF: Optimal average wait time, but needs job length prediction
  3. Round Robin: Fair timesharing, good for interactive systems
  4. Priority: Supports different classes of work, needs anti-starvation mechanisms
  5. MLFQ: Adaptive, learns task behavior automatically
  6. O(1) Scheduler: Constant-time operations, priority-based with feedback
  7. CFS: Proportional fairness, better for modern workloads

Key Trade-offs