CPU Scheduling

Foundation & Mechanism

Core Definition

The Scheduler decides which task (thread/process) runs on the CPU and when.

The Loop:

Ready Queue: Set of tasks ready to run.

Trigger: CPU becomes idle (task finishes, blocks on I/O, or timeslice expires).

Context Switch: Scheduler picks a new task and swaps context.

Scheduling Metrics

Metric	Definition	Goal
Throughput	Jobs completed per unit time	Maximize
Wait Time	Time spent waiting in the ready queue	Minimize
Completion Time	Total time from arrival to finish	Minimize
CPU Utilization	Fraction of time CPU is doing work (not idle)	Maximize
Responsiveness	How quickly the system responds to user input	Maximize

Scheduling Algorithms

Non-Preemptive Scheduling

Assumption: Once a task starts, it runs until it finishes or blocks voluntarily.

FCFS (First Come First Serve)

Tasks run in order of arrival; no preemption.

Example:

T1: 1s, T2: 10s, T3: 1s
Arrival order: T1, T2, T3

Timeline:

0–1: T1 (finishes at 1)
1–11: T2 (finishes at 11)
11–12: T3 (finishes at 12)

Metrics:

Throughput = 3 / 12 = 0.25 tasks/s
Completion times = [1, 11, 12], average = (1 + 11 + 12) / 3 = 8
Wait times:
- T1: 0
- T2: waited 1s before starting
- T3: waited 11s
- Average = (0 + 1 + 11) / 3 = 4

Runqueue implementation: Simple FIFO queue

Problem: A big job arriving early can make shorter jobs wait a lot → poor average completion/wait times. This is called the convoy effect.

SJF (Shortest Job First)

Idea: Always run the task with the shortest remaining execution time (for run-to-completion).

Improves average completion and wait times compared with FCFS
Needs a runqueue where you can efficiently find the shortest job

Data structures: sorted list, min-heap, or balanced tree

Problems:

SJF assumes you know how long jobs will take, which is unrealistic. In practice, we approximate based on history.
Possible starvation of long jobs if short ones keep arriving.

Mechanics (Preemptive variant - SRTF):

A task arrives → insert it into the runqueue
The scheduler checks: “Is the new job shorter than the remaining time of the current job?”
- If yes → preempt current job and run new one
- If no → continue running current job

Preemptive Scheduling

Modern Standard: Tasks are forced to stop when their time is up.

Round Robin Scheduling (RR)

Use when tasks have equal priority.

Basic (non-preemptive) rule:

Put all ready tasks in a queue
Run T1 until completion, then T2, then T3, etc.

Combine with timeslicing:

Each task gets a fixed time quantum (timeslice)
After its quantum expires, it is preempted and moved to the back of the queue

Benefits:

Good responsiveness for interactive tasks
No starvation
Fair sharing of CPU

Trade-offs:

Too short timeslice → high context switch overhead
Too long timeslice → poor responsiveness

Priority Scheduling

We assign each task a priority.

Rules:

Always run the highest-priority runnable task
Must support preemption: If a higher priority task becomes ready, it preempts any lower-priority task

Real systems:

Kernel tasks → high priority
User apps → lower priority

Implementation:

Per-priority runqueues: one queue per priority level

Scheduler:

Find highest non-empty priority queue
Dequeue head of that queue

Starvation and Priority Aging:

Problem: Low-priority tasks can starve if high-priority tasks keep arriving.

Solution: Priority aging

Effective priority = function(actual priority, waiting time)

The longer a task waits, the more its effective priority is boosted. Eventually it bubbles up high enough to get CPU time.

Priority Inversion

Classic subtle bug.

Setup:

Priority: P1 (high), P2 (medium), P3 (low) where P3 < P2 < P1
T3 (low) is running and locks a mutex (e.g., shared data)
T2 (medium) arrives → preempts T3 (because P2 > P3)
T1 (high) arrives → preempts T2
T1 now tries to lock the mutex held by T3 → T1 blocks, waiting for T3

Result: A lower-priority thread is effectively blocking a higher-priority one → priority inversion

Fix: Priority Inheritance

When a high-priority task is blocked on a lock held by a low-priority task:

Temporarily boost T3’s priority to that of T1
The scheduler now sees T3 as high priority and runs it
T3 runs and releases the mutex
T3’s priority is dropped back down

Multilevel Feedback Queue (MLFQ)

Give different tasks different timeslices and priorities automatically, based on their behavior.

Core idea: Maintain multiple queues, each with:

A different priority
A different timeslice length

Top queue:

Highest priority, shortest timeslice (good for I/O bound/interactive)

Bottom queue:

Lowest priority, longest timeslice (good for CPU-bound)

Heuristic rules:

New tasks start in the topmost queue
When a task runs:
- If it yields before its timeslice ends → likely I/O-bound or interactive → keep it in a high-priority queue
- If it must be preempted because it used up its timeslice → likely CPU-bound → demote it to a lower queue with a longer quantum
- If a task in a lower queue starts yielding often, you can boost it back up

The scheduler learns from how tasks behave and adjusts their queue accordingly. It adapts priorities dynamically.

Benefits:

Automatically balances interactive and CPU-bound tasks
No need for users to manually set priorities

Challenges:

Gaming the scheduler: tasks could intentionally yield just before timeslice ends to stay in high priority
Starvation of CPU-bound tasks if too many interactive tasks arrive

Real-World Linux Schedulers

Linux O(1) Scheduler

Selecting next task and adding a task to runqueue happens in constant time, independent of number of tasks.

Priority Levels

140 priority levels total:

0-99: real-time tasks
100-139: normal timesharing tasks
Default user process priority roughly 120

Nice values: “Nice” means you let others go first. “Not Nice” (negative) means you are selfish and want the CPU now.

The Range: -20 (Most Selfish) to +19 (Most Generous)

Timeslice and Feedback

Each priority has its own timeslice length
Uses behavior feedback:
- Tasks that sleep a lot (I/O-bound) → treated as interactive → priority boosted
- Tasks that run continuously (don’t sleep) → CPU-bound → priority lowered

Runqueue Structure

Two arrays per CPU:

active array
expired array

Each array:

Has one linked list (queue) per priority level
Has a bitmask with bits set for non-empty lists

Behavior:

Scheduler always chooses from active:
- Use “find first set bit” CPU instruction on the bitmask to find highest-priority non-empty queue in O(1)
- Take first task from that list
When a running task:
- Blocks or is preempted before using full timeslice → put back in active
- Uses entire timeslice → move it to expired
When active is empty:
- Swap active and expired

Why small timeslices for low-priority tasks?

Low-priority tasks only run once all high-priority tasks are done
If they had huge timeslices, they’d hog CPU when they finally get it, making the system feel sluggish
With smaller quanta, they get short bursts of CPU and then give it back quickly

Linux CFS (Completely Fair Scheduler)

Goal: Provide fair sharing of CPU time among tasks, proportional to their priority (weight).

Key Concept: Virtual Runtime (vruntime)

For each task, CFS tracks how much CPU time it has effectively consumed
“Virtual” because it’s scaled by priority:
- High-priority tasks’ vruntime increases more slowly
- Low-priority tasks’ vruntime increases more quickly

High-priority tasks can run longer before being considered “unfair” Low-priority tasks are more easily preempted

Runqueue Data Structure

Runqueue is a red-black tree
- Balanced BST → insert/search/remove in O(log n)
Tree is ordered by vruntime
- Leftmost node → smallest vruntime → task that has gotten the least CPU so far
Scheduler always chooses the leftmost node

How It Runs

Pick leftmost node (min vruntime)
Run that task
While it runs:
- Its vruntime increases (fast or slow, depending on priority)
Periodically compare its vruntime to leftmost node’s vruntime:
- If still smallest → keep running it
- If it has “caught up” or exceeded another task → preempt and put it back in the tree

Effects

Task selection: O(1) (cached leftmost pointer)
Insertion/removal: O(log n)
Smoother fairness: tasks share CPU in proportion to priority
Better interactivity: interactive tasks that frequently sleep tend to have lower vruntime, so when they wake up, they appear on the left and get CPU quickly

Downside: For huge numbers of tasks, the log n factor of insert/remove might become a bottleneck.

Multiprocessor Scheduling

We have multiple CPUs/cores.

Architectures

Shared memory multiprocessor:
- Multiple CPUs
- Each CPU: private L1/L2; maybe shared last-level cache
- Shared main memory (DRAM)
Multicore CPU:
- One socket with multiple cores inside
- Each core: private L1/L2
- Shared last-level cache on chip
- DRAM shared

From OS viewpoint: sees multiple logical CPUs (each core or hardware thread is something schedulable).

Cache Affinity

Reasoning:

CPU caches matter a lot for performance
If a thread runs on CPU 0, its working set ends up in CPU 0’s cache
If later we run it on CPU 1, we lose that cache state → more cache misses

Hence:

Try to keep tasks on the same CPU they ran on last time → cache affinity

Load Balancing

We usually give each CPU its own runqueue and scheduler, plus a load balancer:

Each CPU mostly schedules from its own queue
Periodically:
- Check if some CPUs are overloaded vs. others
- Move tasks from busy CPU’s runqueue to idle/less busy CPU’s runqueue

Trade-off:

Load balancing improves fairness and utilization
But it conflicts with cache affinity (moving tasks loses cache)
Schedulers must balance these competing goals

Hardware Considerations

Hyperthreading/SMT

Hyperthreading (Simultaneous Multithreading) treats one physical core as two logical CPUs.

Key Concept: Hiding memory stalls by treating one core as two.

Strategy: Mix CPU-bound and memory-bound threads.

When one thread stalls waiting for memory, the other can use the execution units

From OS perspective:

Sees two CPUs per core
Must be aware that they share resources (execution units, caches)
Scheduling decisions should consider this sharing

NUMA (Non-Uniform Memory Access)

Some systems have multiple memory nodes; each CPU/socket is closer (physically) to some memory banks.

Local memory access: fast
Remote memory access: slower

Goal: NUMA-aware scheduling

Keep threads on CPUs close to the memory where their data resides
Avoid migrating tasks across NUMA nodes unless necessary

Implementation:

OS tracks which memory pages belong to which NUMA node
Scheduler tries to keep threads on the node where their memory is allocated
Page migration policies can move pages closer to where they’re accessed

Summary

Scheduling Algorithm Evolution

FCFS: Simple but suffers from convoy effect
SJF/SRTF: Optimal average wait time, but needs job length prediction
Round Robin: Fair timesharing, good for interactive systems
Priority: Supports different classes of work, needs anti-starvation mechanisms
MLFQ: Adaptive, learns task behavior automatically
O(1) Scheduler: Constant-time operations, priority-based with feedback
CFS: Proportional fairness, better for modern workloads

Key Trade-offs

Throughput vs. Responsiveness: Long timeslices improve throughput but hurt responsiveness
Fairness vs. Efficiency: Perfect fairness requires frequent context switches
Cache Affinity vs. Load Balancing: Keeping tasks on same CPU helps cache, but may create imbalance
Complexity vs. Performance: More sophisticated algorithms can perform better but are harder to implement and debug

Modern Trends

Move toward fair scheduling (CFS)
NUMA awareness for large systems
Energy-aware scheduling for mobile/embedded
Real-time guarantees for critical tasks
Container and cgroup-aware scheduling