Advanced Routing and Quality of Service

Prerequisites: 02-Network-Layer-and-Routing Learning Goals: After reading this, you will understand router architecture internals, packet classification algorithms, Head-of-Line blocking, QoS scheduling algorithms (Fair Queuing, DRR), and traffic shaping mechanisms (Token/Leaky Bucket).

Introduction

Moving beyond basic destination-based forwarding, modern routers must handle complex tasks: classifying traffic based on multiple criteria (firewall rules, QoS policies), switching packets at line rate, and managing traffic to prevent congestion. This file covers the hardware and algorithmic challenges of high-speed routers, from lookup optimizations to sophisticated scheduling and traffic shaping.

Key Challenges Addressed:

How to classify packets based on multiple header fields (not just destination IP)
How to switch packets at multi-Gbps speeds without blocking
How to fairly share bandwidth among competing flows
How to shape traffic to meet service level agreements

Router Architecture Internals

The Two Planes Revisited

Control Plane (The Brain):

Implemented in software (Routing Processor - general-purpose CPU)
Function: Runs routing protocols (BGP, OSPF) to build the routing table
Speed: Slow (milliseconds to seconds)
Location: Centralized processor

Forwarding Plane (The Muscle):

Implemented in hardware (ASICs, FPGAs on line cards)
Function: Moves packets from input port to output port using forwarding table (FIB)
Speed: Fast (nanoseconds - line rate)
Location: Distributed across input/output ports

Key Insight: Separating these planes allows innovation in control logic (SDN) without changing hardware.

Key Router Components

1. Input Ports:

Physical termination: Convert optical/electrical signals to bits
Data link processing: Decapsulate frame, extract packet
Lookup: Match destination IP against forwarding table (FIB) - critical bottleneck
Queueing: Buffer packets if switching fabric is busy

2. Switching Fabric:

Function: Interconnect that moves packets from input to output
Types:
- Memory-based: Packets copied to/from shared memory (oldest, slowest - limited by memory bandwidth)
- Bus-based: Packets traverse shared bus (limited by bus bandwidth)
- Crossbar-based: N×N parallel paths (fastest, most common in modern routers)

3. Output Ports:

Queueing: Buffer packets waiting for transmission (critical for QoS)
Scheduling: Determine which packet to send next (FIFO, priority, fair queueing)
Data link processing: Encapsulate packet in frame
Physical transmission: Convert bits to signals

4. Routing Processor:

Runs control plane software
Builds routing table and compiles into forwarding table
Distributes FIB to line cards

Lookup Algorithms Revisited

Challenge: IP forwarding requires Longest Prefix Match (LPM) on every packet

Why Simple Solutions Fail:

Linear search: O(n) - acceptable for 100 routes, prohibitive for 800,000+ routes
Caching: High hit rates (80-90%) but misses still require full lookup; Internet traffic flows are diverse and short-lived

Trie-Based Solutions:

Unibit Trie:

Check 1 bit at a time
Advantages: Memory efficient, simple
Disadvantages: Slow (32 memory accesses for IPv4, 128 for IPv6)

Multibit Trie:

Check k bits at a time (stride = k)
Controlled Prefix Expansion: Expand prefixes to match stride boundaries
- Example: Prefix “1*” (length 1) expanded to “10*” and “11*” (length 2)
Trade-off: More memory for fewer lookups
Performance: Stride = 8-16 bits → 2-4 memory accesses per lookup
Result: Line-rate forwarding at multi-Gbps speeds

Example:

Stride = 2 (check 2 bits at once)

Original Prefixes:
  1*    → Port A
  11*   → Port B
  101*  → Port C

After Expansion (to align with stride 2):
  10*   → Port A  (expanded from 1*)
  11*   → Port B  (already length 2)

Level 1: Check bits [0:1]
  00, 01, 10*, 11*

Level 2 (for 10*): Check bits [2:3]
  100*, 101*, 110*, 111*

Lookup for 10110...:
  Step 1: Check first 2 bits = 10 → match 10*
  Step 2: Check next 2 bits = 11 → match 101*
  Result: Forward to Port C

Packet Classification

Motivation

Problem: Longest prefix matching only considers destination IP

Need: Route/filter based on multiple fields simultaneously:

Firewall rules: Block traffic from specific source IPs to specific destination ports
QoS policies: Prioritize video traffic (TCP port 443 from specific sources)
NAT/VPN: Identify flows for translation/tunneling

Classification Criteria:

Source IP prefix
Destination IP prefix
Source port range
Destination port range
Protocol (TCP, UDP, ICMP)
TCP flags (SYN, ACK, etc.)

Simple Classification Approaches

1. Linear Search:

Check each rule sequentially until a match
Complexity: O(n) where n = number of rules
Acceptable: < 100 rules
Prohibitive: Thousands of rules (modern firewalls have 10,000+ rules)

2. Caching:

Store recently matched flow → rule mappings
Hit rate: 80-90% in practice
Problem: Cache misses still require full linear search (or worse)
Challenge: Internet flows are short (many new flows) and diverse

3. MPLS (Multi-Protocol Label Switching):

Assign a label at the network edge based on full classification
Intermediate routers only look at the label (fast lookup)
Advantage: Avoids re-classification at every router
Disadvantage: Requires label distribution infrastructure

2D Classification Algorithms

Problem: Classify based on two dimensions (e.g., source IP and destination IP)

Representation: Rules as rectangles in 2D space

Destination IP Prefix
^
|  [Rule 2]
|     +------+
|     |      |
|  [Rule 1]  |
|  +------+  |
|  |      +--+
|  +------+
+----------------> Source IP Prefix

1. Set-Pruning Tries:

Structure:

Build a destination trie
At each destination prefix node, attach a source trie containing all source prefixes compatible with that destination

Lookup:

Traverse destination trie
At each node, check source trie for source IP
Record best match seen so far
Continue to more specific destination nodes

Advantages:

Worst-case lookup time: O(W) where W = address width (32 bits)

Disadvantages:

Memory explosion: Source prefixes replicated across multiple destination nodes
Example: If 1000 destination prefixes exist, each source prefix might be replicated 1000 times

Memory Formula: O(N × W) where N = number of rules, W = address width

2. Backtracking:

Structure:

Build a destination trie with pointers to source tries (instead of duplicating)
Destination nodes point to the source trie for that destination prefix

Lookup:

Traverse destination trie to find longest match
Check pointed source trie for source IP
If no match, backtrack to parent destination node and try its source trie
Repeat until match found

Advantages:

Memory efficient: No replication of source tries
Memory: O(N × W)

Disadvantages:

Time cost: Backtracking steps can be expensive
Worst-case: O(W²) if backtracking to root multiple times

Example:

Destination Trie:
  Root
    ├─ 0* → Source Trie A
    └─ 1* → Source Trie B
         └─ 11* → Source Trie C

Lookup for (src=101*, dst=110*):
  Step 1: Traverse to 11* (longest dst match)
  Step 2: Check Source Trie C for 101* → No match
  Step 3: Backtrack to 1*
  Step 4: Check Source Trie B for 101* → Match found

3. Grid of Tries:

Structure:

Hybrid approach combining Set-Pruning and Backtracking
Use switch pointers to jump directly to the next relevant source trie (instead of backtracking)
Precompute which source trie to check next if current fails

Lookup:

Traverse destination trie
Check source trie at current destination node
If no match, follow switch pointer to next source trie (no backtracking up the tree)
Continue until match found

Advantages:

Eliminates backtracking: O(W) worst-case time
Memory reasonable: More than backtracking, less than set-pruning

Disadvantages:

More complex to build (precompute switch pointers)
Higher memory than pure backtracking

Example:

Destination nodes with switch pointers:

  11* → Source Trie C --switch--> Source Trie B --switch--> Source Trie A
   |
  1* → Source Trie B --switch--> Source Trie A
   |
  Root → Source Trie A

Lookup for (src=101*, dst=110*):
  Step 1: Check Source Trie C → No match
  Step 2: Follow switch pointer to Source Trie B → Match found
  (No backtracking needed!)

Switching and Scheduling

Crossbar Switches

Structure: N inputs × N outputs with N² crosspoints

How It Works:

Each input can connect to any output
Multiple connections can happen simultaneously (if no conflicts)

Example (3×3 crossbar):

Input 1 ---+---+---+
           |   |   |
Input 2 ---+---+---+
           |   |   |
Input 3 ---+---+---+
           |   |   |
         Out1 Out2 Out3

Advantages:

Parallel switching: Multiple packets can be switched simultaneously
High throughput: N times faster than single-path switches

Challenge: Head-of-Line (HOL) Blocking

Head-of-Line (HOL) Blocking

Problem: In an input-queued crossbar switch, packets at the head of the queue can block packets behind them

Example:

Input Queue 1: [Packet A → Output 1] [Packet B → Output 2]
Input Queue 2: [Packet C → Output 1] [Packet D → Output 3]

Time Slot 1:
  Packet A and Packet C both want Output 1 (conflict!)
  Arbitration: Packet A wins
  Packet C waits (blocked)

  BUT: Packet D wants Output 3 (free!)
  Problem: Packet D is stuck behind Packet C
  Packet D cannot be switched even though Output 3 is available

Result: Throughput limited to ~58% of capacity (even with infinite buffers)

Why 58%?

Mathematical analysis shows random traffic patterns with HOL blocking limits throughput to approximately 58.6% of maximum

Solutions to HOL Blocking

1. Output Queuing (Knockout Scheme):

Idea: Move all buffering to output ports instead of input ports

How It Works:

Switching fabric runs k times faster than input link speed
Example: If input links are 10 Gbps, fabric runs at k × 10 Gbps
All packets destined for an output are delivered to output queue immediately

Advantages:

Eliminates HOL blocking: 100% throughput possible
Simple scheduling (no input coordination needed)

Disadvantages:

Expensive: Requires very fast switching fabric (k × line rate)
Scalability: Difficult for high-speed routers (100+ Gbps links)

Typical k value: 4-8× speedup

2. Virtual Output Queues (VOQs) with Parallel Iterative Matching (PIM):

Idea: Eliminate HOL blocking by maintaining separate queues at each input for each output

Structure:

Input port has N queues (one per output port)
Packet destined for output j goes to queue j at input

Example (3×3 switch):

Input 1: [Queue for Out1] [Queue for Out2] [Queue for Out3]
Input 2: [Queue for Out1] [Queue for Out2] [Queue for Out3]
Input 3: [Queue for Out1] [Queue for Out2] [Queue for Out3]

No More HOL Blocking:

Packet to Output 1 doesn’t block packet to Output 3 (different queues)

New Problem: How to schedule which input-output pairs to connect each time slot?

Parallel Iterative Matching (PIM) Algorithm:

Goal: Find a maximal matching (connect as many input-output pairs as possible without conflicts)

Three Phases per Iteration:

1. Request:

Each input sends requests to all outputs for which it has queued packets
Example: Input 1 has packets for Output 1 and 3 → sends requests to both

2. Grant:

Each output randomly selects one requesting input and grants permission
Example: Output 1 receives requests from Input 1 and 2 → randomly picks Input 1

3. Accept:

Each input receives grants from multiple outputs → randomly accepts one
Example: Input 1 receives grants from Output 1 and 3 → randomly accepts Output 1

Iterate: Repeat 3-4 times to find better matchings

Example:

Initial state:
  Input 1: has packets for Output 1, 3
  Input 2: has packets for Output 1, 2
  Input 3: has packets for Output 2, 3

Iteration 1:
  Request: I1→[O1,O3], I2→[O1,O2], I3→[O2,O3]
  Grant: O1→I1 (random), O2→I2 (random), O3→I3 (random)
  Accept: I1→O1, I2→O2, I3→O3

Result: Schedule transfers I1→O1, I2→O2, I3→O3 (perfect matching!)

Advantages:

High throughput: Approaches 100% with iterations
Distributed: No central scheduler bottleneck
Fast: Simple random selection, no complex computation

Disadvantages:

Randomness: Not guaranteed optimal (but good in practice)
Iterations needed: Typically 3-4 iterations for good performance

Quality of Service (QoS)

Motivation

Problem: All traffic treated equally (best-effort)

Issues:

Video calls dropped due to bulk file download
Critical financial transaction delayed by email
Real-time gaming laggy due to software updates

Solution: Quality of Service - Differentiate traffic and provide guarantees

QoS Dimensions:

Bandwidth: Minimum/maximum rate guarantees
Delay: Upper bound on latency
Jitter: Variation in delay
Loss: Packet drop rate

FIFO with Tail Drop

Simplest Approach: First-In-First-Out queue with drop when full

How It Works:

Packets arrive → Queue (FIFO) → Transmit
If queue full → Drop incoming packet (tail drop)

Advantages:

Simple to implement (O(1) per packet)
Low overhead

Disadvantages:

No fairness: One flow can monopolize the queue
No priority: Critical traffic treated same as bulk traffic
Tail drop issues: Drops affect all flows simultaneously (TCP synchronization)

Example Problem:

Flow A: Sending video (delay-sensitive)
Flow B: Sending file download (bulk, not delay-sensitive)

Flow B fills the queue → Flow A's packets dropped → Video freezes

Fair Queuing (Bit-by-Bit Round Robin)

Goal: Allocate bandwidth fairly among competing flows

Ideal Algorithm: Bit-by-Bit Round Robin

How It Works (Conceptual):

Identify all active flows (flows with queued packets)
In each “round”, serve 1 bit from each flow
Repeat until all queues empty

Example:

Flow A: 3000 bits queued
Flow B: 2000 bits queued
Flow C: 1000 bits queued

Round 1: Serve 1 bit from A, B, C (3 bits total)
Round 2: Serve 1 bit from A, B, C
...
Round 1000: C finishes
Round 1001-2000: Serve 1 bit from A, B
Round 2001-3000: Serve 1 bit from A

Finish times:
  C: Round 1000
  B: Round 2000
  A: Round 3000

Finish Time Formula:

Finish_time(packet) = max(Current_round, Finish_time(previous_packet_in_flow)) + Packet_size

Fairness Property:

Each flow receives equal share of bandwidth
If n flows, each gets 1/n of link capacity

Problem: Cannot send 1 bit at a time in practice (packets are indivisible)

Packet-by-Packet Fair Queuing (FQ)

Practical Implementation: Simulate bit-by-bit round robin at packet granularity

How It Works:

Calculate “finish time” for each packet (as if sending bit-by-bit)
Always transmit the packet with the smallest finish time
Update virtual time (simulated round number)

Data Structures:

Per-flow queues
Priority queue (min-heap) of packet finish times

Algorithm:

On packet arrival (flow i, size P):
  finish_time = max(virtual_time, finish_time_prev[i]) + P
  Insert packet into flow i's queue
  Insert finish_time into priority queue

On link available:
  packet = extract_min(priority_queue)
  Transmit packet
  virtual_time = packet.finish_time

Example:

Flow A: Packet 1 arrives (size 1000) at time 0
  finish_time = max(0, 0) + 1000 = 1000

Flow B: Packet 1 arrives (size 500) at time 0
  finish_time = max(0, 0) + 500 = 500

Priority Queue: [B:500, A:1000]

Link available:
  Transmit B's packet (smallest finish time = 500)
  virtual_time = 500

Flow B: Packet 2 arrives (size 500) at time 0
  finish_time = max(500, 500) + 500 = 1000

Priority Queue: [A:1000, B:1000]

Link available:
  Transmit A's packet (finish_time = 1000, tie broken arbitrarily)

Complexity: O(log n) per packet (priority queue operations)

Advantages:

Perfect fairness (in the long run)
Isolation: Misbehaving flow doesn’t affect others

Disadvantages:

High complexity: O(log n) per packet
Per-flow state: Memory overhead for large n

Deficit Round Robin (DRR)

Goal: Approximate fair queuing with O(1) complexity

Key Idea: Use “credits” (quantum) to allow flows to send multiple packets per round

Data Structures:

Quantum (Q): Fixed amount of “credit” given per round (e.g., 1500 bytes)
Deficit Counter (DC[i]): Unused credit from previous round for flow i
Active List: Flows with queued packets

Algorithm:

Each round:
  For each flow i in active list:
    DC[i] += Q  (add quantum)

    While (DC[i] >= size of head packet in flow i):
      Transmit packet
      DC[i] -= size of transmitted packet

    If flow i has no more packets:
      DC[i] = 0  (reset deficit)
      Remove from active list

Example:

Q = 1000 bytes

Round 1:
  Flow A: DC=0, Packet size=1500
    DC = 0 + 1000 = 1000 < 1500 → Cannot send
    DC[A] = 1000 (carry deficit forward)

  Flow B: DC=0, Packet size=500
    DC = 0 + 1000 = 1000 ≥ 500 → Send packet
    DC = 1000 - 500 = 500 (carry forward)

Round 2:
  Flow A: DC=1000
    DC = 1000 + 1000 = 2000 ≥ 1500 → Send packet
    DC = 2000 - 1500 = 500

  Flow B: DC=500
    DC = 500 + 1000 = 1500 ≥ 500 → Send packet
    DC = 1500 - 500 = 1000

Why “Deficit”?

If a flow cannot send in one round (deficit), it accumulates credit for next round
Eventually, accumulated credit allows sending large packets

Fairness:

Each flow gets approximately Q bytes per round
Long-term bandwidth share ≈ equal (like fair queuing)

Complexity: O(1) per packet (no priority queue, no sorting)

Advantages:

Constant-time: Fast, scalable
Simple: Easy to implement in hardware
Fair: Good approximation of fair queuing

Disadvantages:

Not perfect fairness: Short-term variations
Delay: Large packets might wait one full round

Typical Deployment: High-speed routers (where O(log n) is too expensive)

Traffic Shaping and Policing

Motivation

Problem: Bursty traffic can overwhelm the network

Example:

User’s camera uploads burst of photos (10 MB in 1 second)
Network link is 1 Mbps average
Burst causes congestion, packet loss for other users

Solution: Traffic Shaping - Smooth out bursts to match contracted rate

Related Concept: Traffic Policing - Drop/mark packets exceeding rate (enforced by network)

Leaky Bucket

Analogy: Bucket with a hole at the bottom

How It Works:

Packets arrive at variable rate (fill the bucket)
Packets transmitted at constant rate (leak out the hole)
If bucket overflows → drop packets

Model:

         Packets arrive (variable rate)
                |
                v
         +-------------+
         |   Bucket    |  ← Buffer (size B)
         |             |
         +-------------+
                |
                v
        Constant rate R (leak)

Parameters:

Bucket size (B): Maximum burst size (buffer capacity)
Leak rate (R): Output rate (constant)

Algorithm:

On packet arrival:
  If bucket_level + packet_size <= B:
    Add packet to bucket
  Else:
    Drop packet

Continuously (every time unit):
  If bucket not empty:
    Transmit R bits
    bucket_level -= R

Characteristics:

Output: Perfectly smooth (constant rate R)
Burst handling: Bursts absorbed in bucket (up to size B)
Overflow: Packets dropped if burst exceeds B

Example:

B = 5000 bytes, R = 1000 bytes/sec

Time 0: Burst of 3000 bytes arrives → bucket_level = 3000
Time 1: Transmit 1000 bytes → bucket_level = 2000
Time 1.5: Burst of 4000 bytes arrives → bucket_level = 6000 > B
  → Accept 3000 bytes (bucket now full at 5000), drop 1000 bytes
Time 2: Transmit 1000 bytes → bucket_level = 4000
...

Use Case: Smoothing traffic for constant bitrate (CBR) services (e.g., voice)

Token Bucket

Analogy: Bucket accumulates tokens at a fixed rate; need token to send a packet

How It Works:

Tokens arrive at constant rate R
Bucket holds up to B tokens
To transmit a packet of size P, consume P tokens
If not enough tokens → packet waits (or dropped in policing mode)

Model:

     Tokens arrive at rate R
                |
                v
         +-------------+
         | Token Bucket|  ← Capacity B tokens
         |             |
         +-------------+
                ↑
                | Consume tokens to send packet
         Packets arrive

Parameters:

Bucket size (B): Maximum burst size (in bytes/tokens)
Token rate (R): Average rate

Algorithm:

Continuously (every time unit):
  If tokens < B:
    tokens += R × time_elapsed
    tokens = min(tokens, B)  (cap at B)

On packet arrival (size P):
  If tokens >= P:
    Transmit packet
    tokens -= P
  Else:
    Queue packet (or drop if policing)

Characteristics:

Allows bursts: Can send burst of size B immediately (if tokens available)
Average rate: Long-term rate limited to R
Flexible output: Not constant rate (bursty output allowed)

Example:

B = 5000 bytes, R = 1000 bytes/sec

Time 0: tokens = 5000 (bucket full)
  Burst of 4000 bytes arrives → Send immediately
  tokens = 5000 - 4000 = 1000

Time 1: tokens = 1000 + 1000 = 2000
  Packet of 1500 bytes arrives → Send
  tokens = 2000 - 1500 = 500

Time 2: tokens = 500 + 1000 = 1500
  Burst of 3000 bytes arrives → Can only send 1500 bytes
  Send 1500 bytes, queue 1500 bytes
  tokens = 0

Key Difference from Leaky Bucket:

Leaky Bucket: Output is constant rate (smooth)
Token Bucket: Output can be bursty (up to B bytes burst) but average rate R

Use Case: Rate limiting with burst tolerance (e.g., SLAs allowing bursts up to B)

Token Bucket vs Leaky Bucket Summary

Aspect	Token Bucket	Leaky Bucket
Output Rate	Variable (allows bursts)	Constant
Burst Handling	Allows burst up to B immediately	Smooths burst over time
Average Rate	R (long-term)	R (always)
Use Case	Rate limiting with burst tolerance	Traffic smoothing for CBR
Example	ISP SLA: 10 Mbps average, 50 MB burst	VoIP codec: constant 64 Kbps

Analogy Comparison:

Token Bucket: Credit card with limit B, replenishes at rate R → Can spend up to limit in burst
Leaky Bucket: Water tank with hole → Water drips out at constant rate regardless of input

Summary

Key Takeaways

Router Architecture:
- Control plane (software) computes routes, forwarding plane (hardware) moves packets
- Multibit tries achieve line-rate lookups (2-4 memory accesses for IPv4)
Packet Classification:
- Required for firewalls, QoS, NAT (multi-field matching)
- Grid of Tries: O(W) time, eliminates backtracking with switch pointers
- Trade-off: Memory vs. time complexity
Head-of-Line Blocking:
- Limits input-queued switches to 58% throughput
- Solutions: Output queuing (expensive) or Virtual Output Queues + PIM (practical)
QoS Scheduling:
- FIFO: Simple but unfair
- Fair Queuing: Perfect fairness but O(log n) complexity
- Deficit Round Robin: O(1) approximation of fair queuing, scalable
Traffic Shaping:
- Leaky Bucket: Smooths traffic to constant rate (CBR services)
- Token Bucket: Allows bursts up to B, average rate R (SLAs)

Common Patterns

Router Design Trade-offs:

Memory vs. Speed (multibit tries)
Centralized vs. Distributed (output queuing vs. PIM)
Fairness vs. Complexity (Fair Queuing vs. DRR)

QoS Mechanisms:

Classification → Scheduling → Shaping
Proactive (shaping at source) vs. Reactive (policing at network)

Algorithm Selection:

Few rules → Linear search
Exact match → Hashing
Prefix match → Tries
Multi-field → Grid of Tries or decision trees

04 Advanced Routing And Qos

Advanced Routing and Quality of Service

Introduction

Router Architecture Internals

The Two Planes Revisited

Key Router Components

Lookup Algorithms Revisited

Packet Classification

Motivation

Simple Classification Approaches

2D Classification Algorithms

Switching and Scheduling

Crossbar Switches

Head-of-Line (HOL) Blocking

Solutions to HOL Blocking

Quality of Service (QoS)

Motivation

FIFO with Tail Drop

Fair Queuing (Bit-by-Bit Round Robin)

Packet-by-Packet Fair Queuing (FQ)

Deficit Round Robin (DRR)

Traffic Shaping and Policing

Motivation

Leaky Bucket

Token Bucket

Token Bucket vs Leaky Bucket Summary

Summary

Key Takeaways

Common Patterns

See Also