source

07 Io And Filesystems

Why I/O stuff matters

A real program doesn’t just talk to the CPU and RAM. It needs:

  1. How does the CPU talk to a device?
  2. How does a user program talk to a device safely and portably?
  3. How can this be fast, even though many devices are slow?
  4. How do we keep data consistent and recover from crashes?

I/O devices:

Most I/O devices have a similar structure, even if they do wildly different things. Registers A device exposes registers that the CPU can read/write:

CPU–Device interconnect: how devices are plugged in

Devices are connected to the CPU complex via controllers and buses.

Device drivers: the OS’s side of the handshake

A device driver is just kernel code(software components ) that knows:

They are responsible for device access, management and control Device drivers are often provided by the device manufacturer per OS version, and each OS standardizes interfaces for device independence and device diversity

Drivers are device-specific, but OSes provide a standard driver framework:

Types of devices

OSes classify devices mostly by how they behave:

  1. Block devices
    • Think: disks, SSDs.
    • Data is read/written in fixed-size blocks (e.g., 512B, 4KB).
    • You can directly access block i without reading 0…i-1 first (random access).
  2. Character devices
    • Think: keyboard, serial ports.
    • Provide a byte/character stream.
    • You just read/write bytes in order.
  3. Network devices
    • A bit in-between: you get variable-sized packets/frames (chunks of data) OS defines standard operations per type:

Pseudo Devices: These files do not correspond to physical hardware. They are implemented entirely by the operating system kernel to provide a special software function or service. /dev/null, /dev/zero, /dev/random, /dev/tty

Devices as files: /dev UNIX-like systems represent devices as special files under /dev:

How the CPU talks to devices

  1. Memory-mapped I/O Device registers are mapped into physical address space. When the CPU does store or load to those addresses, the bus logic routes that to the device, not normal RAM. Which addresses are used is configured via Base Address Registers (BARs) (part of PCI configuration).
  2. I/O port model x86 has special in/out instructions (in, out) and a separate I/O port space. CPU says: out port-number, value — which sends data to that device port. Less common on non-x86 architectures nowadays; MMIO is more universal.

Polling vs. interrupts Once a device does something (e.g., disk read is done, packet arrived), the CPU needs to know. Polling

Once command and addresses are set, actual data must move between device and memory.

Programmed I/O (PIO) With just basic PCI support a system can request an operation from a device using a method called programmed I/O (PIO). This requires no additional hardware support. The CPU issues instructions by writing to the device’s command registers and transfers data via the data registers num_instructions=Instructions for Command+Instructions for Data\text{num\_instructions} = \text{Instructions for Command} + \text{Instructions for Data} Example: sending a 1500B network packet using 8-byte data registers: - Each access transfers 8 bytes. - 1500 / 8 = 187.5 → round up to 188 transfers. - Plus 1 write to the command register. - = 189 CPU accesses total.

Workflow: Command \rightarrow Wait/Poll \rightarrow Data Write \rightarrow Repeat.

Direct Memory Access (DMA) We still write commands to the device via command registers, but data movement is controlled by configuring the DMA controller. Flow:

  1. CPU writes command register: “Transmit a packet.”
  2. CPU configures DMA with:
    • physical memory address of buffer,
    • size of buffer, direction, etc.
  3. DMA engine takes over:
    • Reads packet from RAM,
    • Transfers to device or vice versa.
  4. Device signals completion (typically via interrupt). Advantages:

OS Bypass: user-level access to devices

User process → system call → kernel → driver → device(normal process) OS bypass lets a user process talk directly (almost) to the device:

Synchronous vs asynchronous I/O

Suppose your thread issues an I/O request (read from disk, send packet, etc). Synchronous I/O

Block Device Stack: from files to blocks

Applications care about files, not sectors. Layers:

  1. User process
    • Uses a file abstraction (open, read, write).
  2. Kernel Filesystem (FS)
    • Interprets file paths, permissions,
    • Maps file offsets to block numbers on some block device.
  3. Generic block layer
    • OS-provided uniform interface for all block devices on that OS.
    • Hides differences between, say, SATA disk vs NVMe SSD vs RAM disk.
  4. Device driver
    • Talks the specific protocol of that device type and model.
  5. Block device (disk/SSD)

Virtual File System (VFS): multiple filesystems as one tree

Problems VFS solves:

  1. File: elements on which the VFS operations
  2. File descriptor: OS representation of file
    • Small integer returned by open().
    • Process-local handle to a file instance (with offset, flags, etc).
  3. inode (index node)
    • Persistent structure describing one file:
      • Owner, permissions
      • File size
      • Timestamps
      • Pointers to the blocks holding the file’s data.
    • Important because file data blocks can be scattered across the disk.
  4. dentry (directory entry)
    • Short-lived in-memory object representing one path component.
    • Example: accessing /users/ada → dentries for /, users, ada.
    • Cached in a dentry cache so subsequent path lookups don’t re-read directories from disk.
    • Not persisted to disk; purely in-memory lookup accel.
  5. superblock
    • Describes overall layout of a filesystem instance:
      • Where inodes live,
      • Where data blocks live,
      • Which blocks are free/used.
    • Each mounted filesystem has one superblock (in memory) representing its state. Mapping:

EXT2 on-disk layout

A disk partition formatted as ext2:

  1. Boot block (first block)
    • Often used for bootloader.
    • Linux ext2 itself doesn’t use it directly for files.
  2. Block groups
    • The rest of the partition is split into multiple block groups.
    • Group size is a logical design choice, not necessarily tied to physical disk geometry. Each block group contains:

Bitmaps let the filesystem quickly find free space and free inodes.


Inodes and indirect pointers

Each file is uniquely identified by an inode, which contains a list of all blocks of data and its metadata

Basic inode with direct pointers Imagine an inode that is 128 bytes and stores 4-byte block pointers.

So a file larger than 32 KB wouldn’t fit. This is too small in practice. Fix: indirect pointers Inodes use a mix of:

  1. Single indirect pointer

    • It doesn’t point to data directly.
    • It points to a block of pointers.
    • That block (1 KB) is filled with 4-byte pointers:
      • 1 KB / 4 B = 1024 / 4 = 256 pointers.
    • Each of those points to a data block. So one single-indirect pointer can reference:
    • 256 data blocks × 1 KB per block = 256 KB of data.
  2. Double indirect pointer

    • Pointer in inode → block of single-indirect pointers.
    • Each of those → block of data pointers.
    • Calculation:
      • First level: 1 KB / 4 B = 256 pointers.
      • Each of those points to another 1 KB block of 256 pointers.
      • Total data blocks = 256 × 256 = 65,536 blocks. Each data block = 1 KB:
      • Total bytes = 65,536 × 1 KB.
      • 1 KB = 1024 bytes.
      • 65,536 × 1024 bytes = 67,108,864 bytes.
      • Divide by (1024 × 1024) to get MB:
        • 67,108,864 / 1,048,576 = 64. → 64 MB of data referenced by one double-indirect pointer.

You can also have triple indirect pointers for truly huge files (same pattern: another level of indirection). max_file_size=(direct+single_indirect+double_indirect+triple_indirect)×block_size\text{max\_file\_size} = \left(\text{direct} + \text{single\_indirect} + \text{double\_indirect} + \text{triple\_indirect}\right) \times \text{block\_size} Trade-offs:

Disk access optimizations

Disks (especially spinning ones) are slow compared to RAM/CPU, mainly due to:

fsync() exists so applications can force dirty data to be committed to disk (important for databases, etc.).

I/O scheduling

Multiple pending requests to disk can be reordered to reduce head movement. Example:

Prefetching (read-ahead)

Assumption: programs often access data with spatial locality (e.g., reading sequentially). Example:

Journaling

Problem: write-back caching + reordering means:

Journaling (write-ahead logging) fixes this by:

  1. Writing intended changes to a sequential log (journal) on disk:

    • “I plan to change inode X, block Y, etc.”
    • Appends are sequential, so fast even on spinning disks.
  2. Marking the journal transaction as committed.

  3. Later, applying those recorded changes to their final locations on disk。 On crash: