File System Designs

Log-structured File System (LFS)

Key Idea: Writing Sequentially
Improves writes due to growing system memories allowing more data caching, making disk traffic consist more of writes.
Aims to bridge the large performance gap between random I/O and sequential I/O on disks.
Leverages increased disk transfer bandwidth from packing more bits into the disk surface.
Seeks to mitigate the slowly decreasing costs of seek and rotational delays.
Addresses the poor performance of existing file systems like FFS, which incurs many short seeks and rotational delays.
Writes everything, including data blocks and inodes, sequentially to disk.
Buffers all writes in an in-memory segment and commits the segment to disk as a single large write, known as write buffering.
It is possible to buffer writes to different files in a segment.

Writing Sequentially, and Effectively!

Writing sequentially is not enough alone to guarantee efficient writes.
The Log-structured File System (LFS) first buffers all writes in an in-memory segment; when the segment is large enough, LFS commits the segment to disk as a single large write.
This technique is well known as write buffering.
It is possible to buffer writes to different files in a segment.

Issue #1: How Much to Buffer?

Variables:
- $T{position}$ : Time to position the disk head ( $T{rotation} + T_{seek}$ ).
- $R_{peak}$ : Disk transfer rate.
- $D$ : Amount of data to buffer.
Derivations:
- Time to write the data: $T{write} = T{position} + \frac{D}{R_{peak}}$
- Effective write rate: $R{effective} = \frac{D}{T{write}} = \frac{D}{T{position} + \frac{D}{R{peak}}}$
Goal: Get the effective rate close to the peak rate.
Set the effective rate as a fraction $F$ of the peak rate: $R{effective} = F \times R{peak}$
Solve for $D$ : $D = \frac{F \times R{peak} \times T{position}}{1 - F}$

Indirect Mapping and Checkpoint Region

Issue #2: How to Find Inodes?

In LFS, inodes are scattered throughout the disk.
Solution through Indirection: The Inode Map (imap).
- Maps from an inode-number to the disk-address of the most recent version of the inode.
- Implemented as an array of 4 bytes (disk pointer) per entry.
- Updated whenever an inode is written to disk.
LFS places the imap right next to where it is writing.
The pieces of imap are also spread across the disk.
Complete Solution: The Checkpoint Region (CR) records disk pointers to all latest pieces of imap. Flushed to disk periodically (e.g., every 30 seconds).

File Reading Process

Step 1: LFS needs to read the checkpoint region to find the latest imap.
Step 2: LFS needs to read the latest imap to have the disk location of the inode.
Step 3: LFS needs to read the most recent version of the inode ( $I[k]$ ).
Step 4: LFS needs to read data blocks using direct/indirect pointer as usual.
To perform the same number of I/Os as UNIX FS, LFS must cache the checkpoint region (CR) and the entire imap in the system memory.

Issue #3: What about Directories?

The directory structure of LFS is identical to UNIX FS.
The directory is a collection of (name, inode-num) entries.
When creating a file, LFS writes the data and the new inode, the directory and its inode, and the latest imap.
LFS will do so sequentially on the disk.
When reading a file in the directory, LFS looks up imap (often cached in memory).

Issue #4: Garbage Collection

LFS never overwrites but writes to free locations.
Multiple versions of data may co-exist across the disk.
The old version(s) of data are usually called garbage.
LFS keeps only the latest live versions of data and periodically cleans old dead versions of data.
The process of cleaning is called garbage collection (GC).
LFS adopts a segment-based cleaning as follows:
- Step 1: Reads in $M$ partially-used segments.
- Step 2: Determines which blocks are live within these segments.
- Step 3: Compacts only live contents into $N$ new segments (N < M).
- Step 4: Writes out $N$ segments to disk in new locations.
- Step 5: Frees old M segments for subsequent writing.
- Problem 1: How to determine if a block is live (or dead)?
- Problem 2: How often, and which segments to clean?
LFS adds extra information, at the head of each segment, called the segment summary block (SS).
- It records, for each data block $D$ in the segment, its inode number $N$ and its offset $T$ (e.g., $(k, 0)$ ).
- The liveness for a block $D$ of address $A$ can be determined through segment summary and imap.
  - $(N, T) = SegmentSummary[A]$ ;
  - $inode = Read(imap[N])$ ;
  - if ( $inode[T] == A$ ) // block $D$ is alive.
  - else // block $D$ is garbage.
Optimization: Keeping a version number in both imap and SS, extra reads of inodes can be further avoided.
- The version number should be incremented whenever the file is truncated or deleted.
When to clean?
- Either periodically, during idle time, or when the disk is full.
Which segments are worth cleaning?
- LFS tries to segregate hot and cold segments.
  - A hot segment consists of frequently-over-written blocks.
  - A cold segment may only have a few over-written (dead) blocks.
- LFS cleans cold segments sooner and hot segments later.
- As time goes by, more and more blocks in the hot segment may get over-written (in new segments).

Issue #5: Crash Recovery

Crashes when writing to the checkpoint region:
- Solution: Keeps two CRs (e.g., one at the head and one at the end) and writes to them alternately.
- It first writes a header (with a timestamp), then the body of CR, and then an end marker (with a timestamp).
- Inconsistent pair of timestamps implies an error.
Crashes when writing to a segment:
- Roll Forwarding: Starts with the last checkpoint region and rebuilds all “non-checkpointed” but “committed” segments.

Recall: Metadata Journaling

The sequence of metadata journaling:
- Data Write: Write data to final location
- Journal Metadata Write: Write the begin block ( $T x B$ ) and metadata ( $I[v2], B[v2]$ ) to log
- Journal Commit: Write the transaction commit block ( $T x E$ )
- Checkpoint Metadata: Write the contents of metadata update to their final locations within the file system
- Free: Mark the transaction free in the journal superblock

File Implementation: Block Allocation

Block Allocation: How to allocate disk space to files
It is a typical way to classify file system designs:
- Indexed Allocation: an index block keeps block pointers
  - Examples: UNIX FS, FFS, ext2, LFS
- Linked Allocation: each file is of linked blocks
  - Examples: FAT
- Contiguous Allocation: each file is of contiguous blocks
  - Examples: ext4

Indexed Allocation

Each file has its own index block, which keeps track of all block pointers/locations of a file.
- The $i^{th}$ entry in the index block points to the $i^{th}$ block.
Potential Issues:
- The index block could be far away from data blocks.
- Data blocks are scattered across the disk.

Recall: UNIX FS and its Variants

UNIX file system (and its variants FFS, ext, ext2, etc.) are typical representatives of indexed allocation.
- Metadata Region: tracks data and file system information.
- Data Region: stores user data and occupies most space.

Recall: Multi-Level Index

Multi-level index supports files of big sizes.
Each ext2 inode 15 disk pointers:
- 12 direct pointers;
- 1 indirect pointer;
- 1 double indirect pointer;
- 1 triple indirect pointer

Recall: Log-structured File System

LFS can be also considered as indexed allocation, in which the indirection is further introduced:
- The Checkpoint Region (CR):
  - Records disk pointers to all latest pieces of imap.
  - Flushed to disk periodically (e.g., every 30 seconds).
- The Inode Map (imap)
  - Maps from an inode-number to the disk-address of the most recent version of the inode.
  - Updated whenever an inode is written to disk.
  - Placed right next to where data block (D) and inode ( $I[k]$ ) reside.

Linked Allocation

Each file is a linked list of disk blocks, which may be scattered anywhere on the disk.
- The directory maintains the first and last blocks of the file; every block contains a pointer to the next block.
Each 512-byte block is of 508-byte user data and 4-byte pointer.
- A file can easily continue to grow if there are free blocks.
Potential Issues:
- It can be used effectively only for sequential-access files.
  - It is inefficient to arbitrarily access the $i^{th}$ block of a file.
- It costs 0.78% (4 B / 512 B) of the disk space for pointers.
  - One solution is to collect multiple blocks into a cluster.
- Any lost or damaged pointer makes a big mess.
- Data blocks may be scattered across the disk.

File Allocation Table (FAT)

File Allocation Table (FAT):
- A variation on linked allocation (used by MS-DOS and OS/2).
- A table indexed by block number (i.e., one entry per block).
  - The directory entry contains the block number of the first block.
  - Each FAT entry indicates the block number of the next block.
  - There is no need to maintain the 4B block pointer in each data block.
Problem: The in-disk FAT could be far away from blocks.

Contiguous Allocation

Each file occupies a set of contiguous blocks.
- Block addresses define a linear ordering on the disk.
- Every allocation is defined by the start address and length.
It is efficient for both sequential and direct access.
The difficulties are to:
- determine how much space is need, and
- find contiguous space for a file.

Extent

To avoid over-or-under allocation, some file systems (e.g., ext4) adopt a modified contiguous allocation.
- A chunk of contiguous and variable-sized space, extent, is allocated whenever the allocated space is insufficient.
Four extents can be kept in ext4 inode (not shown).
- An extent tree is used to store the extents map for a big file.

Dynamic Allocation Problem

How to satisfy a request of size $n$ from a list of holes?
Common Solutions: best-fit, worst-fit, and first-fit.
It is also a common problem of memory management.