File System Designs

Log-structured File System (LFS)

  • Key Idea: Writing Sequentially

  • Improves writes due to growing system memories allowing more data caching, making disk traffic consist more of writes.

  • Aims to bridge the large performance gap between random I/O and sequential I/O on disks.

  • Leverages increased disk transfer bandwidth from packing more bits into the disk surface.

  • Seeks to mitigate the slowly decreasing costs of seek and rotational delays.

  • Addresses the poor performance of existing file systems like FFS, which incurs many short seeks and rotational delays.

  • Writes everything, including data blocks and inodes, sequentially to disk.

  • Buffers all writes in an in-memory segment and commits the segment to disk as a single large write, known as write buffering.

  • It is possible to buffer writes to different files in a segment.

Writing Sequentially, and Effectively!

  • Writing sequentially is not enough alone to guarantee efficient writes.

  • The Log-structured File System (LFS) first buffers all writes in an in-memory segment; when the segment is large enough, LFS commits the segment to disk as a single large write.

  • This technique is well known as write buffering.

  • It is possible to buffer writes to different files in a segment.

Issue #1: How Much to Buffer?
  • Variables:

    • T<em>positionT<em>{position}: Time to position the disk head (T</em>rotation+TseekT</em>{rotation} + T_{seek}).

    • RpeakR_{peak}: Disk transfer rate.

    • DD: Amount of data to buffer.

  • Derivations:

    • Time to write the data: T<em>write=T</em>position+DRpeakT<em>{write} = T</em>{position} + \frac{D}{R_{peak}}

    • Effective write rate: R<em>effective=DT</em>write=DT<em>position+DR</em>peakR<em>{effective} = \frac{D}{T</em>{write}} = \frac{D}{T<em>{position} + \frac{D}{R</em>{peak}}}

  • Goal: Get the effective rate close to the peak rate.

  • Set the effective rate as a fraction FF of the peak rate: R<em>effective=F×R</em>peakR<em>{effective} = F \times R</em>{peak}

  • Solve for DD: D=F×R<em>peak×T</em>position1FD = \frac{F \times R<em>{peak} \times T</em>{position}}{1 - F}

Indirect Mapping and Checkpoint Region

Issue #2: How to Find Inodes?
  • In LFS, inodes are scattered throughout the disk.

  • Solution through Indirection: The Inode Map (imap).

    • Maps from an inode-number to the disk-address of the most recent version of the inode.

    • Implemented as an array of 4 bytes (disk pointer) per entry.

    • Updated whenever an inode is written to disk.

  • LFS places the imap right next to where it is writing.

  • The pieces of imap are also spread across the disk.

  • Complete Solution: The Checkpoint Region (CR) records disk pointers to all latest pieces of imap. Flushed to disk periodically (e.g., every 30 seconds).

File Reading Process
  • Step 1: LFS needs to read the checkpoint region to find the latest imap.

  • Step 2: LFS needs to read the latest imap to have the disk location of the inode.

  • Step 3: LFS needs to read the most recent version of the inode (I[k]I[k]).

  • Step 4: LFS needs to read data blocks using direct/indirect pointer as usual.

  • To perform the same number of I/Os as UNIX FS, LFS must cache the checkpoint region (CR) and the entire imap in the system memory.

Issue #3: What about Directories?
  • The directory structure of LFS is identical to UNIX FS.

  • The directory is a collection of (name, inode-num) entries.

  • When creating a file, LFS writes the data and the new inode, the directory and its inode, and the latest imap.

  • LFS will do so sequentially on the disk.

  • When reading a file in the directory, LFS looks up imap (often cached in memory).

Issue #4: Garbage Collection
  • LFS never overwrites but writes to free locations.

  • Multiple versions of data may co-exist across the disk.

  • The old version(s) of data are usually called garbage.

  • LFS keeps only the latest live versions of data and periodically cleans old dead versions of data.

  • The process of cleaning is called garbage collection (GC).

  • LFS adopts a segment-based cleaning as follows:

    • Step 1: Reads in MM partially-used segments.

    • Step 2: Determines which blocks are live within these segments.

    • Step 3: Compacts only live contents into NN new segments (N < M).

    • Step 4: Writes out NN segments to disk in new locations.

    • Step 5: Frees old M segments for subsequent writing.

    • Problem 1: How to determine if a block is live (or dead)?

    • Problem 2: How often, and which segments to clean?

  • LFS adds extra information, at the head of each segment, called the segment summary block (SS).

    • It records, for each data block DD in the segment, its inode number NN and its offset TT (e.g., (k,0)(k, 0)).

    • The liveness for a block DD of address AA can be determined through segment summary and imap.

      • (N,T)=SegmentSummary[A](N, T) = SegmentSummary[A];

      • inode=Read(imap[N])inode = Read(imap[N]);

      • if (inode[T]==Ainode[T] == A) // block DD is alive.

      • else // block DD is garbage.

  • Optimization: Keeping a version number in both imap and SS, extra reads of inodes can be further avoided.

    • The version number should be incremented whenever the file is truncated or deleted.

  • When to clean?

    • Either periodically, during idle time, or when the disk is full.

  • Which segments are worth cleaning?

    • LFS tries to segregate hot and cold segments.

      • A hot segment consists of frequently-over-written blocks.

      • A cold segment may only have a few over-written (dead) blocks.

    • LFS cleans cold segments sooner and hot segments later.

    • As time goes by, more and more blocks in the hot segment may get over-written (in new segments).

Issue #5: Crash Recovery
  • Crashes when writing to the checkpoint region:

    • Solution: Keeps two CRs (e.g., one at the head and one at the end) and writes to them alternately.

    • It first writes a header (with a timestamp), then the body of CR, and then an end marker (with a timestamp).

    • Inconsistent pair of timestamps implies an error.

  • Crashes when writing to a segment:

    • Roll Forwarding: Starts with the last checkpoint region and rebuilds all “non-checkpointed” but “committed” segments.

Recall: Metadata Journaling
  • The sequence of metadata journaling:

    • Data Write: Write data to final location

    • Journal Metadata Write: Write the begin block (TxBT x B) and metadata (I[v2],B[v2]I[v2], B[v2]) to log

    • Journal Commit: Write the transaction commit block (TxET x E)

    • Checkpoint Metadata: Write the contents of metadata update to their final locations within the file system

    • Free: Mark the transaction free in the journal superblock

File Implementation: Block Allocation

  • Block Allocation: How to allocate disk space to files

  • It is a typical way to classify file system designs:

    • Indexed Allocation: an index block keeps block pointers

      • Examples: UNIX FS, FFS, ext2, LFS

    • Linked Allocation: each file is of linked blocks

      • Examples: FAT

    • Contiguous Allocation: each file is of contiguous blocks

      • Examples: ext4

Indexed Allocation
  • Each file has its own index block, which keeps track of all block pointers/locations of a file.

    • The ithi^{th} entry in the index block points to the ithi^{th} block.

  • Potential Issues:

    • The index block could be far away from data blocks.

    • Data blocks are scattered across the disk.

Recall: UNIX FS and its Variants
  • UNIX file system (and its variants FFS, ext, ext2, etc.) are typical representatives of indexed allocation.

    • Metadata Region: tracks data and file system information.

    • Data Region: stores user data and occupies most space.

Recall: Multi-Level Index
  • Multi-level index supports files of big sizes.

  • Each ext2 inode 15 disk pointers:

    • 12 direct pointers;

    • 1 indirect pointer;

    • 1 double indirect pointer;

    • 1 triple indirect pointer

Recall: Log-structured File System
  • LFS can be also considered as indexed allocation, in which the indirection is further introduced:

    • The Checkpoint Region (CR):

      • Records disk pointers to all latest pieces of imap.

      • Flushed to disk periodically (e.g., every 30 seconds).

    • The Inode Map (imap)

      • Maps from an inode-number to the disk-address of the most recent version of the inode.

      • Updated whenever an inode is written to disk.

      • Placed right next to where data block (D) and inode (I[k]I[k]) reside.

Linked Allocation
  • Each file is a linked list of disk blocks, which may be scattered anywhere on the disk.

    • The directory maintains the first and last blocks of the file; every block contains a pointer to the next block.

  • Each 512-byte block is of 508-byte user data and 4-byte pointer.

    • A file can easily continue to grow if there are free blocks.

  • Potential Issues:

    • It can be used effectively only for sequential-access files.

      • It is inefficient to arbitrarily access the ithi^{th} block of a file.

    • It costs 0.78% (4 B / 512 B) of the disk space for pointers.

      • One solution is to collect multiple blocks into a cluster.

    • Any lost or damaged pointer makes a big mess.

    • Data blocks may be scattered across the disk.

File Allocation Table (FAT)
  • File Allocation Table (FAT):

    • A variation on linked allocation (used by MS-DOS and OS/2).

    • A table indexed by block number (i.e., one entry per block).

      • The directory entry contains the block number of the first block.

      • Each FAT entry indicates the block number of the next block.

      • There is no need to maintain the 4B block pointer in each data block.

  • Problem: The in-disk FAT could be far away from blocks.

Contiguous Allocation
  • Each file occupies a set of contiguous blocks.

    • Block addresses define a linear ordering on the disk.

    • Every allocation is defined by the start address and length.

  • It is efficient for both sequential and direct access.

  • The difficulties are to:

    • determine how much space is need, and

    • find contiguous space for a file.

Extent
  • To avoid over-or-under allocation, some file systems (e.g., ext4) adopt a modified contiguous allocation.

    • A chunk of contiguous and variable-sized space, extent, is allocated whenever the allocated space is insufficient.

  • Four extents can be kept in ext4 inode (not shown).

    • An extent tree is used to store the extents map for a big file.

Dynamic Allocation Problem
  • How to satisfy a request of size nn from a list of holes?

  • Common Solutions: best-fit, worst-fit, and first-fit.

  • It is also a common problem of memory management.