JB

File Storage and Systems

Primacy of Files

  • The course emphasizes the importance of file management and manipulation across various computing fields.
  • Skills in handling files are crucial for data scientists, network and security specialists, developers, and other computing professionals.
  • Slides contain more information than the summaries provided in the recordings.
  • Students should watch videos and demos before participating in live online practical sessions.
  • Utilize the commenting feature in the Panopto video player.
  • The module supports course learning outcomes.

Storage Devices

Basics of File Storage (Topic 2.1)

  • Overview of hardware involved in file storage.
  • File systems.
  • POSIX philosophy: "Everything is a file."
  • File attributes and storage methods in a file system.

Historical Perspective

  • Digital tape devices: Sequential access only, reading from start to end of the tape.
  • New file information is added at the end of the tape.
  • Terminology still in use today originates from tape devices.
  • Tape devices are used for tertiary or later backup storage for large organizations.
  • Revolution: Shift from sequential access to direct access enabled by hard disks.

File Systems

  • File systems are required to store and retrieve files from direct access devices.
  • File system structure on the disk stores information about the location of separate files.
  • How files are stored depends on the physical properties of the disks.

Hard Disks

  • Consist of platters of discs spinning rapidly with a read-write head.
  • The head moves over the spinning disks without touching them.
  • Each side of the disk has its own read-write head.
  • The head can move to any surface area for reading or writing.
  • Data I/O is performed in chunks.
  • Minimum storage and transfer unit: Sector (traditionally 512 bytes, but newer disks may use a higher number).
  • File system moves data blocks in chunks.
  • Minimum addressable unit on the disk: Block.
  • The size relationship between blocks and sectors has changed over time.

Disk Terminology

  • Track: Indicated in red, a circular path on the disk's surface.
  • Disk Sector: The entire pie-shaped chunk on the disc.
  • Track Sector: The specific part of the track that also corresponds to the disc sector.
  • Cluster: A sequence of track sectors.

File Allocation

  • File systems try to allocate files to contiguous blocks or clusters to minimize head movement and speed up data transfer.
  • File data scattered all over the disk can result in slow I/O.

Solid State Drives (SSDs)

  • Even though SSDs lack spinning disks, they still move data in chunks or sectors.

Comparative Speeds

  • Disk I/O is slow compared to memory (RAM) access.
  • RAM: 20 GB/s.
  • SSD: 250 MB/s.
  • HDD: 100 MB/s.
  • Disk I/O is fast relative to other processes, including network traffic.

Data Handling

  • Reading larger blocks gives the impression of faster processing, as the bulk is sequential (especially for text files).
  • The operating system kernel reads more data than initially requested, anticipating the application's next request.
  • Data to be written can be buffered in memory before being written to disk.
  • Do not eject a thumb drive immediately after writing, as the data may still be in memory and not yet written to disk.

Files as Logical Storage Units

  • From a user's perspective, a file is the smallest unit of logical nonvolatile storage.
  • Nonvolatile storage persists even when the computer is shut down (unlike RAM).
  • Data must be encapsulated in a file to be written to nonvolatile storage.
  • Files can represent programs (source and compiled code) and data.

Data Files

  • Numeric, alphabetic, alphanumeric (text files), or binary.
  • The distinction between text and binary files depends on how the file is interpreted.
  • The operating system checks the encoding to determine whether a file is text or binary.
  • Text files can be free-form or rigorously formatted.
  • Binary files typically have rigid formatting for applications to extract information.
  • A file is a sequence of bits, typically grouped in bytes (8 bits), lines, or records.

Long-Term Data Storage

  • Computers use various devices for long-term data storage, including:
    • Solid state drives (SSDs).
    • Hard disk drives (HDDs).
    • Optical disks (CDs and DVDs).
    • Magnetic tape.
  • These devices have different I/O properties, especially data transfer speeds.

Operating System Abstraction

  • The operating system presents a uniform interface to the user.
  • It abstracts physical devices and projects them as one type of data storage.
  • This gives rise to logical storage units (files).
  • Files are mapped onto physical devices based on their characteristics.

POSIX Philosophy: "Everything is a File"

  • A core principle in UNIX, Linux, and related operating systems.
  • Files encapsulate various things and provide access for read and write.
  • Operating system structures, variables, and more can be represented as files.
  • This simplifies things for both applications and end users.
  • Network communication can be presented as simple data reads and writes to special files.
  • Hardware (e.g., mouse, webcam) can be represented as files.
  • File permissions can extend to devices.

File Properties

  • File names are a convenience for the end user, not the operating system.
  • Users/applications access files by name, but the OS uses the name as a key to look up the file identifier.
  • The identifier is used by the kernel for operations on the file.
  • Files have properties (attributes in POSIX).
  • Attributes vary by OS but typically include:
    • Name
    • Unique Identifier
    • Location: Pointers to physical places on the disk where the data is stored.
    • Size
    • Time Stamps: Creation, modification, etc.
    • Access Control: Owner and permissions (read, write, execute).

File Attribute Block

  • Stores the attributes or properties of files.
  • Information is kept in a directory structure on the same device as the files.
  • The OS can easily access file attribute blocks for file information.

Inodes

  • In POSIX systems, the file attribute block is called an inode (index node).
  • It is a data structure storing properties and attributes for files on disk.

References

  • Wikipedia pages.
  • POSIX commands (learn via man pages).