Chapter 16: File Structures and Hashing

16.4 Placing File Records on Disk

Files, Fixed-Length Records, and Variable-Length Records

  • A file is defined as a sequence of records.

  • Record Types:

    • All records in a file may be of the same type.

    • If every record has exactly the same size (in bytes), the file is composed of fixed-length records.

    • If records have different sizes, the file consists of variable-length records.

Structure of File Records

  • Records are typically composed of a collection of related data values or items, where each value can consist of one or more bytes, corresponding to specific fields of the record.

  • Records generally describe entities along with their attributes.

Example of a Record

  • An example is an EMPLOYEE record.

    • Fields may include: Name, Birth_date, Salary, Supervisor.

  • A collection of field names and their respective data types forms a record type or record format definition.

  • Each field has a data type that specifies the kind of values it can contain.

Data Types and Size

  • Common data types involved include:

    • Numeric

    • String

    • Boolean

    • Sometimes Date and/or Time

  • The byte size for each data type is fixed based on the computer's system and can vary depending on:

    • Software platform

    • Hardware platform

16.5 Operations on Files
  • OPEN: Prepares the file for access and associates a pointer to a current file record.

  • FIND: Searches for the first record meeting a specific condition and sets it as the current record.

  • FINDNEXT: Searches for another record satisfying the condition and updates the current record pointer.

  • READ: Retrieves the current record into a program variable.

  • INSERT: Adds a new record into the file, updating the pointer to this new record.

  • DELETE: Marks the current record as invalid without physically removing it immediately.

  • MODIFY: Updates the values in some fields of the current record.

  • CLOSE: Ends the file's access session.

  • REORGANIZE: Physically reorganizes records, possibly removing those marked deleted or reorganizing the order they are stored in.

  • FIND_ORDERED: Retrieves all records in a specified sequence.

  • READ_ORDERED: Reads file blocks according to a specific ordering field.

16.6 Files of Unsorted Records (Heap Files)
  • Known as heap files:

    • New records get appended at the end.

    • Requires a linear search for record retrieval, averaging half the file blocks.

    • Efficient record insertion but inefficient for ordered reads as sorting is required afterward.

16.7 Files of Sorted Records (Sorted Files)
  • Referred to as sequential files:

    • Records remain sorted by values from a specified field.

  • Insertion becomes complex as new records demand positioning within existing order, often necessitating an overflow file for better performance in new record handling.

  • Binary searches can locate records efficiently, reading in log2 file blocks on average.

Memory Block Example of a Sorted File

  • The organized display of employee records as structured into various blocks.

16.8 Hashing

Primary File Organization via Hashing

  • A file organization type where quick record access under equality conditions occurs using a hash function.

  • The field searched against is termed the hash field, commonly a key field, or hash key.

  • Hash functions do not randomize; they yield direct disk block addresses based on hash field values.

Internal Hashing

  • Implemented via a hash table, organized as an array of records.

  • Array indexes range from 0 to M-1, indicating M slots corresponding to array indices.

  • A hash function transforms a hash key value into an integer between 0 and M-1 using:

    h(K)=KmodMh(K) = K \mod M

Alternate Hash Functions

  • Variations include:

    • Folding: Arithmetic or logical functions applied over portions of the hash field.

    • Example: For an address range from 0 to 999 storing 1,000 keys, folding 235469 to yield (235+469)mod1000=704(235+469) \mod 1000 = 704.

    • Selective digit extraction from hash field values to formulate hash addresses.

Handling Non-Integer Hash Fields

  • Non-integer hash fields can leverage numeric (ASCII) values through transformations such as multiplication.

Collision Resolution in Hashing

  • A collision occurs when a hash value points to an address already occupied.

  • Methods for conflict resolution include:

    • Open Addressing: Spring from the original address until an unused one is found.

    • Chaining: Maintain overflow locations for records through pointers in a linked list format.

    • Multiple Hashing: Apply a secondary hash function in the event of a collision to find an empty location.

Limitations of Hash Functions

  • Distinct values in hash fields can collide due to large hash field spaces relative to address spaces, requiring robust design considerations.

Dynamic and Extendible Hashing Techniques

  • These approaches allow for flexible adjustment in file record numbers, enhancing efficiency alongside avoiding directory overhead.

Dynamics of Hashing

  • Operates around binary tree structures and makes adjustments to directory depths (

  • Depth can vary based on local or global requirements) leading to efficient record retrieval strategies.

Extendible Hashing

  • Utilizes a directory where specific addresses can represent multiple local records, thus promoting efficient growth management.

16.10 RAID

RAID: Redundant Arrays of Inexpensive Disks

  • Essential for improving secondary storage performance and reliability in line with processor advancements.

High-Level RAID Strategies

  • Includes Redundancy, Reliability, Data Striping, Performance, and Error Detection mechanisms.

RAID Configurations and Characteristics

  • Variants exist with differing focuses on redundancy and performance, such as:

    • RAID 0: Performance-focused (striping without redundancy).

    • RAID 1: Mirroring for enhanced reliability and data availability.

    • RAID 2: Deprecated, bit-level striping with Hamming code.

    • RAID 3 & 4: Block-level striping with parity disks, also deprecated in usage.

    • RAID 5: Block-level striping with distributed parity for performance and redundancy.

    • RAID 6: Further extends RAID 5 with an additional parity block.

Parity Recovery Mechanism

  • Utilizes an Exclusive OR (XOR) operation to derive redundancy across written blocks. This mechanism encompasses:

    • Associativity properties of XOR for scalable error correction across operations.

Variability in RAID Levels

  • Each RAID configuration must consider design decisions based on the operational scale and redundancy needs, leading to optimized performance in varied application contexts.