Chapter 16: File Structures and Hashing

Files, Fixed-Length Records, and Variable-Length Records

A file is defined as a sequence of records.
Record Types:
- All records in a file may be of the same type.
- If every record has exactly the same size (in bytes), the file is composed of fixed-length records.
- If records have different sizes, the file consists of variable-length records.

Structure of File Records

Records are typically composed of a collection of related data values or items, where each value can consist of one or more bytes, corresponding to specific fields of the record.
Records generally describe entities along with their attributes.

Example of a Record

An example is an EMPLOYEE record.
- Fields may include: Name, Birth_date, Salary, Supervisor.
A collection of field names and their respective data types forms a record type or record format definition.
Each field has a data type that specifies the kind of values it can contain.

Data Types and Size

Common data types involved include:
- Numeric
- String
- Boolean
- Sometimes Date and/or Time
The byte size for each data type is fixed based on the computer's system and can vary depending on:
- Software platform
- Hardware platform

OPEN: Prepares the file for access and associates a pointer to a current file record.
FIND: Searches for the first record meeting a specific condition and sets it as the current record.
FINDNEXT: Searches for another record satisfying the condition and updates the current record pointer.
READ: Retrieves the current record into a program variable.
INSERT: Adds a new record into the file, updating the pointer to this new record.
DELETE: Marks the current record as invalid without physically removing it immediately.
MODIFY: Updates the values in some fields of the current record.
CLOSE: Ends the file's access session.
REORGANIZE: Physically reorganizes records, possibly removing those marked deleted or reorganizing the order they are stored in.
FIND_ORDERED: Retrieves all records in a specified sequence.
READ_ORDERED: Reads file blocks according to a specific ordering field.

Known as heap files:
- New records get appended at the end.
- Requires a linear search for record retrieval, averaging half the file blocks.
- Efficient record insertion but inefficient for ordered reads as sorting is required afterward.

Referred to as sequential files:
- Records remain sorted by values from a specified field.
Insertion becomes complex as new records demand positioning within existing order, often necessitating an overflow file for better performance in new record handling.
Binary searches can locate records efficiently, reading in log2 file blocks on average.

Memory Block Example of a Sorted File

Primary File Organization via Hashing

A file organization type where quick record access under equality conditions occurs using a hash function.
The field searched against is termed the hash field, commonly a key field, or hash key.
Hash functions do not randomize; they yield direct disk block addresses based on hash field values.

Internal Hashing

Implemented via a hash table, organized as an array of records.
Array indexes range from 0 to M-1, indicating M slots corresponding to array indices.
A hash function transforms a hash key value into an integer between 0 and M-1 using:
$h(K) = K \mod M$

Alternate Hash Functions

Variations include:
- Folding: Arithmetic or logical functions applied over portions of the hash field.
- Example: For an address range from 0 to 999 storing 1,000 keys, folding 235469 to yield $(235+469) \mod 1000 = 704$ .
- Selective digit extraction from hash field values to formulate hash addresses.

Handling Non-Integer Hash Fields

Non-integer hash fields can leverage numeric (ASCII) values through transformations such as multiplication.

Collision Resolution in Hashing

A collision occurs when a hash value points to an address already occupied.
Methods for conflict resolution include:
- Open Addressing: Spring from the original address until an unused one is found.
- Chaining: Maintain overflow locations for records through pointers in a linked list format.
- Multiple Hashing: Apply a secondary hash function in the event of a collision to find an empty location.

Limitations of Hash Functions

Distinct values in hash fields can collide due to large hash field spaces relative to address spaces, requiring robust design considerations.

Dynamic and Extendible Hashing Techniques

These approaches allow for flexible adjustment in file record numbers, enhancing efficiency alongside avoiding directory overhead.

Dynamics of Hashing

Operates around binary tree structures and makes adjustments to directory depths (
Depth can vary based on local or global requirements) leading to efficient record retrieval strategies.

Extendible Hashing

Utilizes a directory where specific addresses can represent multiple local records, thus promoting efficient growth management.

RAID: Redundant Arrays of Inexpensive Disks

Essential for improving secondary storage performance and reliability in line with processor advancements.

High-Level RAID Strategies

Includes Redundancy, Reliability, Data Striping, Performance, and Error Detection mechanisms.

RAID Configurations and Characteristics

Parity Recovery Mechanism

Utilizes an Exclusive OR (XOR) operation to derive redundancy across written blocks. This mechanism encompasses:
- Associativity properties of XOR for scalable error correction across operations.

Variability in RAID Levels

Each RAID configuration must consider design decisions based on the operational scale and redundancy needs, leading to optimized performance in varied application contexts.