Chapter 16: File Structures and Hashing
16.4 Placing File Records on Disk
Files, Fixed-Length Records, and Variable-Length Records
A file is defined as a sequence of records.
Record Types:
All records in a file may be of the same type.
If every record has exactly the same size (in bytes), the file is composed of fixed-length records.
If records have different sizes, the file consists of variable-length records.
Structure of File Records
Records are typically composed of a collection of related data values or items, where each value can consist of one or more bytes, corresponding to specific fields of the record.
Records generally describe entities along with their attributes.
Example of a Record
An example is an EMPLOYEE record.
Fields may include: Name, Birth_date, Salary, Supervisor.
A collection of field names and their respective data types forms a record type or record format definition.
Each field has a data type that specifies the kind of values it can contain.
Data Types and Size
Common data types involved include:
Numeric
String
Boolean
Sometimes Date and/or Time
The byte size for each data type is fixed based on the computer's system and can vary depending on:
Software platform
Hardware platform
16.5 Operations on Files
OPEN: Prepares the file for access and associates a pointer to a current file record.
FIND: Searches for the first record meeting a specific condition and sets it as the current record.
FINDNEXT: Searches for another record satisfying the condition and updates the current record pointer.
READ: Retrieves the current record into a program variable.
INSERT: Adds a new record into the file, updating the pointer to this new record.
DELETE: Marks the current record as invalid without physically removing it immediately.
MODIFY: Updates the values in some fields of the current record.
CLOSE: Ends the file's access session.
REORGANIZE: Physically reorganizes records, possibly removing those marked deleted or reorganizing the order they are stored in.
FIND_ORDERED: Retrieves all records in a specified sequence.
READ_ORDERED: Reads file blocks according to a specific ordering field.
16.6 Files of Unsorted Records (Heap Files)
Known as heap files:
New records get appended at the end.
Requires a linear search for record retrieval, averaging half the file blocks.
Efficient record insertion but inefficient for ordered reads as sorting is required afterward.
16.7 Files of Sorted Records (Sorted Files)
Referred to as sequential files:
Records remain sorted by values from a specified field.
Insertion becomes complex as new records demand positioning within existing order, often necessitating an overflow file for better performance in new record handling.
Binary searches can locate records efficiently, reading in log2 file blocks on average.
Memory Block Example of a Sorted File
The organized display of employee records as structured into various blocks.
16.8 Hashing
Primary File Organization via Hashing
A file organization type where quick record access under equality conditions occurs using a hash function.
The field searched against is termed the hash field, commonly a key field, or hash key.
Hash functions do not randomize; they yield direct disk block addresses based on hash field values.
Internal Hashing
Implemented via a hash table, organized as an array of records.
Array indexes range from 0 to M-1, indicating M slots corresponding to array indices.
A hash function transforms a hash key value into an integer between 0 and M-1 using:
Alternate Hash Functions
Variations include:
Folding: Arithmetic or logical functions applied over portions of the hash field.
Example: For an address range from 0 to 999 storing 1,000 keys, folding 235469 to yield .
Selective digit extraction from hash field values to formulate hash addresses.
Handling Non-Integer Hash Fields
Non-integer hash fields can leverage numeric (ASCII) values through transformations such as multiplication.
Collision Resolution in Hashing
A collision occurs when a hash value points to an address already occupied.
Methods for conflict resolution include:
Open Addressing: Spring from the original address until an unused one is found.
Chaining: Maintain overflow locations for records through pointers in a linked list format.
Multiple Hashing: Apply a secondary hash function in the event of a collision to find an empty location.
Limitations of Hash Functions
Distinct values in hash fields can collide due to large hash field spaces relative to address spaces, requiring robust design considerations.
Dynamic and Extendible Hashing Techniques
These approaches allow for flexible adjustment in file record numbers, enhancing efficiency alongside avoiding directory overhead.
Dynamics of Hashing
Operates around binary tree structures and makes adjustments to directory depths (
Depth can vary based on local or global requirements) leading to efficient record retrieval strategies.
Extendible Hashing
Utilizes a directory where specific addresses can represent multiple local records, thus promoting efficient growth management.
16.10 RAID
RAID: Redundant Arrays of Inexpensive Disks
Essential for improving secondary storage performance and reliability in line with processor advancements.
High-Level RAID Strategies
Includes Redundancy, Reliability, Data Striping, Performance, and Error Detection mechanisms.
RAID Configurations and Characteristics
Variants exist with differing focuses on redundancy and performance, such as:
RAID 0: Performance-focused (striping without redundancy).
RAID 1: Mirroring for enhanced reliability and data availability.
RAID 2: Deprecated, bit-level striping with Hamming code.
RAID 3 & 4: Block-level striping with parity disks, also deprecated in usage.
RAID 5: Block-level striping with distributed parity for performance and redundancy.
RAID 6: Further extends RAID 5 with an additional parity block.
Parity Recovery Mechanism
Utilizes an Exclusive OR (XOR) operation to derive redundancy across written blocks. This mechanism encompasses:
Associativity properties of XOR for scalable error correction across operations.
Variability in RAID Levels
Each RAID configuration must consider design decisions based on the operational scale and redundancy needs, leading to optimized performance in varied application contexts.