Indexing and Hashing in Database Systems

Indexing - Hash Tables

Basic Concepts

Indexing mechanisms are used to speed up access to data (e.g., author catalog in a library). A Search Key is an attribute or set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form:

search-key | pointer

Index files are typically much smaller than the original file. There are two basic kinds of indices:

  • Ordered indices: Search keys are stored in sorted order.

  • Hash indices: Search keys are distributed uniformly across “buckets” using a “hash function.”

Index Evaluation Metrics

Important metrics for evaluating indices include:

  • Access types supported efficiently:

    • Records with a specified value in the attribute.

    • Records with an attribute value falling in a specified range of values.

  • Access (i.e., search) time

  • Insertion time

  • Deletion time

  • Space overhead

Ordered Indices

In an ordered index, index entries are stored sorted on the search key value.

  • Clustering index: In a sequentially ordered file, the index whose search key specifies the sequential order of the file.

    • Also called primary index.

    • The search key of a primary index is usually, but not necessarily, the primary key.

  • Secondary index: An index whose search key specifies an order different from the sequential order of the file. Also called nonclustering index.

  • Index-sequential file: Sequential file ordered on a search key, with a clustering index on the search key.

Dense Index Files

Dense index: Index record appears for every search-key value in the file (e.g., index on ID attribute of instructor relation).

Sparse Index Files

Sparse Index: Contains index records for only some search-key values.

  • Applicable when records are sequentially ordered on search-key

  • To locate a record with search-key value K:

    • Find index record with largest search-key value < K

    • Search file sequentially starting at the record to which the index record points.

  • Compared to dense indices:

    • Less space and less maintenance overhead for insertions and deletions.

    • Generally slower than dense index for locating records.

  • Good tradeoff:

    • For clustered index: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

    • For unclustered index: sparse index on top of dense index (multilevel index).

Secondary Indices Example

Secondary index on salary field of instructor. Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. Secondary indices have to be dense.

Clustering vs Nonclustering Indices

Indices offer substantial benefits when searching for records, but impose overhead on database modification.

  • When a record is inserted or deleted, every index on the relation must be updated.

  • When a record is updated, any index on an updated attribute must be updated.

  • Sequential scan using clustering index is efficient, but a sequential scan using a secondary (nonclustering) index is expensive on magnetic disk.

    • Each record access may fetch a new block from disk.

    • Each block fetch on magnetic disk requires about 5 to 10 milliseconds.

Multilevel Index

If the index does not fit in memory, access becomes expensive. Solution: treat index kept on disk as a sequential file and construct a sparse index on it.

  • outer index – a sparse index of the basic index

  • inner index – the basic index file

  • If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.

  • Indices at all levels must be updated on insertion or deletion from the file.

Index Update: Deletion

Single-level index entry deletion:

  • Dense indices – deletion of search-key is similar to file record deletion.

  • Sparse indices

    • If an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order).

    • If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

    • If the deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Index Update: Insertion

Single-level index insertion:

  • Perform a lookup using the search-key value of the record to be inserted.

  • Dense indices – if the search-key value does not appear in the index, insert it.

    • Indices are maintained as sequential files

    • Need to create space for new entry; overflow blocks may be required.

  • Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.

    • If a new block is created, the first search-key value appearing in the new block is inserted into the index.

  • Multilevel insertion and deletion: algorithms are simple extensions of the single-level algorithms.

Indices on Multiple Keys

Composite search key:

  • E.g., index on instructor relation on attributes (name, ID)

  • Values are sorted lexicographically

  • E.g. (John, 12121) < (John, 13514) and (John, 13514) < (Peter, 11223)

  • Can query on just name, or on (name, ID)

Introduction – HASH TABLES

A hash table implements an unordered associative array that maps keys to values. It uses a hash function to compute an offset into this array for a given key, from which the desired value can be found.

  • Space Complexity: O(n)

  • Time Complexity:

    • Average: O(1)

    • Worst: O(n)

“Vanilla” Hash Table

Allocate a ‘giant’ array with one slot for every element you need to store. To find an entry, mod the key by the number of elements to find the offset in the array.

Unrealistic Assumptions!

  • #1: Number of elements is known ahead of time and fixed.

  • #2: Each key is unique.

  • #3: Perfect hash function guarantees no collisions.

    • If key1 \neq key2, then hash(key1) \neq hash(key2)

Hash Table – Design Decisions

  • Design Decision #1: Hash Function

    • How to map a large key space into a smaller domain.

    • Trade-off between being fast vs. collision rate.

  • Design Decision #2: Hashing Scheme

    • How to handle key collisions after hashing.

    • Trade-off between allocating a large hash table vs. additional instructions to get/put keys.

Hash Function

For any input key, return an integer representation of that key. The hash function maps search-keys to bucket addresses (data of variable length to data of fixed length, e.g., names to numbers [0..15]). We do not want to use a cryptographic hash function for DBMS hash tables (e.g., SHA-2); we want something that is fast and has a low collision rate.

Hash Functions

Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file. An ideal hash function is uniform (i.e., each bucket is assigned the same number of search-key values from the set of all possible values) and random (so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file).

Typical hash functions perform some computation on the internal binary representation of the search-key. For example, for a string search-key, the binary representations of all the characters in the string could be added, and the sum modulo the number of buckets could be returned.

Hash Functions Examples

  • CRC-64 (1975): Used in networking for error detection.

  • MurmurHash (2008): Designed as a fast, general-purpose hash function.

  • Google CityHash (2011): Designed to be faster for short keys (<64bytes)

  • Facebook XXHash (2012): From the creator of zstd compression.

  • Google FarmHash (2014): Newer version of CityHash with better collision rates.

General definitions

A bucket is a unit of storage containing one or more entries (a bucket is typically a disk block). We obtain the bucket of an entry from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. h is used to locate entries for access, insertion, as well as deletion.

Entries with different search-key values may be mapped to the same bucket; thus, the entire bucket has to be searched sequentially to locate an entry. In a hash index, buckets store entries with pointers to records. In a hash file-organization, buckets store records.

Use of hashing in DBMSs

  1. Hashing is used for the organization of a file (as an alternative of heap or sequential organization).

  2. Hashing is used for the creation of indexes (on attributes that are not keys).

    • In this case, it is called a hash index and organizes the values of attributes (together with the respective pointers) in a hash file.

Example of Hash File Organization

There are 10 buckets. The binary representation of the i^{th} character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10 (e.g., h(Music) = 1, h(History) = 2, h(Physics) = 3, h(Elec. Eng.) = 3). Hash file organization of instructor file, using dept_name as key.

Handling of Bucket Overflows

Bucket overflow can occur because of:

  • Insufficient buckets

  • Skew in distribution of records. This can occur due to two reasons:

    • Multiple records have the same search-key value

    • Chosen hash function produces non-uniform distribution of key values

Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets.

Handling of Bucket Overflows (Cont.)

Overflow chaining: The overflow buckets of a given bucket are chained together in a linked list. The above scheme is called closed addressing (also called closed hashing or open hashing depending on the book you use). An alternative, called open addressing (also called open hashing or closed hashing) which does not use overflow buckets, is not suitable for database applications.

Hashing Schemes

  • Static Hashing: Fixed set of bucket addresses

    • Approach #1: Linear Probe Hashing

    • Approach #2: Cuckoo Hashing

    • There are several other schemes: Robin Hood Hashing, Hopscotch Hashing, Swiss Tables, …

  • Dynamic Hashing: Incrementally resize as needed

    • Chained Hashing

    • Extendible Hashing

    • Linear Hashing

Linear Probe Hashing

Most basic + fastest scheme. Uses a circular buffer of array slots. Keys are mapped to slots using a hash function. Collisions are resolved by linearly searching adjacent slots until an open one is found. Lookups involve checking the slot the key hashes to and searching linearly until the desired entry is found or an empty slot is reached. Deletions can be handled by using