Fundamentals of Programming III: Data Structures
Fundamentals of Programming III - Data Structures Lecture 06: Hash Tables
Introduction to Hash Tables
Definition: A hash table is a data structure that maps keys to values. It can also be referred to as associative arrays.
Keys:
Should be unique.
Must support consistent translation into an integer through a hash function.
Values:
Can be of any data type.
Terminology
Bucket: An element location within the hash table.
Hashing: The process of converting a key into a valid index for storage in the hash table.
Common operation: hash=key\%n where n is the size of the hash table.
Hash Table Abstract Data Type (ADT)
Supported Operations:
insert(key, value): Inserts a new element into the hash table.
remove(key): Removes an element associated with the given key.
search(key)/get(key): Retrieves the value corresponding to the specified key.
update(key, value): Updates the value for a given key.
count: Returns the number of elements in the hash table.
isEmpty: Checks if the hash table contains any elements.
Simple Hashing
Main Operation: The most common hashing operation is modulo.
Procedure:
Convert the key into an integer k utilizing a hash function.
If two keys are equal, they must produce the same hash value k.
Compute: hash = k mod n
Resulting hash is guaranteed to fall within the range of [0, n-1].
Non-Reversible Nature of Hashing
Example: If table size n = 12 and key = 14, then:
14 mod 12 = 2
It is impossible to revert to the original key as many keys (e.g., 2, 14, 26,…) can hash to the same value 2.
Naïve Hash Table Operations
Insert:
Let hf() be a hash function.
Steps:
Calculate hash value: \text{hash value} = hf(\text{key})
Store value in: \text{table}[\text{hash value}] = \text{value}
Remove:
If search(key) is not null: set \text{table}[\text{hash value}] = \text{null}
Search:
Steps:
Calculate hash value: hash value=hf(key) f table[hash ext{ value}] is not null: return table[hash ext{ value}]
Else return null.
Update:
Steps:
Calculate hash value: hash\:{value}=hf(key)
If search(key) is not null: update the value at table[hash ext{ value}]
Else, insert (key, value).
Collisions
Perfect Hash Function: A hash function that maps each key uniquely.
Collision: Occurs when two keys hash to the same bucket, which is dependent on:
Hash function chosen.
Table size.
Load factor.
Collision Handling Strategies
Various methods to handle collisions include:
Chaining: Each bucket holds a linked list of entries.
Open Addressing: Inserts the item in another slot in the same array via probing.
Linear Probing: Sequentially searches through the array to find an open slot.
Quadratic Probing: Increases the step size quadratically while probing.
Double Hashing: Uses a second hash function to determine the probing sequence.
Chaining
Each bucket in a hash table can hold a linked list of entries:
Insert: Adds new nodes to the linked list in the respective bucket.
Search/Remove: Traverses the linked list for the specific key.
Open Addressing Overview
All elements are stored directly in the array. Upon collision, an open bucket is searched using a specific probing sequence.
Distinguishing Status:
ESS: Empty Since Start - No element has ever occupied this space.
EAR: Empty After Removal - Previously contained an element but is currently empty.
Linear Probing
Methodology: Steps through buckets one by one.
If it encounters a filled bucket, it checks the next one until it finds an empty slot.
Issues: Can cause primary clustering, leading to groups of filled buckets which can degrade performance.
Hash Table Operations with Linear Probing
Insert:
Use: hash = hf(item.key)
While probing, if an empty bucket is found, insert the item there.
If probing exceeds the size of the hash table, return false.
Remove:
While probing, if an item matches, label as EAR.
Search:
Probes until either the correct item is found or an ESS is encountered.
Clustering in Hash Tables
Definition: Clustering occurs when multiple keys form groups of occupied slots, making future insertions inefficient.
Probing methods exacerbate the issue since they cause clustered keys to aggregate near each other.
Clustering Effects on Performance
Increased size of clusters leads to longer search times for empty buckets.
Higher likelihood of collisions as more keys are inserted within an existing cluster, thus degrading performance with respect to the load factor.
Mitigating Clustering
Quadratic Probing: Reduces clustering by increasing step sizes quadratically.
Double Hashing: Applies a secondary hash function for selecting the step size, helping to evenly distribute entries.
Rehashing: When clusters severely affect performance, resizing the table with a better hash method can help.
Quadratic Probing Sequence
The probing sequence is determined as: (hf(key)+c_1\cdot i+c_2\cdot i^2)\bmod n where i is the iteration count, and c1 and c2 are typically set to 1.
These tend to be smaller and more spread out.
Hash Table Operations with Quadratic Probing
Operations for insertion, removal, and searching are similar to linear probing but use the quadratic probing sequence.
Double Hashing
Probing Sequence: Provided as (hf1(key) + i imes hf2(key)) mod n where both hf1 and hf2 are distinct hash functions. All underlying procedures (insert, remove, search) remain the same except for the probing sequence used.
Hash Table Size and Resizing
Conditions for Resizing: When the table is full or the load factor exceeds certain thresholds.
Best Size: Recommended to keep hash table size as a prime number to minimize clustering effects.
Resize method involves creating a new larger array and re-inserting all current elements into this new array.
Load Factor and Thresholds
Load Factor (α) = number of items / number of buckets.
Typical Thresholds:
Load factor threshold: Recommended to remain below 0.75.
Collision threshold: Action taken if the count of collisions exceeds a set integer value.
Linked-list threshold: Monitors the length of linked lists in chains.
When to Resize
Triggers include:
Full table condition.
Load factor exceeding threshold.
The number of collisions surpassing collision thresholds.
Common Hash Functions
Division Method: h(k) = k mod m.
Key is divided by the table size m, and the remainder is used as the hash index.
Simple and fast but can cluster if m is poorly chosen.
Multiplication Method: h(k) = m imes (kA mod 1) where A is a constant between 0 and 1.
avoids poor table size choices
Universal Hashing: Uses random hash functions to improve performance against worst-case scenarios.
Folding Method: Combines parts of the key by summation or operations.
Mid-Square Method: Squares the key and extracts the middle bits to derive the hash.
Cryptographic Hash Functions: (e.g., SHA, MD5) secure but usually slower; best for cryptographic applications.
Informal Runtime Analysis of Hash Table Operations
Expected runtime for operations in ideal conditions is O(1).
Factors affecting performance include the type of collision resolution strategy used, load factor, and the quality of the hash function.
Runtime Analysis with Chaining
Insertion: Generally O(1) when inserted at the list head. Worst-case O(m). Expected O(1 + α).
Search: Best-case O(1), but worst-case O(m). Expected O(1 + α).
Removal: Same runtime as search, dependent on finding the node first.
Runtime Analysis with Open Addressing
Insertion can range from O(1) best-case to O(n) worst-case.
Search and removal follow similar patterns affected by collisions and cluster size.
Critical to keep the load factor below 0.75 for optimal performance.
Amortized vs Worst Case Operation Analysis
Amortized for operations in chaining and open addressing are expected to be O(1 + α) with acceptable load factors, while the worst-case can degrade to O(n).
Maintaining a low load factor is essential to ensure consistent performance.