Fundamentals of Programming III: Data Structures

Fundamentals of Programming III - Data Structures Lecture 06: Hash Tables

Introduction to Hash Tables

Definition: A hash table is a data structure that maps keys to values. It can also be referred to as associative arrays.
- Keys:
- Should be unique.
- Must support consistent translation into an integer through a hash function.
- Values:
- Can be of any data type.

Terminology

Bucket: An element location within the hash table.
Hashing: The process of converting a key into a valid index for storage in the hash table.
- Common operation: hash=key\%n where n is the size of the hash table.

Hash Table Abstract Data Type (ADT)

Supported Operations:
- insert(key, value): Inserts a new element into the hash table.
- remove(key): Removes an element associated with the given key.
- search(key)/get(key): Retrieves the value corresponding to the specified key.
- update(key, value): Updates the value for a given key.
- count: Returns the number of elements in the hash table.
- isEmpty: Checks if the hash table contains any elements.

Simple Hashing

Main Operation: The most common hashing operation is modulo.
- Procedure:
1. Convert the key into an integer k utilizing a hash function.
2. If two keys are equal, they must produce the same hash value k.
3. Compute: hash = k mod n
4. Resulting hash is guaranteed to fall within the range of [0, n-1].

Non-Reversible Nature of Hashing

Example: If table size n = 12 and key = 14, then:
- 14 mod 12 = 2
- It is impossible to revert to the original key as many keys (e.g., 2, 14, 26,…) can hash to the same value 2.

Naïve Hash Table Operations

Insert:
- Let hf() be a hash function.
- Steps:
1. Calculate hash value: \text{hash value} = hf(\text{key})
2. Store value in: \text{table}[\text{hash value}] = \text{value}
- Remove:
- If search(key) is not null: set \text{table}[\text{hash value}] = \text{null}
Search:
- Steps:
1. Calculate hash value: hash value=hf(key) f table[hash ext{ value}] is not null: return table[hash ext{ value}]
2. Else return null.
Update:
- Steps:
1. Calculate hash value: hash\:{value}=hf(key)
2. If search(key) is not null: update the value at table[hash ext{ value}]
3. Else, insert (key, value).

Collisions

Perfect Hash Function: A hash function that maps each key uniquely.
Collision: Occurs when two keys hash to the same bucket, which is dependent on:
- Hash function chosen.
- Table size.
- Load factor.

Collision Handling Strategies

Various methods to handle collisions include:
- Chaining: Each bucket holds a linked list of entries.
- Open Addressing: Inserts the item in another slot in the same array via probing.
  - Linear Probing: Sequentially searches through the array to find an open slot.
  - Quadratic Probing: Increases the step size quadratically while probing.
  - Double Hashing: Uses a second hash function to determine the probing sequence.

Chaining

Each bucket in a hash table can hold a linked list of entries:
- Insert: Adds new nodes to the linked list in the respective bucket.
- Search/Remove: Traverses the linked list for the specific key.

Open Addressing Overview

All elements are stored directly in the array. Upon collision, an open bucket is searched using a specific probing sequence.
Distinguishing Status:
- ESS: Empty Since Start - No element has ever occupied this space.
- EAR: Empty After Removal - Previously contained an element but is currently empty.

Linear Probing

Methodology: Steps through buckets one by one.
- If it encounters a filled bucket, it checks the next one until it finds an empty slot.
- Issues: Can cause primary clustering, leading to groups of filled buckets which can degrade performance.

Hash Table Operations with Linear Probing

Insert:
- Use: hash = hf(item.key)
- While probing, if an empty bucket is found, insert the item there.
- If probing exceeds the size of the hash table, return false.
Remove:
- While probing, if an item matches, label as EAR.
Search:
- Probes until either the correct item is found or an ESS is encountered.

Clustering in Hash Tables

Definition: Clustering occurs when multiple keys form groups of occupied slots, making future insertions inefficient.
- Probing methods exacerbate the issue since they cause clustered keys to aggregate near each other.

Clustering Effects on Performance

Increased size of clusters leads to longer search times for empty buckets.
Higher likelihood of collisions as more keys are inserted within an existing cluster, thus degrading performance with respect to the load factor.

Mitigating Clustering

Quadratic Probing: Reduces clustering by increasing step sizes quadratically.
Double Hashing: Applies a secondary hash function for selecting the step size, helping to evenly distribute entries.
Rehashing: When clusters severely affect performance, resizing the table with a better hash method can help.

Quadratic Probing Sequence

The probing sequence is determined as: (hf(key)+c_1\cdot i+c_2\cdot i^2)\bmod n where i is the iteration count, and c1 and c2 are typically set to 1.
These tend to be smaller and more spread out.

Hash Table Operations with Quadratic Probing

Operations for insertion, removal, and searching are similar to linear probing but use the quadratic probing sequence.

Double Hashing

Probing Sequence: Provided as (hf1(key) + i imes hf2(key)) mod n where both hf1 and hf2 are distinct hash functions. All underlying procedures (insert, remove, search) remain the same except for the probing sequence used.

Hash Table Size and Resizing

Conditions for Resizing: When the table is full or the load factor exceeds certain thresholds.
Best Size: Recommended to keep hash table size as a prime number to minimize clustering effects.
- Resize method involves creating a new larger array and re-inserting all current elements into this new array.

Load Factor and Thresholds

Load Factor (α) = number of items / number of buckets.
Typical Thresholds:
- Load factor threshold: Recommended to remain below 0.75.
- Collision threshold: Action taken if the count of collisions exceeds a set integer value.
- Linked-list threshold: Monitors the length of linked lists in chains.

When to Resize

Triggers include:
- Full table condition.
- Load factor exceeding threshold.
- The number of collisions surpassing collision thresholds.

Common Hash Functions

Division Method: h(k) = k mod m.
- Key is divided by the table size m, and the remainder is used as the hash index.
- Simple and fast but can cluster if m is poorly chosen.
Multiplication Method: h(k) = m imes (kA mod 1) where A is a constant between 0 and 1.
- avoids poor table size choices
Universal Hashing: Uses random hash functions to improve performance against worst-case scenarios.
Folding Method: Combines parts of the key by summation or operations.
Mid-Square Method: Squares the key and extracts the middle bits to derive the hash.
Cryptographic Hash Functions: (e.g., SHA, MD5) secure but usually slower; best for cryptographic applications.

Informal Runtime Analysis of Hash Table Operations

Expected runtime for operations in ideal conditions is O(1).
- Factors affecting performance include the type of collision resolution strategy used, load factor, and the quality of the hash function.

Runtime Analysis with Chaining

Insertion: Generally O(1) when inserted at the list head. Worst-case O(m). Expected O(1 + α).
Search: Best-case O(1), but worst-case O(m). Expected O(1 + α).
Removal: Same runtime as search, dependent on finding the node first.

Runtime Analysis with Open Addressing

Insertion can range from O(1) best-case to O(n) worst-case.
Search and removal follow similar patterns affected by collisions and cluster size.
Critical to keep the load factor below 0.75 for optimal performance.

Amortized vs Worst Case Operation Analysis

Amortized for operations in chaining and open addressing are expected to be O(1 + α) with acceptable load factors, while the worst-case can degrade to O(n).
Maintaining a low load factor is essential to ensure consistent performance.