Design & Analysis of Algorithm: Hashing
Analysis of Algorithms
Hashing
The Search Problem
Objective: Find items with keys matching a given search key.
Problem Definition: Given an array A, containing n keys, and a search key x, find the index i such that x=A[i].
Note: A key could be part of a larger record.
Applications of Hashing
Banking: Keeping track of customer account information.
Searching through records for checking balances and performing transactions.
Flight Reservations: Keeping track of seat availability.
Searching to find empty seats and modify or cancel reservations.
Search Engines: Looking for all documents containing a specified word.
Special Case: Dictionaries
Definition: A dictionary is a data structure supporting mainly two fundamental operations: inserting a new item and retrieving an item with a given key.
Dictionary Operations
Queries for information about the set S:
Search(S, k)
Minimum(S)
Maximum(S)
Successor(S, x)
Predecessor(S, x)
Modifying Operations (which change the set):
Insert(S, k)
Delete(S, k) (less frequent)
Direct Addressing
Assumptions
Key values are distinct.
Each key is drawn from a universe U = {0, 1, …, m - 1}.
Concept
Store items in an array indexed by keys.
Direct-address table representation: An array T[0 … m - 1].
Each slot in T corresponds to a key in U.
For an element x with key k, a pointer to x (or x itself) is placed in location T[k].
If there are no elements with key k in the set, T[k] is represented by NIL.
Visualization of Direct Addressing
Example: For a universe U and actual keys K:
Every slot T[j] in an array can either contain a record or NIL.
Insert/Delete operations can be performed in O(1) time.
Operations Algorithms for Direct Addressing
Algorithms:
DIRECT-ADDRESS-SEARCH(T, k): return T[k]
DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x
DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL
Running time for these operations: O(1)
Comparing Different Implementations
Implementing dictionaries using different structures:
Direct addressing: Insert O(1), Search O(1)
Ordered array: Insert O(N), Search O(lgN)
Unordered array: Insert O(1), Search O(N)
Ordered linked list: Insert O(N), Search O(N)
Unordered linked list: Insert O(1), Search O(N)
Examples Using Direct Addressing
Example 1
Scenario: Keys are integers from 1 to 100 with about 100 records.
Create an array of 100 items and store the record whose key equals i in A[i].
Example 2
Scenario: Keys are nine-digit social security numbers.
Using the same strategy would require an array of 1 billion items, which is inefficient due to the large universe U compared to |K|.
Hash Tables
Definition
When K (the number of actual keys) is much smaller than U (the universe), a hash table requires significantly less space than a direct-address table.
Storage can be reduced to |K| while aiming for O(1) average search time (note: this is not guaranteed in the worst case).
Idea of Hash Tables
Use a function h to compute the slot for each key.
Store the element in slot h(k).
Hash function: A function h transforms a key into an index in a hash table T[0…m-1], described as h: U → {0, 1, …, m - 1}.
We say that k hashes to slot h(k).
Advantages:
Reduce the range of array indices handled: m instead of |U|.
Reduced required storage.
Example of Hash Tables
Illustrates U (universe of keys) and K (actual keys) mapped to slots through hash functions h(k).
Handling Collisions in Hashing
Problems
Definition of Collision: Two or more keys that hash to the same slot.
Given a set K of keys: if |K| ≤ m, collisions may or may not happen depending on the hash function; if |K| > m, collisions will definitely occur.
Avoiding collisions completely can be challenging even with a good hash function.
Collision Handling Methods
Methods discussed:
Chaining
Open addressing (Linear probing, Quadratic probing, Double hashing)
Handling Collisions Using Chaining
Concept
Store all elements that hash to the same slot in a linked list.
Slot j contains a pointer to the head of the list of all elements that hash to j.
Collision with Chaining - Discussion
Choosing the size of the table: Needs to be small enough to avoid wastage but large enough to keep the linked lists short (typically 1/5 or 1/10 of the total number of elements).
List ordering: Lists should be unordered to facilitate fast insertion and removal.
Insertion in Hash Tables
Algorithm: CHAINED-HASH-INSERT(T, x): inserts x at the head of list T[h(key[x])].
Worst-case running time: O(1) assuming the element is not already present.
Deletion in Hash Tables
Algorithm: CHAINED-HASH-DELETE(T, x): delete x from the list at T[h(key[x])].
Search time required to find the element being deleted.
Searching in Hash Tables
Algorithm: CHAINED-HASH-SEARCH(T, k): searches for an element with key k in list T[h(k)].
Running time: Proportional to the length of the list of elements in slot h(k).
Analysis of Hashing with Chaining
Worst Case
Worst-case search time: If all n keys hash to the same slot, resulting in time complexity of plus hash function time.
Average Case
Average case depends on the efficiency of the hash function in distributing n keys among m slots.
Assumption: Simple uniform hashing (any element is equally likely to hash into any of the m slots) leads to average list length for j slots being .
Load Factor of a Hash Table
Definition: Load factor, , where n = number of elements in the table, m = number of slots.
Interpretation: Encodes average number of elements stored in a chain.
Cases of Load Factor
Unsuccessful Search: Theorem states that an unsuccessful search takes expected time + under the simple uniform hashing assumption.
Successful Search: Average case analysis offers expected time of , where elements are split across a list.
Hash Functions
Characteristics of Good Hash Functions
Easy to compute.
Must approximate random distribution for every key (simple uniform hashing).
Difficulties arise as it is hard to satisfy simple uniform hashing generally.
Good Approaches to Designing Hash Functions
Minimize the chance that closely related keys hash to the same slot.
Derive a hash value that is independent of any patterns in the key distribution.
The Division Method for Hashing
Overview
Concept: Map key k into slots by taking the remainder of k divided by m.
.
Advantages: Fast computation (only one operation required).
Disadvantages: Poor choices for m (e.g.: powers of 2, non-prime numbers) that can make outputs less uniform.
Example of the Division Method
If , then the hash function will be based only on the least significant bits of k.
Recommendation: Choose m to be a prime number, not close to a power of 2.
The Multiplication Method for Hashing
Overview
Concept: Multiply key k by a constant A (0 < A < 1) and provide a hash based on the fractional part of this product.
.
Advantages: M value is not critical (e.g., should be a power of 2).
Disadvantages: Slower than division method.
Example of the Multiplication Method
Shows the process of taking a fractional part and shifting for the final hash based on chosen m.
Universal Hashing
Concept
Recognizes that key distributions are often non-uniform in practice.
Select a hash function at random from a designed class of functions during the execution to reduce collision probability.
Definition of Universal Hash Functions
A set of functions H is universal if for any two keys x and y, the probability that they collide (i.e., h(x) = h(y)) is at most .
Importance of Universal Hashing
Guarantees reduced collision probability, often providing near random table indices irrespective of the keys themselves.
Designing a Universal Class of Hash Functions
Procedure: Choose a large prime number p ensuring every possible key is within [0, p-1].
Define the hash function where each hash is a valid pair of prime choices:
, with a and b being chosen randomly at execution start.
Example of Universal Hash Functions
Provides a direct application of this definition using specific prime numbers and functional calculations.
Advantages of Universal Hashing
Provides consistent average results regardless of the key input.
Prevents worst-case behaviors from occurring during typical operations due to random hash function selection.
Open Addressing
Concept
Stores keys directly in the table (when m > N) without linked lists.
Insertion and searching are handled through probing for the next available slot if collisions occur.
Generalized Hash Function Notation
Displays that a hash function now contains two arguments: key and probe number.
Probing sequence: Must be a permutation of all slots possible (0 to m-1).
Common Open Addressing Methods
Types of Open Addressing:
Linear probing
Quadratic probing
Double hashing
Linear Probing
Insertion: If a collision occurs, check the next available position in the table with probe formula:
where i = 0, 1, 2…
Linear Probing: Searching for a Key
Cases:
Key matches the current position.
Slot is empty (key not found).
Different key present, continue probing.
Linear Probing: Deleting a Key
Challenges: Cannot mark slots as empty as it may obstruct searches; instead, use a sentinel value 'DELETED' to allow for structures to be reused.
Primary Clustering Problem in Linear Probing
As elements get clustered together, search times can increase due to lengths of chains becoming predictable.
Quadratic Probing
Formula of inquiry is now of quadratic nature to mitigate clustering issues:
.
Still a potential clustering problem exists (secondary clustering).
Double Hashing
Mechanism: Use one hash function to find the initial slot and another for determining the increments in the probing sequence:
.
Provides advantages of avoiding clustering, but makes deletions harder due to probe dependencies.
Analysis of Open Addressing
Probe Performance Analysis
Assumes uniformly distributed probe sequences.
Characterizes likelihood of success and failure across retrieval attempts, leading to expected number of steps based on load factor alpha (α) (average number of elements in the table).
Examples offer specific working conditions under defined probabilities to understand expected retrieval times based on load factors.