hashing
Hashing Searching / Indexing Algorithm-3
The Search Problem
Find items with keys matching a given search key.
Given an array A containing n keys and a search key x, find the index i such that x = A[i].
A key could be part of a large record.
Applications of Searching
Keeping track of customer account information at a bank:
Searching through records to check balances and perform transactions.
Flight reservations tracking:
Searching to find empty seats or to cancel/modify reservations.
Search engines:
Searching for all documents containing a given word.
Special Case: Dictionaries
Dictionary: A data structure supporting mainly two operations:
Insert a new item.
Return an item with a given key.
Queries to return information about a set S:
Search(S, k)
Minimum(S)
Maximum(S)
Successor(S, x)
Predecessor(S, x)
Modifying operations:
Insert(S, k)
Delete(S, k) (not performed very often).
Direct Addressing
Assumptions:
Key values are distinct.
Each key is drawn from a universe U = {0, 1, …, m - 1}.
Idea:
Store items in an array indexed by keys.
Use a direct-address table representation:
Array T[0 … m - 1], corresponding to keys in U.
For element x with key k, store pointer to x in T[k].
If no elements exist with key k, T[k] is represented by NIL.
Direct Addressing (continued)
Example Representation:
T 0 key satellite data, U 1, 2, 3 etc.
Each slot corresponds to a key in the universe.
Insertion and deletion operations in O(1) time.
Operations Algorithms
DIRECT-ADDRESS-SEARCH(T, k): return T[k]
DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x
DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL
Running time for these operations: O(1)
Comparing Different Implementations
Implementing Dictionaries Using:
Direct Addressing: O(1) search
Ordered Arrays & Lists: O(N) for insertion/search
Unordered Arrays & Lists: O(N) for insertion/search
Examples Using Direct Addressing
Example 1:
Keys are integers from 1 to 100; create an array A of 100 items.
Example 2:
Keys are nine-digit social security numbers; inefficient to create an array of one billion items to store 100 records.
Hash Tables
When K < U, hash tables save space compared to direct-address tables.
Hash tables aim to achieve average O(1) search time.
Hash Tables Explained
Idea: Use a function h to compute the slot for each key, storing the element in slot h(k).
Hash function:
Transforms a key into an index in T[0…m-1] via mapping, reducing array index range.
Example of Hash Tables
U: Universe of keys, K: Actual keys with hash function results.
Revisit Example 2
If the keys are social security numbers, a possible hash function is: h(ssn) = SSS mod 100.
Collisions
Two or more keys hashing to the same slot.
If |K| > m, collisions definitely occur.
Even with a good hash function, avoiding collisions entirely is challenging.
Handling Collisions
Methods to handle collisions include:
Chaining
Open Addressing
Linear Probing
Quadratic Probing
Double Hashing
Chaining discussed first.
Handling Collisions with Chaining
Idea:
Place all elements that hash to the same slot in a linked list.
Slot j contains a pointer to the start of the list of all elements hashing to j.
Insertion in Hash Tables
Algorithm:
CHAINED-HASH-INSERT(T, x): Insert x at the head of list T[h(key[x])].
Worst-case running time: O(1), assuming element isn’t already in the list.
Deletion in Hash Tables
Algorithm:
CHAINED-HASH-DELETE(T, x): Delete x from the list T[h(key[x])].
Worst-case running time: Depends on searching the corresponding list.
Searching in Hash Tables
Algorithm:
CHAINED-HASH-SEARCH(T, k): Search for an element with key k in list T[h(k)].
Running time: Proportional to the size of the list in slot h(k).
Analysis of Hashing with Chaining: Worst Case
Worst-case scenario occurs when all n keys hash to the same slot.
Searching can then take O(n) time.
Analysis of Hashing with Chaining: Average Case
Average case depends on hash function performance distributing n keys across m slots.
Simple uniform hashing assumption: Each element equally likely to hash to any of m slots.
Load Factor of Hash Table
Load factor (α): a = n/m (n: number of elements, m: number of slots).
Encodes average chain length; a can be <, =, or > 1.
Case 1: Unsuccessful Search
Theorem: An unsuccessful search takes expected time α under simple uniform hashing.
Total time required is: O(1) (for the hash function) + α.
Case 2: Successful Search
Expected time can be analyzed as O(1 + α/2) based on average scenario.
Analysis of Search in Hash Tables
If m (number of slots) is proportional to n (number of elements):
Searching takes average O(1) time.
Hash Functions
Good hash functions:
Easy to compute.
Approximate randomness (regardless of input).
Good Approaches for Hash Functions
Minimize likelihood of related keys hashing to the same slot.
The Division Method of Hashing
Idea: Map key k to m slots using k mod m.
Advantages: Fast computation; Disadvantages: Poor choice of m (like power of 2) can lead to poor distribution.
Example - The Division Method
If m = 2^p, h(k) becomes least significant p bits of k.
Choose m to be prime and not close to powers of 2.
The Multiplication Method
Multiply key k by a constant A (0 < A < 1) and extract fractional part of kA.
Advantages: m’s value isn’t critical, Disadvantage: Slower than division method.
Universal Hashing
Designing hash functions to produce random table indices, independent of the keys.
Definition of Universal Hash Functions
A universal set of hash functions H={h(k): U (0, 1, …, m - 1)}; H is universal if for x ≠ y, probability EiH: h(x)=h(y) ≤ 1/m.
Universal Hashing – Main Result
Chance of collision during hashing is no more than 1/m if locations are randomly chosen.
Designing a Universal Class of Hash Functions
Choose a prime number p for every key range and define hash function ha,b(k) as:
ha,b(k) = ((ak + b) mod p) mod m.
Family of all such hash functions is universal.
Example: Universal Hash Functions
Using p = 17 and m = 6, calculate:
h3,4(8) = ((3×8 + 4) mod 17) mod 6 = 5.
Advantages of Universal Hashing
Provides good average results, independent of stored keys.
Guarantees that worst-case behavior is rare.
Open Addressing
Storing keys directly in the table without using linked lists when there’s enough memory.
Insert if a slot is full, find the next available one.
Generalized Hash Function Notation
A hash function takes key value and probe number: h(k, p).
Common Open Addressing Methods
Methods: Linear probing, Quadratic probing, Double hashing.
None can generate more than m^2 different sequences.
Linear Probing: Inserting a Key
Idea: On collision, check the next available position: h(k,i) = (h1(k) + i) mod m.
Linear Probing: Searching for a Key
Three cases:
Found the key.
Slot empty, proceed.
Different element found, continue probing.
Linear Probing: Deleting a Key
Challenge of marking a slot as empty is addressed by using a sentinel value DELETED.
Primary Clustering Problem
Certain slots become favored, increasing search time due to long clusters of occupied slots.
Quadratic Probing
Formula: h(k,i) = (h1(k) + i^2) mod N; lessening clustering problem but still present.
Double Hashing
Utilizes first hash function for the initial probe and a second for subsequent increments in the probing process: h(k,i) = (h1(k) + i*h2(k)) mod m.
Analysis of Open Addressing
Ignore clustering to analyze probe success rates.
Analysis of Open Addressing (continued)
Unsuccessful retrieval results based on load factor a.
Variations in expected number of steps based on load factor (0.5 vs 0.9).