Hashing & Hash Table Study Notes
- Reference case table comparing several data structures
- Sequential (unordered) list
- search=insert=delete=O(n)
- Avg. probes: search-hit ≈2n, insert ≈n, delete ≈2n
- No ordered operations
- Ordered array with binary search
- search=O(logn); insert & delete O(n) due to shifting
- Ordered ops supported
- Binary Search Tree (unbalanced)
- Worst O(n), average ≈1.39logn for search, insert, delete
- Red-Black BST
- Guaranteed O(logn) for all three ops, ordered ops supported
- Hashing (subject of this lecture)
- Worst O(n), but expected 3!–!5 probes for search/insert/delete under uniform hashing
- Ordered ops NOT supported
Learning Objectives
- Implement symbol-table methods with hash tables
- Design practical hash codes for multiple key types
- Apply/open-addressing strategies: linear probing, quadratic probing, double hashing
- Apply separate chaining and evaluate each approach’s performance
Direct Addressing & Key-Indexed Tables
- Scenario: parking tags numbered 1,2,3,… → array index = key ⇒ O(1) time
- General case: keys are integers 0…M−1, N<M
- Allocate array of length M & store value at index=key
- Called Direct Addressing / key-indexed table
- Limitations
- Keys may not be small non-negative ints
- M≫N wastes memory
Hashing: Basic Plan
- Store in array by applying a hash function h(key)→[0,M−1]
- Key issues
- Computing h efficiently & deterministically
- Equality test to break ties
- Collision resolution (open addressing vs chaining)
- Classic space–time trade-off
- Infinite memory ⇒ identity function possible
- Infinite time ⇒ sequential search resolves collisions
- Real world ⇒ choose balanced hash function + collision strategy
Two-Part Hash Function Paradigm
- Hash Code h1 (key → 32-/64-bit integer)
- Compression Function h2 (integer → [0,M−1])
- Standard: h2(x)=xmodM using only positive bits: (x&0x7fffffff)%M in Java
- Advantage: h1 independent of table size so re-hash only requires recompressing
Designing Hash Codes
- Goals: exploit entire key, deterministic, fast, spread keys uniformly
- Java conventions
- Every object inherits int hashCode()
- Contract: x.equals(y)⇒x.hashCode()=y.hashCode()
- Desirable: unequal ⇒ different codes (not guaranteed)
- Primitive wrappers
- Integer: return stored int
- Boolean: return 1231 or 1237
- Double: convert to IEEE bits; xor high & low 32-bit halves (beware +0.0=−0.0)
- Strings (library implementation)
- Treat as base-31 polynomial
- h=s[0]31L−1+s[1]31L−2+⋯+s[L−1]
- Horner’s rule yields L multiplies/additions
- General recipe for user-defined types
- For each significant field combine via hash=31×hash+fieldHash
- Recursively apply to arrays (Arrays.deepHashCode)
- Use 0 for null fields
- Polynomial Accumulation (abstract view)
- Given components x<em>0,…,x</em>n−1 and constant a=1
- h=x<em>0an−1+x</em>1an−2+⋯+x<em>n−2a+x</em>n−1
Desirable Compression Functions
- Aim: Pr[h<em>2(h</em>1(k<em>1))=h</em>2(h<em>1(k</em>2))]=1/M for distinct keys
- Cheap: mod a prime or power-of-two size; choose M well (often prime, or power of 2 when using bit-mask)
Good Hash Function Checklist
- Low collision probability
- O(1) time to compute
- Uniform distribution independent of data skew
- Uses all key information
Collision Resolution – Open Addressing
Linear Probing Algorithm
- Insert
- i=h(key)
- While keys[i] occupied & ≠ key: i=(i+1)modM (wrap)
- Assign at first empty/AVAIL slot
- Search
- Same probe sequence until null encountered or key found
- Delete
- Cannot simply null-out; must
- Mark slot AVAIL or
- Remove then re-hash subsequent cluster elements (Java code uses rehash)
- Implementation snippets (Java)
- hash: ((key.hashCode()&0x7fffffff)%M)
- put / get loops shown in slides 88–89
- Table must satisfy M > N (cannot fill completely)
Clustering Phenomena
- Primary clustering: successive keys form long contiguous runs
- As load factor L=N/M approaches 0.5 clusters explode
- Alternative probes
- Quadratic probing: h(x,i)=(h′(x)+c<em>1i+c</em>2i2)modM
- Double hashing: use second hash d(x), probe h′(x)+id(x)
- Search hit / successful search
costhit=21(1+1−L1) - Search miss / insertion
costmiss=21(1+(1−L)21) - Example: L=0.5⇒hit=1.5 probes, miss=2.0 probes
Cluster-specific analysis
- For a single cluster of size n/2 (with M=n)
- Avg. probes for hit ≈n/4
- General miss cost with lists of clusters t<em>i size
1+∑</em>i=1ℓ2Mt<em>i(t</em>i+1)
Collision Resolution – Separate Chaining
- Table of M buckets; each bucket stores a linked list (or dynamic array)
- Operations
- Insert at front of list if key absent
- Search only in bucket list h(key)
- Delete: remove from its list (easy)
- Expected list length L=N/M
- Under uniform hashing, distribution highly concentrated around L
- Average probe counts (pointer dereferences)
- Search hit: 1+2L (check table cell + half list)
- Search miss & insert: L
- Resizing strategy
- Maintain L bounded: typically double M when L≥10; halve when L≤2; rehash all keys
- Implementation (Java) uses Node inner class array st[]
Open Addressing vs Separate Chaining – Comparative Table
| Load factor L | Linear Probing probes | Chaining probes |
|---|
| 0.25 | 1.17 | 1.13 |
| 0.50 | 1.50 | 1.25 |
| 0.75 | 2.50 | 1.38 |
| 0.85 | 3.83 | 1.43 |
| 0.90 | 5.50 | 1.45 |
| 0.95 | 10.50 | 1.48 |
- Trade-offs
- Linear probing: better cache locality; sensitive to L; deletion harder (tombstones/rehash)
- Chaining: graceful degradation; supports N unrestricted; extra pointer memory; cache unfriendly
Practical Examples & Scenarios
- SSN keys: use last 3 digits for quick demo, but better employ full 9-digit polynomial hash
- Phone numbers, student IDs, URLs, LocalTime objects – each needs proper hashCode()
- Parking tag lookup – direct addressing vs hashing
Implementation Highlights (Java)
- LinearProbingHashST
- Array keys[], vals[]; resize() re-inserts all keys because positions depend on M
- SeparateChainingHashST
- Array of Node chains; generic types via Object casting due to Java arrays
- Load factor L=N/M is dominant parameter
- Lower L ⇒ fewer collisions ⇒ closer to O(1) ideal
- Space–time trade-off: memory rises as 1/L
Overall Takeaways
- Hash tables offer expected constant-time get / put independent of table size
- Requires good hash function; poor hashing ruins guarantees
- Ordered symbol-table operations (min, max, range search, inorder traversal) are impossible without extra structure
- Choice between open addressing and chaining depends on
- Memory footprint
- Expected load factor
- Cache behaviour
- Simplicity vs deletion complexity
- Designing robust hash codes is non-trivial; rely on language libraries when possible or follow 31x+y recipe using entire key.