Hashing & Hash Table Study Notes

Implement symbol-table methods with hash tables
Design practical hash codes for multiple key types
Apply/open-addressing strategies: linear probing, quadratic probing, double hashing
Apply separate chaining and evaluate each approach’s performance

Scenario: parking tags numbered $1,2,3,\dots$ → array index = key ⇒ $O(1)$ time
General case: keys are integers $0\ldots M-1$ , N<M
- Allocate array of length $M$ & store value at index=key
- Called Direct Addressing / key-indexed table
Limitations
- Keys may not be small non-negative ints
- $M \gg N$ wastes memory

Store in array by applying a hash function $h(\text{key}) → [0,M-1]$
Key issues
- Computing $h$ efficiently & deterministically
- Equality test to break ties
- Collision resolution (open addressing vs chaining)
Classic space–time trade-off
- Infinite memory ⇒ identity function possible
- Infinite time ⇒ sequential search resolves collisions
- Real world ⇒ choose balanced hash function + collision strategy

Hash Code $h_1$ (key → 32-/64-bit integer)
Compression Function $h_2$ (integer → $[0,M-1]$ )
- Standard: $h_2(x)=x \bmod M$ using only positive bits: $(x\;\&\;0x7fffffff) \% M$ in Java
Advantage: $h_1$ independent of table size so re-hash only requires recompressing

Goals: exploit entire key, deterministic, fast, spread keys uniformly
Java conventions
- Every object inherits int hashCode()
- Contract: $x.equals(y) ⇒ x.hashCode() = y.hashCode()$
- Desirable: unequal ⇒ different codes (not guaranteed)
Primitive wrappers
- Integer: return stored int
- Boolean: return 1231 or 1237
- Double: convert to IEEE bits; xor high & low 32-bit halves (beware $+0.0 \neq -0.0$ )
Strings (library implementation)
- Treat as base-31 polynomial
- $h = s[0]\,31^{L-1} + s[1]31^{L-2}+⋯+s[L-1]$
- Horner’s rule yields $L$ multiplies/additions
General recipe for user-defined types
- For each significant field combine via $\text{hash}=31\times \text{hash}+\text{fieldHash}$
- Recursively apply to arrays (Arrays.deepHashCode)
- Use 0 for null fields
Polynomial Accumulation (abstract view)
- Given components $x0,\dots,x{n-1}$ and constant $a\neq1$
- $h=x0 a^{n-1}+x1 a^{n-2}+⋯+x{n-2} a + x{n-1}$

Aim: $\Pr[ h2(h1(k1)) = h2(h1(k2)) ] = 1/M$ for distinct keys
Cheap: mod a prime or power-of-two size; choose $M$ well (often prime, or power of 2 when using bit-mask)

Insert
- $i = h(key)$
- While keys[i] occupied & ≠ key: $i = (i+1) \bmod M$ (wrap)
- Assign at first empty/AVAIL slot
Search
- Same probe sequence until null encountered or key found
Delete
- Cannot simply null-out; must
1. Mark slot AVAIL or
2. Remove then re-hash subsequent cluster elements (Java code uses rehash)
Implementation snippets (Java)
- hash: $((key.hashCode() \& 0x7fffffff) \% M)$
- put / get loops shown in slides 88–89
Table must satisfy M > N (cannot fill completely)

Primary clustering: successive keys form long contiguous runs
As load factor $L=N/M$ approaches $0.5$ clusters explode
Alternative probes
- Quadratic probing: $h(x,i) = (h'(x)+c1 i + c2 i^2) \bmod M$
- Double hashing: use second hash $d(x)$ , probe $h'(x)+ i\,d(x)$

Search hit / successful search
$\text{cost}_{hit}=\frac{1}{2}\Bigl(1+\frac{1}{1-L}\Bigr)$
Search miss / insertion
$\text{cost}_{miss}=\frac{1}{2}\Bigl(1+\frac{1}{(1-L)^2}\Bigr)$
Example: $L=0.5 ⇒ \text{hit}=1.5$ probes, miss=2.0 probes

For a single cluster of size $n/2$ (with $M=n$ )
- Avg. probes for hit $≈ n/4$
General miss cost with lists of clusters $ti$ size $1 + \sum{i=1}^\ell \frac{ti(ti+1)}{2M}$

Table of $M$ buckets; each bucket stores a linked list (or dynamic array)
Operations
- Insert at front of list if key absent
- Search only in bucket list $h(key)$
- Delete: remove from its list (easy)
Expected list length $L=N/M$
- Under uniform hashing, distribution highly concentrated around $L$
Average probe counts (pointer dereferences)
- Search hit: $1+\frac{L}{2}$ (check table cell + half list)
- Search miss & insert: $L$
Resizing strategy
- Maintain $L$ bounded: typically double $M$ when $L≥10$ ; halve when $L≤2$ ; rehash all keys
Implementation (Java) uses Node inner class array st[]

Trade-offs
- Linear probing: better cache locality; sensitive to $L$ ; deletion harder (tombstones/rehash)
- Chaining: graceful degradation; supports $N$ unrestricted; extra pointer memory; cache unfriendly

SSN keys: use last 3 digits for quick demo, but better employ full 9-digit polynomial hash
Phone numbers, student IDs, URLs, LocalTime objects – each needs proper hashCode()
Parking tag lookup – direct addressing vs hashing

LinearProbingHashST
- Array keys[], vals[]; resize() re-inserts all keys because positions depend on $M$
SeparateChainingHashST
- Array of Node chains; generic types via Object casting due to Java arrays

Hash tables offer expected constant-time $\text{get}$ / $\text{put}$ independent of table size
Requires good hash function; poor hashing ruins guarantees
Ordered symbol-table operations (min, max, range search, inorder traversal) are impossible without extra structure
Choice between open addressing and chaining depends on
- Memory footprint
- Expected load factor
- Cache behaviour
- Simplicity vs deletion complexity
Designing robust hash codes is non-trivial; rely on language libraries when possible or follow 31x+y recipe using entire key.