Hashing & Hash Table Study Notes

Symbol-Table Performance Landscape

  • Reference case table comparing several data structures
    • Sequential (unordered) list
    • search=insert=delete=O(n)\text{search} = \text{insert} = \text{delete} = O(n)
    • Avg. probes: search-hit n2\approx \frac{n}{2}, insert n\approx n, delete n2\approx \frac{n}{2}
    • No ordered operations
    • Ordered array with binary search
    • search=O(logn)\text{search} = O(\log n); insert & delete O(n)O(n) due to shifting
    • Ordered ops supported
    • Binary Search Tree (unbalanced)
    • Worst O(n)O(n), average 1.39logn\approx 1.39 \log n for search, insert, delete
    • Red-Black BST
    • Guaranteed O(logn)O(\log n) for all three ops, ordered ops supported
    • Hashing (subject of this lecture)
    • Worst O(n)O(n), but expected 3!!53!\text{–}!5 probes for search/insert/delete under uniform hashing
    • Ordered ops NOT supported

Learning Objectives

  • Implement symbol-table methods with hash tables
  • Design practical hash codes for multiple key types
  • Apply/open-addressing strategies: linear probing, quadratic probing, double hashing
  • Apply separate chaining and evaluate each approach’s performance

Direct Addressing & Key-Indexed Tables

  • Scenario: parking tags numbered 1,2,3,1,2,3,\dots → array index = key ⇒ O(1)O(1) time
  • General case: keys are integers 0M10\ldots M-1, N<M
    • Allocate array of length MM & store value at index=key
    • Called Direct Addressing / key-indexed table
  • Limitations
    • Keys may not be small non-negative ints
    • MNM \gg N wastes memory

Hashing: Basic Plan

  • Store in array by applying a hash function h(key)[0,M1]h(\text{key}) → [0,M-1]
  • Key issues
    • Computing hh efficiently & deterministically
    • Equality test to break ties
    • Collision resolution (open addressing vs chaining)
  • Classic space–time trade-off
    • Infinite memory ⇒ identity function possible
    • Infinite time ⇒ sequential search resolves collisions
    • Real world ⇒ choose balanced hash function + collision strategy

Two-Part Hash Function Paradigm

  • Hash Code h1h_1 (key → 32-/64-bit integer)
  • Compression Function h2h_2 (integer → [0,M1][0,M-1])
    • Standard: h2(x)=xmodMh_2(x)=x \bmod M using only positive bits: (x  &amp;  0x7fffffff)%M(x\;\&amp;\;0x7fffffff) \% M in Java
  • Advantage: h1h_1 independent of table size so re-hash only requires recompressing

Designing Hash Codes

  • Goals: exploit entire key, deterministic, fast, spread keys uniformly
  • Java conventions
    • Every object inherits int hashCode()
    • Contract: x.equals(y)x.hashCode()=y.hashCode()x.equals(y) ⇒ x.hashCode() = y.hashCode()
    • Desirable: unequal ⇒ different codes (not guaranteed)
  • Primitive wrappers
    • Integer: return stored int
    • Boolean: return 1231 or 1237
    • Double: convert to IEEE bits; xor high & low 32-bit halves (beware +0.00.0+0.0 \neq -0.0)
  • Strings (library implementation)
    • Treat as base-31 polynomial
    • h=s[0]31L1+s[1]31L2++s[L1]h = s[0]\,31^{L-1} + s[1]31^{L-2}+⋯+s[L-1]
    • Horner’s rule yields LL multiplies/additions
  • General recipe for user-defined types
    • For each significant field combine via hash=31×hash+fieldHash\text{hash}=31\times \text{hash}+\text{fieldHash}
    • Recursively apply to arrays (Arrays.deepHashCode)
    • Use 0 for null fields
  • Polynomial Accumulation (abstract view)
    • Given components x<em>0,,x</em>n1x<em>0,\dots,x</em>{n-1} and constant a1a\neq1
    • h=x<em>0an1+x</em>1an2++x<em>n2a+x</em>n1h=x<em>0 a^{n-1}+x</em>1 a^{n-2}+⋯+x<em>{n-2} a + x</em>{n-1}

Desirable Compression Functions

  • Aim: Pr[h<em>2(h</em>1(k<em>1))=h</em>2(h<em>1(k</em>2))]=1/M\Pr[ h<em>2(h</em>1(k<em>1)) = h</em>2(h<em>1(k</em>2)) ] = 1/M for distinct keys
  • Cheap: mod a prime or power-of-two size; choose MM well (often prime, or power of 2 when using bit-mask)

Good Hash Function Checklist

  • Low collision probability
  • O(1) time to compute
  • Uniform distribution independent of data skew
  • Uses all key information

Collision Resolution – Open Addressing

Linear Probing Algorithm

  • Insert
    • i=h(key)i = h(key)
    • While keys[i] occupied & ≠ key: i=(i+1)modMi = (i+1) \bmod M (wrap)
    • Assign at first empty/AVAIL slot
  • Search
    • Same probe sequence until null encountered or key found
  • Delete
    • Cannot simply null-out; must
    1. Mark slot AVAIL or
    2. Remove then re-hash subsequent cluster elements (Java code uses rehash)
  • Implementation snippets (Java)
    • hash: ((key.hashCode()&amp;0x7fffffff)%M)((key.hashCode() \&amp; 0x7fffffff) \% M)
    • put / get loops shown in slides 88–89
  • Table must satisfy M > N (cannot fill completely)

Clustering Phenomena

  • Primary clustering: successive keys form long contiguous runs
  • As load factor L=N/ML=N/M approaches 0.50.5 clusters explode
  • Alternative probes
    • Quadratic probing: h(x,i)=(h(x)+c<em>1i+c</em>2i2)modMh(x,i) = (h'(x)+c<em>1 i + c</em>2 i^2) \bmod M
    • Double hashing: use second hash d(x)d(x), probe h(x)+id(x)h'(x)+ i\,d(x)

Cost Formulas under Uniform Hashing (Linear Probing)

  • Search hit / successful search
    costhit=12(1+11L)\text{cost}_{hit}=\frac{1}{2}\Bigl(1+\frac{1}{1-L}\Bigr)
  • Search miss / insertion
    costmiss=12(1+1(1L)2)\text{cost}_{miss}=\frac{1}{2}\Bigl(1+\frac{1}{(1-L)^2}\Bigr)
  • Example: L=0.5hit=1.5L=0.5 ⇒ \text{hit}=1.5 probes, miss=2.0 probes

Cluster-specific analysis

  • For a single cluster of size n/2n/2 (with M=nM=n)
    • Avg. probes for hit n/4≈ n/4
  • General miss cost with lists of clusters t<em>it<em>i size 1+</em>i=1t<em>i(t</em>i+1)2M1 + \sum</em>{i=1}^\ell \frac{t<em>i(t</em>i+1)}{2M}

Collision Resolution – Separate Chaining

  • Table of MM buckets; each bucket stores a linked list (or dynamic array)
  • Operations
    • Insert at front of list if key absent
    • Search only in bucket list h(key)h(key)
    • Delete: remove from its list (easy)
  • Expected list length L=N/ML=N/M
    • Under uniform hashing, distribution highly concentrated around LL
  • Average probe counts (pointer dereferences)
    • Search hit: 1+L21+\frac{L}{2} (check table cell + half list)
    • Search miss & insert: LL
  • Resizing strategy
    • Maintain LL bounded: typically double MM when L10L≥10; halve when L2L≤2; rehash all keys
  • Implementation (Java) uses Node inner class array st[]

Open Addressing vs Separate Chaining – Comparative Table

Load factor LLLinear Probing probesChaining probes
0.251.171.171.131.13
0.501.501.501.251.25
0.752.502.501.381.38
0.853.833.831.431.43
0.905.505.501.451.45
0.9510.5010.501.481.48
  • Trade-offs
    • Linear probing: better cache locality; sensitive to LL; deletion harder (tombstones/rehash)
    • Chaining: graceful degradation; supports NN unrestricted; extra pointer memory; cache unfriendly

Practical Examples & Scenarios

  • SSN keys: use last 3 digits for quick demo, but better employ full 9-digit polynomial hash
  • Phone numbers, student IDs, URLs, LocalTime objects – each needs proper hashCode()
  • Parking tag lookup – direct addressing vs hashing

Implementation Highlights (Java)

  • LinearProbingHashST
    • Array keys[], vals[]; resize() re-inserts all keys because positions depend on MM
  • SeparateChainingHashST
    • Array of Node chains; generic types via Object casting due to Java arrays

Load Factor & Performance Summary

  • Load factor L=N/ML=N/M is dominant parameter
  • Lower LL ⇒ fewer collisions ⇒ closer to O(1)O(1) ideal
  • Space–time trade-off: memory rises as 1/L1/L

Overall Takeaways

  • Hash tables offer expected constant-time get\text{get} / put\text{put} independent of table size
  • Requires good hash function; poor hashing ruins guarantees
  • Ordered symbol-table operations (min, max, range search, inorder traversal) are impossible without extra structure
  • Choice between open addressing and chaining depends on
    • Memory footprint
    • Expected load factor
    • Cache behaviour
    • Simplicity vs deletion complexity
  • Designing robust hash codes is non-trivial; rely on language libraries when possible or follow 31x+y recipe using entire key.