Hash Tables Lecture Notes

Importance of Efficiency

Premature optimization is the root of all evil.
Focus on critical parts (3%) of the code.

Symbol Table Implementations Summary

Comparison of different symbol table implementations based on their guarantees and average case performance.
Different implementations use different key interfaces such as equals() and compareTo().
The table provides a comparison of search, insert, and delete operations for each.
Different data structures have different performance characteristics for ordered operations.

Implementation	Search Guarantee	Insert Guarantee	Delete Guarantee	Search Hit (Avg)	Insert (Avg)	Delete (Avg)	Key Interface
Sequential Search (Unordered List)	$n$	$n$	$n$	$n$	$n$	$n$	`equals()`
Binary Search (Ordered Array)	$log n$	$n$	$n$	$log n$	$n$	$n$	`compareTo()`
BST	$n$	$n$	$n$	$log n$	$log n$	$\sqrt{n}$	`compareTo()`
Red-Black BST	$log n$	$log n$	$log n$	$log n$	$log n$	$log n$	`compareTo()`
Separate Chaining	$n$	$n$	$n$	$1^*$	$1^*$	$1^*$	`equals()`, `hashCode()`

Simple Hashing Idea

Use an array to implement a symbol table when keys are small integers.
Interpret the key as an array index.
Issue: Table size becomes too large for 4-byte integers ( $2^{32}$ ).
Another Issue: Two keys cannot hash to the same location, which leads to collision.

Hashing: Basic Plan

Save items in a key-indexed table, where the index is a function of the key.
Hash function: Method for computing array index from key.
Equality test: Method for checking whether two keys are equal.
Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index.
Classic space-time tradeoff.
- No space limitation: Trivial hash function with key as index.
- No time limitation: Trivial collision resolution with sequential search.
- Space and time limitations: Hashing (the real world).

Computing the Hash Function

Idealistic goal: Scramble the keys uniformly to produce a table index.
- Deterministic: Equal keys must produce the same value.
- Efficiently computable.
- Each table index equally likely for each key.
Practical challenge: Need different approach for each key type.
Thoroughly researched problem, still problematic in practical applications.

Java's Hash Code Conventions

All Java classes inherit a method hashCode(), which returns a 32-bit int.
Requirement: If x.equals(y), then (x.hashCode() == y.hashCode()).
Highly desirable: If !x.equals(y), then (x.hashCode() != y.hashCode()).
Default implementation: Memory address of x.
Legal (but poor) implementation: Always return 17.
Customized implementations: Integer, Double, String, URL, LocalTime, …
User-defined types: Users are on their own.

Implementing Hash Code: Integers, Booleans, and Doubles

Java library implementations for Integer, Boolean, and Double.
Integer:
java public int hashCode() { return value; }
Boolean:
java public int hashCode() { if (value) return 1231; else return 1237; }
Double: java public int hashCode() { long bits = doubleToLongBits(value); return (int) (bits ^ (bits >>> 32)); }
- Convert to IEEE 64-bit representation; XOR most significant 32-bits with least significant 32-bits
- Warning: -0.0 and +0.0 have different hash codes

Implementing Hash Code: Strings

Treat string of length L as L-digit, base-31 number:
$h = s[0] \cdot 31^{L-1} + \dots + s[L-3] \cdot 31^2 + s[L-2] \cdot 31^1 + s[L-1] \cdot 31^0$
Horner's method: only L multiplies/adds to hash string of length L.
Example:
java String s = "call"; s.hashCode(); // 3045982 = 99*31^3 + 97*31^2 + 108*31^1 + 108*31^0
Java library implementation:
java public int hashCode() { int hash = 0; for (int i = 0; i < length(); i++) hash = s[i] + (31 * hash); return hash; }

Implementing Hash Code: Strings - Performance Optimization

Cache the hash value in an instance variable.
Return cached value.
Example:
java public final class String { private int hash = 0; private final char[] s; ... public int hashCode() { int h = hash; if (h != 0) return h; for (int i = 0; i < length(); i++) h = s[i] + (31 * h); hash = h; return h; } }

Implementing Hash Code: User-Defined Types

Typically a small prime nonzero constant.
For primitive types, use hashCode() of wrapper type.
For reference types, use hashCode().

Example:

public class Transaction {
    private final String who;
    private final Date when;
    private final double amount;public int hashCode() {
    int hash = 17;
    hash = 31*hash + who.hashCode();
    hash = 31*hash + when.hashCode();
    hash = 31*hash + ((Double) amount).hashCode();
    return hash;
}
}

Hash Code Design

"Standard" recipe for user-defined types.
- Combine each significant field using the 31x + y rule.
- If field is a primitive type, use wrapper type hashCode().
- If field is null, use 0.
- If field is a reference type, use hashCode().
- If field is an array, apply to each entry.
In practice: Recipe above works reasonably well; used in Java libraries.
In theory: Keys are bitstring; "universal" family of hash functions exist.
Basic rule: Need to use the whole key to compute hash code; consult an expert for state-of-the-art hash codes.

Modular Hashing

Hash code: An int between $-2^{31}$ and $2^{31} - 1$ .
Hash function: An int between 0 and $m - 1$ (for use as array index).
Modular hashing: m is typically a prime or power of 2; m is the hash table size.

private int hash(Key key) {
    return (key.hashCode() & 0x7fffffff) % m;
}

Uniform Hashing Assumption

Each key is equally likely to hash to an integer between 0 and $m - 1$ .
Bins and balls: Throw balls uniformly at random into m bins.
Birthday problem: Expect two balls in the same bin after ~ $\sqrt{\pi m / 2}$ tosses.
Coupon collector: Expect every bin has $\ge 1$ ball after ~ $m \ln m$ tosses.
Load balancing: After m tosses, expect most loaded bin has ~ $\ln m / \ln \ln m$ balls.

Collisions

Collision: Two distinct keys hashing to same index.
Birthday problem ⇒ can’t avoid collisions.
Coupon collector ⇒ not too much wasted space.
Load balancing ⇒ no index gets too many collisions.
Load factor ⇒ a ratio of number of keys : table size (keep the load factor a reasonable size).

Separate-Chaining Symbol Table

Use an array of m linked lists. [H. P. Luhn, IBM 1953]
- Hash: map key to integer i between 0 and m - 1
- Insert: put at front of ith chain (if not already in chain)
- Search: sequential search in ith chain
- Example:
  - put(L, 11) then hash(L) = 3, separate-chaining hash table (m = 4)

public class SeparateChainingHashST<Key, Value> {
 private int m = 97; // number of chains
 private Node[] st = new Node[m];  // array of chains
 private static class Node {
 private Object key;
 private Object val;
 private Node next;
 ...
 }
 private int hash(Key key) {
 return (key.hashCode() & 0x7fffffff) % m;
 }
 public Value get(Key key) {
 int i = hash(key);
 for (Node x = st[i]; x != null; x = x.next)
 if (key.equals(x.key)) return (Value) x.val;
 return null;
 }

* No generic array creation (declare key and value of type Object)
* Array doubling and halving code omitted

Inserting Item: Java implementation

public void put(Key key, Value val) {
 int i = hash(key);
 for (Node x = st[i]; x != null; x = x.next)
 if (key.equals(x.key)) { x.val = val; return; }
 st[i] = new Node(key, val, st[i]);
}

Proposition: Under uniform hashing assumption, probability that the number of keys in a list is within a constant factor of $n / m$ is extremely close to 1.
- Pf sketch. Distribution of list size obeys a binomial distribution.
- Consequence. Number of probes for search/insert is proportional to $n / m$ .
  - $m$ too large ⇒ too many empty chains.
  - $m$ too small ⇒ chains too long.
  - Typical choice: $m \sim \frac{1}{4} n$ ⇒ constant-time ops.
  - $m$ times faster than sequential search
  - equals() and hashCode()
Resizing in a separate-chaining hash table.
- Goal. Average length of list $n / m = constant$ .
- Double length of array $m$ when $n / m \ge 8$ ; halve length of array $m$ when $n / m \le 2$ .
  - Note: need to rehash all keys when resizing.
Deletion in a separate-chaining hash table.
- How to delete a key (and its associated value)?
  - Easy: need to consider only chain containing key.

Linear Probing

Open addressing. [Amdahl–Boehme–Rocherster–Samuel, IBM 1953]
Maintain keys and values in two parallel arrays.
When a new key collides, find next empty slot, and put it there.
Hash: Map key to integer $i$ between $0$ and $m-1$
Insert: Put at table index $i$ if free; if not try $i+1, i+2$ , etc.
Search: Search table index $i$ ; if occupied but no match,try $i+1$ , $i+2$ , etc.
Note: Array length $m$ must be greater than number of key-value pairs $n$ .
Java Implementation

public class LinearProbingHashST<Key, Value> {
 private int m = 30001;
 private Value[] vals = (Value[]) new Object[m];
 private Key[] keys = (Key[]) new Object[m];
 private int hash(Key key)  { /* as before */ }
 private void put(Key key, Value val) { /* next slide */ }
}

public Value get(Key key) {
 for (int i = hash(key); keys[i] != null; i = (i+1) % m)
 if (key.equals(keys[i])) return vals[i];
 return null;
}

public void put(Key key, Value val) {
 int i;
 for (i = hash(key); keys[i] != null; i = (i+1) % m)
 if (keys[i].equals(key)) break;
 keys[i] = key;
 vals[i] = val;
}

* Array doubling and halving code omitted

Clustering. A contiguous block of items
- Observation. New keys likely to hash into the middle of big clusters
  Knuth's parking problem
- Model. Cars arrive at one-way street with m parking spaces. Each desires a random space $i$ : if space $i$ is taken, try $i+1$ , $i+2$ , etc
- Q. What is mean displacement of a car?
  - Half-full. With $m/2$ cars, mean displacement is $\sim 5/2$
  - Full. With $m$ cars, mean displacement is $\sim \pi m / 8$
- Key insight. Cannot afford to let linear-probing hash table get too full.
Analysis of linear probing
- Proposition. Under uniform hashing assumption, the average # of probes in a linear probing hash table of size $m$ that contains $n = \alpha m$ keys is: # probes for search hit is about $3/2$ and # probes for search miss is about $5/2$
- Pf. Parameters. : $m$ too large ⇒ too many empty array entries. $m$ too small ⇒ search time blows up. Typical choice: $\alpha = n/m \sim \frac{1}{2}$
Resizing in a linear-probing hash table Goal. Average length of list n/m < \frac{1}{2}
- Double length of array $m$ when $n/m \ge \frac{1}{2}$ . Halve length of array $m$ when $n/m \le \frac{1}{8}$
- Need to rehash all keys when resizing.
  Deletion in a linear-probing hash table. Requires some care: can't just delete array entries

Algorithmic Complexity Attacks

Is the uniform hashing assumption important in practice?
- A. Obvious situations: aircraft control, nuclear reactor, pacemaker, HFT, …
- A. Surprising situations: denial-of-service attacks.
  - Real-world exploits. [Crosby–Wallach 2003] Linux 2.4.20 kernel: save files with carefully chosen names. Bro server: send carefully chosen packets to DOS the server, using less bandwidth than a dial-up modem. malicious adversary learns your hash function (e.g., by reading Java API) and causes a big pile-up in single slot that grinds performance to a halt.
Diversion: one-way hash functions
- One-way hash function. “Hard” to find a key that will hash to a desired value (or two keys that hash to the same value).
  - Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, SHA-256, WHIRLPOOL, ….
- Applications. Crypto, message digests, passwords, Bitcoin, ….
- Caveat. Too expensive for use in ST implementations.

Summary: Separate Chaining vs. Linear Probing

Separate chaining.
- Performance degrades gracefully.
- Clustering less sensitive to poorly-designed hash function.
Linear probing.
- Less wasted space.
- Better cache performance.

Hashing: Variations on the Theme

Many improved versions have been studied.
Two-probe hashing. [ separate-chaining variant ]
- Hash to two positions, insert key in shorter of the two chains.
- Reduces expected length of the longest chain to ~ $\lg \ln n$ .
Double hashing. [ linear-probing variant ]
- Use linear probing, but skip a variable amount, not just 1 each time.
- Effectively eliminates clustering.
- Can allow table to become nearly full.
- More difficult to implement delete.
Cuckoo hashing. [ linear-probing variant ]
- Hash key to two positions; insert key into either position; if occupied, reinsert displaced key into its alternative position (and recur).
- Constant worst-case time for search.

Hash Tables vs. Balanced Search Trees

Hash tables.
- Simpler to code.
- No effective alternative for unordered keys.
- Faster for simple keys (a few arithmetic ops versus log n compares).
- Better system support in Java for String (e.g., cached hash code).
Balanced search trees.
- Stronger performance guarantee.
- Support for ordered ST operations.
- Easier to implement compareTo() than hashCode().
Java system includes both.
- Balanced search trees: java.util.TreeMap, java.util.TreeSet.
- Hash tables: java.util.HashMap, java.util.IdentityHashMap.