Hash Tables Lecture Notes
Importance of Efficiency
- Premature optimization is the root of all evil.
- Focus on critical parts (3%) of the code.
Symbol Table Implementations Summary
- Comparison of different symbol table implementations based on their guarantees and average case performance.
- Different implementations use different key interfaces such as
equals()andcompareTo(). - The table provides a comparison of
search,insert, anddeleteoperations for each. - Different data structures have different performance characteristics for ordered operations.
| Implementation | Search Guarantee | Insert Guarantee | Delete Guarantee | Search Hit (Avg) | Insert (Avg) | Delete (Avg) | Key Interface |
|---|---|---|---|---|---|---|---|
| Sequential Search (Unordered List) | equals() | ||||||
| Binary Search (Ordered Array) | compareTo() | ||||||
| BST | compareTo() | ||||||
| Red-Black BST | compareTo() | ||||||
| Separate Chaining | equals(), hashCode() |
Simple Hashing Idea
- Use an array to implement a symbol table when keys are small integers.
- Interpret the key as an array index.
- Issue: Table size becomes too large for 4-byte integers ().
- Another Issue: Two keys cannot hash to the same location, which leads to collision.
Hashing: Basic Plan
- Save items in a key-indexed table, where the index is a function of the key.
- Hash function: Method for computing array index from key.
- Equality test: Method for checking whether two keys are equal.
- Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index.
- Classic space-time tradeoff.
- No space limitation: Trivial hash function with key as index.
- No time limitation: Trivial collision resolution with sequential search.
- Space and time limitations: Hashing (the real world).
Computing the Hash Function
- Idealistic goal: Scramble the keys uniformly to produce a table index.
- Deterministic: Equal keys must produce the same value.
- Efficiently computable.
- Each table index equally likely for each key.
- Practical challenge: Need different approach for each key type.
- Thoroughly researched problem, still problematic in practical applications.
Java's Hash Code Conventions
- All Java classes inherit a method
hashCode(), which returns a 32-bit int. - Requirement: If
x.equals(y), then(x.hashCode() == y.hashCode()). - Highly desirable: If
!x.equals(y), then(x.hashCode() != y.hashCode()). - Default implementation: Memory address of
x. - Legal (but poor) implementation: Always return 17.
- Customized implementations: Integer, Double, String, URL, LocalTime, …
- User-defined types: Users are on their own.
Implementing Hash Code: Integers, Booleans, and Doubles
- Java library implementations for
Integer,Boolean, andDouble. - Integer:
java public int hashCode() { return value; } - Boolean:
java public int hashCode() { if (value) return 1231; else return 1237; } - Double:
java public int hashCode() { long bits = doubleToLongBits(value); return (int) (bits ^ (bits >>> 32)); }- Convert to IEEE 64-bit representation; XOR most significant 32-bits with least significant 32-bits
- Warning:
-0.0and+0.0have different hash codes
Implementing Hash Code: Strings
- Treat string of length L as L-digit, base-31 number:
- Horner's method: only L multiplies/adds to hash string of length L.
- Example:
java String s = "call"; s.hashCode(); // 3045982 = 99*31^3 + 97*31^2 + 108*31^1 + 108*31^0 - Java library implementation:
java public int hashCode() { int hash = 0; for (int i = 0; i < length(); i++) hash = s[i] + (31 * hash); return hash; }
Implementing Hash Code: Strings - Performance Optimization
- Cache the hash value in an instance variable.
- Return cached value.
- Example:
java public final class String { private int hash = 0; private final char[] s; ... public int hashCode() { int h = hash; if (h != 0) return h; for (int i = 0; i < length(); i++) h = s[i] + (31 * h); hash = h; return h; } }
Implementing Hash Code: User-Defined Types
Typically a small prime nonzero constant.
For primitive types, use
hashCode()of wrapper type.For reference types, use
hashCode().Example:
public class Transaction { private final String who; private final Date when; private final double amount;
}public int hashCode() { int hash = 17; hash = 31*hash + who.hashCode(); hash = 31*hash + when.hashCode(); hash = 31*hash + ((Double) amount).hashCode(); return hash; }
Hash Code Design
- "Standard" recipe for user-defined types.
- Combine each significant field using the 31x + y rule.
- If field is a primitive type, use wrapper type
hashCode(). - If field is
null, use 0. - If field is a reference type, use
hashCode(). - If field is an array, apply to each entry.
- In practice: Recipe above works reasonably well; used in Java libraries.
- In theory: Keys are bitstring; "universal" family of hash functions exist.
- Basic rule: Need to use the whole key to compute hash code; consult an expert for state-of-the-art hash codes.
Modular Hashing
- Hash code: An int between and .
- Hash function: An int between 0 and (for use as array index).
- Modular hashing:
mis typically a prime or power of 2;mis the hash table size.
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % m;
}
Uniform Hashing Assumption
- Each key is equally likely to hash to an integer between 0 and .
- Bins and balls: Throw balls uniformly at random into
mbins. - Birthday problem: Expect two balls in the same bin after ~ tosses.
- Coupon collector: Expect every bin has ball after ~ tosses.
- Load balancing: After
mtosses, expect most loaded bin has ~ balls.
Collisions
- Collision: Two distinct keys hashing to same index.
- Birthday problem ⇒ can’t avoid collisions.
- Coupon collector ⇒ not too much wasted space.
- Load balancing ⇒ no index gets too many collisions.
- Load factor ⇒ a ratio of number of keys : table size (keep the load factor a reasonable size).
Separate-Chaining Symbol Table
- Use an array of
mlinked lists. [H. P. Luhn, IBM 1953]- Hash: map key to integer
ibetween 0 andm - 1 - Insert: put at front of
ith chain (if not already in chain) - Search: sequential search in
ith chain - Example:
- put(L, 11) then hash(L) = 3, separate-chaining hash table (m = 4)
- Hash: map key to integer
public class SeparateChainingHashST<Key, Value> {
private int m = 97; // number of chains
private Node[] st = new Node[m]; // array of chains
private static class Node {
private Object key;
private Object val;
private Node next;
...
}
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % m;
}
public Value get(Key key) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next)
if (key.equals(x.key)) return (Value) x.val;
return null;
}
* No generic array creation (declare key and value of type Object)
* Array doubling and halving code omitted
- Inserting Item: Java implementation
public void put(Key key, Value val) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next)
if (key.equals(x.key)) { x.val = val; return; }
st[i] = new Node(key, val, st[i]);
}
- Proposition: Under uniform hashing assumption, probability that the number of keys in a list is within a constant factor of is extremely close to 1.
- Pf sketch. Distribution of list size obeys a binomial distribution.
- Consequence. Number of probes for search/insert is proportional to .
- too large ⇒ too many empty chains.
- too small ⇒ chains too long.
- Typical choice: ⇒ constant-time ops.
- times faster than sequential search
- equals() and hashCode()
- Resizing in a separate-chaining hash table.
- Goal. Average length of list .
- Double length of array when ; halve length of array when .
- Note: need to rehash all keys when resizing.
- Deletion in a separate-chaining hash table.
- How to delete a key (and its associated value)?
- Easy: need to consider only chain containing key.
- How to delete a key (and its associated value)?
Linear Probing
- Open addressing. [Amdahl–Boehme–Rocherster–Samuel, IBM 1953]
- Maintain keys and values in two parallel arrays.
- When a new key collides, find next empty slot, and put it there.
- Hash: Map key to integer between and
- Insert: Put at table index if free; if not try , etc.
- Search: Search table index ; if occupied but no match,try , , etc.
- Note: Array length must be greater than number of key-value pairs .
- Java Implementation
public class LinearProbingHashST<Key, Value> {
private int m = 30001;
private Value[] vals = (Value[]) new Object[m];
private Key[] keys = (Key[]) new Object[m];
private int hash(Key key) { /* as before */ }
private void put(Key key, Value val) { /* next slide */ }
}
public Value get(Key key) {
for (int i = hash(key); keys[i] != null; i = (i+1) % m)
if (key.equals(keys[i])) return vals[i];
return null;
}
public void put(Key key, Value val) {
int i;
for (i = hash(key); keys[i] != null; i = (i+1) % m)
if (keys[i].equals(key)) break;
keys[i] = key;
vals[i] = val;
}
* Array doubling and halving code omitted
- Clustering. A contiguous block of items
- Observation. New keys likely to hash into the middle of big clusters
Knuth's parking problem - Model. Cars arrive at one-way street with m parking spaces. Each desires a random space : if space is taken, try , , etc
- Q. What is mean displacement of a car?
- Half-full. With cars, mean displacement is
- Full. With cars, mean displacement is
- Key insight. Cannot afford to let linear-probing hash table get too full.
- Observation. New keys likely to hash into the middle of big clusters
- Analysis of linear probing
- Proposition. Under uniform hashing assumption, the average # of probes in a linear probing hash table of size that contains keys is: # probes for search hit is about and # probes for search miss is about
- Pf. Parameters. : too large ⇒ too many empty array entries. too small ⇒ search time blows up. Typical choice:
- Resizing in a linear-probing hash table Goal. Average length of list n/m < \frac{1}{2}
- Double length of array when . Halve length of array when
- Need to rehash all keys when resizing.
Deletion in a linear-probing hash table. Requires some care: can't just delete array entries
Algorithmic Complexity Attacks
- Is the uniform hashing assumption important in practice?
- A. Obvious situations: aircraft control, nuclear reactor, pacemaker, HFT, …
- A. Surprising situations: denial-of-service attacks.
- Real-world exploits. [Crosby–Wallach 2003] Linux 2.4.20 kernel: save files with carefully chosen names. Bro server: send carefully chosen packets to DOS the server, using less bandwidth than a dial-up modem. malicious adversary learns your hash function (e.g., by reading Java API) and causes a big pile-up in single slot that grinds performance to a halt.
- Diversion: one-way hash functions
- One-way hash function. “Hard” to find a key that will hash to a desired value (or two keys that hash to the same value).
- Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, SHA-256, WHIRLPOOL, ….
- Applications. Crypto, message digests, passwords, Bitcoin, ….
- Caveat. Too expensive for use in ST implementations.
- One-way hash function. “Hard” to find a key that will hash to a desired value (or two keys that hash to the same value).
Summary: Separate Chaining vs. Linear Probing
- Separate chaining.
- Performance degrades gracefully.
- Clustering less sensitive to poorly-designed hash function.
- Linear probing.
- Less wasted space.
- Better cache performance.
Hashing: Variations on the Theme
- Many improved versions have been studied.
- Two-probe hashing. [ separate-chaining variant ]
- Hash to two positions, insert key in shorter of the two chains.
- Reduces expected length of the longest chain to ~ .
- Double hashing. [ linear-probing variant ]
- Use linear probing, but skip a variable amount, not just 1 each time.
- Effectively eliminates clustering.
- Can allow table to become nearly full.
- More difficult to implement delete.
- Cuckoo hashing. [ linear-probing variant ]
- Hash key to two positions; insert key into either position; if occupied, reinsert displaced key into its alternative position (and recur).
- Constant worst-case time for search.
Hash Tables vs. Balanced Search Trees
- Hash tables.
- Simpler to code.
- No effective alternative for unordered keys.
- Faster for simple keys (a few arithmetic ops versus log n compares).
- Better system support in Java for String (e.g., cached hash code).
- Balanced search trees.
- Stronger performance guarantee.
- Support for ordered ST operations.
- Easier to implement
compareTo()thanhashCode().
- Java system includes both.
- Balanced search trees:
java.util.TreeMap,java.util.TreeSet. - Hash tables:
java.util.HashMap,java.util.IdentityHashMap.
- Balanced search trees: