hashing

Hashing Searching / Indexing Algorithm-3

The Search Problem

Find items with keys matching a given search key.
Given an array A containing n keys and a search key x, find the index i such that x = A[i].
A key could be part of a large record.

Applications of Searching

Keeping track of customer account information at a bank:
- Searching through records to check balances and perform transactions.
Flight reservations tracking:
- Searching to find empty seats or to cancel/modify reservations.
Search engines:
- Searching for all documents containing a given word.

Special Case: Dictionaries

Dictionary: A data structure supporting mainly two operations:
- Insert a new item.
- Return an item with a given key.
Queries to return information about a set S:
- Search(S, k)
- Minimum(S)
- Maximum(S)
- Successor(S, x)
- Predecessor(S, x)
Modifying operations:
- Insert(S, k)
- Delete(S, k) (not performed very often).

Direct Addressing

Assumptions:
- Key values are distinct.
- Each key is drawn from a universe U = {0, 1, …, m - 1}.
Idea:
- Store items in an array indexed by keys.
- Use a direct-address table representation:
  - Array T[0 … m - 1], corresponding to keys in U.
  - For element x with key k, store pointer to x in T[k].
  - If no elements exist with key k, T[k] is represented by NIL.

Direct Addressing (continued)

Example Representation:
- T 0 key satellite data, U 1, 2, 3 etc.
- Each slot corresponds to a key in the universe.
- Insertion and deletion operations in O(1) time.

Operations Algorithms

DIRECT-ADDRESS-SEARCH(T, k): return T[k]
DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x
DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL
Running time for these operations: O(1)

Comparing Different Implementations

Implementing Dictionaries Using:

Direct Addressing: O(1) search
Ordered Arrays & Lists: O(N) for insertion/search
Unordered Arrays & Lists: O(N) for insertion/search

Examples Using Direct Addressing

Example 1:
- Keys are integers from 1 to 100; create an array A of 100 items.
Example 2:
- Keys are nine-digit social security numbers; inefficient to create an array of one billion items to store 100 records.

Hash Tables

When K < U, hash tables save space compared to direct-address tables.
Hash tables aim to achieve average O(1) search time.

Hash Tables Explained

Idea: Use a function h to compute the slot for each key, storing the element in slot h(k).
Hash function:
- Transforms a key into an index in T[0…m-1] via mapping, reducing array index range.

Example of Hash Tables

U: Universe of keys, K: Actual keys with hash function results.

Revisit Example 2

If the keys are social security numbers, a possible hash function is: h(ssn) = SSS mod 100.

Collisions

Two or more keys hashing to the same slot.
If |K| > m, collisions definitely occur.
Even with a good hash function, avoiding collisions entirely is challenging.

Handling Collisions

Methods to handle collisions include:
- Chaining
- Open Addressing
- Linear Probing
- Quadratic Probing
- Double Hashing
Chaining discussed first.

Handling Collisions with Chaining

Idea:
- Place all elements that hash to the same slot in a linked list.
- Slot j contains a pointer to the start of the list of all elements hashing to j.

Insertion in Hash Tables

Algorithm:
- CHAINED-HASH-INSERT(T, x): Insert x at the head of list T[h(key[x])].
Worst-case running time: O(1), assuming element isn’t already in the list.

Deletion in Hash Tables

Algorithm:
- CHAINED-HASH-DELETE(T, x): Delete x from the list T[h(key[x])].
Worst-case running time: Depends on searching the corresponding list.

Searching in Hash Tables

Algorithm:
- CHAINED-HASH-SEARCH(T, k): Search for an element with key k in list T[h(k)].
Running time: Proportional to the size of the list in slot h(k).

Analysis of Hashing with Chaining: Worst Case

Worst-case scenario occurs when all n keys hash to the same slot.
Searching can then take O(n) time.

Analysis of Hashing with Chaining: Average Case

Average case depends on hash function performance distributing n keys across m slots.
Simple uniform hashing assumption: Each element equally likely to hash to any of m slots.

Load Factor of Hash Table

Load factor (α): a = n/m (n: number of elements, m: number of slots).
Encodes average chain length; a can be <, =, or > 1.

Case 1: Unsuccessful Search

Theorem: An unsuccessful search takes expected time α under simple uniform hashing.
Total time required is: O(1) (for the hash function) + α.

Case 2: Successful Search

Expected time can be analyzed as O(1 + α/2) based on average scenario.

Analysis of Search in Hash Tables

If m (number of slots) is proportional to n (number of elements):
- Searching takes average O(1) time.

Hash Functions

Good hash functions:
- Easy to compute.
- Approximate randomness (regardless of input).

Good Approaches for Hash Functions

Minimize likelihood of related keys hashing to the same slot.

The Division Method of Hashing

Idea: Map key k to m slots using k mod m.
Advantages: Fast computation; Disadvantages: Poor choice of m (like power of 2) can lead to poor distribution.

Example - The Division Method

If m = 2^p, h(k) becomes least significant p bits of k.
Choose m to be prime and not close to powers of 2.

The Multiplication Method

Multiply key k by a constant A (0 < A < 1) and extract fractional part of kA.
Advantages: m’s value isn’t critical, Disadvantage: Slower than division method.

Universal Hashing

Designing hash functions to produce random table indices, independent of the keys.

Definition of Universal Hash Functions

A universal set of hash functions H={h(k): U (0, 1, …, m - 1)}; H is universal if for x ≠ y, probability EiH: h(x)=h(y) ≤ 1/m.

Universal Hashing – Main Result

Chance of collision during hashing is no more than 1/m if locations are randomly chosen.

Designing a Universal Class of Hash Functions

Choose a prime number p for every key range and define hash function ha,b(k) as:
- ha,b(k) = ((ak + b) mod p) mod m.
Family of all such hash functions is universal.

Example: Universal Hash Functions

Using p = 17 and m = 6, calculate:
- h3,4(8) = ((3×8 + 4) mod 17) mod 6 = 5.

Advantages of Universal Hashing

Provides good average results, independent of stored keys.
Guarantees that worst-case behavior is rare.

Open Addressing

Storing keys directly in the table without using linked lists when there’s enough memory.
Insert if a slot is full, find the next available one.

Generalized Hash Function Notation

A hash function takes key value and probe number: h(k, p).

Common Open Addressing Methods

Methods: Linear probing, Quadratic probing, Double hashing.
None can generate more than m^2 different sequences.

Linear Probing: Inserting a Key

Idea: On collision, check the next available position: h(k,i) = (h1(k) + i) mod m.

Linear Probing: Searching for a Key

Three cases:
1. Found the key.
2. Slot empty, proceed.
3. Different element found, continue probing.

Linear Probing: Deleting a Key

Challenge of marking a slot as empty is addressed by using a sentinel value DELETED.

Primary Clustering Problem

Certain slots become favored, increasing search time due to long clusters of occupied slots.

Quadratic Probing

Formula: h(k,i) = (h1(k) + i^2) mod N; lessening clustering problem but still present.

Double Hashing

Utilizes first hash function for the initial probe and a second for subsequent increments in the probing process: h(k,i) = (h1(k) + i*h2(k)) mod m.

Analysis of Open Addressing

Ignore clustering to analyze probe success rates.

Analysis of Open Addressing (continued)

Unsuccessful retrieval results based on load factor a.
Variations in expected number of steps based on load factor (0.5 vs 0.9).