Design & Analysis of Algorithm: Hashing

Analysis of Algorithms

Hashing

The Search Problem

Objective: Find items with keys matching a given search key.
- Problem Definition: Given an array A, containing n keys, and a search key x, find the index i such that x=A[i].
- Note: A key could be part of a larger record.

Applications of Hashing

Banking: Keeping track of customer account information.
- Searching through records for checking balances and performing transactions.
Flight Reservations: Keeping track of seat availability.
- Searching to find empty seats and modify or cancel reservations.
Search Engines: Looking for all documents containing a specified word.

Special Case: Dictionaries

Definition: A dictionary is a data structure supporting mainly two fundamental operations: inserting a new item and retrieving an item with a given key.

Dictionary Operations

Queries for information about the set S:
- Search(S, k)
- Minimum(S)
- Maximum(S)
- Successor(S, x)
- Predecessor(S, x)
Modifying Operations (which change the set):
- Insert(S, k)
- Delete(S, k) (less frequent)

Direct Addressing

Assumptions

Key values are distinct.
Each key is drawn from a universe U = {0, 1, …, m - 1}.

Concept

Store items in an array indexed by keys.
Direct-address table representation: An array T[0 … m - 1].
- Each slot in T corresponds to a key in U.
- For an element x with key k, a pointer to x (or x itself) is placed in location T[k].
- If there are no elements with key k in the set, T[k] is represented by NIL.

Visualization of Direct Addressing

Example: For a universe U and actual keys K:
- Every slot T[j] in an array can either contain a record or NIL.
- Insert/Delete operations can be performed in O(1) time.

Operations Algorithms for Direct Addressing

Algorithms:
- DIRECT-ADDRESS-SEARCH(T, k): return T[k]
- DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x
- DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL
Running time for these operations: O(1)

Comparing Different Implementations

Implementing dictionaries using different structures:
- Direct addressing: Insert O(1), Search O(1)
- Ordered array: Insert O(N), Search O(lgN)
- Unordered array: Insert O(1), Search O(N)
- Ordered linked list: Insert O(N), Search O(N)
- Unordered linked list: Insert O(1), Search O(N)

Examples Using Direct Addressing

Example 1

Scenario: Keys are integers from 1 to 100 with about 100 records.
- Create an array of 100 items and store the record whose key equals i in A[i].

Example 2

Scenario: Keys are nine-digit social security numbers.
- Using the same strategy would require an array of 1 billion items, which is inefficient due to the large universe U compared to |K|.

Hash Tables

Definition

When K (the number of actual keys) is much smaller than U (the universe), a hash table requires significantly less space than a direct-address table.
- Storage can be reduced to |K| while aiming for O(1) average search time (note: this is not guaranteed in the worst case).

Idea of Hash Tables

Use a function h to compute the slot for each key.
Store the element in slot h(k).
Hash function: A function h transforms a key into an index in a hash table T[0…m-1], described as h: U → {0, 1, …, m - 1}.
- We say that k hashes to slot h(k).
Advantages:
- Reduce the range of array indices handled: m instead of |U|.
- Reduced required storage.

Example of Hash Tables

Illustrates U (universe of keys) and K (actual keys) mapped to slots through hash functions h(k).

Handling Collisions in Hashing

Problems

Definition of Collision: Two or more keys that hash to the same slot.
- Given a set K of keys: if |K| ≤ m, collisions may or may not happen depending on the hash function; if |K| > m, collisions will definitely occur.
Avoiding collisions completely can be challenging even with a good hash function.

Collision Handling Methods

Methods discussed:
- Chaining
- Open addressing (Linear probing, Quadratic probing, Double hashing)

Handling Collisions Using Chaining

Concept

Store all elements that hash to the same slot in a linked list.
- Slot j contains a pointer to the head of the list of all elements that hash to j.

Collision with Chaining - Discussion

Choosing the size of the table: Needs to be small enough to avoid wastage but large enough to keep the linked lists short (typically 1/5 or 1/10 of the total number of elements).
List ordering: Lists should be unordered to facilitate fast insertion and removal.

Insertion in Hash Tables

Algorithm: CHAINED-HASH-INSERT(T, x): inserts x at the head of list T[h(key[x])].
Worst-case running time: O(1) assuming the element is not already present.

Deletion in Hash Tables

Algorithm: CHAINED-HASH-DELETE(T, x): delete x from the list at T[h(key[x])].
Search time required to find the element being deleted.

Searching in Hash Tables

Algorithm: CHAINED-HASH-SEARCH(T, k): searches for an element with key k in list T[h(k)].
Running time: Proportional to the length of the list of elements in slot h(k).

Analysis of Hashing with Chaining

Worst Case

Worst-case search time: If all n keys hash to the same slot, resulting in time complexity of $heta(n)$ plus hash function time.

Average Case

Average case depends on the efficiency of the hash function in distributing n keys among m slots.
- Assumption: Simple uniform hashing (any element is equally likely to hash into any of the m slots) leads to average list length for j slots being $E[n_j] = rac{n}{m}$ .

Load Factor of a Hash Table

Definition: Load factor, $\beta = rac{n}{m}$ , where n = number of elements in the table, m = number of slots.
Interpretation: Encodes average number of elements stored in a chain.

Cases of Load Factor

Unsuccessful Search: Theorem states that an unsuccessful search takes expected time $O(1)$ + $\beta$ under the simple uniform hashing assumption.
Successful Search: Average case analysis offers expected time of $O(1 + rac{n}{2})$ , where elements are split across a list.

Hash Functions

Characteristics of Good Hash Functions

Easy to compute.
Must approximate random distribution for every key (simple uniform hashing).
- Difficulties arise as it is hard to satisfy simple uniform hashing generally.

Good Approaches to Designing Hash Functions

Minimize the chance that closely related keys hash to the same slot.
Derive a hash value that is independent of any patterns in the key distribution.

The Division Method for Hashing

Overview

Concept: Map key k into slots by taking the remainder of k divided by m.
- $h(k) = k ext{ mod } m$ .
Advantages: Fast computation (only one operation required).
Disadvantages: Poor choices for m (e.g.: powers of 2, non-prime numbers) that can make outputs less uniform.

Example of the Division Method

If $m = 2^p$ , then the hash function will be based only on the least significant bits of k.
Recommendation: Choose m to be a prime number, not close to a power of 2.

The Multiplication Method for Hashing

Overview

Concept: Multiply key k by a constant A (0 < A < 1) and provide a hash based on the fractional part of this product.
- $h(k) = ext{floor}(m imes (k imes A ext{ mod } 1))$ .
Advantages: M value is not critical (e.g., should be a power of 2).
Disadvantages: Slower than division method.

Example of the Multiplication Method

Shows the process of taking a fractional part and shifting for the final hash based on chosen m.

Universal Hashing

Concept

Recognizes that key distributions are often non-uniform in practice.
Select a hash function at random from a designed class of functions during the execution to reduce collision probability.

Definition of Universal Hash Functions

A set of functions H is universal if for any two keys x and y, the probability that they collide (i.e., h(x) = h(y)) is at most $rac{|H|}{m}$ .

Importance of Universal Hashing

Guarantees reduced collision probability, often providing near random table indices irrespective of the keys themselves.

Designing a Universal Class of Hash Functions

Procedure: Choose a large prime number p ensuring every possible key is within [0, p-1].
Define the hash function where each hash is a valid pair of prime choices:
- $h_{a,b}(k) = ((ak + b) ext{ mod } p) ext{ mod } m$ , with a and b being chosen randomly at execution start.

Example of Universal Hash Functions

Provides a direct application of this definition using specific prime numbers and functional calculations.

Advantages of Universal Hashing

Provides consistent average results regardless of the key input.
Prevents worst-case behaviors from occurring during typical operations due to random hash function selection.

Open Addressing

Concept

Stores keys directly in the table (when m > N) without linked lists.
Insertion and searching are handled through probing for the next available slot if collisions occur.

Generalized Hash Function Notation

Displays that a hash function now contains two arguments: key and probe number.
Probing sequence: Must be a permutation of all slots possible (0 to m-1).

Common Open Addressing Methods

Types of Open Addressing:
- Linear probing
- Quadratic probing
- Double hashing

Linear Probing

Insertion: If a collision occurs, check the next available position in the table with probe formula:
- $h(k,i) = (h_1(k) + i) ext{ mod } m$ where i = 0, 1, 2…

Linear Probing: Searching for a Key

Cases:
1. Key matches the current position.
2. Slot is empty (key not found).
3. Different key present, continue probing.

Linear Probing: Deleting a Key

Challenges: Cannot mark slots as empty as it may obstruct searches; instead, use a sentinel value 'DELETED' to allow for structures to be reused.

Primary Clustering Problem in Linear Probing

As elements get clustered together, search times can increase due to lengths of chains becoming predictable.

Quadratic Probing

Formula of inquiry is now of quadratic nature to mitigate clustering issues:
- $h(k, i) = (h'(k) + c<em>1i + c</em>2i^2) ext{ mod } m$ .
Still a potential clustering problem exists (secondary clustering).

Double Hashing

Mechanism: Use one hash function to find the initial slot and another for determining the increments in the probing sequence:
- $h(k, i) = (h<em>1(k) + i h</em>2(k)) ext{ mod } m$ .
Provides advantages of avoiding clustering, but makes deletions harder due to probe dependencies.

Analysis of Open Addressing

Probe Performance Analysis

Assumes uniformly distributed probe sequences.
Characterizes likelihood of success and failure across retrieval attempts, leading to expected number of steps based on load factor alpha (α) (average number of elements in the table).
Examples offer specific working conditions under defined probabilities to understand expected retrieval times based on load factors.