Design & Analysis of Algorithm: Hashing

Analysis of Algorithms

Hashing

The Search Problem

  • Objective: Find items with keys matching a given search key.

    • Problem Definition: Given an array A, containing n keys, and a search key x, find the index i such that x=A[i].

    • Note: A key could be part of a larger record.

Applications of Hashing

  • Banking: Keeping track of customer account information.

    • Searching through records for checking balances and performing transactions.

  • Flight Reservations: Keeping track of seat availability.

    • Searching to find empty seats and modify or cancel reservations.

  • Search Engines: Looking for all documents containing a specified word.

Special Case: Dictionaries

  • Definition: A dictionary is a data structure supporting mainly two fundamental operations: inserting a new item and retrieving an item with a given key.

Dictionary Operations
  • Queries for information about the set S:

    • Search(S, k)

    • Minimum(S)

    • Maximum(S)

    • Successor(S, x)

    • Predecessor(S, x)

  • Modifying Operations (which change the set):

    • Insert(S, k)

    • Delete(S, k) (less frequent)

Direct Addressing

Assumptions
  • Key values are distinct.

  • Each key is drawn from a universe U = {0, 1, …, m - 1}.

Concept
  • Store items in an array indexed by keys.

  • Direct-address table representation: An array T[0 … m - 1].

    • Each slot in T corresponds to a key in U.

    • For an element x with key k, a pointer to x (or x itself) is placed in location T[k].

    • If there are no elements with key k in the set, T[k] is represented by NIL.

Visualization of Direct Addressing
  • Example: For a universe U and actual keys K:

    • Every slot T[j] in an array can either contain a record or NIL.

    • Insert/Delete operations can be performed in O(1) time.

Operations Algorithms for Direct Addressing

  • Algorithms:

    • DIRECT-ADDRESS-SEARCH(T, k): return T[k]

    • DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x

    • DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL

  • Running time for these operations: O(1)

Comparing Different Implementations

  • Implementing dictionaries using different structures:

    • Direct addressing: Insert O(1), Search O(1)

    • Ordered array: Insert O(N), Search O(lgN)

    • Unordered array: Insert O(1), Search O(N)

    • Ordered linked list: Insert O(N), Search O(N)

    • Unordered linked list: Insert O(1), Search O(N)

Examples Using Direct Addressing

Example 1
  • Scenario: Keys are integers from 1 to 100 with about 100 records.

    • Create an array of 100 items and store the record whose key equals i in A[i].

Example 2
  • Scenario: Keys are nine-digit social security numbers.

    • Using the same strategy would require an array of 1 billion items, which is inefficient due to the large universe U compared to |K|.

Hash Tables

Definition
  • When K (the number of actual keys) is much smaller than U (the universe), a hash table requires significantly less space than a direct-address table.

    • Storage can be reduced to |K| while aiming for O(1) average search time (note: this is not guaranteed in the worst case).

Idea of Hash Tables
  • Use a function h to compute the slot for each key.

  • Store the element in slot h(k).

  • Hash function: A function h transforms a key into an index in a hash table T[0…m-1], described as h: U → {0, 1, …, m - 1}.

    • We say that k hashes to slot h(k).

  • Advantages:

    • Reduce the range of array indices handled: m instead of |U|.

    • Reduced required storage.

Example of Hash Tables
  • Illustrates U (universe of keys) and K (actual keys) mapped to slots through hash functions h(k).

Handling Collisions in Hashing

Problems
  • Definition of Collision: Two or more keys that hash to the same slot.

    • Given a set K of keys: if |K| ≤ m, collisions may or may not happen depending on the hash function; if |K| > m, collisions will definitely occur.

  • Avoiding collisions completely can be challenging even with a good hash function.

Collision Handling Methods
  • Methods discussed:

    • Chaining

    • Open addressing (Linear probing, Quadratic probing, Double hashing)

Handling Collisions Using Chaining

Concept
  • Store all elements that hash to the same slot in a linked list.

    • Slot j contains a pointer to the head of the list of all elements that hash to j.

Collision with Chaining - Discussion
  • Choosing the size of the table: Needs to be small enough to avoid wastage but large enough to keep the linked lists short (typically 1/5 or 1/10 of the total number of elements).

  • List ordering: Lists should be unordered to facilitate fast insertion and removal.

Insertion in Hash Tables
  • Algorithm: CHAINED-HASH-INSERT(T, x): inserts x at the head of list T[h(key[x])].

  • Worst-case running time: O(1) assuming the element is not already present.

Deletion in Hash Tables
  • Algorithm: CHAINED-HASH-DELETE(T, x): delete x from the list at T[h(key[x])].

  • Search time required to find the element being deleted.

Searching in Hash Tables
  • Algorithm: CHAINED-HASH-SEARCH(T, k): searches for an element with key k in list T[h(k)].

  • Running time: Proportional to the length of the list of elements in slot h(k).

Analysis of Hashing with Chaining

Worst Case
  • Worst-case search time: If all n keys hash to the same slot, resulting in time complexity of heta(n)heta(n) plus hash function time.

Average Case
  • Average case depends on the efficiency of the hash function in distributing n keys among m slots.

    • Assumption: Simple uniform hashing (any element is equally likely to hash into any of the m slots) leads to average list length for j slots being E[nj]=racnmE[n_j] = rac{n}{m}.

Load Factor of a Hash Table

  • Definition: Load factor, β=racnm\beta = rac{n}{m}, where n = number of elements in the table, m = number of slots.

  • Interpretation: Encodes average number of elements stored in a chain.

Cases of Load Factor
  1. Unsuccessful Search: Theorem states that an unsuccessful search takes expected time O(1)O(1) + β\beta under the simple uniform hashing assumption.

  2. Successful Search: Average case analysis offers expected time of O(1+racn2)O(1 + rac{n}{2}), where elements are split across a list.

Hash Functions

Characteristics of Good Hash Functions
  1. Easy to compute.

  2. Must approximate random distribution for every key (simple uniform hashing).

    • Difficulties arise as it is hard to satisfy simple uniform hashing generally.

Good Approaches to Designing Hash Functions
  • Minimize the chance that closely related keys hash to the same slot.

  • Derive a hash value that is independent of any patterns in the key distribution.

The Division Method for Hashing

Overview
  • Concept: Map key k into slots by taking the remainder of k divided by m.

    • h(k)=kextmodmh(k) = k ext{ mod } m.

  • Advantages: Fast computation (only one operation required).

  • Disadvantages: Poor choices for m (e.g.: powers of 2, non-prime numbers) that can make outputs less uniform.

Example of the Division Method
  • If m=2pm = 2^p, then the hash function will be based only on the least significant bits of k.

  • Recommendation: Choose m to be a prime number, not close to a power of 2.

The Multiplication Method for Hashing

Overview
  • Concept: Multiply key k by a constant A (0 < A < 1) and provide a hash based on the fractional part of this product.

    • h(k)=extfloor(mimes(kimesAextmod1))h(k) = ext{floor}(m imes (k imes A ext{ mod } 1)).

  • Advantages: M value is not critical (e.g., should be a power of 2).

  • Disadvantages: Slower than division method.

Example of the Multiplication Method
  • Shows the process of taking a fractional part and shifting for the final hash based on chosen m.

Universal Hashing

Concept
  • Recognizes that key distributions are often non-uniform in practice.

  • Select a hash function at random from a designed class of functions during the execution to reduce collision probability.

Definition of Universal Hash Functions
  • A set of functions H is universal if for any two keys x and y, the probability that they collide (i.e., h(x) = h(y)) is at most racHmrac{|H|}{m}.

Importance of Universal Hashing
  • Guarantees reduced collision probability, often providing near random table indices irrespective of the keys themselves.

Designing a Universal Class of Hash Functions

  • Procedure: Choose a large prime number p ensuring every possible key is within [0, p-1].

  • Define the hash function where each hash is a valid pair of prime choices:

    • ha,b(k)=((ak+b)extmodp)extmodmh_{a,b}(k) = ((ak + b) ext{ mod } p) ext{ mod } m, with a and b being chosen randomly at execution start.

Example of Universal Hash Functions
  • Provides a direct application of this definition using specific prime numbers and functional calculations.

Advantages of Universal Hashing

  • Provides consistent average results regardless of the key input.

  • Prevents worst-case behaviors from occurring during typical operations due to random hash function selection.

Open Addressing

Concept
  • Stores keys directly in the table (when m > N) without linked lists.

  • Insertion and searching are handled through probing for the next available slot if collisions occur.

Generalized Hash Function Notation
  • Displays that a hash function now contains two arguments: key and probe number.

  • Probing sequence: Must be a permutation of all slots possible (0 to m-1).

Common Open Addressing Methods

  • Types of Open Addressing:

    • Linear probing

    • Quadratic probing

    • Double hashing

Linear Probing
  • Insertion: If a collision occurs, check the next available position in the table with probe formula:

    • h(k,i)=(h1(k)+i)extmodmh(k,i) = (h_1(k) + i) ext{ mod } m where i = 0, 1, 2…

Linear Probing: Searching for a Key
  • Cases:

    1. Key matches the current position.

    2. Slot is empty (key not found).

    3. Different key present, continue probing.

Linear Probing: Deleting a Key
  • Challenges: Cannot mark slots as empty as it may obstruct searches; instead, use a sentinel value 'DELETED' to allow for structures to be reused.

Primary Clustering Problem in Linear Probing

  • As elements get clustered together, search times can increase due to lengths of chains becoming predictable.

Quadratic Probing
  • Formula of inquiry is now of quadratic nature to mitigate clustering issues:

    • h(k,i)=(h(k)+c<em>1i+c</em>2i2)extmodmh(k, i) = (h'(k) + c<em>1i + c</em>2i^2) ext{ mod } m.

  • Still a potential clustering problem exists (secondary clustering).

Double Hashing
  • Mechanism: Use one hash function to find the initial slot and another for determining the increments in the probing sequence:

    • h(k,i)=(h<em>1(k)+ih</em>2(k))extmodmh(k, i) = (h<em>1(k) + i h</em>2(k)) ext{ mod } m.

  • Provides advantages of avoiding clustering, but makes deletions harder due to probe dependencies.

Analysis of Open Addressing

Probe Performance Analysis
  • Assumes uniformly distributed probe sequences.

  • Characterizes likelihood of success and failure across retrieval attempts, leading to expected number of steps based on load factor alpha (α) (average number of elements in the table).

  • Examples offer specific working conditions under defined probabilities to understand expected retrieval times based on load factors.