hashing

Hashing Searching / Indexing Algorithm-3


The Search Problem

  • Find items with keys matching a given search key.

  • Given an array A containing n keys and a search key x, find the index i such that x = A[i].

  • A key could be part of a large record.


Applications of Searching

  • Keeping track of customer account information at a bank:

    • Searching through records to check balances and perform transactions.

  • Flight reservations tracking:

    • Searching to find empty seats or to cancel/modify reservations.

  • Search engines:

    • Searching for all documents containing a given word.


Special Case: Dictionaries

  • Dictionary: A data structure supporting mainly two operations:

    • Insert a new item.

    • Return an item with a given key.

  • Queries to return information about a set S:

    • Search(S, k)

    • Minimum(S)

    • Maximum(S)

    • Successor(S, x)

    • Predecessor(S, x)

  • Modifying operations:

    • Insert(S, k)

    • Delete(S, k) (not performed very often).


Direct Addressing

  • Assumptions:

    • Key values are distinct.

    • Each key is drawn from a universe U = {0, 1, …, m - 1}.

  • Idea:

    • Store items in an array indexed by keys.

    • Use a direct-address table representation:

      • Array T[0 … m - 1], corresponding to keys in U.

      • For element x with key k, store pointer to x in T[k].

      • If no elements exist with key k, T[k] is represented by NIL.


Direct Addressing (continued)

  • Example Representation:

    • T 0 key satellite data, U 1, 2, 3 etc.

    • Each slot corresponds to a key in the universe.

    • Insertion and deletion operations in O(1) time.


Operations Algorithms

  • DIRECT-ADDRESS-SEARCH(T, k): return T[k]

  • DIRECT-ADDRESS-INSERT(T, x): T[key[x]] ← x

  • DIRECT-ADDRESS-DELETE(T, x): T[key[x]] ← NIL

  • Running time for these operations: O(1)


Comparing Different Implementations

Implementing Dictionaries Using:

  • Direct Addressing: O(1) search

  • Ordered Arrays & Lists: O(N) for insertion/search

  • Unordered Arrays & Lists: O(N) for insertion/search


Examples Using Direct Addressing

  • Example 1:

    • Keys are integers from 1 to 100; create an array A of 100 items.

  • Example 2:

    • Keys are nine-digit social security numbers; inefficient to create an array of one billion items to store 100 records.


Hash Tables

  • When K < U, hash tables save space compared to direct-address tables.

  • Hash tables aim to achieve average O(1) search time.


Hash Tables Explained

  • Idea: Use a function h to compute the slot for each key, storing the element in slot h(k).

  • Hash function:

    • Transforms a key into an index in T[0…m-1] via mapping, reducing array index range.


Example of Hash Tables

  • U: Universe of keys, K: Actual keys with hash function results.


Revisit Example 2

  • If the keys are social security numbers, a possible hash function is: h(ssn) = SSS mod 100.


Collisions

  • Two or more keys hashing to the same slot.

  • If |K| > m, collisions definitely occur.

  • Even with a good hash function, avoiding collisions entirely is challenging.


Handling Collisions

  • Methods to handle collisions include:

    • Chaining

    • Open Addressing

    • Linear Probing

    • Quadratic Probing

    • Double Hashing

  • Chaining discussed first.


Handling Collisions with Chaining

  • Idea:

    • Place all elements that hash to the same slot in a linked list.

    • Slot j contains a pointer to the start of the list of all elements hashing to j.


Insertion in Hash Tables

  • Algorithm:

    • CHAINED-HASH-INSERT(T, x): Insert x at the head of list T[h(key[x])].

  • Worst-case running time: O(1), assuming element isn’t already in the list.


Deletion in Hash Tables

  • Algorithm:

    • CHAINED-HASH-DELETE(T, x): Delete x from the list T[h(key[x])].

  • Worst-case running time: Depends on searching the corresponding list.


Searching in Hash Tables

  • Algorithm:

    • CHAINED-HASH-SEARCH(T, k): Search for an element with key k in list T[h(k)].

  • Running time: Proportional to the size of the list in slot h(k).


Analysis of Hashing with Chaining: Worst Case

  • Worst-case scenario occurs when all n keys hash to the same slot.

  • Searching can then take O(n) time.


Analysis of Hashing with Chaining: Average Case

  • Average case depends on hash function performance distributing n keys across m slots.

  • Simple uniform hashing assumption: Each element equally likely to hash to any of m slots.


Load Factor of Hash Table

  • Load factor (α): a = n/m (n: number of elements, m: number of slots).

  • Encodes average chain length; a can be <, =, or > 1.


Case 1: Unsuccessful Search

  • Theorem: An unsuccessful search takes expected time α under simple uniform hashing.

  • Total time required is: O(1) (for the hash function) + α.


Case 2: Successful Search

  • Expected time can be analyzed as O(1 + α/2) based on average scenario.


Analysis of Search in Hash Tables

  • If m (number of slots) is proportional to n (number of elements):

    • Searching takes average O(1) time.


Hash Functions

  • Good hash functions:

    • Easy to compute.

    • Approximate randomness (regardless of input).


Good Approaches for Hash Functions

  • Minimize likelihood of related keys hashing to the same slot.


The Division Method of Hashing

  • Idea: Map key k to m slots using k mod m.

  • Advantages: Fast computation; Disadvantages: Poor choice of m (like power of 2) can lead to poor distribution.


Example - The Division Method

  • If m = 2^p, h(k) becomes least significant p bits of k.

  • Choose m to be prime and not close to powers of 2.


The Multiplication Method

  • Multiply key k by a constant A (0 < A < 1) and extract fractional part of kA.

  • Advantages: m’s value isn’t critical, Disadvantage: Slower than division method.


Universal Hashing

  • Designing hash functions to produce random table indices, independent of the keys.


Definition of Universal Hash Functions

  • A universal set of hash functions H={h(k): U (0, 1, …, m - 1)}; H is universal if for x ≠ y, probability EiH: h(x)=h(y) ≤ 1/m.


Universal Hashing – Main Result

  • Chance of collision during hashing is no more than 1/m if locations are randomly chosen.


Designing a Universal Class of Hash Functions

  • Choose a prime number p for every key range and define hash function ha,b(k) as:

    • ha,b(k) = ((ak + b) mod p) mod m.

  • Family of all such hash functions is universal.


Example: Universal Hash Functions

  • Using p = 17 and m = 6, calculate:

    • h3,4(8) = ((3×8 + 4) mod 17) mod 6 = 5.


Advantages of Universal Hashing

  • Provides good average results, independent of stored keys.

  • Guarantees that worst-case behavior is rare.


Open Addressing

  • Storing keys directly in the table without using linked lists when there’s enough memory.

  • Insert if a slot is full, find the next available one.


Generalized Hash Function Notation

  • A hash function takes key value and probe number: h(k, p).


Common Open Addressing Methods

  • Methods: Linear probing, Quadratic probing, Double hashing.

  • None can generate more than m^2 different sequences.


Linear Probing: Inserting a Key

  • Idea: On collision, check the next available position: h(k,i) = (h1(k) + i) mod m.


Linear Probing: Searching for a Key

  • Three cases:

    1. Found the key.

    2. Slot empty, proceed.

    3. Different element found, continue probing.


Linear Probing: Deleting a Key

  • Challenge of marking a slot as empty is addressed by using a sentinel value DELETED.


Primary Clustering Problem

  • Certain slots become favored, increasing search time due to long clusters of occupied slots.


Quadratic Probing

  • Formula: h(k,i) = (h1(k) + i^2) mod N; lessening clustering problem but still present.


Double Hashing

  • Utilizes first hash function for the initial probe and a second for subsequent increments in the probing process: h(k,i) = (h1(k) + i*h2(k)) mod m.


Analysis of Open Addressing

  • Ignore clustering to analyze probe success rates.


Analysis of Open Addressing (continued)

  • Unsuccessful retrieval results based on load factor a.

  • Variations in expected number of steps based on load factor (0.5 vs 0.9).