S

Fundamentals of Programming III: Data Structures

Fundamentals of Programming III - Data Structures Lecture 06: Hash Tables

Introduction to Hash Tables

  • Definition: A hash table is a data structure that maps keys to values. It can also be referred to as associative arrays.

    • Keys:

    • Should be unique.

    • Must support consistent translation into an integer through a hash function.

    • Values:

    • Can be of any data type.

Terminology

  • Bucket: An element location within the hash table.

  • Hashing: The process of converting a key into a valid index for storage in the hash table.

    • Common operation: hash=key\%n where n is the size of the hash table.

Hash Table Abstract Data Type (ADT)

  • Supported Operations:

    • insert(key, value): Inserts a new element into the hash table.

    • remove(key): Removes an element associated with the given key.

    • search(key)/get(key): Retrieves the value corresponding to the specified key.

    • update(key, value): Updates the value for a given key.

    • count: Returns the number of elements in the hash table.

    • isEmpty: Checks if the hash table contains any elements.

Simple Hashing

  • Main Operation: The most common hashing operation is modulo.

    • Procedure:

    1. Convert the key into an integer k utilizing a hash function.

    2. If two keys are equal, they must produce the same hash value k.

    3. Compute: hash = k mod n

    4. Resulting hash is guaranteed to fall within the range of [0, n-1].

Non-Reversible Nature of Hashing

  • Example: If table size n = 12 and key = 14, then:

    • 14 mod 12 = 2

    • It is impossible to revert to the original key as many keys (e.g., 2, 14, 26,…) can hash to the same value 2.

Naïve Hash Table Operations

  • Insert:

    • Let hf() be a hash function.

    • Steps:

    1. Calculate hash value: \text{hash value} = hf(\text{key})

    2. Store value in: \text{table}[\text{hash value}] = \text{value}

    • Remove:

    • If search(key) is not null: set \text{table}[\text{hash value}] = \text{null}

  • Search:

    • Steps:

    1. Calculate hash value: hash value=hf(key) f table[hash ext{ value}] is not null: return table[hash ext{ value}]

    2. Else return null.

  • Update:

    • Steps:

    1. Calculate hash value: hash\:{value}=hf(key)

    2. If search(key) is not null: update the value at table[hash ext{ value}]

    3. Else, insert (key, value).

Collisions

  • Perfect Hash Function: A hash function that maps each key uniquely.

  • Collision: Occurs when two keys hash to the same bucket, which is dependent on:

    • Hash function chosen.

    • Table size.

    • Load factor.

Collision Handling Strategies

  • Various methods to handle collisions include:

    • Chaining: Each bucket holds a linked list of entries.

    • Open Addressing: Inserts the item in another slot in the same array via probing.

      • Linear Probing: Sequentially searches through the array to find an open slot.

      • Quadratic Probing: Increases the step size quadratically while probing.

      • Double Hashing: Uses a second hash function to determine the probing sequence.

Chaining

  • Each bucket in a hash table can hold a linked list of entries:

    • Insert: Adds new nodes to the linked list in the respective bucket.

    • Search/Remove: Traverses the linked list for the specific key.

Open Addressing Overview

  • All elements are stored directly in the array. Upon collision, an open bucket is searched using a specific probing sequence.

  • Distinguishing Status:

    • ESS: Empty Since Start - No element has ever occupied this space.

    • EAR: Empty After Removal - Previously contained an element but is currently empty.

Linear Probing

  • Methodology: Steps through buckets one by one.

    • If it encounters a filled bucket, it checks the next one until it finds an empty slot.

    • Issues: Can cause primary clustering, leading to groups of filled buckets which can degrade performance.

Hash Table Operations with Linear Probing

  • Insert:

    • Use: hash = hf(item.key)

    • While probing, if an empty bucket is found, insert the item there.

    • If probing exceeds the size of the hash table, return false.

  • Remove:

    • While probing, if an item matches, label as EAR.

  • Search:

    • Probes until either the correct item is found or an ESS is encountered.

Clustering in Hash Tables

  • Definition: Clustering occurs when multiple keys form groups of occupied slots, making future insertions inefficient.

    • Probing methods exacerbate the issue since they cause clustered keys to aggregate near each other.

Clustering Effects on Performance

  • Increased size of clusters leads to longer search times for empty buckets.

  • Higher likelihood of collisions as more keys are inserted within an existing cluster, thus degrading performance with respect to the load factor.

Mitigating Clustering

  • Quadratic Probing: Reduces clustering by increasing step sizes quadratically.

  • Double Hashing: Applies a secondary hash function for selecting the step size, helping to evenly distribute entries.

  • Rehashing: When clusters severely affect performance, resizing the table with a better hash method can help.

Quadratic Probing Sequence

  • The probing sequence is determined as: (hf(key)+c_1\cdot i+c_2\cdot i^2)\bmod n where i is the iteration count, and c1 and c2 are typically set to 1.

  • These tend to be smaller and more spread out.

Hash Table Operations with Quadratic Probing

  • Operations for insertion, removal, and searching are similar to linear probing but use the quadratic probing sequence.

Double Hashing

  • Probing Sequence: Provided as (hf1(key) + i imes hf2(key)) mod n where both hf1 and hf2 are distinct hash functions. All underlying procedures (insert, remove, search) remain the same except for the probing sequence used.

Hash Table Size and Resizing

  • Conditions for Resizing: When the table is full or the load factor exceeds certain thresholds.

  • Best Size: Recommended to keep hash table size as a prime number to minimize clustering effects.

    • Resize method involves creating a new larger array and re-inserting all current elements into this new array.

Load Factor and Thresholds

  • Load Factor (α) = number of items / number of buckets.

  • Typical Thresholds:

    • Load factor threshold: Recommended to remain below 0.75.

    • Collision threshold: Action taken if the count of collisions exceeds a set integer value.

    • Linked-list threshold: Monitors the length of linked lists in chains.

When to Resize

  • Triggers include:

    • Full table condition.

    • Load factor exceeding threshold.

    • The number of collisions surpassing collision thresholds.

Common Hash Functions

  • Division Method: h(k) = k mod m.

    • Key is divided by the table size m, and the remainder is used as the hash index.

    • Simple and fast but can cluster if m is poorly chosen.

  • Multiplication Method: h(k) = m imes (kA mod 1) where A is a constant between 0 and 1.

    • avoids poor table size choices

  • Universal Hashing: Uses random hash functions to improve performance against worst-case scenarios.

  • Folding Method: Combines parts of the key by summation or operations.

  • Mid-Square Method: Squares the key and extracts the middle bits to derive the hash.

  • Cryptographic Hash Functions: (e.g., SHA, MD5) secure but usually slower; best for cryptographic applications.

Informal Runtime Analysis of Hash Table Operations

  • Expected runtime for operations in ideal conditions is O(1).

    • Factors affecting performance include the type of collision resolution strategy used, load factor, and the quality of the hash function.

Runtime Analysis with Chaining

  • Insertion: Generally O(1) when inserted at the list head. Worst-case O(m). Expected O(1 + α).

  • Search: Best-case O(1), but worst-case O(m). Expected O(1 + α).

  • Removal: Same runtime as search, dependent on finding the node first.

Runtime Analysis with Open Addressing

  • Insertion can range from O(1) best-case to O(n) worst-case.

  • Search and removal follow similar patterns affected by collisions and cluster size.

  • Critical to keep the load factor below 0.75 for optimal performance.

Amortized vs Worst Case Operation Analysis

  • Amortized for operations in chaining and open addressing are expected to be O(1 + α) with acceptable load factors, while the worst-case can degrade to O(n).

  • Maintaining a low load factor is essential to ensure consistent performance.