Hashing Techniques and Applications

Hashing

Hashing is a technique used in data structures t efficiently store and retrieve data, allowing for quick access.
It involves mapping data to a specific index in a hash table using a hash function.
Hashing can achieve search, insert, and delete operations in $O(1)$ time on average.
It is mainly used to implement sets of dissimilar items and dictionaries (key-value pairs).

Components of Hashing

Key: Input to the hash function to determine the storage position within a data structure.
Hash function: Accepts a key as input and produces the index of a component within a hash table.
Hash Table: A data model that links keys to values via a hash function, holding data associatively in an array.

Types of Hash Functions

Division Method
Mid-Square Method
Folding Method
Multiplication Method

1. Division Method

The hash function divides the value K by M and uses the remainder obtained.
Formula: $h(k) = K \mod M$
- $K$ = Key
- $M$ = Size of the hash table
It is best suited that $M$ is a prime number to ensure keys are more uniformly distributed.
Example:
- $K = 56, M = 10, h(56) = 56 \mod 10 = 6$
- The key element 56 is stored at the 6th index position of the hash table.
- $K = 1276, M = 11, h(1276) = 1276 \mod 11 = 0$
- The key element 1276 is stored at the 0th index position.
Advantages:
- Good for any value of M.
- Very fast since it requires only a single division operation.
Disadvantages:
- Leads to poor performance since consecutive keys map to the same hash index in the hash table.
- Extra care should be taken to choose the value of M.

2. Mid-Square Method

Involves two steps to compute the hash value:
- Square the value of the key $(K)$ : $K^2$ .
- Extract the middle digits of the squared number.
  - The number of digits depends on the size of the hash table.
Formula:
- $Start index = (L - n) // 2$
  - Integer-flow division (e.g., $1.5 = 1$ , $2.5 = 2$ )
  - $L$ = Total number of digits in the squared number
  - $n$ = Number of digits you want for the hash
Example 1:
- Key = 123
- Square it: $(123)^2 = 15129$
- Middle digits (two digits): 51
- Hash value: 51
- If your hash table has 100 slots (size 100), 51 is a valid index.
Example 2:
- Key = 67
- Square it: $(67)^2 = 4489$
- $L = 4$
- We want 2 digits.
- $Start index = (4 - 2) // 2 = 1$
- Extract from index 1 to index 2: 48
Advantages:
- Simple to implement.
- Tends to spread keys well if the key values are not too clustered.
Disadvantages:
- Not as robust as other hash functions for large-scale applications.
- Poor distribution if not carefully designed.

3. Folding Method

A hashing technique used to generate hash values by breaking a key into parts and combining those parts.

Divide the key into equal parts.
Add these parts together to produce the final hash value.
If needed, apply a modulo operation with the table size to get the final hash value.

Formula:
- $H(x) = S \mod M$
 - $S = x1 + x2 + x3 + … + xn$
 - Where:
 - $H(x)$ is the final hash value
 - $M$ is the size of the hash table for the range you want the hash value to fit into.
Example: $key = 123456$
- Chunks: $x1 = 12, x2 = 34, x_3 = 56$
- Sum of chunks: $12 + 34 + 56 = 102$
- If the hash table size $M$ is 10, then
 - $H(123456) = 102 \mod 10 = 2$
- Thus, the hash value for the input 123456 would be 2.
Advantages:
- Simple and easy to implement.
- Good distribution.
- Reduces collisions.
Disadvantages:
- Limited security.
- Not suitable for all data types.
- May not handle large inputs well.

4. Multiplication Method

A technique used to compute the hash function in data structures, especially for implementing hash tables.
Formula:
- $h(k) = \lfloor M * ((K * A) \mod 1) \rfloor$
  - Where:
    - K is the key
    - A is a constant
    - M is the size of the hash table
    - A commonly recommended im,constant is $A = \frac{\sqrt{5} - 1}{2} \approx 0.6180339887$
Example:
- key = 12345
- table size M = 10
- Constant A ≈ 0.6180339887
- $h(12345) = \lfloor 10 * ((12345 * 0.6180339887) \mod 1) \rfloor$
- $= \lfloor 10 * (7617.501038 \mod 1) \rfloor$
- $= \lfloor 10 * 0.50103 \rfloor$
- $= \lfloor 5.0103 \rfloor$
- $= 5$
- So, the hash value = 5
Advantages:
- It generally distributes keys evenly across the hash table, reducing clustering.
- Works well for both numerical and string-based keys.
Disadvantages:
- You need to carefully choose the constant (A) to avoid poor performance.

Collision Resolution Techniques

In hashing, hash functions generate hash values, used to create an index for keys in the hash table.
The hash function may return the same hash value for two or more keys. (Collision)
When two or more keys have the same hash value, a collision happens.
Collision resolution techniques are used to handle these collisions.
Main Techniques:
- Separate Chaining (Open Hashing)
- Open Addressing (Closed Hashing)
  - Linear Probing
  - Quadratic Probing
  - Double Hashing

1. Separate Chaining (Open Hashing)

The idea behind separate chaining is to make each cell of the hash table point to a linked list of records that have the same hash function value.
Simple but requires additional memory outside the table.

2. Open Addressing (Closed Hashing)

In open addressing, all elements are stored in the hash table itself.
Each table entry contains a record or $NILL$ .
When searching for an element, we examine the table slots one by one until the element is found or it is clear that the element is not in the table.

a. Linear Probing

In linear probing, the hash table is searched sequentially, starting from the original hash location.
If the location is already occupied, we check for the next location.
Example:
- Hash function: $h(key) = key \mod 10$
- Sequence of keys to be inserted: 22, 11, 32, 71, 46

b. Quadratic Probing

Quadratic probing is an open addressing scheme for resolving hash collisions in hash tables.
Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found.
- $h(key) = key \mod 10$
- $h'(key) = (key + i^2) \mod 10; i = 1 \text{ to } 10$
This method is also known as the mid-square method because we look for $i^2 \mod \text{table size}$ in the $i^{th}$ iteration, and the values of $i = 1, 2, 3, …, 9, 10$ . We always start from the original hash location.
Example: Table size = 10, hash function $h(key) = key \mod 10$ , collision resolution strategy $h'(i) = i^2$
- Insert 19, 18, 39, 29, 8

c. Double Hashing

Double hashing is a collision-resolving technique in an open-addressed hash table.
Double hashing makes use of two hash functions:
- $h(key) = key \mod 10$
- $h'(key) = M - (key \mod 7)$
  - Where $M$ is a prime number smaller than the hash table size.
Example:
- Insert the keys 1, 2, 3, 11, 12, 13 into the hash table of size 10.
- First hash function: $h(key) = key \mod 10$
- Second hash function: $h'(key) = M - (key \mod M)$

Applications of Hashing

Hashing plays a crucial role in generating unique identifiers and improving caching efficiency.

1. Unique Identifier Generation

Hashing algorithms, like SHA-256 or MD5, can be used to create unique identifiers for objects, data, or any other entity.
The hash value is a one-way function of the original data, meaning it's difficult to reverse and highly unlikely to generate the same hash for different inputs.
Applications:
- File systems: Hashing can be used to generate unique identifiers for files, allowing for efficient file management and storage.
- Databases: Hashing can be used to create primary keys or unique constraints in relational databases, ensuring data integrity and preventing duplicates.
- Network identification: Hashing can be used to create unique IDs for network devices or connections, facilitating network management and identification.

2. Caching

Hash tables are data structures that utilize hashing to provide fast lookups.
They store data in key-value pairs, where the key is hashed and used to determine the storage location of the corresponding value.
Applications:
- Web caching: Hashing is used in web browsers and server-side caching to store frequently accessed data, such as images, CSS, and JavaScript files. This reduces bandwidth consumption and improves website loading times.
- Application caching: Hashing can be used in applications to store frequently accessed data, such as user profiles or database queries. This improves application performance by reducing the need to access slow storage devices.
- Database caching: Hashing can be used to implement caching mechanisms within databases, such as query result caching. This improves database performance by reducing the need to repeatedly execute the same queries.