SR

Hash Tables

Hash Tables

Introduction

  • Hash tables offer fast searching capabilities but may have some memory management inefficiencies and implementation complexity.

  • Gaining speed often means losing efficiency in other areas, like memory management.

  • Hash tables are based on arrays, making them relatively easy to implement.

  • They are widely used in language processing and automatic translation systems because of the need for high speed in searching for particular words.

  • Principles of hash table organization, quadratic probing, separate chaining, and hash function organization will be covered.

Advantages and Disadvantages

  • Advantages:

    • Fast access if the key is known.

    • Based on arrays, which allows direct access using indexes.

  • Disadvantages:

    • Slow deletion of elements.

    • Slow access if the key is unknown, requiring a search through all elements.

    • Memory management inefficiency, using more memory to maintain efficiency.

Hashing in Cybersecurity

  • Hashing is used in cybersecurity to extract a hash or summary of a document for encryption.

  • In data structures, hashing is used to squeeze a large range of keys into a smaller range.

Cone Notation and Speed

  • Hash tables have an access time close to O(1), the fastest possible.

  • Trees have log \, n time complexity, which is slower than hash tables.

  • Hash tables are relatively easy to program because they are based on arrays.

Memory Management

  • Dynamic memory is used for storing data in arrays to improve memory management.

  • Arrays are static, so without dynamic objects, a lot of memory needs to be reserved.

Limitations

  • It is difficult to expand a hash table if it becomes overloaded.

  • Reaching the size limit requires creating a new, larger hash table and reinserting all elements, which can be time-consuming and require additional memory.

  • Hash tables do not provide a convenient way to visit elements in sorted order; trees are better for sorting.

Best Use Cases

  • Hash tables are best when the size of the data is known and will not change dramatically.

Examples

  • If storing employee data with identifiers from 1 to 1,000, a simple array can be used instead of a hash table because the keys are numbers in order.

  • For organizing a dictionary, words need to be converted into numbers to create indexes for a hash table, as words are not inherently numbers.

  • Hash tables are suitable for dictionaries because the size of words in a language is relatively fixed and stable.

Using Hash Function

Converting Words to Numbers

  • One approach is to use the ASIC code table to extract codes for each letter.

  • A simpler approach is to assign numbers to letters (e.g., a=1, z=26).

  • If storing words with ten letters, the range of indexes would be from 1 to 260, an inefficient use of space.

Multiplying by Powers

  • Each letter is multiplied by a power of 10 based on its position in the word.

  • This results in a very large range of indexes, making it impractical to create such a large array.

  • Hash tables are used to squeeze this huge number into a smaller range.

Hash Functions

  • Hash functions are used to find array indexes for storing data in hash tables.

  • A hash function squeezes a huge number into the array size.

  • For example, if storing 50,000 words, a hash table of 100,000 is used to provide free space.

Modular Arithmetic

  • A modular sign is used to get the residual part, squeezing a large range into a small range.

  • For example, with a range of 10 elements, 10 % 10 = 0, 12 % 10 = 2.

Collisions

  • Collisions occur when different words take the same position in the array.

  • Open addressing and separate chaining are two approaches to solve collisions.

Open Addressing

  • If a position is taken, search for the next free space.

  • Linear probing is an open addressing algorithm where the next free space is found by moving one step ahead.

  • Clusters can form in the data structure, impacting the speed of the search.

  • Sufficient size of the hash table is needed to avoid large clusters.

Finding Elements

  • To find an element, use the hash function to extract the index and search by index.

  • If the element is not found, continue probing until the element is found or an empty place is reached.

Deleting Kernel Hash Tables

Linear Probing

  • In linear probing, after a collision, look for following free space by moving one step ahead.

  • Clusters start to form when consecutive places are taken. These clusters impact search speed.

Example

  • Hash table of seven elements (indexes 0-6).

    • Insert 76, hash to index 6, store element (one probe).

    • Insert 16, hash to index 2, store element (one probe).

    • Insert 40, hash to index 5, store element (one probe).

    • Insert 47, hash to index 5 (collision). Search for a free place.

      • Probe 1: Index 6 (taken).

      • Probe 2: Index 0 (free). Store 47 at index 0 (three probes).

    • Insert 10, hash to index 3, store element (one probe).

    • Insert 55, hash to index 6 (collision). Search for a free place.

      • Probe 1: Index 0 (taken).

      • Probe 2: Index 1 (free). Store 55 at index 1 (three probes).

Cluster Analysis

  • After insertions, one cluster wraps around the array (indexes 0, 1, 5, 6, indicating a sequence of occupied slots).

  • The more clusters and probes, the less efficient the data structure becomes.

Exam Questions

  • Possible exam tasks include determining the number of clusters, probes, or collisions after a series of insertions.

Deletion

  • Deleting elements from hash tables is complicated and can impact data structure integrity.

  • The easiest approach is to mark elements as deleted without freeing the memory to avoid breaking probe sequences for other elements.

  • Deleting 76 (at index 6) could cause issues for element 47 (originally hashed to 5, but stored at 0) if index 6 is freed.

Duplicates

  • Duplicates should be forbidden to avoid needing to search through the entire cluster.

Dynamic Data Element

Hash Table Load

  • A hash table should be no more than one-half to two-thirds full to maintain efficiency.

Data Element Structure

  • Each element contains a data key used for extracting array indexes.

  • Use a simple (non-dynamic) array or vector of dynamic data elements.

  • Initialize the array by setting all elements to null.

  • Define a deleted element as having an index of -1.

Hash Function

  • A simple modular sign is used for the hash function.

Search Function

  • Taking the key, then putting in a hash function, you get the hash value. The hash value is the index of array, which we need to store.

  • Loop through the array until the element is found.

  • If the end of the hash table is reached, wrap around to the beginning using p = p % hashtablesize.

Insert Function

  • Taking the key and hashing it. The inserting search and logic is:

  • Move until finding an empty place or a place where an element is deleted.

  • If the end is reached, move to the beginning of the array.

  • Store the new element.

Remove Function

  • Take keys, hash it to get array index, find element.

  • If element is found remember it (ptemp), assign placeholder of deleted element, and return ptemp.

  • If you move to the end, wrap around.

Print Function

  • You do not need a while cycle because you know the size of the array. A for loop from zero to end will work.

Quadratic Probing

  • In quadratic probing, use different intervals for each step to avoid primary clusters

  • step = i^2 where i is the probe number.

  • Each step we will do, we will move differently because initially, we will move only one. If we will move time, it will be two by degree two, it will be four. Then probe will be three degree by two, it will be nine.

  • Secondary clustering (virtual clusters) can occur because moves form a specific combination.

Double Hashing

  • Get the step size per probe by applying a second hash function.

  • The step size is randomly dependent on element key, and it is very good.

  • A typical double hashing function looks like 5 - (key \, mod \, 5)

  • The purpose of subtracting from the constant (5 in the example) is to ensure a non-zero step size.

Example

  • 76 is inserted, hashing to its position.

  • 93 is inserted at index 3.

  • The value, 47, has a collision, Index 5, and then step size of 3 is used.

  • After the first and second hops, the step size is used, and then the key is stored.

  • With double hashing, the size of each probe can be different based on the index and probe size.

Implementation Details

  • Double hashing requires that the hash table size is a prime number.

  • Using the prime numbers prevents infinite loops of step sizes because it allows the search to visit various indexes.

Open Addressing Techniques

  • Various collision resolution techniques such as liner, quadratic, or double hashing achieve some kinds of improvements.

Load Factor and Performance

Linear Probing

  • 50% load results in a sufficient speed.

  • 75% load is manageable, though speed is slightly affected.

  • 95% load is very slow for unsuccessful searches.

Quadratic Probing and Double Hashing

  • The situation is better than liner probing.

  • 90% load is still manageable. If overloaded at almost 100%, it will not help us.

Expansion

  • If there isn't enough place in hash, the array needs to expand.

  • Extending the array involves rehashing and reinserting all elements, which is time-consuming.

  • For a double array sizes, there is a need to rehash, so elements must all be reinserted.

Separate Chaining

  • In separate chaining approach, lists are used.

  • With separate chaining, use linked lists that manage elements.

  • It has the advantage of not having memory constraints and makes duplicates and deletions of elements easier.

Table Properties

  • Does not require that the table size is a prime number.

  • Load factor in the table isn't of much of an issue.

Efficiency

  • If the load factor is 5 times, then speed maintains to be very acceptable.

When to use

  • Open addressing is simpler, but chaining is efficient if unsure about what elements will enter, since it is manageable.

Hash Functions

  • They should be simple and quickly computed.

  • Keys must be random.

  • Use prime numbers in the model to prevent the creation of infinite loops.

Conclusion

  • Hash tables are ideal for high-speed searches but have memory management disadvantages (efficiency around 50%).

  • Linked lists provide the most flexibility, whereas open addressing and simplistic tables are simpler for small-size data.

  • Different types of algorithms may give some advantages, which gives the best possible speed, but this leads to certain trade-offs.