hridyesh bisht for AWS Community Builders

Posted on Jan 20, 2021 • Edited on Mar 23, 2021

Intro into #Hashing ?

Hello, this blog contains the basic concept of hashing, hashing terminologies, various hashing functions, various collision resolution strategies, chaining with and without replacement, chaining using linked list and extendible hashing.

At the end it contains link to a GitHub repository, along with my another blog on encryption and decryption.

Concept of Hashing:

To search for a specific element, we will look at just a specific position. The element will be found there if it was kept there in the first place. Hence, to place an element, we will first calculate its location, place it there so that it can be retrieved from that position.

In hashing, the address or location of an identifier X is obtained by using some function f(X) which gives the address of X in a table.

Hashing Terminology:

Hash Function: A function that transforms a key X into a table index is called a hash function.
Hash address: The address of X computed by the hash function is called the hash address or home address of X.
Hash Table: The hash table is the sequential memory used to store identifiers.
Bucket: Each hash table is partitioned into ‘b’ buckets ht[0]……ht[b-1].
1. Each bucket is capable of holding ‘s’ records. Thus, a bucket consists of ‘s’ slots. When s = 1, each bucket can hold 1 record.
2. The function f(X) maps an identifier X into one of the ‘b’ buckets i.e from 0 to b-1.
Synonyms: Usually, the total number of possible values for identifier X is much larger than the hash table size. Therefore the hash function f(X) must map several identifiers into the same bucket.
1. Two identifiers I₁ and I₂ are synonyms if f(I₁) = f(I₂)
Collision: When two non identical identifiers are mapped into the same bucket, a collision is said to occur, i.e. f (I₁) = f(I₂). Hence all synonyms occupy the same bucket.
Overflow: An overflow is said to occur when an identifier gets mapped onto a full bucket.
1. When s = 1, i.e., a bucket contains only one record, collision and overflow occur simultaneously. Such a situation is called a Hash clash.
Load factor: If n is the total number of identifiers in the table and t is the table size, the load factor is :
1. lf = n/t It represents the fraction of the table that is occupied.
Perfect Hash Function : Given a set of keys k1, k2, … kn, a perfect hash function, f, is such that f(ki) != f(kj) for all distinct i and j.

For an example, Let us consider a hash table with b= 26 buckets (from 1 to 26) each having 2 slots, i.e. s = 2. Assume that there are 10 distinct identifiers each beginning with a letter.

Hash function f(X) = Number corresponding to the first letter of identifier i.e. if X = ‘A’, then f(X) = 1.

If the identifiers are GA, D, A, G, L, A2, A1, A3, A4, E . Their hash addresses will be 7,4,1,7,12,1,1,1,1,5 respectively. GA and G are :

synonyms. A, A2, A1, A3 and A4 are synonyms. When G gets hashed into bucket 7, a collision is said to occur. When A1 gets hashed into bucket 1, which already contains A and A2, an overflow occurs.

Hash Function:

Desirable Characteristics of a Hashing Function

It should be easily computable.
It should minimize the number of collisions.
The hash function should compute the address, which depends on all or most of the characters in the identifier.
It should yield uniform bucket addresses for random inputs. Such a function is called a uniform hash function.

1.Mid Square hash function:

This is a very widely used function in symbol table applications. It is the “middle of square” function, which is computed by squaring the identifiers and using an appropriate number of bits from the middle to obtain the bucket address.

Example: (the size of the hash table = 100), If X = 225, X² = 050625 then hash address = 06

For more information,

https://research.cs.vt.edu/AVresearch/hashing/midsquare.php

2.Division hash function

The modulus operator can be used as a hash function. The identifier X is divided by some number M and the remainder is used as the hash address of X.

f(x) = X mod M.

The best choice for M is that it should be a prime number.

Example: If X = 134 and M = 31 then f(X)= X mod M = 134 mod 31 =10

3.Folding hash function:

In this method, the identifier X is partitioned into several parts all of the same length except the last. These parts are added to obtain the hash address.

Addition is done in two ways.

1.Shift Folding: All parts except the last are shifted so that their least significant bits correspond to each other.

For example, the hash function could form groups of three from the key 12320324111220. To determine the index to which the key is mapped the hash function “shifts” each set of digits under the other and adds them

2.Folding on the boundaries: The identifier is folded at the part boundaries and the bits falling together are added.

Group the digits in the search key as in shift folding but the middle numbers are folded on the boundary between the first group and the middle group and they are thus reversed.

COLLISION RESOLUTION STRATEGIES

1.Open Addressing:

In an open addressing hashing system, if a collision occurs, alternative locations are tried until an empty location is found. Three techniques are commonly used for open addressing.

Features :

All the identifiers are stored in the hash table itself
Each slot contains an identifier or it is empty
Requires a bigger table for open addressing

1. Linear Probing or Linear Open Addressing :

In this method, the hash address of identifier X is obtained. If the slot in its bucket is empty, the identifier is placed there. If an overflow occurs the identifier is placed in the next empty slot after its bucket.

Advantage

It is a simple method of overflow handling.
The method can be easily implemented using simple data structures like arrays.

Disadvantage

An identifier whose bucket is full, will occupy the valid hash address of another identifier. Thus, one overflow will lead to many more.
If an identifier is not found at its computed address, a series of comparisons will have to be made in order to retrieve it.
If an identifier is not present in the bucket, the search will be terminated only the entire bucket is searched.
It creates a cluster of identifiers i.e. identifiers are concentrated in some parts of the table.
There is a possibility of the table becoming full.

2.Quadratic Probing:

In Quadratic Probing a quadratic function of i is used as the increment. when the identifier is not found in its bucket , we search in successive buckets using an increment i.

The hash function is quadratic: this method checks the buckets number computed using a quadratic equation. This ensures that the identifiers are fairly spread out in the table.

Advantages :

Avoids the clustering problem in linear probing to some extent
Faster insertion and deletion

Disadvantages:

There is no guarantee of finding an empty slot once the table becomes more than half full if the table size is not prime.
Although this method eliminates primary clustering, it does not eliminate another phenomenon called “secondary clustering” in which different keys that hash to the same value follow the same rehash path.

3.Rehashing:

In this method, if an overflow occurs, a new address is computed by using another hash function.

A series of hash functions f₁, f₂, …..f_n are used. To place an identifier X, f₁(X) , f₂(X) …..f_m (X) are computed till an empty slot is found to place the identifier.

To search for identifier X, they are used in the same order to search the identifier. If all the hash functions have been used and the identifier is not found, it means that the identifier is not there in the table.

Advantages

Does not lead to clustering

Disadvantages

The process may loop forever. This may happen when some rehash function results in the same hash address for an identifier.
It is difficult to find a series of hash functions yielding different addresses for the same identifier.
Rehashing is costly as each rehash function takes a finite amount of time.

Chaining:

1.Chaining without Replacement:

The hash address of an identifier X is computed.

If this position is vacant, it is placed there.
It its position is occupied by another identifier Y, the identifier X is put in the next vacant position and a chain is formed to the new position i.e. from Y to X.

Example: Identifier is 11, 32, 41, 54, 33.

Hash function: f(x) = X mod 10 and Table size = 10

Identifier 11 and 32 get hashed into positions 1 and 2 respectively.
The hash address of 41 is 1 but its position is occupied, hence it is put at position 3 and a chain is formed.
54 stored at position 4
The hash address of 33 is 3 but it is occupied. So 3 is placed at position 5 and a chain is formed from position 3 to 5 as shown.

Advantage:

Since identifiers are chained, we have to only use the chains to locate an identifier.
It is more efficient as compared to the previous methods.

Disadvantage:

The main idea is to chain all identifiers having same hash address (synonyms). However, when an overflow occurs, an identifier occupies the position of another identifier. Hence, even non-synonyms get chained together thereby increasing complexity.

2.Chaining with Replacement:

This is an improvement over the above method. As we saw above, even non synonyms get chained. To avoid this from happening, if we find that another identifier Y which is a non-synonym, is occupying the position of an identifier X, X replaces Y and then Y is relocated to a new position.

It will not make a difference to Y since Y was anyway not in its own bucket. But by placing X in its own bucket, we are improving the efficiency.

Let us consider the same example as earlier. Here, 11, 32, 41 and 54 get placed in the same way as before.

However, when 33 has to be placed, its position is occupied by 41. Thus, 33 replaces 41 and 41 is now put into the next empty slot i.e 5 and the chain from element 11 at position 1 is modified.

11,41===1
33===3
We will swap 41 & 33

Advantages:

Most of the identifiers occupy their valid positions.
Searching becomes easier since only the synonyms are chained.

Disadvantages:

Insertions and deletions take more time.

3.Chaining using linked lists:

All of the above methods suffer from a variety of problems.

Limited amount of space, due to which the table can become full.
The complexity of chaining which results in poor efficiency.

Obviously the best method would be to have unlimited amount of space for each bucket so that we could put in as many identifiers as we want without any overflow taking place.

Advantages

Most efficient method to resolve collision.
There is no limit on the number of identifiers.
Searching, Insertion and Deletions are done efficiently.
The hash table will never be ‘full’.

Disadvantages

If many identifiers are put into a single list, searching time within a bucket will increase.
Handling of the pointer array and linked lists is more complex as compared to simple arrays.

EXTENDIBLE HASHING

Reason for Extendible hashing :

Most records have to be stored in disk
Disk read/write operations much more expensive than operations in main memory
Regular hash tables need to examine several disk blocks when collisions occur
We want to limit number of disk accesses for find/insert operations

Concept :

Retrieval to be performed in two disk accesses and Insertion also in few accesses.
This can be achieved by using a tree structure.
The tree is an M-ary tree (degree M) which allows M-way branching.
This means that as the branching increases, depth of the tree decreases.
A complete Binary tree has height (log₂N) whereas a complete M-ary tree has height of (log _M/2 N). As M increases, the depth decreases.