limjy

Posted on Mar 10, 2021 • Updated on Apr 7, 2021

Data Structure - Hash Table

Summary of Leetcode explore card.

organizes data using hash functions in order to support quick insertion and search

Two different kinds:

Hash set
set data structure - no repeated values.
Hash map
map data structure - store (key,value) pairs

Hash Function

maps a big number or string to a small integer that can be used as index in hash table.

Concepts

Further reading on concepts here & for java specific info

Load Factor

number of of elements stored in table / number of buckets

average number of key-value pairs per bucket.

It offers tradeoff between time and space complexity

Load factor of Java Hashmap is 0.75

Capacity / Initial Capacity

number of buckets

initial capaity of Java HashMap is 16

Threshold

Current Capacity * Load Factor

Capacity of Hashmap is doubled each time it reaches threshold

In java, Hashmap created with initial capacity 16 & load factor of 0.75. Therefore threshold is 16 * 0.75 = 12.
So, capacity of HashMap increased from 16 to 32 after 12th element added to HashMap.

Rehashing

Hashing again. (eg. when capacity is increased)

When new HashMap object with new capacity is created, all old elements (key-value pairs) are placed into new object after recalculating their hashcode.

Initial capacity and load factor must be chosen such that they minimize the number of rehashing operations.

Example
Load factor is 1.0f. (Rehashing take place after filling 100% of current capacity)

saves space
increases retrieval time of existing elements

Load factor is 0.5f.

rehashing after 50% of current capacity filled.
increase number of rehashing operations.

Hash Table

use hash function to map keys to buckets

insert new key
- hash function decides which bucket key is assigned
- key is stored in corresponding bucket
search for key
- hash table uses hash function to find bucket
- search in bucket for key

eg. y = x % 5 is the hashing function
1987 & 2 are in same bucket. (bucket 2)
use hashing function, search bucket 2 for the key.

Two basic operations

Insert
Search

Designing a Hash Table

1. Hash Function

assign key to bucket as uniformly as possible

Ideally, key and bucket is a one-to-one mapping. However, there is always a trade off between amount of buckets & capacity of buckets.

Good Hash function must:

efficiently computable
uniformly distribute keys

2. Collision Reduction

collision is when > 1 key is assigned to 1 bucket

Ideally, hash function is one-to-one mapping for key & bucket.

Considerations

How to organize values in bucket?
What is too many values are assigned to one bucket?
How to search for target value in bucket?

Must consider:

capacity of bucket
number of keys that might be mapped to the same bucket

given N =number of keys in bucket

N: small & constant --> array to store keys O(N)
N: variable / large --> height balanced binary search tree to store keys O(log(n))

Handling Collisions

situation where newly inserted key maps to an already occupied slot in hash table.

Since a hash function produces a small number for a key which is a big integer or string. There is a possibility 2 keys result in same value. (eg. Birthday Paradox, need 23 people for 50% chance of sharing same birthday)

1. Separate Chaining

each cell of hash table point to a linked list of records that have same hash function value.

requires additional memory outside table.

Data Structures for Separate Chaining

Linked List
ArrayList
Trees

Separate Chaining - Complexity

Complexity depends on complexity of underlying data structure supporting Separate Chaining

Average

Average assumes uniform hashing

more details here

lf = load factor
n = number of elements
m = number of buckets

α = n / m

unsuccessful

O(α)
For unsuccessful find, linked list in separate chaining must be exhaustively searched. Average length of linked list in uniform hashing is load factor

successful

O(1+(α/2))
In successful find, linked list will definitely contain target key. On average, target key can be found in first half of linked list.

Worst

O(N)

Best

O(1)

2. Open Addressing

all elements stored in the hash table itself.

Since all elements stored in hash table, at any point size of table must be >= total number of keys.
Each table entry contains either a record or NIL. To search for element, examine table slots one by one.

Insert(k)
Keep probing until an empty slot is found & insert key

Search(k)
Keep probing until slot’s key == k or an empty slot is reached.

Delete(k)
Simply deleting a key may result in failures in search.
Slots of deleted keys are marked specially as “deleted”. Items can be inserted into deleted slot but search cannot stop at deleted stop.

Open Addressing - Implementations

1. Linear Probing

probe for the next spot

If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S

Problem - Primary clustering problem

tendency for a collision resolution scheme to create long runs of filled slots near the hash position of keys.

blocks of occupied cells can form. The cluster can grow bigger and reduce performance of hashmap. (must iterate through entire cluster to find value)

2. Quadratic Probing

Look for the next i²th slot in the ith iteration

If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S

Problem - Secondary Clustering

tendency to create long runs of filled slots away from hash position of keys

Still more efficient than primary clustering. It aims to break up primary clusters and probe more widely separated cells.

Primary and Secondary Clustering explained in more detail here

3. Double Hashing

Use another hash function hash2(x) and look for i*hash2(x) slot in i^th rotation.

If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S

Requirements

second hash function can never evaluate to 0
Ensure all cells can be probed

Open Addressing - Complexity

More details here

m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table

Load factor α = n/m  ( < 1 )

Average - Unsuccessful

average analysis assume uniform hashing
Unsuccessful Search : O(1/(1 - α))
insert : O(1/(1-α))
Derivation

probability of successful search = p

Probability 1st probe successful  = (m-n)/m = 1-α
Probability 2nd probe successful = (m-n)/(m-1) > 1-α

Therefore at every probe, probability is at least 1-α. So in worst case, for unsuccessful search minimum number of executions expected is 1/p = 1/(1-α)

Average - Successful

O((1/α) * log (1/(1-α))

Best Case

O(1)

Worst Case

O(N)

Open Addressing VS Chaining

Open Addressing: better cache performance

Chaining: less sensitive to functions

Hash Set

Hashset is an implementation of set data structures.
set data structures stores no repeated values

no repeated values

Under the hood, Hashset is backed by java HashMap that maps all its keys to a single constant object

Typical usage

check if value has appeared or not.

Set<Type> hashset = new HashSet<>();
for (Type key: keys) {
  if (hashset.contains(key)) {
    // there is duplicate in set
    return true;
  }
  // no duplicat add key
  hashset.add(key);
}

Hash Table

hash functions are used to link key and values

implementation of map data structure which stores data in (key,value) pairs

build mapping relation between key & information

used when user needs more information than key.

Under the Hood

Internally, it maintains an array of buckets.
When put(key,value) / get(key)

hascode() method of key object is called to find the bucket index.
If hashcode() > number of buckets, it is converted to a number <= number of buckets with formula index = hashCode(key) & (n-1). Bitwise AND operator is used so only lower bits of hashCode are considered.

Collision Resolution

Java HashMap resolves collisions uses separate chaining (LinkedList)

Once hashCode of key object is calculated. Iterate through Linked List of bucket to find the value / put the value at end of linked list.

More detailed readings can be found on GeeksforGeeks (contains step by steps process / case studies.) For more brief summaries: DZone & javarevisited.

Usages

// 1. initialize a hash map
Map<Integer, Integer> hashmap = new HashMap<>();
// 2. insert a new (key, value) pair
hashmap.putIfAbsent(0, 0);
hashmap.putIfAbsent(2, 3);
// 3. insert a new (key, value) pair or update the value of existed key
hashmap.put(1, 1);
hashmap.put(1, 2);
// 4. get the value of specific key
System.out.println("The value of key 1 is: " + hashmap.get(1));
// 5. delete a key
hashmap.remove(2);
// 6. check if a key is in the hash map
if (!hashmap.containsKey(2)) {
    System.out.println("Key 2 is not in the hash map.");
}
// 7. get the size of the hash map
System.out.println("The size of hash map is: " + hashmap.size()); 
// 8. iterate the hash map
for (Map.Entry<Integer, Integer> entry : hashmap.entrySet()) {
    System.out.print("(" + entry.getKey() + "," + entry.getValue() + ") ");
}
System.out.println("are in the hash map.");
// 9. clear the hash map
hashmap.clear();
// 10. check if the hash map is empty
if (hashmap.isEmpty()) {
    System.out.println("hash map is empty now!");
}

Get more information (other than key)

From array of indices, return indices of two numbers such that they add up to specific target

In this case, hashamp is needed, need to store the values and check that target - current_value can be found in the hashmap.

Common problems

array of integers, return list of indices where numbers add up to given target
check if two strings are isomorphic. (replace 1 character with another can get from first string input to second string input)

Aggregate by key

aggregate all information by key

The key to solving this category of problems is deciding strategy when existing key is encountered (eg. increment hashmap value by 1, replace original value with new value?)

Common problem

find first non-repeating character in string and return its index, else return -1

Deign Hash Key

designing the hash key is about building a mapping relationship between original information & key used by hashmap.

deciding what the hash key in hashmap should be

Hash Key needs to guarantee

All values belong to the same group will be mapped in the same group.

Values which needed to be separated into different groups will not be mapped into the same group.

hash function fulfills criteria 1 but NOT 2 Mapping function should satisfy both.

Example

Anagrams HashMap to groups anagrams (eg. 'eat','ate' same group. 'act' different group) Hash Key mapping could be to use sorted string. 'eat' & 'ate' will be mapped to key 'aet'.

Hash Function

Concepts

Load Factor

Capacity / Initial Capacity

Threshold

Rehashing

Hash Table

Designing a Hash Table

1. Hash Function

2. Collision Reduction

Handling Collisions

1. Separate Chaining

Data Structures for Separate Chaining

Separate Chaining - Complexity

Average

unsuccessful

successful

Worst

Best

2. Open Addressing

Open Addressing - Implementations

1. Linear Probing

Problem - Primary clustering problem

2. Quadratic Probing

Problem - Secondary Clustering

3. Double Hashing

Requirements

Open Addressing - Complexity

Average - Unsuccessful

Average - Successful

Best Case

Worst Case

Open Addressing VS Chaining

Hash Set

Typical usage

Hash Table

Under the Hood

Usages

Get more information (other than key)

Common problems

Aggregate by key

Common problem

Deign Hash Key

Example

more readings

Read next

Docker Commands for Beginners and Pros

AWS Security Group vs NACL: What’s the Difference?

How to Interpret Code Plagiarism Checker Results: A Beginner’s Guide

Dangerous Linux Configurations You Must Avoid for Secure Systems (Deep Dive)