DEV Community

limjy
limjy

Posted on • Updated on

Data Structure - Hash Table

Summary of Leetcode explore card.

organizes data using hash functions in order to support quick insertion and search

Two different kinds:

  1. Hash set
    set data structure - no repeated values.

  2. Hash map
    map data structure - store (key,value) pairs

Hash Function

maps a big number or string to a small integer that can be used as index in hash table.

Concepts

Further reading on concepts here & for java specific info

Load Factor

number of of elements stored in table / number of buckets

average number of key-value pairs per bucket.

alt text

It offers tradeoff between time and space complexity

alt text

Load factor of Java Hashmap is 0.75

Capacity / Initial Capacity

number of buckets

initial capaity of Java HashMap is 16

Threshold

Current Capacity * Load Factor

Capacity of Hashmap is doubled each time it reaches threshold

In java, Hashmap created with initial capacity 16 & load factor of 0.75. Therefore threshold is 16 * 0.75 = 12.
So, capacity of HashMap increased from 16 to 32 after 12th element added to HashMap.

Rehashing

Hashing again. (eg. when capacity is increased)

When new HashMap object with new capacity is created, all old elements (key-value pairs) are placed into new object after recalculating their hashcode.

Initial capacity and load factor must be chosen such that they minimize the number of rehashing operations.

Example
Load factor is 1.0f. (Rehashing take place after filling 100% of current capacity)

  • saves space
  • increases retrieval time of existing elements

Load factor is 0.5f.

  • rehashing after 50% of current capacity filled.
  • increase number of rehashing operations.

Hash Table

use hash function to map keys to buckets

  • insert new key
    • hash function decides which bucket key is assigned
    • key is stored in corresponding bucket
  • search for key
    • hash table uses hash function to find bucket
    • search in bucket for key

alt text
eg. y = x % 5 is the hashing function
1987 & 2 are in same bucket. (bucket 2)
use hashing function, search bucket 2 for the key.

Two basic operations

  1. Insert
  2. Search

Designing a Hash Table

1. Hash Function

assign key to bucket as uniformly as possible

Ideally, key and bucket is a one-to-one mapping. However, there is always a trade off between amount of buckets & capacity of buckets.

Good Hash function must:

  1. efficiently computable
  2. uniformly distribute keys

2. Collision Reduction

collision is when > 1 key is assigned to 1 bucket

Ideally, hash function is one-to-one mapping for key & bucket.

Considerations

  1. How to organize values in bucket?
  2. What is too many values are assigned to one bucket?
  3. How to search for target value in bucket?

Must consider:

  1. capacity of bucket
  2. number of keys that might be mapped to the same bucket

given N =number of keys in bucket

  • N: small & constant --> array to store keys O(N)
  • N: variable / large --> height balanced binary search tree to store keys O(log(n))

Handling Collisions

situation where newly inserted key maps to an already occupied slot in hash table.

alt text

Since a hash function produces a small number for a key which is a big integer or string. There is a possibility 2 keys result in same value. (eg. Birthday Paradox, need 23 people for 50% chance of sharing same birthday)

1. Separate Chaining

each cell of hash table point to a linked list of records that have same hash function value.

alt text

requires additional memory outside table.

Data Structures for Separate Chaining

  1. Linked List
  2. ArrayList
  3. Trees

Separate Chaining - Complexity

Complexity depends on complexity of underlying data structure supporting Separate Chaining

Average

Average assumes uniform hashing

more details here

lf = load factor
n = number of elements
m = number of buckets

α = n / m
Enter fullscreen mode Exit fullscreen mode

unsuccessful

O(α)
For unsuccessful find, linked list in separate chaining must be exhaustively searched. Average length of linked list in uniform hashing is load factor

successful

O(1+(α/2))
In successful find, linked list will definitely contain target key. On average, target key can be found in first half of linked list.

Worst

O(N)

Best

O(1)

2. Open Addressing

all elements stored in the hash table itself.

Since all elements stored in hash table, at any point size of table must be >= total number of keys.
Each table entry contains either a record or NIL. To search for element, examine table slots one by one.

Insert(k)
Keep probing until an empty slot is found & insert key

Search(k)
Keep probing until slot’s key == k or an empty slot is reached.

Delete(k)
Simply deleting a key may result in failures in search.
Slots of deleted keys are marked specially as “deleted”. Items can be inserted into deleted slot but search cannot stop at deleted stop.

Open Addressing - Implementations

1. Linear Probing

probe for the next spot

alt text

If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S 
Enter fullscreen mode Exit fullscreen mode

Problem - Primary clustering problem

tendency for a collision resolution scheme to create long runs of filled slots near the hash position of keys.

blocks of occupied cells can form. The cluster can grow bigger and reduce performance of hashmap. (must iterate through entire cluster to find value)
alt text

2. Quadratic Probing

Look for the next i2th slot in the ith iteration

If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
Enter fullscreen mode Exit fullscreen mode

Problem - Secondary Clustering

tendency to create long runs of filled slots away from hash position of keys
alt text

Still more efficient than primary clustering. It aims to break up primary clusters and probe more widely separated cells.

Primary and Secondary Clustering explained in more detail here

3. Double Hashing

Use another hash function hash2(x) and look for i*hash2(x) slot in ith rotation.

If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S
Enter fullscreen mode Exit fullscreen mode

Requirements

  1. second hash function can never evaluate to 0
  2. Ensure all cells can be probed

Open Addressing - Complexity

More details here

m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table

Load factor α = n/m  ( < 1 )
Enter fullscreen mode Exit fullscreen mode

Average - Unsuccessful

average analysis assume uniform hashing
Unsuccessful Search : O(1/(1 - α))
insert : O(1/(1-α))
Derivation

probability of successful search = p

Probability 1st probe successful  = (m-n)/m = 1-α
Probability 2nd probe successful = (m-n)/(m-1) > 1-α
Enter fullscreen mode Exit fullscreen mode

Therefore at every probe, probability is at least 1-α. So in worst case, for unsuccessful search minimum number of executions expected is 1/p = 1/(1-α)

Average - Successful

O((1/α) * log (1/(1-α))

Best Case

O(1)

Worst Case

O(N)

Open Addressing VS Chaining

Open Addressing: better cache performance

Chaining: less sensitive to functions

Hash Set

Hashset is an implementation of set data structures.
set data structures stores no repeated values

no repeated values

Under the hood, Hashset is backed by java HashMap that maps all its keys to a single constant object

Typical usage

check if value has appeared or not.

Set<Type> hashset = new HashSet<>();
for (Type key: keys) {
  if (hashset.contains(key)) {
    // there is duplicate in set
    return true;
  }
  // no duplicat add key
  hashset.add(key);
}
Enter fullscreen mode Exit fullscreen mode

Hash Table

hash functions are used to link key and values

implementation of map data structure which stores data in (key,value) pairs

build mapping relation between key & information

used when user needs more information than key.

Under the Hood

Internally, it maintains an array of buckets.
When put(key,value) / get(key)

  • hascode() method of key object is called to find the bucket index.
  • If hashcode() > number of buckets, it is converted to a number <= number of buckets with formula index = hashCode(key) & (n-1). Bitwise AND operator is used so only lower bits of hashCode are considered.

Collision Resolution

Java HashMap resolves collisions uses separate chaining (LinkedList)

Once hashCode of key object is calculated. Iterate through Linked List of bucket to find the value / put the value at end of linked list.

More detailed readings can be found on GeeksforGeeks (contains step by steps process / case studies.) For more brief summaries: DZone & javarevisited.

Usages

// 1. initialize a hash map
Map<Integer, Integer> hashmap = new HashMap<>();
// 2. insert a new (key, value) pair
hashmap.putIfAbsent(0, 0);
hashmap.putIfAbsent(2, 3);
// 3. insert a new (key, value) pair or update the value of existed key
hashmap.put(1, 1);
hashmap.put(1, 2);
// 4. get the value of specific key
System.out.println("The value of key 1 is: " + hashmap.get(1));
// 5. delete a key
hashmap.remove(2);
// 6. check if a key is in the hash map
if (!hashmap.containsKey(2)) {
    System.out.println("Key 2 is not in the hash map.");
}
// 7. get the size of the hash map
System.out.println("The size of hash map is: " + hashmap.size()); 
// 8. iterate the hash map
for (Map.Entry<Integer, Integer> entry : hashmap.entrySet()) {
    System.out.print("(" + entry.getKey() + "," + entry.getValue() + ") ");
}
System.out.println("are in the hash map.");
// 9. clear the hash map
hashmap.clear();
// 10. check if the hash map is empty
if (hashmap.isEmpty()) {
    System.out.println("hash map is empty now!");
}
Enter fullscreen mode Exit fullscreen mode

Get more information (other than key)

From array of indices, return indices of two numbers such that they add up to specific target

In this case, hashamp is needed, need to store the values and check that target - current_value can be found in the hashmap.

Common problems

  1. array of integers, return list of indices where numbers add up to given target
  2. check if two strings are isomorphic. (replace 1 character with another can get from first string input to second string input)

Aggregate by key

aggregate all information by key

The key to solving this category of problems is deciding strategy when existing key is encountered (eg. increment hashmap value by 1, replace original value with new value?)

Common problem

  1. find first non-repeating character in string and return its index, else return -1

Deign Hash Key

designing the hash key is about building a mapping relationship between original information & key used by hashmap.

deciding what the hash key in hashmap should be

Hash Key needs to guarantee

  1. All values belong to the same group will be mapped in the same group.
  2. Values which needed to be separated into different groups will not be mapped into the same group.

hash function fulfills criteria 1 but NOT 2 Mapping function should satisfy both.

Example

  1. Anagrams HashMap to groups anagrams (eg. 'eat','ate' same group. 'act' different group) Hash Key mapping could be to use sorted string. 'eat' & 'ate' will be mapped to key 'aet'.

more readings

https://home.cse.ust.hk/~dekai/2012H/lectures/l32_hash.pdf
http://courses.csail.mit.edu/6.006/fall11/lectures/lecture10.pdf
http://www.cs.toronto.edu/~ylzhang/csc263w15/slides/lec05-hashing.pdf

Latest comments (0)