Summary of Leetcode explore card.
organizes data using hash functions in order to support quick insertion and search
Two different kinds:
Hash set
set data structure - no repeated values.Hash map
map data structure - store (key,value) pairs
Hash Function
maps a big number or string to a small integer that can be used as index in hash table.
Concepts
Further reading on concepts here & for java specific info
Load Factor
number of of elements stored in table / number of buckets
average number of key-value pairs per bucket.
It offers tradeoff between time and space complexity
Load factor of Java Hashmap
is 0.75
Capacity / Initial Capacity
number of buckets
initial capaity of Java HashMap
is 16
Threshold
Current Capacity * Load Factor
Capacity of Hashmap is doubled each time it reaches threshold
In java, Hashmap created with initial capacity 16 & load factor of 0.75. Therefore threshold is 16 * 0.75 = 12
.
So, capacity of HashMap increased from 16 to 32 after 12th element added to HashMap.
Rehashing
Hashing again. (eg. when capacity is increased)
When new HashMap object with new capacity is created, all old elements (key-value pairs) are placed into new object after recalculating their hashcode.
Initial capacity and load factor must be chosen such that they minimize the number of rehashing operations.
Example
Load factor is 1.0f. (Rehashing take place after filling 100% of current capacity)
- saves space
- increases retrieval time of existing elements
Load factor is 0.5f.
- rehashing after 50% of current capacity filled.
- increase number of rehashing operations.
Hash Table
use hash function to map keys to buckets
- insert new key
- hash function decides which bucket key is assigned
- key is stored in corresponding bucket
- search for key
- hash table uses hash function to find bucket
- search in bucket for key
eg. y = x % 5
is the hashing function
1987 & 2 are in same bucket. (bucket 2)
use hashing function, search bucket 2 for the key.
Two basic operations
- Insert
- Search
Designing a Hash Table
1. Hash Function
assign key to bucket
as uniformly as possible
Ideally, key and bucket is a one-to-one mapping. However, there is always a trade off between amount of buckets
& capacity of buckets
.
Good Hash function must:
- efficiently computable
- uniformly distribute keys
2. Collision Reduction
collision is when > 1 key is assigned to 1 bucket
Ideally, hash function is one-to-one mapping for key & bucket.
Considerations
- How to organize values in bucket?
- What is too many values are assigned to one bucket?
- How to search for target value in bucket?
Must consider:
- capacity of bucket
- number of keys that might be mapped to the same bucket
given N
=number of keys in bucket
- N: small & constant -->
array
to store keysO(N)
- N: variable / large -->
height balanced binary search tree
to store keysO(log(n))
Handling Collisions
situation where newly inserted key maps to an already occupied slot in hash table.
Since a hash function produces a small number for a key which is a big integer or string. There is a possibility 2 keys result in same value. (eg. Birthday Paradox, need 23 people for 50% chance of sharing same birthday)
1. Separate Chaining
each cell of hash table point to a linked list of records that have same hash function value.
requires additional memory outside table.
Data Structures for Separate Chaining
- Linked List
- ArrayList
- Trees
Separate Chaining - Complexity
Complexity depends on complexity of underlying data structure supporting Separate Chaining
Average
Average assumes uniform hashing
more details here
lf = load factor
n = number of elements
m = number of buckets
α = n / m
unsuccessful
O(α)
For unsuccessful find, linked list in separate chaining must be exhaustively searched. Average length of linked list in uniform hashing is load factor
successful
O(1+(α/2))
In successful find, linked list will definitely contain target key. On average, target key can be found in first half of linked list.
Worst
O(N)
Best
O(1)
2. Open Addressing
all elements stored in the hash table itself.
Since all elements stored in hash table, at any point size of table must be >= total number of keys.
Each table entry contains either a record or NIL. To search for element, examine table slots one by one.
Insert(k)
Keep probing until an empty slot is found & insert key
Search(k)
Keep probing until slot’s key == k or an empty slot is reached.
Delete(k)
Simply deleting a key may result in failures in search.
Slots of deleted keys are marked specially as “deleted”. Items can be inserted into deleted slot but search cannot stop at deleted stop.
Open Addressing - Implementations
1. Linear Probing
probe for the next spot
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
Problem - Primary clustering problem
tendency for a collision resolution scheme to create long runs of filled slots near the hash position of keys.
blocks of occupied cells can form. The cluster can grow bigger and reduce performance of hashmap. (must iterate through entire cluster to find value)
2. Quadratic Probing
Look for the next i2th slot in the ith iteration
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
Problem - Secondary Clustering
tendency to create long runs of filled slots away from hash position of keys
Still more efficient than primary clustering. It aims to break up primary clusters and probe more widely separated cells.
Primary and Secondary Clustering explained in more detail here
3. Double Hashing
Use another hash function hash2(x) and look for i*hash2(x) slot in ith rotation.
If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S
Requirements
- second hash function can never evaluate to 0
- Ensure all cells can be probed
Open Addressing - Complexity
More details here
m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table
Load factor α = n/m ( < 1 )
Average - Unsuccessful
average analysis assume uniform hashing
Unsuccessful Search : O(1/(1 - α))
insert : O(1/(1-α))
Derivation
probability of successful search = p
Probability 1st probe successful = (m-n)/m = 1-α
Probability 2nd probe successful = (m-n)/(m-1) > 1-α
Therefore at every probe, probability is at least 1-α. So in worst case, for unsuccessful search minimum number of executions expected is 1/p = 1/(1-α)
Average - Successful
O((1/α) * log (1/(1-α))
Best Case
O(1)
Worst Case
O(N)
Open Addressing VS Chaining
Open Addressing: better cache performance
Chaining: less sensitive to functions
Hash Set
Hashset is an implementation of set
data structures.
set
data structures stores no repeated values
no repeated values
Under the hood, Hashset is backed by java HashMap that maps all its keys to a single constant object
Typical usage
check if value has appeared or not.
Set<Type> hashset = new HashSet<>();
for (Type key: keys) {
if (hashset.contains(key)) {
// there is duplicate in set
return true;
}
// no duplicat add key
hashset.add(key);
}
Hash Table
hash functions are used to link key and values
implementation of map
data structure which stores data in (key,value)
pairs
build mapping relation between key & information
used when user needs more information than key.
Under the Hood
Internally, it maintains an array of buckets.
When put(key,value) / get(key)
- hascode() method of key object is called to find the bucket index.
- If hashcode() > number of buckets, it is converted to a number <= number of buckets with formula
index = hashCode(key) & (n-1).
Bitwise AND operator is used so only lower bits of hashCode are considered.
Collision Resolution
Java HashMap resolves collisions uses separate chaining (LinkedList)
Once hashCode of key object is calculated. Iterate through Linked List of bucket to find the value / put the value at end of linked list.
More detailed readings can be found on GeeksforGeeks (contains step by steps process / case studies.) For more brief summaries: DZone & javarevisited.
Usages
// 1. initialize a hash map
Map<Integer, Integer> hashmap = new HashMap<>();
// 2. insert a new (key, value) pair
hashmap.putIfAbsent(0, 0);
hashmap.putIfAbsent(2, 3);
// 3. insert a new (key, value) pair or update the value of existed key
hashmap.put(1, 1);
hashmap.put(1, 2);
// 4. get the value of specific key
System.out.println("The value of key 1 is: " + hashmap.get(1));
// 5. delete a key
hashmap.remove(2);
// 6. check if a key is in the hash map
if (!hashmap.containsKey(2)) {
System.out.println("Key 2 is not in the hash map.");
}
// 7. get the size of the hash map
System.out.println("The size of hash map is: " + hashmap.size());
// 8. iterate the hash map
for (Map.Entry<Integer, Integer> entry : hashmap.entrySet()) {
System.out.print("(" + entry.getKey() + "," + entry.getValue() + ") ");
}
System.out.println("are in the hash map.");
// 9. clear the hash map
hashmap.clear();
// 10. check if the hash map is empty
if (hashmap.isEmpty()) {
System.out.println("hash map is empty now!");
}
Get more information (other than key)
From array of indices, return indices of two numbers such that they add up to specific target
In this case, hashamp is needed, need to store the values and check that target - current_value
can be found in the hashmap.
Common problems
- array of integers, return list of indices where numbers add up to given target
- check if two strings are isomorphic. (replace 1 character with another can get from first string input to second string input)
Aggregate by key
aggregate all information by key
The key to solving this category of problems is deciding strategy when existing key is encountered (eg. increment hashmap value by 1, replace original value with new value?)
Common problem
- find first non-repeating character in string and return its index, else return -1
Deign Hash Key
designing the hash key is about building a mapping relationship
between original information & key used by hashmap.
deciding what the hash key in hashmap should be
Hash Key needs to guarantee
- All values belong to the same group will be mapped in the same group.
- Values which needed to be separated into different groups will not be mapped into the same group.
hash function fulfills criteria 1 but NOT 2
Mapping function should satisfy both.
Example
- Anagrams HashMap to groups anagrams (eg. 'eat','ate' same group. 'act' different group) Hash Key mapping could be to use sorted string. 'eat' & 'ate' will be mapped to key 'aet'.
more readings
https://home.cse.ust.hk/~dekai/2012H/lectures/l32_hash.pdf
http://courses.csail.mit.edu/6.006/fall11/lectures/lecture10.pdf
http://www.cs.toronto.edu/~ylzhang/csc263w15/slides/lec05-hashing.pdf
Top comments (0)