Posted on Nov 11, 2019 • Edited on Jun 11, 2020

Hash Tables

#computerscience #algorithms #datastructures #csharp

Resources

Hash table overview
Hashing explanation
Hash table explanation
Hash table overview + implementation
In depth explanation - warning this one is a long read (~60 mins). But there is too much good, approachable, info in there to not mention it.

Takeaways:

A hash table is essentially an array of buckets.
Each bucket contains a key value pair (KVP).
When inserting a KVP into a hash table, the key is hashed and modulo'd by the max size/capacity of the array. The resulting integer is the index of the array the KVP should be inserted at.
Hashing of keys can be done using a variety of hashing functions*.
No hashing function is perfect and collisions do happen.
Collisions are when we try to insert a KVP at a bucket that is not empty. This can happen when the same hash is generated for different keys. When the generated hash is the same, the resulting indexes are the same.
There are two ways to perform collision resolution: Separate Chaining and Open Addressing (also known as Closed Hashing).
Separate Chaining means that each bucket in our array is a Linked List. When a collision occurs the KVP gets added to the Linked List. This means multiple KVPs can reside at the same index.
Open Addressing uses a technique called probing to find the next available index to insert a KVP at. The order you search for an available bucket is the probe sequence. One type of probe sequence is linear probing - where we just increment the index by 1 until we find a free bucket in the array.
One thing that determines how likely collisions are is the Load Factor of a hash table's underlying array. Load factor is: Number of KVPs/Number of Buckets.
More robust hash table implementations will resize the underlying array when the Load Factor has reached a certain point. In my implementations I resize the arrays when the Load Factor reaches 0.75.
Searching, Inserting, and Deleting are all worst case O(n) (linear), but for each the amortized time is O(1) (constant). This is because collisions should be rare (if you use a good hashing function and keep the Load Factor below a certain threshold).
Space for hash tables is O(n).

*The most common hashing functions I came across were: DJB2, FNV-1a, and Murmur. FNV-1a is pretty simple, as is DJB2. As it turns out FNV-1a and Murmur are probably the best hashing functions to rely on. I personally like FNV-1a for how simple, yet effective, it is.

Below is the finished implementation of two Hash Tables (with test code). One is implemented using separate chaining and the other open addressing:

As always, if you found any errors in this post please let me know!