Intro to Hash Tables (JS Objects Under the Hood)

#hash #beginners #dsa #javascript

In the world of data structures and algorithms Hash Tables are extremely prevalent. As someone who primarily is working in JavaScript- I haven't really had to deal with them-- because like so many other things- Javascript abstracts them away (spoiler: they are Objects). However in the interest of learning DSA material I spent some time this weekend looking into them and hoped to share what I've learned to help demystify this common data structure-- and to give a better look into how an HOW an Object stores it's data and then goes and retrieves a value when you give it a key.

To understand the inner-workings of a Hash Table let's run through an imaginary problem checking whether or not an array includes a value.

We have an array of [1, 3, 4]. How can we check whether this array includes the number 5? The easiest solution is to just iterate thru the array-- checking each value and seeing whether or not it equals 5- and ultimately returning false since the above array doesn't have a 5. This is fine but this solution is done in O(n) time- that is, the time it takes to solve this problem depends on the size of the array. If we had an array of length 10k and we wanted to check if it included a specific value it would be very time consuming- in the worst case we would have to check ALL 10k indices before we could answer that question. So with this in mind how can we solve this problem in O(1) or constant time. How can we instantly go and get the answer to whether or not our array contains a specific value- regardless of it's length?

Let's take another approach--we could use an array of booleans to represent whether or not the value of that index is contained within our original set of values -(ie a true at index 1 means that the number 1 is contained)- this would look something like:

Values:     1     3  4
 Index:  0  1  2  3  4
   Arr:[ F, T, F, T, T ]

With this we can check whether the values contains a value in O(1) time- since all we need to do it visit that index and check for T/F.

Now that we have a super simple example set-up a problem becomes clear-- what if the values contained a large number (ie 100)? We would have to fill the array with 90+ more values or F before we could indicate T at index 100. Obviously this is completely inefficient- so in order to get around this we need to come up with a way that the length of our array can better correspond to the actual number of values it represents. A common example of how we could manipulate our values to fit within a smaller array is to take their modulo ten and use THAT as the index in which the T/F will be stored.

Our new set of values contains : 1, 3, 4, 77 , and 100
77%10=7 and 100%10=0 so those indices will now contain T

Values: 100    1     3  4        77        
   Arr:[ T, T, F, T, T, F, F, F, T, F, F ]

Now that we've seen this- lets make our array a bit more complex, and actually store Key/Value Pairs within it to better reflect the actual value of the whatever is contained at a given index- just seeing that 0/7 are T doesn't do a good job of reflecting that the underlying values they represent are 100 and 77.

Since this is an under the hood look of how an Object is implemented- we can't just use an Object for this-- instead we will use another array where the first index is the key and the second is the value

Our new collection contains : 1, 3, 4, 77 , and 100

 Arr:[ 
    [100,T], 
    [1, T], 
    F, 
    [3, T], 
    [4, T], 
    F, 
    F, 
    F, 
    [77, T], 
    F, 
    F ]

Now let's add a 17 so we can see another problem: COLLISIONS. With our current system we decide where something is stored based on it's modulo 10-- so now we have two conflicting values that both want to be stored at index 7 (7 AND 77). Instead of overwriting the 77 we can just add another Key/Value pair array to index 7. Storing multiple values at one location like this is called SEPARATE CHAINING- and is just one of many ways to handle collisions.

Value at index 7
    [77, T] ------> [ [77,T], [17,T] ]

This is cool- but it's awfully convenient that our values are numbers-- what would happen if we wanted to do something like this but with strings? In comes actual HASHING- the process of taking a value and converting it to some sort of numeric code that represents it. In reality Hashing is done via some very complex math that you can look into on your own but ultimately it is just the process of converting something into a numeric code.

Now let's pretend our values contain the strings "Dog" and "Cat" with dog's value being a 5 and cat's being a 3. An example of a fake hashing function would be to use the combined ASCII value of each character in the string to determine its hash code. I'm feeling lazy so we will PRETEND that the combined ASCII value of 'Dog' is 31 and 'Cat' is 23.

Cool- now we would just make another array and store the values at the proper index. Once again we'll use %10 so as to keep our array down to only ~10 length-but now we will be using the actual hash code to determine where to place our animal strings-- Dog will go to index 1 and Cat to Index 3

 Arr:[ 
    F, 
    ['Dog', 5], 
    F, 
    ['Cat', 3], 
    F, 
    F, 
    F, 
    F, 
    F, 
    F, 
    F ]

The big thing here is that via an actual hash function we can turn ANY type of data into a numeric code- and then use that code to place it within our array. We can then access the data in 0(1) time using the proper index (though it can take more if we have multiple values stacking up in one location due to separate chaining)- which is far more efficient than traditional looping.

One last concept to look at is what is called Load Factor (represented w/ a lambda). What would happen if we had a collection of 1000 strings to store? We already know that we want to keep the length of our array in check-- but what will end up happening is that we will end up with a bunch of values within each index due to separate chaining-- and if we allow THAT to happen then we'll have slowed our hash table down which defeats the whole point. Load Factor is the idea of maintaining this balance and is calculated via:

Load Factor = (number of Key/Value pairs) / (length of array)

When utilizing separate chaining we always want a Load Factor of 1 or below (that is the length of the array is always greater than or equal to the number of pairs it stores). Utilizing this concept we can resize our array whenever this balance is our of proportion.

...And that's it- a super brief overview of the inner-workings of a hash table.
The takeaway from all of this is that instead of just storing things in an Array/List and looping through it over and over- we can go the extra mile by hashing our data and placing it into a specific index. This bit of extra work pays off when we can quickly go and find our data down the line.

To boil all of this down into a sentence-- a hash table is just an array of key/value pairs that uses complicated math to determine WHERE/HOW to store that data so that it can be quickly accessed later.

Like so many things in the world of coding--it's basically just an array- but hopefully this post has helped a bit to demystify what a hash table is AND why it's used.

Thanks for reading and please leave any questions/comments!

Thanks-

Logan