Cover photo by Dan Deaner via Unsplash
There are many advantages to arrays as a data structure that make them ideal for certain scenarios, but make them quite unflatteringly slow when using their built-in methods in others. In this article, we'll have a look at some alternative data structures that makes the job much more efficient.
The right tool for the right job
In the previous article we explored Big-O notation so that we can make better decisions about how to analyze our algorithms to achieve better performance. We know that arrays are great when you're accessing an element by index (O(1)), and are great to use when mutations occur at the end of the array (O(1)), so if we can avoid mutations at the beginning of long arrays, our code will perform best. We can even improve the performance of shift()
and unshift()
by slicing up the array into multiple parts, with the overhead of having to keep track of indices of elements in each slice separately. Laborious, but depending on the choice of slicing, it might be quite fast.
There's one operation that seems to be unavoidably slow for arrays, though: indexOf()
, and its related functions, find()
, findIndex()
and includes()
. The latter three functions are just convenience functions that use indexOf()
internally, so their performance is identical, if we ignore the cost of the function passed as the parameter.
The O(n) performance means that an array twice as big will take twice as long to search. We can do better. Much better.
Values, variables and references
You may be aware that JavaScript has two types of values: primitives and objects. Variables can refer to these primitives and objects by assigning those values to a name. When a variable references a value, we say that it contains a reference to the value.
const a = 3; // variable `a` points to the primitive value `3`
const b = {}; // variable `b` points to an object instance
The difference between primitives (like null
, undefined
, booleans, strings and numbers) and all the other objects is that primitives are immutable -- only one copy ever exists at any given time within the same environment, and they cannot be changed after they're created. No matter how many times you create the same string or number, the result will be the same:
const a = 3; // we create the integer `3` and assign it to `a`
const b = 3; // `3` already exists, so `b` points to the same number
const c = 'hello'; // we create the string 'hello' and assign to `c`
const d = 'hello'; // 'hello' exists, so `d` points to the same string
When we say that we compare references, we mean using strict equality (===
), which compares two values to see if they are pointing to (referencing) the same thing. Given the above, we should expect the following to all be true:
Still with me? Here's where it gets interesting: whenever you create an object (i.e. not a primitive value), JavaScript allocates new memory for the object, regardless of what it contains, and returns a reference to it. A reference is a kind of unique address for that value, a way for the runtime to know where to look for a value when needed.
And yes, arrays are objects too, so the same rules apply. Let's put it to the test:
Even if the objects contain the same primitive values with the same keys, they will have unique references.
There are two data structures that take advantage of this property to great effect: Set
and Map
.
Keeping track of references using Set
Conceptually, references are numbers that JavaScript uses to find the values in memory for a particular value. Those numbers are hidden inside the internals of the JavaScript engine, but some built-in objects have access to them and this enabled them to provide some unique capabilities.
With arrays, checking that a value is present in it requires searching the elements one by one and seeing if any of the references match the one we're searching for. Set
, on the other hand, uses references as numbers to search for a number using binary search trees.
Imagine you have a huge pile of manuscript pages on your desk. You know the pile is ordered, but some of the pages are missing, so you don't have a good idea of where exactly a particular page is, if it is in the pile at all.
You can peek at the top and bottom pages and see that they range between 1 and 1000. Someone asks you to check if page 314 is in the pile. How would you search?
Going from top to bottom would mean it would take you up to 314 steps, so that's not quite efficient. But what if we just pick the middle of the pile to see how close we are?
Let's split the pile roughly in the middle and look at the top page of the bottom half. We discover it's page 562:
|1.....................562.....................1000|
^
Hm, that means it must be in the top part. Let's split the top part again:
|1.........193.........562|
^
OK, too far, it's in the lower half now:
|193...397...562|
^
Close! At this point, would you just flip through the pages to try to find the elusive 314 or continue to split the pile? How do you know when to stop? Which approach would be faster, assuming that splitting the pile takes as much time as flipping a single page? How many steps would you need to finish the task by only splitting the pile?
Let's test this out in code and see how well it performs against a page-by-page search:
Depending on the random number list, the list might or might not contain the number 314
. You'll notice, however, that there's a stark difference in the amount of steps needed to find (or not find) the value in the random number array.
This approach is called the binary search algorithm. It belongs to a whole family of related algorithms that have different speed and memory tradeoffs that can be applied to specific cases for maximum effect. The expected complexity of the binary search algorithm is O(log2 n). In contrast, includes()
uses a linear search algorithm, which has a complexity of O(n).
The Set
is a data structure that uses those internal IDs within the JavaScript engine to be able to quickly search through the pile for a given reference and determine if it is in the pile or not.
So how does that compare to Array::includes
? Here's a benchmark result on my laptop that compares the runtime performance of using either method on an array of 100k integers:
The higher the op/s (operations per second), the better. In this example on Chrome 73, using a Set
to determine whether the chosen number is in the list of numbers is more than 1000 times faster! Here's a link to the benchmark so you can test it out yourself.
Of course, this won't always mean that one method is 1000 times faster; it just means that on the scale of 100k elements, Set
ends up being 1000 times faster in this specific example. It will depend on the number of elements you have, and the smaller the set, the less noticeable the difference will be. In most cases involving more than, say, a hundred elements, you should see an improvement of orders of magnitude.
When to use Set
If the problem you're solving requires testing whether a given value is part of a set of values, then this is the data structure for you. Here are a couple of examples:
Creating associations between values with Map
If Set
lets you easily look up references in a set, Map
lets you associate that reference with another, essentially mapping one value to another. Before we get into it, let's try to model this behaviour using an array.
To do this we'll start with an array containing a pair of values, or a tuple. A tuple is an ordered list of values, and in our case, our tuples will contain a key and a value.
Note: for this example to work correctly, the keys in the tuples must be unique across the list, otherwise duplicates will terminate the search on the first occurence.
// we can use any type of reference as the key, so let's create an object
const three = { value: 3 };
// construct the list as an array of arrays
const list = [
['one', 'eins'],
[2, 'zwei'],
[three, 'drei']
];
Next, we need a lookup function. This will take a list and a key, and return the associated value, or undefined
if not found.
const get = (list, key) => {
const pair = list.find(
(pair) => pair[0] === key
);
return pair !== undefined ? pair[1] : undefined;
};
Let's test it out:
Since find()
is a linear search, its complexity is O(n), which is far from ideal. And this is where Map
can really bring in the big guns.
Just as with Set
, it contains a has(key)
method which returns a true
or false
based on reference equality. It also has a get(key)
method, which allows us to get the associated value by key.
Now you might be thinking, wait, couldn't we just use objects for this? The answer is yes, so long as all of your keys are strings, otherwise you're setting yourself up for failure. If you wanted to have a lookup by string, a plain old object would do just fine:
const germanNumbers = {
one: 'eins',
two: 'zwei',
three: 'drei'
};
const key = 'one';
germanNumbers[key]; // 'eins'
But this strategy falls flat if you try to assign a key that isn't a string, since all object property lookups get converted to a string first. You wouldn't be able to look up a value given an object reference, since objects are cast to strings, resulting in "[Object object]"
by default. And you can't differentiate between 2
(a number) and "2"
(a string).
This is the reason we had to implement the list as an array of key, value pairs and use ===
to compare the values. Map
works by letting you assign any reference as the key, not just strings.
Additionally, it enjoys the same speed benefits as Set
does, so looking up values in the map also has a complexity of O(log2 n). How about a quick race to see just how fast?
When to use Map
Map
can be used to preserve references in cases where you cannot convert a key to a string, or want to avoid casting other primitive values to strings. Its performance is a little worse than object property or array index access (O(log2 n) instead of O(1)).
The most common use case is when you want to create associations between objects. There are generally two ways you can do this:
- you can assign the associated value to a property on the object; or
- you can generate unique IDs and use those to look up the values.
The first method can create cyclic references, which makes it impossible to convert those objects to JSON strings. The second requires a lot of bookkeeping for each value being referenced, and can often be impractical and slow to implement.
This is where a Map
offers a way out:
We can create many different associations by just creating more maps, and we never have to modify the objects.
Caveats to consider when dealing with JSON data
While this means that the values being mapped can still be converted to JSON strings, the Map
s themselves cannot, since there is no way to serialize references. In this case, generating unique keys is a necessity, but keeping track of which objects need to have their IDs generated can be handled by another Map
instance and used in the replacer
function of JSON.stringify()
. Similarly, a reviver
function can recreate the maps. I wrote an article on this that you might find useful:
Conclusion
If your data requires you to iterate over a collection in order to check the presence of a key or to look up a value, you might consider using Set
and Map
to use as a data structure instead of arrays. They offer a fast and safe way to look up values, and you can iterate over them or convert them back into strings if needed.
Next time, we'll have a look at their weakly-referenced siblings, WeakSet
and WeakMap
!
Top comments (0)