Using the big O notation in the wild

#computerscience

I see a lot of people getting scared by the big O notation, and I get it. It is a math thing, and math is scary for a lot of people.

I have a masters degree in Computer Science, and truth be told, the first year or so the big O notation was very scary for me too. But for each year of professors talking about it, it got less scary.

In the "wild", that knowledge has turned out to be useless in the vast majority of the day-to-day work. But there is one pattern that I've seen repeated by a lot of people, which has the potential to make server response times and load worse than it they had to be if they had known a simple trick tied to the big O notation.

The scenario

Imagine the follow scenario; you have two different services you are integrating with. They both return a list of people. One only has data about which vocation the person has, and one only has the name of the person. You want to combine these to a single list. In this scenario, they share an id, but the list is not sorted.

The pattern that I've observed is to loop through one of the lists, and in that loop try to .find the corresponding item in the second list. It works, and makes a lot of sense when you read it. It also has a worse time complexity (the thing big O "measures") than it could have had.

Analyzing the time complexity

What big O notation means is when the number of people in the list increases, how does that affect the running time? (or memory, but lets focus on time). It also is only concerned with the worst possible case, that is maximum bad luck with how the list is sorted.

In our case, for every person in the list, we will have to potentially look at every other person in the other list to pair them together. Lets call the number of people in the list n, which is the (stupid?) norm. In order to create one complete person we would have to potentially look at n items in a list by using .find.

This work has to be repeated for all n persons. That means the time complexity will be O(n*n) or O(n^2). We can do a lot better.

The easy fix

The thing we can make faster is finding the corresponding person in the other list. In javascript, we can use a simple object {} for this. In other languages you should look for something called Map or dict or similar.

What you want to do is some prework to create a lookup object where you can get the name object if you have the id. It would look something like this:

const lookup = {
  "13": {
    name: "Robert",
    id: "13",
  },
  "52": {
    name: "Julia",
    id: "52",
  },
}

That way, instead of using .find to get the name, you can do lookup[id]. The benefit is that getting a value from an object is, through the magic of computers, super fast. In fact, when we talk about it in big O notation we don't even take note of it. It is said to take constant time. What it means is that is does not matter if the object has 100 or 10000 items in it, it takes about the same time to fetch the item.

If you don't take my word for it, I made a codepen for you to look at.

The time complexity of the fix

Since the lookup[id] takes constant time, we don't count it in big O notation. That means for we still have to look at each person one time, and for that person we only do constant work. That means we are down to O(n). That is a great time complexity for most problems, and it is the best we can do in this case.

However, we have the added work of the prework. We actually loop through the list twice. That is O(2*n), but a weird thing about the big O notation is that we simply remove any constant numbers. So that means that O(2*n) = O(n).

That weirdness is due to the fact that the big O notation does not actually translate to how long an algorithm takes. It only translates to the order of magnitude of the time increase when the input size increases.

Time complexity is rarely important in day-to-day work, but it does come up once in a while. What you can take with you from this article is that as soon as you see a loop inside of another loop, you might be looking at code with bad time complexity. Think through if it matters, and if you can use the trick above to fix it.

Top comments (9)

Nested Software • Jun 16 '19 • Edited

It can be surprising how effective this kind of technique can be. I suppose it can be overdone, but in general the wisdom of modern computing seems to be that trading lower cpu usage for increased memory usage is often a good idea. I was just playing around with a minimax algorithm for tic-tac-toe, and it was a similar story. I hadn't really thought things through and just dove into the implementation. My (admittedly naive) python code took 1.5 minutes to play a game, which I did not expect. Simply adding caching for positions in a dict took that number down to 0.3 seconds though!

James McPherson • Jun 27 '19

A minor niggle - I had to re-read your comment about the prework adding an extra loop iteration, because of the way you expressed it.

You had O(n*2), which my brain read as O(n^2), so seeing the comment about dropping constants was jarring. If you'd expressed it as O(2*n) or O(2n) it would have been much more mathematical and the comment would have flowed.

As to whether we need to take account of time complexity in day to day work - that depends on what your problem space is :-)

Peter Nycander • Jun 28 '19

Valid point with the 2*n n*2 thing, I changed it to 2*n. Thanks for the feedback.

Yeah, maybe I'll work with something that requires thinking more about time complexity in the future. But I think most people are like me, and work with things that rarely creates the opportunity to utilize any knowledge of time complexity.

tiborveres-ired • Jun 16 '19

lookup[id] is not constant but O(log n)

Peter Nycander • Jun 17 '19

Interesting! I had no idea, thanks! Which data structure is actually being used? I thought it was a hash table...

tiborveres-ired • Jun 17 '19

well. it could be a hash table, but then it's even worse (for big n-s)
as with hash table and big n-s there will be hash collisions, so it needs to do a find in the collision list
best case would be a balanced tree

Peter Nycander • Jun 17 '19

Hmm, I wonder at what n you start to get enough collisions for it to matter in practice. I guess it is up to the implementation, and up to the browsers to create a good heuristic.

When I have used SpiderMonkey (and node.js to some extent) solve problems on kattis, I have never run into a problem of object lookup being a bottleneck. Maybe in some special case?