Pato Z

Posted on Sep 28, 2021

A Strange Request

#database #performance #optimization #broccoli

I'm planning a fancy dinner party for our friends, and you being my friend, decide to help me with the arrangements. Everything is going well until I ask you...

A Strange Request

Hey, I need you to run to the Library, you know that building full of books we used to visit in ancient times, before the Internet?
Yes, I know what a Library is, thankyouverymuch.
I need you to find me the title of the first five books about broccoli ever published, it's a matter of life and death.

Being the brains behind this operation, you first worry about my mental sanity, but then dubbing me a bit eccentric, you decide to humor me. Before heading out to the Library, you take out a piece of paper and jot down my request:

SELECT book titles
FROM books in the Library
WHERE the book topic is about 'broccoli'
ORDER BY publication year
LIMIT 5

(I sneak a peek at your note and think to myself "What a strange fella").

Road trip

Clearly you don't live in the Library so to get there you'll need to take some means of transportation. And today your transport of choice is a TCP Socket.

Never mind what happens in the Library, a lot can go wrong on the way there and back: you could get lost, take too long, decide to stop for coffee on your way back, even get to the Library only to find out that it's temporarily closed due to a bug infestation.

I, being the inquisitive chef, start preparing for the worst, including network errors and socket timeouts.

Luckily the roads are clear, so a couple of milliseconds later you arrive at the Library.

The only sensible way

Fresh through the door, you see an impressive building stuffed from top to bottom with rows and rows of books.

Eager to get your broccoli quest out of the way you do what any sensible human being would do in this situation: you pick the first book in the first shelf and read it from cover to cover to see if it's about broccoli or not, you make a note of its broccoliness and its publication year and continue with the second one, and so on until you've read the whole library. Then you find the five earliest publications and head back home.

Unfortunately years have passed, the dinner party was a sham and we are no longer on speaking terms. This strategy was too inefficient. It could've worked if there were only a couple of books in the Library but this was not the case.

What you just did is known as a table scan (also known as a collection scan) and in most cases you'll lose friends and ruin parties because of it.

We need a better approach, luckily librarians have solved this problem before.

A bunch of cards

Libraries since forever had this great tool called The Catalog. Library catalogs, before computers, were big pieces of furniture, chests of drawers filled with small cardboard cards. Each card contained information about where to find books according to some criteria. An alternative name for these "catalog furniture things" is indexes (or "indices" if you're feeling particularly fancy).

Armed with this knowledge you venture into the mysterious Room of Catalogs. In there you find lots of catalogs with book reference cards for many different types of queries. You have "books by size", "books by weight", "books by nutritional value", ... etc. Browsing through them you find two indexes that catch your attention: "books by topic" and "books by publication year".

These two should come in handy but which one to choose?

A plan comes together

So you have a choice to make, should you go check the topic index or the publication index? The answer you can already foresee: "it depends".

If the library had only books on green veggies, then the chances of finding a broccoli book in there would be pretty high. In other words the "broccoli" card of the topic index would be very very long. If this was the case, maybe going to the publication index and starting from the oldest known year, searching those books for broccoliness until you find five would make sense.

In non-veggie only libraries though this strategy would be disastrous, so you decide you'll go for the topic index.

But before jumping to it, you realize picking the right index is no trivial task, pick the right index and you might be done in an instant, pick the wrong one and you might have to stay at the library overnight.

Just like you have people that specialize in finding the perfect wine to go with a meal, the task of finding the right index for your query is best left to query planners (sometimes called query optimizers). You can think of these as kind of query sommeliers.

In the process of choosing an index you just learned an important nugget of wisdom:

index efficiency depends on many things, including not only the question you are trying to answer, but the data that was indexed and the index stats.

All over the place

So you approach the catalog titled "books by topic", open the drawer for "b" and rummage through the cards until you find a dusty old card labeled "broccoli". The future's looking bright, you'll be outta here in no time you think.

Unfortunately you look down at the card and to your dismay, there are 100 book locations listed in the card, and even worse, these locations are all over the place, some in the upper floors, others in the basement, etc.

Apparently books in the library are not physically ordered by the type of veggie they discuss, go figure.

The physical order of books is often called the cluster index and a popular one for books would be "alphabetical by author". An alternative would be "just put it in the next empty shelf", something like ordered by purchase date.

Anyway back to your current predicament, you can foresee lots of stairs and walking around in the library. You'll need to go check those 100 locations, make a note of the publication year, then find the earliest five.

You're not looking forward to it, if only there was a way for you to avoid those trips up and down the library...

Then suddenly two ideas pop into your head.

The first one: what if you take this list of 100 locations and use the "books by publication year" index to find the first five there, then you'll only need to do five trips around the library.

Like any query planning activity, the efficiency of this strategy depends on many factors. If the broccoli card had only, say 6 entries, then probably doing six trips around the library would be faster than rummaging through the publication index (that contains many many books that are not about broccoli for some reason).

On the other hand, if the library was so big that you needed to take a train from one room to the other, then whatever you can do to avoid fetching books would be the right call.

Given the complexity of this type of decision, some query planners might not even bother with trying to consider index intersection strategies like this one and just go fetch the 100 books.

But even if we were to consider this strategy, the potential inefficiencies are plain to see.

The second idea seems more promising...

A compound solution to a compound problem

Wouldn't it be nice if the people that created these two indexes had created one containing both publication and topic information at the same time? Some sort of compound index?

You start pondering about the endless possibilities and quickly realize there are two paths forward: we could either (a) add topic information to the publication index or (b) add publication information to the topic index.

What a mouthful, let's see that in action!

A card in the publication index looks like this:

1985
====
  - 1
  - 2
  - 5
  - 6
  - 10
...

If we add topic information to this index (that would be option a above), we'd get something like this:

1985
====
- broccoli
  - 1
  - 5
  - 10
- cars
  - 2
  - 6
...

We are sorting all the books in the card by topic and adding topic titles, so the card for publication year "1985" under the title "broccoli" would list locations 1, 5, 10.

We could do the same for the topic index. Right now cards in that index look like this one:

broccoli
========
- 1
- 5
- 10
- 234
- 567
...

To add publication information to this index (that would be option b above), we'd get something like this:

broccoli
========
- 1985
  - 1
  - 5
  - 10
...

For the task at hand (b) looks like the most promising solution, but knowing what you know about strange book requests, surely there are other questions that are best answered by an (a) type of index.

Thanks to this compound index of "books by topic and publication year (in that order)" you can just pick the five first locations and go fetch those books. A huge improvement!

You take this time to admire the craft of these ancient index makers. When making compound indexes the order in which indexes are combined is very significant.

Just by looking at (a) and (b) you can tell the difference.

You can almost see the emerging pattern here:

when building compound indexes, put filter fields first (the ones over which you specify conditions) and sorting fields later (the ones you'll order by).

Not the wrong princess, again!

“Five books, five trips, things are looking pretty good” you think, putting on a smug face and nodding to yourself like a psycho.

This is a huge milestone because, to be honest, all prior approaches had that frustrating feeling of climbing two flights of stairs to find a book only to realize it wasn’t the book you wanted.

You felt a bit like an Italian descent plumber constantly trying to save a princess from some kind of dragon thing only to realize you saved the wrong princess, again.

So, happy with this huge efficiency improvement, you go fetch the first book, write the title down and just before leaving for the second one you glance at the shelves and your heart sinks.

Wouldn’t it be nice if all the books about broccoli were on the same shelf? All that broccoli knowledge neatly clustered together. One trip surely beats five. How hard would it be to convince the local librarians to change the way they sort books on the shelves? Pretty hard, probably.

This is exhausting, so you decide to take a break...

A coffee interlude at the local bookshop

You sit and nurse a hot coffee while pondering about why every time you propose changing the cluster index to a librarian they go red with anger and blue with sadness. There must be a reason.

(right about this time, I’m at home chopping onions and wondering how could you possibly have a decent latency with all this philosophical pondering and meandering)

Before coming up with an explanation you think of an alternative approach: let’s talk about bookshops.

Bookshops are ancient brick and mortar places where you can go to buy books. Most bookshops organize books by topic, and within each topic, alphabetically by author. Pretty close to what we need.

So you wonder why do bookshop people subject themselves to this book sorting torture that brings librarians to tears?

Of the top of your mind, one explanation is that bookshops aim to answer a very limited number of questions, namely “where’s this book?”, “where are all the books by this author?”, etc. And bookshops want to make it easier for people to find those books by walking around the store without any help. Also, bookstores usually have much fewer books than libraries.

Explanation aside, what would it take to sort books that way? Well, you should probably ask a bookstore clerk next time you visit one. In the meantime, clearly we need to deal with overflow. When an author publishes a new book and we don’t have enough space to place it on the shelf, we need to shift everything to the right.

If we were to do this in the library, we’d also need to update every card in every index because the location for this book changed to the next shelf. This feels pretty annoying and requires us to be able to answer a meta question: where are all the index references to this book?

Your brain starts to hurt, so you decide to park that thought for later.

Deep down you keep wondering about how convenient it would be to cluster books by topic.

The Asymptote of Awesome

Back at the library you decide to recap. So far you did a single index lookup in the "books by topic and publication year (in that order)". You got 5 locations, but because of your convincing powers, the library now clusters by this index, so those 5 locations are actually consecutive, so you really need a single trip through the library.

You started this journey by looking at every single book in the library, then you narrowed the search down to all 100 books about broccoli. From there to just the first five broccoli books. And, in the ultimate optimization, a single lookup for five consecutive books!

You feel accomplished, you ascended the Peak of Efficiency, you must have reached the Asymptote of Awesome.

(you quickly take out a piece of paper and scribble down “Asymptote of Awesome” in case you decide to start a rock band in the future)

...unless…

One trip around the library is much better than a hundred, but you know what’s better than one trip around the library?

You’re right: no trips around the library!

Is there such a thing as being too lazy?

What if we could include the book titles in the topic index. Maybe scribble the title right next to the location? Something like this:

broccoli
========
- 1985
  - [location 1]: Broccoli in the flux capacitor
  - [location 5]: Green veggies of the future, a look at 2015 agriculture
  - [location 10]: Little edible trees
...

With this new expanded index, you can now answer the question directly from the index. No trips, no walking around the library at all.

You realize there is something deeply meaningful in this: some queries can be answered directly from an index. These queries are known as covered queries (because an index fully covers the query without having to go fetch the book).

Covered queries are the ultimate query, the holy grail of laziness, and therefore of performance.

You can hear a manic evil laughter suddenly bursting from your own mouth. “Every query should be a covered query, every query should have its own index, in fact let's index every property, what we need is more indexes, more catalogs, more cardboard…”

You pick up a piece of paper and start trying to solve the traveling salesman problem to optimize hitting every stationary in town so that you can score precious cardboard to start indexing ASAP, and then it dawns on you: somebody must have thought about this before, how come librarians are not doing this already?

A publisher's worst nightmare

Books are mostly immutable. Once published they stay published as is, they don’t really change much. But what if they did? What would happen if books kept changing? What would that mean to the humble librarian?

Clearly there are all sorts of wrong things that can happen. The easiest is when the thing that changes is not a part of any index, then you just take a permanent marker, find the book and change it.

But what if the property that changed was indexed? There was a typ0 in the book and the publication year was wrong. We’d need to find the card for the publication index and fix it. If we had a compound index like "books by topic and publication year (in that order)" we’d need to find the card for the book topic, then move the book under the right publication year. The more indexes we have the more changes we’d need to make.

Probably the worst type of change is the cluster index change. You suddenly recall your coffee break and how painful it was to shift every book to the right. Not to mention answering the question of: where are all the index references to this book?

And, if we happen to be inserting a book at the beginning of the cluster index we might need to shift every single book in the library and update every single card of every single index.

There’s also the problem of space. If we take this to the extreme, every index is a cluster index, every index includes every property in the book, so for every index you’d need a whole new library building.

All of this is kind of depressing and, to be honest, you already have what you came for, it’s getting late so you better head back.

The return

So you are finally back, you hand me over the list with the five titles, expecting a hero’s welcome.

Instead, I look at the list, shrug and say “just as I suspected, there are at least 5 books about broccoli”, and keep cooking.

You start wondering why you subject yourself to friends like me.

Upon careful examination you start thinking about what could I have done to avoid so much waste. So far we’ve been so focused on finding the right books that we never stopped to think about what we were using the books for.

I asked for the title of the first five books about broccoli. What if I had asked “how many of the first five books about broccoli have a green cover?” or maybe “how many times is the word ‘broccoli’ mentioned in the first five books about broccoli?”

All of these questions have something in common (a couple of things actually). First and foremost they are pretty weird. But much more interesting is the fact that all of them could be answered using the same five books!

In truth there were two parts of my strange request. Find me a couple of books, and from those books I only want the titles.

All queries have the same two parts: a criteria for selecting books (often called a filter) and some instructions of what to do with those books once you find them (usually called a projection).

My original query had “get the first five books about broccoli” as the filter and “from those I want the title” as the projection.

We’ve been focusing mostly on the filter so far, trying our best to avoid any kind of physical activity, but let’s talk about the projection for a moment.

We’ll start by altering the filter to make matters much much worse…

Frenemies

Let’s say that instead of the first five books about broccoli, I asked for the first thousand.

If you had the right index that covers this query, you could get an answer very quickly and efficiently, without a single trip around the library.

If, on the other hand, you needed to use a compound index, you’d still need to go fetch 1000 books. This is usually the most typical scenario given that a cluster index at this point is probably out of the question.

That being said, you could easily fit 1000 book titles in a small pocket-sized notebook. What if I had asked for 1000 full books instead?

So you diligently fetch and carry 1000 books to your car only to realize you cannot fit 1000 books in the trunk of a compact.

You hire a truck, load it and drive it back to my house.

Once there, you see me give the books a cursory glance, check the titles and throw the rest in the trash.

At this point you scream bloody murder.

And now we learned a new lesson, if you want to keep your friends, ask only for what you truly need. Don’t be lazy and ask for everything just because it’s easier than to explain what you want.

So with this final frustration out of the way, it’s time to wrap up with this party.

Cue nostalgic music

In the end the party was a big success. Whether so much broccoli wisdom had anything to do with it or not, only time will tell.

Before everyone heads home, let’s recap what we’ve learned:

Table scans ruin parties and friendships.
Always use an index for your queries.
You can check how good a particular index is for your query by comparing the number of books fetched with the number of books actually included in the result. You want these numbers to be as close as possible.
Fancy people call indexes “indices”.
When creating compound indexes, the order of the properties being indexed matters a lot. You want “filter” properties first, “sorting” properties later.
Sometimes you can answer queries directly from the index, these are covered queries. Having indexes that cover your quieres has a cost so try to balance coverage obsession with pragmatic index building. Only cover the most critical queries.
If you want to check which indexes apply to a particular query, query planners are always very eager to explain their train of thought.
The physical ordering of books is the cluster index. Cluster indexes can be very efficient, but changes to the cluster index can be painful.
Projections are very important, always ask exactly what you need.