DEV Community: Adam Jones

Understanding database indices by (poorly) implementing one

Adam Jones — Sat, 20 May 2023 20:59:12 +0000

There’s a lot of misconceptions about database indices. These exist, in part, because people are missing the context needed to imagine how a database uses them. There’s a lot to learn to establish that context. Too much for one blog post. But, we can try to bootstrap off what’s already familiar to help develop a better understanding.

To do that, we’re going to implement a fake database index in Ruby. It will be woefully incomplete, but still should be enough to give an idea of what’s happening.

Warning

What you’ll see here is not, actually, how database indices work. It’s an extremely crude approximation. I try to call out the where and how that approximation isn’t valid. If you encounter anything in an actual database that doesn’t match up with what you see here, I encourage you to take that as an opportunity to dive in and learn deeper.

Making things very simple

We’re going to build our fake index out the humble Ruby Hash. Those are pretty familiar, right? Store data by key and value, then you can later retrieve the value by providing the key. If you don’t have a key, you’re basically just working with a more expensive variant of an Array. Ironically, under the hood a Ruby Hash uses a lot of the same concepts and data structures as database indices. Anyways, this will be our substitute for writing actual data structure code.

We’ll only support unique indices. It’s possible, but messy, for us to support non-unique ones. I just don’t think it’s going to teach much you won’t already learn here. We will support composite indices, and will get into covering queries that only use some of the index columns.

Probably the biggest query-time thing separating our index from a real one will be lack of support for range queries. So no WHERE X > 0 style queries for our index. We’re ignoring this because hashes don’t make it easy to do efficiently, and I don’t think implementing it will tell you much that direct value lookups don’t. Real database indices absolutely are able to handle these for many different data types.

The Index class

We’ll start with a class, that we name Index, which will be the core of our code here. We will “implement” different SQL queries as Ruby code written in terms of this Index class.

We use Index.declare to create an (empty) index on a list of columns. Then we can add data to it by looping through the data and calling Index#add.

# Allow us to efficiently answer questions about a large amount of data based on
# specific column(s) in it.
class Index
  # The column this index is handling.
  attr_reader :column

  # The columns that come *after* this one in the index.
  #
  # If this list is empty, we're at the "end" of the index column list and
  # should store row ids as our Hash values.
  #
  # If it is *not* empty, we make an `Index` class that deals with those
  # columns and use it as our Hash value.
  attr_reader :subsequent_columns

  # The Hash that represents actual index content. I'm avoiding calling this
  # `data` because it's *not* the actual data we're indexing. Confusing
  # terminology.
  attr_reader :content

  def initialize(column, subsequent_columns = [])
    @column = column
    @subsequent_columns = subsequent_columns
    @content = {}
  end

  # Are we the final column of the index? If so, our answers should be data id
  # values instead of another `Index`
  def leaf?
    @subsequent_columns.empty?
  end

  # "Index" a piece of data. It's assumed that this data is functionally a Hash
  # that contains at least `:id` and whatever value we hvae for `column`.
  def add(data)
    value = data[column]
    if leaf?
      @content[value] = data[:id]
    else
      # If we are *not* the final column, create a new Index to represent the
      # slice of data that all shares the same value for our `column`. This
      # index should use the *next* subsequent column, and needs to know about
      # the *rest* of the subsequent columns in case it too is not the final
      # one.
      @content[value] ||= Index.new(subsequent_columns[0], subsequent_columns.drop(1))
      @content[value].add(data)
    end
  end
end

What we can learn about database indices

Surprisingly, just here we can draw an important and useful inference about working with indices. The “natural flow” of accessing this data is going to be along the path dictated by the columns. Our index also can’t answer questions involving columns that weren’t indexed.

It’s easy to imagine navigating this in column order, but other orders seem like a bigger challenge. Databases are full of clever optimizations that can sometimes make out-of-order usage possible, but generally speaking you want things to happen in-order.

Sample data

To play with this, we’ll work on sample data taken from the US Census Bureau City and Town Population Totals This is a list of ~20k cities in the US with their estimated population.

For the purposes of this post, I have Cleaned it up in a CSV, with state names extracted.

We’re going to assume here that the combination of the city and state columns makes a record unique. That isn’t strictly true for this data, but again it makes it easier to work with.

Harnass code

The following code is enough to get us started in an IRB session. It assumes the above code snippet is available locally as ./index.rb, and the csv can be found at ./city_populations_2022.csv.

require "csv"

load "index.rb"

# Load the CSV, converting integer values as we go
csv = CSV.read("./city_populations_2022.csv", headers: true, converters: [:integer, :all, :all, :all, :integer])

# Store our CSV in an Array where the values are hashes of the row
# data. This will simulate the actual database table.
data = csv.map(&:to_h); nil

# Declare an index on state and city, in that order
index = Index.declare("state", "city")

data.each do |row|
  index.add(row)
end; nil

If we were to discard the index class and just look at things as nested hashes, our index would look like this:

{
  "California" => {
    "Los Angeles" => 1444 # 1444 is the row id for this city
  }
}

Finding a row by state and city

We’ll start out simple: given a city and state, look up the row. We’ll try it out with Los Angeles, California. In SQL, this would be: SELECT * FROM populations WHERE state = 'California' AND city = 'Los Angeles'

state = index.content["California"]
city = state.content["Los Angeles"]

# Our `id` values don't exactly correspond to Array offsets, so we have to do this.
data[city - 1]

What we can learn about database indices

Index ordering

Notice how we’re starting the lookup with the state? That’s because it’s the “beginning” of the index.

Imagine if we tried to start with the city first. What would that code look like? It would have to dig through every value in the state index to get at cities, then work its way backwards.

Often, your database effectively can’t do this. There’s too much data involved, and simply keeping track of everything you’ve looked at could cause problems. Plus “examine the entire index” isn’t going to be a fast operation. It might pursue this strategy if you give it no better option, but you really want to give it better options.

Row lookup

Notice how, to return the data, we had to go to our “table” that is stored in data? That’s called a “row lookup”. Real databases almost certainly store the index data and row data separately, so row lookups have additional overhead that we want to be careful with.

Often, optimizing SQL queries is a process of trying to avoid any more row lookups than strictly necessary.

Finding the total population of a state

Ok, now let’s try another likely task: finding the total population of a state. We’ll go with Idaho this time. In SQL this would look like SELECT sum(population) FROM populations WHERE state = 'Idaho'.

At first glance it might not look like our index is helpful here, but it still is. Here’s code to get this without the index:

sum = 0

# Let's keep track of how many times we had to go fetch a row. This is
# important, because row lookups are expensive.
rows_examined = 0

# Notice: we are visiting *every* row in the data. If we had millions or
# billions of rows, this would be really bad.
data.each do |row|
  rows_examined += 1
  next unless row["state"] == "Idaho"

  sum += row["population"]
end; nil

[sum, rows_examined] # => [1302154, 19692]

So we got our sum, it probably was fast on your computer (reminder: this is a tiny amount of data), but we had to look at every single row in the data. Usually, “we have to look at every row in the entire table” is one of the absolute worst things you can see your database doing.

So how can we use our index? We don’t have a ready list of “the names of every city in Idaho”, so we can’t just plug that in as keys once we get to the Idaho index. But, we do have the ability to traverse a Hash by values. So we can still use our index to help us get to the state of Idaho, then crawl through its contents to find the total population.

sum = 0

# Again, we're tracking rows
rows_examined = 0

state = index.content['Idaho']
state.content.values.each do |row_id|
  city = data[row_id - 1]
  rows_examined += 1
  sum += city["population"]
end; nil

[sum, rows_examined] # => [1302154, 199]

So now we have the same sum, but we looked at roughly 10% of the rows. That’s a huge win.

What we can learn about database indices

Databases don’t just use indices for cases where they have every single relevant key. It’s a data structure that they can dig through, and that can help significantly.

Sometimes they do this by “skipping over” intermediate keys to get to the final rows, like what we did here. It’s worth noticing that this was only possible because our index was defined as (state, city). If it had been (city, state), we would have had to examine every single city name to see if it was in the state. That’s usually still better than crawling every row of the data, but it’s nowhere near as good as what we just experienced.

When you’re defining a composite index, it’s really important to think about the cases where you might end up querying only some of those columns. Getting the column order right will maximize the value you get out of the database’s work in maintaining the index.

A new index for even faster population totals

Let’s say this kind of population query is extremely important, and we’ve found the above “only accessing 10% of the rows” to still be too slow for our needs. What can an index do for us?

We’ve done more or less everything we can with our existing index. If our system supported non-unique indices, we could make an index on just state that would allow us to directly jump into rows, but it wouldn’t change the amount of rows we’re looking at.

Let’s build another index, one that extends our previous index with population values. So it would look like (state, city, population). Here’s how:

population_index = Index.declare("state", "city", "population")

data.each do |row|
  population_index.add(row)
end; nil

Because state+city was already unique, state+city+population is also going to be.

Here’s a sketch of it as a Hash:

{
  "California" => {
    "Los Angeles" => {
      # NOTE: This "population" Hash will always be a single key (the
      # population) pointing to the row id.
      3898767 => 1444
    }
  }
}

This index can give us our population total without touching a single row!

sum = 0

state = population_index.content["Idaho"]

state.content.values.each do |city|
  sum += city.content.keys.sum
end; nil

sum # => 1302154

Notice how data is not even mentioned in this code. We’re answering queries just from the index content!

What we can learn about database indices

Since our index reflects the underlying data, we can use the index contents in place of the actual data. Databases use this trick a lot, and it’s an incredibly effective optimization.

It’s generally safe to assume that your data on disk isn’t organized in a way that makes any particular lookup effective. Before when we read 199 rows to get our data, it’s safe to assume that none of those rows lived next to each other in a way that allowed the operating system to avoid doing 199 disk reads.

By comparison, even when the index is serialized to disk, all of the relevant bits of information live closer together. It’s very likely that reading the disk block that gave us one relevant city also happened to load and cache other cities we needed. Plus our index data is a lot smaller/denser than the actual row data. So even digging everything up off the disk involved fewer disk reads.

When trying to look up actual city records, the same “skip over a column” trick that we did in the last section can work here. So it’s possible to go from (state, city, population) to the city record even with just state and city. This index could handily serve every query we’ve seen so far.

Finding the total population of EVERY state

Now we’re going to try to handle this query: SELECT state, sum(population) FROM populations GROUP BY state.

STOP! Before you read further, I want you to think about how you’d solve this. You have three options now:

Walk through all the data rows
Try to use the original index
Try to use the population index

That act of “deciding how to get at the data” is called Query Planning. It’s an important part of how databases work. Get deep enough into database performance and you’re going to have to become intimately familiar with your database’s query plan explanations. Examining that output is a key way to help debug slow queries and figure out what changes need to happen to make them not-slow queries.

In this case we have only three options, and it’s (probably) relatively easy to pick which one will be “the best”. But, let’s think them through in a rough approximation of how a query planner might look at this.

If we assign a “cost” to data and index reads, we can weigh our options by “total cost”:

Walk all the rows: 20,000 data reads + 0 index reads
Use the original index: 20,000 data reads + 20,000 index reads
Use the population index: 0 data reads + 20,000 index reads (reminder: all needed data is in the index)

It’s generally accurate to assume that index reads are cheaper than data reads. So we’d want to “weigh” data reads higher. The actual process inside a real database is much more complicated, but here we’ll just assume data reads are 5x as expensive.

That gives us total costs of: 100,000 120,000 and 20,000 respectively. Which means we should go with the last option.

Sidenote: to make it accessible, this cost calculation is wildly naive. Real databases track a lot more information than “how many rows are there”, and have more detailed insights about both the characteristics of the data and the specifics of how it is stored. Imagine you spent years refining this concept to fix every case where your cost predictions were wrong, and you’re closer to how databases actually work.

So, how do we find per-state populations? That one again comes out kind of straightforward:

result = {}

population_index.content.each do |state, cities|
  result[state] = 0

  cities.content.values.each do |population|
    result[state] += population.content.keys[0]
  end
end; nil

result

What we can learn about database indices

Again, we’re using the index as a source of information rather than just a way to get to information. It’s worth reiterating this because it comes up so often in real world scenarios.

We also, in our simulated query planning, saw a case where using an index was slower than reading the entire table. In practice, these scenarios are rare, but they can happen. Sometimes you’ll look at a query plan and wonder why the database is ignoring an index, only to find that you’re missing details where the index makes things slower. In this case that detail was “we’re asking to read the entire table”, but it’s definitely not the only one.

We’re also, in our result hash, just getting a glimpse into result buffering. Again it’s worth imagining what we would do in a scenario where there was so much data that just storing this result in memory wouldn’t work.

Wrapping up

Hopefully this helps a little to demistify database indices. As I mentioned at the start, the actual data structures vary significantly from what we’re using here, but I’ve tried to keep the reasoning and thought processes consistent.

Despite the inaccuracies, you can get pretty far using this “wandering through nested hashes” view of how database indices work. Perhaps the best extension to your mental model would be imagining a hash that also includes the ability to do inequality comparisons on keys. Like if a Hash#lookup method existed that took a Ruby Range as an argument and could efficiently give you the values where keys were inside of that range.

If all of this has you interested in what the internals of an index actually look like, you can start by studying B-trees. They’re probably the most commonly used data structure for this purpose. Many databases support alternative index types based around different data structures, which is where you really start getting deep into the benefits and drawbacks of each one.

If you’d like to know more about query plans and how databases handle the topic of picking an algorithm to look up the data, that unfortunately gets pretty specific to the database involved. If you’re using MySQL or PostgreSQL I’ve linked to the relevant sections of their documentation. Because databases are attempting to generate the best possible plan out of (potentially) a huge number of choices in a tiny fraction of time, query planning gets kind of hairy and detailed fast.

If this has peaked your interest in how to effectively use indices, Use the index, Luke is a fantastic resource. It even includes an introduction to B-trees and resources tailored to multiple database types.

Understanding MySQL Multiversion Concurrency Control

Adam Jones — Sun, 16 Sep 2018 18:02:55 +0000

MySQL, under the InnoDB storage engine, allows writes and reads of the same row to not interfere with each other. This is one of those features that we use so often it kind of gets taken for granted, but if you think about how you would build such a thing it’s a lot more detailed than it seems. Here, I am going to talk through how that is implemented, as well as some ramifications of the design.

Allowing Concurrent Change

Unsurprisingly, given the title of this post, MySQL’s mechanism for allowing you to simultaneously read and write from the same row is called “Multiversion Concurrency Control”. They (of course) have documentation on it, but that dives into internal technical details pretty fast.

Instead, let’s talk about it at a little higher level. This concept has been around for a long time (the best I can do hunting down an origin is a thesis from 1979). The overall answer for allowing concurrent reads and writes is pretty simple: writes create new versions of rows, reads see the version that was current when they started.

Version tracking

Obviously if we’re going to keep track of versions, we need something to differentiate them. This tool needs to distinguish one version from another, but ideally it would also make it easy to decide which version a read operation should see.

In MySQL, this “version enabling thing” is a transaction id. Every transaction gets one. Even your one-shot update queries in the console get one. These ids are incremented in a way that allows MySQL to determine that one transaction started before another. Every table under InnoDB essentially has a “hidden column” that stores the transaction id of the last write operation to change the row. So, in addition to the columns you may have updated, a write operation also marks the row with its transaction id. This allows read operations to know if they can use the row data, or if it has been changed and they need to consult an older version.

Reading older version

For the cases where your read operation hits on rows that have been changed, you’ll need an older version of the data. The transaction id comes into play here too, but there’s more info needed. Every time MySQL writes data into a row, it also writes an entry into the rollback segment. This is a data structure that stores “undo logs” used to restore the row to its previous state. It’s called the “rollback segment” because it is the tool used to handle rolling back transactions.

The rollback segment stores undo logs for each row in the database. Every row has another hidden column that stores the location of the latest undo log entry, which would restore the row to its state prior to the last write. When these entries are created, they are marked with the outgoing transaction id. By walking the undo log for a row and finding the latest transaction before a read transaction the database can identify the correct data to present to a transaction.

Handling deletes

Deletion is handled by a marker in the row to indicate a record was deleted. Delete operations also set the row’s transaction id to their transaction id, so the process above can present a pre-delete version of the row to read operations that started before the delete.

When are versions deleted

MySQL obviously cannot keep a record of every change that happens in the database for all time. It doesn’t need to, though. Undo logs can be removed as soon as the last transaction that could possibly want them completes.

Similarly, rows that have been marked as deleted can be outright abandoned once the oldest active transaction started after the deletion. These rows and undo logs are physically removed to reclaim their disk space by a “purge” operation that happens in its own thread in the background.

What about indexes

So, to recap, MySQL handles versions by keeping the row constantly up to date and storing diffs for as long as currently running queries need them. That’s only half the story though, indexes need to support consistent reads as well. Primary key indexes work much like the above description for actual database rows. Secondary indexes are a little different.

MySQL handles this in two ways: pages of index entries are marked by the last transaction id to write in them, and individual index entries have delete markers. When an update changes an indexed column, three things happen: the current index entry is delete marked, a new entry for the updated value is written, and that new entry’s index page is marked with the transaction id.

Read operations that find non-delete-marked entries in pages that predate their transaction id use that index data directly. If the operation finds either a delete marker or a newer transaction id, it looks up the row and traverses the undo log to find the appropriate value to use.

Similar to the purging of deleted rows from expired transactions, delete-marked index entries are also eventually reclaimed. Because there is always a fresh new entry to work with somewhere in the index, MySQL can be a little more aggressive at cleaning those up.

What do I do with this information?

So, given the above, what can we take home to make our lives better? A few things. Keep in mind with all of this that database performance can be very difficult to analyze. Each point below is just one potential piece of the story of what could be happening with your data.

Big transactions are painful: Long running transactions don’t just tie up a connection, they force the database to preserve history for longer. If that transaction is reading through a large swath of the database subsequent writes will force it to read the rollback log, which may be in a different page of memory or even on disk.
Multi-statement transactions need to commit quickly: This is another variety of “big transactions are painful”, but it’s worth calling out. MySQL does not “kill” active transactions. If you open a transaction, query out data, then spend two hours in application code before committing, MySQL will faithfully preserve undo history for two hours. Every moment of an open transaction forces more undo history. Commit as quickly as you can and do your processing outside of transactions whenever possible.
Writes make index scans less useful. The whole point of an index is to answer questions about your data without actually looking at your data. Delete markers on index entries, and transaction stamps on index pages, force the database to read your data. Think carefully about using composite indexes with columns you aren’t querying. Your queries will pay the price for updates to those columns anyways.
Rapid fire writes magnify the penalties for reads. If you have a lot of data to write, especially to the same row, write it in chunks instead of one-by-one. Each write generates a transaction id, relevant undo logs, and makes a mess of secondary indexes. Chunking writes together increases the chances that reads will find valid index data, and lowers the size of undo logs they have to wade through. There’s an opposite extreme where one big write might have too much data, so it’s important to look for the happy middle ground here.
“Hot” rows are hot for all columns, not just updated ones. A row that stores a frequently updated counter forces more row transaction id updates and undo log entries. Queries that start before the counter is incremented, even if they don’t use the counter, still have to traverse undo logs for the row state when they started. That same logic applies to extremely frequently updated timestamps that aggregate change times across the relations to a row. If possible, batch those updates beforehand or consider storing them in a separate table you can join to when needed.
Consider separating reporting from direct/application use reads. Reporting queries tend to scan large sections of the database. They take a long time and thus force the preservation and consumption of more undo history. Most application behavior is more direct: it knows specific records to retrieve and goes straight for them. If you’re already using read replicas, consider dedicating one to reporting so that your application queries don’t pay the undo storage penalties of reporting.

Final thoughts

It’s worth noting that this is not the only way to implement MVCC. PostgreSQL handles this task by storing the minimum and maximum transaction ids where a row should be visible. Under that scheme, updating a row sets the maximum transaction id on it and creates an entirely new row entry by copying the original and performing the update in that copy. This avoids the need for undo logs, but at the cost of copying all row data for each update.

The point here is that, for cases where you are really trying to push database performance, understanding a few of the bigger details of the internals can pay off. Many of the takeaways I listed are only going to be applicable in extreme use cases, but in those cases knowing how the database goes about versioning data can make understanding performance problems easier.

Understanding the Elasticsearch Percolator

Adam Jones — Tue, 24 Apr 2018 18:02:55 +0000

Elasticsearch is a powerful, feature-packed tool. Their documentation is great, but some pieces are a bit … out there. Beyond that, some of the functionality has changed significantly over the years, so third-party explanations might no longer be accurate.

One fantastic feature that is both unusual and has changed a lot is percolation. I’m going to try to explain that feature, in the context of its current implementation (version 6.2.4 as of this writing). You’ll need a basic understanding of Elasticsearch, specifically mappings and search.

The Concept

The normal workflow for Elasticsearch is to store documents (as JSON data) in an index, and execute searches (also JSON data) to ask the index about those documents.

Succinctly, percolation reverses that. You store searches and use documents to ask the index about those searches. That’s true, but it’s not particularly actionable information. How percolators are structured has evolved over the years, to the point where we can give a more useful explanation.

Now, percolation revolves around the percolator mapping field type. This is like any other field type, except that it expects you to assign a search document as the value. When you store data, the index processes this search document into an executable form and saves it for later.

The Percolate Query takes one or more documents and limits results to ones whose stored searches match at least one document. When searching, the percolate query works like any other query element.

In Depth

Under the hood, this is implemented in about the way you would expect: indexes with percolate fields keep a hidden (in memory) index. Documents listed in your percolate queries are first put in that index, then a normal query is executed against that index to see if the original percolate-field-bearing document matches.

An important point to remember is that this hidden index gets its mappings from the original percolator index. So indexes used for percolate queries need to have mappings appropriate for the original data and the query document data.

This introduces a bit of a management problem, in that your index data and the percolate query documents could use the same field in different ways. A simple answer to that is to use the object type to isolate the percolate-relevant mappings from normal document mappings.

Assuming the queries you are using were originally written for another index of actual documents, it makes the most sense to isolate the data going directly into the percolate index and give the root level over to mapping definitions for percolate query documents.

Also, because percolate fields are parsed into searches and saved at index time, you likely will need to reindex percolate documents after upgrading to take advantage of any optimizations to the system.

An Example

In my opinion, percolator examples are one of the prime contributors to making the tool hard to understand. They tend to be too simple, to the point where it’s hard to distinguish the parts.

In this example, we’re going to build out an index of saved term and price searches for toys. The idea behind it is that users should be able to put in a search term and a max price, then get notified as soon as something matching that term goes below this price. Users should also be able to turn these notifications on and off.

The mapping below implements a percolate index to support this feature. Fields related to the saved search itself are in the search object, while fields related to the original toys live at the root level of the mappings.

{
  "mappings": {
    "_doc": {
      "properties": {
        "search": {
          "properties": {
            "query": { "type": "percolator" },
            "user_id": { "type": "integer" },
            "enabled": { "type": "boolean" }
          },
        },
        "price": { "type": "float" },
        "description": { "type": "text" }
      }
    }
  }
}

Here is what a document that represents a stored search would look like:

{
  "_id": 1,
  "search": {
    "user_id": 5,
    "enabled": true,
    "query": {
      "bool": {
        "filter": [
          { 
            "match": { 
              "description": { "query": "nintendo switch" }
            }
          },
          { "range": { "price": { "lte": 300.00 } } }
        ]
      }
    }
  }
}

Note that we are only storing data inside the search object field. The mappings for price and description are just there to support percolate queries.

At query time, we want to use both the plain object fields and the “special” percolator field. This query would check, inside a user’s searches, to see which currently-enabled searches match the document.

{
  "query": {
    "bool": {
      "filter": [
        {
          "percolate" : {
            "field" : "search.query",
            "document" : {
              "description" : "Nintendo Switch",
              "price": 250.00
            }
          }
        },
        { "term": { "search.enabled": true } },
        { "term": { "search.user_id": 5 } }
      ]
    }
  }
}

Note that it combines percolate matching of a document against the queries stored in the field with regular term queries to limit which documents we test based on their enabled state and the user id.

Some Additional Thoughts

Because of the work involved in running queries as part of resolving a percolate filter, you might need to pay extra attention to shards/replicas for a percolate index. Each shard reduces the number of queries any one machine may have to run, by reducing the number of search-bearing documents to evaluate.

Percolate queries have an option to get documents from another index inside the cluster. This takes the form of a literal GET request, so there’s not much benefit in trying to keep shards from the two indices on the same nodes.