There's often a huge fuss about making data-driven decisions, leveraging data analytics, using data science and data-centred thinking. From a technological point of view, data is usually stored and accessed using databases. Databases, often referred to as RDBMS, are sophisticated systems that abstract away the fairly complex logic and engine behind data storage on disk. Several databases exist in today's tech landscape, but this article will be focused on something common to data storage and retrieval.
Drum roll 🥁🥁🥁🥁🥁🥁🥁. We will be discussing indexes. The topic gives it away anyway.
Log-based Data Structures
My approach draws a slightly historical perspective to database indexes. To begin, we look at the simplest form of a database - A file. Think of this file as a log file. Every time we need to store stuff, we append the data to the log file. To retrieve the data, we traverse each entry till we get the information we want.
The above illustration sounds straightforward and is very efficient for storing data (write operations) because all it needs to do is an append operation. However, it introduces a huge challenge to retrieve data (read operations) when the data grows in volume. This is because the program has to go through all entries all the time. In computer science, the big-O notation for this sort of operation is O(n), n being the number of records.
For clarity, we can think of each entry as having a key and value. The indexes in this article will refer to the key in each entry.
This is where indexes come in. Imagine if we had a separate data structure that tells us where the information is. something like the famous indices at the back of the Oxford Dictionary. An index in this context is a data structure that is derived from the primary data that helps retrieve information quickly. There are multiple variations of indexes and we will get into them in the next sections
Hash indexes
A hash index is represented using a similar data structure to the dictionary data type in Python or a hashMap in Java.
Let's refer back to our simple example database where we have multiple entries appended to a log-based file. A hash index would be akin to having an in-memory key-value pair of every key in the data appended and its byte offset in the data file. In so doing, when we look up some data, we search the index for the key and find a pointer to the location of the actual values on the data file. Also, every time a value is added to the database, the hash index is updated to include that new entry using its key.
Before going into the pros and cons of hash indexes, I will build on the illustration of this database. since the existing database is a log-based database (append-only), when an already existing value is to be updated, the database does not seek for the data and modify the values in place. Rather, it appends the entire entry and read operations are built to make sure to get the latest value for any given key. This bolsters the need for an index because the hash map will also be updated to point to the most recent location for that entry.
Assuming we have a service that is transactional in nature. This implies that new entries will be added and existing entries will be changed frequently. For entirely new entries, this is no problem. However, every change to an existing entry will append a new entry and will make the older values redundant. If we use hash indexes to solve the issues of long lookup times, reclaiming disk space used by those redundant values in the database is still a significant challenge.
But how do we prevent the disk from running out of space? A simple solution to this is to split the log files. Each split can be referred to as a segment. After splitting the files into segments, a compaction process can be run in the background. This compaction takes a segment or multiple segments and merges them. In this process, only the latest values are kept and others are discarded and written into a new segment. After which, new operations are redirected from the older segments to the newly compact segments. Note that the database is still split into files but compaction reduces the amount of redundant keys in the database.
In relation to hash indexes, each segment will have its own in-memory hash maps, and these are also updated after merging and compaction also. When a lookup operation is done, it first checks the hash map of the most recent segment, then traverses backwards to the next most recent and the next and so on.
Limitations of hash indexes
In practice, log-based databases and hash indexes are very efficient but still have limitations. Some of the core limitations of the example above are poor concurrency control, crash recovery, partial writes, no support for delete operations and range queries are inefficient. Because the index is in-memory (RAM), if the server is restarted, all hash maps are lost. Additionally, the entire hash index must fit in memory. These limitations do not fit the requirements for how we interact with databases and the deluge of data we work with today.
To address these limitations, enhancements and changes are made to the existing data structure housing the hash map indexes per segment. Referring back to the current state of the database, we recall that we now have our data split into different segments and segments undergo compaction.
SSTables
We make a fairly simple change to how the data is stored in these segment files. The change we introduce is to sort the data (key-value) by the key. By doing this, the data is stored in a sorted format on disk using the keys. This is referred to as a Sorted String Table (SSTable). The obvious limitation it solves now when compared to plain hash maps is that we can now fully support range queries.
The term was coined in Google's Bigtable paper, along with the term memtable.
In comparison to log-based storage with hash indexes, SSTables introduce an additional layer to merging and compaction. What it does is use an algorithm very similar to the popular mergesort algorithm to maintain the order of all entries in the SSTable
It is worth noting that sometimes, SSTables are not referred to as indexes but as a data structure itself, which seems to be a better description of what an SSTable is. In this case, the accompanying index can be referred to as an SSIndex or SSTable index or Memtable. However, for the purpose of this article, SSTable will refer to the combination of both the datafile (sorted key-value pairs on disk) and its corresponding in-memory index containing the keys and their bytes offset.
SSTables vs Hash Indexes
All the advantages of hash indexes are preserved in SSTables. That is,
- it is still efficient for write operations since it is log-based (append-only)
- the in-memory index will act as a pointer to the actual location of the data on the disk
- The compaction process in the background makes it efficient from a storage perspective
Key Advantages
The additional advantage when compared to Hash indexes are in 2 folds. First, we can now query ranges because our data is sorted by the key. Second, the index can be sparse. To explain the second point, let's use the example that we have keys from A0 to A1000. The index will not need to have 1000 keys but can have half of that which is 500 keys. So the keys could be A0, A2, A4, A6 and so on with their corresponding pointers to their location on the disk. When trying to retrieve the values associated with key A5, even though it is not in the index, we know that it is in between A4 and A6 and we can begin our search from there. Thus, making the index sparse without trading off read performance.
Drawbacks of SSTables
However, it is not without its limitations. The entire index still needs to fit within the memory of the server. If the server crashes, the index is lost. In a very busy transactional database, that is a lot of work required to keep the SSTable up to date. In the next section, we continue to build on our knowledge of log-based databases and indexes with LSM Trees.
Log-Structured Merge (LSM) Trees
We have established that log-based approaches to data storage can be very efficient. Just like the SSTables, LSM Trees are also log-based in their way of storing data on disk and have an in-memory data structure akin to Memtable in SSTrees. In fact, LSM Trees make use of SSTrees.
LSM Trees are layered collections of Memtables and SSTables. The first layer is the memtable stored in memory. The following layers are cascaded SSTables and these layers are stored on disk. Its major characteristic can be observed in how it handles write operations. During the write operations, the entries are initially added to the memtable and are then flushed to SSTables after an interval or when it reaches a certain size. This mechanism makes writes very fast but can slow up reads.
Implication on read and write operations
Reads have to look up the key in the memtable first for the most recent entries, then traverse through the layers of SSTables. Therefore, it is a painful operation to look up a key that does not exist in the LSM tree because it ends up searching through all the layers of data available. Bloom filters help mitigate and optimize these experiences by being able to quickly determine if a key might exist in the SSTable.
Overall, LSM Trees seem to provide superior performance when it comes to workloads that involve a lot of write operations while SSTables are preferable when quicker reads are essential.
Databases like CassandraDB, LevelDB use LSM Trees.
To address the limitation of preserving the indexes when the server crashes, it sounds intuitive to store the index on disk as it is smaller in size. Since it is smaller, it should be easier to read and update. However, that is not the case because of how storage disks are designed (SSDs and HDDs). In practice, what is done is that the index has a file that is read into memory when the database server is back up and running. Furthermore, for every write operation the key is added into the memtable - remember the memtable is the in-memory data structure - and a Write Ahead Log (WAL) which is persistent on disk. Recall that write operations are very efficient. Therefore, the essence of having a persistent WAL is so that in the event of a server crash, the WAL has all that is required to rebuild the in-memory index. This WAL can be applied to both the SSTable, LSM Trees and other indexing strategies.
B-Trees
All the indexes and data structures discussed above have something in common, which is they are all log-based. Here, I discuss a very popular and mature data structure that is widely used in the most popular databases today. B-Trees are data structures that keep and maintain sorted data such that they allow searches and sequential access to the data in logarithmic time. They are self-balancing and very similar in nature to a binary search tree, if you are familiar with programming.
They are tree-like and essentially break down the database into fixed-sized blocks referred to as pages. These pages are commonly 4KB in size by default, although, many RDBMS systems offer the option to change the page size. In comparison to the log-based which uses segments where data is append-only, B-Trees allows us to access and manipulate the data in place on the disk using references to those pages.
Because it is tree-like, it has one root, inner and leaf nodes. One node is designated to the root of the B-Tree and has pointers to children nodes. Every time a lookup is done using the key, you have to start there and traverse through different hierarchies to the key being looked for, guaranteeing access in logarithmic time.
Remember, we made the assumption that our data entries are key-value pairs. Therefore, it is worth noting that out of all the 3 types of nodes - root, inner and leaf - only the leaf nodes contain the actual information (values), the others only contain references to other nodes and so on till they point to the corresponding leaf node. Additionally, within the tree hierarchy, the outermost child references indicate the boundaries in that range.
When a write operation (insert or update in SQL terminology) is done, the idea is to locate the page where the values should be on disk and write the value to that page. Therefore, we must consider what happens when a page becomes full. In this case, a split operation occurs, The page is split into 2 and all cascading nodes above must be adequately updated to reflect the change that has occurred.
B-Trees have depth and width that are inversely proportional to one another. That is, the deeper the B-Tree the slimmer it is. The technical term for the width is referred to as the branching factor, which is defined as the number of references to child nodes within a single node. Linking back to a write operation that may cause a page to split, this will, in turn, require several updates if the B-Tree has a lot of depth. Additionally, careful measures must be put in place to protect the tree's data structure during split and concurrent operations, and to achieve this, internal locks (latches) are placed for the time they are being updated.
B-Trees are not log-based but they also use Write-Ahead logs to recover from crashes and for redundancy.
B-Trees in comparison to LSM Trees
Writes are slower in B-Trees in comparison to LSM Trees. On the other hand, reads are faster when using B-Trees. It is, however, important to experiment and test extensively for any use case. Benchmarking is essential when choosing the database and indexing strategy that would best support your workload.
TL;DR
Each data structure supporting different indexes has its strong points as well as areas of weakness. In this article, we have discussed 4 main data structures that power indexes. These are hash indexes, SSTables, LSM Trees and B-Trees. The first 3 have a log-based data structure, such that, they are append-only. Their respective limitations were mentioned, and how some address certain limitations, for instance, SSTables support range queries because the data is sorted. Some general optimizations such as the use of a Write Ahead Log for crash recovery, compaction and merging for saving disk space, and latches for concurrency control. Lastly, we briefly compared the performance of different pairs of data structures at the tail end of each section.
This article is strongly inspired by my current read: Designing Data-Intensive Applications by Martin Kleppmann. I completely recommend it if you want to broaden your understanding of data systems. Please share in the comment sections interesting books, articles, and posts that have inspired and helped you understand a concept better.
Top comments (3)
Great piece, Ayokunle !
Thank you very much!
Thank you 😊