Writing a SQLite clone from scratch in Rust
Alright, slowly we are getting closer and closer to our objective. Which is to have a working simple relational database modeled after SQLite, meaning among other things, embedded into a single file. This time on top of all the research I had already mentioned in the previous articles of the series, I spent a lot of time just researching everything SQLite related. E.g.: documentation, code base and talks.
I also spent a lot of my time since the last chapter of the series understanding the data structures used on relational databases at their fundamental principles, how they are used on SQLite and what would be best way to approach them on the SQLRite Project.
I’m a huge proponent of designing your code around the data, rather than the other way around, and I think it’s one of the reasons git has been fairly successful… I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
If you need more, here is one from Rob Pike in 1989:
Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
I think those guys know what they are talking about when it comes to software, don’t you agree? Anyway, my take on what they mean by it is that good data structures makes coding easy to design and maintain, whereas the best code in the world cannot make up for poor data structures. I think this also ties to what I said on Part 1, “If you spend enough time on planning, coding is easy”.
So on this chapter of the series we will do a deep dive into the main data structures used on database design and hopefully we will get a better understanding on why they are used.
The first thing I would like to explore is a data structure called B-Tree which is a key piece of modern database design.
Traditionally sorted maps have been the domain of Binary Search Trees (BSTs). There is no shortage of literature, implementations and promotions about BSTs in the the educational system. They are great to think about, have a great theoretical and practical applications, and have about a million different variations to satisfy your needs.
The basic idea of a BST is that every element in the tree gets a single node. Each node has two pointers or children, a left child node and a right child node. Nodes in the sub-tree to the left must contain elements that are smaller then the parent, and nodes to the right must contain elements that are larger. This makes search fairly straight-forward: start at the “root” of the tree and compare the element in that node to your search key. Then recursively search either the left or right tree accordingly (or stop if you find an exact match).
But BST have some serious practical issues when it comes to how computers actually work. A lots of applications, like databases and file systems, are not bound by how fast the CPU can execute instructions, but actually by how fast it can access the data on disk. Bottom line is that modern CPUs now a days go so fast that the data that they’re working with has to be really fast and close to CPU to actually be used. Light only travels so fast! One other thing to take into account are the caches available to the CPUs that have a hierarchy that ranges from “very small and very fast” to “very large and very slow (relatively)”.
Caches usually work based on a time-space locality assumption. Ideally you would like for the data you need to be as close to each other as possible. For example, if you’re working with some data at location A right now, ideally you would want next data near location A. A great example of this assumption is something like looping over an array: every piece of data you want next is literally right next to the last one. So we can start to see how choosing the right data structure can make all the difference.
This assumption is usually implemented roughly like this: when location A in memory is requested, the CPU will check if it is in the fastest cache. If it is, great (cache hit)! If it’s not, too bad (cache miss), check the next (slower, bigger) level of cache. In the worst-case this bottoms out into your computer’s RAM (or worst, the disk!). When you do find the data, it gets added to all the previous (smaller, faster) levels of cache, along with some of the data surrounding it. This operation then evicts some data that is determined to be unlikely to be needed. How exactly that works is out of the scope of this post, but the point is: cache hits are fast, so we want to access data in a space-time local way. To get an idea of scale, check out these numbers.
We started talking about BSTs and ended up talking about CPUs and Cache Hits and Misses. But how do BSTs accesses data? Basically randomly. Each node is allocated separately from every other node in the tree. So unlike an array, each node is in a separate location in memory and a search will amount to a series of random accesses. As a rough estimate, every time you go from a pointer to another you can expect a cache miss. Niet goed! (As the dutch would say!)
BSTs are actually very memory inefficient. Each node has two pointers for every single entry in the tree. So on a 64-bit that means you’ve got a 16-byte overhead for every element. To make it worst, half of those pointers are null. And that’s in the best case scenario. When you factor in issues like padding and any extra metadata that nodes need to store (such as in a red-black tree), they are just terrible.
To make things worst for BSTs, note that every insertion triggers an allocation. Allocations are generally viewed as a slow thing to do, so if we can avoid them, we would!
B-Trees to the rescue!
A B-Tree is a self-balancing tree data structure that maintains data sorted and allows for searches, sequential access, insertions and deletions in logarithmic time. But wait?! Isn’t that exactly what a BST is? So why do I need a B-Tree?
A B-Tree takes the idea of a BST, and roughly speaking, say “how about we put some arrays in there, arrays are easy and computers love arrays, so why not?”. So instead of each node consisting of a single element with two children, a B-Tree have an array of elements with an array of children. All ordered!
Basically, for some fixed constant B, each node contains between B-1 and 2B-1 elements in sorted order (root can have as few as one element). An internal node (one which has children) with k elements has k+1 children. In this way each element still has a “left” and “right” child, but for example the 2nd child contains elements strictly between the first and second element.
According to Knuth’s definition, a B-tree of order M is a tree which satisfies the following properties:
- Every node has at most m children.
- Every non-leaf node (except root) has at least ⌈M/2⌉ child nodes.
- The root has at least two children if it is not a leaf node.
- A non-leaf node with k children contains k − 1 keys.
- All leaves appear in the same level and carry no information.
Each internal node’s keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: _a1 and _a2. All values in the leftmost subtree will be less than _a1, all values in the middle subtree will be between _a1 and _a2, and all values in the rightmost subtree will be greater than _a2.
Historically, ever since its invention in 1970, B-Trees have been very popular as a data structure stored on disk. This is because accessing disk can be a super slow operation, but with a B-Tree you get back big chunks at once, in contrast to a BST. It takes way less CPU time to search through your data than it does to read the data into memory from disk. So if you for example take B = 1000, then you can get a thousand entries in the tree at once and process them in RAM relatively fast. Your tree will also be really shallow (not deep), specially in comparison to a BST, meaning each search will maybe hit the disk by following a pointer only a couple times. Sound familiar? This is the cache-hit problem that we were talking about just above.
Just so you could have an idea, here are some real numbers:
A 100,000 row SQLite database has a B-Tree with a depth 3. So to fetch a node we only need to read 3 pages or jump 3 times between nodes. If we had a BST we would have needed to do log(100000) / log(2) = 16 seeks! That’s more than five times as many. We definitely do not want that!
That is all nice and very interesting, but enough theory about B-Trees and let’s take a look into how SQLite uses it. If you would like to more in depth about B-Tree the its implementation in different programming languages, I highly recommend taking a look at the Open Data Structures. For another great source of information that gives a lot of background on B-Trees, see Knuth, The Art of Computer Programming, Volume 3 “Sorting and Searching”, pages 471–479. Nice light reading…
SQLite uses B-Trees to represent both tables and indexes. But SQLite actually uses two variants of B-Trees. Traditional B-Tree is used to store indexes and it is referred as “Index B-Trees”. And to store tables, SQLite uses a variation called B+ Tree, that is referred to as “Table B-Trees”.
The main difference is that a “Table B-Tree” or B+Tree uses a 64-bit signed integer key to reference the data in the Internal Nodes and all the data is actually stored in the Leaf Nodes. This 64-bit signed integer is referred to as “ROWID”. Additionally, a leaf node may include a pointer to the next leaf node to speed sequential access. Now an “Index B-Tree” or just B-Tree uses arbitrary keys and does not store any data.
Here are some basic differences:
+-------------------------------+----------------+-----------------+ | # | B-tree | B+ tree | +-------------------------------+----------------+-----------------+ | Pronounced | “Bee Tree” | “Bee Plus Tree” | | Used to store | Indexes | Tables | | Internal nodes store keys | Yes | Yes | | Internal nodes store values | Yes | No | | Number of children per node | Less | More | | Internal nodes vs. leaf nodes | Same struct | Different struct| +-------------------------------+----------------+-----------------+
Another important piece of information is that SQLite keeps one “Table B-Tree” in the database for each table in the database schema, including system tables such as sqlite_schema. SQLite also has a type of table called WITHOUT_ROWID, but I won’t get into this right now. SQLite also keeps one “Index B-Tree” in the database for each index in the schema, including implied indexes created by UNIQUE constraints.
Let’s try to visualize it!
Imagine that you have the following table stored in the database:
+-------+-------+-------+-----+ | ROWID | Name | Marks | Age | +-------+-------+-------+-----+ | 6 | Jone | 5 | 28 | | 15 | Alex | 32 | 45 | | 12 | Tom | 37 | 23 | | 53 | Ron | 87 | 13 | | 24 | Mark | 20 | 48 | | 25 | Bob | 89 | 32 | +-------+-------+-------+-----+
First, the database creates a unique auto incrementing index (ROWID or uses the primary key) for each of the given record and convert relevant rows into a byte stream. Then it stores each of the key and record byte stream on a B+tree. Here, the ROWID is used as the key for indexing. The key and record byte stream altogether know as Payload. Resulting B+tree could be represented as follows.
As you can see, all records are stored on leaf nodes of the B+tree or “Table B-Tree” and index or ROWID are used as the key to creating a B+tree. There are no records stored on Internal nodes. And each of the leaf nodes have a reference to next record in the tree. This way the database can perform a binary search by using the index or sequential search by searching through every element and only traveling through the leaf nodes.
If no indexing used, then the database will read each of these records to find the given record. When indexing enabled, the database creates a B-Tree or “Index B-Tree” for each of the indexed columns in the table as follows. In this example we are indexing the “Name” column and the column value is used as a key in the B-tree for indexing. The index is the reference to the actual data record in the “Table B-Tree”.
When indexing is available, first the database engine will search a given key in corresponding Index B-Tree and get the index in logarithmic time. Then it performs another search in Table B-Tree or B+Tree by using already found index also in logarithmic time and returns the record.
Since we didn’t really do any coding this time, here it is just a taste of how a Database is represented in the SQLite codebase, which by the way is really well commented and documented.
There is it, out beloved
Btree, representing the beginning of our database. If we dig deep into these properties we will start to see a lot of what we have been talking about in the post. I won’t do this here, but I highly recommend that you take a look at the SQLite code, if you are curious.
I am very big fan of the First Principles Approach. When applying First Principles you first try to Identify the problem you are trying to solve, define some constraints and assumptions. The good thing about defining constraints at the start is that by defining constraints you are basically defining what you cannot do. And if you define what you cannot do, what you can do becomes a lot more clear. After this first step you breakdown the problem into smaller pieces or its fundamental principles. And last but not least, create a new solution from scratch.
This time we took a very important step towards understanding the problem, defining some constraints by defining some limitations of some data structures and even broke down the problem into smaller pieces like how to deal with tables and indexes.
Now we are ready to decide how we are going to build a solution to the problem from scratch, meaning, how are we actually going to build our data structures in Rust.
If you wanna follow this track don’t forget to follow me here on Medium and also give a couple of claps!