Building a B+ Tree Storage Engine (Part 1): Why Order Stops Making Sense on Disk

#database #algorithms #learning #programming

I had been doing web dev for most of my coding career. I'd take breaks from it now and then, building games, some small project that looked interesting, or just learning something new. But this time I wanted to go a little deeper.

So I looked around for something interesting and built a simple in-memory key-value store in C. But it felt too simple. It was nothing like the technologies I actually used in web dev, like Redis and databases such as MySQL and Postgres. So I searched around and found the B-tree. I implemented a bit of an in-memory B-tree, then found out a B-tree isn't really what those technologies use either, and found the B+ tree. I finished an in-memory B+ tree. It was messy, but the concept itself was what real databases actually use. So I went looking for how real databases use B+ trees, and that's where I ran into pages, WAL, and so on.

Then I thought about how to add disk saving to it. In the key-value store I had done disk persistence before, just with a CSV file. This time I realized I had to write the disk version from scratch. That wasn't really a problem, I had the general idea of how things worked, I just had to replace the memory pointers with bytes and pages.

At first I just copied the layout from the memory tree. I looked up how people usually save these bytes and found things like magic, version, and so on. I defined an order macro at the top and used fixed size records, same as the memory tree.

// Inital implementation of the tree header where order was present
struct Header
{
    uint8_t magic;
    uint8_t version;
    uint16_t root_page_id;
    uint16_t order;
    uint16_t page_count;
    uint16_t page_size;
};

But then I thought, this doesn't feel real. I mean, it's not like I was building software for people to use, but I at least wanted to implement what real databases used. Fixed size records weren't it. So I scrapped them and decided to do variable sized records.

So I started trying to store variable sized records, and right away I hit a problem: how do I even find a record now? With fixed size records it was easy, they were all the same length, so I could jump straight to the one I wanted. But once records could be any size, I had no idea where one ended and the next one started.

My first idea was to put some kind of delimiter between them. That fell apart pretty fast, because a key is just bytes, so the delimiter could show up inside a key and I'd have no way to tell it apart. And even then, I'd still have to read through the node from the start to find anything.

So then I thought, what if I put a little metadata in front of each key, like its length? At least then I'd know where each key ended. That helped a bit, I could skip past the keys that didn't match by jumping over their bytes. But it still felt off. To actually find keys fast you want them sorted, so you can binary search instead of scanning the whole node. And with the metadata sitting inline next to each record, keeping the node sorted meant moving those variable sized records around every time I inserted one, which got convoluted fast. And if I didn't keep them sorted, I was back to walking the node one key at a time, worst case all the way to the end, and with small records that was a lot of jumps.

I asked some chatbots about it, and that's where I found the slot design. Instead of keeping that metadata inline with each record, you put it in a little directory at the front of the page and store the actual records from the back. The slots are all the same size, so keeping them sorted just meant moving the slots around instead of the actual records. And sorted, same sized slots are exactly what a binary search needs. I'm still scanning them for now, but the setup is there to do it properly later. Ok, that part was done.

struct LeafSlot
{
    uint16_t pl_offset;
    uint16_t pl_size;
};

But once the slots were working, I ran into a bigger problem. What about order? In the memory tree, every node except the root had a minimum and a maximum number of keys, and that's what order gave me: order 4 meant at most 4. Splitting and merging were based on that count. When there were order-1 keys the node was full and I split it, and when it dropped below (order/2)-1 it was too empty and I merged it.

But with variable sized records, that count stopped making sense. If I had a bunch of small records, order would cap me at 4 of them and the rest of the page would just go to waste, even though there was tons of room left. And the other way around, a page could be under (order/2)-1 records and still be basically full, if the records were big. The number of keys wasn't telling me anything about how full the page actually was anymore.

This bugged me for days. I kept feeling like order and page size were doing the same job, but I couldn't quite put it together. I searched for it and didn't really find anything, most of what came up was about the in-memory tree with fixed size records. So I explained my idea to a chatbot again, and it told me my thinking was right: order was basically an abstraction for page size. It was there to make the algorithm easier to explain and simpler to implement, without changing the actual idea underneath.

So I swapped the counts for sizes. For the max, instead of a max number of keys it's a max number of bytes, and when the next record doesn't fit, it splits. The min side would be the same idea in reverse, a min number of bytes instead of a min number of keys, and merge when the page drops under it. Same two rules, just measured in bytes instead of keys.

// in bt_insert: does the new record still fit in the page?
if (total_pl_size + sizeof(LeafSlot) + ph->cur_size > t->header->page_size)
{
    bt__leaf_handle_overflow(t, ph, key, value, k_len, v_len);
    return;
}

When I look back, the thing I actually figured out wasn't really about order, it was about the page. In the memory tree a node is just a bag of keys, and I never thought about where it lived. On disk a node is a page, a fixed block of bytes, and once it's a physical thing like that, everything else bends around it. The pointers between nodes become page ids. Variable sized records need slots. Full and empty stop being counts and become bytes. And order, the first thing you get taught, is really just the memory tree's way of never having to mention the page at all.

But none of that actually changed the tree. It's the same tree, the same algorithm as the one I wrote in memory. You still split a node when it's full and push the middle key up to the parent. You still keep the keys sorted. You still start at the root and walk down to find where something goes. The core concept didn't change at all. What changed was only how it's stored, memory turned into disk, pointers turned into page ids, and a node turned into a page. The concept stayed exactly the same, only the implementation moved underneath it.

So that's what a B+ tree is under the hood. Now, I have just deletion and searching left to implement.

Originally published on samir-adhikari.com.np.