DEV Community: Tudor Stanila

insertion in heap pages

Tudor Stanila — Mon, 19 Jan 2026 16:44:52 +0000

Intro

Insertion in heap pages is, in theory, straightforward: you find some free space in the page, place the row there and create a slot pointing to that location. There are, however, multiple small details that make things a bit more complicated (mostly edge cases caused by fragmentation).

The implementation in TRDB is very similar to the one in Java - I mostly covered some scenarios I had missed there & cleaned it up a bit, making it more granular and easier to test.

I will mostly cover the overall logic and reasoning behind it, limitations & possible improvements.

High level

From a high level, the insertion is done in two separate steps:

plan the insertion
perform the actual insertion

The planning phase looks for a location for the new row, along with metadata for the insertion algorithm. No data is changed during this phase, it is a read-only operation.

The insertion phase uses the metadata provided in the plan to actually place the row at the target position and to update the header & slot array

Planning phase

This phase attempts to answer several questions about where the row is going to end up and what other changes are needed to the page from a maintenance point of view:

what slot number will the row use?
where is it going to be physically placed?
is compaction needed for this insert?

Slot number

The first step of the planning is to go through the slot array and look for any unused slots. As a reminder: when a record is deleted its slot is only being invalidated, but kept in the array. The insertion algorithm will reuse these empty slots, to avoid increasing the footprint of the slot array unnecessarily.

The scan starts from the beginning of the array (rightmost element) and stops at the first unused slot it encounters. If no such slot is found, it flags that a new one will be needed.

Physical location

In the second step of the planning, we look for a place to physically place the row itself in the page.

This is done by checking the following conditions, in this order:

Can the row fit in the free space? (free_end - free_start)
Can the row fit between two existing rows?
Can the row fit between the last row and free_start?

The options are probed one by one, halting at the first positive answer. If none are successful, the algorithm flags that a compaction is needed for this insertion.

Notes

Prior to this phase, an additional check is done to ensure that the row can physically fit in the page - meaning that there are enough free bytes in the page, regardless of fragmentation; if there are not, the entire flow is aborted. This allows us to have the fallback of compaction (we can be sure that if we could not find a large enough spot right now, there will be one once we compact)
All checks are done with the information from the first phase in mind - if a new slot is needed, 4 extra bytes are taken into account when determining the necessary overall space in the page.

Implementation

The actual result of the planning phase is represented as an enum comprised of two other enums (code):

/// Defines the offset at which a new record should be inserted in an unsorted heap page.  
#[derive(Debug)]  
pub enum InsertionOffset {  
    /// Record should be inserted at the start of free space after compacting the page.  
    AfterCompactionFreeStart,  
    /// Record should be inserted at an exact offset.  
    Exact(usize),  
}  

/// Defines whether a new slot should be created for the record or an existing slot can be reused when inserting into an unsorted heap page.  
#[derive(Debug)]  
pub enum InsertionSlot {  
    /// A new slot should be created for the record.  
    New,  
    /// An existing slot can be reused for the record.  
    Reuse(usize),  
}  

/// Represents a plan for inserting a new record into an unsorted heap page.  
#[derive(Debug)]  
pub struct InsertionPlan {  
    /// The slot information for the insertion.  
    pub slot: InsertionSlot,  
    /// The offset information for the insertion.  
    pub offset: InsertionOffset,  
}

Insertion phase

This is pretty straight forward: based on the insertion plan received from the previous step, it will know exactly whether it needs to compact the page beforehand, at what offset to place the row and how to handle the slot.

Besides inserting the actual row data and updating the slot array, the following header values are also updated:

SLOT_COUNT - incremented by 1 only if we are creating a new slot
FREE_END - decremented by SLOT_SIZE only if we are creating a new slot
FREE_START - incremented by row.size only if we are inserting at this location (meaning we are not using a gap between two rows)
FREE_SPACE - always updated. Decremented by row.size if we are reusing a slot, or by row.size + SLOT_SIZE if we are also creating a new slot.

Reasoning

The initial implementation in Java was a single step insert - it worked ok, despite it becoming a bit too bloated. However, I had to split it into these two phase once I started working on write-ahead logging (WAL).

As the name suggests, in WAL you need to create a log entry with the operation that is going to happen, meaning you need to have all the information about the change you are about to make before you perform the change. This was problematic because one of the fields that had to be included in the log entry was the slot number of the row being inserted - value which was only known during/after the insertion, not before.

Since it is not acceptable to write an incomplete WAL entry and update it afterward, the only choice was to extract the logic that determined the physical location of the row outside the actual insertion, allowing me to (1) generate the insertion plan, (2) log the operation and (3) perform the actual insert.

I could have, of course, updated the insert method to perform the log internally, right before it placed the row in the page. However, I preferred to have the Page implementation unaware of the WAL and not couple the two in this manner.

Possible improvements

During the search for a physical location in the planning phase, because the checks being performed are short-circuiting, their order is relevant and up for debate:

on one hand the current implementation favors fragmentation, because it will only attempt to fill in gaps if the free_end - free_start area is not large enough. However, it avoids scanning the existing slots if it is not necessary, thus providing a quicker response.
on the other hand, switching those two would reduce overall fragmentation, but would also incur useless scans of the slot array for scenarios where the fragmentation is low (which is also kept low by this algorithm)
don't think there is a right or wrong, although implementing them both and doing some benchmarking would be the only way of telling for sure
- it is also highly influenced by access patterns: tables with low deletions would benefit from the current implementation, while tables with lots of inserts & deletes could benefit from the second one

Another area of improvement is also during the planning phase: as of right now, the two steps in the planning phase scan over the slot array twice: once for the slot # and once for the physical location. It can be done in a single pass, but for now I chose clarity and testability over efficiency. Again, only some benchmarks could give us a performance penalty for this implementation.

maintaining stable row identifiers during page maintenance

Tudor Stanila — Mon, 05 Jan 2026 17:55:36 +0000

intro

Christmas, New Year's, a 1-week ski trip, time with family, the rabbit hole of a full re-organization of my Obsidian vault into a Zettelkasten slip box after reading How to take smart notes - all of these resulted in me making far less progress on this project than I would have expected, especially during my 3-week winter vacation. Nevertheless, there has been some progress.

I am still within the storage system, working on the slotted pages for heaps and there have been a few interesting things I have encountered around the way a database engine will handle certain page maintenance operations.

row IDs

Most systems (though here I am mainly knowledgeable about how SQL Server handles things) will use a row ID to internally identify a row.

A row ID can be considered any identifier that allows the system to uniquely identify a single row. There can be multiple places where the engine would need to store this identifier, but maybe the best example is a non-clustered (or secondary) index. In here, each index entry will contain a pointer to the full row it belongs to. When retrieving rows based on that non-clustered index, the row identifier is used to obtain the full row (if needed).

how SQL Server does it

In the case of SQL Server, row IDs used in non-clustered indexes depend on the source table, and it is usually referred to as the 'row bookmark value' (Delaney et al., 2013, p. 308):

if the source table is clustered, the non-clustered indexes will use the clustering key as the row identifier
if the source table is a heap, the non-clustered index will use a physical row locator to identify the row

Today we are only focusing on heaps, as clustered tables are a whole other beast. That physical row locator (called RID by Microsoft) is an 8 byte value comprised of 3 different segments FileID:PageID:SlotId and can be broken down into 2 bytes for the FileId, 4 bytes for the PageId and 2 bytes for the SlotId (Delaney et al., 2013, p. 311).

The relevant point here: a row ID will contain the slot number of the row and this row ID will appear in all non-clustered indexes. Which leads us to the concept behind this post.

stable slot numbers

Because indexes will use the slot ID as part of the row's identifier, the system needs to guarantee that rows maintain their slot number during internal page maintenance operations. If rows were to be assigned to other slots, all non-clustered indexes would also require an update, which would be pretty bad performance wise.

The most direct implication of this rule is for the process of page compaction but, as I've noticed, the journey to guarantee this rule starts with deletions.

row deletions

One of the advantages abstracting the access to rows through the slot array is that certain operations such as deletions become trivial. Similar to how a file system works, deletion from a slotted page does not require us to touch the actual row data. Instead, it is enough to invalidate the slot pointing to that row, thus making the area of bytes corresponding to the row completely unreachable.

The process is quite simple, but what does it mean to invalidate a slot? In my case, it is just a matter of setting both its offset and length fields to 0, while keeping the slot count from the header untouched:

initial: [(96, 100), (196, 50), (246, 50)]
slot_count = 3
// delete slot index 1 (196, 50)
after: [(96, 100), (0, 0), (246, 50)] 
slot_count = 3

This implementation adheres to the stable slot numbers rule:

if, instead of invalidating the slot, we would have deleted it from the array completely (and decremented slot_count), we would have had to shift the next slots one index down, thus blatantly breaking the rule
if we set it to (0,0) as we are doing now, but decremented the slot_count, we would have made the last valid slot unreachable.

Turns out this rule logically simplified the implementation by quite a lot. The only more complex part of deletion is a small optimization I've added: if the row being deleted is physically the last on the page (so right at the end of the data section), we can actually avoid leaving the page fragmented by moving the free_start pointer to the end of the previous row (physically). If this is the happy case, the can_compact flag is also not set in the header to indicate fragmentation has started.

I had actually missed doing this in the Rust implementation (despite me having done it in the Java db which was open on the other display) and I only realized it when I was trying to do the setup for some insertion tests - I initially marked the scenarios as 'technically possible, but not valid', until I gave it some more thought and discovered that they are in reality impossible only if this optimization is implemented (you can still see the comment where I marked them as impossible)

page compaction

The process of internally reorganizing the records in the page, in order to remove fragmentation. While all pages start off as tightly packed arrays where all rows are in a contiguous chunk of bytes, deletions and updates can lead to small fragments of bytes appearing in between rows. As a result, a page's free space counter might be high, but without any contiguous area of space large enough to fit a new row.
Compaction is done by moving all the existing rows next to each other, thus removing any fragments.
It is usually a lazy process, performed when an insertion cannot find any large enough fragment to fit the new row.

My implementation for it might not be the most optimal, but it seems to work. I allocate a new byte array the size of the data segment (free_start - header_size), go through the slot array and for each valid slot, copy the row in the new byte array, packing them one after the other & updating the slot offset value. Then I just overwrite the data segment with the new byte array (which will have some trailing 0s, as many as there were bytes from fragments).

I'm sure there is some fancier in-place approach where you don't need to temporarily allocate a new array of up to 4k bytes, but that is more of a nice-to-have, later-on-if-i-ever-get-to-it optimization. Especially since this implementation checks the most important box: it does not break the rule of stable slot numbers. Funnily enough, the initial implementation in Java was to actually allocate a whole new page and only copy the valid slots, then replace the current page's byte array (a hacky way of leveraging the insertion algorithm without having to think too much). It was only while discussing the whole process with a friend almost a year later that we realized how the approach violated this rule. So maybe that's why I'm writing about it.

outro

That is mostly it for this post, which turned out to be way longer than I expected anyways. As I said in the beginning, things have been moving pretty slowly during the last few weeks, but they should pick up now that the holidays are over.

I've also been working on is a nicer infrastructure for testing - my goal for this project is to try and add tests for everything as I am writing the code, to avoid having to cover tens of scenarios later on. I've been looking into ways in which I can leverage some of the features of Rust to create helpers for this module that will make the setup of any test scenario a breeze - but I think I will cover that in another post.

One other thing that has kept me busy has been the insertion algorithm - the Java implementation was quite chunky, so I've been splitting it into smaller pieces, cleaning them up (and trying to cover them as much as possible).

As next steps, I will hopefully finish the things related to heap pages pretty soon. Then I need to decide whether I want to stay in the storage engine and work on things like index pages (and the monstrous B+trees), or to expand into the other modules.
Most likely I am going to go with the second. This whole process is quite slow, not due to the differences between the two languages, but because at every step I am questioning many of the implementation details from the Java db, trying to improve things or to solve shortcomings. Because of this, going through all the layers with a small working prototype might make decisions easier when I revisit them to complete the MVP. And on a subjective side of things, I'd like to have a working prototype as soon as possible.

references

Delaney, K., Beauchemin, B., Cunningham, C., Kehayias, J., Freeman, C., Nevarez, B., & Randal, P. S. (2013). Microsoft SQL Server 2012 Internals

Paging the data

Tudor Stanila — Mon, 22 Dec 2025 05:21:21 +0000

Intro

At the foundation of any database engine is the storage layer - responsible for handling the actual persistence and retrieval of data on disk. In this post, I will go over database pages, the universally adopted abstraction for how the engine interacts with the disk. The theoretical part of the article does not propose any new design, but rather summarizes how databases typically structure and use pages. I then walk through my own implementation of database pages, which, while custom-built, follows closely the structure and behavior described in the theoretical overview.

Why pages

A database page (also referred to as a block) is the fundamental unit of storage and I/O inside most database engines. Instead of reading or writing arbitrary bytes, the database organizes its files into fixed-size blocks - generally 4 KB to 16 KB - that contain a header, records (rows) or index entries, and free space.

Pages are the basis for how data is stored on disk, cached in memory, and manipulated by internal components such as B-trees, buffer pools, and write-ahead logging.

Databases use pages because they provide deterministic control over I/O. Fixed-size blocks allow the engine to align with disk hardware, optimize sequential reads, minimize fragmentation and avoid expensive random access. This also makes caching predictable: a buffer pool can simply hold X frames, each corresponding to exactly one page, allowing clearer decisions about eviction, tracking dirty pages, and pinning.

Pages also decouple the database from operating system memory management. Relying on the OS pages or the OS cache would make eviction, flushing, and write ordering unpredictable or outside the control of the engine. Databases need to know exactly when a block has been written, how to track its state relative to the write-ahead log, and how to restore it after a crash. Having their own page format and buffer management system gives them full control over these constraints.

The page abstraction is also essential for data structures like B-trees and for space management within tables. Each index node fits within a single page, making lookups and modifications more efficient. More on B-tree backed indexes in a later post.

As with many other areas and modules, the OS is the enemy of the engine, and the database will choose to implement its own mechanisms, rather than relying on the OS provided abstractions in order to be able to optimize and define its behavior exactly as it wants it.

Slotted pages

The most common implementation for this is called a slotted page. The reasoning behind this concept is explained in the Why slotted pages section - but first I will quickly go over the structure and the behavior of this kind of page by using my design as an example.

Structure and behavior

I have configured a page to be 4KB in size - I have chosen this size mostly because it is more manageable to look through while debugging and because I do most of my work on a Macbook with an SSD that has a 4KB block size as well, so that matches nicely.

Pages are basically a contiguous chunk of 4096 bytes structured as a 96-byte header, immediately followed by a data region growing right (or down) and ending with a slot array growing left (or up):

Header

The header represents the first 96 bytes of the page and stores metadata about the page itself: the total number of slots that are currently in use, the amount of free space remaining on the page, the type of the page (simple heap data page, index page, etc), various flags (e.g., fragmentation), pointers to the sibling pages (in the case of index pages).

These are stored as different data types: for example, a 16 bit integer is enough to store the total number of rows in the page - we can never expect to have a large number of rows. Flags require even fewer bits, while other fields like sibling pointer might require more.

Slots

The idea behind the slotted page is pretty simple: whenever you insert a row, you place it wherever there is some room in the page, then you create a new entry in the slot array - this entry will store the offset at which the row starts and its length. The row is now identified by its slot number.

Reading

Because the slots are fixed size (4 bytes in my case - 2 for the offset, 2 for the length), it is trivial to read directly the i-th slot - e.g. reading slot at index 3 simply means positioning your read-head at PAGE_SIZE - (slot_number + 1) * SLOT_SIZE = 4096 - (3 + 1) * 4 = 4080, then reading the first 4 bytes (in the image above, that would give us the slot (140, 12)).

Iterating over all the rows in the page is equally simple - you just iterate over the slot array, dereference each slot and output the corresponding row.

Updating

As said, inserting a new row means finding some free space in the page and creating a new slot.

Deletion, on the other hand, is straightforward - you can just delete (or invalidate) the slot - deleting the actual bytes of the row is optional and can be postponed, since that segment of the page is no longer accessible and will eventually be overwritten. A process called compaction will rearrange all the existing rows in a contiguous segment, removing any fragmentation that may have occurred after deletions.

As for updating, if the new row has the same size (or smaller), the update can be done in place. If it's greater, the updated value will need to be placed somewhere else in the page and the slot's offset will be updated - the slot number, however, will remain unchanged.

Why slotted pages

The way I see it, slotted pages solve two main problems:

How to handle variable length rows
How to encapsulate storage details from outside objects

Variable length rows

Say we didn't have slotted pages, but wanted to store rows that vary in length. How would we do that? The most direct approach would be to store them as [ row_length | row_bytes]- e.g. [ 5 | 'hello']. This allows you to know exactly how many bytes you need to read to obtain a single row.

But what about random access? What happens if you need to read the 3-rd row? You have no idea where it starts exactly, so you need to: read the length of row 1, move the cursor at current + row_1_length, read the length of row 2, move the cursor at current + row_2_length, then finally read the row you need.

This can become cumbersome and inefficient.

With slots on the other hand, because they are fixed size, you can determine exactly where the 3rd slot is stored, which will point you to where the row itself is stored. In our case, the 3rd slot (index = 2) will be placed at PAGE_SIZE - (index + 1) * SLOT_SIZE = 4096 - 3 * 4 = 4084. By computing this offset and setting the read head at that location, we are able to directly read the two shorts that compose the slot.

Encapsulating the details

The other advantage of slots is that it allows you to hide from the outside world the way the page organizes itself underneath.

Again, say we didn't have the slot array. If you look at the drawing of the page from the start of the post, you might notice that row 3 is missing because it was deleted. How would we handle the page operations now, say, for reading row #4? We would read row 1, then row 2 - and then? We would need some sort of pointer to indicate that we need to skip some bytes to be able to get to what is labelled as row 4, but in reality, is the now 3rd row. Or we would need to compact (remove fragmentation) on each delete.

But if we were to compact on each delete - this means that row 4 now becomes row 3. This can become quite problematic. What if there is an index out there for this table? We would need to not only remove the old row from it, but also update rows 4, 5, 6, ... to become rows 3, 4, 5, ...

The problem is that the logical ordering of the rows is dependent on the physical ordering and organization on disk.

This is very limiting and causes dependencies left and right. Even if external objects don't know the exact offset of the row, they are dependent on the place where it is stored and rows themselves are dependent on one another.

Slots allow us to hide all of these details from the outside world.

By only exposing the slot numbers to external objects, we are free to organize the rows however we want - fragmented, compacted, in increasing or decreasing order of their size, all even and all odd, does not matter because the slot array stays constant. Deleting a row and compacting the page means moving around the only the row data - the slots stay unchanged. If you consult the drawing again, you will notice that the 3rd slot is (0,0) - that is because the row was deleted and the slot invalidated, but not removed from the array. This means that after a compaction, the space that was occupied by the row will be overwritten, but the (0,0) slot will remain in the array.

As a result, row 4 will remain row 4, regardless of where it is stored in the page, so external objects like indexes will not be affected by any internal reorganization of the page.

Implementation

Java

The original Java implementation defined a class called SlottedPage which contained a ByteBuffer of size 4096 with the actual content from disk. It also exposed numerous methods for interacting with the page - reading and writing flags from the header, retrieving, inserting or updating rows, etc. Additionally, a SortedPage class extended the SlottedPage and offered overrides for B-tree index nodes.

This grew a bit cumbersome for several reasons:

Inheritance sometimes sucks - had a bug I spent several days fixing, only because I was expecting an object to be a SortedPage, but was in reality an unsorted SlottedPage, which broke some B-tree invariants. This could have been prevented by a better implementation or by paying more attention, but I feel my point still stands.
The classes grew quite big - exposing a different method for every header value, every behavior - this led to large objects.
(A me problem) Not a lot of attention paid to data duplication - there were flows where the main byte array of the page was duplicated and copied around as it was passed from method to method.

Rust

The new implementation is pretty similar conceptually to the Java one. However, in a Rust-centric approach I tried to focus more on the ownership side of things and ensure that the data is not duplicated in memory for no good reason. I also leveraged the fact that you can have multiple impl blocks across different files for the same struct - this allowed me to group methods in different files based on their functionality.

The next question was how to actually store the page bytes in memory and how it would play along with the buffer. The way I saw it, I had two main possible directions:

The page owns the byte array, while the buffer cache owns the pages via some data structure.
The buffer manager allocates and owns a pool of frames where each page is placed -> then the page is only a pointer to that memory region.

Based on my research, the second option seems more popular or idiomatic. I chose, however, to go with the first one because it felt easier to avoid doing all that pointer arithmetic on the buffer pool & dealing with lifetimes. In a way it also felt more modern & more Rust (might be wrong here tho).

The base of it all is the Page struct:

pub struct Page {  
    page_id: PageId,   
    data: Box<[u8; PAGE_SIZE]>
}

The page_id is just an in-memory page identifier comprised of filename_hash::page_number_within_file. Only the page number within the file is stored in the page header - the filename itself will be known through the use of system catalogs. The data field stores the main byte array of the page in a Box on the heap. I chose an actual array over something like a Vec as the size will never change so the overhead of the Vec seemed unnecessary.

HeaderRef and HeaderMut

The first order of business was to provide access to the header of the page. Instead of going with a set of getters and setters on the main page, I decided to create two new structs, HeaderRef and HeaderMut:

pub(crate) struct HeaderRef<'a> {  
    bytes: &'a [u8; HEADER_SIZE] 
}

pub(crate) struct HeaderMut<'a> {  
    bytes: &'a mut [u8; HEADER_SIZE] 
}

These structures borrow from the main page byte array a slice the size of the header - namely the first 96 bytes. HeaderRef offers an immutable view into the header of the page via getters, while HeaderMut offers a mutable view via getters and setters. The word view here is relevant to me - that is because these structs are zero-copy - they do not duplicate the underlying byte array in any way. Because their slices are also limited to the header size, they cannot affect the rest of the page, should there be any bug. I also decided to split them in mutable and immutable to keep the trend of Rust - if you want to modify something, you should opt in for the mutable one.

The getters and setters are implemented via a macro that takes as param the flag/value name and data type, to avoid having to write a large number of methods manually.

SlotArrayRef and SlotArrayMut

Similarly to the header, I also provide access to the slot array via the following structs:

pub(crate) struct SlotArrayRef<'a> {   
    bytes: &'a [u8],  
}
pub(crate) struct SlotArrayMut<'a> {    
    bytes: &'a mut [u8],  
}

The reasoning here is the same: slice into the main array and encapsulate behavior in a more organized and testable way. These structs are instantiated with a slice the size of the slot array at the time of its creation: if the page contains 3 slots, the slice will be 3*4 bytes, or more precisely the last 12 bytes of the page.

They offer methods for retrieving slots, and in the case of SlotArrayMut, insert/update slots. They also expose the slot structs - a zero-copy view of a single slot:

pub(crate) struct SlotRef<'a> {  
    slot_index: u32,  
    bytes: &'a [u8; SLOT_SIZE]
}

pub(crate) struct SlotMut<'a> {  
    slot_index: u32,  
    bytes: &'a mut [u8; SLOT_SIZE]
}

Page API

With the exception of the main Page struct, all behavior and structs are internal to the page crate. The page exposes a set of public methods that are accessible outside of the crate - for creation of the page itself and for retrieval/insertion/deletion/update of rows. Underneath, these public methods leverage this structs and encapsulate the behavior of the page. The main impl blocks are split into separate files: one for reading rows, one for inserting, etc. All of these internal methods can be tested in place where needed.

Error handling

I tried to create an error enum for every main flow: InsertError, ReadRowError, etc. Underneath, these are derivable for more concrete errors: all header operations can return HeaderError, all slot operations can return SlotError, etc. All of these errors can be converted to a PageOpError, which represents the crate's catch-all error type:

                       HeaderError
                      /
PageError - PageOpError - SlotError
                      \
                       \-ReadRowError
                        \
                         InsertError
                         ....

Additionally, the PageError is the crate's public error type and the only one exposed to the outside world. It is a struct containing a PageOpError and an additional PageId. A trait to more easily attach the page id is also implemented:

pub struct PageError {  
    pub(crate) page_id: PageId,  
    pub(crate) source: PageOpError,  
}

pub(crate) trait WithPageId<T> {  
    fn with_page_id(self, page_id: PageId) -> PageResult<T>;  
}  

impl<T> WithPageId<T> for Result<T, PageOpError> {  
    fn with_page_id(self, page_id: PageId) -> PageResult<T> {  
        self.map_err(|source| PageError { page_id, source })  
    }  
}

Remaining work

At the time of writing this, only the insertion and retrieval of rows has been implemented (besides the header & slot array accessors). I started by taking the Java implementation as-is and translated it to Rust and right now I am trying to improve it by (1) splitting it into more granular and testable chunks, (2) optimizing and improving the overall logic. After that, I will apply the same principle to compaction, deletion & updates.

Worth noting tho, this only covers heap unsorted pages. The B-tree index pages are a whole other thing that will be tackled separately, but that might be pushed a bit further down the priority list.

Resources

These resources helped inform my understanding of database page architecture:

Microsoft SQL Server Documentation — Pages and Extents Architecture Guide
Kalen Delaney — Microsoft SQL Server 2012 Internals
CMU Intro to databases

So I decided to write a database system

Tudor Stanila — Mon, 22 Dec 2025 05:18:59 +0000

About a year ago I started a personal project to implement a relational database system from scratch, mostly as a result of (1) my interest in the field and (2) a desire to better understand how things work under the hood.

Professionally, I’ve been working for the past 7.5 years as a .NET backend engineer. A lot of my recent tasks involved attempting to improve the response time of a SQL Server database, which led me to spend quite some time looking at queries, execution plans, and schema design, trying to find the culprit behind the performance issue. I kept reading articles and blogs about random operators and how they work to try and grasp what was happening to my query.

The idea of writing some sort of database & query engine kept bouncing around in my head more like a conceptual thing - I wonder how difficult it would be to implement a JOIN - until one day last fall when I was telling a friend about this over FaceTime, and we decided to make a go of it. We chose Java over something like C++ mostly because (a) it was a language we are both familiar with, C# being quite similar in many aspects, and (b) writing a database engine is hard enough - we wanted to focus on the design and functionality, not on also getting familiar with a low-level language at the same time.

After one year of active development, I started feeling a bit stuck. Some of the design choices I made turned out not to be the best, which led me to create several tasks tagged with refactor and XXL. Looking at the impact these refactors would bring, it felt easier to cut my losses, learn from my mistakes, and re-work everything from scratch rather than make the changes in place.

That decision also meant choosing a new direction for the implementation itself - not just a redesign, but a complete rewrite. I still didn't particularly want to go the C++ route, but I did want something closer to how actual database systems are implemented, so I chose to learn Rust for this new version.

Since I’m redoing the whole thing anyway, I figured I might as well make use of everything I learned the first time around. One of the lessons was that I didn’t do a great job documenting my thought process or why I made certain choices. This time I want to change that, and having a place to write things down from the very beginning might help.

This blog will follow my journey of learning Rust by re-implementing my initial database server. I’ll cover design choices and approaches I’ve taken for the engine, along with Rust-specific things I’ve used to implement them (or that I find cool). I know I’ll make plenty of mistakes - both Rust-specific and DB engine ones - but I guess that’s the whole point: I’ve been programming for long enough to know that you rarely get the right solution on the first attempt. I don’t regret any of the wrong choices from the Java project - by having made them, I’ve learned a lot more. Same goes for the way I’ve implemented things in the few hundred lines of Rust code I’ve already written for this - I’m sure many of them will come back and bite me a few months from now, but that is part of the process I’ve learned to accept and actually enjoy.

I use various resources when designing the system: lectures I find on YouTube from various universities, several books, documentation, and (very little) open source code like PostgreSQL or SQLite. Many of the design choices are influenced by SQL Server (through observation & Microsoft’s public documentation) mostly because it’s the system I’m most familiar with and enjoy the most. However, I try to use these resources only to guide me in the right direction and have the actual implementation be as much my own as possible.

This database project will never see the light of day as an actual product to be used in real life - I don’t see the point in trying to create a competitor for existing solutions with decades of experience on the market (especially as a side project done by one person after work). It’s designed for my own growth and passion.

I do welcome any kind of feedback from anyone reading this - whether I’ve made a stupid choice regarding the design/behavior of the database engine, or if I’ve written some junk piece of Rust code that can be improved. Appreciation of anything that looks right is also more than welcome :D

The repo with the Rust code can be found here. cargo generated docs can be found here