João Henrique Machado Silva

Posted on Feb 16, 2021 • Edited on Feb 23, 2021

What would SQLite look like if written in Rust? — Part 1

#database #computerscience #rust #programming

What would SQLite look like if written in Rust? — Part 1

Writing a SQLite clone from scratch in Rust

← Part 0 - Overview

Part 2 — SQL Statement and Meta Commands Parser + Error Handling →

I think there isn't a developer out there that hasn't used at least one of the relational databases available out there, but do they really know how they work. I bet not. To be honest, a lot of what goes on behind the curtains with databases is still a black box to me, even after almost 2 decades working as a software developer.

Like I said in the previous post, here are some of the questions I want to be able to answer:

What format is data saved in? In our case, in memory and on disk.
When does it move from memory to disk? How would that work exactly?
Why we only have one primary key per table? Do you know?
How does rolling back a transaction work?
How are indexes formatted? And how do they work?
When and how does a full table scan happen? Does it happen every time we issue a query? Do the indexes help? How?
What format is a prepared statement saved in?
What extra features can we add to our DB, if any?

Basically, how a database really works?

To get there I want to write a database from scratch, in Rust and I will model it after the existing and popular SQLite. Why SQLite? Well, because I just don't have the time write something like MySQL or PostgreSQL, and possibly not the expertise either. Also, I want to develop a small database, contained in a single file, with the main goal of improving my understanding of it, and for that we don't need so many features as big distributed database.

Also I think it is important to say that even SQLite it self has a lot of features! You can see that even by typing sqlite3 --help . My intention here, at least at first, is not to deliver a full on replacement with all the features available.

The goal is to have a simple relational database contained in a single file and modeled after SQLite.

And for that my plan is to both, one take my time and second take one step at a time, with each post I plan to add one or two features and let's see how long this will take. I am not in a hurry, are you?

A former colleague of mine used to say something that really stuck to me and that was that:

If you spend enough time on planning, coding is easy.

As software developers I think it is in our nature to want to get our hands dirty and see things working fast. I know, it's fun to build things our of nothing and see them working. But a lot of times if we rush the process we probably will have to go back and redo the entire thing. Anyway, I am saying this because I do plan to take my time on each part and really dig in to the works of how SQLite was designed so I can try to replicate. If I really understand how it works and I have a step by step and a clear goal than the coding part will be piece of cake.

So, let's go!

Undestanding SQLite

Thankfully the documentation on their website is quite good, Architecture of SQLite. Plus I’ve got a copy of SQLite Database System: Design and Implementation, that was a great help understanding some concepts.

SQLite works by compiling SQL text into bytecode, then running that bytecode using a virtual machine.

The SQLite interfaces act as a compiler by converting SQL text into bytecode. This bytecode is basically a sqlite3_stmt object that implements a single SQL statement. The statement that has been compiled into binary form (bytecode) is ready to be evaluated and is passed on to the virtual machine, which in turn runs the program until it either completes, or forms a row of results to be returned, or hits a fatal error, or is interrupted.

Digging a little deeper…

A query on SQLite goes through a specific chain of components in order to interact with actual data.

SQLite is designed with a very modular architecture. The architecture consists of seven major components, partitioned into two parts.

Frontend Parsing System
Backend Engine

The frontend compiles each SQL statement and the backend executes the compiled statement.

The Frontend:

As I mentioned the frontend pre-processes SQL statements as well as SQLite commands that are then sent as inputs. It parses the statements (and commands), optimizes them, and as I mentioned before, generates equivalent SQLite internal bytecode so that the backend can execute.

Tokenizer: Is responsible for splitting the inputed SQL statements into separate tokens, by doing a lexical analysis on the input.
Parser: Is responsible for analyzing the structure of the SQL statement by analyzing the tokens produced by the Tokenizer, and generates a parse tree from the tokens.
Code Generator: It traverses the parse tree and generates the already mentioned so many times bytecode, that when executed by the backend, should be able to produce the result from SQL statement.

The Backend:

The backend is the engine that takes the bytecode generated by the frontend and executes it. This engine does actual thedatabase processing work and it is composed of 4 modules: Virtual Machine, B-Tree, Pager and Operating System Interface.

Virtual Machine: The virtual machine takes the bytecode generated by the frontend and it executes. This is the ultimate manipulator of data in the database. It can see a database as a collection of tables and indexes that are actually stored as a data structure called B-Tree. The VM is essentially a big switch statement on the type of bytecode instruction. Or is it? We will find out.
B-Tree: Is responsible for actually organizing the data into an ordered tree data structure. Each table and indexes get their own B-Tree's. The use of this structure helps the VM to search, insert, delete and update the data into trees. It also helps the VM to create or delete new tree if needed.
Pager: The B-Tree module requests information from the disk in fixed-size pages. The default page_size is 4096 bytes but can be any power of two between 512 and 65536 bytes. The Pager Module is responsible for reading, writing, and caching these pages. The page cache also provides the rollback and atomic commit abstraction and takes care of locking of the database file, implementing transactional ACID properties. The B-Tree driver requests particular pages from the Pager and notifies the Pager when it wants to modify pages or commit or rollback changes. The Pager handles all the messy details of making sure the requests are handled quickly, safely, and efficiently. Definitely a challenge this one.
Operating System Interface: This module is responsible for providing a uniform interface to different operating systems. It implements routines for file I/O, thread mutex, sleep, time, random number generation, etc. For the purposes of this project, I am going to support different platforms. At least not at first. But PR's are more than welcome.

A journey of a thousand miles begins with a single step.

I don't really know who said that, but it is true. So let's start with something simple. Let start by making a simple CLI executable with a REPL program that responds to basic commands and can exit gracefully.

Making a Simple CLI with a REPL in Rust

Well, since we are modeling after SQLite, we might as well try to mirror what SQLite looks like, to make it easier on users. When you run sqlite3 , SQLite starts a read-execute-print loop, aka REPL.

$ sqlite3  
SQLite version 3.30.0 2019-10-04 15:03:17  
Enter ".help" for usage hints.  
Connected to a transient in-memory database.  
Use ".open FILENAME" to reopen on a persistent database.  
sqlite> .databases  
main:  
sqlite> .blah  
Error: unknown command or invalid arguments:  "blah". Enter ".help" for help  
sqlite> .exit

In order to do that I had to make some design choices here. Since the focus of this project is to study and build a database, I decided to let most of my efforts focus on that, meaning I do not want to spend most of my time re-inventing the wheel and writing CLI interpreters or REPL logic for example. So for those I decided to make use of already developed and somewhat mature third party libraries to get it done. And who knows, maybe in the future if I have some spare time and I realize that the use of those third party libraries are really impacting the overall performance of the application, I can always come back and replace them.

The REPL logic is pretty straight forward, basically we will need an infinite loop that prints a prompt, gets an input line, validates and then processes that line. I decided to go with the crate rustyline, which is already pretty mature, memory efficient and already solved a lot of the issues we would have to deal with, even from the user experience side, for example, providing hints and auto-completion in real time, which is a great feature.

So before I jump into my code, which you can already find it on Github by the way, I am going to quickly explain, using a quick code snippet and some comments, how Rustyline works with a simple example.

First we need to add the dependency in your cargo.yaml file:

[dependencies]  
rustyline = "7.1.0"

And in your main.rs you can add this (maybe without my comments):

use rustyline::error::ReadlineError;
use rustyline::Editor;

fn main() {
    // This line creates an Editor with the default configuration options.
    let mut repl = Editor::<()>::new();
    // This if statement loads a file with the history of commands
    // If the file does not exists, it creates one.
    if repl.load_history("history.txt").is_err() {
        println!("No previous history.");
    }
    // This is our infinite loop. We will be here until the user terminates the program.
    loop {
        // This line asks the user to input a command. You can add whatever you want in here as a prefix.
        let readline = repl.readline(">> ");

        // The readline method returns an Result. Which we now use a match statement to filter the result.
        match readline {
            Ok(line) => {
                repl.add_history_entry(line.as_str());
                println!("Line: {}", line);
            },
            Err(ReadlineError::Interrupted) => {
                println!("CTRL-C");
                break
            },
            Err(ReadlineError::Eof) => {
                println!("CTRL-D");
                break
            },
            Err(err) => {
                println!("Error: {:?}", err);
                break
            }
        }
    }
    // Here we are saving the commands into the file. Until now they are stored in memory.
    repl.save_history("history.txt").unwrap();
}

And that is it. With that you would have a basic REPL program up and running. I hope my comments made it clear for you. If not, feel free to message me and ask.

Now let's get back to our Rust-SQLite program!

I assume that if you are trying to follow this and writing some code along with me you can manage to create an empty Rust project on your own. Just to be clear this is what I did to start: cargo new rust_sqlite --bin . But again, you can find all the code on Github.

You will notice that my implementation of rustyline is a bit different, but that is because I took advantage of the features like hints and auto-completion to improve the user experience from the start. But let's go.

First thing you need to do is to add the dependency to your Cargo.toml file.

[dependencies]
rustyline = "7.1.0"
log = "0.4.14"
env_logger = "0.8.3"
rustyline-derive = "0.4.0"
clap = "2.33.3"

Well, you may have noticed that I have a couple of extra dependencies there then you might have been expecting. Yes, that is true. Remember I said I was going to take advantage of some features that rustyline has? So yeah, that is why we have the rustyline-derive there, that crate is actually just some procedural macros that help us to implement the traits necessary to add things like validation, hints and completion to our REPL program with rustyline. In fact, rustyline-derive even lives within the rustyline repository on Github.

The clap crate is a command line parser also pretty mature and that takes care of a lot of the boilerplate code we would have to write to create our own CLI parser. So I decided to go with that. For now, is not actually doing much, and only showing some help information if we type for example rust-sqlite --help . But for now that is all we need.

The other two dependencies log and env_logger are there to make logging easier throughout our application, like for example, manage different log levels for us.

This is the state of our main.rs so far. I tried to add as many comments as possible but I'll still go through some lines and explain.

extern crate clap;
mod repl;

use repl::{REPLHelper, get_config};

use rustyline::error::ReadlineError;
use rustyline::{Editor};

use clap::{App, crate_version};

fn main() -> rustyline::Result<()> {
    env_logger::init();

    let _matches = App::new("Rust-SQLite")
                          .version("0.0.1")
                          .author("João Henrique Machado Silva <joaoh82@gmail.com>")
                          .about("Light version of SQLite developed with Rust")
                          .get_matches();

    // Starting Rustyline with a default configuration
    let config = get_config();

    // Getting a new Rustyline Helper
    let helper = REPLHelper::new();

    // Initiatlizing Rustyline Editor with set config and setting helper
    let mut repl = Editor::with_config(config);
    repl.set_helper(Some(helper));

    // This method loads history file into memory
    // If it doesn't exist, creates one
    // TODO: Check history file size and if too big, clean it.
    if repl.load_history("history").is_err() {
        println!("No previous history.");
    }
    // Counter is set to improve user experience and show user how many 
    // commands he has ran.
    let mut count = 1;
    loop {
        if count == 1 {
            // Friendly intro message for the user
            println!("{}{}{}{}{}",
            format!("Rust-SQLite - {}\n", crate_version!()),
            "Enter .exit to quit.\n",
            "Enter .help for usage hints.\n",
            "Connected to a transient in-memory database.\n",
            "Use '.open FILENAME' to reopen on a persistent database.");
            //TODO: Get info about application name and version dinamically.
        }

        let p = format!("rust-sqlite | {}> ", count);
        repl.helper_mut()
            .expect("No helper found")
            .colored_prompt = format!("\x1b[1;32m{}\x1b[0m", p);
        // Source for ANSI Color information: http://www.perpetualpc.net/6429_colors.html#color_list
        // http://bixense.com/clicolors/

        let readline = repl.readline(&p);
        match readline {
            Ok(command) => {
                repl.add_history_entry(command.as_str());
                // println!("Command: {}", line);
                if command.eq(".exit") {
                    break;
                }else{
                    println!("Error: unknown command or invalid arguments: '{}'. Enter '.help'", &command);
                }
            }
            Err(ReadlineError::Interrupted) => {
                break;
            }
            Err(ReadlineError::Eof) => {
                break;
            }
            Err(err) => {
                println!("Error: {:?}", err);
                break;
            }
        }
        count += 1;
    }
    repl.append_history("history").unwrap();

    Ok(())
}

For sake of keeping the code clean and readable, I like to keep my main file as clean as possible. And one thing you may have noticed is that I have a mod repl; being declared on line 2 . That is because I moved most of the business logic related to our REPL part of the program to a different module. This way our main.rs can remain clean and easy to read.

So you notice that on line 4 I am importing a struct and a function

use repl::{REPLHelper, get\_config};

that later on line 24 and line 27 I am using them to construct our helper that will be used on line 30 to construct our Editor .

The helper struct on rustyline is responsible to implementing all the behaviors we would like to have on our REPL program, for example we are making use of the Validator trait that can validate the input as the user types in real time. We are also making use of the Hinter trait, responsible for providing hints while the user is typing commands.

Maybe I got a bit carried away with those, but I think they will give a better feel to the user when using the program. Plus, sqlite3 has them, so why won't we?!

Like I mentioned all the code related to the implementation and behavior of the rustyline REPL is a separate module called repl . You can find the code below, which is pretty well commented and self explanatory I guess. To actually explain this code I would have to get into the works of Rust which not exactly the point of this post, but feel free to ask or comment anything.

use std::borrow::Cow::{self, Borrowed, Owned};

use rustyline_derive::{Helper, Completer};
use rustyline::error::ReadlineError;
use rustyline::config::OutputStreamType;
use rustyline::{CompletionType, Config, Context, EditMode};
use rustyline::validate::{MatchingBracketValidator, Validator};
use rustyline::validate::{ValidationContext, ValidationResult};
use rustyline::hint::{Hinter, HistoryHinter};
use rustyline::highlight::{Highlighter, MatchingBracketHighlighter};

// REPL Helper Struct with all functionalities
#[derive(Helper, Completer)]
pub struct REPLHelper {
    pub validator: MatchingBracketValidator,
    pub colored_prompt: String,
    pub hinter: HistoryHinter,
    pub highlighter: MatchingBracketHighlighter,
}

impl REPLHelper {
    // Default constructor
    pub fn new() -> Self {
        REPLHelper {
            // completer: FilenameCompleter::new(),
            highlighter: MatchingBracketHighlighter::new(),
            hinter: HistoryHinter {},
            colored_prompt: "".to_owned(),
            validator: MatchingBracketValidator::new(),
        }
    }
}

// Implementing trait responsible for providing hints
impl Hinter for REPLHelper {
    type Hint = String;

    // Takes the currently edited line with the cursor position and returns the string that should be 
    // displayed or None if no hint is available for the text the user currently typed
    fn hint(&self, line: &str, pos: usize, ctx: &Context<'_>) -> Option<String> {
        self.hinter.hint(line, pos, ctx)
    }
}

// Implementing trait responsible for determining whether the current input buffer is valid.
// Rustyline uses the method provided by this trait to decide whether hitting the enter key 
// will end the current editing session and return the current line buffer to the caller of 
// Editor::readline or variants.
impl Validator for REPLHelper {

    // Takes the currently edited input and returns a ValidationResult indicating whether it 
    // is valid or not along with an option message to display about the result.
    fn validate(&self, ctx: &mut ValidationContext) -> Result<ValidationResult, ReadlineError> {
        use ValidationResult::{Incomplete, /*Invalid,*/ Valid};
        let input = ctx.input();
        // let result = if !input.starts_with("SELECT") {
        //     Invalid(Some(" --< Expect: SELECT stmt".to_owned()))
        // } else 
        let result = if input.eq(".exit") {
            Valid(None)
        } else if !input.ends_with(';') {
            Incomplete
        } else {
            Valid(None)
        };
        Ok(result)
    }

    // Configure whether validation is performed while typing or only when user presses the Enter key.
    fn validate_while_typing(&self) -> bool {
        self.validator.validate_while_typing()
    }
}

// Implementing syntax highlighter with ANSI color.
impl Highlighter for REPLHelper {
    // Takes the prompt and returns the highlighted version (with ANSI color).
    fn highlight_prompt<'b, 's: 'b, 'p: 'b>(&'s self, prompt: &'p str, default: bool,) -> Cow<'b, str> {
        if default {
            Borrowed(&self.colored_prompt)
        } else {
            Borrowed(prompt)
        }
    }

    // Takes the hint and returns the highlighted version (with ANSI color).
    fn highlight_hint<'h>(&self, hint: &'h str) -> Cow<'h, str> {
        Owned("\x1b[1m".to_owned() + hint + "\x1b[m")
    }

    // Takes the currently edited line with the cursor position and returns the highlighted version (with ANSI color).
    fn highlight<'l>(&self, line: &'l str, pos: usize) -> Cow<'l, str> {
        self.highlighter.highlight(line, pos)
    }

    // Tells if line needs to be highlighted when a specific char is typed or when cursor is moved under a specific char.
    // Used to optimize refresh when a character is inserted or the cursor is moved.
    fn highlight_char(&self, line: &str, pos: usize) -> bool {
        self.highlighter.highlight_char(line, pos)
    }
}

// Returns a Config::builder with basic Editor configuration
pub fn get_config() -> Config {
    Config::builder()
        .history_ignore_space(true)
        .completion_type(CompletionType::List)
        .edit_mode(EditMode::Emacs)
        .output_stream(OutputStreamType::Stdout)
        .build()
}

Let's give it a try!!

First the CLI part.

$ ./rust\_sqlite --help  
Rust-SQLite 0.1.0  
João Henrique Machado Silva <[joaoh82@gmail.com](mailto:joaoh82@gmail.com)\>  
Light version of SQLite developed with RustUSAGE:  
    rust\_sqliteFLAGS:  
    -h, --help       Prints help information  
    -V, --version    Prints version information

Now our REPL part.

$ ./rust_sqlite
Rust-SQLite - 0.1.0
Enter .exit to quit.
Enter .help for usage hints.
Connected to a transient in-memory database.
Use '.open FILENAME' to reopen on a persistent database.
rust-sqlite | 1> random command;
Error: unknown command or invalid arguments: 'random command;'. Enter '.help'
rust-sqlite | 2> .exit
$

Alright, we set out to have a working simple CLI application with a working REPL that responds to simple commands and exits gracefully and we made it!

Next we see if we can make our app even better by start developing our command language and working with actual statements.

View on Github (pull requests are more then welcome)

If you wanna follow this track don’t forget to follow me here on Dev.to and also give some love!

Fast, Lean, and Fully Extensible

CKEditor 5 is built for developers who value flexibility and speed. Pick the features that matter, drop the ones that don’t and enjoy a high-performance WYSIWYG that fits into your workflow

Start now

Top comments (2)

Matt Davies • Jul 1 '21

structopt is a great crate for command line parsing. It sits on top of clap and parses the command line arguments directly into structs and enums. For icing on the cake, consider pawn that allows you to use a structopt struct as an argument to main.

Kim Seon Woo • Aug 21 '24

To see rust implementation for sqlite3(and some additional features such as async), check out the limbo project

github.com/penberg/limbo

Check out Episode 1: How a Hackathon Project Became a Web3 Startup 🚀

Ever wondered what it takes to build a web3 startup from scratch? In the Stellar Dev Diaries series, we follow the journey of a team of developers building on the Stellar Network as they go from hackathon win to getting funded and launching on mainnet.

DEV Community

What would SQLite look like if written in Rust? — Part 1

What would SQLite look like if written in Rust? — Part 1

Basically, how a database really works?

So, let's go!

Undestanding SQLite

Digging a little deeper…

The Frontend:

The Backend:

Making a Simple CLI with a REPL in Rust

Now let's get back to our Rust-SQLite program!

Let's give it a try!!

Fast, Lean, and Fully Extensible

Top comments (2)

Check out Episode 1: How a Hackathon Project Became a Web3 Startup 🚀

Okay