Building a SQL Database in Rust: Reducing Memory Usage with Spans and String Interning

Musab Khan — Sat, 13 Jun 2026 15:16:04 +0000

I have been working on a PostgreSQL compatible database system in Rust and recently made a couple of changes that significantly reduced unnecessary memory allocations.

The first change was in the lexer.

Originally, identifiers were stored as String values inside token variants such as Ident(String) and QuotedIdent(String). This meant every identifier required its own allocation even though the original SQL query already contained that text.

I switched to storing spans instead. A span contains the start and end position along with line and column information. Whenever I need the actual identifier text, I can retrieve it directly from the source query using the span.

Besides reducing allocations, this also improved diagnostics because I always know exactly where a token came from.

The second change was introducing a string interner.

As I started building larger parts of the parser and AST, I noticed that names such as tables, columns, views, and aliases could appear many times throughout a query. Storing the same string repeatedly felt wasteful.

I implemented a simple interner:

pub struct Interner {
    map: HashMap<&'static str, Symbol>,
    strings: Vec<&'static str>,
}

Now identifiers are stored as compact symbols instead of duplicated strings. The actual text is stored only once and can be resolved when needed.

Some benefits of this approach:

No duplicate allocations for repeated identifiers
Faster identifier comparisons using integer equality
Smaller AST nodes
Better cache locality

The project currently includes a lexer, parser, binder, query planner, optimizer, catalog, and storage engine.

I am still exploring ways to improve performance and memory efficiency, so I would be interested to hear how others have approached similar problems in compilers, interpreters, or database systems.

Repository: https://github.com/musab05/osirisdb

Building a SQL Lexer in Rust: Why I Replaced `Vec` with `&str` and `Ident(String)` with Spans

Musab Khan — Sat, 06 Jun 2026 15:36:36 +0000

I've been building a database engine from scratch in Rust, and I recently finished the lexer.

The lexer itself wasn't the most interesting part.

What I found more valuable was how my design evolved as I learned more about Rust and how compilers and database systems are typically implemented.

My First Approach

When I started, I stored the input as a Vec<char>.

It felt straightforward because I could access characters directly without worrying about UTF-8 boundaries.

I also represented identifiers like this:

Ident(String)

At first glance, this seems perfectly reasonable.

Every identifier token carries its own text, making it easy for the parser to consume.

The Problem

As the lexer grew, I started asking myself a simple question:

The identifier already exists in the original SQL query.

Why am I allocating another string and copying the same data into every token?

For a query like:

SELECT username, email FROM users;

the source text already contains:

username
email
users

Creating separate String allocations for each identifier means duplicating data that already exists.

I also learned an important detail about Rust enums.

The size of an enum is influenced by its largest variant.

Once variants start carrying additional data, every token instance becomes larger than it otherwise needs to be.

Moving to a Span-Based Design

Instead of storing identifier text directly inside tokens, I switched to storing only the token kind:

Ident

along with source location information:

Span {
    start,
    end,
    line,
    column,
}

Now the token only answers two questions:

What is this token?
Where did it come from?

If the parser needs the actual identifier text, it can recover it directly from the original SQL source using the stored byte range.

Replacing `Vec<char>` with `&str`

The second design change was moving away from:

Vec<char>

and operating directly on:

&str

using lifetimes.

Instead of creating another collection containing the entire input, the lexer now walks over borrowed source text.

This means:

No duplicated input buffer
Less memory usage
Fewer allocations
A single source of truth

The lexer doesn't own the SQL string.

It only borrows it.

Current Output

For the query:

SELECT name, age FROM users WHERE age > 18;

the lexer produces:

Select @ line 1, col 1, bytes 0..6
Ident @ line 1, col 8, bytes 7..11
Comma @ line 1, col 12, bytes 11..12
Ident @ line 1, col 14, bytes 13..16
From @ line 1, col 18, bytes 17..21
Ident @ line 1, col 23, bytes 22..27
Where @ line 1, col 29, bytes 28..33
Ident @ line 1, col 35, bytes 34..37
Gt @ line 1, col 39, bytes 38..39
IntLit(18) @ line 1, col 41, bytes 40..42
Semicolon @ line 1, col 43, bytes 42..43
Eof @ line 1, col 44, bytes 43..43

What I Learned

I started this project to understand how databases work internally.

What surprised me most so far wasn't SQL.

It was seeing how a few seemingly small design decisions around ownership, borrowing, and data representation can significantly change the memory characteristics of a system.

The lexer is complete.

Next stop: building the parser and AST.

If you've built a compiler, interpreter, database, or parser before, I'd be interested to hear what design decisions ended up changing your implementation the most.

DEV Community: Musab Khan