I've been building a database engine from scratch in Rust, and I recently finished the lexer.
The lexer itself wasn't the most interesting part.
What I found more valuable was how my design evolved as I learned more about Rust and how compilers and database systems are typically implemented.
My First Approach
When I started, I stored the input as a Vec<char>.
It felt straightforward because I could access characters directly without worrying about UTF-8 boundaries.
I also represented identifiers like this:
Ident(String)
At first glance, this seems perfectly reasonable.
Every identifier token carries its own text, making it easy for the parser to consume.
The Problem
As the lexer grew, I started asking myself a simple question:
The identifier already exists in the original SQL query.
Why am I allocating another string and copying the same data into every token?
For a query like:
SELECT username, email FROM users;
the source text already contains:
username
email
users
Creating separate String allocations for each identifier means duplicating data that already exists.
I also learned an important detail about Rust enums.
The size of an enum is influenced by its largest variant.
Once variants start carrying additional data, every token instance becomes larger than it otherwise needs to be.
Moving to a Span-Based Design
Instead of storing identifier text directly inside tokens, I switched to storing only the token kind:
Ident
along with source location information:
Span {
start,
end,
line,
column,
}
Now the token only answers two questions:
- What is this token?
- Where did it come from?
If the parser needs the actual identifier text, it can recover it directly from the original SQL source using the stored byte range.
Replacing Vec<char> with &str
The second design change was moving away from:
Vec<char>
and operating directly on:
&str
using lifetimes.
Instead of creating another collection containing the entire input, the lexer now walks over borrowed source text.
This means:
- No duplicated input buffer
- Less memory usage
- Fewer allocations
- A single source of truth
The lexer doesn't own the SQL string.
It only borrows it.
Current Output
For the query:
SELECT name, age FROM users WHERE age > 18;
the lexer produces:
Select @ line 1, col 1, bytes 0..6
Ident @ line 1, col 8, bytes 7..11
Comma @ line 1, col 12, bytes 11..12
Ident @ line 1, col 14, bytes 13..16
From @ line 1, col 18, bytes 17..21
Ident @ line 1, col 23, bytes 22..27
Where @ line 1, col 29, bytes 28..33
Ident @ line 1, col 35, bytes 34..37
Gt @ line 1, col 39, bytes 38..39
IntLit(18) @ line 1, col 41, bytes 40..42
Semicolon @ line 1, col 43, bytes 42..43
Eof @ line 1, col 44, bytes 43..43
What I Learned
I started this project to understand how databases work internally.
What surprised me most so far wasn't SQL.
It was seeing how a few seemingly small design decisions around ownership, borrowing, and data representation can significantly change the memory characteristics of a system.
The lexer is complete.
Next stop: building the parser and AST.
If you've built a compiler, interpreter, database, or parser before, I'd be interested to hear what design decisions ended up changing your implementation the most.
Top comments (0)