Athreya aka Maneshwar

Posted on Feb 27

Internal Datatypes and Record Format in SQLite

#webdev #programming #database #architecture

Hello, I'm Maneshwar. I'm working on git-lrc: a Git hook for Checking AI generated code.

Internal Datatypes and Record Format

So far, we’ve seen how the VM executes instructions, manages cursors, and moves through bytecode programs. Today we zoom into something even more fundamental:

What exactly is a value inside the VM?

Before a value is written to disk, before it is compared in a WHERE clause, before it is returned through sqlite3_step(), it exists inside the VM as a memory object.

Understanding that representation is key to understanding SQLite’s flexibility.

Internal Datatypes

The VM uses an arbitrary number of numbered memory locations (registers). Each register holds exactly one value at a time.

Every value processed by the VM must belong to one of five primitive storage types:

INTEGER – a signed integer
REAL – a signed floating-point number
TEXT – a character string
BLOB – a byte image
NULL – the SQL NULL

There are no other primitive types inside SQLite’s execution engine.

Everything whether it originated from disk, from an expression, or from a bound parameter becomes one of these five.

The type determines physical representation:

INTEGER → stored in integer form
REAL → stored in IEEE floating format
TEXT → stored with encoding metadata
BLOB → raw bytes
NULL → special marker

Interestingly, some values can have multiple representations cached at once.

For example, the value 123 may simultaneously exist as:

INTEGER 123
REAL 123.0
TEXT "123"

The VM keeps these representations synchronized when necessary. This caching avoids repeated conversions during expression evaluation.

Two types cannot have alternative representations:

BLOB
NULL

These remain as they are.

If a conversion is needed for example, comparing TEXT with INTEGER the VM performs implicit type conversion on demand. SQLite’s dynamic typing model relies heavily on this flexibility.

You can inspect storage types using:

SELECT a, typeof(a) FROM t1;

At the C API level, sqlite3_column_type() reveals the storage class of a value returned by sqlite3_step().

The Mem Object — The VM’s Value Container

Internally, almost every value inside the VM is represented as a Mem object.

Each element of the Vdbe.aMem array is one such structure.

Here is its conceptual layout:

A Mem object can contain:

The actual value
Flags describing its current storage type
Metadata about encoding
Cached alternate representations

The critical invariant is this:

A value always has exactly one storage class, even if multiple representations are cached.

For example, a Mem may cache both integer and string forms, but its storage class flag will indicate which one is canonical.

This distinction between storage class and cached representation is subtle but extremely important.

It allows SQLite to implement:

Manifest typing
On-the-fly type coercion
Efficient comparison semantics

All without rigid schema enforcement at runtime.

From Values to Records

Individual values in registers are not what get stored in B-trees. The VM must first compose them into records.

A record is a logically contiguous byte string that contains:

A key
An optional payload (value portion)

Although the tree module may physically split a record across pages or overflow areas, the VM sees it as one continuous byte sequence.

The VM alone is responsible for formatting records. The tree module does not interpret field structure it just stores and retrieves byte strings.

Record Formats — Fixed vs Variable Length

There are two conceptual ways to format records:

Fixed-Length Format

Each column occupies a predetermined amount of space.
Record size is constant.
Padding may be required.
Column sizes are known at table creation time.

This design simplifies offset calculation but wastes space when values are small.

Variable-Length Format

Column sizes vary per record.
No padding.
Record size depends on actual data.
Offsets must be computed dynamically.

SQLite uses a variant of variable-length formatting.

Why?

Because it yields several advantages:

Smaller database files (no wasted padding).
Faster I/O (fewer bytes moved between disk and memory).
Supports dynamic, manifest typing.

This final point is especially important.

If SQLite enforced static typing, fixed-length storage might be practical. But SQLite allows:

An INTEGER column to store TEXT.
A TEXT column to store numbers.
Any column to store NULL.

Variable-length formatting makes this possible without structural constraints.

Manifest Typing — A Preview

Traditional databases enforce type at schema level. SQLite enforces type at value level.

This is called manifest typing.

Each value carries its own storage class, independent of column declaration.

That’s why the Mem object stores type flags with the value itself. The type is not determined solely by the table schema — it is determined by the value inserted.

This flexibility is what allows:

INSERT INTO t1(c1) VALUES(123);
INSERT INTO t1(c1) VALUES('abc');

to coexist even if c1 was declared as TEXT.

In the next sections, we will explore how this manifest typing model interacts with:

Record headers
Serial types
On-disk encoding

Because once you understand how SQLite encodes records at the byte level, the design of the VM, the Mem object, and the B-tree layer will all snap into alignment.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

Any feedback or contributors are welcome! It’s online, source-available, and ready for anyone to use.
⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub