Building a Vectorized SQL Engine from Scratch in C++ — No Dependencies
I built an embedded analytical database engine (https://github.com/SouravRoy-ETL/slothdb) entirely from scratch in C++20 — no external libraries, not even for SQL parsing or
Parquet file reading. Here's how each subsystem works under the hood.
The Execution Model: Why Vectorized Beats Row-at-a-Time
Traditional databases process one row at a time. Each row travels through every operator (scan → filter → project → aggregate) individually. The overhead per row is massive:
virtual function calls, branch mispredictions, poor cache utilization.
Vectorized execution processes batches of 2,048 values at once. Instead of calling filter(row) 10 million times, you call filter(vector_of_2048) about 5,000 times. Each call
processes a tight loop over a flat array — perfect for CPU caches and SIMD auto-vectorization.
// Row-at-a-time (slow)
for (each row) {
if (row.salary > 100000) emit(row); // virtual call, branch, cache miss
}
// Vectorized (fast)
auto *salaries = vector.GetData(); // flat array, cache-friendly
auto *output = result.GetData();
for (idx_t i = 0; i < 2048; i++) {
output[i] = salaries[i] > 100000; // tight loop, auto-vectorizes
}
The Vector class is the core data structure. It stores a flat data_ptr_t buffer, a ValidityMask (bitvector for NULLs — 1 bit per value, 256 bytes for 2,048 values), and
supports four modes: FLAT, CONSTANT, DICTIONARY, and SEQUENCE. The CONSTANT mode is key — if an entire vector has the same value (common in WHERE x = 5), we store it once and
skip the loop entirely.
The String Problem: 16-Byte Inline Strings
String comparison is the #1 bottleneck in analytical queries. Most engines store strings as heap-allocated std::string objects — each comparison requires chasing a pointer,
causing cache misses.
I use a 16-byte inline string representation:
struct string_t { // exactly 16 bytes, fits in one cache line
uint32_t length;
char data_[12]; // strings ≤ 12 bytes stored entirely inline
// For longer strings: first 4 bytes = prefix, next 8 bytes = pointer
};
Strings under 12 characters (which covers ~80% of real-world data: country codes, status flags, short names) never touch the heap. For longer strings, the 4-byte prefix enables
fast inequality comparisons without dereferencing the pointer — if the prefixes differ, you already know the result.
Writing a SQL Parser with Zero Dependencies
Most databases use parser generators (Yacc, Bison, ANTLR). I wrote a hand-written recursive descent parser instead. Why? Zero dependencies, full control over error messages,
and better performance (no generated code bloat).
The tokenizer converts SQL text into tokens in a single pass. Keywords are looked up in a hash map (std::unordered_map with ~100 entries). The parser is a
set of mutually recursive functions following the grammar precedence:
ParseExpression → ParseOr → ParseAnd → ParseNot → ParseComparison
→ ParseAddSub → ParseMulDiv → ParseUnary → ParsePrimary
Each level handles one precedence level. ParsePrimary handles atoms: constants, column references, function calls, parenthesized expressions, CASE WHEN, CAST, EXISTS
subqueries, and window functions.
The trickiest part was LEFT and RIGHT — they're both SQL keywords (for joins) AND function names (LEFT(s, 3)). The parser checks if the keyword is followed by ( and routes
accordingly.
A depth counter prevents stack overflow from deeply nested expressions (limit: 256 levels).
Implementing Parquet Without Apache Arrow
This was the hardest part. Parquet is a complex columnar file format:
- File starts and ends with magic bytes PAR1
- Footer at the end contains schema, row group metadata, column statistics
- Data is organized in row groups → column chunks → pages
- Each page contains encoded data (PLAIN, RLE, DICTIONARY, etc.)
Most implementations use Apache's C++ Parquet library, which pulls in Arrow, Thrift, Boost, and 50+ transitive dependencies. I implemented the format from scratch.
The reader:
- Reads the 4-byte footer size from file_end - 8
- Seeks back and reads the footer (binary metadata with column names, types, offsets)
- For each column chunk, reads data at the specified offset
- Decodes values based on the physical type (INT32, INT64, DOUBLE, BYTE_ARRAY)
Row group statistics (min/max per column) enable predicate pushdown — if you query WHERE salary > 200000 and a row group's max salary is 150000, we skip the entire row group
without reading it.
bool ParquetReader::RowGroupMightMatch(idx_t rg_idx, idx_t col_idx,
const std::string &op, const Value &val) {
auto &col = meta_.row_groups[rg_idx].columns[col_idx];
if (!col.has_stats) return true; // can't skip, read it
if (op == ">") return !(col.max_value < val); // max < target → skip
// ...
}
Dictionary-Encoded Execution: The 10x String Speedup
When a column has low cardinality (few unique values — think department, country, status), dictionary encoding replaces strings with integer codes:
Original: ["Engineering", "Sales", "Engineering", "Sales", "Marketing"]
Dictionary: {0: "Engineering", 1: "Sales", 2: "Marketing"}
Codes: [0, 1, 0, 1, 2]
The key insight: all operations (GROUP BY, JOIN, filter, hash) work on the integer codes, not the strings. Comparing two 4-byte integers is ~10x faster than comparing two
variable-length strings. Hashing an integer is O(1); hashing a string is O(n).
The DictionaryVector class automatically encodes columns when cardinality < 50% of row count. GROUP BY on a dictionary-encoded column becomes a simple integer-to-bucket
mapping.
GPU Acceleration Architecture
The GPU engine has a clean abstraction:
class GPUEngine {
virtual int64_t SumInt32(const int32_t *data, idx_t count) = 0;
virtual vector Filter(const int32_t *data, idx_t count,
const string &op, int32_t value) = 0;
virtual AggResult HashAggregate(const int32_t *keys,
const int32_t *values, idx_t count) = 0;
};
Three backends: CUDAEngine (NVIDIA), MetalEngine (Apple Silicon), CPUFallbackEngine. The engine auto-detects available GPU at startup. Operations below 100K rows use CPU (PCIe
transfer overhead makes GPU slower for small data).
The CUDA kernels use parallel reduction for SUM (shared memory + warp shuffle), atomic operations for hash aggregation, and radix sort for ORDER BY.
Apple Metal has a unique advantage: unified memory. No PCIe transfer needed — the GPU reads the same physical memory as the CPU. This makes GPU acceleration viable even for
smaller datasets on Apple Silicon.
Security: What I Got Wrong the First Time
A security audit found 22 vulnerabilities in my initial implementation:
- Parquet footer parsing: reading field lengths from the file without bounds checking → buffer overflow from malicious files. Fixed with a safe_read wrapper.
- GENERATE_SERIES(0, 9223372036854775807): attempted to allocate trillions of rows. Fixed with a 10M row limit.
- Extension loading: LOAD '/etc/malicious.so' → arbitrary code execution. Fixed by restricting to relative paths.
- Regex DoS: REGEXP_MATCHES(col, '(a+)+$') → catastrophic backtracking in std::regex. Noted for future RE2 migration.
- CSV formula injection: cells starting with = written unquoted → Excel command execution. Fixed by auto-quoting.
The lesson: every value read from a file is attacker-controlled input. Every memcpy from file data needs a bounds check.
Numbers
- ~33,000 lines of C++20
- 325 tests, 131,000+ assertions
- 130+ SQL features
- 7 file format readers (all from scratch)
- Single binary: ~700KB stripped
- Compiles with MSVC, GCC, and Clang
Top comments (0)