Building a Vectorized SQL Engine from Scratch in C++ — No Dependencies

#cpp #database #sql #programming

I built an embedded analytical database engine (https://github.com/SouravRoy-ETL/slothdb) entirely from scratch in C++20 — no external libraries, not even for SQL parsing or
Parquet file reading. Here's how each subsystem works under the hood.

The Execution Model: Why Vectorized Beats Row-at-a-Time

Traditional databases process one row at a time. Each row travels through every operator (scan → filter → project → aggregate) individually. The overhead per row is massive:
virtual function calls, branch mispredictions, poor cache utilization.

Vectorized execution processes batches of 2,048 values at once. Instead of calling filter(row) 10 million times, you call filter(vector_of_2048) about 5,000 times. Each call
processes a tight loop over a flat array — perfect for CPU caches and SIMD auto-vectorization.

// Row-at-a-time (slow)
for (each row) {
if (row.salary > 100000) emit(row); // virtual call, branch, cache miss
}

// Vectorized (fast)
auto *salaries = vector.GetData(); // flat array, cache-friendly
auto *output = result.GetData();
for (idx_t i = 0; i < 2048; i++) {
output[i] = salaries[i] > 100000; // tight loop, auto-vectorizes
}

The Vector class is the core data structure. It stores a flat data_ptr_t buffer, a ValidityMask (bitvector for NULLs — 1 bit per value, 256 bytes for 2,048 values), and
supports four modes: FLAT, CONSTANT, DICTIONARY, and SEQUENCE. The CONSTANT mode is key — if an entire vector has the same value (common in WHERE x = 5), we store it once and
skip the loop entirely.

The String Problem: 16-Byte Inline Strings

String comparison is the #1 bottleneck in analytical queries. Most engines store strings as heap-allocated std::string objects — each comparison requires chasing a pointer,
causing cache misses.

I use a 16-byte inline string representation:

struct string_t { // exactly 16 bytes, fits in one cache line
uint32_t length;
char data_[12]; // strings ≤ 12 bytes stored entirely inline
// For longer strings: first 4 bytes = prefix, next 8 bytes = pointer
};

Strings under 12 characters (which covers ~80% of real-world data: country codes, status flags, short names) never touch the heap. For longer strings, the 4-byte prefix enables
fast inequality comparisons without dereferencing the pointer — if the prefixes differ, you already know the result.

Writing a SQL Parser with Zero Dependencies

Most databases use parser generators (Yacc, Bison, ANTLR). I wrote a hand-written recursive descent parser instead. Why? Zero dependencies, full control over error messages,
and better performance (no generated code bloat).

The tokenizer converts SQL text into tokens in a single pass. Keywords are looked up in a hash map (std::unordered_map with ~100 entries). The parser is a
set of mutually recursive functions following the grammar precedence:

ParseExpression → ParseOr → ParseAnd → ParseNot → ParseComparison
→ ParseAddSub → ParseMulDiv → ParseUnary → ParsePrimary

Each level handles one precedence level. ParsePrimary handles atoms: constants, column references, function calls, parenthesized expressions, CASE WHEN, CAST, EXISTS
subqueries, and window functions.

The trickiest part was LEFT and RIGHT — they're both SQL keywords (for joins) AND function names (LEFT(s, 3)). The parser checks if the keyword is followed by ( and routes
accordingly.

A depth counter prevents stack overflow from deeply nested expressions (limit: 256 levels).

Implementing Parquet Without Apache Arrow

This was the hardest part. Parquet is a complex columnar file format:

File starts and ends with magic bytes PAR1
Footer at the end contains schema, row group metadata, column statistics
Data is organized in row groups → column chunks → pages
Each page contains encoded data (PLAIN, RLE, DICTIONARY, etc.)

Most implementations use Apache's C++ Parquet library, which pulls in Arrow, Thrift, Boost, and 50+ transitive dependencies. I implemented the format from scratch.

The reader:

Reads the 4-byte footer size from file_end - 8
Seeks back and reads the footer (binary metadata with column names, types, offsets)
For each column chunk, reads data at the specified offset
Decodes values based on the physical type (INT32, INT64, DOUBLE, BYTE_ARRAY)

Row group statistics (min/max per column) enable predicate pushdown — if you query WHERE salary > 200000 and a row group's max salary is 150000, we skip the entire row group
without reading it.

bool ParquetReader::RowGroupMightMatch(idx_t rg_idx, idx_t col_idx,
const std::string &op, const Value &val) {
auto &col = meta_.row_groups[rg_idx].columns[col_idx];
if (!col.has_stats) return true; // can't skip, read it
if (op == ">") return !(col.max_value < val); // max < target → skip
// ...
}

Dictionary-Encoded Execution: The 10x String Speedup

When a column has low cardinality (few unique values — think department, country, status), dictionary encoding replaces strings with integer codes:

Original: ["Engineering", "Sales", "Engineering", "Sales", "Marketing"]
Dictionary: {0: "Engineering", 1: "Sales", 2: "Marketing"}
Codes: [0, 1, 0, 1, 2]

The key insight: all operations (GROUP BY, JOIN, filter, hash) work on the integer codes, not the strings. Comparing two 4-byte integers is ~10x faster than comparing two
variable-length strings. Hashing an integer is O(1); hashing a string is O(n).

The DictionaryVector class automatically encodes columns when cardinality < 50% of row count. GROUP BY on a dictionary-encoded column becomes a simple integer-to-bucket
mapping.

GPU Acceleration Architecture

The GPU engine has a clean abstraction:

class GPUEngine {
virtual int64_t SumInt32(const int32_t *data, idx_t count) = 0;
virtual vector Filter(const int32_t *data, idx_t count,
const string &op, int32_t value) = 0;
virtual AggResult HashAggregate(const int32_t *keys,
const int32_t *values, idx_t count) = 0;
};

Three backends: CUDAEngine (NVIDIA), MetalEngine (Apple Silicon), CPUFallbackEngine. The engine auto-detects available GPU at startup. Operations below 100K rows use CPU (PCIe
transfer overhead makes GPU slower for small data).

The CUDA kernels use parallel reduction for SUM (shared memory + warp shuffle), atomic operations for hash aggregation, and radix sort for ORDER BY.

Apple Metal has a unique advantage: unified memory. No PCIe transfer needed — the GPU reads the same physical memory as the CPU. This makes GPU acceleration viable even for
smaller datasets on Apple Silicon.

Security: What I Got Wrong the First Time

A security audit found 22 vulnerabilities in my initial implementation:

Parquet footer parsing: reading field lengths from the file without bounds checking → buffer overflow from malicious files. Fixed with a safe_read wrapper.
GENERATE_SERIES(0, 9223372036854775807): attempted to allocate trillions of rows. Fixed with a 10M row limit.
Extension loading: LOAD '/etc/malicious.so' → arbitrary code execution. Fixed by restricting to relative paths.
Regex DoS: REGEXP_MATCHES(col, '(a+)+$') → catastrophic backtracking in std::regex. Noted for future RE2 migration.
CSV formula injection: cells starting with = written unquoted → Excel command execution. Fixed by auto-quoting.

The lesson: every value read from a file is attacker-controlled input. Every memcpy from file data needs a bounds check.

Numbers