How I Built a Database Engine from Scratch in C++
Most developers use databases every day — MySQL, PostgreSQL, SQLite — without ever thinking about what's happening underneath. I wanted to change that. So I built one from scratch.
This is the story of 4mulaQuery — a custom database engine written in C++, exposed through a Java Spring Boot REST API, containerized with Docker, and deployed live on the internet.
Why Build a Database?
I was studying data structures and kept asking myself — how does a real database actually store data? How does it find a record instantly among millions?
Books explain B+ Trees theoretically. But I wanted to feel it. So I built it.
The goal was simple: understand databases from the ground up, not just use them.
The Architecture
Browser
│
▼
Java Spring Boot API
│
▼ (stdin/stdout)
C++ Database Engine
│
▼
4mulaQuery.db (Binary File)
Three completely independent layers — each doing one job.
Layer 1 — C++ Storage Engine
This is the heart of the project. No libraries. No shortcuts.
Phase 1: Binary File Storage
First, I defined a fixed-width row schema:
struct Row {
uint32_t id; // 4 bytes
char username[32]; // 32 bytes
char email[255]; // 255 bytes
};
// Total: 291 bytes per row
Fixed-width means I can calculate any row's position instantly:
position = row_number × 291 bytes
A Pager class handles all disk I/O — reading and writing raw bytes.
Phase 2: B+ Tree Indexing
The first version used linear search — O(n). Fine for 100 records. Terrible for 100,000.
I upgraded to a B+ Tree — the same data structure PostgreSQL and MySQL use internally.
Why B+ Tree?
- O(log n) insert, search, delete
- Leaf nodes form a linked list — full table scans stay fast
- All data lives in leaf nodes — internal nodes are just routing
My implementation:
Page Layout (4096 bytes per page):
Leaf Node:
[node_type | is_root | parent | num_cells | next_leaf | cells...]
Each cell = key(4 bytes) + Row(291 bytes) = 295 bytes
Max cells per leaf = 13
Internal Node:
[node_type | is_root | parent | num_keys | right_child | cells...]
Each cell = child(4 bytes) + key(4 bytes) = 8 bytes
Max keys = 509
Capacity:
- Depth 1: 13 records
- Depth 2: 6,617 records
- Depth 3: 3,369,353 records
The hardest part was the leaf split — when a leaf node fills up, you have to:
- Create a new leaf
- Redistribute records
- Update the linked list
- Inform the parent node
- Create a new root if needed
Getting this right took me 2 days of debugging.
Layer 2 — Java Spring Boot API
C++ handles storage. But browsers speak HTTP, not binary.
I built a Java Spring Boot REST API as the bridge — it spawns a C++ process for every query, sends the command via stdin, and reads the response from stdout.
// EngineService.java
Process process = processManager.startProcess();
streamHandler.writeCommand(process, command);
String result = streamHandler.readOutput(process);
The API follows Single Responsibility Principle:
EngineService → orchestrates everything
ProcessManager → spawns/kills C++ process
StreamHandler → stdin/stdout I/O
QueryLogger → logs every query for ML
CommandType → enum: INSERT | SEARCH | DELETE | ALL
Endpoints:
GET /api/insert?id=1&name=Abdul&email=a@b.com
GET /api/search?id=1
GET /api/delete?id=1
GET /api/all
GET /api/logs → analytics data
POST /api/auth/login
POST /api/auth/register
Layer 3 — Frontend Dashboard
A single-page app with:
- Login / Signup / Forgot Password
- Dashboard with real-time DB operations
- Data Explorer
- Query Console (raw commands)
- Analytics Dashboard — live charts using Chart.js
The analytics dashboard fetches /api/logs and renders:
- Query distribution (bar chart)
- Average execution time per query type
- Success rate (pie chart)
- Timeline of last 20 queries
ML Query Logging
Every query is automatically logged:
timestamp,type,execution_ms,success,command
2026-04-01 10:00:01,INSERT,6,true,"insert,1,Abdul,abdul@test.com"
2026-04-01 10:00:02,SEARCH,2,true,"search,1"
I built a Python analytics script (analyze.py) that:
- Reads
query_logs.csv - Generates performance graphs
- Identifies slow queries and patterns
This data will eventually train an ML model to predict and optimize query execution.
Docker Deployment
The whole system runs in a single Docker container:
# Multi-stage build
# Stage 1: Compile C++ engine
# Stage 2: Build Java JAR
# Stage 3: Runtime with both binaries
Deployed live on Render.com:
👉 fourmulaquery.onrender.com
What I Learned
1. Databases are just files
At the lowest level, every database is reading and writing bytes to disk. The magic is in how efficiently you do it.
2. B+ Trees are beautiful
Once I understood why leaf nodes form a linked list, everything clicked. Range scans become trivial. Full table scans stay fast.
3. Inter-process communication is tricky
Getting Java to talk to C++ via stdin/stdout required careful stream handling — especially flushing buffers at the right time.
4. Build the simple version first
I started with linear search. It worked. Then I upgraded to B+ Tree. This made debugging so much easier — I always had a working baseline.
What's Next
- [ ] SQL Parser —
SELECT * FROM users WHERE id > 5 - [ ] Tree rebalancing on delete
- [ ] Distributed version
Try It
🌐 Live Demo: fourmulaquery.onrender.com
💻 GitHub: github.com/4mulaMind/4mulaQuery
If you're learning data structures, build something real with them. You'll understand them 10x better.

Top comments (0)