Abdul Qadir

Posted on Apr 4

Building a Database Engine from Scratch

#cpp #java #database #beginners

How I Built a Database Engine from Scratch in C++

Most developers use databases every day — MySQL, PostgreSQL, SQLite — without ever thinking about what's happening underneath. I wanted to change that. So I built one from scratch.

This is the story of 4mulaQuery — a custom database engine written in C++, exposed through a Java Spring Boot REST API, containerized with Docker, and deployed live on the internet.

Why Build a Database?

I was studying data structures and kept asking myself — how does a real database actually store data? How does it find a record instantly among millions?

Books explain B+ Trees theoretically. But I wanted to feel it. So I built it.

The goal was simple: understand databases from the ground up, not just use them.

The Architecture

Browser
   │
   ▼
Java Spring Boot API
   │
   ▼ (stdin/stdout)
C++ Database Engine
   │
   ▼
4mulaQuery.db (Binary File)

Three completely independent layers — each doing one job.

Layer 1 — C++ Storage Engine

This is the heart of the project. No libraries. No shortcuts.

Phase 1: Binary File Storage

First, I defined a fixed-width row schema:

struct Row {
    uint32_t id;           // 4 bytes
    char username[32];     // 32 bytes
    char email[255];       // 255 bytes
};
// Total: 291 bytes per row

Fixed-width means I can calculate any row's position instantly:

position = row_number × 291 bytes

A Pager class handles all disk I/O — reading and writing raw bytes.

Phase 2: B+ Tree Indexing

The first version used linear search — O(n). Fine for 100 records. Terrible for 100,000.

I upgraded to a B+ Tree — the same data structure PostgreSQL and MySQL use internally.

Why B+ Tree?

O(log n) insert, search, delete
Leaf nodes form a linked list — full table scans stay fast
All data lives in leaf nodes — internal nodes are just routing

My implementation:

Page Layout (4096 bytes per page):

Leaf Node:
  [node_type | is_root | parent | num_cells | next_leaf | cells...]
  Each cell = key(4 bytes) + Row(291 bytes) = 295 bytes
  Max cells per leaf = 13

Internal Node:
  [node_type | is_root | parent | num_keys | right_child | cells...]
  Each cell = child(4 bytes) + key(4 bytes) = 8 bytes
  Max keys = 509

Capacity:

Depth 1: 13 records
Depth 2: 6,617 records
Depth 3: 3,369,353 records

The hardest part was the leaf split — when a leaf node fills up, you have to:

Create a new leaf
Redistribute records
Update the linked list
Inform the parent node
Create a new root if needed

Getting this right took me 2 days of debugging.

Layer 2 — Java Spring Boot API

C++ handles storage. But browsers speak HTTP, not binary.

I built a Java Spring Boot REST API as the bridge — it spawns a C++ process for every query, sends the command via stdin, and reads the response from stdout.

// EngineService.java
Process process = processManager.startProcess();
streamHandler.writeCommand(process, command);
String result = streamHandler.readOutput(process);

The API follows Single Responsibility Principle:

EngineService     → orchestrates everything
ProcessManager    → spawns/kills C++ process
StreamHandler     → stdin/stdout I/O
QueryLogger       → logs every query for ML
CommandType       → enum: INSERT | SEARCH | DELETE | ALL

Endpoints:

GET /api/insert?id=1&name=Abdul&email=a@b.com
GET /api/search?id=1
GET /api/delete?id=1
GET /api/all
GET /api/logs        → analytics data
POST /api/auth/login
POST /api/auth/register

Layer 3 — Frontend Dashboard

A single-page app with:

Login / Signup / Forgot Password
Dashboard with real-time DB operations
Data Explorer
Query Console (raw commands)
Analytics Dashboard — live charts using Chart.js

The analytics dashboard fetches /api/logs and renders:

Query distribution (bar chart)
Average execution time per query type
Success rate (pie chart)
Timeline of last 20 queries

ML Query Logging

Every query is automatically logged:

timestamp,type,execution_ms,success,command
2026-04-01 10:00:01,INSERT,6,true,"insert,1,Abdul,abdul@test.com"
2026-04-01 10:00:02,SEARCH,2,true,"search,1"

I built a Python analytics script (analyze.py) that:

Reads query_logs.csv
Generates performance graphs
Identifies slow queries and patterns

This data will eventually train an ML model to predict and optimize query execution.

Docker Deployment

The whole system runs in a single Docker container:

# Multi-stage build
# Stage 1: Compile C++ engine
# Stage 2: Build Java JAR
# Stage 3: Runtime with both binaries

Deployed live on Render.com:
👉 fourmulaquery.onrender.com

What I Learned

1. Databases are just files
At the lowest level, every database is reading and writing bytes to disk. The magic is in how efficiently you do it.

2. B+ Trees are beautiful
Once I understood why leaf nodes form a linked list, everything clicked. Range scans become trivial. Full table scans stay fast.

3. Inter-process communication is tricky
Getting Java to talk to C++ via stdin/stdout required careful stream handling — especially flushing buffers at the right time.

4. Build the simple version first
I started with linear search. It worked. Then I upgraded to B+ Tree. This made debugging so much easier — I always had a working baseline.

What's Next

[ ] SQL Parser — SELECT * FROM users WHERE id > 5
[ ] Tree rebalancing on delete
[ ] Distributed version

Try It

🌐 Live Demo: fourmulaquery.onrender.com

💻 GitHub: github.com/4mulaMind/4mulaQuery

If you're learning data structures, build something real with them. You'll understand them 10x better.

Built by Abdul Qadir

DEV Community