Most developers use databases every day.
Few actually understand what happens under the hood.
We say things like “the database stores data” or “the query is slow”—but inside a real database engine, there is a complex, carefully engineered system built for performance, durability, and correctness.
This article explains how real databases work internally, step by step, without marketing language or oversimplified myths.
1. A Database Is Not a Spreadsheet
A real database is not:
- A giant table in memory
- A collection of CSV files
- A simple key-value map
A real database is a storage engine + execution engine + transaction system, tightly integrated.
At a high level, every serious database has:
- Parser
- Planner / Optimizer
- Execution Engine
- Storage Engine
- Buffer Cache
- Transaction Manager
- Recovery System
Let’s walk through what actually happens.
2. What Happens When You Send a Query?
When you run:
SELECT name FROM users WHERE age > 30;
The database does not immediately scan a table.
Instead, it follows a strict pipeline.
Step 1: Parsing
The SQL text is converted into an Abstract Syntax Tree (AST).
The database checks:
- Syntax correctness
- Valid table and column names
- User permissions
At this stage, it does zero execution.
Step 2: Query Planning & Optimization
This is where databases become serious engineering.
The query optimizer decides:
- Which indexes to use
- Join order
- Scan method (index scan vs sequential scan)
- Cost estimation based on statistics
Example decisions:
- Is it cheaper to scan 1 million rows sequentially?
- Or to use an index with random I/O?
Modern databases use:
- Cost-based optimizers
- Statistics (histograms, cardinality)
- Rule-based rewrites
This step determines performance more than hardware.
3. The Execution Engine
After planning, the database produces an execution plan.
The execution engine:
- Pulls rows through operators (scan → filter → project)
- Uses iterators or vectorized execution
- Streams results instead of loading everything into memory
Important concept:
Databases process data in pipelines, not all at once.
4. How Data Is Actually Stored on Disk
Databases do not store rows directly on disk.
Pages (Blocks)
Data is stored in fixed-size pages (commonly 4KB–16KB).
Each page contains:
- Page header
- Row slots
- Metadata
Pages are the minimum unit of I/O.
Row Storage vs Column Storage
Row-oriented (PostgreSQL, MySQL):
- Best for OLTP
- Fast inserts and point queries
Column-oriented (ClickHouse, Redshift):
- Best for analytics
- Excellent compression
- Vectorized scans
5. Indexes Are Not Magic
Indexes are separate data structures, usually:
- B-Trees (most common)
- Hash indexes
- LSM Trees (RocksDB, Cassandra)
A B-Tree:
- Keeps data sorted
- Minimizes disk seeks
- Balances read/write costs
Important truth:
Indexes speed up reads but slow down writes.
Every insert/update must update:
- Table data
- All related indexes
6. Memory Is a Cache, Not the Source of Truth
Databases never trust memory.
Buffer Pool (Cache)
- Frequently used pages are cached in RAM
- Dirty pages are written back later
- Replacement strategies (LRU variants)
If power fails:
- Memory is lost
- Disk must still be consistent
Which brings us to the most critical system.
7. Transactions and ACID
Real databases guarantee ACID:
- Atomicity
- Consistency
- Isolation
- Durability
This is achieved through:
Write-Ahead Logging (WAL)
Before modifying data:
- Changes are written to a log
- Log is flushed to disk
- Only then is memory updated
If the database crashes:
- WAL is replayed
- Data is recovered
Logs are more important than data files.
8. Concurrency Control (Why Locks Exist)
Databases support thousands of concurrent users.
They use:
- Locks (row, page, table)
- MVCC (Multi-Version Concurrency Control)
With MVCC:
- Readers don’t block writers
- Writers create new versions
- Old versions cleaned by vacuum/GC
This is why:
- PostgreSQL can read without locking
- Oracle and PostgreSQL scale well under load
9. Recovery and Crash Safety
When a database restarts after a crash:
- Read last checkpoint
- Replay WAL records
- Undo incomplete transactions
- Restore consistency
This process is deterministic and repeatable.
No guessing. No heuristics.
10. Why This Knowledge Matters
Understanding internals helps you:
- Write faster queries
- Design better schemas
- Choose correct indexes
- Avoid dangerous assumptions
- Debug performance issues
Most “slow database” problems are:
- Bad query plans
- Wrong indexes
- Misunderstanding internals
Not hardware.
Final Thought
A real database is closer to an operating system than a library.
It manages:
- Memory
- Storage
- Concurrency
- Recovery
- Scheduling
If you treat it like a black box, it will punish you.
If you understand it, it becomes one of the most powerful tools in software engineering.
Top comments (0)