Farhad Rahimi Klie

Posted on Jan 31

How Real Databases Work Internally ?

#database #data #sql #nosql

Most developers use databases every day.
Few actually understand what happens under the hood.

We say things like “the database stores data” or “the query is slow”—but inside a real database engine, there is a complex, carefully engineered system built for performance, durability, and correctness.

This article explains how real databases work internally, step by step, without marketing language or oversimplified myths.

1. A Database Is Not a Spreadsheet

A real database is not:

A giant table in memory
A collection of CSV files
A simple key-value map

A real database is a storage engine + execution engine + transaction system, tightly integrated.

At a high level, every serious database has:

Parser
Planner / Optimizer
Execution Engine
Storage Engine
Buffer Cache
Transaction Manager
Recovery System

Let’s walk through what actually happens.

2. What Happens When You Send a Query?

When you run:

SELECT name FROM users WHERE age > 30;

The database does not immediately scan a table.

Instead, it follows a strict pipeline.

Step 1: Parsing

The SQL text is converted into an Abstract Syntax Tree (AST).

The database checks:

Syntax correctness
Valid table and column names
User permissions

At this stage, it does zero execution.

Step 2: Query Planning & Optimization

This is where databases become serious engineering.

The query optimizer decides:

Which indexes to use
Join order
Scan method (index scan vs sequential scan)
Cost estimation based on statistics

Example decisions:

Is it cheaper to scan 1 million rows sequentially?
Or to use an index with random I/O?

Modern databases use:

Cost-based optimizers
Statistics (histograms, cardinality)
Rule-based rewrites

This step determines performance more than hardware.

3. The Execution Engine

After planning, the database produces an execution plan.

The execution engine:

Pulls rows through operators (scan → filter → project)
Uses iterators or vectorized execution
Streams results instead of loading everything into memory

Important concept:

Databases process data in pipelines, not all at once.

4. How Data Is Actually Stored on Disk

Databases do not store rows directly on disk.

Pages (Blocks)

Data is stored in fixed-size pages (commonly 4KB–16KB).

Each page contains:

Page header
Row slots
Metadata

Pages are the minimum unit of I/O.

Row Storage vs Column Storage

Row-oriented (PostgreSQL, MySQL):

Best for OLTP
Fast inserts and point queries

Column-oriented (ClickHouse, Redshift):

Best for analytics
Excellent compression
Vectorized scans

5. Indexes Are Not Magic

Indexes are separate data structures, usually:

B-Trees (most common)
Hash indexes
LSM Trees (RocksDB, Cassandra)

A B-Tree:

Keeps data sorted
Minimizes disk seeks
Balances read/write costs

Important truth:

Indexes speed up reads but slow down writes.

Every insert/update must update:

Table data
All related indexes

6. Memory Is a Cache, Not the Source of Truth

Databases never trust memory.

Buffer Pool (Cache)

Frequently used pages are cached in RAM
Dirty pages are written back later
Replacement strategies (LRU variants)

If power fails:

Memory is lost
Disk must still be consistent

Which brings us to the most critical system.

7. Transactions and ACID

Real databases guarantee ACID:

Atomicity
Consistency
Isolation
Durability

This is achieved through:

Write-Ahead Logging (WAL)

Before modifying data:

Changes are written to a log
Log is flushed to disk
Only then is memory updated

If the database crashes:

WAL is replayed
Data is recovered

Logs are more important than data files.

8. Concurrency Control (Why Locks Exist)

Databases support thousands of concurrent users.

They use:

Locks (row, page, table)
MVCC (Multi-Version Concurrency Control)

With MVCC:

Readers don’t block writers
Writers create new versions
Old versions cleaned by vacuum/GC

This is why:

PostgreSQL can read without locking
Oracle and PostgreSQL scale well under load

9. Recovery and Crash Safety

When a database restarts after a crash:

Read last checkpoint
Replay WAL records
Undo incomplete transactions
Restore consistency

This process is deterministic and repeatable.

No guessing. No heuristics.

10. Why This Knowledge Matters

Understanding internals helps you:

Write faster queries
Design better schemas
Choose correct indexes
Avoid dangerous assumptions
Debug performance issues

Most “slow database” problems are:

Bad query plans
Wrong indexes
Misunderstanding internals

Not hardware.

Final Thought

A real database is closer to an operating system than a library.

It manages:

Memory
Storage
Concurrency
Recovery
Scheduling

If you treat it like a black box, it will punish you.

If you understand it, it becomes one of the most powerful tools in software engineering.

DEV Community