DEV Community

Cover image for How Real Databases Work Internally ?
Farhad Rahimi Klie
Farhad Rahimi Klie

Posted on

How Real Databases Work Internally ?

Most developers use databases every day.
Few actually understand what happens under the hood.

We say things like “the database stores data” or “the query is slow”—but inside a real database engine, there is a complex, carefully engineered system built for performance, durability, and correctness.

This article explains how real databases work internally, step by step, without marketing language or oversimplified myths.


1. A Database Is Not a Spreadsheet

A real database is not:

  • A giant table in memory
  • A collection of CSV files
  • A simple key-value map

A real database is a storage engine + execution engine + transaction system, tightly integrated.

At a high level, every serious database has:

  • Parser
  • Planner / Optimizer
  • Execution Engine
  • Storage Engine
  • Buffer Cache
  • Transaction Manager
  • Recovery System

Let’s walk through what actually happens.


2. What Happens When You Send a Query?

When you run:

SELECT name FROM users WHERE age > 30;
Enter fullscreen mode Exit fullscreen mode

The database does not immediately scan a table.

Instead, it follows a strict pipeline.

Step 1: Parsing

The SQL text is converted into an Abstract Syntax Tree (AST).

The database checks:

  • Syntax correctness
  • Valid table and column names
  • User permissions

At this stage, it does zero execution.


Step 2: Query Planning & Optimization

This is where databases become serious engineering.

The query optimizer decides:

  • Which indexes to use
  • Join order
  • Scan method (index scan vs sequential scan)
  • Cost estimation based on statistics

Example decisions:

  • Is it cheaper to scan 1 million rows sequentially?
  • Or to use an index with random I/O?

Modern databases use:

  • Cost-based optimizers
  • Statistics (histograms, cardinality)
  • Rule-based rewrites

This step determines performance more than hardware.


3. The Execution Engine

After planning, the database produces an execution plan.

The execution engine:

  • Pulls rows through operators (scan → filter → project)
  • Uses iterators or vectorized execution
  • Streams results instead of loading everything into memory

Important concept:

Databases process data in pipelines, not all at once.


4. How Data Is Actually Stored on Disk

Databases do not store rows directly on disk.

Pages (Blocks)

Data is stored in fixed-size pages (commonly 4KB–16KB).

Each page contains:

  • Page header
  • Row slots
  • Metadata

Pages are the minimum unit of I/O.


Row Storage vs Column Storage

Row-oriented (PostgreSQL, MySQL):

  • Best for OLTP
  • Fast inserts and point queries

Column-oriented (ClickHouse, Redshift):

  • Best for analytics
  • Excellent compression
  • Vectorized scans

5. Indexes Are Not Magic

Indexes are separate data structures, usually:

  • B-Trees (most common)
  • Hash indexes
  • LSM Trees (RocksDB, Cassandra)

A B-Tree:

  • Keeps data sorted
  • Minimizes disk seeks
  • Balances read/write costs

Important truth:

Indexes speed up reads but slow down writes.

Every insert/update must update:

  • Table data
  • All related indexes

6. Memory Is a Cache, Not the Source of Truth

Databases never trust memory.

Buffer Pool (Cache)

  • Frequently used pages are cached in RAM
  • Dirty pages are written back later
  • Replacement strategies (LRU variants)

If power fails:

  • Memory is lost
  • Disk must still be consistent

Which brings us to the most critical system.


7. Transactions and ACID

Real databases guarantee ACID:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

This is achieved through:

Write-Ahead Logging (WAL)

Before modifying data:

  1. Changes are written to a log
  2. Log is flushed to disk
  3. Only then is memory updated

If the database crashes:

  • WAL is replayed
  • Data is recovered

Logs are more important than data files.


8. Concurrency Control (Why Locks Exist)

Databases support thousands of concurrent users.

They use:

  • Locks (row, page, table)
  • MVCC (Multi-Version Concurrency Control)

With MVCC:

  • Readers don’t block writers
  • Writers create new versions
  • Old versions cleaned by vacuum/GC

This is why:

  • PostgreSQL can read without locking
  • Oracle and PostgreSQL scale well under load

9. Recovery and Crash Safety

When a database restarts after a crash:

  1. Read last checkpoint
  2. Replay WAL records
  3. Undo incomplete transactions
  4. Restore consistency

This process is deterministic and repeatable.

No guessing. No heuristics.


10. Why This Knowledge Matters

Understanding internals helps you:

  • Write faster queries
  • Design better schemas
  • Choose correct indexes
  • Avoid dangerous assumptions
  • Debug performance issues

Most “slow database” problems are:

  • Bad query plans
  • Wrong indexes
  • Misunderstanding internals

Not hardware.


Final Thought

A real database is closer to an operating system than a library.

It manages:

  • Memory
  • Storage
  • Concurrency
  • Recovery
  • Scheduling

If you treat it like a black box, it will punish you.

If you understand it, it becomes one of the most powerful tools in software engineering.

Top comments (0)