Abhinav

Posted on Sep 2

A Technical Deep Dive: The Internals of Database Storage 🔬

#database #dbms #programming #learning

Databases power the modern world, but behind the scenes lies a world of ingenious design choices and optimizations. Beyond the basic tables and queries we interact with, databases rely on sophisticated algorithms and data structures to deliver speed, consistency, and reliability at scale. In this deep dive, we’ll peel back the layers and explore the science of database storage.

1. Storage Pages: The Physical Reality 💾

At the lowest level, databases manage data in pages, which are fixed-size I/O units (commonly 8KB). This choice aligns with how operating systems and disk drives interact with data.

Why Pages?

A disk is a physical device with spinning platters and a read/write head. To fetch data, the head must:

Move to the correct cylinder (seek time).
Wait for the right sector to spin into position (rotational latency).

These mechanical delays are orders of magnitude slower than CPU operations. To minimize them, databases read or write entire pages at once, rather than fetching individual rows.

Page Layout

A database page has structure:

Header: metadata (page type, free space, transaction IDs).
Data section: where rows are stored.
Slot array (in slotted pages): pointers to rows at the end of the page, enabling efficient row movement or deletion without shifting the entire block.

📊 Illustration: Page Layout

+---------------------+
| Page Header         |
+---------------------+
| Row Data (variable) |
| Row Data (variable) |
| Row Data (variable) |
+---------------------+
| Slot Array (pointers)|
+---------------------+

2. Indexes: The B-Tree Workhorse 🌳

Indexes are the backbone of query performance, and the most common structure is the B+ Tree.

Structure

Unlike binary trees, B-Trees have high fanout (many children per node). Each node corresponds to a disk page. This keeps the tree short and wide, reducing the number of I/O operations needed.

Internal nodes: contain keys and pointers (navigation).
Leaf nodes: contain actual row pointers and are linked in a sorted doubly linked list.

📊 Illustration: B+ Tree

           [50]
         /      \
   [20,30]      [60,80]
   /  |  \      /   |   \
[10] [25] [35] [55] [65] [90]

Leaf nodes are linked -> [10] <-> [25] <-> [35] <-> [55] <-> [65] <-> [90]

Why B+ Trees Are Powerful

Point Lookups: Traversing from root to leaf typically requires just 3–4 disk reads, even in massive tables.
Range Scans: Once the starting leaf is found, the database follows the linked list across leaf nodes for fast sequential access.

This flexibility is why B+ Trees are the default index type in most relational databases, supporting queries like =, <, >, BETWEEN, and prefix-based LIKE.

3. Write-Ahead Logging (WAL): The ACID Foundation ⚛️

Ensuring Atomicity and Durability of transactions requires careful design. That’s where Write-Ahead Logging (WAL) comes in.

How WAL Works

Before modifying a data page, the database writes a redo log record to the WAL.
Once the WAL record is safely on disk, the transaction is considered committed.

📊 Illustration: WAL Flow

Transaction -> WAL (log record) -> Commit Success
            -> Data Page (later written)

Crash Recovery

Redo Phase: Reapply all committed changes from the WAL since the last checkpoint.
Undo Phase: Roll back any incomplete transactions.

This decoupling of transaction commit from physical data writes allows high throughput without sacrificing safety.

4. Concurrency with MVCC: The Snapshot Approach 📸

Concurrency is a major challenge in databases. Traditional locks can lead to contention and deadlocks. Multi-Version Concurrency Control (MVCC) offers a smarter approach.

How MVCC Works

Each row stores xmin (creating transaction ID) and xmax (deleting/updating transaction ID).
A transaction sees only rows where:
- xmin < transaction_id
- xmax is unset or greater than the transaction’s ID.

📊 Illustration: Row Versions

Row A (xmin=100, xmax=200)
Row B (xmin=150, xmax=∞)

Transaction 180 sees Row A
Transaction 250 sees Row B

This means readers never block writers, and writers don’t block readers. Each transaction gets a snapshot of the database as of its start time.

Benefits of MVCC

Readers don’t wait for writers: Queries can run without being blocked by updates.
Writers don’t block readers: Updates and deletes proceed while readers still see a consistent snapshot.
High concurrency: Makes it easier to scale workloads with many parallel transactions.

Challenges of MVCC

Storage bloat: Multiple row versions can increase disk usage.
Vacuuming / Garbage Collection: Old row versions must eventually be cleaned up.
Complex transaction management: Determining visibility rules requires careful bookkeeping of transaction IDs.

Real-World Use

PostgreSQL: Implements MVCC with transaction IDs (xmin, xmax) and requires VACUUM to reclaim space.
Oracle: Uses UNDO segments to maintain old row versions.
MySQL InnoDB: Maintains undo logs and hidden system columns to support MVCC.

MVCC provides a balance between consistency and performance, making it the preferred model in modern relational databases.

5. Row vs. Column Storage: The OLTP vs. OLAP Trade-off 📊

The choice between row and column storage determines a database’s strengths.

Row-Oriented (OLTP)

Stores all values of a row contiguously.
Example: (CustomerID: 123, Name: John Smith, City: NYC)
Optimized for frequent inserts, updates, and whole-row reads.
Ideal for transactional systems like banking or e-commerce.

Column-Oriented (OLAP)

Stores all values of a column contiguously.
Example: (CustomerID: 123, 124, 125...), (City: NYC, LA, SF...)
Optimized for analytical queries scanning few columns across many rows.
Enables compression (similar values compress well).
Perfect for data warehouses and reporting systems.

📊 Illustration: Row vs Column

Row Store:
[123 | John Smith | NYC]
[124 | Jane Doe   | LA ]

Column Store:
CustomerID: [123,124]
Name:       [John, Jane]
City:       [NYC, LA]

Closing Thoughts ✨

What feels like “magic” when running a fast query or recovering from a crash is really the result of decades of careful engineering:

Pages minimize slow disk I/O.
B+ Trees balance speed and flexibility.
WAL ensures durability without blocking.
MVCC delivers smooth concurrency with snapshots.
Row vs. Column storage tailors systems to different workloads.

The next time we query a database, we’ll know the science humming beneath the surface — turning spinning platters and raw bytes into reliable, high-performance systems.

DEV Community