Following a Database Read to the Metal — A Simple Walkthrough

#database #internals #systems #beginners

This is a cross-post from my Medium article.

I wanted to learn about the internals of database indexes. The first step was understanding how Disk I/O works — so I got Claude/Gemini to curate a reading list, which led me to Database Pages — A Deep Dive by Hussein Nasser.

There were things I hadn't understood, so I wrote this mellowed-down version for my own clarity. For complete understanding, do read the original post by Hussein Nasser.

Here it goes.

1. Database Layer

You run:

SELECT NAME FROM STUDENTS WHERE ID = 1008

DB parses the query → looks up STUDENTS in pg_class (an internal catalog, also stored on disk) → finds OID (Object Identifier) 24601
DB knows the file lives at PGDATA/base/<db_oid>/24601 on the filesystem
DB asks the OS to open that file — the OS hands back a temporary integer called a file descriptor (fd), say fd = 7. This is a short-lived handle, valid only for the session. The fd is never stored on disk.

No index on ID, so DB scans pages one by one. For each page it:

Checks its buffer pool first — if the page is already in memory, no disk read needed
If not found, issues a read() to the OS for that page

read(fd, 0,    8192)  → page 0: bytes 0–8191
read(fd, 8192, 8192)  → page 1: bytes 8192–16383

The OS → SSD journey below happens once per page. We trace it for page 0.

Note: The exact syscall used by databases may differ — Postgres uses pread() which takes an explicit offset. The intent here is to show what information is passed, not the exact function signature.

2. File System / OS Layer

OS looks up the inode of file 24601 → finds block mapping

inode (index node): a data structure the Linux filesystem maintains for every file on disk.

bytes 0–4095    → LBA 100
bytes 4096–8191 → LBA 101

OS checks its page cache → blocks not found
OS sends a read command to the NVMe driver with LBA 100 and 101

NVMe (Non-Volatile Memory Express): a communication protocol designed specifically for SSDs.

3. LBA — The Bridge Between OS and SSD

LBA (Logical Block Address) is a sequential numbering system for blocks on a storage device.

The OS doesn't know or care about physical locations on the SSD — it just says:

"Give me LBA 100 and 101."

The NVMe controller receives this and translates internally:

LBA 100 → Physical page 99, offset 0x0001
LBA 101 → Physical page 99, offset 0x1002

This translation is managed by the SSD's Flash Translation Layer (FTL).

The reason this layer exists: the SSD can move data around internally (for wear leveling, bad block management, etc.) without the OS ever knowing.

4. SSD Layer

NVMe controller checks its DRAM cache — page 99 not found
Fetches the entire NAND page 99 (16KB) into DRAM cache
Extracts just the requested 8KB (LBA 100 + 101) and returns it to the OS

5. Back Up the Stack

SSD returns 8KB
      ↓
OS stores blocks 100, 101 in PAGE CACHE (RAM)
      ↓
OS returns 8KB to DB
      ↓
DB stores page 0 in BUFFER POOL (RAM)
      ↓
DB scans page 0 — rows 1–1000, row 1008 not found
      ↓
entire journey repeats for page 1
      ↓
DB stores page 1 in BUFFER POOL (RAM)
      ↓
DB scans page 1 — finds row 1008, returns to user ✓

Layered Abstraction Summary

Each layer only knows its own abstraction and talks to the layer directly below it.

Layer	Abstraction it uses
Database	File + offset (pages)
OS	Inodes + LBAs
NVMe Controller	LBA → physical page (via FTL)
NAND Flash	Physical pages and cells

LBA is the common language between the OS and the SSD — the key handoff point where the OS's logical world meets the SSD's physical world. And the FTL is what keeps the physical complexity invisible to everyone above it.

*Originally published on Medium.

Find me on LinkedIn · Medium