This is a cross-post from my Medium article.
I wanted to learn about the internals of database indexes. The first step was understanding how Disk I/O works — so I got Claude/Gemini to curate a reading list, which led me to Database Pages — A Deep Dive by Hussein Nasser.
There were things I hadn't understood, so I wrote this mellowed-down version for my own clarity. For complete understanding, do read the original post by Hussein Nasser.
Here it goes.
1. Database Layer
You run:
SELECT NAME FROM STUDENTS WHERE ID = 1008
- DB parses the query → looks up
STUDENTSinpg_class(an internal catalog, also stored on disk) → finds OID (Object Identifier)24601 - DB knows the file lives at
PGDATA/base/<db_oid>/24601on the filesystem - DB asks the OS to open that file — the OS hands back a temporary integer called a file descriptor (
fd), sayfd = 7. This is a short-lived handle, valid only for the session. Thefdis never stored on disk.
No index on ID, so DB scans pages one by one. For each page it:
- Checks its buffer pool first — if the page is already in memory, no disk read needed
- If not found, issues a
read()to the OS for that page
read(fd, 0, 8192) → page 0: bytes 0–8191
read(fd, 8192, 8192) → page 1: bytes 8192–16383
The OS → SSD journey below happens once per page. We trace it for page 0.
Note: The exact syscall used by databases may differ — Postgres uses
pread()which takes an explicit offset. The intent here is to show what information is passed, not the exact function signature.
2. File System / OS Layer
- OS looks up the inode of file
24601→ finds block mapping
inode (index node): a data structure the Linux filesystem maintains for every file on disk.
bytes 0–4095 → LBA 100
bytes 4096–8191 → LBA 101
- OS checks its page cache → blocks not found
- OS sends a read command to the NVMe driver with LBA 100 and 101
NVMe (Non-Volatile Memory Express): a communication protocol designed specifically for SSDs.
3. LBA — The Bridge Between OS and SSD
LBA (Logical Block Address) is a sequential numbering system for blocks on a storage device.
The OS doesn't know or care about physical locations on the SSD — it just says:
"Give me LBA 100 and 101."
The NVMe controller receives this and translates internally:
LBA 100 → Physical page 99, offset 0x0001
LBA 101 → Physical page 99, offset 0x1002
This translation is managed by the SSD's Flash Translation Layer (FTL).
The reason this layer exists: the SSD can move data around internally (for wear leveling, bad block management, etc.) without the OS ever knowing.
4. SSD Layer
- NVMe controller checks its DRAM cache — page 99 not found
- Fetches the entire NAND page 99 (16KB) into DRAM cache
- Extracts just the requested 8KB (LBA 100 + 101) and returns it to the OS
5. Back Up the Stack
SSD returns 8KB
↓
OS stores blocks 100, 101 in PAGE CACHE (RAM)
↓
OS returns 8KB to DB
↓
DB stores page 0 in BUFFER POOL (RAM)
↓
DB scans page 0 — rows 1–1000, row 1008 not found
↓
entire journey repeats for page 1
↓
DB stores page 1 in BUFFER POOL (RAM)
↓
DB scans page 1 — finds row 1008, returns to user ✓
Layered Abstraction Summary
Each layer only knows its own abstraction and talks to the layer directly below it.
| Layer | Abstraction it uses |
|---|---|
| Database | File + offset (pages) |
| OS | Inodes + LBAs |
| NVMe Controller | LBA → physical page (via FTL) |
| NAND Flash | Physical pages and cells |
LBA is the common language between the OS and the SSD — the key handoff point where the OS's logical world meets the SSD's physical world. And the FTL is what keeps the physical complexity invisible to everyone above it.
*Originally published on Medium.
Top comments (0)