Athreya aka Maneshwar

Posted on Dec 3, 2025 • Edited on Dec 5, 2025

B-Trees, Pages, and the Hidden Problems Behind Traditional Indexin

#webdev #programming #database #beginners

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*

When a database needs to search, insert, or update records quickly, it relies on index structures.

The B-Tree has been the standard index used by storage engines such as InnoDB (MySQL), SQLite, PostgreSQL (for many workloads), and countless embedded systems.

How B-Trees Work

A B-Tree organizes records in sorted order and splits data into pages, which are fixed-size blocks of storage.

A page is the smallest unit of reading and writing between memory and disk.

Typical page sizes: 4 KB or 8 KB (aligned with SSD block size)

When you query a B-Tree:

The database reads one page from disk at a time.
Each page contains many records.
Following pointers from page to page narrows the search until the exact record is found.

This design efficiently minimizes disk IO, which historically has been the slowest part of database access.

Why Pages Matter

Even if only a small portion of data is accessed, the database must fetch or rewrite the entire page.

Example:

Page size: 4 KB
One record change: 50 bytes
Disk must rewrite: 4 KB

This leads to a performance issue known as write amplification.

Problem #1 — High Write Amplification

Because B-Trees update data in place, modifying a small value forces the system to rewrite the full page.

Effects:

Disk bandwidth is wasted
SSD wear increases
Throughput drops under heavy write loads

This becomes painful in workloads like:

Logging
High-frequency inserts and updates
Secondary indexes with very small keys and values

Problem #2 — Inefficient Caching

B-Trees cache data in memory at page granularity.

If only 5 records in a page are frequently accessed (hot), the system still caches the entire page:

Memory is wasted holding cold data
Cache hit rate suffers
Hot records compete with irrelevant ones

For large databases where RAM is the real performance bottleneck, page-level caching becomes inefficient.

Problem #3 — Secondary Index Pain

In a secondary index, keys and values are tiny.
But because the B-Tree still works page-by-page:

Writes rewrite many more bytes than needed
Cache fills with mostly unused data
Reads and writes both struggle as scale grows

Existing Attempts to Fix B-Trees

Many systems tried to reduce write amplification:

LSM Trees (used in RocksDB, LevelDB, Cassandra)

They avoid in-place updates by appending to logs.

Pros: much better write throughput
Cons: slower reads, expensive compaction, degraded scan performance

Bw-Tree & Bε-Tree

Use delta records instead of rewriting whole pages.

Pros: fewer full page rewrites
Cons: reads follow long chains; scans become slow

Record caching systems (Anti-Caching, Siberia, Tree-line, TwoTree)

Place hot records in a separate in-memory structure.

Pros: improves point lookups
Cons: poor scan performance, complicated tuning, fragmented design

Every improvement solves one problem but hurts another.

What’s Next

A new approach called Bf-Tree proposes separating in-memory cache pages from on-disk pages and using smaller variable-size “mini-pages” to avoid rewriting full pages and avoid caching irrelevant data.

Early results show significant performance gains in both reads and writes.

But that’s a deep topic.

Stay tuned

In the next article, we’ll dive into:

How Bf-Trees work internally
What mini-pages are
Why they provide better caching & lower write amplification
Performance comparisons against B-Trees and LSM Trees

👉 Check out: FreeDevTools

Any feedback or contributors are welcome!

It’s online, open-source, and ready for anyone to use.

⭐ Star it on GitHub: freedevtools

DEV Community