DEV Community

Cover image for Bf-Trees: Breaking the Page Barrier
Athreya aka Maneshwar
Athreya aka Maneshwar

Posted on

Bf-Trees: Breaking the Page Barrier

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*

In the previous article, we saw how traditional B-Trees suffer because pages are treated as inseparable units, both on disk and in memory.

That coupling is the root cause behind write amplification, cache inefficiency, and performance collapse under modern workloads.

Today’s insight is simple but powerful:

Records shouldn’t be forced to travel in 4 KB blocks when each record is tiny.

The Real Problem: Pages Are Too Big

A database page is usually 4 KB or 8 KB.
But most records are tiny, tens or hundreds of bytes.

So every update or read pulls in a lot of irrelevant junk:

  • A 50-byte record update rewrites a 4 KB page
  • Caching 5 hot records also caches 200 cold ones
  • Range scans and point lookups compete for the same page-level cache

As a thought experiment:
If record size = page size, these problems disappear instantly.
But that’s not the world we live in.

Group 952

The Core Idea Behind Bf-Trees

Bf-Trees break the assumption that cache pages must mirror disk pages.

Two independent structures

Disk page  !=  Cache page
Enter fullscreen mode Exit fullscreen mode

Instead, memory holds mini-pages: variable-length chunks that store only what matters:

  • Individual hot records
  • A small key range
  • Buffered recent writes
  • A full page when needed for range scans

Mini-pages grow and shrink dynamically based on workload.

This changes everything

Traditional B-Tree Bf-Tree
Updates rewrite whole pages Updates rewrite just the mini-page
Cache holds cold data Cache holds only hot data
Fixed 4 KB units everywhere Variable-sized fine-grained blocks
Record/cache split is bolted on Mini-pages unify cache + buffer + data

Variable-Length Buffer Pool

To manage mini-pages efficiently, Bf-Tree uses a circular buffer in memory.

  • Fixed total size
  • Mini-pages allocated by advancing the tail
  • Free space tracked in a free list
  • Growth handled by copy-on-write (RCU)
  • When full → evict mini-pages near the head back to disk

This design enables concurrent allocation, resizing, and eviction without fragmentation hell.

What This Enables

Point lookups

Fast, because mini-pages cache only what’s hot
→ No more sifting through cold records

Range scans

If needed, mini-pages expand to full pages
→ Preserve locality without permanent cost

Writes

Only the mini-page changes
→ No more rewrite-4-KB-for-50-bytes stupidity

The Numbers

Evaluations show:

  • 6× faster writes vs B-Tree
  • 2× faster point lookups vs B-Trees & LSMs
  • 2.5× faster scans vs RocksDB (LSM-Tree)

In short:

Bf-Tree beats both B-Trees and LSM-Trees across all workloads.

This is rare. Historically, if you optimized for reads, writes suffered.
If you optimized for writes, scans suffered.
Bf-Tree breaks that trade-off.

Why This Matters

Modern systems push data beyond RAM size.
Indexes need to survive disk boundaries without falling apart.

Bf-Tree shows that the future isn’t about B-Tree vs LSM-Tree debate.
It’s about rethinking pages.

Next Article

Coming up next, I’ll dive into three areas:

  • Understanding LSM-Tree internals and where they shine or break
  • How modern NVMe SSD behavior changes the cost model of writes
  • The architectural details behind Bf-Tree and mini-pages

This will set the foundation before exploring RCU growth/shrink, eviction strategy, and concurrency guarantees.

Stay tuned.

FreeDevTools

👉 Check out: FreeDevTools

Any feedback or contributors are welcome!

It’s online, open-source, and ready for anyone to use.

⭐ Star it on GitHub: freedevtools

Top comments (0)