DEV Community

Chameera De Silva
Chameera De Silva

Posted on

Inside Databases: BSTs, B-Trees, & LSM Trees

Database Internals: BSTs, B-Trees, and LSM TreesTL;DR: I choose to expand my knowledge of how data management works behind the scenes in databases and data warehouses after gaining experience in data engineering and software development. I recommend everyone to be cautious of data structures and algorithms (DSA) and hardware specifics, often sprinkled with hardware terminology.

In late 2020, When I started my data engineering/software development journey as an intern into the space of databases, I felt like a fish out of water. I was neither a specific computer science major nor a math specialist, just a stubborn curiosity and a job that demanded I figure it out. Terms like "BST," "B-tree," and "LSM tree" hit me like a wall of jargon and confusion, leaving me wondering if I’d ever crack the code. But here’s the thing: once I got my hands dirty with these concepts, they went from intimidating to downright fascinating. First I solved some leetcode problems with my go-to language Python and felt little confidence about these concepts except deep in my soul I wanted to apply real world use cases directly with the work that I was doing. But I knew that these are the backbone powering banks backend databases to day-to-day queue services which we use in decoupled systems.

In this article, I’ll try to explain three main parts in database design: Binary Search Trees (BSTs), B-Trees, and Log-Structured Merge Trees (LSM Trees). I’ll walk you through each one with relatable analogies (famous bookshelves and mailrooms example), simple Python snippets, and real-world tie-ins to systems like PostgreSQL and Cassandra. Whether you’re an engineer or just leveling up, this is your backstage pass to how databases really work.


Binary Search Trees (BSTs): The Wobbly Bookshelf

Binary Search Trees (BSTs) diagram: Example: Wobbly Bookshelf

Remember that moment when you first understood how a perfectly organized library could make finding any book a breeze? That intuitive A-to-Z system, where you can quickly narrow down your search? That's the core idea behind a Binary Search Tree (BST). Think of it as a digital bookshelf where each new book (or piece of data) gets placed precisely: titles that come earlier in the alphabet go to the left, and those that come later go to the right. It feels elegant, almost magical. In the best-case scenario, finding the book you need takes logarithmic time, O(log n), a remarkably efficient way to search.

Let's see a quick peek at how this looks in Python:

BST Code

class BSTNode:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None

def bst_insert(root, key):
    if root is None:
        return BSTNode(key)
    if key < root.key:
        root.left = bst_insert(root.left, key)
    else:
        root.right = bst_insert(root.right, key)
    return root

def bst_search(root, key):
    if root is None or root.key == key:
        return root
    if key < root.key:
        return bst_search(root.left, key)
    else:
        return bst_search(root.right, key)

# Build a BST and search for 17:
root = None
for k in [15, 10, 20, 8, 12, 17, 25]:
    root = bst_insert(root, k)
node = bst_search(root, 17)
print("Found" if node else "Not found")  # Expected: Found

Enter fullscreen mode Exit fullscreen mode

Why it’s cool: It’s fast for in-memory tasks like sorting a list or building a quick lookup table. But here’s where I hit a wall when I first played with it: if you add books in order (say, 1, 2, 3…), your shelf turns into a leaning tower of Pisa. Suddenly, finding anything takes O(n) time, like rifling through a stack of unsorted papers. That’s why databases don’t lean on plain BSTs for disk-based work, those random disk hops are a performance killer.

Takeaway: BSTs are a great starting point for learning, but they’re too fragile for the heavy lifting databases demand.


B-Trees: The Library That Scales

B-Trees diagram: The Library That Scales

Now, imagine you’re not just managing a bookshelf but a full-blown library. You’ve got floors, sections, and shelves galore, each holding dozens of books, neatly organized. That’s a B-tree: a beefed-up tree where every "shelf" (node) can store tons of keys, keeping the whole structure short and balanced. It’s like a librarian’s dream: no matter how many books you add, finding one never takes more than a few steps.

In databases like PostgreSQL, B-trees rule the indexing game. Why? They’re built for the disk. Each node aligns with a disk page (think 4KB or 8KB), so a single read grabs a bunch of keys at once, slashing the number of trips to storage.

Here’s a stripped-down B-tree in Python:

b-tree code

class BTreeNode:
    def __init__(self, t, leaf=False):
        self.t = t  # minimum degree (here t=2)
        self.keys = []
        self.children = []
        self.leaf = leaf

    def split_child(self, index):
        child = self.children[index]
        new_node = BTreeNode(self.t, leaf=child.leaf)
        mid = self.t - 1
        median = child.keys[mid]
        # Move keys after median to new node
        new_node.keys = child.keys[mid+1:]
        child.keys = child.keys[:mid]
        if not child.leaf:
            new_node.children = child.children[mid+1:]
            child.children = child.children[:mid+1]
        self.children.insert(index+1, new_node)
        self.keys.insert(index, median)

    def insert_non_full(self, key):
        i = len(self.keys) - 1
        if self.leaf:
            self.keys.append(None)
            while i >= 0 and key < self.keys[i]:
                self.keys[i+1] = self.keys[i]
                i -= 1
            self.keys[i+1] = key
        else:
            while i >= 0 and key < self.keys[i]:
                i -= 1
            i += 1
            if len(self.children[i].keys) == 2*self.t - 1:
                self.split_child(i)
                if key > self.keys[i]:
                    i += 1
            self.children[i].insert_non_full(key)

class BTree:
    def __init__(self, t):
        self.root = BTreeNode(t, leaf=True)
        self.t = t

    def insert(self, key):
        r = self.root
        if len(r.keys) == 2*self.t - 1:
            s = BTreeNode(self.t, leaf=False)
            s.children.append(r)
            s.split_child(0)
            self.root = s
        self.root.insert_non_full(key)

    def traverse(self):
        def _traverse(node):
            res = []
            for i, k in enumerate(node.keys):
                if not node.leaf:
                    res.extend(_traverse(node.children[i]))
                res.append(k)
            if not node.leaf:
                res.extend(_traverse(node.children[len(node.keys)]))
            return res
        return _traverse(self.root)

# Example usage:
bt = BTree(t=2)
for x in [10, 20, 5, 6, 12, 30, 7, 17]:
    bt.insert(x)
print(bt.traverse())  # Should print [5, 6, 7, 10, 12, 17, 20, 30]

Enter fullscreen mode Exit fullscreen mode

Why it’s a champ: It guarantees O(log n) performance, even when the library grows massive. Splits happen when a shelf overflows, but the tree stays balanced, perfect for OLTP workloads like processing payments or updating user profiles.

The catch: Those splits can mean random disk writes, which old-school hard drives hate. Still, for most database tasks, B-trees are the unsung heroes keeping things snappy.

Real-world vibe: In PostgreSQL, B-tree leaf nodes (over 99% of the index) point straight to your data rows. It’s like a library where the top floors guide you, but the ground floor’s where the action is.


LSM Trees: The Mailroom Hustle

LSM Trees: The Mailroom Hustle/ system design

Now, let’s switch gears. You’re running a mailroom, and letters are flooding in faster than you can file them. Instead of sorting each one on the spot, you toss them into a bin (memory), sort them quickly, and then batch-file them into cabinets later. That’s an LSM Tree: it’s all about handling a deluge of writes without breaking a sweat.

Used in databases like Cassandra and RocksDB, LSM trees batch new data in memory (a "memtable"), then flush it to disk as sorted chunks (SSTables). Background merging keeps it tidy over time.

Python snippet showing an LSM-like process of buffering and merging

# Simulate an LSM tree with flush and compaction
threshold = 3
memtable = []
sstables = []

def flush():
    sstables.append(sorted(memtable))
    memtable.clear()

def insert(key, value):
    memtable.append((key, value))
    memtable.sort(key=lambda x: x[0])
    if len(memtable) >= threshold:
        flush()

def compact():
    if len(sstables) >= 2:
        a = sstables.pop(0)
        b = sstables.pop(0)
        merged = sorted(a + b, key=lambda x: x[0])
        sstables.insert(0, merged)

# Example usage:
for k in [10, 5, 20, 1, 15, 12]:
    insert(k, f"val{k}")
# After inserts, flushes produce sorted SSTables
print("SSTables before compact:", sstables)
compact()  # merge the two oldest SSTables
print("SSTables after compact:", sstables)

Enter fullscreen mode Exit fullscreen mode

Output might be:

SSTables before compact: [[(5, 'val5'), (10, 'val10'), (20, 'val20')], [(1, 'val1'), (12, 'val12'), (15, 'val15')]]
SSTables after compact: [[(1, 'val1'), (5, 'val5'), (10, 'val10'), (12, 'val12'), (15, 'val15'), (20, 'val20')]]

Enter fullscreen mode Exit fullscreen mode

Why it’s clutch: Writes are sequential, append-only, so they’re blazing fast, even on spinning disks. It’s a dream for write-heavy setups like logging or streaming data.

The trade-off: Reads might need to peek at multiple spots (memory, disk files), slowing them down. Bloom filters help, but LSM trees still favor writes over reads.

Takeaway: If your app’s drowning in writes, LSM trees are your lifeline.


Hardware: The Silent Game-Changer

Here’s a nugget I learned the hard way: your hardware calls the shots. On old HDDs, random writes (like B-tree splits) are a nightmare, think 10ms per hop. LSM trees sidestep this with sequential writes, making them HDD darlings. SSDs, though? They handle random access better, so B-trees can flex their muscles, especially for reads.

Quick tip: Match your data structure to your storage. HDDs love LSM; SSDs play nice with both.


OLTP vs. OLAP: Battlefield

Not all workloads are created equal:

  • OLTP (e.g., e-commerce): Small, fast transactions, think adding an item to your cart. B-trees shine here with quick lookups.
  • OLAP (e.g., analytics): Big, sweeping queries, like monthly sales reports. Columnar stores (often with LSM-inspired merging) crush it by scanning fast.

Real-world twist: ClickHouse’s MergeTree blends LSM-style merging with columnar storage, making analytics a breeze.


In the Wild: PostgreSQL and Beyond

PostgreSQL leans hard on B-trees. its default index is a masterclass in balance. But for write-crazy workloads, LSM-based systems like Cassandra steal the show. And in analytics land, columnar stores like Parquet or ClickHouse use sorted chunks (LSM vibes) to zip through queries.

Pro move: Peek under your database’s hood. It’ll guide you to the right tool for the job.


Conclusion: Own the Gears

From shaky BSTs to sturdy B-trees to write-happy LSM trees, these structures are the heartbeat of databases. Here’s your cheat sheet:

  • BSTs: Fun to learn, but don’t bet the farm on them.
  • B-trees: Your go-to for balanced performance.
  • LSM trees: Write fast, worry about reads later.

Next time you’re wrestling with a slow query or a data pipeline, think about what’s spinning the wheels. Tinker with the code, test the trade-offs, and make it yours. That’s the engineer’s edge.


Top comments (0)