DEV Community: Chameera De Silva

Inside Databases: BSTs, B-Trees, & LSM Trees

Chameera De Silva — Thu, 15 May 2025 07:11:59 +0000

TL;DR: I choose to expand my knowledge of how data management works behind the scenes in databases and data warehouses after gaining experience in data engineering and software development. I recommend everyone to be cautious of data structures and algorithms (DSA) and hardware specifics, often sprinkled with hardware terminology.

In late 2020, When I started my data engineering/software development journey as an intern into the space of databases, I felt like a fish out of water. I was neither a specific computer science major nor a math specialist, just a stubborn curiosity and a job that demanded I figure it out. Terms like "BST," "B-tree," and "LSM tree" hit me like a wall of jargon and confusion, leaving me wondering if I’d ever crack the code. But here’s the thing: once I got my hands dirty with these concepts, they went from intimidating to downright fascinating. First I solved some leetcode problems with my go-to language Python and felt little confidence about these concepts except deep in my soul I wanted to apply real world use cases directly with the work that I was doing. But I knew that these are the backbone powering banks backend databases to day-to-day queue services which we use in decoupled systems.

In this article, I’ll try to explain three main parts in database design: Binary Search Trees (BSTs), B-Trees, and Log-Structured Merge Trees (LSM Trees). I’ll walk you through each one with relatable analogies (famous bookshelves and mailrooms example), simple Python snippets, and real-world tie-ins to systems like PostgreSQL and Cassandra. Whether you’re an engineer or just leveling up, this is your backstage pass to how databases really work.

Binary Search Trees (BSTs): The Wobbly Bookshelf

Remember that moment when you first understood how a perfectly organized library could make finding any book a breeze? That intuitive A-to-Z system, where you can quickly narrow down your search? That's the core idea behind a Binary Search Tree (BST). Think of it as a digital bookshelf where each new book (or piece of data) gets placed precisely: titles that come earlier in the alphabet go to the left, and those that come later go to the right. It feels elegant, almost magical. In the best-case scenario, finding the book you need takes logarithmic time, O(log n), a remarkably efficient way to search.

Let's see a quick peek at how this looks in Python:

BST Code

class BSTNode:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None

def bst_insert(root, key):
    if root is None:
        return BSTNode(key)
    if key < root.key:
        root.left = bst_insert(root.left, key)
    else:
        root.right = bst_insert(root.right, key)
    return root

def bst_search(root, key):
    if root is None or root.key == key:
        return root
    if key < root.key:
        return bst_search(root.left, key)
    else:
        return bst_search(root.right, key)

# Build a BST and search for 17:
root = None
for k in [15, 10, 20, 8, 12, 17, 25]:
    root = bst_insert(root, k)
node = bst_search(root, 17)
print("Found" if node else "Not found")  # Expected: Found

Why it’s cool: It’s fast for in-memory tasks like sorting a list or building a quick lookup table. But here’s where I hit a wall when I first played with it: if you add books in order (say, 1, 2, 3…), your shelf turns into a leaning tower of Pisa. Suddenly, finding anything takes O(n) time, like rifling through a stack of unsorted papers. That’s why databases don’t lean on plain BSTs for disk-based work, those random disk hops are a performance killer.

Takeaway: BSTs are a great starting point for learning, but they’re too fragile for the heavy lifting databases demand.

B-Trees: The Library That Scales

Now, imagine you’re not just managing a bookshelf but a full-blown library. You’ve got floors, sections, and shelves galore, each holding dozens of books, neatly organized. That’s a B-tree: a beefed-up tree where every "shelf" (node) can store tons of keys, keeping the whole structure short and balanced. It’s like a librarian’s dream: no matter how many books you add, finding one never takes more than a few steps.

In databases like PostgreSQL, B-trees rule the indexing game. Why? They’re built for the disk. Each node aligns with a disk page (think 4KB or 8KB), so a single read grabs a bunch of keys at once, slashing the number of trips to storage.

Here’s a stripped-down B-tree in Python:

b-tree code

class BTreeNode:
    def __init__(self, t, leaf=False):
        self.t = t  # minimum degree (here t=2)
        self.keys = []
        self.children = []
        self.leaf = leaf

    def split_child(self, index):
        child = self.children[index]
        new_node = BTreeNode(self.t, leaf=child.leaf)
        mid = self.t - 1
        median = child.keys[mid]
        # Move keys after median to new node
        new_node.keys = child.keys[mid+1:]
        child.keys = child.keys[:mid]
        if not child.leaf:
            new_node.children = child.children[mid+1:]
            child.children = child.children[:mid+1]
        self.children.insert(index+1, new_node)
        self.keys.insert(index, median)

    def insert_non_full(self, key):
        i = len(self.keys) - 1
        if self.leaf:
            self.keys.append(None)
            while i >= 0 and key < self.keys[i]:
                self.keys[i+1] = self.keys[i]
                i -= 1
            self.keys[i+1] = key
        else:
            while i >= 0 and key < self.keys[i]:
                i -= 1
            i += 1
            if len(self.children[i].keys) == 2*self.t - 1:
                self.split_child(i)
                if key > self.keys[i]:
                    i += 1
            self.children[i].insert_non_full(key)

class BTree:
    def __init__(self, t):
        self.root = BTreeNode(t, leaf=True)
        self.t = t

    def insert(self, key):
        r = self.root
        if len(r.keys) == 2*self.t - 1:
            s = BTreeNode(self.t, leaf=False)
            s.children.append(r)
            s.split_child(0)
            self.root = s
        self.root.insert_non_full(key)

    def traverse(self):
        def _traverse(node):
            res = []
            for i, k in enumerate(node.keys):
                if not node.leaf:
                    res.extend(_traverse(node.children[i]))
                res.append(k)
            if not node.leaf:
                res.extend(_traverse(node.children[len(node.keys)]))
            return res
        return _traverse(self.root)

# Example usage:
bt = BTree(t=2)
for x in [10, 20, 5, 6, 12, 30, 7, 17]:
    bt.insert(x)
print(bt.traverse())  # Should print [5, 6, 7, 10, 12, 17, 20, 30]

Why it’s a champ: It guarantees O(log n) performance, even when the library grows massive. Splits happen when a shelf overflows, but the tree stays balanced, perfect for OLTP workloads like processing payments or updating user profiles.

The catch: Those splits can mean random disk writes, which old-school hard drives hate. Still, for most database tasks, B-trees are the unsung heroes keeping things snappy.

Real-world vibe: In PostgreSQL, B-tree leaf nodes (over 99% of the index) point straight to your data rows. It’s like a library where the top floors guide you, but the ground floor’s where the action is.

LSM Trees: The Mailroom Hustle

Now, let’s switch gears. You’re running a mailroom, and letters are flooding in faster than you can file them. Instead of sorting each one on the spot, you toss them into a bin (memory), sort them quickly, and then batch-file them into cabinets later. That’s an LSM Tree: it’s all about handling a deluge of writes without breaking a sweat.

Used in databases like Cassandra and RocksDB, LSM trees batch new data in memory (a "memtable"), then flush it to disk as sorted chunks (SSTables). Background merging keeps it tidy over time.

Python snippet showing an LSM-like process of buffering and merging

# Simulate an LSM tree with flush and compaction
threshold = 3
memtable = []
sstables = []

def flush():
    sstables.append(sorted(memtable))
    memtable.clear()

def insert(key, value):
    memtable.append((key, value))
    memtable.sort(key=lambda x: x[0])
    if len(memtable) >= threshold:
        flush()

def compact():
    if len(sstables) >= 2:
        a = sstables.pop(0)
        b = sstables.pop(0)
        merged = sorted(a + b, key=lambda x: x[0])
        sstables.insert(0, merged)

# Example usage:
for k in [10, 5, 20, 1, 15, 12]:
    insert(k, f"val{k}")
# After inserts, flushes produce sorted SSTables
print("SSTables before compact:", sstables)
compact()  # merge the two oldest SSTables
print("SSTables after compact:", sstables)

Output might be:

SSTables before compact: [[(5, 'val5'), (10, 'val10'), (20, 'val20')], [(1, 'val1'), (12, 'val12'), (15, 'val15')]]
SSTables after compact: [[(1, 'val1'), (5, 'val5'), (10, 'val10'), (12, 'val12'), (15, 'val15'), (20, 'val20')]]

Why it’s clutch: Writes are sequential, append-only, so they’re blazing fast, even on spinning disks. It’s a dream for write-heavy setups like logging or streaming data.

The trade-off: Reads might need to peek at multiple spots (memory, disk files), slowing them down. Bloom filters help, but LSM trees still favor writes over reads.

Takeaway: If your app’s drowning in writes, LSM trees are your lifeline.

Hardware: The Silent Game-Changer

Here’s a nugget I learned the hard way: your hardware calls the shots. On old HDDs, random writes (like B-tree splits) are a nightmare, think 10ms per hop. LSM trees sidestep this with sequential writes, making them HDD darlings. SSDs, though? They handle random access better, so B-trees can flex their muscles, especially for reads.

Quick tip: Match your data structure to your storage. HDDs love LSM; SSDs play nice with both.

OLTP vs. OLAP: Battlefield

Not all workloads are created equal:

OLTP (e.g., e-commerce): Small, fast transactions, think adding an item to your cart. B-trees shine here with quick lookups.
OLAP (e.g., analytics): Big, sweeping queries, like monthly sales reports. Columnar stores (often with LSM-inspired merging) crush it by scanning fast.

Real-world twist: ClickHouse’s MergeTree blends LSM-style merging with columnar storage, making analytics a breeze.

In the Wild: PostgreSQL and Beyond

PostgreSQL leans hard on B-trees. its default index is a masterclass in balance. But for write-crazy workloads, LSM-based systems like Cassandra steal the show. And in analytics land, columnar stores like Parquet or ClickHouse use sorted chunks (LSM vibes) to zip through queries.

Pro move: Peek under your database’s hood. It’ll guide you to the right tool for the job.

Conclusion: Own the Gears

From shaky BSTs to sturdy B-trees to write-happy LSM trees, these structures are the heartbeat of databases. Here’s your cheat sheet:

BSTs: Fun to learn, but don’t bet the farm on them.
B-trees: Your go-to for balanced performance.
LSM trees: Write fast, worry about reads later.

Next time you’re wrestling with a slow query or a data pipeline, think about what’s spinning the wheels. Tinker with the code, test the trade-offs, and make it yours. That’s the engineer’s edge.

Bridging the Gap: Building a Tea Plantation Assistant for the Sri Lankan Tea plantation

Chameera De Silva — Wed, 12 Mar 2025 14:15:38 +0000

As someone passionate about problem-solving, I noticed a significant gap in the Sri Lankan tea industry:

Tea estate owners struggle to access and manage essential documents and information.

So, I built an AI-powered Tea Plantation Assistant—a chatbot that simplifies information retrieval for tea industry stakeholders. 🚀

In this blog, I'll walk you through how I went from analyzing raw data to a fully functional AI chatbot.

Whether you're a developer or tea industry professional, this step-by-step guide will help you understand the impact of AI in agriculture.

🌱 Step 1: Data Collection & Preprocessing

The foundation of this assistant is structured data from official tea industry documents.

To make the chatbot useful, I needed a clean dataset. I built a pipeline that:

✅ Scrapes and processes relevant documents

✅ Cleans and structures the text for efficient retrieval

Check out the GitHub repo:

🔗 ai-assistant-tri (Data Preprocessing)

Even if you're not a coder, imagine this as organizing thousands of tea reports into a well-labeled library. 📚

🎨 Step 2: Designing the Chatbot UI with Gradio

Once the data was ready, I built a user-friendly chatbot UI using Gradio.

Why Gradio?

✅ Quick to set up

✅ Clean, minimalistic interface

✅ Great for AI-powered applications

I customized the UI to align with a tea plantation theme, making it accessible to non-tech users.

🔗 Chatbot UI Repo (Gradio)

🧠 Step 3: Adding AI & Knowledge Retrieval

The core of the assistant is retrieval-augmented generation (RAG) using:

LangChain for intelligent query processing
Pinecone for storing and retrieving relevant documents

This enables the chatbot to:

✅ Understand tea-related questions using Google Generative AI

✅ Retrieve the most relevant tea industry information

✅ Generate clear and useful responses

For developers: This is a LangChain-powered RetrievalQA chain optimized for real-world tea industry applications. 🍃

🚀 Step 4: Deploying the Working Demo

The final step was making this assistant publicly available for testing.

I deployed the chatbot on HuggingFace Spaces:

🔗 Live Demo: Sri Lankan Tea Chatbot

Now, anyone can ask industry-related questions and receive instant AI-generated answers!

🎯 Conclusion

This project started with a real-world problem:

Stakeholders in the Sri Lankan tea industry struggle to access critical information.

By leveraging AI, LangChain, and Gradio, I built a chatbot that:

✅ Makes information instantly accessible

✅ Bridges the gap between technology & traditional industries

✅ Showcases how AI can revolutionize agriculture 🌍

If you're curious about AI in agriculture, explore the code and try the demo!

Let’s innovate together. 🚀

📌 Resources & Links

🗂 Data Collection & Preprocessing: GitHub Repo
🎨 Gradio Chatbot UI: GitHub Repo
🤖 Live Chatbot Demo: HuggingFace

💡 What do you think about AI in agriculture? Drop your thoughts in the comments!

Survivorship Bias in Data Engineering

Chameera De Silva — Sat, 08 Apr 2023 21:23:06 +0000

Are you tired of seeing the context of data engineering stories and how they lead to success? Well, be ready to dig deeper into the world of "Survivorship bias in data engineering".

So the term "Survivorship bias" was initiated by the statistician "Abraham Wald" at Columbia University, who started the analysis of the research and development on World War II combat aircraft and specifically recommended places for reinforcement against enemy attacks.

First, let's see the major problem here: data getting generated every millisecond, customer touchpoints, and whatever happens worldwide. Every time, it leaves a digital footprint, as you might think right now. Oh yes, data engineering has helped to gain a massive amount of data from a technical perspective, called "data pipelines," leading to remarkable findings and business success. But haven't you heard about numerous failed small-scale and large-scale projects that never make it to the spotlight? They do not get talked about because, well, they are failures.

However, there will be more! As we focus on the term Survivorship bias here, the data ecosystem that exists and how we used to train machine learning models come into the limelight. Mostly, we focus only on the data that leads to successful models, neglecting the rest of the data to cut the mustard. So, the machine learning models will basically learn from a biased dataset.

Now, I know what you are thinking: "Is there a better way" or "What can we do other than the usual process". Well, first, stop thwarting the projects that are so-called "failed" and give a little attention to where they have gone wrong and try to learn from them. Again, it's better to perceive than to do what we usually do because divergent thinking can be more mindful and involves actively seeking out examples of failed projects.

Currently, we are living in the hype cycle of AI; it's essential to keep an eye on where we are heading, and in terms of solution bias, we can welcome the chaos and unpredictable data engineering. It's hard to depend on only the success stories, and you must be prepared to divert and cope when things go wrong. And hey, maybe we can even find some humor in the failures along the way.

Folks, I hope you got some sense of the not-so-glamorous side of data engineering. But do not discourage you. Always welcome failure; try to intercept it and learn from it. No one knows your failed project will lead to something amazing (or at least a good laugh).