DEV Community

Onion Lee
Onion Lee

Posted on • Edited on

JavaLSM: An LSM Tree Based Key-Value Storage in Java

In this post, I’ll share how I built my own LSM-based storage engine from scratch in Java. I focused on implementing the full data pipeline correctly, while using existing solutions for low-level details such as balanced search tree or probabilistic data structures.

GitHub logo seungwonlee003 / JavaLSM

An LSM Tree based key-value storage engine written in Java

JavaLSM: An LSM Tree Based Key-Value Storage in Java

Introduction

Blog Post

JavaLSM is an LSM tree based storage engine written in Java.

It offers a simple key-value interface and is designed to be used as an embedded storage engine.

Features

  1. Simple key-value interface: put(key, value), get(key), delete(key)
  2. LSM Tree Architecture: In-memory buffer + disk-based SSTables
  3. Fast Reads: Powered by Bloom filters and block indexes
  4. Concurrent Reads: Fine-grained locking allows reads during compaction
  5. Automatic Compaction: Full-level leveled compaction strategy optimizes storage
  6. Crash Recovery: Write-Ahead Log (WAL) ensures durability of in-memory write buffer

Usage

Opening a DB instance

Create a DB instance by calling its constructor. No arguments are required. By default, Write-Ahead Logging (WAL) is enabled, and both the memtable and SSTable sizes are set to 16GB.

DB db = new DB();

Writing to the DB

Use the put method to insert or update entries in the DB. It accepts…

Basics

LSM tree-based storage is a write-optimized design that uses an in-memory buffer to store initial writes and immutable sorted string tables (SSTables) for persistent storage on disk. The in-memory buffer, called a memtable, typically uses a balanced search tree as its underlying data structure. This keeps entries sorted, enabling efficient in-memory writes and reads with O(log n) complexity.

An SSTable is an immutable, sorted, and append-only structure stored on disk. Since the memtable is already sorted, flushing it to disk to create a new SSTable is a relatively cheap O(n) operation. To optimize space and remove old or duplicate entries, background compaction runs periodically, merging SSTables and discarding obsolete data.

Memtable

With the basics in place, I implemented the memtable using Java's built-in TreeMap, which is backed by a red-black tree. I kept both memtables and SSTables at a fixed size, ensuring that flushing a memtable to disk never required splitting SSTables—this simplified the process.

Block Index

In LevelDB, data is read and written in ~4KB blocks to minimize costly disk I/O. I adopted a similar approach by maintaining an in-memory index per block that stores the first key, block offset, and block size. This was stored in Java’s TreeMap. Assuming each entry fits within a block (i.e., <4KB) and that block indexes fit in memory, this reduces lookup time from O(N) to O(log N) per SSTable by avoiding full table scans. See below for the SSTable’s get method:


public String get(String key) {
    if (key.compareTo(minKey) < 0 || key.compareTo(maxKey) > 0) {
        return null;
    }

    if (!bloomFilterUtil.mightContain(key)) {
        return null;
    }

    Map.Entry<String, BlockInfo> indexEntry = index.floorEntry(key);
    if (indexEntry == null) {
        return null;
    }

    BlockInfo blockInfo = indexEntry.getValue();

    try (RandomAccessFile file = new RandomAccessFile(filePath, "r")) {
        file.seek(blockInfo.offset);
        byte[] blockData = new byte[(int) blockInfo.length];
        file.readFully(blockData);

        try (ByteArrayInputStream bais = new ByteArrayInputStream(blockData);
             DataInputStream dis = new DataInputStream(bais)) {
            while (dis.available() > 0) {
                int keyLength = dis.readInt();
                byte[] keyBytes = new byte[keyLength];
                dis.readFully(keyBytes);
                String currentKey = new String(keyBytes, StandardCharsets.UTF_8);

                int valueLength = dis.readInt();
                byte[] valueBytes = new byte[valueLength];
                dis.readFully(valueBytes);

                if (currentKey.equals(key)) {
                    return IOUtils.deserializeValue(valueBytes);
                }
                if (currentKey.compareTo(key) > 0) {
                    return null;
                }
            }
        }
    } catch (IOException e) {
        throw new RuntimeException("Failed to read SSTable: " + filePath, e);
    }

    return null;
}
Enter fullscreen mode Exit fullscreen mode

Compaction

With efficient reads and writes in place, I implemented a simple compaction to control the ever-growing log of SSTables. I used a k-way merge algorithm with min-heap to combine multiple SSTables into new, smaller SSTables without duplicates. Here's how it works: a compaction thread acquires a read lock and merges data between level n and level n+1. It combines all SSTables from level n with selected SSTables from level n+1, maintaining a list ordered by recency. An iterator is created for each SSTable, and their first entries are inserted into a min-heap. The heap is sorted by keys, breaking ties by favoring entries from more recent SSTables (which reflect newer writes). As elements are popped from the heap, they're added to the new SSTable only if they aren't marked as deleted (tombstoned) and haven't already been written. Duplicates are skipped, and iterators advance. The heap maintains at most k elements at a time. If the current SSTable exceeds its size threshold during compaction, a new SSTable file is created.

Full-level Leveled Compaction

This strategy is known as full-level leveled compaction. It simplifies comparator logic and merging while maintaining a time complexity of O(N log k), where k is the number of input SSTables and N is the total number of entries. I implemented fine-grained locking so that reads can continue on both memtables and SSTables even during this CPU-intensive compaction process. A write lock is only briefly acquired on the manifest file when swapping out old SSTables for newly compacted ones.

In production systems, a more advanced approach called partial compaction compacts only subsets of SSTables over time, reducing write amplification. However, I didn’t implement this, as compaction wasn’t the bottleneck in my case.

Bloom filter

I added an optimization for searching the keys that do not exist. In such cases, without optimization, the database would have to check every SSTable at each level, which is expensive. I experimented with Guava’s Bloom filter to measure their impact on reducing unnecessary disk lookups. In the benchmark for negative gets (where the key is absent), performance improved by a factor of 10 compared to random access get benchmarks.

Write-ahead Log

To ensure durability of writes buffered in the memtable, I implemented a write-ahead log (WAL), an append-only log that guarantees persistence even if the system crashes before flushing data to disk.

Finally, here’s an overview diagram of my LSM storage engine architecture:

Overall Architecture

Initially, I thought the concept behind LSM storage engines was simple, but building a complete pipeline was much more challenging. There are far more subtle details and possible optimizations than I expected. Nonetheless, through this project, I gained a much deeper understanding of how LSM-tree-based storage engines work—and I hope this knowledge will help me grow into a better engineer.

Top comments (0)