Wormhole4j v0.3.0: Supports Multi-threaded Concurrency

#java #map #performance #datastructure

Introduction

Wormhole4j is a Java implementation of the Wormhole index, an ordered in-memory data structure from the EuroSys '19 paper, "Wormhole: A Fast Ordered Index for In-memory Data." By using the strengths of hash tables, prefix trees, and B+ trees, it achieves a worst-case lookup complexity of O(log L), where L is the length of the key. This makes it very fast for both point lookups and range scans.

While earlier versions of the library were fast, they only worked in single-threaded environments. Most high-performance systems today need thread safety and concurrency. Version 0.3.0 fixes this. The concurrency model follows the design described in the original paper. This post describes how we implemented it in Java, what challenges we encountered, and what the benchmarks show.

Why Concurrent Ordered Indexes Are Hard

Designing a concurrent ordered index is much harder than a concurrent hash map. In a hash map, keys are independent. In an ordered index, keys are linked in a specific sequence to allow range scans.

Using one "big lock" is the easiest way to make code thread-safe, but it stops multiple threads from working at the same time, which ruins performance. To avoid this, Wormhole4j uses the strategy from the original paper, which splits operations into three groups:

Read-only (GET, SCAN): These should be lock-free and not block each other.
Leaf-only writes (PUT/DELETE that stay within one leaf node): These only affect one leaf node and only need a lock on that specific node.
Structural writes (PUT/DELETE that cause a split/merge): These are the most complex because they change both the leaf list and the internal metadata.

The goal is to make sure that a structural change—like a node split—doesn't cause a reader to see incorrect data.

The Design: Lock-Free Reads via QSBR RCU

To keep reads fast, Wormhole4j uses QSBR RCU (Quiescent State-Based Reclamation Read-Copy-Update) for metadata. The index keeps two copies of the metadata: an active table and an inactive table.

When a structural change happens, the writer updates the inactive table and then flips a pointer to make it the active one. To safely update the "old" table later, the system needs to know when all readers are finished. Instead of using expensive reference counting, Wormhole4j uses QSBR RCU, where threads signal a "quiescent state" (a quiet moment) when they aren't using the index.

Because the QSBR mechanism needs to know which threads are active to work correctly, you must register and unregister each thread:

// Enrollment is required for the QSBR RCU mechanism
wormhole.registerThread();
try {
    String value = wormhole.get("some_key");
    wormhole.put("another_key", 42);
    ...
} finally {
    // Signaling a quiescent state upon completion
    wormhole.unregisterThread();
}

We chose to do it this way on purpose. While it requires an extra step for the user, it removes the heavy overhead usually needed to track concurrent readers, which makes the library much faster.

Fine-Grained Locking on Leaf Nodes

For local changes, Wormhole4j uses fine-grained locking. Each leaf node has its own lock, so writers only compete with other threads trying to access the exact same data.

To keep readers from being blocked by these locks, we use version numbers. Readers check a version timestamp before and after reading a node. If the versions don't match, it means a change happened, and the reader simply tries again.

Testing Correctness with Lincheck

Concurrency bugs are very hard to find with standard tests because they only happen when threads interact in specific, rare ways. To make sure v0.3.0 is truly thread-safe, we used Lincheck, a tool for testing concurrent data structures on the JVM.

Lincheck checks different thread orders to verify linearizability: the rule that every concurrent execution must act like a valid sequential one. By using a standard TreeMap as our guide, we verified that the concurrent code works correctly.

During testing, we used a small key range and a tiny leaf node size (set to 4). This forced the index to perform splits and merges constantly, which helped us find several real bugs that would have stayed hidden in normal tests. One example: a reader could follow stale leaf node metadata immediately after a split, landing on the wrong node and returning an incorrect result. Without Lincheck systematically exploring thread interleavings, this kind of bug is nearly impossible to reproduce reliably.

Benchmark Results

We compared Wormhole4j v0.3.0 against ConcurrentSkipListMap (CSLM) using Java 21 with different thread counts (from 1+1 up to 8+8).

PUT + GET: Wormhole4j was faster than CSLM across all thread counts. It performed best with String keys, where its lookup logic is much more efficient than the string comparisons used by a skip list.
PUT + SCAN: Wormhole4j has a strong advantage in PUT speed. However, for numeric key scans under heavy write pressure, the frequent node splits can make performance closer to CSLM.
Scalability: Both structures scaled well up to about 4+4 threads. Wormhole4j kept its performance lead as more threads were added.

Lessons Learned

Developing v0.3.0 showed that testing concurrent code is often more difficult than implementing it. While the plan for QSBR RCU and dual tables was important, strict linearizability testing was what actually made the implementation reliable. Lincheck is now part of the automated tests to make sure future updates do not break thread safety.

Closing

Wormhole4j v0.3.0 brings high-performance, verified concurrent indexing to Java. By combining QSBR RCU, dual-table metadata, and leaf-level locking, it offers a fast and reliable alternative to standard concurrent collections.