DEV Community

mote
mote

Posted on

I Spent 3 Hours Watching My Benchmark Hang, Then 6 Seconds to Fix It


Three hours. That's how long bench_column_index ran before I realized it wasn't going anywhere.

I was preparing for moteDB v0.2.0 and running the usual performance suite. Twelve DB instances in parallel, each doing SELECT WHERE col = ? queries while a background thread built indexes. Queries that should take milliseconds started taking minutes. Then hours. Then nothing.

The culprit was a single RwLock<GenericBTree> protecting every read and write to the column index. When the background thread grabbed the write lock to bulk-insert, every query blocked. Simple as that. Twelve threads fighting over one lock.

Here's what I did about it — and how I got that 3-hour hang down to 6.6 seconds.

The Architecture That Was Killing Us

v0.1.7 had a straightforward design: one B-Tree, one RwLock. Clean. Wrong.

SELECT WHERE col = ?  →  acquire read lock  →  traverse B-Tree  →  return
Background index build  →  acquire write lock  →  bulk insert  →  release
Enter fullscreen mode Exit fullscreen mode

When those two paths hit the same lock simultaneously, queries queued behind the writer. With twelve instances, the queue grew faster than it drained. The system looked alive — threads were running, memory was allocated — but nothing was making progress.

I needed a different model. Here's what I landed on:

Two-layer architecture, RocksDB style:

  1. IndexMemBuffer — an in-memory BTreeMap with parking_lot::RwLock (nanosecond-level contention). Writes go here first.
  2. GenericBTree — the on-disk B-Tree. Reads through here. Writes only happen during background drain.
  3. drain_lock (Mutex) — serializes the buffer-to-B-Tree migration using try_lock so writers never block.
  4. tombscones (HashSet) — tracks deleted keys so drained buffers don't resurrect data.

When the memory buffer exceeds a threshold, it atomically flips to an immutable snapshot. The drain thread picks it up and builds the B-Tree without blocking readers. New writes hit the new active buffer.

There's also a TOCTOU fix: get() holds the same lock through both the tombstone filter and the LRU cache write, eliminating the race window where a key could be deleted between the two operations.

The result speaks for itself:

bench_column_index runtime: 3+ hours → 6.6 seconds
Enter fullscreen mode Exit fullscreen mode

Three Phases of Performance Work

Beyond the core lock contention, I spent the release cycle on three performance phases.

Phase 1: Memory Layout

  • Arc<DataEntry> eliminated full-row memcpy on every get()
  • Non-vector tables got their own BTreeMap instead of the generic wrapper, saving 24 bytes per row

At 100K rows, that's roughly 10MB of memory saved without touching any query logic.

Phase 2: Syscall Reduction

  • DiskANN insertion path optimized
  • SQ8Vectors persistent file handle reuse

Every cache miss was triggering 2 syscalls. This phase eliminated that overhead at the I/O layer.

Phase 3: Space Index and FTS

  • i-Octree now uses Morton codes for batch loading, with leaf nodes filled in order to minimize tree splitting
  • LSM scan_range() switched to streaming scan instead of materializing everything
  • FTS switched to append-only sharded writes, with delayed merge triggering when a shard hits 5 segments
  • Columnar predicate pushdown: decode the timestamp column first to locate rows, then decode target columns on demand — avoids decoding columns that were already filtered out
  • Spatial query row cache + removing per-row HashMap allocation: 8000x speedup on spatial range queries

The Audit That Found 28 Problems

I ran three rounds of adversarial auditing before this release. I'm glad I did.

The findings were... extensive:

  • B-Tree: split leaf index out-of-bounds panic
  • Async index pipeline: double-insert causing text index panic
  • WAL compression + DiskANN: 3 separate deadlocks
  • close(): not notifying background threads before checkpoint
  • Column/text index: querying before async pipeline finished building — no fallback
  • SUM precision loss: switched from floating-point accumulation to a two-pass compensation algorithm
  • BTreeMap scan: materializing all results at once causing memory spikes
  • Primary key query: index missing after restart, no fallback to scan
  • glibc arena: concurrent crash on explicit db.close()

The glibc arena one was particularly fun. Under heavy concurrent close() calls, the arena allocator would crash because malloc wasn't thread-safe in the way the code was using it. Fixed by not calling close() from multiple threads simultaneously. Obvious in hindsight.

Edge Devices Finally Get Love

moteDB targets embedded and edge hardware. v0.2.0 has dedicated optimizations for that:

  • EdgeIndexConfig: DiskANN now has bounded memory index configuration, limiting graph memory footprint
  • FTS bounded shard counter + VersionStore eviction: prevents memory growth during long-running operations
  • Dead code cleanup: removed ~2200 lines, reducing binary size
  • Zero clippy warnings: everything compiles clean

Testing at Scale

The new test infrastructure handles the concurrency edge cases:

  • wait_for_indexes_ready() — polls pending_index_batches atomic counter for deterministic index readiness
  • CI adaptive data scaling — detects CI environment and automatically reduces test data volume
  • 749 new test cases, running under 4-thread concurrency, completing in ~3 minutes with zero hangs

The Numbers

Here's the full picture:

  • 35 commits, 89 source files changed
  • 28,118 lines added, 14,815 deleted
  • 11 performance optimizations
  • 21 bug fixes
  • 3 new features

The release is on crates.iocargo add motedb or add motedb = "0.2.0" to your Cargo.toml.

If you're running moteDB on edge hardware or need a database that won't stall your queries while building indexes in the background, this one's worth upgrading to.

The benchmark suite no longer hangs.

Top comments (0)