Discussion on: I Built a Lightweight Embedded Database for Node.js — LioranDB

View post

200MB for 1M documents is respectable for a Node.js solution. The file-based approach makes sense for a lot of use cases â one thing worth thinking about as you scale: how does the WAL handle concurrent writers?

In my experience building embedded databases (I work on a Rust-based one for edge/robotics), the WAL replay time at startup becomes the real bottleneck once you cross 10M+ operations. Batching WAL writes and using a checksum-based recovery instead of replaying every entry can cut startup from seconds to milliseconds.

Also curious â does LioranDB support range queries or just point lookups? That's usually the first feature request that forces you to rethink your storage format from flat files to something like an SSTable or B+tree.

Impressive work for 5 months though, especially at 17. The embedded DB space could use more options â keep going!

Swaraj Puppalwar • Apr 25

Thanks for the thoughtful reply and the kind words — coming from someone who's built embedded DBs in Rust, that means a lot!

On WAL & concurrency

You're spot on about WAL replay becoming painful at scale. Right now, LioranDB uses a simple append-only WAL with per-tx "op → commit → applied" records + CRC checks. Recovery replays only un-applied committed txs from the last checkpoint (dual A/B checkpoints for crash safety). It works well for the current target (sub-10M ops, quick startup), but I agree batching writes and optimizing the replay path (e.g., more aggressive checkpointing or a compact "applied up to LSN" marker) will be important soon.

Concurrent writers are serialized through a dedicated writer queue (with optional backpressure: "wait" or "reject" mode + memory-pressure gating based on RSS/heap ratio). It prevents corruption nicely, but it does introduce some serialization. For true multi-writer without full serialization, a segmented WAL or MVCC-like approach is on the roadmap.

Range queries & storage format

Currently:

Point lookups and simple equality / $in use secondary indexes (B-tree style on LevelDB-backed per-field indexes, with proper encoding for numbers, strings, dates, etc.).
Range support ($gt, $gte, $lt, $lte) is implemented in the index layer and query planner for indexed fields.
Full scans still fall back gracefully with explain plans showing index usage vs. full scan.

No full SSTables or LSM yet — storage is ClassicLevel (LevelDB) per collection + separate index DBs + WAL. It's simple and "good enough" for embedded use cases, but you're right that richer range + sorting workloads will push toward rethinking the on-disk format eventually.

The goal so far has been MongoDB-like ergonomics + low overhead + encryption by default rather than competing on massive scale out of the gate. 200 MB RAM / ~300 MB disk for 1M docs (including indexes + WAL) felt like a solid starting point.

Appreciate the real-world perspective from the edge/robotics side — those constraints are brutal and super informative. If you have any war stories or specific patterns that helped with WAL replay or range-heavy workloads, I'd love to hear them (here or on the Discord).

Thanks again for checking it out and for the encouragement.