Veltrix Treasure Hunt at 100k Users: When the Embedded Query Engine Exploded

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The treasure hunt tables were skinny: hunt_id, user_id, event_time, status, score, a few JSON blobs. Veltrixs embedded engine uses BoltDB under the hood, an append-only B+tree on disk. At 100k users we had 80 million rows, 14 GB of data, and BoltDB was spending 400ms per read traversing 6 levels of B-trees just to find the leaf node. Concurrent writes during user submissions were blocking the compaction process, so the tree height kept growing. The Veltrix docs warned about this, but only for queries with wide rows > 1 MB. We had none of those. We had narrow rows and high concurrency. The real problem was the embedded engines single-writer lock per shard, not row size.

What We Tried First (And Why It Failed)

We first tried tuning the BoltDB open timeout and the mmap_size flag to 256 MB as the docs suggested. The latency dropped to 2800 ms. Still too high. Then we tried adding more BoltDB workers by increasing bolt_workers to 8. The latency flattened at 2400 ms, but P99 tail latencies spiked to 12 seconds during compaction. The compaction process itself became a write storm that blocked every query for 1.8 seconds every 3 minutes. We monitored with bolt_compact_duration_seconds_sum and saw the trend: compaction time scaled linearly with data size. At 150 GB it would be 30 seconds. We needed a different approach.

The second attempt was vertical scaling: beefier SSD disks, 16 vCPUs, 128 GB RAM. Cost tripled for a 2x latency improvement. The CEO asked why we couldnt just add more nodes. Veltrixs embedded engine is not designed to shard; it stores a single file per shard. Horizontal scaling would require splitting hunt_ids across clusters, which meant rewriting the entire treasure service and changing the data boundary. Thats when I realized the embedded engine was the wrong abstraction for our scale.

The Architecture Decision

We tore the embedded engine out and replaced it with a two-tier system: an in-memory write-through cache for active hunts, backed by a sharded PostgreSQL cluster for cold data. The cache tier used Redis Cluster with 12 shards, each 2 GB max memory, and a 60-second TTL for hunt metadata. The cold tier used PostgreSQL 15 with 6 read replicas and 3 standbys. We kept the hunt_id hash as the shard key, so each hunt lived entirely on one shard. This preserved data locality for each hunts timeline of events.

The migration tool veltrix_migrate_hunt ran offline during low-traffic hours. It read from BoltDB, batched writes into PostgreSQL COPY commands, and verified checksums with SHA-256. The first run took 45 minutes for 80 million rows. We replayed the last hour of live writes with logical replication using Debezium, which added about 6 MB/s throughput during peak. The cutover script flipped the DNS from the embedded engine to the new tier in 90 seconds with zero unavailability.

We accepted three tradeoffs:

Strong consistency: PostgreSQL serializable isolation for score updates, but higher latency for writes (40 ms vs 1 ms).
Operational complexity: monitoring two systems, Redis Cluster health, PostgreSQL replication lag.
Cost: Redis Cluster $2.4k/month, PostgreSQL $1.8k/month, plus monitoring stack.

The alternative was to stay on the embedded engine and limit hunt lifetimes to 7 days, purging old data nightly. That would have saved cost but killed the product feature that users loved: 30-day leaderboards.

What The Numbers Said After

After two weeks, the P99 query latency for treasure hunt queries dropped from 5123 ms to 89 ms. The 95th percentile stayed under 45 ms. Redis hit rate stabilized at 98.2 % for active hunts. PostgreSQL replication lag never exceeded 120 ms. We dropped the veltrix_treasure_query_latency_bucket alert threshold from 5000 ms to 150 ms.

Cost went from $4.2k/month for a single EC2 r6g.2xlarge with EBS gp3 to $4.3k/month for the new stack. The difference was offset by reduced CPU usage on the app layer: no more BoltDB contention.

The most surprising metric was hunt completion rate. With the embedded engine, 3.2 % of hunts failed to record the final score because the 12-second compaction window truncated the last write. With the new system, the failure rate dropped to 0.08 %. That directly translated to higher revenue from treasure completions.

What I Would Do Differently

I should have measured BoltDB compaction behavior earlier. We only started tracking bolt_compact_duration_seconds two weeks before the outage. If we had collected that metric from day one, we would have seen the slope: 0.03 ms per 1000 rows for the first month, then 0.15 ms per 1000 rows after 50 million rows. Thats the inflection point we missed.

I would also have pushed back harder on the embedded engine choice during the initial architecture review. The Veltrix docs say it scales to millions of users. Thats only true if you define scale as total users,