When Your Treasure Hunt Engine Becomes a Memory Leak Factory

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Veltrix is a real-time event platform that powers in-game treasure hunts for 8 million concurrent users. Each hunt runs for exactly 30 minutes, then immediately releases 20,000 loot crates that disappear in under 2 seconds. The hunt duration, spawn rate, loot table ID, and geofence polygons are all configurable per-campaign, and we built a YAML-driven engine to let marketers upload new campaigns without code deploys.

What we didnt model was the operator. Product managers would paste a 300-line YAML file at 2 a.m., forget to set the timeout-grace-period flag, and the Cassandra write queue would fill faster than we could compact. Our TTL was 3 days, but tombstones piled up in minutes because hundreds of intermediate writes with the same primary key landed at millisecond offsets. The nodetool tablestats output showed 82 million tombstones in the hunt_events table; read latency climbed to 3.2 seconds while compaction backed up for 47 minutes.

What We Tried First (And Why It Fail)

We started with a single campaign.yaml file checked into Git, then added an S3 bucket for campaign packs. The first campaign with 12 dynamic zones loaded fine: 100k writes/minute, tombstones under 2k. Then marketing wanted to A/B test three different spawn curves. We appended three new YAML files to the same bucket and pointed the engine to the folder. The operator set the bucket prefix via environment variable.

Thats when we learned that the Java drivers CqlSession caches prepared statements per keyspace and per file path. When the prefix changed, the driver issued two full prepare rounds: one to invalidate the old cached statements, another to prepare the new ones. Each prepare round took 4.3 seconds on a 12-node DSE cluster with 30 tables. Our 30-minute hunt now had a 4-second cold path before the first spawn. After three campaigns, the prepared-statement cache grew to 48k entries, and heap climbed to 2.8 GB per instance. GC STW events hit 300 ms every 45 seconds.

We tried mitigating with a single prepared statement template and dynamic bind variables, but the ORM layer (we were using DataStaxs legacy mapper) insisted on binding column names, not values. The error message MapperParsingException: Unknown property 'spawnCurveA' became our nightly wake-up call.

The Architecture Decision

We ripped out the YAML folder entirely and replaced it with a versioned API.

Each campaign became a single JSON document stored in PostgreSQL. The document schema enforced FKs to loot tables, geofence polygons, and spawn curves, all pre-validated.
We added an in-memory, versioned cache with a 10-second invalidation window. Each campaign version carries a monotonically increasing versionId; the engine always reads the latest committed version unless the operator explicitly pins an older one.
We turned off prepared-statement caching at the driver level (prepareOnAllNodes=false) and switched to the newer CqlSessionBuilder.withPreparedStatements(false) so the driver only prepares once per statement template, not once per file.
We built a small CLI: veltrix config deploy --campaign-id 12345 --dry-run which runs the schema validator against a shadow database before pushing to production. The dry run caught SQL injection attempts and geofence polygons covering the Pacific Ocean.

The trade-off was no longer file-based diffs; instead, we gained a single source of truth with strict schema and transactional writes. The cost was one extra hop to PostgreSQL, but we already had a 3-node Patroni cluster with <1 ms local reads, so the latency budget stayed under 5 ms for 99.9 % of requests.

What The Numbers Said After

After the cutover, we ran the same 20,000-loot drop test.

Tombstone count in hunt_events dropped from 82 million to 12k over 7 days.
Write latency p99 stayed at 28 ms, with a 95th percentile of 42 ms.
Prepared-statement cache size stabilized at 182 entries.
Heap usage fell from 2.8 GB to 512 MB; minor GC pauses dropped to 5 ms.
Campaign deployment latency (from CLI to observable traffic) went from 14 minutes (manual YAML + cache warm) to 38 seconds (DB commit + cache invalidation).
The worst-case operator mistake was now a foreign-key violation, which PostgreSQL rejected in 12 ms before any writes landed.

What I Would Do Differently

I would never let an operator edit raw YAML in production again. The abstraction of a configuration file is seductive until a misplaced colon destroys your cluster. A database with schema, versioning, and dry-run validation is a small operational tax that pays for itself the first time an operator accidentally sets spawnRate to zero and crashes the hunt.

We also over-optimised the cache too early. Our first cut used a 5-second window, which still caused drift when two campaigns deployed within the window. After two rollbacks in staging, we stretched it to 10 seconds—enough to absorb operator latency but short enough to prevent stale reads. Measure cache staleness, not just hit rate.

Finally, I regret the ORM layer. The DataStax mapper is deprecated, and it fought us every step of the way. If I rebuilt today, Id use raw CQL via the drivers builder pattern and hand-roll the bind maps. The extra compile-time safety isnt worth runtime GC pressure and the occasional MapperParsingException at 2 a.m.