DEV Community

Cover image for The Day the Treasure Hunt Engine Buried Itself Alive
Lillian Dube
Lillian Dube

Posted on

The Day the Treasure Hunt Engine Buried Itself Alive

The Problem We Were Actually Solving

Our first production spike came during a Black Friday weekend when the hunt feed exceeded 180k concurrent sessions. The Rails process memory ballooned to 4.2 GB, the P99 latency for /hunt/start hit 5.2 seconds, and the New Relic trace showed 72 % of that time inside Veltrixs TemplateResolver#evaluate. The error log spewed Psych::SyntaxError: (<unknown>): found character that cannot start any token while parsing a block mapping at line 23 column 10 every 90 seconds. I traced that to a YAML merge key (<<: *defaults) that Veltrix expanded into 4 MB of embedded ERB templates at runtime. The merge keys were undocumented, so when marketing duplicated the defaults block in every hunt definition to change one variable, the resolver exploded.

What We Tried First (And Why It Failed)

I replaced the YAML parser with SafeYAML, set safe_load: true, and wrapped every template in a literal block. P99 dropped to 800 ms, but the heap still grew 200 MB per hunt instance because SafeYAML couldnt garbage-collect the expanded ERB trees. Next, I tried caching evaluated templates in Redis with a TTL of 30 minutes. The cache key looked like vx:template:sha256(erb_string), but the SHA calculation itself took 12 ms—more than the original YAML parse. The Ruby profiler (ruby-prof -p stack) showed the bottleneck was in OpenSSLs digest for every hunt variant. Finally, I rewrote the resolver in Go and used text/template with a precompiled map of functions. Memory flatlined at 180 MB, but the Go side introduced a 50 ms network hop because we still had to deserialize the hunt definition in Ruby.

The Architecture Decision

We removed Veltrix entirely. Instead we moved hunt metadata into two Postgres tables: hunts and hunt_variants. The variants table stored the ERB string and a compiled_hash column precomputed by a background job. At runtime the Rails controller executed HuntService.render_variant(variant_id, context) which called a small Rust extension via FFI (libhunt). The Rust layer cached compiled templates in a HashMap<String, Template> protected by a single Mutex. The Postgres query went from 180 lines of YAML merge madness to a 20-line CTE that joined hunts and hunt_variants in 3 ms. The FFI cost 0.8 ms per call, but it was cheaper than any network hop and kept the GC pause under 10 ms.

Trade-offs were clear: we lost the dynamic YAML hot-reload, but gained predictable memory, a single code path for all hunt variants, and the ability to version control the templates in Git. The Rust FFI added a 4 MB shared library to each Docker image, but our final image size dropped from 340 MB to 130 MB because we stripped SafeYAML and the merge-key parser.

What The Numbers Said After

After the cutover we ran a controlled load test with Locust simulating 10k concurrent hunts. The P99 for /hunt/start dropped to 115 ms, the peak RSS per pod stayed below 400 MB, and the Go garbage collector ran once every 30 seconds instead of every request. The error rate for malformed templates fell to zero; compile-time check in the background job caught 47 syntax errors that would have crashed the Ruby version. The cost per million requests in our AWS Cost Explorer went from $0.47 to $0.12 because we halved the pod count during peak hours.

What I Would Do Differently

I would not have let Veltrix live for six months. Its merge keys and embedded ERB were a ticking memory bomb. I also would have pushed harder to move the entire template evaluation into Rust from day one instead of using it only for the hot path. The FFI boundary added just enough latency to make me second-guess every micro-optimization, and that distraction cost us two sprints.

If I had to do it again, Id write a tiny Go microservice for template rendering from the start, use go:embed to bundle the templates at compile time, and expose a gRPC endpoint with a 100-byte protobuf. Then I would delete Veltrixs Git repository entirely.


The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1


Top comments (0)