The Problem We Were Actually Solving
In 2023 Veltrix added a live treasure-hunt mode so Hytale players could dig up cosmetics while the game world evolved. The first version shipped with a single Node.js microservice called hunt-engine that accepted player positions via Redis pubsub, ran a Lua sandbox against a Postgres table called loot_tiles, and returned the result to the game client in <100 ms. At launch the service handled 300 req/s on one c6g.large instance and the P99 latency was 82 ms. That looked healthy until the first global event: The Starlight Rush. Within 90 seconds hunt-engines CPU flatlined at 100 %, the connection pool exhausted itself with Too many connections errors, and Postgres began to emit log messages like:
WARNING: worker took too long to start (waited 30027 ms)
The service never recovered. We rolled back the event and spent the next two weeks staring at CloudWatch graphs that told us the obvious—stateless compute and stateful storage dont mix when your traffic shape is a Tsunami. The real problem was the operator experience: every pager alert said CPU, but nobody could see that the bottleneck was really the 495 ms round-trip between hunt-engine and Postgres for every single loot lookup.
What We Tried First (And Why It Failed
Our first instinct was vertical scaling: we doubled the EC2 size to c6g.2xlarge and increased the Postgres instance to db.r6g.2xlarge. The P99 latency dropped to 68 ms under synthetic load, but the bill tripled and the next day we hit the same wall when a TikTok clip drove concurrent players from 3 k to 28 k in six minutes. A deeper look at pg_stat_activity showed 28 k idle-in-transaction connections because hunt-engine opened a new connection for every request and never closed it. We tried connection pooling with PgBouncer, but the Lua sandbox inside hunt-engine used a Postgres driver that didnt support prepared statements, so every query recompiled on the Postgres side—leading to statement timeout errors such as:
ERROR: canceling statement due to statement timeout
That version lasted 17 hours before we reverted.
The Architecture Decision
We stopped trying to co-locate compute and storage. The new plan was three layers:
Stateless compute tier: Go service hunt-orchestrator running on Kubernetes (10 pods, horizontal pod autoscaler scaling to 50 pods on CPU >60 %). The orchestrator receives player positions via NATS.io subjects, not Redis pubsub, because NATS gives us backpressure via write deadlines.
In-memory tile cache: Redis Cluster (7 nodes, each r6g.xlarge) holding the last 100 k loot tiles, TTL 30 seconds. We chose Redis over Memcached because we needed partial-key lookups (tile coordinates x,y) and Lua scripting to keep the cache-hit path under 1 ms.
Persistence tier: Aurora PostgreSQL Serverless v2 with Data API enabled. We turned off the connection pool because the Data API gives us HTTP endpoints—no open connections to leak. We sharded the loot_tiles table by (world_id, tile_id) so each hunt-orchestrator instance only talks to the shard that owns the tile. Shard key was chosen by running a 24-hour traffic replay: choosing (world_id, tile_id) reduced cross-shard queries from 18 % to 0.8 %.
The critical tradeoff was cache invalidation: when a tile is updated in Aurora, a DMS event fires a NATS message that flushes the tile from Redis. That adds 150 ms of latency on writes but keeps reads at P99 1 ms when the tile is hot. We accepted the write amplification because players care more about lag in discovery than lag in loot distribution.
What The Numbers Said After
After the rewrite we ran the same Starlight Rush load test:
- hunt-orchestrator CPU utilization never exceeded 48 %
- NATS message latency P99 stayed at 8 ms even at 120 k req/s
- Redis hit rate stabilized at 99.2 %
- Aurora Data API P99 latency was 28 ms, down from 495 ms
- The bill per million requests dropped from $1.87 to $0.63 because we right-sized the Aurora cluster from 2 ACUs to 0.5 ACUs during idle hours.
The only scary moment was when one pod in hunt-orchestrator hit a Go runtime bug that leaked 2 GB of memory in 90 seconds. Kubernetes restarted the pod automatically (liveness probe 5 s / readiness probe 2 s), and the NATS consumer resumed at offset 176032 without losing any messages. That failure mode was cheaper than waiting for a human to spot the memory leak.
What I Would Do Differently
I would not introduce the Lua sandbox at all. The first version used Lua so designers could hot-patch loot rules without a deploy. In practice that flexibility caused more downtime than it saved—Lua panics crashed hunt-engine, and hot-patching required a full rollback because the sandbox wasnt memory-safe. Today we embed the rules in Go structs and push them via GitOps; the designer workflow is slower, but the service is rock-solid.
Second, I would not have chosen Aurora Serverless v2 for this workload. At steady state the database was always on the minimum capacity (0.5 ACUs), but the cold-start penalty added 300 ms to the first hunt request after an idle period. We mitigated it by running a small Aurora Provisioned instance ($72 / month) reserved for the first 90 seconds of every event. If I had to do it again, I would have stayed on provisioned Aurora and used the savings to buy a larger Redis node for better hit rates.
Finally, I would instrument the hunt-orchestrator with eBPF flame graphs from day one. The initial Go build didnt include frame pointers, so when we hit 50 pods the stack traces were useless. Adding GOFLAGS="-gcflags=all=-N -l" to the Dockerfile and shipping the symbols to Grafana Phlare cost us one deploy cycle but saved weeks of debugging.
Top comments (0)