The Problem We Were Actually Solving
The hunt engine ran on Veltrix 1.6, a LuaJIT micro-framework we had bolted together in three weeks so the art team could script events. Every hunter spawned a coroutine, every coroutine did an EVALSHA against Redis to atomically award loot, then wrote the result to a single hunt_session table using Postgres 12 with fsync=on.
At 300 hunters the coroutine scheduler was still fine, but the Redis call grew from 0.4 ms to 42 ms when the connection pool had 20 active slots. We watched RESP_PROTOCOL_ERROR spike, exactly 413 times in sixty seconds. Postgres autovacuum started at 60 s intervals because the loot table churned two million rows per day, and each freeze added 400–600 ms to INSERT latency. The engines P99 dropped to 1.8 s, then clients started timing out.
What we needed was a story boundary that could absorb a 100× traffic spike without re-architecting the whole hunt script engine.
What We Tried First (And Why It Failed)
First, we upgraded Redis to 7.0 and enabled pipelining inside the Lua script. That dropped EVALSHA to 6 ms, but now the coroutine scheduler itself became the bottleneck—LuaJIT coroutines are cheap, but 3 400 of them suspending on a 6 ms Redis call created 20 k context switches per second. The kernel showed run queue 22 with 85 % steal time on the bare-metal box.
Next, we split the hunt session into two tables: hunt_session_metadata and hunt_session_loot. We added an index on (hunter_id, hunt_id) and turned fsync off for the metadata table. Autovacuum still ran, but the freeze time fell to 80 ms. The P99 latency dropped to 550 ms—good, but still above the 200 ms SLA we promised streamers.
Then we tried a managed Postgres with PgBouncer in transaction mode. For 300 hunters the latency looked perfect, but when traffic climbed to 3 000 hunters the Bouncer hit max_client_conn=100 and started rejecting connections. The error message was pgbouncer 1.17.0, ERROR rejecting connection because server b1 has 101 active connections.
We realized we had optimized for the wrong layer: the bottleneck wasnt the database, it was the LuaJIT engine treating each hunter as if it were a persistent coroutine. The engine assumed state would fit in RAM, but at 3 000 hunters the RSS grew to 8 GB and the allocator started stalling.
The Architecture Decision
We drew an explicit service boundary at the LuaJIT boundary.
Every hunt became a stateless, short-lived process called hunt_worker. Instead of spawning coroutines, we spawned fork-exec hunt_worker with the hunter_id as the only argument. hunt_worker ran a single LuaJIT VM, executed the treasure script in <50 ms, and exited. No context switching, no connection pooling inside the worker.
The hunt_worker image itself is a Docker multi-stage build with LuaJIT 2.1, the compiled hunt bytecode, and a stripped-down musl libc. We push it to our private ECR repo tagged with the bytecode hash. The worker starts in 8 ms and dies in 50 ms—perfect for scaling to zero when traffic drops.
We placed hunt_worker behind an Envoy proxy that uses consistent hashing on hunter_id to route sessions to the same worker pod. If a pod dies, Envoy retries on another pod; the proxy guarantees at-least-once delivery so hunt_worker can be idempotent.
On the persistence side we moved loot writes out of the hunt session table entirely. hunt_worker emits a single row to a firehose-style event table loot_events (hunter_id, hunt_id, loot_id, timestamp) via a fire-and-forget HTTP POST to an internal Kafka REST proxy. A separate aggregator service reads the stream and materializes the hunt_session table every fifteen minutes. This gives us eventual consistency on hunt progress while keeping the P99 write latency to 12 ms.
We chose Kafka REST proxy (Confluent 7.5) over Kafka binary protocol because our ingress tier already speaks HTTP. The REST proxy buffers writes for 10 ms before flushing to the broker; at 3 000 hunters we observed 2.3 MB/s ingress with zero broker-side backpressure.
The entire worker layer autoscales via KEDA using the envoy_hunter_requests_per_second metric exported by Envoy. We set the scale target to 500 RPS per pod, and the HPA checks every fifteen seconds. When traffic drops to zero the pods scale to zero in 42 s.
What The Numbers Said After
We ran a synthetic load test at 10 000 concurrent hunters for twenty minutes. The worker layer spawned 1 200 pods on EKS (m6i.large nodes), each pod sustaining 8 RPS. The Envoy proxys consistent hash ensured 99.9 % of sessions never left their initial pod. The loot_events topic grew to 70 GB, but the aggregator consumed at 45 MB/s, keeping the consumer lag under 3 s.
P99 latency on the hunt API stayed at 72 ms, down from 1.8 s.
Redis connections dropped to 40 active slots; the EVALSHA latency never exceeded 3 ms.
Post
Top comments (0)