How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had No Clear Operator Bounds

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We had built the Treasure Hunt Engine to power in-game treasure hunts that reward players for exploring content. At 100 000 active players it felt fast. At 500 000 it started to stutter. The operator documentation told teams to set max_reindexing_concurrency = 2 and warned against full table scans. No one listened when the game grew faster than the docs.

The real problem wasnt the engines speed; it was the lack of an explicit operator boundary. We had implicit rules (dont reindex during peak hours, dont run two jobs on the same shard) but no enforced policy. When the shard leaders disk filled up at 02:17, Prometheus fired an alert, the on-call operator panicked and set max_reindexing_concurrency = 4 to speed up cleanup. That single command turned an idling indexer into a four-headed hydra that vacuumed every row in the events table.

What We Tried First (And Why It Failed)

Our first attempt was a simple feature flag: reindexing_enabled = false. We shipped it behind LaunchDarkly and let game masters flip the switch during maintenance windows. In practice, people forgot. One GM set it to false, recompiled the engine config, and redeployed—only to discover that the flag was evaluated a full 60 seconds after the pod came up, during which time the old config was still active. During that window the system reindexed anyway, locking the table for the full 9 minutes while the new pod waited to read the new flag value.

Next we tried a runtime mutex: a Redis lock around the reindexing code. The mutex worked for a month until Redis Cluster decided to fail over mid-operation. The old master stepped down, but the lock key wasnt replicated in time, so two pods assumed they were the only reindexer running. We saw two concurrent reindex jobs, each scanning 1.2 billion rows. The disk I/O latency recorded by Datadog climbed to 120 ms p95; the Postgres autovacuum process couldnt keep up and the database bloat grew by 42 GB in 10 minutes.

The Architecture Decision

We ripped out the implicit rules and built an explicit operator boundary we called the Reindexing Governor.

Policy engine: a single gRPC service running as a sidecar that evaluates admission requests against a set of operator-defined policies. The policies are stored in a S3-backed YAML file that is loaded at startup; we refresh it every 30 seconds via an ETag check.
Governance state: every reindexing request must carry a governance token signed by the Governor. The token encodes shard ID, max rows to scan, max concurrency, and the operators name. If the token is missing or invalid, the indexer rejects the request immediately with an HTTP 403 and the specific reason:

ERROR: invalid governance token: policy violation – concurrent reindexing limit exceeded for shard-7

Hard limits in code: we embedded the token schema into the proto contract so the compiler enforces the rules at build time. The token contains a 32-byte random nonce; each time a request is accepted we increment the nonce and store it in Redis with a 24-hour TTL. If the nonce collides we know someone replayed a token and we force a re-authentication flow.
Safety valves: an operator can still run an emergency full reindex by presenting a hardware-backed SSH key and a signed override token, but the system logs every override to an append-only Kafka topic we call governor-audit. The token itself is deleted from Redis immediately after use, so it cannot be replayed.

The whole thing weighs about 3 MB of binary and runs in its own 256 MB container. We chose Go for the Governor because we needed sub-millisecond token validation and wanted to avoid JVM tuning rituals at 3 AM.

What The Numbers Said After

Three weeks after pushing the Governor to production we measured a 99.9 % reduction in reindexing collisions. The longest lock time on the events table dropped from 9 minutes to 42 seconds. P95 ingestion latency went from 1.2 s down to 320 ms, and p99 fell below 800 ms consistently.

We also tracked the false-positive rate of the policy engine. Out of 12 847 reindexing requests, only 23 were rejected by the Governor for violating policy—mostly because someone forgot to refresh their token after rotating their SSH key. The remaining 12 824 requests completed without incident, a false-positive rate of 0.18 %. The Governor itself consumed 1.4 % CPU on a 2 vCPU pod and added a median 12 ms to the end-to-end request path.

We deployed the Governor across all shards last quarter. The disk bloat stopped growing and our nightly vacuum jobs now finish in under 7 minutes instead of 45.

What I Would Do Differently

Use a simpler admission control model next time. The Governor works, but its still a bespoke service we have to maintain and monitor. If I were starting today Id evaluate Open Policy Agent running as an Envoy external authorization filter. OPA gives us a mature policy language, a growing ecosystem of policies, and a proven data-plane proof that scales horizontally. We could keep the same token schema and admission API; the only change would be swapping the Governor for an OPA sidecar. We ran a spike with OPA last month and the median admission time dropped from 12 ms to 3 ms—good enough to skip the custom Governor entirely.