I Broke Our Event System With 300ms Config Lookups And Learned This About State Machines

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our treasure hunt engine needed to validate whether a players action triggered a prize drop. The prize rules could change daily, sometimes hourly, based on marketing whims. Initially, we stored these rules in S3 as JSON files keyed by floor and drop percentage. A service called PrizeValidator would download the entire directory on startup, parse them with Jackson, and build an in-memory map. On a hot reload, it would flush the entire map and reload from scratch.

That worked fine until marketing introduced 1,200 variants. The in-memory map grew to 320MB. Every time a new variant was added, the validator service restarted, adding 8–10 seconds of cold-start time. During peak traffic at 47,000 requests per second, GC pauses spiked to 800ms. The service would report 503s with error code CassandraReadTimeout/1100 because the JVM heap was fighting for memory with Cassandras memtable flushes. Our on-call rotation learned to hate the phrase the rules changed again.

What We Tried First (And Why It Failed)

We tried two things before accepting that the configuration wasnt static data.

First, we punted to a database. We moved the JSON into a PostgreSQL table with a GIN index on floor, percentage, delay. The query looked like:

SELECT * FROM prize_rules WHERE floor = :floor AND percentage >= :percentage ORDER BY delay LIMIT 1;

P95 latency dropped to 12ms, but we introduced a new problem: every hunt trigger required a round trip to the database. At 47k rps, thats 47k additional connections. pgBouncer ran out of pool slots at 2,048, and we started seeing no more connections allowed (max_client_conn) errors. We bumped pgBouncer to 8,192, but the PostgreSQL server itself began OOMing because the shared_buffers couldnt keep up with 47k active connections scanning GIN indexes.

Second, we tried Redis with a hash per floor, serializing the variant into a protobuf blob. It looked fast—p95 at 3ms—but we forgot about persistence. The first outage after a Redis restart cost us 4.5 minutes of hunt triggers while it rehydrated from RDB. That outage cost us $18k in missed prize redemptions because the validator returned nil for every request during the restart. Sentry lit up with prize_validation_missed alerts.

The Architecture Decision

We stopped treating the prize rules as configuration and started treating them as state in a finite state machine. We modeled the rules as a state table:

floor_id INT
percentage INT
delay_ms INT
active BOOLEAN

We moved the table into CockroachDB for multi-region consistency and deployed a sidecar called RuleEngine that precomputed all active variants into a local RocksDB instance on each PrizeValidator pod. The sidecar watched the CockroachDB changefeed and applied upserts and deletes to the local RocksDB in under 150ms.

The PrizeValidator loaded its entire local RocksDB into a single mmap file during startup and relied on the OS page cache. Cold starts dropped to 200ms. Memory usage per pod stayed under 45MB because RocksDB compressed the data with zstd and shared the page cache with the JVM.

We added a leader election on the RuleEngine sidecar so only one pod per region performed the changefeed scan, reducing CPU pressure on CockroachDB. The changefeed itself ran at 1,200 rows per second with zero backpressure because CockroachDBs CDC feed is built on the same Raft log used for transactions.

We also introduced a consistency switch: for high-stakes hunts, we could flip a flag to enable synchronous writes to CockroachDB before the prize drops, ensuring that no rule changes could slip through during a hunt. That flag added 8ms latency to the critical path but cut the chance of a missed validation to zero.

What The Numbers Said After

After two weeks, Prometheus showed p95 latency at 18ms, down from 4.2s. We dropped the PostgreSQL bill by 40% because we no longer needed 8,192 pgBouncer slots. CockroachDB CPU stayed flat at 23% per node, and the CDC feed never lagged more than 50ms behind writes.

We measured memory: each PrizeValidator pod used 45MB of heap and 72MB of off-heap for RocksDB. With three pods per region, thats 342MB total, versus the previous 320MB per pod that included the full map in JVM heap. The GC pause p99 dropped to 120ms, and the OOM rate fell to zero.

Sentry errors for prize_validation_missed went to zero. During the Black Friday hunt that followed, we processed 1.3 million triggers with zero missed validations and zero latency SLO breaches.

What I Would Do Differently

I would not have started with a database. That was premature optimization in the wrong direction. We needed a state machine and a changefeed, not a table index.

I would have modeled the state table earlier. The initial mistake was conflating configuration with data. Configuration is a story about state transitions, not static files.

I would have limited the RocksDB mmap to a fixed size. We ran into a production incident where a misrouted CDC event produced 70,000 variants in one floor, overflowing the local RocksDB and causing mmap corruption. We added a max_rows limit