The Veltrix Engine Was 16ms From Catastrophe and No One Noticed

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At Veltrix we run a city-wide treasure hunt every quarter. Up to 20,000 players move simultaneously through 300 physical beacons that emit BLE signals. Each players phone streams position fixes to a Go service we call huntd. Huntd fans out each fix to 12 stateless worker pods that compute which beacon the player is closest to, then checks that beacons rules engine to award points, unlock achievements, or trigger real-time events.

The real problem wasnt latency; it was unbounded state. Every player connection creates a WebSocket that lives for hours. When we scaled to 20 k concurrent sockets, the kernels TCP stack allocated ~2.3 MB per connection for reassembly buffers alone. We hit the default somaxconn of 4,096 before the kernel even looked at the accept queue, and huntd started accepting connections with RST packets pre-emptively. Error message in logs: accept4: resource temporarily unavailable (errno 24).

What We Tried First (And Why It Failed)

We tried two quick fixes without touching the socket lifecycle.

First, we enabled reuseport so the kernel could spread the 20 k sockets across all 64 threads. This dropped p99 latency from 112 ms to 87 ms, but the memory graph kept climbing at 42 MB per thousand active connections because the WebSockets themselves persisted. We monitored with go tool pprof and saw 38 % of heap still held by the conn struct waiting for the next ping.

Second, we set SO_KEEPALIVE with tcp_keepalive_time=60. That closed dead sockets after one minute of idle, but broke the game: a phone in a subway tunnel would get killed while the player was still moving, and the state machine would think they had quit. Player support tickets spiked with phrases like I didnt quit, my phone lost signal, and Why did I lose 450 points?.

The Architecture Decision

We tore the WebSocket layer apart and introduced a two-tier split:

Tier 1: Ephemeral WebSocket shim written in Rust using tokio-tungstenite. It forwards BLE pings as CloudEvents to a Kafka topic called ble-raw and closes the socket within 5 seconds of last ping. Memory footprint per connection dropped to ~8 KB.

Tier 2: Dedicated hunt-worker pods that read ble-raw via Kafka Streams in exactly-once mode. They maintain a compacted RocksDB state store on NVMe with the TTL set to 30 minutes. If a player disappears for more than 30 minutes, the state evaporates and the players session is re-created on their next beacon hit.

We chose RocksDB for the TTL window because we measured 99.8 % of all player moves within 5 minutes, and 100 % within 25 minutes. The compaction filter deletes keys older than 30 minutes, so the db stayed under 2.4 GB on an i3.large even at 20 k active sessions. The tradeoff? We gave up Rediss O(1) lookups for iterator-based range scans, but the throughput hit was negligible—Kafka Streams fetch latency stayed below 8 ms p99.

What The Numbers Said After

After the cutover on April 3, we watched Prometheus for seven days.

WebSocket shim CPU utilization: 0.4 cores across the fleet.
Kafka consumer lag on ble-raw: consistently 0 ms.
hunt-worker heap usage: 150 MB per pod, down from 1.8 GB.
Player-reported disconnect incidents: 0.03 % of total sessions—well inside our SLO of 0.1 %.
The only lingering pain was the RocksDB iterator overhead during leader election. When a pod restarted, it took 4.7 seconds to rebuild the RocksDB memtable cache from the WAL, causing brief spikes in compute credit burn on AWS. We mitigated that by pinning the cache to 256 MB and pre-warming it on startup with a dummy range scan.

What I Would Do Differently

I would have separated the beacon-state TTL from the player-session TTL from day one.

We baked both into the same RocksDB key: beacon::player:. After the shim layer, we only need the players current beacon; everything else can be reconstructed from the BLE pings. By splitting the stores—one for live sessions (30 min TTL) and one for audit logs (90 days)—we could have run the audit store on S3 via RocksDBs SstFileManager and avoided the 4.7-second cache rebuild altogether.

The bigger lesson: never let a real-time protocol own long-lived state that can be derived. The WebSocket shim was the right call, but the RocksDB consolidation was premature. We spent three engineer-weeks tuning compaction and cache sizes that could have been avoided if we had drawn the boundary at the beacon hit itself.