The Moment We Realized the Default Config Was a Lie

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We werent building a cache. We were running a treasure hunt engine where 100,000 concurrent users refresh GPS positions every 200ms, and every refresh triggers a write to aggregate player scores. The data isnt temporal; its mutable. The catch is that the moment a player crosses a checkpoint, all previous state for that player becomes invalid. Redis handled the writes, but the fork-based persistence model meant that every invalidation created a new fork, and the background save kept cloning the entire dataset. Our memory usage grew from 12GB to 48GB in 12 hours. We hit OOM killer at 14:33.

What We Tried First (And Why It Fail

We tried tuning Redis. We set maxmemory 32GB, maxmemory-policy allkeys-lru, and switched to AOF with fsync always. The OOM still happened, but now the kernel killed the redis-server process instead of the OS. The fork latency spike from AOF fsync added 800ms to our p99, and our p50 went from 2ms to 34ms. The Redis cluster mode documentation hinted at resharding pain, but we didnt want to shard a stateful mutable dataset. We considered removing state entirely, pushing invalidations to an SQS queue, but then wed lose atomicity per player—two concurrent invalidations could overwrite each other.

The Architecture Decision

We ported the state store to TiKV. Not because we loved it—TiKVs Rust client was immature then—but because it gave us per-key atomic writes and snapshot isolation. The decision wasnt about language; it was about the consistency model. TiKVs MVCC let us:

Write a checkpoint event with a new timestamp
Overwrite previous checkpoints in a single transaction
Keep the old version readable for 5 seconds, so late GPS packets still hit valid state
Run on Kubernetes with 3 replicas, tolerating one AZ failure

We wrote a custom Rust crate, kvs-raft, that wrapped the tikv-client crate and added a bloom filter for checkpoint lookups. The bloom filter was 8MB per TiKV node, reducing 40% of unnecessary reads. We ran a chaos test: killed one TiKV pod every 30 seconds for 15 minutes. Our p99 stayed under 12ms, total memory stabilized at 24GB across three pods.

What The Numbers Said After

Heres the profiler output from the Redis failure day vs. the TiKV stable day, both on c5.4xlarge:

Redis failure (day 3):

RSS: 48GB
p99 latency: 2.1s
p50 latency: 42ms
Allocations/sec: 180k fork buffers
Evictions: 0 (noeviction)

TiKV stable day:

RSS: 24GB
p99 latency: 12ms
p50 latency: 3.4ms
Allocations/sec: 12k
Raft log size: 4.2GB

The Rust client added 3ms to p99 on cold path, but the deterministic GC-free allocations dropped tail latency variability from 400ms swings to 12ms.

What I Would Do Differently

I wouldnt have trusted the default Redis config for mutable state. The READMEs QPS numbers are for cache workloads, not mutable, frequently invalidated datasets. I would have benchmarked fork syscalls before deploying. Also, TiKVs Rust client panicked on network partitions under load—the fix was adding a custom backoff strategy in our kvs-raft wrapper, but we should have tested partition tolerance first. Finally, we over-provisioned memory. The rustc compiled binary was 12MB, but our container image was 600MB due to vendored dependencies. We switched to a distroless image and cut cold-start time from 4s to 800ms.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2