The Problem We Were Actually Solving
It was 2024 and our in-game treasure hunt engine was doing 950k events per second at peak. We had just migrated from a vanilla Kafka cluster to Veltrix—our own home-grown event streaming platform that promised 30 % lower CPU and sub-millisecond p99 at 5 GB/s. The marketing slide said Veltrix could ingest 10 M events/sec per rack; the benchmarks showed it. Reality gave us a 47 k latency spike every time the hunt leaderboard updated.
The nightly load test that had been green for weeks suddenly showed p99 climbing to 842 ms. Veltrixs own metrics endpoint (/api/v1/stream/stats) reported 99.9 % of events were under 2 ms, but our application observed 5 % of events above 500 ms. We dug into the Prometheus scrape and found the culprit: buffer flush latency spiking when the event schema changed mid-stream. Veltrix internally uses RocksDB for the mutable event log, and every schema update triggered a compaction storm that blocked the flush thread for up to 400 ms. The schema was generated by a NodeJS microservice that shipped a new protobuf every time a new treasure type dropped—roughly once every 30 seconds during prime time.
What We Tried First (And Why It Failed)
We tried three things before we admitted the schema problem.
First, we threw read replicas at it. We spun up six Veltrix brokers per AZ and let the hunt engine read from the nearest follower. That cut p99 to 210 ms, but the follower lag (kafka.consumer_lag) stayed at 120–180 ms, and our leaderboard API timed out after 200 ms. We were just pushing the latency somewhere else.
Second, we disabled schema auto-register in the Veltrix producer client. The client would emit raw bytes and let the consumer infer the schema. Inference added 6 µs per event, but the p99 dropped to 38 ms. The problem vanished until the next treasure type cascade; then the consumer would crash with DecodeError: no known schema for 0x9d and the hunt instance would restart, losing 400 ms of update history. We lost two leaderboards to this race condition.
Third, we pinned every event to a single protobuf schema v1. We froze the treasure type catalog and generated one giant enum. Hunt traffic dropped from 950k to 820k because players couldnt see new types, but p99 stayed flat at 4 ms. The business killed the feature request list.
The Architecture Decision
We had to decouple schema evolution from event ingestion without freezing the catalog. We chose a two-layer model:
- Event payload layer: immutable, append-only, schema-less bytes in Veltrix.
- View layer: per-hunt leaderboard table in PostgreSQL 15, rebuilt via a CQRS projection.
Each treasure type update became a treasure_type_updated event in Veltrix. The event payload was a compact JSON blob {id, name, rarity, multiplier, timestamp}. We kept the blob under 512 bytes to fit in a Kafka message (Veltrixs max message size is still 1 MB; we never hit it).
On the consumer side, a single Go worker subscribed to the treasure_type_updated topic and published a synchronous UPDATE to a materialized view called hunt_leaderboard. The worker used ON CONFLICT DO UPDATE to avoid read-modify-write races. We measured 95th-percentile UPSERT latency at 1.8 ms on a 4-vCPU, 16 GB Cloud SQL instance.
To protect against poison messages, we added a dead-letter topic (treasure_type_dlq) and a retry budget of 5 with exponential backoff. Any malformed JSON would land in DLQ within 30 seconds and be inspected by on-call. Over six months, we logged 472 poison events; the largest was a 2.1 KB blob that crashed the NodeJS generator. We blacklisted that services IP range for 15 minutes.
The Numbers Said After
After the change, p99 latency stayed below 8 ms during the Super Treasure Weekend peak of 1.4 M events/sec. The hunt engines CPU dropped from 68 % to 43 % because we no longer compiled protobuf schemas per request.
We added one custom metric to Veltrix: veltrix_schema_compaction_duration_seconds. After the freeze it never exceeded 0.5 s; before it hit 4.2 s during schema floods. The dead-letter topic grew to 11 MB; we purged it weekly.
What I Would Do Differently
I would not have trusted Veltrixs internal metrics blindly. The /api/v1/stream/stats endpoint reports ingest latency, but not the flush thread latency that killed us. We should have added a custom histogram metric hunt_engine_veltrix_flush_latency_seconds in the Go workers and scraped it via Telegraf. That would have surfaced the 400 ms spikes the day the schema changed, not three days later when the on-call got paged.
I would also have fought harder to keep the protobuf schema versioned per hunt, not globally. The freeze hurt player engagement; we could have used a protobuf registry (Buf) and generated per-hunt descriptors at build time. The extra 300 MB of generated code would have been worth the flexibility.
The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1
Top comments (0)