The System I Broke Trying to Make Events Do Something Useful

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Our initial goal was trivial: record every button click, API call, and background job in a way that downstream teams could replay for analytics. The CTO wanted it so simple that non-engineers could query events like a SQL table. We picked Debezium for CDC, Kafka for transport, and PostgreSQL for storage because those were the defaults in the CloudFormation template the vendor provided. The moment we hit production scale, the pipeline became the opposite of what we promised: events arrived out of order, schema drift from new client builds broke downstream consumers, and the single partition in the UI topic turned a 50 ms write into a 30-second stall whenever the broker rebalanced. The real problem wasnt capturing events; it was making sure they were still retrievable three months later when marketing asked for last quarters funnel metrics.

What We Tried First (And Why It Failed)

Our first attempt at fixing the backlog was to throw money at it: we upgraded the cluster to 20 brokers with 50 partitions each, added a tiered storage plugin, and set retention to 30 days. The backlog shrank, but the consumer lag graphs looked like a stock chart during a flash crash. The issue wasnt capacity; it was that every outbox write generated a Kafka message that included the entire row state—every column, even the ones the downstream consumer didnt care about. When a change touched a single nullable field, the producer serialized the whole blob, pushed 1.2 KB per event, and the backlog grew faster than we could scale. Worse, the schema registry assumed schema-idempotence; when the mobile team introduced an optional field, the entire pipeline went red because the consumer expected the old schema and the broker couldnt differentiate between a breaking change and a harmless addition.

We tried replacing Debezium with a custom Change-Data-Capture daemon written in Rust—naive mistake. The new daemon ran at 50k events/sec, but introduced a new failure mode: it assumed the primary key was monotonic and emitted tombstone events for deletes. Once the primary key wrapped around in our 32-bit integer user ID, the partitioner sent all deletes to the same broker, that brokers log segment exploded, and the disk fill alerts fired again within two hours.

The Architecture Decision

We finally admitted the default pipeline wasnt salvageable and spent two weeks redesigning the event spine around a single principle: never send data you dont need, and never trust the order you receive. The new spine has three explicit layers:

Event sourcing layer: services publish events to a fan-in topic using a strict envelope schema that includes only an aggregate ID, event type, version, and timestamp. We stripped out the entire row blob—only the delta matters now. The envelope is 128 bytes instead of 1.2 KB, so the same 3 Broker cluster handles 3 M events/sec before backpressure kicks in.
Schema registry as a separate service: every event type registers a JSON schema with a semantic version tag. Consumers subscribe to a topic prefix (orders.v1, orders.v2) and the registry validates at produce time. If a new mobile version introduces orders.v3 with a new optional field, the registry blocks the older consumers from seeing it while the newer ones opt-in. We moved the registry to a separate P99 5ms latency endpoint so schema drift never touches the Kafka cluster.
Replayable log with explicit ordering: we switched to a single partition per aggregate type, enforced the partition key as the aggregate ID, and introduced a concept called micro-batching for consumers. Instead of letting consumers poll topic offsets, we built a tiny Go service called Rewind that materializes a read-optimized view in PostgreSQL with one row per event, a stable sequence number, and a JSONB payload of only the fields that changed. The view is rebuilt every minute by a streaming job that consumes the raw Kafka topic, applies a compacted state machine, and exposes an HTTP endpoint for marketing to query with a vanilla SQL client. The replay latency is 60 seconds worst case, but the query latency is 30 ms average—something the default pipeline never achieved.

The tradeoff was simplicity versus correctness. The new spine is more moving parts: an extra registry service, a replay layer, and a dedicated schema validator. We lost the illusion of a single YAML file, but we gained a system where events are actually discoverable, replayable, and safe to evolve.

What The Numbers Said After

The backlog reached zero within 12 hours of the new spine going live, and it has stayed there during every subsequent traffic spike. Throughput on the fan-in topic jumped from 200k events/sec to 3.2 M events/sec with the same 3-broker cluster. The envelope size reduction cut our cloud storage bill by 40 percent because we no longer stored every revision of every row.

Downstream analytical queries that once timed out at 30 seconds now return in 30 milliseconds because the Rewind view pre-computes the state machine. When the mobile team launched a new feature with an optional field last week, the registry rejected the change until they issued a new semantic version; no consumer broke, and no alerts fired.

Schema drift incidents dropped from weekly to zero in four months. The only remaining failure mode is when a service publishes an event with a timestamp in the future—our new envelope includes a validation rule that the timestamp must be within 5 seconds of the brokers clock. The broker rejects the message immediately, so we never lose data to bad clocks.

What I Would Do Differently

I would never start with a default event pipeline again, no matter how attractive the quick demo looks. The moment someone says just wire Debezium into Kafka and call it a day, I will now ask one question: What does your replay story look like when the schema changes next quarter? If