The Configuration Hell of Events in Production—and How I Solved It

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our Veltrix platform ingested 2.3 million events per second at peak, but the on-call rotation was drowning in pages for two reasons that had nothing to do with throughput:

State divergence across regions: Each availability zone ran its own event router, and the configuration for transforming order_placed into inventory_reserved lived in a JSON blob referenced by three different tools (Kafka Streams, Flink, and a bespoke Node service). When Europe rotated a schema change without bumping the version in main, Asia happily processed corrupted payloads for 58 minutes before someone noticed.
The impedance mismatch between event contracts and operational reality: We treated events as immutable contracts, so every operational change required a new event type (order_placed_v2, order_placed_v3, etc.). That added 47% to our cardinality in Kafka partitions, which turned one outage into four separate recovery windows because the partition reassignment tool kept freezing when the topic had 7,200 partitions.

The real problem wasnt performance. It was cognitive load on operators: when a purchase schema changed, no one could answer Who owns this transformation? in under 10 minutes.

What We Tried First (And Why It Failed)

We started with OpenAPI contracts, assuming the problem was schema discovery. That failed because:

The OpenAPI generator for Avro couldnt handle the dozen nested unions our business required.
Our pipeline tools (Kafka Connect, Flink SQL, and the Node router) each interpreted the same schema differently. The Flink job threw Cannot cast STRUCT to VARCHAR while Kafka Connect silently dropped the field. No one on-call could correlate the two logs because they were 15 minutes apart and in different dashboards.

Next we tried a centralized schema registry (Apicurio 2.4.0) with a shared subject namespace. That failed because:

Business units treated the registry like a DDL tool. The #/components/schemas section of the OpenAPI file became a three-way merge hell between payments, inventory, and fraud when we tried to version order_placed.
The registrys compaction window was 24 hours. When the payments team updated user_id from STRING to UUID at 2 AM, the fraud teams nightly batch job processed three million events using the old schema before the registry converged. That cost us $47,000 in false positives.

Finally, we tried a monorepo of event definitions with a custom schema compiler. That failed because:

The compiler introduced a 30-second build step, which broke our CI pipeline when a business analyst renamed a field in a pull request.
The monorepo grew to 11 GB of Avro IDL, and git status became unusable. Blaming each other over Slack for merge conflicts replaced the original problem of blaming each other over outages.

The Architecture Decision

I ripped out the monorepo and replaced it with contract ownership boundaries. The rule: every event type belongs to one business capability, and the capability owns both the schema and the transformation code.

We implemented this with:

A typed event contract registry: Each event type is a single YAML file in a Git repository owned by the team that produces it. The file includes:
- capability: payments (must match the Git repo)
- schemaLocation: payments/schemas/orderplaced.avsc
- transformations: [{in: orderplaced, out: inventory.reserve, service: inventory-service}]
A build-time contract compiler: At CI time, we compile all contracts into a single artifact (contracts-<sha>.jar) that contains:
- A Java enum ContractId with precomputed hash for each event type
- A generated Avro schema per event type
- A protobuf schema per event type (for gRPC fan-out)
A runtime transformation registry: Each microservice mounts the JAR at /contracts/contracts.jar and uses the enum to resolve event handlers. If the handler is missing, the service throws ContractNotFoundException with a detailed message including the missing ContractId and the team that owns it.
A cross-team validation pipeline: Every Friday at 11 AM UTC, the payments team runs a job that validates all event transformations against a synthetic event stream. If the job fails, it posts a Slack message to #alerts-payments and tags the inventory team in Git. We instrumented this with Prometheus metrics: contract_validation_error_total{event_type="orderplaced", owner="payments"}. After three weeks, the error rate dropped from 18 incidents per month to 2.

The key tradeoff: we sacrificed build-time coupling (all teams must run the validation pipeline) to gain runtime decoupling (services resolve contracts by type, not by upstream changes).

What The Numbers Said After

After six months:

Outage MTTR for event-related incidents dropped from 87 minutes to 7 minutes.
Schema change velocity increased from 2 releases per month to 12 releases per month, with zero cross-team merge conflicts.
Kafka topic cardinality dropped from 7,200 partitions to 1,200 partitions because we stopped creating order_placed_v2.
The top on-call complaint shifted from schema conflicts to something saner: IAM token rotation.

The most surprising metric was cognitive load. Before the change, the average operator could locate an event contract in 15 minutes. After the change, it took 28 seconds. That 32x speedup meant fewer pages in the middle of the night.

What I Would Do Differently

I would not have let the payments team own the inventory_reserved contract. Inventory gets its own contract now, even though its a one-line transformation. The ownership boundary prevents the payments team from accidentally breaking fraud rules when they tweak the order schema.
I would have versioned the contract registry itself. Our first deployment used a flat JAR. When we needed to migrate from Avro to Protobuf, we had to redeploy every service. Now we use a semantic versioned contract registry