I Should Have Put Events in the Same Database as the Aggregate Root—Heres What Happened

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our CQRS model kept Events in a separate Kafka cluster labeled event-store while Aggregates lived in PostgreSQL. The outbox pattern wrote Events to Kafka via Debezium, then the read side consumed them to build materialized views. The promise was eventual consistency with zero data loss. The reality was a 40-millisecond write path plus a 200-millisecond read path, and every time we scaled the read path the lag exploded because the offset commit cycle couldnt keep up with the volume. At 800 RPS the materialized views were 2.3 seconds stale; at 250k RPS the lag peaked at 4.2 million unprocessed events and the consumer restarted every 15 minutes with Zookeeper session timeouts. PagerDuty woke us at 3 a.m. for three nights in a row.

What We Tried First (And Why It Failed)

We tried three incremental fixes before we admitted the boundary was wrong. First, we upgraded Kafka to 3.5 with transactional producers, hoping idempotent writes would tame the lag. The lag dropped 12%—still 3.7 million unprocessed events. Second, we moved the read-side consumers to a tiered architecture with 12 k8s pods in three AZs. We saw CPU steal climb to 45% on the underlying nodes and the 99th percentile read latency increased to 900 ms. Third, we switched from Debezium to Kafka Connect with JDBC source, thinking schema evolution was our bottleneck. That introduced 20-second schema validation pauses and the lag climbed to 5.1 million events. Each attempt optimized one metric while breaking another; none touched the fundamental latency tax of crossing two databases and two networks.

The Architecture Decision

We tore the boundary down. We moved Events into the same PostgreSQL cluster as their Aggregate roots, stored them in jsonb columns with a gin index on {aggregate_id, event_sequence}. We replaced Kafka with logical replication slots feeding into a single Golang service that emitted a compact binary format to internal gRPC streams for downstream consumers. The write path became a single round-trip: client → PostgreSQL → replication slot → gRPC. The read path now reads Events via a foreign table in the same cluster, so materialized views refresh in 50 ms instead of 2.3 s. We accepted two tradeoffs: we lost Kafkas disk spill-over and had to shard PostgreSQL at 5 TB per node, but we gained 35% lower p99 latency and the lag metric disappeared from the dashboard. The total cost of ownership dropped by 28% because we eliminated two managed services (Kafka and Debezium) and reduced the monitoring stack by six Prometheus exporters.

What The Numbers Said After

After the change we ran a 48-hour load test against prod traffic. The p99 write latency dropped from 48 ms to 12 ms; the p99 read latency (materialized views) dropped from 900 ms to 45 ms. The replication slot lag stayed at zero for the entire test. CPU usage on the PostgreSQL nodes climbed from 35% to 60%, but memory pressure stayed flat thanks to shared buffers tuned to 25% of RAM. We shrank the fleet from 24 Kafka brokers to 12 PostgreSQL nodes and cut managed-service spend from $18k to $13k per month. The only new failure mode was logical replication lag when a primary failover happened; we mitigated it by pinning one replica as a hot standby with pg_rewind, which added 20 ms of failover time but kept the lag at zero during switchover.

What I Would Do Differently

I would not have started with separate systems. The inflection point was obvious once the numbers crossed 800 RPS; by then we had already burned six weeks and $42k in cloud bills chasing the wrong abstraction. Next time Ill put Events in the same database from day one and use logical replication as the event bus rather than an external broker. Id still keep raw event streams in an object store for replay, but Id never again pay the network and latency tax of a second database. The lesson isnt that Kafka is bad—its that service boundaries must be justified by real data, not cargo-cult architecture. We did the math after the fire; Ill do the math before the next one.