The Day We Broke the Events System (And How We Fixed It)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In early 2024, we were running the Veltrix event pipeline at 80K events per second with a default PostgreSQL configuration for the audit log table. That throughput sounds healthy until you realize the table had 227 million rows and the weekly Opsgenie wake-up was triggered by P99 latencies hitting 4.2 seconds on SELECTs. Not because of the volume—PostgreSQL could handle that—but because the events table had no partitioning strategy, only an IDENTITY column primary key, and a GIN index on event_metadata that the planner kept ignoring. During the Black Friday sale, the nightly batch job that archived events to S3 would lock the table for 45 minutes, causing downstream users to see duplicate events in their dashboards. The finance team nearly filed a ticket to my manager when the revenue report for Cyber Monday was off by $1.2 million due to missing events. The default configuration assumed events were ephemeral; we used it like a warehouse.

What We Tried First (And Why It Failed)

Our first attempt was a classic over-engineering trap: we moved the audit logs to Amazon Timestream, a managed time-series database, thinking write-heavy event ingestion would benefit from its high-throughput ingestion API. The throughput jumped to 120K events/sec on paper, but the timeline() function in Timestream required sorting keys, and our event payloads were heterogeneous—some had user IDs, others had order IDs, others had device fingerprints. Querying across all types of events became a full table scan disguised as a query, and the cost exploded from $180/month to $2,300/month almost overnight. Worse, the cold-start latency for analytics dashboards spiked to 11 seconds because Timestreams memory store only held 24 hours of data. When the marketing team asked for a retroactive funnel analysis on last months traffic, we had to export 47 million events to S3, re-ingest them into a temporary Snowflake cluster, and wait three hours for a result that should have taken minutes. The CFO started asking about cloud cost discipline, and I had to explain why we were using a $14K/month database to store clickstream events that cost $0.50 each to reprocess in batch.

The Architecture Decision

We ripped out the audit table entirely and replaced it with an event sourcing backbone built on Apache Kafka with Tiered Storage enabled. Heres the brutal tradeoff we accepted: we gave up strict ACID semantics for eventual consistency in exchange for durability and horizontal scalability. Kafkas compacted topics became our source of truth, and we built a lightweight CQRS layer on top using Flink for real-time aggregations and materialized views. We chose Kafkas idempotent producer with exactly-once semantics turned on, which cost us 15% more CPU per broker but eliminated duplicate events. For the archive pipeline, we used KIP-405 Tiered Storage to offload older segments to S3, reducing broker disk pressure by 68%. On the query side, we implemented a sidecar cache using Dragonfly, a Redis fork optimized for large values, to serve recent events with sub-millisecond latency. The cache invalidates using Kafkas event sourcing model: when an event is produced, its key is hashed and pushed to a Redis stream, triggering cache eviction. Its not perfect—cache misses still happen during spikes—but the P99 for event retrieval dropped from 4.2 seconds to 8 milliseconds once the system stabilized.

What The Numbers Said After

After six months, the event pipeline handled 1.1 million events per second at peak during Prime Day, with P99 latency for event ingestion at 12 milliseconds and retrieval at 22 milliseconds. The storage cost for Kafka Tiered Storage on S3 settled at $4,200/month—cheaper than the original PostgreSQL instance once we factored in the cost of developer time debugging locks and archiving jobs. The duplicate event rate dropped to 0.002%, which saved the finance team from another audit fire drill. The Dragonfly cache served 94% of requests from memory, and the Flink job reprocessed events in under a minute when we had to replay a partition due to a broker misconfiguration. Prometheus alert veltrix_events_high_tail_latency fired only twice in six months, and both times it was due to a consumer group lagging on a misconfigured RocksDB state backend. The worst-case scenario was a 3.4-second spike during a rolling restart when Flinks checkpointing coincided with a Kafka broker rolling upgrade—we fixed it by increasing the checkpoint interval from 30 seconds to 60 seconds and pre-warming the cache.

What I Would Do Differently

If I could go back, I would not have used Kafkas compacted topics for the raw event store. Compaction is a lie in high-volume event systems because it silently drops late-arriving events when the retention policy kicks in, and we spent two weeks debugging why user sessions were truncating in the middle of a funnel analysis. Next time, I would use Kafkas non-compacted topics with a tombstone-based retention policy and handle deduplication in Flink using event IDs and a RocksDB state store. I would also avoid Dragonfly for caching. Its forked architecture means it doesnt support Redis modules, so we had to write our own serialization layer for event blobs, which introduced subtle bugs when event schemas changed. We are migrating to Valkeys cluster mode with module support this quarter—its more stable and supports Lua scripting for cache invalidation. Lastly, I would invest in a proper event replay tool from day one instead of relying on Flinks savepoints. During the Prime Day load test, a misconfigured consumer group corrupted its state store, and it took us 90 minutes to restore from a backup S3 bucket. A simple event replay tool that reads from a specific offset would have saved us that pain.