The Day Our Event Pipeline Collapsed Under 50k Writes/sec—And How We Fixed It

#webdev #programming #architecture #systems

We thought we had events figured out.

It was 2024, and the Veltrix platform was growing fast—not just in users, but in sheer event volume. A single user action could fire off five or six events, each with multiple attributes: click, impression, purchase, session_start, session_end, user_updated, plan_changed. We were logging everything, all the time. Our initial architecture treated events like streams of log lines—append them to a Kafka topic, sink to S3, process in batch. Simple. Scalable.

Then, at 2:17 PM on a Thursday, our ingestion service—event-collector-v1—started throwing TooManyRequests errors from DynamoDB. Again. Wed seen this before. Not under load. Just… randomly. CloudWatch showed throttling spikes even when total throughput was only 12k writes/sec across 1,200 shards. We had set RCU=1,000 per shard. That should handle 10k writes/sec with room to spare. But the error said ThrottlingException: User: ... is exceeding capacity.... Why?

We dug into the metrics: 80% of our writes were conditional—PutItem with ConditionExpression to avoid duplicates. Every duplicate event meant a retry. And every retry hit the same throttled shard key: the user ID. Hot keys. Our shard strategy was event_type + user_id_hash, which seemed smart—distribute by user. But in practice, active users clustered around a few high-value segments. So 30% of our traffic landed on 3% of shards. Those shards were melting.

We tried a few things.

First, we switched to event-collector-v2 with a write-through cache in front of DynamoDB. We used Redis Cluster with 64 shards and a TTL of 30 seconds. We thought: cache hot writes, reduce pressure on DynamoDB. It worked for a week. Then, during a Black Friday-like surge, Redis memory usage hit 98% in 12 minutes. Our monitoring tool, redis-exporter, started returning OOM command not allowed when used memory > memory limit. We had to restart nodes. After two restarts, we lost 1.8 million events in the time it took for the cache to warm back up. Not acceptable.

Next, we tried changing the shard key entirely. We moved to a time-based partition: event_type + hashed_user_id + hour_bucket. This spread writes more evenly. Throttling dropped to near zero. But latency spiked. A user action that triggered three events would now see them land in different partitions, meaning we couldnt enforce idempotency in real time. We had to rely on eventual de-duplication in downstream batch jobs. That introduced a 30-second window where duplicate purchases could be counted. Finance noticed. Twice.

Then came the real fix: stop treating events like logs.

We scrapped the log pipeline for a bounded context: Event Bus as a Service.

We modeled events as domain events with strict schemas. We used Amazon EventBridge Pipes with Kafka as the source and a new DynamoDB table as the sink—this time partitioned by event_type + event_id_hash (a UUID-based hash). No conditionals. No retries. Just PutItem with a 100ms TTL on the conditional write. We moved all de-duplication to a background job that ran every 15 minutes, using a RocksDB-backed deduplication store in Flink. We lost real-time idempotency, but gained throughput and stability.

The numbers after?

We ran a load test simulating 50k writes/sec across 20 event types. DynamoDB throughput: 4.5k RCU, 15k WCU. We used on-demand mode with auto-scaling enabled. Max item size: 4KB. Average latency: 8.7ms for ingestion, 2.1ms for processing. CPU utilization on the Kafka brokers stayed below 65%. We used kafka-lag-exporter to monitor consumer lag. We set alerting at 100ms lag for the top 5 lagging partitions. No alerts fired.

The error in the old system was clear: hot keys and conditional writes. The Redis cache failed because we treated events as transient data, but they werent. They were critical telemetry. The new system accepts eventual consistency for idempotency—something we avoided until we realized the cost of doing it in real time was higher than the cost of fixing it later.

What I would do differently?

I never would have used conditional writes for ingestion. The Toggle Pattern: write unconditionally, then filter in downstream services. Thats the discipline we eventually adopted. We also built a schema registry early—using AWS Glue Schema Registry—so event schemas were versioned and validated before ingestion. That saved us during a breaking change in the purchase event schema that would have poisoned our pipelines.

We also should have invested earlier in observability around event flows. We added a metadata field to every event: ingest_timestamp, event_version, source_service. Then we used OpenTelemetry spans in a custom interceptor to track end-to-end latency. Before that, when an event was delivered, we had no idea if it was 2 seconds or 20 seconds old.

Our mistake wasnt in the tools—it was in the mental model. We thought events were logs. Theyre not. Theyre the nervous system of the product. Treat them like that, and you build pipelines that dont collapse under their own weight.

DEV Community

The Day Our Event Pipeline Collapsed Under 50k Writes/sec—And How We Fixed It

Top comments (0)