When the Event Log Became a Liability: What Happened When We Treated Events Like Garbage

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The Treasure Hunt Engine is a multiplayer game where players dig for virtual gems, craft tools, and compete on leaderboards. Every action produces an event: dig_started, gem_found, score_updated, inventory_cleared. We needed a system that could ingest, deduplicate, and propagate these events to every player in under 200 ms while guaranteeing no double-counting of rare gems.

Our first attempt used Kafka as a raw log, with a downstream Flink job for deduplication and scoring. The contract was simple: events arrive, Flink groups by player_id and action_type within a 5-second window, then publishes to Redis streams for fan-out. Flink also wrote every deduplicated event back to Kafka under a processed topic for replay.

The first latency spike happened during the Black Friday gem drop when 300,000 players simultaneously clicked dig. Flinks backpressure alarms fired within 90 seconds. The real-time score dashboard showed 1,847 duplicate gem_found events in the first minute, meaning Flinks deduplication window was either too wide or too slow. We widened the window to 10 seconds and pushed the Flink autoscaler to 40 TaskManagers. The backpressure subsided, but the duplicate count stabilized at 1,102 per minute—still unusable. The error message in our logs repeated: FlinkException: Failed to commit checkpoint within 60000 ms. We discovered Flink was writing 1.2 TB of state to S3 every minute for exactly those 10-second windows. The cost of stronger consistency was turning into a bill shock.

What We Tried First (And Why It Failed)

We pivoted to an at-most-once strategy using AWS EventBridge pipes and Lambda. Events entered EventBridge, bypassed stateful processing, and landed in DynamoDB with a conditional write on event_id. Our hope was to trade consistency for throughput, trusting that duplicates would be rare. Within 48 hours the Treasure Hunt forums exploded: players reported gem counts increasing when they refreshed the page. The new error was plain: ConditionalCheckFailedException: The conditional request failed. We had replaced Flinks deduplication with DynamoDBs conditional writes, but EventBridge delivered at-least-once delivery and Lambda retries meant events could arrive multiple times before the conditional write succeeded. The fan-out Redis stream still received duplicates because the deduplication step lived downstream of the game server.

We tried one more approach: idempotent event publishing from the game server itself. Each client generated a UUID4 event_id and signed the payload with a private key. The Redis stream consumer enforced exactly-once semantics via a Lua script that checked SETNX event_id 1. The latency improved immediately—median end-to-end dropped to 280 ms. The duplicate count fell to zero for normal traffic. But on Black Friday the Lua script timed out after 5 seconds, Redis connections piled up, and the game servers started throwing RedisConnectionException: too many open connections. We had replaced network backpressure with CPU-bound Lua scripts on Redis. The event log was still a liability.

The Architecture Decision

We needed a system that could guarantee exactly-once delivery while remaining stateless upstream. After benchmarking NATS JetStream, Apache Pulsar, and a custom CRDT-based log in Go, we chose Pulsar with a partitioned topic and a built-in deduplication cursor. The contract changed: every event carries a producer-supplied message key of the form player_id:action_type:epoch_second. Pulsars deduplication window defaults to 10 minutes, but we tuned it to 5 seconds to match our scoring cadence. The game server publishes to Pulsar with ack-timeout=500 ms. If Pulsar does not ack within that window, the game client retries with exponential backoff and a new event_id.

For fan-out we replaced Redis streams with Pulsars built-in key_shared subscription, which gives one consumer per player_id. The subscription is durable, so late-joining players can replay events from the last 24 hours. The scoring micro-service reads from the same topic with a separate subscription and writes results back to PostgreSQL under a transactional outbox pattern. Each outbox message carries a sequence_number that acts as a causality token. If a replay happens, downstream services can detect gaps and replay only the missing sequences.

The migration took 10 days. We ran the old Kafka/Flink system in parallel for 72 hours, comparing end-to-end latency and duplicate rates. On the final cutover day, median latency dropped from 2.1 seconds to 142 ms. The duplicate reward count went from 1,102 per minute to zero for eight consecutive hours.

What The Numbers Said After

After six weeks on Pulsar, the metrics tell a different story:

P99 end-to-end latency: 187 ms (was 4,200 ms on Kafka/Flink)
Duplicate rewards per thousand actions: 0 (was 6.2 under DynamoDB conditional writes)
Pulsar internal backlog: 0 bytes (was 3.4 TB on S3 during Black Friday)
Monthly infrastructure cost for event backbone: $8,400 (was $22,000 for Kafka/Flink plus S3 plus EC2 autoscaling)

We also saved 2.3 FTE weeks per quarter because the new system no longer required manual Flink state cleanup or DynamoDB conditional-write throttling. On-call alerts dropped from 14 per week to 2.

What I Would Do Differently

I would not have started with Kafka or DynamoDB for exactly-once delivery. Both are great tools, but they solve different problems. Kafka excels at high-throughput append-only logs; DynamoDB excels at low-latency key-value access. When we forced them into an exactly-once deduplication role