The Kafka Lesson Nobody Teaches You Until a Customer Gets Charged Twice

#php #tutorial #kafka #architecture

TL;DR: Kafka guarantees at-least-once delivery which means your consumer will sometimes receive the same event twice. If processing that event has side effects like invoicing or sending email you must make the consumer idempotent. Here is the pattern that fixed it for us.

I want to start with the bug because the bug is the whole point.

A customer opened a support ticket saying they had two identical invoices for the same order. Same amount. Same line items. Created four seconds apart. Nobody on the team had touched that order manually.

We traced it back to one of our Rust microservices. It had restarted in the middle of processing a batch of events. When it came back Kafka redelivered the events it was not sure had finished. Our billing consumer processed those events a second time. Two invoices.

Nothing was broken in the way we usually mean broken. No exception. No failed job. The system did exactly what we told it to do. We just told it the wrong thing.

At-Least-Once Is Not a Bug

Here is the thing most Kafka tutorials skip. They teach you the producer publishes and the consumer consumes and everyone is happy. They do not sit you down and explain delivery guarantees.

Kafka offers at-least-once delivery by default. Read that word again. At least once. Not exactly once. The broker would rather send you an event twice than risk you never seeing it.

So when does a duplicate happen? A consumer processes an event and then crashes before it commits the offset back to Kafka. When it restarts Kafka sees an uncommitted offset and replays. A consumer group rebalances because a new instance joined. Partitions move and some events near the boundary get reprocessed. An offset commit itself fails on a network blip.

None of these are rare. They are normal operational events. Which means duplicate delivery is not an edge case you might hit. It is a certainty you will hit.

Why This Only Hurts Sometimes

If your consumer just updates a value the duplicate is harmless. Setting an order status to complete twice leaves it complete. That operation is naturally idempotent.

The pain starts when the consumer has a side effect that accumulates. Creating an invoice. Charging a card. Sending an email. Incrementing a counter. Run those twice and you get two of something that should be one.

So the rule is simple. Any consumer whose work is not naturally idempotent must be made idempotent by you.

The Pattern

The fix is to give every event a unique id at publish time and to record processed ids on the consumer side before doing any real work.

class InvoiceConsumer
{
    public function handle(array $event): void
    {
        // firstOrCreate is atomic at the database level.
        // wasRecentlyCreated tells us if THIS call inserted it.
        $record = ProcessedEvent::firstOrCreate(
            ['event_id' => $event['event_id']]
        );

        if (! $record->wasRecentlyCreated) {
            // We have seen this event before. Replay. Skip safely.
            return;
        }

        Invoice::create([
            'order_id'  => $event['order_id'],
            'tenant_id' => $event['tenant_id'],
            'amount'    => $event['total'],
        ]);
    }
}

The detail that matters is atomicity. The check and the insert are one operation. If two copies of the same event are processed at the exact same moment by two workers only one insert succeeds. The other sees wasRecentlyCreated as false and returns. A naive "select then insert if missing" has a race window between the select and the insert. firstOrCreate does not.

The producer side is the easy half.

KafkaProducer::publish('order.completed', [
    'event_id'  => (string) Str::uuid(),
    'order_id'  => $order->id,
    'tenant_id' => $order->tenant_id,
    'total'     => $order->total,
]);

The Mistake I Made

My first version of this stored processed ids in Redis with a time to live of one hour. It felt lighter than a database table.

It worked until a consumer fell behind during a traffic spike and did not get to some events until ninety minutes after they were published. The Redis keys had already expired. The consumer treated old events as new and we were back to duplicates.

Processed event records must live at least as long as the longest possible delay between publish and consume. For us that meant a real database table with a scheduled cleanup that only removes records older than thirty days. Not a short lived cache.

Keeping the Table Healthy

A processed events table grows forever if you let it. We run a nightly job that deletes rows older than thirty days. Thirty days is far longer than any realistic redelivery window so it is safe. The table stays small and the lookup stays fast because event_id is indexed and unique.

What I Would Do Differently

I would make idempotency the default for every new consumer from the first line of code. We added it reactively after the double invoice. It should have been a base consumer class that every real consumer extends so a developer can not write a non-idempotent consumer by accident.

I would also put the event_id in our logs from the start. Tracing the double invoice took longer than it should have because the duplicate events were not easy to correlate.

My Honest Take

Exactly-once processing is often described as something Kafka gives you. It is not really. What you get is at-least-once delivery plus the tools to make processing idempotent. Exactly-once is something you build on top, not something you receive.

Once that clicked, event-driven systems stopped feeling fragile to me. Duplicate delivery is not the system failing. It is the system protecting you from lost data. Your job is just to make the second copy harmless.

How do your consumers handle an event arriving twice?