Gabriel Anhaia

Posted on Apr 26

The Idempotency Token Pattern Every Event-Driven System Forgets Until 3 AM

#architecture #eventdriven #microservices #tutorial

Book: Event-Driven Architecture Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture the runbook entry every team eventually writes. It is 3 AM. PagerDuty fires. A consumer that ingests order.paid events from Kafka has just replayed the last several hours of a partition because a rebalance flipped the committed offset back to a stale checkpoint. Stripe webhooks downstream of that consumer fire charges again. A few hundred customers get billed twice. The on-call dev opens the runbook and finds a section called "Idempotency" with three bullet points and no code.

This is the post that section should have been.

The pattern is small. Maybe sixty lines if you count the table. Teams skip it because Kafka says "exactly-once semantics" on the box and engineers believe the box. Then partition rebalances, redelivery on consumer crash, retried producer batches, and dual-write outboxes all conspire to deliver the same event twice. The consumer dutifully processes it twice because nobody wrote down the contract.

microservices.io's idempotent-consumer pattern is explicit about it: assume at-least-once delivery even when the broker claims otherwise. Kafka, RabbitMQ, SQS, EventBridge all do the same thing. Any of them can hand you the same message twice under failure, and the consumer is the one that decides whether that becomes a duplicate charge.

The contract, in four questions

Every idempotency token implementation answers four questions. Get any of them wrong and the pattern silently leaks duplicates.

1. Who generates the token. The producer of the logical operation. Not the broker, not the consumer. The producer is the only party that knows whether two messages represent the same intent or two distinct intents that happen to look the same. Stripe makes this explicit in their idempotency docs: the client sends Idempotency-Key, the server stores the result against it. For event-driven systems the "client" is whatever service emits the event. UUIDv4 is the boring correct answer; some teams hash a deterministic tuple of (aggregate_id, command_type, command_payload) when they want producer crashes to recover the same key on retry.

2. Where the token is stored. Inside the same database transaction as the side effect. This is the load-bearing detail that 90% of half-baked implementations miss. If you check the token in Redis and then write to Postgres, two concurrent replays can both pass the Redis check before either writes the Postgres row. The token row and the business row must commit together or not at all.

3. How long it lives. TTL depends on retry windows, not memory budgets. Stripe keeps keys for at least 24 hours on v1 and up to 30 days on v2. For Kafka consumers the practical floor is "longer than the longest possible redelivery window," which means longer than your max consumer-lag SLO, longer than your DLQ retry backoff, longer than your manual-replay tooling's worst case. 7 days is a reasonable default. 1 hour is wrong and you will find out which hour.

4. What you return on a duplicate. The original result. Not "ok" with no payload. Not 200 and silence. The same response body the original request produced. Stripe stores the response code and body keyed by the idempotency key for exactly this reason: a retried request gets the original outcome, even if the original was a 500. For event consumers the equivalent is: complete the message (ack it) without re-running side effects, and surface the original processed_at timestamp on any audit trail.

The schema

Single table. Two indexes. No ORM ceremony.

CREATE TABLE idempotency_keys (
    key            TEXT        PRIMARY KEY,
    consumer       TEXT        NOT NULL,
    request_hash   TEXT        NOT NULL,
    response_body  JSONB,
    response_code  INT,
    status         TEXT        NOT NULL,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at   TIMESTAMPTZ
);

CREATE INDEX idx_idem_consumer_created
    ON idempotency_keys (consumer, created_at);

CREATE INDEX idx_idem_status
    ON idempotency_keys (status)
    WHERE status = 'in_flight';

request_hash is a SHA-256 of the canonicalized event payload. It catches the worst class of producer bug: same idempotency key, different payload. Stripe rejects this with an error. For event consumers, log loudly and refuse to process. Two events with the same key and different bodies means somebody upstream is reusing keys, and you do not want to be the system that papered over it.

status has three values: in_flight, succeeded, failed. The in_flight state is what stops two concurrent replays from both running side effects when they hit the consumer at the same moment.

The handler wrapper

Go, because the failure modes are easier to read when error handling is in the foreground.

package idempotency

import (
    "context"
    "crypto/sha256"
    "database/sql"
    "encoding/hex"
    "encoding/json"
    "errors"
)

var ErrKeyMismatch = errors.New("idempotency key reused with different payload")

type Result struct {
    Code int
    Body json.RawMessage
}

type Handler func(ctx context.Context, tx *sql.Tx) (Result, error)

func Wrap(
    ctx context.Context,
    db *sql.DB,
    consumer, key string,
    payload []byte,
    handler Handler,
) (Result, error) {
    hash := hashPayload(payload)

    tx, err := db.BeginTx(ctx, nil)
    if err != nil {
        return Result{}, err
    }
    defer tx.Rollback()

    var existingHash, status string
    var body []byte
    var code sql.NullInt64
    err = tx.QueryRowContext(ctx, `
        SELECT request_hash, status, response_body, response_code
        FROM idempotency_keys
        WHERE key = $1 AND consumer = $2
        FOR UPDATE
    `, key, consumer).Scan(&existingHash, &status, &body, &code)

    switch {
    case errors.Is(err, sql.ErrNoRows):
        // First time. Claim the key.
        _, err = tx.ExecContext(ctx, `
            INSERT INTO idempotency_keys
                (key, consumer, request_hash, status)
            VALUES ($1, $2, $3, 'in_flight')
        `, key, consumer, hash)
        if err != nil {
            return Result{}, err
        }
    case err != nil:
        return Result{}, err
    default:
        if existingHash != hash {
            return Result{}, ErrKeyMismatch
        }
        if status == "succeeded" {
            return Result{Code: int(code.Int64), Body: body}, tx.Commit()
        }
        // 'in_flight' from a crashed prior attempt: re-run inside this tx.
    }

    res, handlerErr := handler(ctx, tx)
    if handlerErr != nil {
        return Result{}, handlerErr
    }

    _, err = tx.ExecContext(ctx, `
        UPDATE idempotency_keys
        SET status = 'succeeded',
            response_body = $1,
            response_code = $2,
            completed_at = NOW()
        WHERE key = $3 AND consumer = $4
    `, res.Body, res.Code, key, consumer)
    if err != nil {
        return Result{}, err
    }
    return res, tx.Commit()
}

func hashPayload(p []byte) string {
    sum := sha256.Sum256(p)
    return hex.EncodeToString(sum[:])
}

What it gets right that the naive version misses:

The SELECT ... FOR UPDATE row lock means two replays of the same message arriving on two consumer pods cannot both pass the existence check. One waits.
The handler runs inside the same transaction as the idempotency row update. Side effects on the business tables, status flip on the idempotency row, both commit atomically.
A crashed prior attempt that left an in_flight row is re-run rather than being treated as a duplicate. If your handler is itself idempotent at the data-layer level (upserts, conditional writes), this is the right behavior. If it is not, change the branch to treat in_flight as "another worker is processing, return retry."
The hash check catches reused keys with mutated payloads before any side effect fires.

The wiring at the consumer

func handleOrderPaid(ctx context.Context, msg KafkaMsg) error {
    var evt OrderPaidEvent
    if err := json.Unmarshal(msg.Value, &evt); err != nil {
        return err
    }

    _, err := idempotency.Wrap(
        ctx, db, "billing-charge-consumer",
        evt.IdempotencyKey, msg.Value,
        func(ctx context.Context, tx *sql.Tx) (idempotency.Result, error) {
            charge, err := stripeClient.Charge(
                ctx, evt.Amount, evt.CustomerID,
            )
            if err != nil {
                return idempotency.Result{}, err
            }
            _, err = tx.ExecContext(ctx, `
                INSERT INTO charges (id, order_id, stripe_id)
                VALUES ($1, $2, $3)
            `, charge.ID, evt.OrderID, charge.StripeID)
            if err != nil {
                return idempotency.Result{}, err
            }
            body, _ := json.Marshal(charge)
            return idempotency.Result{Code: 201, Body: body}, nil
        },
    )
    if errors.Is(err, idempotency.ErrKeyMismatch) {
        // Producer bug. Send to DLQ.
        return sendToDLQ(msg, err)
    }
    return err
}

Note one thing: stripeClient.Charge itself takes an idempotency key. Stripe accepts Idempotency-Key on every POST and replays the original response on retry. The event's IdempotencyKey should be the same value passed to Stripe. That way, even if the consumer crashes between the Stripe call and the database commit, the next run hits Stripe with the same key, gets the original charge back, and reconciles into the database.

The idempotency table prevents duplicate Stripe calls in the happy path; Stripe's own idempotency layer is the belt-and-braces for the unhappy one. Brandur Leach's writeup on idempotency keys goes into the recovery state machine in detail.

The 3 AM failures this prevents

The consumer rebalance that replays a partition's last several hours? Each event arrives with the same key, hits the succeeded row, returns the cached response, acks. No duplicate charges. Logs show "idempotent replay" instead of a runaway charge spike.

The retried Kafka producer batch that emits the same event twice in 50 ms? Both consumers race to the row lock. One wins, runs the handler, commits. The second wakes, sees succeeded, returns the cached body.

The exactly-once feature you turned on? It still does not save you. The CockroachDB writeup on idempotency in event-driven systems argues the same thing: at-least-once is the floor every consumer should code for, and exactly-once is a property you build on top of that floor rather than a guarantee you inherit from the broker.

Where teams cut corners and regret it

A short list, in descending order of "how much money this has cost real teams I have read about."

Storing the key in Redis, side effects in Postgres, no two-phase reconciliation. Race-condition replay window: roughly the network RTT.
TTL of one hour because "Kafka redelivery happens fast." The dead-letter queue replay tooling does not happen fast. It happens on Tuesday at 2 PM after the on-call wrote a Jira.
No request_hash. Producer reuses keys after a deploy. Every duplicate looks like a happy-path replay. You find out three weeks later from finance.
Acking the message before the idempotency row commits. Crash between ack and commit means the next replay sees no row, runs side effects again, and the original ack hides the duplicate from your offset lag dashboard.

The pattern is not glamorous and the table is not interesting. The bug it stops, on the other hand, is the kind that wakes you up.

If this was useful

The idempotent-consumer pattern is one of about a dozen that decide whether your event-driven system stays consistent under partial failure. The Event-Driven Architecture Pocket Guide walks through outbox, saga, CQRS, and the failure modes each one masks until production exposes them. Chapter 4 is the long version of this post, including the in_flight recovery state machine and the dead-letter strategy for ErrKeyMismatch.

Top comments (1)

Andrew Tan • May 3

Your idempotency token implementation covers the edge case most posts skip: concurrent replays both racing for the row lock. How do you handle the scenario where the handler runs inside the same transaction but the downstream API times out?