Gabriel Anhaia

Posted on May 23

Event Replay Will Take Down Production. Here's How to Tag Replay-Safe vs Replay-Toxic Events at the Schema Level.

#architecture #eventdriven #eventsourcing #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Someone on your team triggered a replay of the last 24 hours of orders to recover from a corrupted projection. Two hundred customers got duplicate shipping notifications, and one got their card charged a third time. The replay ran successfully. That was the problem.

This post is about the lie inside the word "successfully."

The conference-talk version of replay vs the 3am version

The conference talk goes like this. Your read model is corrupted. No problem: you have the event log. You rewind, replay every event back through the projector, and arrive at a freshly-built, correct projection. Greg Young waves his hand and everyone nods. Replay is the superpower of event sourcing.

The 3am version is different. Your read model is corrupted. You rewind. You replay. Halfway through the replay, the projector emits a SendShippingNotification command. Then a ChargeStripe command. Then a ReportToTaxJurisdiction command. None of those side effects know they're happening inside a replay. The downstream systems don't know either. They process the commands like any other Tuesday.

By the time you notice, you've shipped 200 duplicate notifications, double-charged a handful of cards, and the tax service has a new line item for an order that was finalised six months ago. The replay ran successfully. The system did exactly what you told it to do. You told it the wrong thing.

The fix isn't "be more careful." Nobody is careful at 3am. The fix is that replay has to be a property of the event type, declared at the schema level, enforced by tooling, and audited automatically.

Three categories of event, by replay behavior

Every event in an event-sourced system falls into one of three buckets. Most teams have all three mixed together in the same topic with no tag distinguishing them.

Replay-safe. Events whose only consequence is updating an internal projection. Replaying them rebuilds state and nothing else. ProductPriceUpdated, OrderLineItemAdded, ShippingAddressChanged. The projection consumes them, writes to its own store, and there is no observable side effect outside the read model. You can replay these all day.

Replay-restricted. Events that normally trigger an external side effect, but where the side effect can be suppressed during replay without losing correctness. OrderConfirmed might normally send an email and bump an analytics counter. During replay, you want the projection updated, the analytics counter incremented (because the dashboard rebuilds from the event log), but the email stays in its outbox. The receiver was already emailed when the original event fired six months ago. Sending again is a regression.

Replay-toxic. Events whose consequences live outside your boundary and cannot be undone, reversed, or even safely re-fired. PaymentCaptured (Stripe charged the card, your replay doesn't get to un-charge it). RefundIssued. TaxReported. ContractSigned (DocuSign sent the email, the signer signed it). InventoryReserved (a third-party warehouse decremented stock). Replaying a toxic event isn't a clever rebuild. It's a new, real-world action with consequences.

The mistake the literature makes is treating these as the same kind of object. They aren't. The taxonomy goes on the schema.

"The projection should be idempotent" isn't the full answer

The usual answer to the replay problem is "make your projections idempotent." That's correct as far as it goes. If your projection writes UPDATE orders SET status='confirmed' WHERE id=?, replaying it ten times leaves the row the same.

But idempotency lives inside your code. The Stripe API doesn't care that you have an idempotency key on your projection. The shipping carrier's webhook doesn't care. The tax authority's filing system definitely doesn't care. The moment your projection's apply() function calls out to a third party (directly, or by emitting a command that another service picks up), your replay has left the building.

You can wrap external calls in your own idempotency keys, and you should. But that turns every replay into a careful audit of every key collision, and most teams discover their idempotency-key strategy is partial when the postmortem runs. A PaymentCaptured event that triggered a Stripe charge six months ago used an idempotency key that expired from Stripe's window. The replay generates a fresh one. Stripe charges again. Idempotency keys have TTLs; replay windows don't.

Tagging the event at the schema is the only durable answer. It's the one place the producer, the consumer, the replay tool, and the human auditing a postmortem can all agree on what this event is.

Annotating replay policy at the schema

Pick one place. The schema is the only one all the systems read. Below are the three serialization formats most teams hit, with the annotation pattern that survives schema-registry round-trips and CI.

Avro. Avro lets you attach arbitrary metadata fields to records. Convention: namespace your custom annotations to avoid clashes with the registry.

{
  "type": "record",
  "name": "PaymentCaptured",
  "namespace": "shop.payments.v3",
  "x-replay-policy": "toxic",
  "x-replay-reason": "Stripe charge is irreversible; reconcile instead",
  "fields": [
    {"name": "orderId", "type": "string"},
    {"name": "amountCents", "type": "long"},
    {"name": "stripeChargeId", "type": "string"},
    {"name": "capturedAt", "type": "long",
     "logicalType": "timestamp-micros"}
  ]
}

Confluent Schema Registry preserves unknown top-level fields starting with x- on most versions; verify on yours.

Protobuf. Protobuf has first-class custom options. Register one in a shared replay_options.proto and use it on every event.

// replay_options.proto
import "google/protobuf/descriptor.proto";

extend google.protobuf.MessageOptions {
  optional ReplayPolicy replay_policy = 50001;
}

enum ReplayPolicy {
  REPLAY_POLICY_UNSPECIFIED = 0;
  REPLAY_SAFE = 1;
  REPLAY_RESTRICTED = 2;
  REPLAY_TOXIC = 3;
}

// payments_v3.proto
import "replay_options.proto";

message PaymentCaptured {
  option (replay_policy) = REPLAY_TOXIC;

  string order_id = 1;
  int64 amount_cents = 2;
  string stripe_charge_id = 3;
  google.protobuf.Timestamp captured_at = 4;
}

Reflection at runtime gives you descriptor.GetOptions().GetExtension(replay_options.REPLAY_POLICY) in Go, or the equivalent in the language of your choice.

JSON Schema. Use a $comment-style custom keyword. Most validators ignore unknown keywords by default, which is what you want.

{
  "$id": "https://schemas.shop.example/orders/v3/PaymentCaptured.json",
  "type": "object",
  "x-replay-policy": "toxic",
  "x-replay-reason": "Stripe is the source of truth post-capture",
  "required": ["orderId", "amountCents", "stripeChargeId"],
  "properties": {
    "orderId": {"type": "string"},
    "amountCents": {"type": "integer", "minimum": 1},
    "stripeChargeId": {"type": "string"}
  }
}

A CI check that fails the build if any event schema lacks x-replay-policy is 30 lines of Python and the single best return-on-effort guardrail you'll ship this quarter:

# tools/check_replay_policy.py
import json
import sys
from pathlib import Path

VALID = {"safe", "restricted", "toxic"}
missing = []
invalid = []

for path in Path("schemas").rglob("*.json"):
    with path.open() as f:
        schema = json.load(f)
    policy = schema.get("x-replay-policy")
    if policy is None:
        missing.append(path)
    elif policy not in VALID:
        invalid.append((path, policy))

if missing or invalid:
    for p in missing:
        print(f"MISSING x-replay-policy: {p}")
    for p, v in invalid:
        print(f"INVALID x-replay-policy={v!r}: {p}")
    sys.exit(1)

Run it in CI on every PR that touches the schemas/ tree. New event types without a policy fail the build.

The replay gate

The other half of the schema annotation is the runtime enforcer. Every replay tool (backfill scripts, projection rebuilders, dev-environment hydrators) runs every event through one function before passing it to the consumer.

// replay/gate.go
package replay

import (
    "context"
    "fmt"
)

type Policy int

const (
    PolicyUnknown Policy = iota
    PolicySafe
    PolicyRestricted
    PolicyToxic
)

type Event interface {
    Type() string
    Payload() []byte
}

type SchemaLookup func(eventType string) (Policy, error)

type Consumer interface {
    // ReplayMode tells the consumer to suppress side effects.
    Apply(ctx context.Context, ev Event, replayMode bool) error
}

func Gate(
    ctx context.Context,
    ev Event,
    lookup SchemaLookup,
    consumer Consumer,
) error {
    policy, err := lookup(ev.Type())
    if err != nil {
        return fmt.Errorf("lookup %s: %w", ev.Type(), err)
    }

    switch policy {
    case PolicySafe:
        return consumer.Apply(ctx, ev, true)

    case PolicyRestricted:
        // projection updates; side-effect emitters check the flag
        return consumer.Apply(ctx, ev, true)

    case PolicyToxic:
        // hard stop. these events do not replay.
        return fmt.Errorf(
            "refusing to replay toxic event %s "+
                "(use reconciliation instead)",
            ev.Type(),
        )

    default:
        // unknown policy means the schema lacks the tag.
        // fail closed.
        return fmt.Errorf(
            "no replay policy for %s; tag the schema",
            ev.Type(),
        )
    }
}

Three properties matter here. First, the gate fails closed. An event with no policy stops the replay. You don't get to forget the annotation and have the gate guess. Second, PolicyRestricted doesn't drop the event; it passes a replayMode flag through to the consumer, which checks it at the point a side effect would fire:

func (h *OrderConfirmedHandler) Apply(
    ctx context.Context,
    ev Event,
    replayMode bool,
) error {
    payload := decodeOrderConfirmed(ev.Payload())

    // projection always updates
    if err := h.projection.MarkConfirmed(ctx, payload); err != nil {
        return err
    }

    // side effects only on live ingestion
    if !replayMode {
        h.emailer.SendConfirmation(ctx, payload.OrderID)
        h.analytics.RecordConversion(ctx, payload)
    }
    return nil
}

Third, toxic events refuse to enter the replay pipeline at all. They go through reconciliation, which is a different machine entirely.

Reconciliation for replay-toxic events

The right model for PaymentCaptured, RefundIssued, InventoryReserved isn't replay. It's reconciliation: pull the current state from the external source of truth, compare to your projection, emit compensating events to close the gap.

# reconcile/payments.py
from datetime import datetime, timedelta

async def reconcile_payments(window_start: datetime,
                              window_end: datetime):
    """
    Walk every order finalised in [start, end], pull Stripe's
    record, emit Adjusted/Mismatch events if our projection
    disagrees. NEVER re-emits PaymentCaptured.
    """
    orders = await projection.orders_finalised_between(
        window_start, window_end
    )

    for order in orders:
        stripe_charge = await stripe_client.charges.retrieve(
            order.stripe_charge_id
        )

        if stripe_charge.amount != order.amount_cents:
            # source of truth disagrees with our state.
            # emit a compensating event the projection knows
            # how to apply.
            await event_bus.publish(
                "PaymentAmountReconciled",
                {
                    "order_id": order.id,
                    "previous_amount_cents": order.amount_cents,
                    "reconciled_amount_cents": stripe_charge.amount,
                    "reconciled_at": datetime.utcnow().isoformat(),
                    "source": "stripe",
                    "stripe_charge_id": order.stripe_charge_id,
                }
            )

A PaymentAmountReconciled event is itself replay-safe. It carries the reconciled value, and re-applying it produces the same projection state. The toxic action (the original charge) stays in the past. The correction is forward-only.

This is the inversion that takes a while to internalise. In a system with toxic events, replay isn't how you fix broken state. Replay is how you rebuild interpretation of state. Fixing state requires going to the external source of truth and writing forward-only corrections.

A drift detector for misclassified events

The taxonomy works only if events are classified correctly. Misclassification is silent. A RefundIssued accidentally tagged safe will happily replay and refund the customer twice. You need a detector that catches this without waiting for the postmortem.

The pattern is a nightly job that rebuilds the projection from the event log against a sandbox, compares to the live projection row-by-row, and alerts on drift:

# tools/drift_detector.py
import asyncio
from datetime import date

async def nightly_drift_check(target_date: date):
    sandbox = await projection.fresh_sandbox()

    async for ev in event_store.stream_until(
        end=target_date,
        sandbox_mode=True,  # all side-effect emitters muted
    ):
        await sandbox.apply(ev, replay_mode=True)

    diffs = []
    async for live_row in live_projection.scan():
        sandbox_row = await sandbox.get(live_row.id)
        if not rows_match(live_row, sandbox_row):
            diffs.append((live_row.id, live_row, sandbox_row))

    if diffs:
        await alerter.page(
            f"Projection drift: {len(diffs)} rows differ",
            details=diffs[:50],
        )

When the detector fires, three causes account for almost everything you'll see:

A toxic event was marked safe, replayed in a previous backfill, and double-counted.
A restricted event's handler doesn't check replayMode and fires its side effect, leaving the live projection ahead of the rebuilt one.
An external system corrected itself (Stripe refunded a chargeback, the warehouse fixed inventory) without an event reaching your bus.

Cause #3 is what tells you you're missing a reconciliation job. Causes #1 and #2 are bugs the detector catches before a real replay ships them to production.

A team I talked to last month runs this detector against a 7-day window every night. The first month it caught two misclassified events and one missing reconciliation. After that it's mostly quiet, which is the point. The quiet is the signal that the taxonomy is holding.

How to start

You don't refactor the entire event catalog in one PR. The path that works:

Run an audit script over your top 10 event types by volume. For each, ask the producing team: "What happens downstream when this fires?" If any answer mentions an external system, that event is at least restricted, likely toxic.
Add the x-replay-policy annotation to those 10 schemas.
Ship the replay gate behind a feature flag, fail-closed for tagged events, fail-open for un-tagged. This lets the existing 200 untagged events keep flowing while you classify them.
Add the CI check that requires the annotation on every new schema from day one.
Build the drift detector against the 10 tagged events. Run it nightly.
Burn down the un-tagged list one team at a time.

Six months in, every event in the system has a policy, the replay gate fails-closed for everything, and your next "rebuild the projection" runbook doesn't quietly charge anyone's card.

What's the worst replay-side-effect story your team has lived through? Drop it in the comments. I want to know which third-party API everyone's accidentally double-fired in production.

If this was useful

The event-replay traps in this post are one of the patterns the Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About goes deep on. The chapter on event-sourcing pitfalls covers the replay-vs-reconciliation distinction, why projections aren't enough, and the compensating-event patterns that keep external state from drifting from your log. If you're building or maintaining an event-driven system and want the rest of the traps mapped out before the postmortem teaches them to you, the book's for you.