Mehmet TURAÇ

Posted on May 28

Architecture of Chaos Part 3 — Event Sourcing Saved Our Audit Trail, Then a Fiber Cable Broke

#systemdesign #eventsourcing #architecture #backend

This is Part 3 of the Architecture of Chaos series. Start from Part 1 | Part 2

⚠️ Names, companies, and specific details are composite/fictional. Patterns and code are drawn from real production experience.

Chapter 5: Why CRDTs Fail for Financial Ledgers (Event Sourcing)

The Critical Meeting

Fifth month. CFO Masato called me in. Compliance team plus two external auditors present. The question was simple:

"Selim, how do we guarantee balances are correct in the new architecture? Auditors need to trace every transaction. They need to see where every cent came from and where it went. Can this 'CRDT' thing provide that?"

Internally I said "no." Externally I said "CRDT alone can't, but combined with Event Sourcing, we can provide a level of transparency auditors can't even dream of."

Why CRDTs Are Insufficient

The problem: CRDTs are state-based. They tell you "current balance is X" but can't answer "how did we get to X?"

Example:

Alice starts with $1,000
Alice spends $200 → $800
Alice earns $500 → $1,300
Alice spends $300 → $1,000

CRDT's final state says "$1,000." But when the auditor asks "Why $1,000?" — you can't answer. Worse, if Tokyo and Virginia simultaneously update Alice's balance and a merge goes wrong, you reach a wrong balance and can never detect it.

In financial systems, this is unacceptable.

Event Sourcing: Single Source of Truth

Event Sourcing stores not the state, but the events that produce the state. State is a function of events:

State(t) = Reduce(all_events_up_to_t, initial_state)

Benefits:

Full audit trail: Every change — who, when, why — all recorded
Time travel: Reproduce state at any point in history
Replay: Reprocess events to fix corrupted state
Determinism: Same events in same order always produce same state

The Event Store: Append-Only Log

Events live in an append-only log. Never updated, never deleted. Only new events can be added.

-- migrations/001_event_store.sql
CREATE TABLE event_store (
    event_id        UUID PRIMARY KEY,
    event_type      VARCHAR(64) NOT NULL,
    aggregate_id    UUID NOT NULL,
    stream_version  BIGINT NOT NULL,
    causation_id    UUID,
    correlation_id  UUID NOT NULL,
    hlc_physical    BIGINT NOT NULL,
    hlc_logical     INTEGER NOT NULL,
    vector_clock    JSONB NOT NULL,
    payload         JSONB NOT NULL,
    metadata        JSONB NOT NULL,
    user_id         UUID,
    recorded_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (aggregate_id, stream_version)
);

-- Append-only protection: ban UPDATE and DELETE
CREATE RULE no_update AS ON UPDATE TO event_store DO INSTEAD NOTHING;
CREATE RULE no_delete AS ON DELETE TO event_store DO INSTEAD NOTHING;

Command Handler: Producing Events

When a user places a bid, state isn't directly updated. A command is produced first. The command handler validates business rules, then generates events:

// commands/place_bid.ts
class PlaceBidHandler {
  async handle(command: PlaceBidCommand): Promise<CommandResult> {
    // 1. Idempotency check
    const existing = await this.eventStore.findByCorrelationId(command.commandId);
    if (existing.length > 0) {
      return { status: 'already_processed', events: existing };
    }

    // 2. Load aggregates
    const userBalance = await this.userBalanceProjection.get(command.bidderId);
    const auction = await this.auctionProjection.get(command.auctionId);

    // 3. Business rules (invariants)
    if (auction.status !== 'ACTIVE')
      return { status: 'rejected', reason: 'Auction not active' };
    if (command.amount <= auction.currentHighestBid)
      return { status: 'rejected', reason: 'Bid too low' };
    if (userBalance.availableBalance < command.amount)
      return { status: 'rejected', reason: 'Insufficient funds' };

    // 4. Produce events atomically
    const events = [
      { eventType: 'BID_PLACED', aggregateId: command.auctionId, ... },
      { eventType: 'FUNDS_RESERVED', aggregateId: command.bidderId, ... },
    ];

    await this.eventStore.append(events);
    await this.eventBus.publish(events);
    return { status: 'accepted', events };
  }
}

Note: Monetary amounts use bigint in cents. Never float or number. The 0.1 + 0.2 = 0.30000000000000004 problem is unacceptable in financial systems.

The Auditors' Joy

When we presented to the CFO, the auditors' eyes lit up. Because now:

Every cent is traceable: Event store chains every money movement via causation_id
Time travel: "What was Alice's balance 3 days ago at 14:23?" — instant answer
Reconciliation: Event store vs projections auto-compared, inconsistency triggers alarm
Immutable log: Events can never be modified — manipulation is impossible

Masato told me afterward: "Selim, I've been through audits for 10 years. This is the first time auditors thanked us."

Battle Scar #6

Lesson: In financial systems, store events, not state. State is a function of events. This isn't just a technical choice — it's a legal requirement. GDPR, SOX, PCI-DSS, MiFID II all require traceable transactions. Event Sourcing is the most elegant way to satisfy them.

Chapter 6: Distributed Sagas and the "Rollback" Nightmare

That Night, 02:47 — The Transatlantic Fiber Broke

Sixth month. Third major incident. This time the SRE director called directly:

"Selim, the transatlantic fiber is cut. Zero traffic between US-East and EU-West. And right now, 47 'auction-win' workflows are stuck mid-transaction. What do we do?"

The auction-win workflow consisted of these steps:

Reserve funds from winner's balance (US-East service)
Transfer to escrow account (EU-West service)
Credit seller's balance as "pending" (US-East service)
Update asset ownership (EU-West service)
Send email notifications (Email service)

If the network dies after step 3: money is reserved, moved to escrow, but never credited to the seller. Money is in limbo.

The Saga Pattern: Long-Lived Transactions

We used Temporal.io for saga orchestration — "durable execution" that survives coordinator crashes:

// workflows/auction_win.go — Temporal Workflow
func AuctionWinWorkflow(ctx workflow.Context, input AuctionWinInput) error {
    retryPolicy := &temporal.RetryPolicy{
        InitialInterval:    time.Second,
        BackoffCoefficient: 2.0,
        MaximumAttempts:    5,
    }

    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy:         retryPolicy,
    })

    // STEP 1: Reserve funds
    var reservationID string
    err := workflow.ExecuteActivity(ctx, ReserveFunds, input).Get(ctx, &reservationID)
    if err != nil { return err }

    // STEP 2: Transfer to escrow
    var escrowID string
    err = workflow.ExecuteActivity(ctx, TransferToEscrow, input).Get(ctx, &escrowID)
    if err != nil {
        // COMPENSATE: Release reservation
        workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
        return err
    }

    // STEP 3: Credit seller (pending)
    err = workflow.ExecuteActivity(ctx, CreditSellerPending, input).Get(ctx, nil)
    if err != nil {
        // COMPENSATE: Refund escrow + release reservation
        workflow.ExecuteActivity(ctx, RefundFromEscrow, escrowID).Get(ctx, nil)
        workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
        return err
    }

    // STEP 4: Transfer ownership
    err = workflow.ExecuteActivity(ctx, TransferOwnership, input).Get(ctx, nil)
    if err != nil {
        // COMPENSATE: All steps in reverse
        workflow.ExecuteActivity(ctx, DebitSellerPending, input).Get(ctx, nil)
        workflow.ExecuteActivity(ctx, RefundFromEscrow, escrowID).Get(ctx, nil)
        workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
        return err
    }

    // STEP 5: Email (best-effort, failure won't fail the workflow)
    _ = workflow.ExecuteActivity(ctx, SendWinnerEmail, input).Get(ctx, nil)

    return nil
}

The Golden Rule: Every Activity Must Be Idempotent

Every Temporal activity must be idempotent. Temporal may run an activity multiple times (retry, partition recovery, etc.).

func ReserveFunds(ctx context.Context, input ReserveFundsInput) (string, error) {
    // Idempotency key from input hash
    key := generateIdempotencyKey(input.UserID, input.Amount, input.CorrelationID)

    existing, err := db.GetReservationByIdempotencyKey(ctx, key)
    if err == nil && existing != nil {
        return existing.ReservationID, nil  // Already processed
    }

    // Create new reservation in a transaction
    reservationID := ulid.Make().String()
    err = db.WithTransaction(ctx, func(tx *sql.Tx) error {
        result, err := tx.ExecContext(ctx, `
            UPDATE user_balances
            SET available_balance = available_balance - $1,
                reserved_balance = reserved_balance + $1
            WHERE user_id = $2 AND available_balance >= $1
        `, input.Amount, input.UserID)
        if err != nil { return err }
        if rows, _ := result.RowsAffected(); rows == 0 {
            return ErrInsufficientFunds
        }
        _, err = tx.ExecContext(ctx, `
            INSERT INTO reservations (reservation_id, user_id, amount, idempotency_key, status)
            VALUES ($1, $2, $3, $4, 'ACTIVE')
        `, reservationID, input.UserID, input.Amount, key)
        return err
    })
    return reservationID, err
}

What Happened That Night?

The fiber outage left 47 auction-win workflows stranded. But zero data was lost. Because:

Temporal persisted every workflow's state to PostgreSQL
When fiber came back (27 minutes later), Temporal auto-resumed all paused workflows
Every activity was idempotent — "already done" steps weren't repeated
44 of 47 completed successfully
3 entered compensation path (auction timers had expired) and rolled back cleanly

No user noticed a thing. Just a #incidents Slack notification and a morning post-mortem.

Battle Scar #7

Lesson: In distributed systems, "rollback" isn't a simple COMMIT/ROLLBACK — it must be designed as its own distributed system. Every compensation must be idempotent, retryable, and timeout-aware. And don't build it yourself — use battle-tested frameworks like Temporal, Cadence, or AWS Step Functions.

Next: Chapter 7 takes us into Cell-Based Architecture and Sharding (when GDPR threatens $400K/day fines), and Chapter 8 introduces Hybrid Logical Clocks — the poor man's TrueTime that saved us $2M/year.

DEV Community