This is Part 3 of the Architecture of Chaos series. Start from Part 1 | Part 2
⚠️ Names, companies, and specific details are composite/fictional. Patterns and code are drawn from real production experience.
Chapter 5: Why CRDTs Fail for Financial Ledgers (Event Sourcing)
The Critical Meeting
Fifth month. CFO Masato called me in. Compliance team plus two external auditors present. The question was simple:
"Selim, how do we guarantee balances are correct in the new architecture? Auditors need to trace every transaction. They need to see where every cent came from and where it went. Can this 'CRDT' thing provide that?"
Internally I said "no." Externally I said "CRDT alone can't, but combined with Event Sourcing, we can provide a level of transparency auditors can't even dream of."
Why CRDTs Are Insufficient
The problem: CRDTs are state-based. They tell you "current balance is X" but can't answer "how did we get to X?"
Example:
- Alice starts with $1,000
- Alice spends $200 → $800
- Alice earns $500 → $1,300
- Alice spends $300 → $1,000
CRDT's final state says "$1,000." But when the auditor asks "Why $1,000?" — you can't answer. Worse, if Tokyo and Virginia simultaneously update Alice's balance and a merge goes wrong, you reach a wrong balance and can never detect it.
In financial systems, this is unacceptable.
Event Sourcing: Single Source of Truth
Event Sourcing stores not the state, but the events that produce the state. State is a function of events:
State(t) = Reduce(all_events_up_to_t, initial_state)
Benefits:
- Full audit trail: Every change — who, when, why — all recorded
- Time travel: Reproduce state at any point in history
- Replay: Reprocess events to fix corrupted state
- Determinism: Same events in same order always produce same state
The Event Store: Append-Only Log
Events live in an append-only log. Never updated, never deleted. Only new events can be added.
-- migrations/001_event_store.sql
CREATE TABLE event_store (
event_id UUID PRIMARY KEY,
event_type VARCHAR(64) NOT NULL,
aggregate_id UUID NOT NULL,
stream_version BIGINT NOT NULL,
causation_id UUID,
correlation_id UUID NOT NULL,
hlc_physical BIGINT NOT NULL,
hlc_logical INTEGER NOT NULL,
vector_clock JSONB NOT NULL,
payload JSONB NOT NULL,
metadata JSONB NOT NULL,
user_id UUID,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (aggregate_id, stream_version)
);
-- Append-only protection: ban UPDATE and DELETE
CREATE RULE no_update AS ON UPDATE TO event_store DO INSTEAD NOTHING;
CREATE RULE no_delete AS ON DELETE TO event_store DO INSTEAD NOTHING;
Command Handler: Producing Events
When a user places a bid, state isn't directly updated. A command is produced first. The command handler validates business rules, then generates events:
// commands/place_bid.ts
class PlaceBidHandler {
async handle(command: PlaceBidCommand): Promise<CommandResult> {
// 1. Idempotency check
const existing = await this.eventStore.findByCorrelationId(command.commandId);
if (existing.length > 0) {
return { status: 'already_processed', events: existing };
}
// 2. Load aggregates
const userBalance = await this.userBalanceProjection.get(command.bidderId);
const auction = await this.auctionProjection.get(command.auctionId);
// 3. Business rules (invariants)
if (auction.status !== 'ACTIVE')
return { status: 'rejected', reason: 'Auction not active' };
if (command.amount <= auction.currentHighestBid)
return { status: 'rejected', reason: 'Bid too low' };
if (userBalance.availableBalance < command.amount)
return { status: 'rejected', reason: 'Insufficient funds' };
// 4. Produce events atomically
const events = [
{ eventType: 'BID_PLACED', aggregateId: command.auctionId, ... },
{ eventType: 'FUNDS_RESERVED', aggregateId: command.bidderId, ... },
];
await this.eventStore.append(events);
await this.eventBus.publish(events);
return { status: 'accepted', events };
}
}
Note: Monetary amounts use bigint in cents. Never float or number. The 0.1 + 0.2 = 0.30000000000000004 problem is unacceptable in financial systems.
The Auditors' Joy
When we presented to the CFO, the auditors' eyes lit up. Because now:
- Every cent is traceable: Event store chains every money movement via causation_id
- Time travel: "What was Alice's balance 3 days ago at 14:23?" — instant answer
- Reconciliation: Event store vs projections auto-compared, inconsistency triggers alarm
- Immutable log: Events can never be modified — manipulation is impossible
Masato told me afterward: "Selim, I've been through audits for 10 years. This is the first time auditors thanked us."
Battle Scar #6
Lesson: In financial systems, store events, not state. State is a function of events. This isn't just a technical choice — it's a legal requirement. GDPR, SOX, PCI-DSS, MiFID II all require traceable transactions. Event Sourcing is the most elegant way to satisfy them.
Chapter 6: Distributed Sagas and the "Rollback" Nightmare
That Night, 02:47 — The Transatlantic Fiber Broke
Sixth month. Third major incident. This time the SRE director called directly:
"Selim, the transatlantic fiber is cut. Zero traffic between US-East and EU-West. And right now, 47 'auction-win' workflows are stuck mid-transaction. What do we do?"
The auction-win workflow consisted of these steps:
- Reserve funds from winner's balance (US-East service)
- Transfer to escrow account (EU-West service)
- Credit seller's balance as "pending" (US-East service)
- Update asset ownership (EU-West service)
- Send email notifications (Email service)
If the network dies after step 3: money is reserved, moved to escrow, but never credited to the seller. Money is in limbo.
The Saga Pattern: Long-Lived Transactions
We used Temporal.io for saga orchestration — "durable execution" that survives coordinator crashes:
// workflows/auction_win.go — Temporal Workflow
func AuctionWinWorkflow(ctx workflow.Context, input AuctionWinInput) error {
retryPolicy := &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumAttempts: 5,
}
ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: retryPolicy,
})
// STEP 1: Reserve funds
var reservationID string
err := workflow.ExecuteActivity(ctx, ReserveFunds, input).Get(ctx, &reservationID)
if err != nil { return err }
// STEP 2: Transfer to escrow
var escrowID string
err = workflow.ExecuteActivity(ctx, TransferToEscrow, input).Get(ctx, &escrowID)
if err != nil {
// COMPENSATE: Release reservation
workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
return err
}
// STEP 3: Credit seller (pending)
err = workflow.ExecuteActivity(ctx, CreditSellerPending, input).Get(ctx, nil)
if err != nil {
// COMPENSATE: Refund escrow + release reservation
workflow.ExecuteActivity(ctx, RefundFromEscrow, escrowID).Get(ctx, nil)
workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
return err
}
// STEP 4: Transfer ownership
err = workflow.ExecuteActivity(ctx, TransferOwnership, input).Get(ctx, nil)
if err != nil {
// COMPENSATE: All steps in reverse
workflow.ExecuteActivity(ctx, DebitSellerPending, input).Get(ctx, nil)
workflow.ExecuteActivity(ctx, RefundFromEscrow, escrowID).Get(ctx, nil)
workflow.ExecuteActivity(ctx, ReleaseReservation, reservationID).Get(ctx, nil)
return err
}
// STEP 5: Email (best-effort, failure won't fail the workflow)
_ = workflow.ExecuteActivity(ctx, SendWinnerEmail, input).Get(ctx, nil)
return nil
}
The Golden Rule: Every Activity Must Be Idempotent
Every Temporal activity must be idempotent. Temporal may run an activity multiple times (retry, partition recovery, etc.).
func ReserveFunds(ctx context.Context, input ReserveFundsInput) (string, error) {
// Idempotency key from input hash
key := generateIdempotencyKey(input.UserID, input.Amount, input.CorrelationID)
existing, err := db.GetReservationByIdempotencyKey(ctx, key)
if err == nil && existing != nil {
return existing.ReservationID, nil // Already processed
}
// Create new reservation in a transaction
reservationID := ulid.Make().String()
err = db.WithTransaction(ctx, func(tx *sql.Tx) error {
result, err := tx.ExecContext(ctx, `
UPDATE user_balances
SET available_balance = available_balance - $1,
reserved_balance = reserved_balance + $1
WHERE user_id = $2 AND available_balance >= $1
`, input.Amount, input.UserID)
if err != nil { return err }
if rows, _ := result.RowsAffected(); rows == 0 {
return ErrInsufficientFunds
}
_, err = tx.ExecContext(ctx, `
INSERT INTO reservations (reservation_id, user_id, amount, idempotency_key, status)
VALUES ($1, $2, $3, $4, 'ACTIVE')
`, reservationID, input.UserID, input.Amount, key)
return err
})
return reservationID, err
}
What Happened That Night?
The fiber outage left 47 auction-win workflows stranded. But zero data was lost. Because:
- Temporal persisted every workflow's state to PostgreSQL
- When fiber came back (27 minutes later), Temporal auto-resumed all paused workflows
- Every activity was idempotent — "already done" steps weren't repeated
- 44 of 47 completed successfully
- 3 entered compensation path (auction timers had expired) and rolled back cleanly
No user noticed a thing. Just a #incidents Slack notification and a morning post-mortem.
Battle Scar #7
Lesson: In distributed systems, "rollback" isn't a simple
COMMIT/ROLLBACK— it must be designed as its own distributed system. Every compensation must be idempotent, retryable, and timeout-aware. And don't build it yourself — use battle-tested frameworks like Temporal, Cadence, or AWS Step Functions.
Next: Chapter 7 takes us into Cell-Based Architecture and Sharding (when GDPR threatens $400K/day fines), and Chapter 8 introduces Hybrid Logical Clocks — the poor man's TrueTime that saved us $2M/year.
Top comments (0)