Exactly-Once Delivery Is a Lie: How Systems Approximate It Anyway

#distributed #systems #architecture #backend

Reliability in distributed systems does not mean eliminating failures; it means ensuring failures do not corrupt state.

What We're Building

We are implementing a financial order service that processes thousands of transactions per second across multiple availability zones. The requirement is strict consistency regarding money. We cannot charge a customer twice, even if the network retries. We must handle network timeouts without creating ghost orders or losing revenue. This architecture assumes the network is eventually consistent and focuses on managing the retries safely rather than pretending they never happen.

Step 1 — The Idempotency Key

Every incoming request must carry a unique identifier. When the service receives a payment instruction, it checks if that identifier has already been processed. If a duplicate arrives, the service rejects it immediately without executing business logic. This prevents double-charging when a client resends a failed request. The code ensures the identifier is hashed and stored before processing the payload.

func (s *Service) CreateOrder(ctx context.Context, req *OrderRequest) error {
    id := req.IdempotencyKey

    // Check existing state in a separate store
    existing := s.db.GetProcessedKey(ctx, id)
    if existing != nil {
        return nil // Return previously created order, don't process again
    }
    // ... process order
    return s.db.InsertOrder(ctx, order)
}

Step 2 — The Outbox Pattern

Database transactions and message queue deliveries often fail together. We separate these by writing to a local table during the same transaction as the database record. A separate worker process reads this table and sends the message to the external queue. This guarantees the order event is logged before it is sent, ensuring durability without requiring complex two-phase commits across services.

// Inside a transaction
tx := db.Begin()
db.InsertOrderRecord(tx, order) // Write data
db.InsertOutboxEntry(tx, id)   // Write event log
tx.Commit()
// Worker reads 'id' and sends to MQ asynchronously

Step 3 — State Machine Validation

We track the lifecycle of each order. States include Created, Paid, and Completed. A transition from Created to Paid is idempotent. A retry on a Paid order is ignored. If the database is down, the state remains Created. Once the database recovers, we replay the log. The state machine prevents moving backward or skipping steps, which would violate accounting principles. We use a finite state machine to enforce valid transitions at every request boundary.

Step 4 — Compensating Transactions

If a payment fails after an order is created, we must undo the side effects. This is a compensating transaction. We record a cancellation event with the same unique identifier logic as the creation request. If a cancellation arrives, we check the current state. If the state is still Created, we update it to Cancelled and refund the account. This ensures we never hold funds indefinitely without a valid order. The worker processes cancellations with the same priority as creation.

func (s *Service) CancelOrder(ctx context.Context, key string) error {
    order, err := s.db.FindByOrderKey(ctx, key)
    if order == nil {
        return errors.New("not found")
    }
    if order.State != "CREATED" {
        return nil // State is already processed
    }
    return s.db.UpdateOrderState(ctx, order.ID, "CANCELLED")
}

Key Takeaways

Idempotency keys protect against duplicate requests but require client support. The Outbox pattern ensures messages are not lost during DB failures. State machines validate valid lifecycle paths and prevent invalid updates. Compensating transactions clean up partial failures and maintain financial integrity. Together, these patterns approximate exactly-once semantics by acknowledging network unreliability and designing for eventual consistency.

What's Next?

The final step is monitoring. We track the ratio of retries to successes. If the retry rate spikes, we check the queue depth. We also monitor the deduplication store for memory usage. High cardinality of keys can cause performance degradation. We plan to shard the state storage to handle large volumes of concurrent requests. Next, we will explore distributed tracing to observe these flows across service boundaries.

Architecture Patterns Series

Part of the Architecture Patterns series.

DEV Community