Rajkiran

Posted on Jun 12

System Design - 15. The Saga Pattern: How Uber Books a Trip Without a Single Database Transaction

#architecture #distributedsystems #microservices #systemdesign

Covers: Two-Phase Commit, Saga Pattern, Choreography vs Orchestration Sagas, Compensating Transactions, Idempotency

The Question That Breaks Most Microservices Designs

You're designing Uber's trip booking flow. A single trip booking touches multiple services:

1. Trip Service:    create trip record
2. Driver Service:  assign a driver, mark as unavailable
3. Payment Service: authorize payment method
4. Pricing Service: lock in the fare estimate

In a monolith with one database, this would be a single transaction:

BEGIN TRANSACTION;
  INSERT INTO trips (...);
  UPDATE drivers SET status = 'on_trip' WHERE id = ?;
  INSERT INTO payment_authorizations (...);
  INSERT INTO fare_locks (...);
COMMIT;  -- all or nothing

If anything fails, ROLLBACK undoes everything. Clean. Simple. Guaranteed.

But in microservices, each of these lives in a different service with its own database. There is no single COMMIT that spans all four. So what happens if:

Trip is created ✓
Driver is assigned ✓
Payment authorization fails ✗

You now have a trip with an assigned driver but no valid payment. The driver is marked unavailable for a trip that can't proceed. How do you "roll back" across four independent databases?

This is the central problem the Saga pattern solves.

Two-Phase Commit (2PC): The Tempting Wrong Answer

2PC is the "obvious" distributed transaction protocol — and almost universally considered an anti-pattern for microservices. Understanding why is important.

How 2PC Works

Phase 1 (Prepare):
  Coordinator → asks all participants: "Can you commit this?"
  Trip Service:     "Yes, I can commit" (locks resources, doesn't commit yet)
  Driver Service:   "Yes, I can commit" (locks resources, doesn't commit yet)
  Payment Service:  "Yes, I can commit" (locks resources, doesn't commit yet)
  Pricing Service:  "Yes, I can commit" (locks resources, doesn't commit yet)

Phase 2 (Commit):
  All said yes → Coordinator tells everyone: "COMMIT"
  All services commit and release locks

  OR if anyone said no:
  Coordinator tells everyone: "ROLLBACK"
  All services roll back and release locks

Why 2PC Is an Anti-Pattern for Microservices

1. Blocking and locks held across services
During Phase 1, every participant holds locks on its resources, waiting for the coordinator's decision. If the coordinator crashes between Phase 1 and Phase 2, participants are stuck holding locks indefinitely — a "blocking" state.

2. Tight coupling and availability cascade
If the Payment Service is slow or down, the entire transaction blocks — Trip Service and Driver Service hold their locks waiting. One service's unavailability brings down the whole operation. This is exactly the cascading failure problem from Day 2.

3. Doesn't scale
2PC requires synchronous coordination across all participants for every transaction. At Uber's scale (millions of trips per day across dozens of services), this creates massive contention.

4. Poor fit for NoSQL
Many NoSQL databases (Cassandra, DynamoDB) don't support distributed transactions or locking at all — 2PC simply isn't possible with them.

The rule: If you're designing microservices and reach for 2PC, stop. There's almost always a better pattern — usually the Saga.

The Saga Pattern: Local Transactions + Compensation

A Saga breaks a distributed transaction into a sequence of local transactions, each in a single service. If any step fails, previously completed steps are undone using compensating transactions.

Saga: Book Trip
  Step 1: Trip Service     → create trip            (local transaction)
  Step 2: Driver Service   → assign driver           (local transaction)
  Step 3: Payment Service  → authorize payment       (local transaction)
  Step 4: Pricing Service  → lock fare                (local transaction)

If Step 3 fails:
  Compensate Step 2: Driver Service → release driver
  Compensate Step 1: Trip Service   → cancel trip
  (Steps run in reverse order)

Each step is its own ACID transaction within its own service's database. There's no global lock, no blocking coordinator. If something fails partway through, you run compensating actions to undo the completed steps — not a database rollback, but a business-level undo operation.

The crucial insight: A compensating transaction isn't "undo" in the database sense — it's a new operation that semantically reverses the effect of the original. "Cancel the trip" isn't the same as "delete the trip row" — it might mean marking it cancelled, notifying the user, logging the cancellation reason, and releasing the driver.

Choreography-Based Saga

Each service publishes events; other services react. No central coordinator.

1. Order Service: creates order (PENDING)
   → publishes OrderCreated event

2. Payment Service: (listens for OrderCreated)
   → charges card
   → publishes PaymentCompleted (success) OR PaymentFailed (failure)

3a. If PaymentCompleted:
    Inventory Service: (listens for PaymentCompleted)
    → reserves stock
    → publishes StockReserved

3b. If PaymentFailed:
    Order Service: (listens for PaymentFailed)
    → marks order as CANCELLED (compensating action)

Diagram of the happy path and failure path:

Happy path:
OrderCreated → PaymentCompleted → StockReserved → OrderConfirmed

Failure path (payment fails):
OrderCreated → PaymentFailed → OrderCancelled
(no compensation needed — nothing else happened yet)

Failure path (stock unavailable, AFTER payment succeeded):
OrderCreated → PaymentCompleted → StockUnavailable
            → Payment Service listens for StockUnavailable
            → refunds payment (compensating action)
            → Order Service listens for PaymentRefunded
            → marks order as CANCELLED

Advantages:

Fully decoupled — no service knows about the others
Easy to add steps (just subscribe to relevant events)

Disadvantages:

The "saga" — the overall flow — exists only implicitly, scattered across event handlers in multiple services
Hard to answer "what's the current state of order #123?" without tracing through events across services
Cyclic dependencies are easy to accidentally create

Best for: Sagas with 2-4 steps and simple compensation logic.

Orchestration-Based Saga

A central Saga Orchestrator explicitly calls each service in sequence and handles compensation.

class OrderSagaOrchestrator:
    def execute(self, order_data):
        try:
            # Step 1
            order = order_service.create_order(order_data)

            # Step 2
            try:
                payment = payment_service.charge(order.total)
            except PaymentFailedException:
                order_service.cancel_order(order.id)  # compensate step 1
                raise SagaFailedException("Payment failed")

            # Step 3
            try:
                inventory_service.reserve_stock(order.items)
            except StockUnavailableException:
                payment_service.refund(payment.id)     # compensate step 2
                order_service.cancel_order(order.id)    # compensate step 1
                raise SagaFailedException("Stock unavailable")

            # Step 4
            try:
                shipping_service.schedule_delivery(order)
            except ShippingException:
                inventory_service.release_stock(order.items)  # compensate step 3
                payment_service.refund(payment.id)             # compensate step 2
                order_service.cancel_order(order.id)           # compensate step 1
                raise SagaFailedException("Shipping unavailable")

            order_service.confirm_order(order.id)
            return order

        except SagaFailedException as e:
            log_saga_failure(order_data, e)
            raise

The orchestrator maintains saga state — typically persisted so it can resume after a crash:

Saga State Table:
  saga_id | order_id | current_step | status
  saga_1  | order_42 | 3 (inventory) | IN_PROGRESS

If orchestrator crashes after step 3 completes:
  On restart, read saga state → resume from step 4
  (or run compensations for steps 1-3 if step 4 can't proceed)

Advantages:

The entire workflow is visible in one place — easy to understand, modify, debug
Centralized error handling and retry logic
Saga state can be persisted and resumed after crashes

Disadvantages:

Orchestrator becomes a critical component — must be highly available
Services become aware of the orchestrator's API contract

Best for: Complex multi-step workflows with non-trivial compensation logic. Most production order/booking systems (Amazon order fulfillment, Uber trip booking) use orchestration-based sagas with a dedicated framework like Temporal, AWS Step Functions, or Camunda.

Idempotency: The Non-Negotiable Requirement

Sagas involve retries — networks fail, services restart, messages get redelivered. Every step (and every compensation) must be idempotent: running it multiple times produces the same result as running it once.

Non-idempotent (dangerous):

def charge_card(amount):
    payment_gateway.charge(card_id, amount)  # Retrying this charges TWICE

Idempotent (safe):

def charge_card(idempotency_key, amount):
    # Idempotency key ensures the payment gateway deduplicates
    payment_gateway.charge(card_id, amount, idempotency_key=idempotency_key)

Stripe's idempotency key pattern (industry standard):

import uuid

idempotency_key = f"order_{order_id}_payment"  # deterministic, same every retry

stripe.PaymentIntent.create(
    amount=5000,
    currency="usd",
    idempotency_key=idempotency_key  # Stripe deduplicates if seen before
)

If this request is sent twice (due to a retry), Stripe recognizes the idempotency key and returns the original result without charging again. This single technique prevents the most common and costly distributed systems bug: double charges.

Compensating transactions must also be idempotent. If "release driver" is sent twice (retry), the second call should be a safe no-op — not an error, and definitely not "release a different driver."

Real-World Example: Amazon Order Fulfillment

Amazon's order fulfillment saga (simplified) looks like this:

1. Order Service: Create order (status: PLACED)
2. Payment Service: Authorize payment (hold funds, don't capture yet)
3. Inventory Service: Reserve items across warehouses
4. Fulfillment Service: Generate pick/pack/ship instructions
5. Payment Service: Capture payment (now actually charge)
6. Shipping Service: Hand off to carrier
7. Order Service: Update status to SHIPPED

Compensation scenarios:
- If inventory unavailable after payment auth → release auth (no charge happened yet)
- If fulfillment fails after capture → refund + cancel order
- If item damaged before shipping → refund + restock + notify customer

Notice step 2 (authorize) vs step 5 (capture) — this is a deliberate design choice. Authorization holds funds without charging. This gives the saga a "soft" compensation option (release the hold) for the early failure scenarios, and only "hard" compensation (refund) is needed for failures after the actual charge.

This pattern — separating authorization from capture — is one of the most important saga design techniques for payment flows. It buys you a cheap, reversible step before the expensive, harder-to-reverse step.

Interview Scenario: "Handle Payment Spanning 3 Microservices"

Q: A user purchase involves Order Service, Payment Service, and Inventory Service. How do you ensure consistency?

"I'd implement this as a Saga rather than attempting a distributed transaction. Given the complexity — three services, multiple failure scenarios — I'd lean toward an orchestration-based saga rather than choreography, so the workflow logic lives in one place and is easy to reason about.

The sequence would be: create the order in PENDING state, authorize (not capture) payment, reserve inventory, then capture payment and confirm the order. I'd use authorization-before-capture so early failures (inventory unavailable) only require releasing the auth hold — no refund needed.

Every step and compensation needs an idempotency key, because the orchestrator will retry on failures, and I need to guarantee a retried 'charge card' doesn't double-charge.

I'd persist the saga state after each step so that if the orchestrator crashes, it can resume from where it left off rather than restarting the whole flow — which could cause duplicate charges or duplicate inventory reservations if not handled carefully."

This answer demonstrates: knowledge of the pattern, a clear architectural choice with justification, awareness of idempotency, and crash-recovery thinking — exactly what the "Top 1%" checklist from our syllabus describes.

Key Takeaways

2PC is an anti-pattern for microservices — it creates blocking locks across services and cascading availability failures.
Saga pattern: break distributed transactions into local transactions per service, with compensating transactions for rollback.
Choreography sagas: event-driven, decoupled, best for simple 2-4 step flows.
Orchestration sagas: centralized coordinator, explicit workflow, best for complex flows with non-trivial compensation — most production systems use this.
Idempotency is mandatory — every step and compensation must handle retries safely. Use idempotency keys (Stripe's pattern is the gold standard).
Authorize-then-capture for payments gives you a cheap, reversible early step before the costly, harder-to-reverse final step.
Persist saga state so orchestrators can recover from crashes without duplicating side effects.

You've now covered the entire async communication layer: Message Queues (how services talk without blocking), Event-Driven Architecture (how systems react to "things that happened"), and Sagas (how distributed transactions actually work in microservices). Together, these three topics explain how every large-scale system coordinates work across dozens of independent services.

we move into Microservices Infrastructure — Monolith vs Microservices trade-offs, Service Discovery, and Fault Tolerance Patterns like Circuit Breakers and Bulkheads. How to actually run hundreds of services in production without them taking each other down.

Tags: system-design microservices saga-pattern distributed-systems backend software-architecture interview-prep

DEV Community