Amit Kamble

Posted on Mar 20

Mastering the Saga Pattern: Achieving Data Consistency in Microservices

#microservices #saga #consistency #pattern

A transaction represents a unit of work, which can include multiple
operations. In a monolith, placing an order is easy: you wrap your
database calls in one @Transactional block. It's "all or nothing."

But in microservices, the Inventory Service, Payment Service,
and Shipping Service each have their own databases. You can't use a
single transaction across network boundaries.

If the Shipping Service fails after the Payment Service has already
taken the customer's money, how do you fix it? This is where Saga Design
Pattern is useful.

1 What is a Saga?

The Saga design pattern helps maintain data consistency in distributed
systems by coordinating transactions across multiple services. A
saga is a sequence of local transactions where each service performs
its operation and initiates the next step through events or messages.

If a step in the sequence fails, the saga performs compensating
transactions to undo the completed steps. This approach helps maintain
data consistency.

The Saga Pattern is your distributed "safety net."

There are two ways of coordination sagas:

Choreography - each local transaction publishes domain events that
trigger local transactions in other services
Orchestration - an orchestrator (object) tells the participants
what local transactions to execute

2 Core Concepts

2.1 Choreography (The Event-Driven Dance)

There is no central "boss." Services communicate by broadcasting
events.

Choreography relies on a decentralized "chain reaction" where services
communicate by broadcasting and listening to events via a message broker
like Kafka. There is no central controller; instead, each service knows
exactly which event triggers its local transaction and which event to
emit upon completion. This approach offers high decoupling and no single
point of failure, but it can become difficult to monitor or debug as the
number of services and "spaghetti events" increases.

The Flow: Order Service emits OrderCreated. Payment Service hears it, charges the card, and emits PaymentSuccessful.

Success Chain:

Order Service: Create Order -> Order_Created

Payment Service: Charge Card -> Payment_Successful

Inventory Service: Deduct Stock -> Stock_Reserved

Shipping Service: Ship Package -> Order_Shipped

Failure Chain (at Shipping):

Shipping Service: Error -> Shipping_Failed

Inventory Service: Shipping_Failed heard -> Undo: Restock ->
Stock_Released

Payment Service: Stock_Released heard -> Undo: Refund ->
Payment_Refunded

Order Service: Payment_Refunded heard -> Final Status: CANCELLED

Pros: Truly decoupled; no single point of failure.
Cons: Hard to "see" the process. It can turn into "Spaghetti
Events" where tracking a single order's journey requires complex
distributed tracing.

2.2 Orchestration (The Centralized Conductor)

A central Order Orchestrator acts as the "Conductor."

Orchestration uses a central "brain"-typically the Order
Service- to explicitly direct the flow of the entire business process
by sending commands to participant services. The orchestrator manages
the state of the saga, tracks which steps have succeeded, and remains
responsible for triggering specific compensating transactions if a
failure occurs. This pattern is ideal for complex workflows with many
steps, as it provides a single point of visibility and control for the
entire transaction life cycle.

The Flow: The Orchestrator tells the Payment Service: "Charge the card." It waits for a "Success" response before telling the Inventory Service: "Reserve the item."

Success Chain:

Orchestrator ->Charge_Card ->Payment Service (Success)

Orchestrator ->Reserve_Stock ->Inventory Service (Success)

Orchestrator ->Ship_Items ->Shipping Service (Success)

Orchestrator: Mark COMPLETE

Failure Chain (at Inventory):

Orchestrator ->Reserve_Stock ->Inventory Service (FAIL)

Orchestrator ->Refund_Money ->Payment Service (Undo)

Orchestrator: Mark CANCELLED

Pros: Single source of truth. Easy to manage complex business
logic.
Cons: The Orchestrator can become a "God Service" if not
designed with clear boundaries.

2.3 The "Undo" Button: Compensating Transactions

In a distributed system, you cannot "un-commit" a database change. You
must perform a Semantic Undo. Compensating transactions are "undo"
operations designed to restore a system to a consistent state after a
partial failure in a Saga. Unlike a traditional database rollback that
simply deletes an uncommitted change, a compensating transaction is a
new, separate transaction that semantically reverses a previously
committed action---such as issuing a refund for a captured payment or
restocking an item that was previously deducted.

Service	Forward Action	Compensating Action
Payment	Capture $100	Refund $100 to customer
Inventory	Stock -1	Stock +1 (Restock)
Shipping	Create Shipping Label	Void/Cancel Shipping Label

2.4 The Pivot Transaction

Every Saga has a "Point of No Return." In E-commerce, this is
usually Payment. Once the money is taken, the business is committed.
If a failure occurs after the Pivot (e.g., the shipping printer jams),
we don't refund the user immediately. Instead, we Retry the
shipping service until it succeeds.

3 Important Decisions

3.1 Is Saga pattern suitable

Before you implement a Saga, ask: "Can I just merge these two
microservices?" If two services constantly require a Saga to stay
consistent, they likely belong to the same Bounded Context. Sagas
add significant operational overhead; use them only when domain
separation is strictly required.

If you do not have clear "Undo" path for every action, then do not
implement Saga.

If Longer Gap for eventual consistency is acceptable then consider
Periodic Reconciliation instead of Saga. While a Saga proactively
pushes for consistency in milliseconds using a chain of real-time
events, Periodic Reconciliation reactively achieves it by using a
scheduled background job to "sweep" the database and fix
discrepancies. It trades immediate synchronization for a simpler,
self-healing model where time -rather than complex
coordination- ensures all services eventually align.

Feature	Saga Pattern (Push-Based)	Periodic Reconciliation (Pull-Based)
Trigger Mechanism	Proactive: Each service "pushes" the next one via real-time events (Kafka/RabbitMQ).	Reactive: A central job "pulls" records from the DB to find and fix mismatches.
Consistency Speed	Near Real-Time: Syncs in milliseconds to seconds.	Delayed: Syncs based on job frequency (e.g., every 5, 10, or 60 mins).
Implementation	High Complexity: Requires event brokers, idempotency, and undo logic.	Low Complexity: Requires a simple scheduler (Spring @Scheduled) and a status-check loop.
Failure Handling	Fragile: If an event is lost or a service is down, the flow breaks.	Self-Healing: If a service is down, the job simply retries on the next run.
When to Use	High-Scale/High-Velocity: When users need an immediate "Confirmed" screen (e.g., Ride-sharing, seat booking).	Low-to-Medium Scale: When a 5-minute delay is acceptable for the business (e.g., Invoice generation, shipping updates).
When to Avoid	Small Teams: If you don't have the dev-ops resources to manage distributed tracing and message brokers.	Inventory Scarcity: Avoid if a delay allows "overselling" (e.g., selling the same ticket twice during the 10-minute gap).

3.2 Orchestration vs. Choreography

Choosing your coordination style is the most critical architectural
decision in a Saga.

Feature	Choreography (Event-Driven)	Orchestration (Command-Driven)
Complexity	Low (initially), High (at scale)	High (initially), Low (at scale)
Coupling	Loosely coupled	Orchestrator knows all participants
Best For	Simple flows (2-3 services)	Complex flows (5+ services)
Observability	Difficult (need distributed tracing)	Easy (check the Orchestrator log)

3.3 Where does the Saga live?

Which Microservice "Owns" the Saga?

Choosing the host for your Orchestrator follows the "Initiator
Rule":

The Intent Owner: The service that receives the initial business
intent from the user should host the Saga. In E-commerce, this is the
Order Service.
The Outcome Stakeholder: If the "success" of the process is
primarily measured by one domain (e.g., "Was the order placed?"),
that domain should coordinate the steps.

Backend vs. Frontend: Who coordinates?

A common mistake is trying to manage a Saga from the Frontend
(Angular). Don't do this.

The Backend (Spring Boot) MUST do the Saga. If the user closes
their browser or loses internet mid-Saga, an Angular-led process would
die, leaving your data in an inconsistent "zombie" state.
The Backend ensures the process continues even if the user is
offline.

3.4 Role of frontend

While the backend does the heavy lifting, the frontend must handle the
Asynchronous UX.

The Request: Angular sends the order and immediately receives a
202 Accepted with a UUID.
The Stream: The Angular app connects to a WebSocket (via
RxJS) or uses Server-Sent Events (SSE) to listen for the Saga's
progress.
The UI: Update the UI dynamically: "Payment Confirmed" ->
"Stock Reserved" -> "Success!"

4 Further Bullet-proofing

4.1 The Semantic Lock Pattern

The Semantic Lock Pattern is a strategy used to handle the "I"
(Isolation) deficiency in the Saga Pattern. Since Sagas are distributed
and take time to complete, they lack the immediate isolation provided by
a traditional ACID database.

Without this pattern, a system is vulnerable to "Dirty Reads" or
"Lost Updates"---where one process modifies data that a concurrent
process is already using or is about to roll back.

4.1.1 The Problem: The "Lack of Isolation"

Imagine an E-commerce store with 1 pair of sneakers left in stock.

User A starts a Saga. The Inventory Service successfully deducts
the last pair.
User B arrives a second later. The Inventory Service sees
quantity: 0 and correctly tells User B it's out of stock.
User A's Payment Service then fails.
The Saga triggers a Compensating Transaction to "Undo" the
stock deduction.
Suddenly, the sneakers are back in stock. User B was turned away
for an item that technically became available again, or worse, a
different User C might have seen the "Available" status while
User A was still "failing."

4.1.2 The Solution: How Semantic Locking Works

Instead of a hard database lock (which would kill performance in
microservices), you use an application-level status to signal that a
record is "busy."

The "Pending" State

When a service performs its local transaction, it doesn't just update
the final value. It marks the record with a Pending/Locked status.

Inventory: Status becomes RESERVED_PENDING_PAYMENT.
Payment: Status becomes AUTHORIZATION_HOLD.

Conflict Handling

While a record has a "Semantic Lock," other transactions must follow
specific rules:

Read-Only: Other users can see the item, but perhaps with a "Low
Stock/In Carts" warning.
Write-Blocked: If another Saga tries to buy that same specific
item, the system rejects the request or puts it in a queue because the
record is "Locked."

The Success Path:

Inventory Service: Sets status to PENDING_COMMIT.
Saga Finishes: Orchestrator sends a "Finalize" command.
Inventory Service: Clears the lock and sets status to SOLD.

The Failure Path (The Rollback):

Shipping Service: Fails.
Orchestrator: Sends a "Compensate" command to Inventory.
Inventory Service: Sees the PENDING_COMMIT lock and simply changes
it back to AVAILABLE.

Clarity Note: This prevents "Dirty Reads" because any other
service looking at that item knew it wasn't really gone yet---it was
just "Semantically Locked."

4.2 Combining with Transactional Outbox Pattern

What if your service updates the DB but crashes before it can send the
Kafka event?" Or What if Kafka event is sent but transaction is
rollbacked

To solve this, use the Transactional Outbox Pattern.

Instead of sending an event directly to Kafka, save the event into
an OUTBOX table in the same transaction as your business data.
A background process polls the outbox and pushes messages to the
broker.

5 Final Checklist for Production

Idempotency: Can your services handle the same event twice?
Observability: Do you have a correlation-id passing through
all services?
Compensations: Does every "Do" action have a corresponding
"Undo" action?

6 Summary

Sagas are about Resilience. By using an Orchestrator in the Backend,
implementing Semantic Locks, Transactional Outbox, and keeping the Frontend reactive, you build a system that can gracefully handle the chaos of distributed
networks.