This article breaks down the three core forces behind designing distributed systems (communication, coordination and consistency) and shows how they combine into eight saga patterns. You’ll see how each pattern works, where it fits, and what trade-offs come with it. Whether designing a new workflow or improving old ones, this guide helps you reason through the options and make informed design decisions.
Throughout this article, we’ll explain things using an order checkout flow example.
- The Three Forces of Service Interaction
- Communication
- Coordination
- Consistency
- Saga Patterns
- Wrap-Up
- Further Reading
The Three Forces of Service Interaction
Software has evolved from monoliths (one deployable, one database) to SOA (multiple deployables, often one shared database) and finally to microservices (each service owns its data and deploys on its own).
Splitting a system into separate services with the right modularity and granularity is hard, but getting those services to work together is even harder. Business requests like placing an order often span multiple services (Order, Inventory, Payment, Shipping) requiring coordination and introducing new design decisions and trade-offs.
To make sense of those trade-offs, Mark Richards and Neal Ford introduced in their book a useful way to think about service interactions. They identified three forces that show up every time services need to work together:
-
Communication - How does one service talk to another?
- Synchronous (like REST or gRPC): Caller waits for a response.
- Asynchronous (messaging or events): Caller sends a message and moves on.
-
Coordination - Who drives the workflow?
- Orchestrator: Central service tells each service what to do.
- Choreography: Services listen and react to events independently.
-
Consistency - When must the data be correct?
- Atomic: All-or-nothing, like a traditional transaction.
- Eventual: Some inconsistency is fine, resolved over time.
These forces trade off against each other. Atomic consistency leans on sync calls and orchestration. Async flows favor eventual consistency and choreography. Most systems mix styles, like orchestration for payments, choreography for notifications.
Next, we'll explore each of these forces in more detail, then show how they come together in eight saga patterns, practical approaches to handling distributed transactions.
Communication
When two services need to coordinate a task, how they communicate is just as critical as what they exchange. This choice directly impacts system responsiveness, fault tolerance, scalability, and the degree of coupling between services.
The fundamental communication styles are synchronous and asynchronous.
Synchronous Communication
In synchronous communication, one service sends a request to another service and waits for the response before continuing. This is a blocking interaction, the caller is stalled until it hears back. This pattern is common in protocols like HTTP/REST and gRPC.
- The frontend sends a
POST /checkout
to Order Service. - Order Service calls Payment Service and waits for it to confirm the charge.
- Once payment is confirmed, it calls Inventory Service to reserve stock.
- Inventory Service calls Shipping Service to arrange delivery after successful reservation.
- Only once all steps succeed, Order Service returns "Order confirmed." to the user.
We now have tight temporal coupling: all services must be online, responsive, and agree in real-time, or the whole system stalls.
Trade-offs
Upsides | Downsides |
---|---|
Immediate, deterministic feedback to the caller | Lower availability, one service failure breaks the chain |
Simple control flow and debugging | Tight coupling between services |
Fits user actions that must finish now (login, payment) | Requires resilience mechanisms (retries, timeouts, circuit breakers) |
Asynchronous Communication
In asynchronous communication, one service places a message on a queue and moves on without waiting for a response. This is a non-blocking interaction. The other service picks up the message when ready, often using a message broker like Kafka or RabbitMQ. This decouples services in time and allows for more parallelism.
- The frontend sends a
POST /checkout
to Order Service. - Order Service saves the order and emits an
OrderPlaced
event. - Order Service immediately responds to the user: "Your order is being processed."
- Payment Service listens to that event, charges the card, then emits
PaymentCaptured
. - Inventory Service sees
PaymentCaptured
, reserves the stock, and emitsStockReserved
. - Shipping Service sees
StockReserved
, ships the item, and emitsOrderShipped
. - Email Service sees
OrderShipped
and sends the confirmation email.
No service blocks another, and messages queue safely while any service is down, but this also introduces eventual consistency. We will talk about consistency in the next section.
Trade-offs
Upsides | Downsides |
---|---|
High availability: If the receiver is down, messages queue and are processed once it recovers | No immediate feedback |
Loose temporal coupling, highly resilient | Eventual consistency, caller sees only "accepted" |
High parallelism and scalability | Requires extra infrastructure (brokers, tracing) |
Choosing Between Synchronous and Asynchronous
The choice depends on the trade-offs you're willing to make between responsiveness, reliability, and coupling.
Use synchronous communication when:
- The caller needs an immediate result (e.g. credit-card charge, login).
- The service's response directly controls what happens next.
- Dependencies are reliable and low-latency.
Use asynchronous communication when:
- Loose coupling and resilience matter more than speed.
- The task can be done later or retried (e.g., sending emails, logging, bulk imports).
- You need high throughput or resilience. Services need to keep working even if others are down.
- Services are independently deployable or might be temporarily unavailable.
Coordination
When a business request spans multiple services, those services need to work in sync to get the job done. But who drives the workflow? Should one service take charge, or should each one act on its own? That's what coordination is all about.
The coordination style you choose shapes everything, from how you handle errors to where state lives to how complex things get. There are two main patterns: orchestration and choreography.
Orchestration Pattern
A dedicated service (orchestrator) is in charge. It drives the flow by calling each participating service, waiting for their responses, and deciding what happens next. It also owns the workflow state, often storing it in a local table or event log (CREATED
, PAID
, SHIPPED
, etc.). This makes it easy to know exactly where a request stands.
Happy Path
- The frontend sends a
POST /checkout
to the Orchestrator. - The orchestrator calls Order Service (sync) to create the order.
- Then it calls Payment Service (sync) to charge the card.
- Then it calls Inventory Service (sync) to reserve the stock.
- Then it notifies Shipping Service (async) to ship the item.
- Then it notifies Email Service (async) to send confirmation.
- Finally, it responds to the user with "Order confirmed.".
- In each step the orchestrator updates the workflow state.
Failure Path
- Payment Service says "declined".
- The orchestrator updates workflow state to
FAILED_PAYMENT
. - Then it asks Order Service to undo their changes - This is known as a compensating action.
- Then It asks Email Service to notify the user.
- Then it responds to the user with "Payment has failed".
- No extra communications are needed, the orchestrator already talks to every service.
These examples illustrate the Fairy Tale Saga, we will talk about sagas later.
Trade-offs
Upsides | Downsides |
---|---|
Single source of truth for progress and errors | Extra network hops adds latency |
Central place for timeouts, retries, compensations | Orchestrator can bottleneck or fail |
Easier to reason about and unit-test complex flows | Limits parallelism, steps are often serialized |
Tighter coupling between orchestrator and service |
Choreography Pattern
Choreography works without a central service. Each service reacts to events and publishes its own events. Together, these event-driven reactions form the workflow. Since there's no orchestrator, managing state is trickier. Here are common options:
- Front Controller: The first service in the chain (e.g. Order Service) tracks the state. Others report back. Easy to query, but adds responsibilities and coupling.
- Stateless: No service tracks workflow state. To know what happened, you query each service and reconstruct the state on the fly. Loose coupling, but lots of network chatter.
- Stamp Coupling: Instead of storing state, pass it along. Each service adds its progress to the shared message or event as it moves through the workflow. No extra queries, but messages get heavier.
Happy Path
- The frontend sends a
POST /checkout
to Order Service. - Order Service saves the order, emits
OrderPlaced
. - Order Service returns immediately to the user "You order is being processed".
- Payment Service listens, charges the card, emits
PaymentCaptured
. - Inventory Service listens, reserves the stock, emits
StockReserved
. - Shipping Service hears
StockReserved
, ships the item, emitsOrderShipped
. - Email Service listens for
OrderShipped
and sends confirmation to the user.
Failure Path x
- Shipping Service emits
OutOfStock
. - Payment, Inventory and Order services listens to
OutOfStock
to undo their changes. - Email Service listens to
OutOfStock
and notifies the user. - New communication links are added each time you discover a new error path.
These examples illustrate the Anthology Saga.
Trade-offs
Upsides | Downsides |
---|---|
High parallelism, steps run in parallel | Debugging involves multiple logs and topics |
Loose coupling, services scale independently | No built-in global state, must design your own approach |
Better fault-isolation, no single point of failure | Error handling scatters across services |
Choosing Between Orchestration and Choreography
Start with the workflow's priorities, then pick the style that matches.
- Complex logic or many ways to fail? Orchestration wins. A single component tracks steps, rolls back work, and hides complexity from others.
- Need fast responses and high parallelism? Choreography fits. Each service does its job and moves on, letting the rest catch up through events.
- Want easy way to track the workflow status? Orchestration gives a single source of truth. With choreography, you'll need to reconstruct state from events.
- Worried about a single point of failure? Choreography removes the central brain at the cost of more scattered error handling.
Most production systems mix the two. Keep orchestration for high-risk, money-moving steps such as payment and refunds, where clear control and fast rollback matter. Use choreography for high-volume, low-risk steps like sending emails, updating analytics, or syncing inventory, where speed and autonomy pay off.
Consistency
Consistency is the guarantee (strong or weak) that when one service updates data, all other service will immediately or eventually see the same result.
In a distributed system, as soon as a business request involves more than one service, you have to decide how much inconsistency you can tolerate between them, and for how long. Whether you aim for strict, all-or-nothing guarantees (atomic consistency) or let things settle over time (eventual consistency), your consistency strategy shapes how reliable, responsive, and maintainable your system really is.
There are two ways for consistency: atomic consistency and eventual consistency. Before exploring these consistency styles, let's look at how consistency works in the monolith world.
ACID vs BASE
Inside a single service with a single database the "order checkout" workflow is simple. A request starts and triggers a single transaction: insert the order row, reserve stock, charge the card, mark the order ready to ship. If the card step fails, the database rolls everything back. That comes from the four ACID guarantees for transactions:
- Atomicity: All-or-nothing. All updates commit or none do.
- Consistency: Business rules and constraints stay valid throughout the transaction.
- Isolation: During a transaction, other requests can't see its uncommitted changes.
- Durability: Once committed, it's permanent, a crash can't erase the data.
Move the same workflow into four microservices (Order, Inventory, Payment and Shipping), each with its own database, and ACID breaks. Order and Inventory commit, Payment times out, no global rollback, constraints drift, and partial updates leaks to users. ACID only applies within one database connection.
You could try a global XA transaction using 2PC, but it means extra network round-trips and long-held locks. The single coordinator can stall the system and kill availability, and every datastore must support the same XA protocol. Most modern teams decide the cost is too high.
Instead, you swap ACID for BASE:
- Basic availability: Services respond quickly, even if data is temporarily inconsistent.
- Soft state: State may temporarily be incorrect or incomplete.
- Eventual consistency: Given retries, compensations or human help, the data will line up.
BASE is a promise to converge, not a guarantee of instant correctness.
Atomic Transactions
If you want an ACID-like experience across services, you typically introduce a central service (orchestrator) that drives the whole workflow. It synchronously invokes each service, commits locally inside each one, and triggers compensating transactions to undo all work if something fails as if it never happened. A response is returned to the caller once all steps succeed or rollback completes.
Happy path
- The frontend sends a
POST /checkout
to the Orchestrator. - The orchestrator calls Order, Payment, Inventory and Shipping services in sequence.
- Each service commits to its local database immediately with no failures.
- The orchestrator returns "Order confirmed." to the user.
Failure path
- Order, Payment, and Inventory services have already committed.
- Shipping Service times-out.
- The orchestrator immediately issues three compensating transactions to undo the earlier steps.
- The orchestrator returns "Unable to ship" to the user once every compensation succeeds.
Points to watch
- This gives you ACD but no Isolation, other requests can see intermediate states before compensation finishes, dirty reads can happen, or other requests might overwrite in-progress changes.
- Compensation itself might fail (e.g. refund gateway offline), you need retry or manual dashboards.
- Side-effects (email, analytics) already triggered may not be reversible.
This is the Epic Saga, one way to handle the atomic transactions.
Trade-offs
Upsides | Downsides |
---|---|
Data consistency and invariants are restored immediately once compensations finish | Lower availability, response time grows with each hop and compensation |
User sees one clear success/failure result | Orchestrator is a coordination hot-spot and potential bottleneck |
Deterministic rollback logic lives in one place | Isolation is gone, other requests may see half-done state until compensation finishes |
Eventual Transactions
The more scalable alternative is to let each service act independently. Services commit changes locally, publish asynchronous events, return immediately, and rely on other services to react to these events in their own time. To handle failures, instead of trying to undo work immediately, they are managed through retries, fallback states, or human intervention.
Happy Path
- The frontend sends a
POST /checkout
to Order Service. - Order Service saves and commits the order, emits
OrderCreated
event. - Order Service responds to the user immediately "You order is being processed".
- Payment Service processes
OrderCreated
, charges card and emitsPaymentCaptured
. - Inventory Service processes
PaymentCaptured
, reserves stock and emitsStockReserved
. - Shipping Service hears
StockReserved
, ships the item and emitsOrderShipped
. - Email Service hears
OrderShipped
and notifies the user. - Order Service hears
OrderShipped
and mark the order asFULLFILLED
.
Failure Path
- Payment Service declines the charge and emits
PaymentFailed
. - Order Service hears
PaymentFailed
, marks order asPAYMENT_FAILED
. - From here, we have several recovery paths:
-
Retry Policy: Payment Service retries the charge and emits
PaymentCaptured
orPaymentFailed
again. -
Human Intervention: A support dashboard highlights stuck orders with
PAYMENT_FAILED
for a human to manually fix or retry. -
Fallback State: System gives up and issues compensating transactions to clean-up. Here Order Service hears
PaymentFailed
, marks order asCANCELLED
and emails users about this issue. Similar to the example in Choreography - Failure Path.
-
Retry Policy: Payment Service retries the charge and emits
This is the Anthology Saga, one way to handle the eventual transactions.
Points to Watch
- Decide where the status lives (row column, side-car table, or event stream). Splitting state across multiple places invites race conditions.
- Idempotency is crucial. Every step may be retried. Services must handle duplicate events without breaking state.
- For every non-terminal failure state (i.e.
PAYMENT_FAILED
), identify who's responsible for fixing it and how (automatic retry, human help, or another event). - Failures that can't recover should be moved to a holding queue or flagged for investigation.
Trade-offs
Upsides | Downsides |
---|---|
High availability | Short windows of data drift. Dashboards, users, and code must tolerate it |
Services scale and deploy independently | Requires retry logic, compensating transactions, or human help to clean-up |
High throughput, no tight transaction boundaries | Debugging spans multiple event hops |
Choosing Between Atomic and Distributed Transactions
The choice depends on the trade-offs you're willing to make between responsiveness, level of consistency, or effort to recover from failure. Ask yourself a few questions:
- How strict is consistency? If any mismatch causes serious issues (money, security), atomic wins. If delay is fine, eventual scales better.
- Can you undo steps? Atomic needs safe rollbacks. If not possible, prefer retries or manual repair.
- Do users need fast responses? Atomic blocks until all steps finish. Eventual responds fast, even if some parts run later.
- What's your fault tolerance? Atomic isolates failure but can reduce availability. Distributed keeps moving, but errors may surface later.
- How autonomous are your services? Atomic often requires orchestration. Distributed keeps services decoupled and event-driven.
Most production systems combine atomic transactions for local operations with distributed, asynchronous messaging across services. Some steps might use synchronous calls for strong feedback, while others rely on eventual consistency and retries.
Saga Patterns
We've already explored different ways to handle business workflows that span multiple services, these known as sagas. A saga breaks the workflow into local transactions, each owned by one service. After each step commits, the next is triggered via a call or an event, depending on the communication style. If any step fails, the saga issues compensations or moves into an error‐handling path, depending on the consistency and coordination model.
There are eight saga patterns. They're simply every possible combination of the three forces we've been using throughout the article. Mark Richards and Neal Ford gave these sagas memorable names:
Pattern name | Communication | Consistency | Coordination |
---|---|---|---|
Epic Saga | synchronous | atomic | orchestrated |
Phone-Tag Saga | synchronous | atomic | choreographed |
Fairy-Tale Saga | synchronous | eventual | orchestrated |
Time-Travel Saga | synchronous | eventual | choreographed |
Fantasy-Fiction Saga | asynchronous | atomic | orchestrated |
Horror-Story Saga | asynchronous | atomic | choreographed |
Parallel Saga | asynchronous | eventual | orchestrated |
Anthology Saga | asynchronous | eventual | choreographed |
Dotted boxes show atomic consistency. No box means eventual consistency.
Epic Saga
Synchronous • Atomic • Orchestrated
This pattern enforces all-or-nothing behavior via an orchestrator that makes blocking, synchronous calls and triggers compensating actions on failure. This makes the system behaves as a monolith.
- The orchestrator receives the request and manages the workflow.
- It calls each service one after the other, waiting for each to respond.
- If all services succeed, the saga completes successfully.
- If any step fails, the orchestrator triggers compensating actions in reverse order.
- Guarantees atomicity but suffers from bottlenecks and tight coupling.
Choose Epic Saga when you need all-or-nothing behavior and the workflow is relatively short-lived. It’s a familiar approach, but should be avoided for long chains or highly distributed systems.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | Very High | Sync calls, atomicity, and an orchestrator maximize coupling between services. |
Complexity | Low | Sync calls and rollback logic is centralised in the orchestrator. |
Availability | Low | One service failure aborts the whole flow. All-or-nothing behavior will affect responsiveness. |
Scale | Very Low | Orchestrator and atomicity coupling create bottlenecks and limit scaling. |
Phone Tag Saga
Synchronous • Atomic • Choreographed
A fully choreographed version of the Epic Saga where services call each other in a strict order and handle their own rollback logic.
- The initiating service starts the chain and calls the next service synchronously.
- Each service commits locally and calls the next service.
- If any step fails, services must independently send compensating messages upstream.
- No orchestrator exists, each service has coordination and rollback logic which increases complexity.
This is only better for simple and linear workflows that rarely fail. Many error handling paths and conditional flows make the code unmanageable, best treated as a transitional or legacy-friendly model.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | High | Atomicity and sync calls cause high coupling, but distributed coordination makes it less coupled than Epic Saga. |
Complexity | High | Each service has coordination and rollback logic. |
Availability | Low | Error handling without an orchestrator requires callbacks and multiple round-trips. |
Scale | Low | Sync calls and atomicity prevent parallelism. |
Fairy Tale Saga
Synchronous • Eventual • Orchestrated
Orchestration with synchronous calls, but each service manages its own commit, consistency is achieved eventually, not atomically.
- The orchestrator sends synchronous calls to services in sequence.
- Each service commits its changes independently.
- The orchestrator listens for success or failure after each step.
- If any step fails, the data will eventually line up.
- The orchestrator still can trigger compensating actions but they won't be part of an active transaction.
Ideal for business processes where a central controller is valuable and consistency can be delayed. Think of checkout, signup, or account setup flows that need visibility and control but don’t require strict atomicity, which makes this saga popular and common with many microservices architectures.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | High | Uses an orchestrator and sync calls, but avoids global transactions. |
Complexity | Very Low | Sync calls and rollback logic are centralised in the orchestrator, also consistency is loosened. |
Availability | Medium | Still blocks on each call, but allows for eventual consistency. |
Scale | High | Better scalability due to lack of transactional coupling. |
Time Travel Saga
Synchronous • Eventual • Choreographed
Fully decentralized version of the Fairy Tale Saga. Services call each other in sequence and own all workflow logic, including failures.
- A service begins and completes its local transaction.
- It then calls the next service synchronously and passes control forward.
- Each service continues this chain until the workflow ends.
- If an error occurs, each service must handle its own compensations.
Best for throughput-focused, one-way and linear flows, such as ETL pipelines and simple chains where each step progresses naturally, independently and in-order.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | Medium | No orchestrator and no atomicity reduce coupling, but sync calls retain some coupling. |
Complexity | Low | No transactional logic, services handle only local logic. |
Availability | Medium | Still blocks on each call, but no central bottleneck means fewer hops. |
Scale | High | Choreographed flows with local commits scale well. |
Fantasy Fiction Saga
Asynchronous • Atomic • Orchestrated
An orchestrated saga that attempts atomic coordination over asynchronous calls, introducing heavy complexity in managing order and state.
- The orchestrator sends asynchronous commands to each participating service.
- Services perform local transactions and respond back but out-of-order.
- The orchestrator tracks progress and handles pending state.
- On failure, it issues compensating commands asynchronously.
- Coordination logic must handle race conditions and retries.
Only consider this pattern when atomic guarantees are a must and you need some parallelism or better performance. It is hard to get it right due to the challenges of managing transactional consistency asynchronously, it requires advanced orchestration and observability tooling.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | High | Atomic guarantees demand coordination, async makes timing harder. |
Complexity | High | Orchestrator must manage out-of-order events, rollbacks, retries, and partial states. |
Availability | Low | Async compensations mean long recovery paths, and one service failure affects the whole flow. |
Scale | Low | High scale is still challenging with atomic services, async alone can't offset coordination bottlenecks. |
Horror Story Saga
Asynchronous • Atomic • Choreographed
The most difficult model that tries to achieve atomic consistency with no orchestrator and only async messaging (the two loosest coupling factors). All services must coordinate rollbacks without global state.
- Services exchange messages asynchronously and commit locally.
- No orchestrator so each service must track workflow state and handle compensation.
- Compensation logic must handle failures across out-of-order, possibly incomplete message chains.
- High risk of race conditions, cascading failures, and coordination errors.
Never use this pattern, it's considered a red flag, signaling accidental complexity or under-designed coordination. Use it if you truly require atomicity but cannot introduce orchestration due to organizational boundaries.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | Medium | No orchestrator helps loosen structure, but atomicity still enforces shared state constraints. |
Complexity | Very High | Services must coordinate rollbacks asynchronously, tracking transaction state and order. |
Availability | Low | Async chatter to achieve atomicity hurts responsiveness. |
Scale | Medium | Parallelism is possible with async calls. No orchestrator helps as well. |
Parallel Saga
Asynchronous • Eventual • Orchestrated
A scalable and resilient pattern where the orchestrator coordinates async service calls with eventual consistency, enabling high throughput.
- The orchestrator sends async requests to all participating services.
- Services execute independently and manage their own commits.
- Results are returned asynchronously to the orchestrator.
- If the orchestrator receives a failure, it sends async messages to services to compensate for this failed change.
- Enables parallel execution and graceful recovery at scale.
Perfect for high-volume complex business flows, e.g., onboarding, order processing, subscription handling, where speed and observability matter more than atomic guarantees. Great balance of control, resilience, and performance.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | Low | No global transaction, services react to events, the orchestrator only sequences steps. |
Complexity | Low | The orchestrator's logic is simple due to low coupling. |
Availability | High | Fast responses, non-blocking flows. |
Scale | High | No atomicity guarantee, services scale at their own pace. |
Anthology Saga
Asynchronous • Eventual • Choreographed
The most decoupled pattern: services communicate via events without orchestration, each maintaining its own state and reacting to changes.
- Services emit events upon completion of local work.
- Other services listen and react to those events asynchronously.
- Each service is responsible for its own transaction scope and compensation.
- No orchestrator or synchronous links, state is emergent from event flow.
- Maximizes scalability and autonomy at the cost of visibility and control.
Choose it when scale and service independence are priority. Ideal for data ingestion, analytics pipelines, or any process tolerant to loose consistency. Expect reduced observability, but maximum throughput and fault isolation. It's common in many microservices architectures.
Trade-offs
Characteristic | Value | Description |
---|---|---|
Coupling | Very Low | No orchestrator, no global transaction, and fully decoupled via events. |
Complexity | High | Error handling and state reconstruction are tricky. |
Availability | High | Services operate independently, queues absorb load spikes. |
Scale | Very High | No coupling factors. Ideal for massive scale. |
Wrap-Up
There’s no one-size-fits-all saga. Each pattern involves trade-offs across key characteristics like consistency, availability, scalability and performance. You can't maximize them all at once. Strong control often limits scalability, while loose coupling increases flexibility but demands stronger coordination and observability.
In practice, many systems adopt multiple saga patterns. For example, you might use the Epic Saga for critical and atomic flows like payments, and the Parallel Saga for scalable tasks that doesn't require immediate consistency like order fulfillment. The key is to choose the right trade-offs for each workflow guided by the characteristics your business values most and can’t afford to sacrifice.
Further Reading
Most of the material here is taken from The Hard Parts book by Mark Richards and Neal Ford.
Top comments (0)