So you finally migrated your monolith to microservices. Congratulations! Now each service has its own database, everything is decoupled, and you're confident everything is under control until that seemingly simple requirement arrives:
"we need to process an order that involves payment, inventory AND customer notification"
There you go. Welcome to distributed transactions hell.
The problem that every developer has faced or will face at some point. Remember when everything was simple in the monolith? You started an ACID transaction, performed your operations, and if something went wrong, a ROLLBACK solved everything.
But now you have:
- An orders and inventory service with its own relational database
- A payment service with MongoDB
- A notifications service
And they all need to work together in a coordinated way. If payment fails after you've already decremented inventory, what do you do? If the notification doesn't send, do you need to revert everything?
Using BEGIN TRANSACTION won't work here.
That's where the Saga Pattern comes in, which is basically a solution for managing transactions involving multiple services. As Chris Richardson (microservices.io) mentions in his article:
"A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule, the saga executes a series of compensating transactions that undo the changes made by the preceding local transactions."
The idea is to break that "big transaction" into several smaller transactions, each in its own service. And here's the interesting part:
if something goes wrong, you don't rollback - you execute compensating transactions.
What are Compensating Transactions? It's like the "Ctrl+Z" of Microservices. A compensating transaction is basically the opposite of the original transaction. Booked a flight? The compensation is to cancel the booking. Charged the card? The compensation is to issue a refund, and so on.
Think of it as that classic travel example (that every Saga article uses, but it's so good I'll use it too π ), so imagine an application that:
- You book a flight
- Then make a hotel room reservation
- And finally rent a car
But if the last step, the car rental, fails, the application needs to:
First: Cancel the hotel reservation
Second: Cancel the flight booking
These are your compensating transactions. Simple, right?
The Two Faces of Saga: Orchestration vs Choreography
There are two main ways to implement the Saga Pattern, and each has its trade-offs.
1. Orchestration
In orchestration, you have a central orchestrator that coordinates everything. It knows the order of operations, when to call each service, and when to execute compensations.
Client β Orchestrator β Service A β Orchestrator β Service B β Orchestrator β Service C ...
Advantages:
- Easier to debug (you know exactly where you are)
- Centralized logic
- Easier to visualize the flow
Disadvantages:
- Single point of failure (if the orchestrator goes down, it's over)
- Can become a "God Object" pretty quickly
- Coupling with the orchestrator
AWS has a great example using Step Functions in their documentation, where each step has its own success and failure handlers. AWS Example
2. Choreography
In choreography, each service knows what to do when it receives an event and publishes new events for the next one in the chain.
Service A β Event: "A_COMPLETED" β Service B β Event: "B_COMPLETED" β Service C
Advantages:
- Fully decoupled
- No single point of failure
- More "microservice-like"
Disadvantages:
- Hard to debug (logic is spread out)
- Can become "event hell" pretty fast
- Understanding the complete flow requires looking at N services
As the folks at Baeldung mention, choreography is better for simple flows, and orchestration for more complex ones.
Implementing in Practice
I'll show a conceptual example in PHP using RabbitMQ with orchestration (because it's easier to understand).
I'll assume you've already set up the project with the necessary libraries, so I'll focus only on what matters.
We then create the TravelBookingSagaOrchestrator class that will orchestrate the Saga, and in it we'll create the following functions:
public function bookTrip(array $tripData): array
{
$sagaId = uniqid('trip_', true);
$this->sagaLog = [];
try {
$flight = $this->executeStep('flight', 'reserve', [
'from' => $tripData['from'],
'to' => $tripData['to'],
'date' => $tripData['departure_date'],
'passengers' => $tripData['passengers']
], $sagaId);
$this->logStep('flight_reserved', $flight);
$hotel = $this->executeStep('hotel', 'reserve', [
'city' => $tripData['to'],
'checkin' => $tripData['checkin_date'],
'checkout' => $tripData['checkout_date'],
'guests' => $tripData['passengers']
], $sagaId);
$this->logStep('hotel_reserved', $hotel);
$car = $this->executeStep('car', 'reserve', [
'city' => $tripData['to'],
'pickup_date' => $tripData['checkin_date'],
'return_date' => $tripData['checkout_date']
], $sagaId);
$this->logStep('car_reserved', $car);
return [
'success' => true,
'booking' => [
'saga_id' => $sagaId,
'flight' => $flight,
'hotel' => $hotel,
'car' => $car
]
];
} catch (Exception $e) {
echo "Booking error: {$e->getMessage()}\n";
echo "Starting compensations...\n";
$this->compensate($sagaId);
throw new Exception("Trip booking failed: " . $e->getMessage());
}
}
private function executeStep(string $service, string $action, array $data, string $sagaId): array
{
$command = [
'saga_id' => $sagaId,
'action' => $action,
'data' => $data,
'timestamp' => time()
];
$message = new AMQPMessage(
json_encode($command),
['delivery_mode' => AMQPMessage::DELIVERY_MODE_PERSISTENT]
);
$this->channel->basic_publish($message, '', "{$service}_commands");
// Wait for response with timeout
$response = $this->waitForResponse($sagaId, $service, 30);
if ($response['status'] === 'error') {
throw new Exception("Step {$service}/{$action} failed: " . $response['data']);
}
return $response['data'];
}
private function compensate(string $sagaId): void
{
// Execute compensations in reverse order
$reversedLog = array_reverse($this->sagaLog);
foreach ($reversedLog as $step) {
$compensation = $this->getCompensation($step);
if (!$compensation) continue;
$retries = 0;
$maxRetries = 5;
while ($retries < $maxRetries) {
try {
$this->executeStep(
$compensation['service'],
$compensation['action'],
$compensation['data'],
$sagaId
);
echo "Compensated: {$step['type']}\n";
break;
} catch (Exception $e) {
$retries++;
echo "Compensation retry {$retries}/{$maxRetries}: {$e->getMessage()}\n";
sleep(pow(2, $retries)); // Exponential backoff
}
}
}
}
private function getCompensation(array $step): ?array
{
$compensations = [
'flight_reserved' => [
'service' => 'flight',
'action' => 'compensate_cancel',
'data' => ['id' => $step['data']['id'], 'pnr' => $step['data']['pnr']]
],
'hotel_reserved' => [
'service' => 'hotel',
'action' => 'compensate_cancel',
'data' => ['id' => $step['data']['id']]
],
'car_reserved' => [
'service' => 'car',
'action' => 'compensate_cancel',
'data' => ['id' => $step['data']['id']]
]
];
return $compensations[$step['type']] ?? null;
}
private function logStep(string $type, array $data): void
{
$this->sagaLog[] = ['type' => $type, 'data' => $data, 'timestamp' => time()];
}
In a simplified way, I'll bring a single example of what one of the services (FlightService) would look like.
class FlightServiceWorker
{
private $channel;
public function __construct()
{
$this->channel = RabbitMQConnection::getChannel();
$this->channel->queue_declare('flight_commands', false, true, false, false);
}
public function start(): void
{
$callback = function($msg) {
$command = json_decode($msg->body, true);
try {
if ($command['is_compensation'] ?? false) {
$result = $this->handleCompensation($command);
} else {
$result = $this->handleCommand($command);
}
$this->sendResponse($command['saga_id'], 'success', $result);
$msg->ack();
} catch (Exception $e) {
$this->sendResponse($command['saga_id'], 'error', $e->getMessage());
$msg->nack(false, true); // Requeue on error
}
};
$this->channel->basic_qos(null, 1, null);
$this->channel->basic_consume('flight_commands', '', false, false, false, false, $callback);
while ($this->channel->is_consuming()) {
$this->channel->wait();
}
}
private function handleCommand(array $command): array
{
switch($command['action']) {
case 'reserve':
return $this->reserveFlight($command['data']);
default:
throw new Exception("Unknown action: {$command['action']}");
}
}
private function handleCompensation(array $command): array
{
switch($command['action']) {
case 'compensate_cancel':
return $this->cancelFlight($command['data']);
default:
throw new Exception("Unknown compensation: {$command['action']}");
}
}
private function reserveFlight(array $data): array
{
// Idempotence
$existingReservation = $this->checkExistingReservation($data);
if ($existingReservation) {
echo "Reservation already exists (idempotence), return...\n";
return $existingReservation;
}
if (rand(1, 100) <= 10) {
throw new Exception("Flight unavailable for the requested date.");
}
$reservationId = uniqid('FLT_');
$pnr = strtoupper(substr(md5($reservationId), 0, 6));
DB::insert('flight_reservations', [
//...
]);
echo "Flight booked: {$pnr} - {$data['from']} β {$data['to']}\n";
return [
'id' => $reservationId,
'pnr' => $pnr,
'from' => $data['from'],
'to' => $data['to'],
'date' => $data['date'],
'passengers' => $data['passengers'],
'status' => 'confirmed',
'price' => 450.00
];
}
private function cancelFlight(array $data): array
{
// Check if it has already been cancelled
$reservation = DB::selectOne('flight_reservations', ['id' => $data['id']]);
if ($reservation['status'] === 'cancelled') { return $reservation; }
DB::update('flight_reservations',
['status' => 'cancelled', 'cancelled_at' => date('Y-m-d H:i:s')],
['id' => $data['id']]
);
return [
'id' => $data['id'],
'pnr' => $data['pnr'],
'status' => 'cancelled'
];
}
private function checkExistingReservation(array $data): ?array
{
$hash = md5(json_encode($data));
return DB::selectOne('flight_reservations', ['idempotency_key' => $hash]) ?: null;
}
private function sendResponse(string $sagaId, string $status, $data): void
{
$response = [
'saga_id' => $sagaId,
'status' => $status,
'data' => $data,
'service' => 'flight',
'timestamp' => time()
];
$message = new AMQPMessage(
json_encode($response),
['delivery_mode' => AMQPMessage::DELIVERY_MODE_PERSISTENT]
);
$this->channel->basic_publish($message, '', 'saga_responses');
}
}
$worker = new FlightServiceWorker();
$worker->start();
And to use the orchestrator (TravelBookingSagaOrchestrator)
try {
$orchestrator = new TravelBookingSagaOrchestrator();
$tripData = [
'from' => 'GRU',
'to' => 'MIA',
'departure_date' => '2026-07-15',
'checkin_date' => '2026-07-15',
'checkout_date' => '2026-07-22',
'passengers' => 2
];
$result = $orchestrator->bookTrip($tripData);
echo "\n Trip booked successfully!\n";
echo "ββββββββββββββββββββββββββββββββ\n";
echo "Flight: {$result['booking']['flight']['pnr']}\n";
echo "Hotel: {$result['booking']['hotel']['id']}\n";
echo "Car: {$result['booking']['car']['id']}\n";
echo "ββββββββββββββββββββββββββββββββ\n";
} catch (Exception $e) {
echo "\n ERROR: {$e->getMessage()}\n";
echo "All reservations have been cancelled.\n";
}
Important tip: In production, you would run each worker in a separate process, and the orchestrator could be called via API or another worker consuming a "create_order" queue.
Production Configurations
Before moving on to the pitfalls, some essential configurations for production environments:
1. Durable Queues
$channel->queue_declare('flight_commands',
false, // passive
true, // durable - Survives RabbitMQ restart
false, // exclusive
false // auto_delete
);
2. Persistent Messages
$message = new AMQPMessage(
json_encode($data),
['delivery_mode' => AMQPMessage::DELIVERY_MODE_PERSISTENT]
);
3. Process 1 message at a time - important to avoid overload
$channel->basic_qos(null, 1, null);
4. Dead Letter Exchange (DLX) - For messages that failed too many times
$args = new AMQPTable([
'x-dead-letter-exchange' => 'dlx_exchange',
'x-dead-letter-routing-key' => 'failed_bookings'
]);
$channel->queue_declare('flight_commands', false, true, false, false, false, $args);
This is CRUCIAL to not lose messages that failed multiple times.
Airline, hotel, and car rental APIs are often slow or unstable. So it's important to configure timeouts:
connection = new AMQPStreamConnection(
'localhost', 5672, 'guest', 'guest', '/',
false, // insist
'AMQPLAIN', // login method
null, // login response
'en_US', // locale
3.0, // connection timeout
3.0 // read/write timeout
);
Important point! Retry must be implemented in compensations!
Common Pitfalls
Throughout the process of implementing a SAGA, we'll encounter some obstacles. Be careful with:
1. Lack of Isolation
Unlike ACID transactions, Sagas don't guarantee isolation. This means other processes can see intermediate states.
Practical example: A user might see that payment was processed but the order isn't confirmed yet. You need to handle this in the UI.
2. Compensations Can Fail
If compensation fails, you're in an inconsistent state with no way to automatically recover.
Solution: Implement idempotency and automatic retry. Your compensations need to be idempotent (executing N times = executing 1 time) and you need infinite retry until successful.
3. Debugging is a nightmare
Especially with choreography. You'll need:
- Correlation IDs everywhere
- Structured logging
- Distributed tracing tools
- A lot of patience
4. Irreversible Transactions
Some cases can't be compensated. If you sent an email, you can't "unsend" it. If you printed a ticket, you can't "unprint" it.
In these cases, you need to think about alternative compensations (like sending a cancellation email).
Tools That Can Help
Depending on the scenario, you don't need to reinvent the wheel. There are several tools/frameworks available:
- Axon Framework - Popular in the Spring Boot world
- Eventuate - From Chris Richardson himself
- Temporal - Focused on durable execution (very good, by the way)
- AWS Step Functions - If you're on AWS
The folks at Temporal have an excellent article showing how they abstract away all the complexity of tracking and retry.
When NOT to Use Saga
Important: not everything needs to be a Saga.
Don't use Saga if:
- The transaction is local to a single service (obvious)
- Data can be eventually consistent WITHOUT coordination
- The cost of complexity > benefit
- You're just starting with microservices (seriously, start simple)
As the Azure folks put it well in their documentation: evaluate business risk. For low-risk operations, simple eventual consistency might be enough.
Lessons I Learned the Hard Way
After implementing Saga in production, here are some learnings:
- Start with orchestration - It's easier to debug and evolve
- Invest in observability - You WILL need it
- Test the compensations - Don't wait to find out in production
- Document the flow - Diagrams save lives
- Be pragmatic - Not everything needs to be transactional
Conclusion
The Saga Pattern is not a silver bullet. It adds complexity, requires discipline, and will make you think A LOT about edge cases. But when you truly need consistency across multiple services, it's practically unavoidable.
Once you understand the basic concepts and choose the right approach (orchestration vs choreography), it becomes more manageable. And there are several tools that can help.
The secret is not to try to implement everything at once. Start simple, add observability from the beginning, and evolve as needed.
References
Microservices.io - Saga Pattern - Chris Richardson
Microsoft Learn - Saga Design Pattern
Baeldung - Saga Pattern in Microservices
Temporal - Saga Pattern Made Easy
AWS Prescriptive Guidance - Saga Pattern
TheServerSide - How the Saga Design Pattern Works
Top comments (0)