Why Sagas (and Why Not Distributed Transactions)

#distributedsystems #systemdesign #architecture #microservices

You have 5 microservices. An order comes in. You need to validate the product, charge the customer, and reserve inventory. If any of those steps fails, you need to undo the ones that already succeeded.

The textbook answer is a distributed transaction with two-phase commit (2PC). Lock all resources across all services, do the work, then commit everything at once. The problem: 2PC doesn't scale. It requires all services to be available simultaneously. One slow database and everything blocks. In a microservices world with Kafka and independent deployments, 2PC is a non-starter.

The alternative is the Saga Pattern. Instead of one big transaction, you run a chain of local transactions. Each service does its work and publishes an event. If a step fails, you run compensating transactions to undo the previous steps. No distributed locks. No two-phase commit. Each service owns its own data and its own rollback logic.

This series walks through how I built a saga orchestrator from scratch with Spring Boot and Kafka. Real code, real failure scenarios, real rollback chains.

Choreography vs Orchestration

There are two ways to implement sagas.

Choreography means each service listens for events and decides what to do next. Order service publishes "order created." Payment service picks it up and charges the card. Inventory service picks up "payment completed" and reserves stock. No central coordinator.

The problem with choreography is that nobody owns the flow. When you have 5 services and 3 failure modes each, the event chain becomes hard to follow. Debugging a failed saga means reading logs across all services and reconstructing the sequence yourself.

Orchestration means a central service controls the flow. It tells each service what to do and when. It knows which step comes next and which service to call for rollback. The saga logic lives in one place.

I went with orchestration. The tradeoff is that you get a single point of coordination (the orchestrator), but in return you get a clear state machine that's easy to debug and easy to extend.

The Architecture

My system has 5 services, each with its own database:

Service	Port	Database	Role
order-service	3000	MongoDB	Creates orders, stores saga events
orchestrator	8050	(stateless)	Controls the saga flow
product-validation	8090	PostgreSQL	Validates product catalog
payment-service	8091	PostgreSQL	Processes payments
inventory-service	8092	PostgreSQL	Manages stock

All communication goes through Kafka. The orchestrator publishes to service-specific topics. Each service does its work and publishes back to the orchestrator topic.

The Happy Path

When everything works, the flow looks like this:

Order Service → Orchestrator → Product Validation ✅ → Payment ✅ → Inventory ✅ → Finish

A user creates an order via REST API on the order-service
Order-service saves the order to MongoDB and publishes to start-saga
Orchestrator picks it up and publishes to product-validation-success
Product validation checks the catalog, publishes SUCCESS back to orchestrator
Orchestrator publishes to payment-success
Payment processes the charge, publishes SUCCESS back to orchestrator
Orchestrator publishes to inventory-success
Inventory reserves stock, publishes SUCCESS back to orchestrator
Orchestrator publishes to finish-success and then notify-ending

Every step is a Kafka message. Every transition is logged. The order-service listens on notify-ending to update the final status.

The Sad Path: Compensating Transactions

When payment fails (card declined, fraud blocked, amount too high), the orchestrator needs to undo product validation. When inventory fails (out of stock), it needs to undo both payment and product validation.

The rule is simple: on failure, roll back in reverse order. If step 3 fails, compensate steps 2 and 1.

Payment FAIL → publish to payment-fail → Payment refunds
            → publish to product-validation-fail → Validation marks as failed
            → publish to finish-fail → Saga ends with FAIL status

Each service implements two operations: the forward action and the compensation. The payment-service has realizePayment() and realizeRefund(). The inventory-service has updateInventory() and rollbackInventory().

Creating an Order (the Starting Point)

Here's the actual REST endpoint that kicks off a saga:

@PostMapping
public ResponseEntity<Order> createOrder(@Valid @RequestBody OrderRequest orderRequest) {
    Order createdOrder = orderService.createOrder(orderRequest);
    return ResponseEntity.status(HttpStatus.CREATED).body(createdOrder);
}

The OrderService saves the order, creates an event, and publishes to Kafka:

@Transactional
public OrderDocument createOrder(OrderRequest orderRequest) {
    var orderDocument = saveOrder(orderRequest);
    var eventDocument = createEventPayload(orderDocument);
    eventPublisherService.publish(eventDocument);
    return orderDocument;
}

The EventPublisherService serializes the event and sends it to the start-saga topic:

public void publish(EventDocument eventDocument) {
    eventService.save(eventDocument);
    sagaProducer.sendEvent(serializeEvent(eventDocument));
}

From this point, the orchestrator takes over. The order-service doesn't know or care about product validation, payment, or inventory. It just publishes an event and waits for the final notification.

The Event Structure

Every message in the system follows the same Event structure:

public class Event {
    protected String eventId;
    private String transactionId;
    private String orderId;
    private Order order;
    private String source;
    private SagaStatusEnum status;        // SUCCESS, ROLLBACK, FAIL
    private List<History> eventHistory;
    private LocalDateTime createdAt;
}

The eventHistory list is key. Every service appends its result to this list. By the time the saga ends, you have a complete audit trail of what happened at each step, who did it, and when.

public void addToHistory(History history) {
    if (eventHistory == null) {
        eventHistory = new ArrayList<>();
    }
    eventHistory.add(history);
}

What's Next

In the next post, I'll show the orchestrator itself: the state transition table that maps (source, status) to the next Kafka topic, the consumer that routes events, and how the whole thing stays deterministic even with concurrent sagas.

The repo is open source: github.com/pedrop3/saga-orchestration