Testing Sagas with Real Failure Scenarios

#java #microservices #testing #springboot

In the previous post, I walked through the compensation logic in each service. The code looks clean on paper. But sagas have a lot of moving parts, and bugs tend to hide in the transitions between services, not inside a single service.

This post covers how I test the saga system: unit tests for each service, orchestrator routing tests, and the edge cases that caught me off guard.

Testing the Orchestrator Routing

The orchestrator's state transition table is the most critical piece. If it routes to the wrong topic, the entire saga breaks. I test every (source, status) combination:

@Test
void shouldReturnNextTopicGivenValidSourceAndSuccessStatus() {
    setEvent(PAYMENT_SERVICE.toString(), SUCCESS);

    TopicsEnum topic = sagaExecutionController.getNextTopic(event);

    assertEquals(INVENTORY_SUCCESS, topic);
}

@Test
void shouldReturnFailTopicGivenValidSourceAndFailStatus() {
    setEvent(PAYMENT_SERVICE.toString(), FAIL);

    TopicsEnum topic = sagaExecutionController.getNextTopic(event);

    assertEquals(PRODUCT_VALIDATION_FAIL, topic);
}

@Test
void shouldReturnRollbackTopic() {
    setEvent(PRODUCT_VALIDATION_SERVICE.toString(), ROLLBACK);

    TopicsEnum topic = sagaExecutionController.getNextTopic(event);

    assertEquals(PRODUCT_VALIDATION_FAIL, topic);
}

These tests are fast and deterministic. No Kafka, no databases. Just the lookup logic. If someone adds a new service to the saga and forgets to update the table, the test for that (source, status) pair will fail with "Topic not found!"

Edge Cases in Routing

Two cases that caught me early on:

@Test
void shouldThrowValidationExceptionWhenSourceIsNull() {
    setEvent(null, SUCCESS);

    ValidationException ex = assertThrows(ValidationException.class, () -> {
        sagaExecutionController.getNextTopic(event);
    });

    assertEquals("Source and status must be informed.", ex.getMessage());
}

@Test
void shouldThrowValidationExceptionWhenTopicNotFound() {
    setEvent(PAYMENT_SERVICE.toString(), TIMEOUT);

    ValidationException ex = assertThrows(ValidationException.class, () -> {
        sagaExecutionController.getNextTopic(event);
    });

    assertEquals("Topic not found!", ex.getMessage());
}

The TIMEOUT status exists in the enum but has no mapping in the saga table. Without this test, a timeout event would silently disappear. The exception makes it visible immediately.

Testing the OrchestrationService

The orchestration layer adds history entries and publishes to Kafka. I mock the producer and verify the correct topic:

@Test
void shouldStartSagaSuccessfully() {
    when(sagaExecutionController.getNextTopic(event))
        .thenReturn(TopicsEnum.PRODUCT_VALIDATION_SUCCESS);

    orchestrationService.startSaga(event);

    verify(producer).sendEvent(eq("product-validation-success"), eq("{json}"));
    assertEquals("ORCHESTRATOR", event.getSource());
    assertEquals(SUCCESS, event.getStatus());
    assertTrue(event.getEventHistory().stream()
        .anyMatch(h -> h.getMessage().contains("Saga started")));
}

@Test
void shouldFinishSagaWithFailure() {
    orchestrationService.finishSagaFail(event);

    verify(producer).sendEvent(eq("notify-ending"), eq("{json}"));
    assertEquals(FAIL, event.getStatus());
    assertTrue(event.getEventHistory().stream()
        .anyMatch(h -> h.getMessage().contains("with errors")));
}

The history assertion is important. It verifies that each step leaves a trace. If a saga fails and the history is empty, debugging becomes guesswork.

Testing Payment: The Happy and Sad Paths

The payment-service has the most complex logic. It validates amounts, checks fraud scores, simulates gateway responses, and handles refunds. Here's how I test the main scenarios:

Payment Success

@Test
void shouldRealizePaymentSuccessfully_givenValidOrderAndAmount() {
    givenNoExistingPayment();
    givenPaymentFound();
    givenJsonSerialization();

    paymentService.realizePayment(event);

    assertEquals(SUCCESS, event.getStatus());
    assertEquals("PAYMENT_SERVICE", event.getSource());
    assertEquals(20.0, event.getOrder().getTotalAmount());
    assertHistoryContains("Payment realized successfully");
    verify(producer).sendEvent("{json}");
}

Amount Below Minimum

@Test
void shouldRollback_givenAmountIsLessThanMinimum() {
    event = buildEvent(0.0, 1);     // unit value = 0.0
    payment = buildPayment(0.0, 1);
    givenNoExistingPayment();
    givenPaymentFound();
    givenJsonSerialization();

    paymentService.realizePayment(event);

    assertEquals(ROLLBACK, event.getStatus());
    assertHistoryContains("minimal amount");
}

Duplicate Transaction

@Test
void shouldRollback_givenTransactionAlreadyExists() {
    when(paymentRepository.existsByOrderIdAndTransactionId(any(), any()))
        .thenReturn(true);
    givenJsonSerialization();

    paymentService.realizePayment(event);

    assertEquals(ROLLBACK, event.getStatus());
    assertHistoryContains("transactionId");
}

Refund (Compensation)

@Test
void shouldRealizeRefund_whenPaymentExists() {
    when(paymentRepository.findByOrderIdAndTransactionId(any(), any()))
        .thenReturn(Optional.of(payment));
    givenJsonSerialization();

    paymentService.realizeRefund(event);

    assertEquals(FAIL, event.getStatus());
    assertEquals(PaymentStatus.REFUND, payment.getStatus());
    assertHistoryContains("Rollback executed for payment");
    verify(paymentRepository).save(payment);
}

Refund Failure (Compensation of the Compensation)

This is the tricky one. What if the refund itself fails? The payment-service still publishes FAIL so the saga can continue rolling back. It just logs that the refund didn't execute:

@Test
void shouldHandleRefundFailureGracefully_whenPaymentNotFound() {
    when(paymentRepository.findByOrderIdAndTransactionId(any(), any()))
        .thenThrow(new RuntimeException("DB error"));
    givenJsonSerialization();

    paymentService.realizeRefund(event);

    assertEquals(FAIL, event.getStatus());
    assertHistoryContains("Rollback not executed for payment");
    verify(producer).sendEvent("{json}");
}

The saga doesn't get stuck. The refund failure is recorded in the history for manual intervention later.

Testing Inventory Rollback

The inventory tests follow the same pattern. The interesting case is restoring stock to its previous value:

@Test
void shouldRollbackInventorySuccessfully() {
    OrderInventory orderInventory = OrderInventory.builder()
        .inventory(inventory)
        .oldQuantity(10)
        .newQuantity(5)
        .orderId("order-1")
        .transactionId("tx-123")
        .build();

    when(orderInventoryRepository.findByOrderIdAndTransactionId("order-1", "tx-123"))
        .thenReturn(List.of(orderInventory));

    inventoryService.rollbackInventory(event);

    assertEquals(FAIL, event.getStatus());
    assertEquals(10, inventory.getAvailable());  // restored to old value
    assertHistoryContains("Rollback executed for inventory");
}

The oldQuantity was 10, the forward action reduced it to 5, and the rollback restores it to 10. Without the OrderInventory record that saves both values, this rollback would be impossible.

A Helper That Saves Time

I use the same assertion helper across all service tests:

private void assertHistoryContains(String expectedMessage) {
    assertTrue(event.getEventHistory().stream()
        .anyMatch(h -> h.getMessage().toLowerCase()
            .contains(expectedMessage.toLowerCase())),
        "Expected message not found in history: " + expectedMessage);
}

This checks that the service added the right message to the event history. Every test verifies both the status AND the history. The status controls the saga flow. The history tells you why.

What I'd Do Differently

Looking back, there are a few things I'd add:

Integration tests with embedded Kafka. The unit tests mock the producer, so they don't catch serialization bugs or topic misconfiguration. An embedded Kafka setup would let me publish a real event and verify the full chain.

Testcontainers for the databases. The unit tests mock the repositories. A Testcontainers setup with real PostgreSQL and MongoDB would catch schema issues and migration bugs.

Chaos testing. Kill a service mid-saga and verify recovery. Introduce network delays between services. These are the scenarios that break sagas in production, and they're hard to test with mocks alone.

These are in the roadmap. For now, the unit tests cover the routing logic and compensation flows well enough to catch regressions.

Wrapping Up

The saga orchestrator pattern works because each piece is testable in isolation. The state transition table is a pure function. Each service's forward and compensation logic can be tested with mocked dependencies. The event history gives you a built-in audit trail.

The full test suite runs in seconds because nothing touches real infrastructure. That's the payoff of keeping the orchestrator stateless and the services decoupled.

The repo (with all tests): github.com/pedrop3/saga-orchestration