DEV Community

pponali
pponali

Posted on

Circuit Breakers Under Stress: Anatomy of a Payment Cascade

A flash sale hit us at 10x baseline RPS. Within four minutes, our Payment Service circuit breaker tripped to OPEN, error rate climbed to 92%, and p99 latency on the payment path went from 200ms to 14.2 seconds. Here's the part nobody tells you on the conference circuit: the circuit breaker didn't fail. It worked exactly as designed. The failure was everywhere else.

This is a postmortem of what we saw, why Resilience4j's defaults weren't enough, and the four changes that made the next sale boring.

The setup

Standard Java microservices stack. Spring Cloud Gateway in front, JWT auth via Keycloak, Resilience4j wrapping every outbound call. Payment Service synchronously calls Stripe. Order Service synchronously calls Payment. PostgreSQL for orders, Redis for circuit breaker state, Kafka for the dead-letter queue.

Six services. Five circuit breakers. One very stressed thread pool.

What 10x RPS actually does

Baseline was around 1,000 RPS. The flash sale pushed us to 10,243. The edge layer absorbed it fine — NGINX did its job, the rate limiter degraded gracefully, the CDN cached anything cacheable. Spring Cloud Gateway routed cleanly.

The wheels came off at the Payment Service. Stripe's p99 latency under load climbed from a healthy 800ms to 14.2 seconds. That doesn't sound catastrophic until you do the math: every Payment thread now holds for ~14s instead of <1s. With a fixed thread pool, throughput collapses long before the breaker notices.

# What we had — Resilience4j defaults, lightly tuned
resilience4j.circuitbreaker:
  instances:
    paymentService:
      failureRateThreshold: 50
      slidingWindowSize: 100
      slidingWindowType: COUNT_BASED
      waitDurationInOpenState: 30s
      permittedNumberOfCallsInHalfOpenState: 10
Enter fullscreen mode Exit fullscreen mode

A 50% failure threshold over 100 calls means the breaker waits for 50 failures before tripping. At 10x load with timeouts, that's roughly four minutes of users staring at spinners. By the time the breaker opened, the thread pool was already 98% saturated.

The cascade, step by step

The order matters:

  1. Flash-sale spike hits the gateway at 10x RPS.
  2. Order Service synchronously calls Payment for every checkout.
  3. Stripe's p99 spikes to 14s under provider-side load.
  4. Payment Service threads block on those timeouts.
  5. failureRateThreshold=50% breached → Payment CB transitions to OPEN.
  6. Subsequent calls fail-fast → fallback handler enqueues "deferred order" responses to Kafka.
  7. Order Service's own CB drops to HALF-OPEN, probing with limited concurrency.
  8. Bulkhead isolation prevents the cascade from reaching Inventory, Notifications, or User services.

Step 8 is the only reason this incident wasn't a full-platform outage. Without per-endpoint bulkheads, a slow Stripe would have eaten every thread in the gateway's pool, and User Service login requests would have queued behind dead Payment calls.

The state machine, practically

If you've only read the docs, the circuit breaker looks like a tidy three-state diagram. In production it's noisier:

// Resilience4j state transitions, simplified
CircuitBreaker cb = CircuitBreaker.of("paymentService", config);

cb.getEventPublisher()
  .onStateTransition(event -> {
      log.warn("CB {} : {} -> {}",
          event.getCircuitBreakerName(),
          event.getStateTransition().getFromState(),
          event.getStateTransition().getToState());
      meterRegistry.counter("cb.transition",
          "name", event.getCircuitBreakerName(),
          "to", event.getStateTransition().getToState().name()
      ).increment();
  });
Enter fullscreen mode Exit fullscreen mode

That listener saved us during the postmortem. We could replay exactly when each breaker tripped, when probing started, and which trial calls failed. If you don't emit metrics on every state transition, you're flying blind.

The HALF-OPEN state is the dangerous one. Resilience4j permits a small number of trial calls; if any of them fail, you slam back to OPEN for another waitDuration. Set the trial pool too low and you'll never recover; set it too high and you'll hammer a still-broken downstream.

Four changes that fixed it

1. Tighter, faster breakers

We dropped the threshold and shrunk the window:

resilience4j.circuitbreaker:
  instances:
    paymentService:
      failureRateThreshold: 30          # was 50
      slowCallRateThreshold: 50          # NEW — slow calls also count
      slowCallDurationThreshold: 2s      # NEW
      slidingWindowSize: 20              # was 100
      minimumNumberOfCalls: 10
      waitDurationInOpenState: 15s       # was 30s
      permittedNumberOfCallsInHalfOpenState: 5
Enter fullscreen mode Exit fullscreen mode

Two non-obvious knobs matter here. slowCallRateThreshold lets you trip on latency, not just errors — critical when a downstream is dying slowly rather than 500-ing. And the smaller window means the breaker reacts in seconds, not minutes.

2. Per-endpoint bulkheads

A single thread pool for "Payment Service" is too coarse. Split by downstream:

@Bean
public ThreadPoolBulkhead stripeBulkhead() {
    ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(20)
        .coreThreadPoolSize(10)
        .queueCapacity(50)
        .keepAliveDuration(Duration.ofMillis(500))
        .build();
    return ThreadPoolBulkhead.of("stripe", config);
}

@Bean
public ThreadPoolBulkhead fraudBulkhead() {
    // Smaller — fraud is allowed to be slow, not allowed to starve payment
    return ThreadPoolBulkhead.of("fraud",
        ThreadPoolBulkheadConfig.custom()
            .maxThreadPoolSize(8)
            .coreThreadPoolSize(4)
            .build());
}
Enter fullscreen mode Exit fullscreen mode

Now a slow fraud engine can't drain Stripe's threads, and vice versa. Bulkhead-per-dependency is more YAML, but it's the only way to guarantee isolation when one downstream misbehaves.

3. Async outbox + Kafka retry

The synchronous Order → Payment → Stripe chain was the real sin. We moved Payment to an outbox pattern: orders write a payment intent to Postgres in the same transaction, a relay publishes to Kafka, and a worker calls Stripe asynchronously. The user gets an immediate "order placed" response; the charge happens within seconds, with retries handled by the consumer.

@Transactional
public Order placeOrder(OrderRequest req) {
    Order order = orderRepo.save(Order.from(req));
    outboxRepo.save(new OutboxEvent(
        "payment.charge.requested",
        order.getId(),
        objectMapper.writeValueAsString(req.payment())
    ));
    return order;  // returns in <50ms regardless of Stripe latency
}
Enter fullscreen mode Exit fullscreen mode

Decoupling time-of-order from time-of-charge means a 14-second Stripe doesn't translate to a 14-second user experience. It also gives us natural retry and dead-lettering through Kafka, instead of bolting retry logic onto every caller.

4. HPA on RPS and queue depth

The Payment Service was scaled on CPU, which is useless when threads are blocked on I/O. We swapped to a custom Prometheus metric — RPS plus Kafka consumer lag — and let the HPA add pods when the queue grew faster than it drained. CPU never crossed 40% during the incident; if we'd been watching the right signal, we'd have scaled out three minutes earlier.

What I'd tell past me

The circuit breaker is a fire alarm, not a fire suppression system. By the time it trips, you've already had a fire for a while. The real defenses are the things that stop the fire from starting: bulkhead isolation per downstream, slow-call detection, async boundaries on anything you don't fully control, and autoscaling on signals that actually correlate with load.

Resilience4j is excellent. The defaults are not your friend in production.

Takeaways

If you take three things from this:

  • Trip on latency, not just errors. slowCallRateThreshold is the most underused knob in Resilience4j.
  • One bulkhead per downstream, always. Coarse pools will betray you the moment two dependencies fail differently.
  • Synchronous chains across third-party APIs are tech debt. An outbox + queue is more code, but it's the difference between a postmortem and an incident report.

The next flash sale ran 12x baseline. Payment p99 stayed under 600ms. Nobody paged.

Top comments (0)