Athul Mukundan

Posted on Jul 29

Making REST Microservices Resilient: Bulkhead, Retry & Circuit Breaker in Practice

#microservices #restapi #springboot #java

Why REST is a good starting point?

When starting with a microservice, REST is often the goto approach, and for a good reason.

Simplicity: Its simplicity allows teams to set up communication between services easily without needing to set up complex infrastructures.
Faster debugging: REST is also widely understood, making it easy for developers to jump in, read each other's code and debug using familiar tools like Postman.
Faster feature delivery: For smaller team this mean they can deliver more features without having to have operational overload of maintaining messaging systems or brokers.

The hidden dangers of naive REST communication.

At first glance, REST-based communication between service feels straight forward. But as the system grows in complexity and traffic, naive implementation can quickly become a bottleneck.

Synchronus dependency chain: When services relays on each other through direct REST calls, a single request can fire a long chain of downstream calls. This increases the overall response time and tightly couples services together.
Failure Propogation: If one service becomes slow or unavailable, all the services depending on it may also start failing or slowing down, leading to cascading failure through out the system.

Long-tail latency: With multiple request hops between the services, even a small delay in a service can accumulate and cause significant end-to-end latency, degrading the user experience.

Introducing Resilience Patterns.

Introducing the resilience patterns can make your system more fault tolerant and stable.

🔁 Retry:

In distributed systems, transient failures are common, a service might briefly fail due to network issues, load spikes, or resource contention. A retry mechanism allows the calling service to attempt the failed request again after a short delay, increasing the chances of eventual success.

Exponential backoff: Rather than retrying immediately or at fixed intervals, exponential backoff gradually increases the wait time between attempts. This reduces pressure on overloaded services and gives them a chance to recover.
Jitter: Adding randomness (jitter) to retry intervals helps avoid the thundering herd problem, where multiple services retry at the same time and cause another spike in traffic.

import io.github.resilience4j.retry.*;
import io.github.resilience4j.retry.annotation.*;
import org.springframework.stereotype.Service;

import java.util.concurrent.Callable;

@Service
public class PaymentService {

    private final Retry retry = Retry.of("paymentRetry",
        RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(500))
            .retryExceptions(RuntimeException.class)
            .build()
    );

    public String processPayment() {
        Callable<String> decorated = Retry
            .decorateCallable(retry, this::callPaymentGateway);

        try {
            return decorated.call();
        } catch (Exception ex) {
            return "❌ All retries failed!";
        }
    }

    private String callPaymentGateway() {
        if (Math.random() < 0.8) throw new RuntimeException("Gateway timeout");
        return "✅ Payment processed";
    }
}

⚡️ Circuit Breaker

What if the downstream service isn’t recovering quickly? Constant retries could make things worse. A circuit breaker monitors the failure rate of requests and temporarily stops calls to a failing service once a threshold is breached.

Open State: When failures cross a defined threshold, the circuit “opens,” blocking further calls and failing fast. This prevents overloading the unhealthy service.
Half Open State: After a timeout, the circuit breaker allows a few test requests through to check if the service has recovered.
Closed stat: If the test requests succeed, the circuit closes and normal traffic resumes.

this allows the requests to fail faster

import io.github.resilience4j.circuitbreaker.*;
import io.github.resilience4j.circuitbreaker.annotation.*;
import org.springframework.stereotype.Service;

import java.util.concurrent.Callable;

@Service
public class ExternalApiService {

    private final CircuitBreaker circuitBreaker = CircuitBreaker.of("externalApiCB",
        CircuitBreakerConfig.custom()
            .failureRateThreshold(50)               // % of failures to open the circuit
            .waitDurationInOpenState(Duration.ofSeconds(5))
            .slidingWindowSize(10)
            .build()
    );

    public String fetchData() {
        Callable<String> decorated = CircuitBreaker
            .decorateCallable(circuitBreaker, this::callExternalApi);

        try {
            return decorated.call();
        } catch (CallNotPermittedException ex) {
            return "⛔ Circuit is open. Using fallback!";
        } catch (Exception ex) {
            return "❌ API call failed!";
        }
    }

    private String callExternalApi() {
        // Simulate flaky external service
        if (Math.random() < 0.7) throw new RuntimeException("External service error");
        return "✅ Data from external API";
    }
}

🧵 Bulkhead

As systems scale, a failure in one part of your application can unexpectedly affect other unrelated areas. The bulkhead pattern helps isolate such failures so they don’t take down the whole system.

What is it? Inspired by ship design, where compartments are isolated to prevent flooding, the bulkhead pattern in software creates isolated resource pools, such as separate thread pools, for different parts of the system.
Why it’s needed: Without bulkheads, a spike in a heavy task (like report generation) can exhaust shared resources like threads or memory, causing even lightweight, critical requests (like health checks or profile lookups) to fail or slow down.

Example:

/generate-report uses a lot of CPU and threads
/get-profile is lightweight and fast
Without bulkheads, if /generate-report overloads the system, it can delay or block /get-profile
With bulkheads, they run in separate thread pools, so one won’t affect the other

Types of isolation:

Thread pool isolation: Allocate different thread pools per endpoint or functionality
Queue isolation: Use separate queues for different request types
Instance isolation: Scale different parts of the service independently if needed

Benefit:

The system degrades gracefully, even if one part is under pressure, the rest continues to work.

import io.github.resilience4j.bulkhead.*;
import io.github.resilience4j.bulkhead.annotation.*;
import org.springframework.stereotype.Service;

import java.time.Duration;
import java.util.concurrent.*;

@Service
public class ReportService {

    private final Bulkhead bulkhead = Bulkhead.of("generateReportBulkhead",
        BulkheadConfig.custom()
            .maxConcurrentCalls(2)
            .maxWaitDuration(Duration.ofMillis(500))
            .build()
    );

    public String generateReport() {
        Callable<String> decorated = Bulkhead
            .decorateCallable(bulkhead, () -> {
                simulateHeavyComputation();
                return "✅ Report generated!";
            });

        try {
            return decorated.call();
        } catch (BulkheadFullException ex) {
            return "🚫 System busy. Try again later.";
        } catch (Exception e) {
            return "❌ Unexpected error.";
        }
    }

    private void simulateHeavyComputation() throws InterruptedException {
        Thread.sleep(2000); // simulate heavy work
    }
}

Trade-offs & When to Move On

While resilience patterns like retry, circuit breaker, and bulkhead significantly improve fault tolerance, they don’t address one fundamental issue, tight coupling between services.

REST still couples services tightly: Each service needs to know exactly who to call and when. This direct dependency makes changes harder and increases coordination overhead between teams.
Error recovery is limited: You can retry, back off, or fail fast, but you're still reacting to failures. There’s no natural way to defer processing or recover asynchronously.
Still vulnerable to partial failures and latency: A chain of REST calls means if one link slows down, the entire request suffers. You’re still bound by the slowest service in the chain.

Eventually, as the number of services and their interactions grow, the limitations of synchronous REST become more obvious. That’s when teams start looking for asynchronous, event-driven alternatives.

Up Next: Decoupling with Choreography

What if services didn’t need to know about each other directly?

What if failures didn’t ripple through the system immediately?

What if we could decouple communication and build more autonomous services?

In Part 2, we’ll explore choreography, where services react to events instead of calling each other directly, unlocking a new level of flexibility, resilience, and scalability.

DEV Community