How to Build Resilient Microservices in Go: Circuit Breakers and Bulkhead Patterns

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Building microservices is a bit like organizing a large, busy kitchen. You have different stations for prep, cooking, baking, and plating. Normally, this works beautifully. But what happens when the grill overheats and catches fire? Without proper safeguards, that fire could spread to the entire kitchen, burning down your whole operation. In software terms, that's a cascading failure: one sick service takes down the entire system.

The solution is not to prevent all fires—that's impossible—but to build smart fire doors and isolated compartments. In our microservices kitchen, we use two main tools for this: the circuit breaker and the bulkhead pattern. Let me show you how to build these in Go, step by step, with practical code you can use.

I'll write this as if we're working together on a project. We'll start simple, then add layers of protection. Think of it as building a safety net, one strand at a time.

First, let's understand the core problem. A microservice calls another microservice. If the second service is slow or broken, the first one might wait forever, using up a valuable thread or connection. Soon, all its resources are stuck waiting, it can't handle new requests, and it becomes sick too. The sickness spreads.

A circuit breaker is our first line of defense. It's not a complex idea. It watches the calls to a service. If too many calls fail, it "trips" and stops all new calls for a while. It gives the sick service time to recover, and it prevents our service from wasting resources. After a timeout, it cautiously lets a few test calls through. If they succeed, it closes the circuit again and business resumes. If they fail, the timeout resets.

Here is a basic version of this concept in Go. We'll create a CircuitBreaker struct. It can be in one of three states: Closed (everything's fine, calls go through), Open (the circuit is tripped, calls are blocked immediately), or HalfOpen (we're testing to see if things are fixed).

type CircuitBreaker struct {
    mu            sync.RWMutex
    state         CircuitState
    failureCount  int32
    threshold     int32
    resetTimeout  time.Duration
    lastFailure   time.Time
}

const (
    Closed CircuitState = iota
    Open
    HalfOpen
)

The Execute method is where the logic lives. It first asks, "Am I allowed to make a request?" based on my current state. If not, it instantly returns an error. This is the "fast failure" that protects our system. If allowed, it runs the actual operation (like an HTTP call) and records whether it succeeded or failed.

func (cb *CircuitBreaker) Execute(fn func() error) error {
    if !cb.allowRequest() {
        return errors.New("circuit breaker is open")
    }
    err := fn()
    cb.recordResult(err)
    return err
}

The allowRequest method contains the state machine logic. If the state is Closed, it always allows the request. If it's Open, it checks how long it's been open. If enough time has passed (the resetTimeout), it moves to HalfOpen and allows a request—this is our test probe. In HalfOpen state, we might limit the number of test probes to, say, five.

func (cb *CircuitBreaker) allowRequest() bool {
    cb.mu.RLock()
    defer cb.mu.RUnlock()

    switch cb.state {
    case Closed:
        return true
    case Open:
        if time.Since(cb.lastFailure) > cb.resetTimeout {
            cb.transitionToHalfOpen()
            return true
        }
        return false
    case HalfOpen:
        // Only allow a few test requests
        return cb.testRequestCount < 5
    }
    return false
}

The recordResult method updates the counts. A successful call in the HalfOpen state might be enough to reset the breaker back to Closed. A failure in the HalfOpen state immediately sends it back to Open. A failure in the Closed state increments the failure count; if we pass a threshold (like 5 failures in a row), we trip the breaker to Open.

This simple mechanism is incredibly powerful. I've seen it turn a total system outage into a minor, isolated blip. The failing service gets the quiet it needs to restart or recover, and the rest of the system hums along, perhaps with slightly degraded function, but it's alive.

Now, let's talk about the second pattern: the bulkhead. If the circuit breaker is a fire door, the bulkhead is a ship's compartmentalization. On a ship, if one compartment floods, the watertight doors seal, and the ship stays afloat. In our code, we isolate different resources—like database connection pools, thread pools, or external service clients—so a failure in one doesn't drain all resources.

Imagine your service connects to a database and a Redis cache. If the database becomes very slow, you don't want all your goroutines getting stuck waiting for it, because then you also won't have any left to handle the fast Redis calls. You need separate, isolated pools.

Here's a Bulkhead struct that controls concurrency for a specific resource. It has a maximum number of concurrent calls it will allow. Requests that come in beyond that limit wait in a queue. If the queue is full, or if the wait times out, the request is immediately rejected to protect the system.

type Bulkhead struct {
    maxConcurrent int32
    current       int32 // Track current in-flight requests
    queue         chan struct{}
    timeout       time.Duration
}

func NewBulkhead(name string, maxConcurrent int32, queueSize int, timeout time.Duration) *Bulkhead {
    return &Bulkhead{
        maxConcurrent: maxConcurrent,
        queue:         make(chan struct{}, queueSize),
        timeout:       timeout,
    }
}

The Execute method tries to acquire a "slot." It first checks if we're under the concurrency limit. If so, it increments the counter and proceeds. If not, it tries to put a ticket into the queue channel. If the queue is full, it rejects immediately. If it gets in the queue, it then waits for a slot to free up, but only for the duration of the timeout.

func (bh *Bulkhead) Execute(ctx context.Context, fn func() error) error {
    if !bh.acquireSlot(ctx) {
        return errors.New("bulkhead rejected: too many concurrent requests")
    }
    defer bh.releaseSlot()
    return fn()
}

func (bh *Bulkhead) acquireSlot(ctx context.Context) bool {
    // Try fast path first
    if atomic.AddInt32(&bh.current, 1) <= bh.maxConcurrent {
        return true
    }
    atomic.AddInt32(&bh.current, -1) // We went over, undo

    // Wait in queue with timeout
    select {
    case bh.queue <- struct{}{}:
        // Got in queue, now wait for a free slot signal
        select {
        case <-ctx.Done():
            <-bh.queue
            return false
        case <-time.After(bh.timeout):
            <-bh.queue
            return false
        // ... (we need a way to be notified when a slot frees)
        }
    default:
        // Queue is full
        return false
    }
}

In practice, you'd use a more sophisticated coordinator (like a weighted semaphore) to manage the queue notification, but the concept is clear: limit, queue, and timeout. You would create separate bulkheads for your database client, your payment service client, and your email service client. A storm in one area leaves the others dry.

The real magic happens when we combine these patterns with retries and fallbacks. This creates a resilient client. Let me show you what a ResilientClient struct might look like. It wraps a business operation.

type ResilientClient struct {
    circuit  *CircuitBreaker
    bulkhead *Bulkhead
    retries  int
}

func (rc *ResilientClient) Call(ctx context.Context, operation func() error) error {
    // We'll attempt the operation within the bulkhead and circuit breaker
    for i := 0; i < rc.retries; i++ {
        err := rc.bulkhead.Execute(ctx, func() error {
            return rc.circuit.Execute(operation)
        })
        if err == nil {
            return nil // Success!
        }
        // If it's not a retryable error (e.g., a 4xx client error), break
        if !isRetryableError(err) {
            return err
        }
        // Wait before retrying, with exponential backoff
        select {
        case <-time.After(time.Duration(math.Pow(2, float64(i))) * 100 * time.Millisecond):
            continue
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    // All retries failed. Execute a fallback function.
    return rc.fallback()
}

The flow is: Call -> Bulkhead.Execute -> CircuitBreaker.Execute -> operation. The bulkhead ensures we don't overload our own resources. The circuit breaker stops us from calling a dead service. The retry logic handles temporary glitches. If everything fails, the fallback provides a default response—maybe stale cached data, a friendly message, or a default value.

Let's write a concrete example. Suppose we have a function that calls a weather API.

func getWeather(city string) (string, error) {
    // Simulate an HTTP call
    time.Sleep(50 * time.Millisecond)
    if rand.Intn(10) == 0 { // Simulate 10% failure rate
        return "", errors.New("weather service unreachable")
    }
    return "Sunny", nil
}

func main() {
    // Create a resilient client for our weather service
    client := NewResilientClient("weather-api", 5, 30*time.Second, 10, 5*time.Second, 3)

    var wg sync.WaitGroup
    for i := 0; i < 20; i++ {
        wg.Add(1)
        go func(reqNum int) {
            defer wg.Done()
            var weather string
            err := client.Call(context.Background(), func() error {
                w, err := getWeather("London")
                weather = w
                return err
            })
            if err != nil {
                fmt.Printf("Request %d failed: %v\n", reqNum, err)
            } else {
                fmt.Printf("Request %d: Weather is %s\n", reqNum, weather)
            }
        }(i)
    }
    wg.Wait()
}

In this simulation, some calls will randomly fail. The circuit breaker will trip after a few failures, blocking new calls. You'll see "circuit breaker is open" messages. After 30 seconds, it will move to half-open, allow a test request, and if it succeeds, close again. Meanwhile, the bulkhead ensures no more than 10 goroutines are trying to execute this logic concurrently.

When you run systems like this, monitoring is crucial. You need to know how often your circuits are opening, how many requests your bulkheads are rejecting, and what your retry success rate is. Expose these metrics as Prometheus gauges and counters, or send them to your observability platform. A dashboard showing circuit states is worth a thousand logs.

A final piece of advice: start simple. You don't need to wrap every single external call immediately. Apply these patterns first to the most critical and most flaky dependencies—payment gateways, core data stores, third-party APIs with shaky SLAs. Use a library like github.com/sony/gobreaker for a production-ready circuit breaker and go.uber.org/ratelimit for rate limiting, which is a cousin of the bulkhead pattern.

The goal is not to write perfect, unbreakable software. The goal is to write software that breaks well—in predictable, isolated, and manageable ways. When your payment service has a bad day, your users should still be able to browse products, add them to their cart, and read reviews. They might see a message saying "Payment processing is temporarily delayed," but the store remains open. That's resilience. And in a distributed system, it's not a luxury; it's the foundation.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!