DEV Community

Cover image for Stop the Domino Effect: Bulkhead Isolation in Go
Onur Cinar
Onur Cinar

Posted on

Stop the Domino Effect: Bulkhead Isolation in Go

In a distributed system, failure is inevitable. But a failure in one part of your system shouldn't bring down everything else.

Imagine your Go service depends on three different downstream APIs: Payments, Inventory, and Recommendations. Suddenly, the Recommendations API starts taking 30 seconds to respond. If your service doesn't have isolation, your goroutines will start piling up waiting for Recommendations. Eventually, you'll hit your process limit, and even the critical Payments API calls will start failing because there are no resources left to handle them.

This is the Domino Effect, and the Bulkhead Pattern is how you stop it.


The Problem: Resource Exhaustion

When one dependency slows down, it consumes resources:

  • Goroutines: Blocked waiting for a response.
  • Memory: Each blocked goroutine carries a stack.
  • File Descriptors/Sockets: Open connections to the slow service.

Without a bulkhead, a single slow dependency can "starve" the rest of your application, leading to a total system collapse.


The Solution: The Bulkhead Pattern

Named after the partitioned sections of a ship's hull, a Bulkhead isolates failures. If one section of the ship is flooded, the others remain buoyant. In software, we achieve this by limiting the number of concurrent executions allowed for a specific resource or dependency.

Implementing with Resile:

Resile makes it trivial to add bulkhead isolation to any operation.

// Allow only 10 concurrent calls to this specific operation.
// If an 11th call comes in, it fails fast with resile.ErrBulkheadFull.
err := resile.DoErr(ctx, action, 
    resile.WithBulkhead(10),
)
Enter fullscreen mode Exit fullscreen mode

Using a Shared Bulkhead

Often, you want to limit concurrency across multiple different call sites that hit the same downstream service. You can create a shared Bulkhead instance for this:

// Create a shared bulkhead for the "Inventory Service"
inventoryBulkhead := resile.NewBulkhead(20)

// Call Site A
resile.DoErr(ctx, fetchItem, resile.WithBulkheadInstance(inventoryBulkhead))

// Call Site B
resile.DoErr(ctx, updateStock, resile.WithBulkheadInstance(inventoryBulkhead))
Enter fullscreen mode Exit fullscreen mode

By sharing the instance, you ensure that the total concurrency hitting the Inventory Service never exceeds 20, regardless of which part of your code is making the call.


Why "Fail-Fast" Matters

When a bulkhead is full, Resile immediately returns resile.ErrBulkheadFull.

This is much better than waiting for a timeout. By failing fast, you:

  1. Preserve Resources: You don't spawn another goroutine or open another connection.
  2. Provide Immediate Feedback: Your upstream callers get an error instantly and can decide how to handle it (e.g., show a cached result or a "service busy" message).

Observability: Monitoring the Walls

You need to know when your bulkheads are working. If a bulkhead is frequently full, it might mean your downstream service is struggling, or you need to re-evaluate your capacity limits.

If you use Resile's telemetry integrations (like slog or OpenTelemetry), you'll get automatic alerts when a bulkhead saturates. The OnBulkheadFull event is triggered every time a request is rejected due to capacity limits.


Conclusion

Bulkheads are a fundamental building block of resilient systems. By isolating your dependencies, you ensure that a local fire doesn't become a global conflagration.

Resile provides a clean, "Go-native" way to implement bulkheads without complex boilerplate, allowing you to focus on your business logic while keeping your system stable.

Explore Resile on GitHub: github.com/cinar/resile

Top comments (0)