A service you depend on starts answering in 10 seconds instead of 50 milliseconds. So now your service answers in 10 seconds too. Goroutines pile up waiting on it, your connection pool drains, and the timeouts cascade upward until callers of your service start falling over. One slow dependency, and the whole chain goes down with it.
A circuit breaker is the small piece that stops the spread. When a dependency fails enough, the breaker "trips" and starts rejecting calls to it instantly — your code gets an immediate error instead of hanging on a doomed request. After a cooldown it lets one call through to test the waters; if that works, it closes again and traffic resumes. It's the same idea as the breaker in your wall: better to cut the circuit than burn the house down.
We'll build a working one in about 100 lines of Go, then look at why you'll eventually reach for github.com/sony/gobreaker instead.
The three states
A circuit breaker is a tiny state machine with three states:
- Closed — normal operation. Every call passes through to the dependency. The breaker counts failures. If failures cross a threshold, it trips to Open.
- Open — every call is rejected immediately with an error; the dependency gets a rest. After a timeout, the breaker moves to Half-Open.
- Half-Open — the breaker lets a single probe call through. If it succeeds, the dependency looks healthy and the breaker goes back to Closed. If it fails, back to Open for another timeout.
failures ≥ threshold
┌────────┐ ───────────────────▶ ┌──────┐
│ Closed │ │ Open │
└────────┘ ◀─────────────────── └──────┘
▲ probe succeeds │
│ │ timeout elapsed
│ ┌───────────┐ ◀────────┘
└───── │ Half-Open │
probe ok └───────────┘ ── probe fails ──▶ back to Open
Building it: about 100 lines of Go
Here's a complete, concurrency-safe breaker. One mutex guards the state; the only real subtlety is making sure that in Half-Open we let exactly one probe through, not every goroutine that happens to arrive.
package breaker
import (
"errors"
"sync"
"time"
)
// ErrOpen is returned when the breaker is open and rejecting calls.
var ErrOpen = errors.New("circuit breaker is open")
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
func (s State) String() string {
switch s {
case StateClosed:
return "closed"
case StateOpen:
return "open"
default:
return "half-open"
}
}
type Breaker struct {
mu sync.Mutex
state State
failures int // consecutive failures while closed
threshold int // trip after this many
timeout time.Duration // how long to stay open
openedAt time.Time
probing bool // a half-open probe is already in flight
}
func New(threshold int, timeout time.Duration) *Breaker {
return &Breaker{state: StateClosed, threshold: threshold, timeout: timeout}
}
// Do runs fn through the breaker. If the breaker is open, it returns ErrOpen
// immediately without calling fn at all.
func (b *Breaker) Do(fn func() error) error {
if err := b.beforeCall(); err != nil {
return err
}
err := fn()
b.afterCall(err)
return err
}
func (b *Breaker) beforeCall() error {
b.mu.Lock()
defer b.mu.Unlock()
// Open long enough? Move to half-open and allow a probe.
if b.state == StateOpen && time.Since(b.openedAt) > b.timeout {
b.state = StateHalfOpen
b.probing = false
}
switch b.state {
case StateOpen:
return ErrOpen
case StateHalfOpen:
if b.probing {
return ErrOpen // someone else is already probing
}
b.probing = true // claim the single probe slot
}
return nil
}
func (b *Breaker) afterCall(err error) {
b.mu.Lock()
defer b.mu.Unlock()
if err != nil {
b.failures++
switch b.state {
case StateClosed:
if b.failures >= b.threshold {
b.trip()
}
case StateHalfOpen:
b.trip() // probe failed — reopen
}
return
}
switch b.state { // success
case StateClosed:
b.failures = 0
case StateHalfOpen:
b.state = StateClosed // probe passed — close up
b.failures = 0
b.probing = false
}
}
func (b *Breaker) trip() {
b.state = StateOpen
b.openedAt = time.Now()
b.failures = 0
b.probing = false
}
That's the whole thing. beforeCall decides whether to allow the call and, in Half-Open, hands out exactly one probe slot. afterCall records the outcome and flips state. trip() is the one place that opens the circuit, so there's a single, obvious path into Open.
Using it
Wrap any call that can fail and hang. The pattern that makes a breaker worth having is the fallback — when it's open, you serve something else instead of an error:
cb := breaker.New(5, 30*time.Second)
err := cb.Do(func() error {
resp, err := http.Get("https://api.example.com/data")
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return fmt.Errorf("upstream returned %d", resp.StatusCode)
}
return nil
})
if errors.Is(err, breaker.ErrOpen) {
return serveFromCache() // breaker is open — don't even try the network
}
Five failures in a row and the breaker opens. For the next 30 seconds every Do returns ErrOpen instantly — no hung goroutines, no drained pool. After 30 seconds one probe goes out; if it succeeds, normal traffic resumes.
The transitions at a glance
| From | Event | To |
|---|---|---|
| Closed | a call succeeds | Closed (failure count reset to 0) |
| Closed | failures reach the threshold | Open |
| Open | a call arrives before the timeout | rejected with ErrOpen, stays Open |
| Open | the timeout has elapsed | Half-Open (one probe allowed) |
| Half-Open | the probe succeeds | Closed |
| Half-Open | the probe fails | Open (timeout restarts) |
| Half-Open | another call arrives mid-probe | rejected with ErrOpen
|
Where 100 lines runs out
This breaker works, and for a lot of services it's genuinely enough. But put it under real traffic and you'll hit its edges:
- It only counts consecutive failures. Five failures in a row trips it — but a service that fails 30% of the time, never twice in a row, will sail right past. Real systems often want to trip on a rate: "more than half of the last 100 calls failed." That needs rolling counts, which my version doesn't keep.
- One probe is noisy. A single half-open probe decides everything. If that one call happens to time out by bad luck, the breaker reopens even though the service had recovered. A handful of probes gives a steadier verdict.
- No visibility. When did it trip? How often? My breaker can't tell you. In production you want a hook that fires on every state change so you can emit a metric.
-
4xx shouldn't trip it. If the dependency answers fast with
404s, it's working — your input is just wrong. A good breaker lets you decide which errors count as failures.
You can bolt each of these onto the 100-line version, but at that point you're reimplementing a library that already exists.
Reaching for gobreaker
github.com/sony/gobreaker is the well-worn Go implementation, and its current major version is v2 (import github.com/sony/gobreaker/v2), which uses generics. It covers all four gaps above with a small Settings struct:
import "github.com/sony/gobreaker/v2"
cb := gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
Name: "data-api",
MaxRequests: 3, // probes allowed in half-open
Interval: 60 * time.Second, // window for clearing the counts
Timeout: 30 * time.Second, // how long to stay open
ReadyToTrip: func(c gobreaker.Counts) bool {
// trip on a failure rate, not just a streak
return c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) >= 0.5
},
OnStateChange: func(name string, from, to gobreaker.State) {
log.Printf("breaker %s: %s -> %s", name, from, to) // emit a metric here
},
IsSuccessful: func(err error) bool {
// treat a 404 as success so it doesn't trip the breaker
return !errors.Is(err, errNotFound)
},
})
body, err := cb.Execute(func() ([]byte, error) {
return fetchData()
})
Each field maps straight onto a limitation we just hit: MaxRequests replaces the single noisy probe, ReadyToTrip with Counts replaces consecutive-only tripping (the default ReadyToTrip trips after more than five consecutive failures — the same rule as ours, just overridable), OnStateChange gives you metrics, and IsSuccessful keeps 4xx from opening the circuit. There's also a two-step Allow() / Done() API for when the call doesn't fit inside a single function — opening a stream now and closing it later.
The honest split: build the 100-line version to understand the machine, and run it for a simple internal service. Reach for gobreaker the moment you want rate-based tripping, real metrics, or more than one probe — which is most production services.
Give each dependency its own breaker
One breaker should guard one dependency — not your whole service. If you share a single breaker across calls to the payments API and the search API, a payments outage will open the circuit for search too, and you'll start rejecting perfectly healthy calls. Keep one per downstream:
var (
paymentsCB = breaker.New(5, 30*time.Second)
searchCB = breaker.New(5, 30*time.Second)
)
Tune them separately, too. A critical, usually-fast dependency might trip after 3 failures with a short 10-second cooldown, so you fall back quickly. A flaky best-effort one might tolerate 10 failures and a longer timeout before you bother backing off. A breaker guards one relationship, and each relationship has its own tolerance.
How it fits with retry and timeout
A circuit breaker doesn't replace retries or timeouts; it sits with them. A solid arrangement, from the inside out:
err := cb.Do(func() error {
return retry.Do( // github.com/avast/retry-go
func() error {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
return callService(ctx)
},
retry.Attempts(3),
retry.Delay(100*time.Millisecond),
)
})
Each layer has a job. The timeout bounds a single attempt so it can't hang forever. Retry smooths over the occasional blip — a dropped packet, a one-off 503. The circuit breaker sits outside both, so it trips only when whole retry sequences keep failing — i.e. the dependency is genuinely down, not just flaky. Drop the breaker and your retries will hammer a dying service until every attempt times out. Drop the retries and the breaker overreacts to single transient blips. Drop the timeout and any one attempt can hang the whole stack. (You can also place the breaker inside retry if you want it to react to individual attempts — pick based on whether "a failure" means one attempt or one whole sequence.)
When you don't need one
A breaker isn't free, and it isn't always the answer:
- No fallback, no point. If you have a single dependency, it's down, and you have nothing to serve instead — no cache, no default, no second instance — the breaker just turns a slow error into a fast one. Sometimes that alone is worth it (fast failure beats a hung request), but don't expect it to save the request.
- 4xx isn't a circuit problem. If the dependency responds quickly with client errors, it's healthy; tripping the breaker would be wrong. Only count failures that mean overload or outage, not bad input.
A breaker pays off when there's a real alternative to fall back to, and when the failures are the dependency buckling under load rather than something your own requests caused.
What to take away
- A circuit breaker is a three-state machine — Closed, Open, Half-Open — that fails fast when a dependency is down, so one slow service doesn't drag yours down with it.
- You can write a correct one in ~100 lines of Go. The only real subtlety is gating Half-Open to a single probe.
- The pattern only earns its keep with a fallback (cache, default, another instance) to serve while the circuit is open.
- A hand-rolled breaker trips on consecutive failures and allows one probe. When you need rate-based tripping, multiple probes, metrics, or 4xx exclusion, use
github.com/sony/gobreaker/v2— every one of itsSettingsfields maps to a limitation of the small version. - It lives alongside retry and timeout, not instead of them: timeout bounds an attempt, retry smooths blips, the breaker reacts to sustained failure.
Build the small one once to really see the machine. Then let the library carry it in production.
Top comments (0)