Arthur

Posted on Jul 3 • Originally published at pickles.news

A Circuit Breaker in Go: Build One in 100 Lines, Then Reach for gobreaker

#go #circuitbreaker #resilience #microservices

A service you depend on starts answering in 10 seconds instead of 50 milliseconds. So now your service answers in 10 seconds too. Goroutines pile up waiting on it, your connection pool drains, and the timeouts cascade upward until callers of your service start falling over. One slow dependency, and the whole chain goes down with it.

A circuit breaker is the small piece that stops the spread. When a dependency fails enough, the breaker "trips" and starts rejecting calls to it instantly — your code gets an immediate error instead of hanging on a doomed request. After a cooldown it lets one call through to test the waters; if that works, it closes again and traffic resumes. It's the same idea as the breaker in your wall: better to cut the circuit than burn the house down.

We'll build a working one in about 100 lines of Go, then look at why you'll eventually reach for github.com/sony/gobreaker instead.

The three states

A circuit breaker is a tiny state machine with three states:

Closed — normal operation. Every call passes through to the dependency. The breaker counts failures. If failures cross a threshold, it trips to Open.
Open — every call is rejected immediately with an error; the dependency gets a rest. After a timeout, the breaker moves to Half-Open.
Half-Open — the breaker lets a single probe call through. If it succeeds, the dependency looks healthy and the breaker goes back to Closed. If it fails, back to Open for another timeout.

         failures ≥ threshold
  ┌────────┐ ───────────────────▶ ┌──────┐
  │ Closed │                      │ Open │
  └────────┘ ◀─────────────────── └──────┘
       ▲        probe succeeds       │
       │                             │ timeout elapsed
       │      ┌───────────┐ ◀────────┘
       └───── │ Half-Open │
   probe ok   └───────────┘ ── probe fails ──▶ back to Open

Building it: about 100 lines of Go

Here's a complete, concurrency-safe breaker. One mutex guards the state; the only real subtlety is making sure that in Half-Open we let exactly one probe through, not every goroutine that happens to arrive.

package breaker

import (
    "errors"
    "sync"
    "time"
)

// ErrOpen is returned when the breaker is open and rejecting calls.
var ErrOpen = errors.New("circuit breaker is open")

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

func (s State) String() string {
    switch s {
    case StateClosed:
        return "closed"
    case StateOpen:
        return "open"
    default:
        return "half-open"
    }
}

type Breaker struct {
    mu        sync.Mutex
    state     State
    failures  int           // consecutive failures while closed
    threshold int           // trip after this many
    timeout   time.Duration // how long to stay open
    openedAt  time.Time
    probing   bool // a half-open probe is already in flight
}

func New(threshold int, timeout time.Duration) *Breaker {
    return &Breaker{state: StateClosed, threshold: threshold, timeout: timeout}
}

// Do runs fn through the breaker. If the breaker is open, it returns ErrOpen
// immediately without calling fn at all.
func (b *Breaker) Do(fn func() error) error {
    if err := b.beforeCall(); err != nil {
        return err
    }
    err := fn()
    b.afterCall(err)
    return err
}

func (b *Breaker) beforeCall() error {
    b.mu.Lock()
    defer b.mu.Unlock()

    // Open long enough? Move to half-open and allow a probe.
    if b.state == StateOpen && time.Since(b.openedAt) > b.timeout {
        b.state = StateHalfOpen
        b.probing = false
    }

    switch b.state {
    case StateOpen:
        return ErrOpen
    case StateHalfOpen:
        if b.probing {
            return ErrOpen // someone else is already probing
        }
        b.probing = true // claim the single probe slot
    }
    return nil
}

func (b *Breaker) afterCall(err error) {
    b.mu.Lock()
    defer b.mu.Unlock()

    if err != nil {
        b.failures++
        switch b.state {
        case StateClosed:
            if b.failures >= b.threshold {
                b.trip()
            }
        case StateHalfOpen:
            b.trip() // probe failed — reopen
        }
        return
    }

    switch b.state { // success
    case StateClosed:
        b.failures = 0
    case StateHalfOpen:
        b.state = StateClosed // probe passed — close up
        b.failures = 0
        b.probing = false
    }
}

func (b *Breaker) trip() {
    b.state = StateOpen
    b.openedAt = time.Now()
    b.failures = 0
    b.probing = false
}

That's the whole thing. beforeCall decides whether to allow the call and, in Half-Open, hands out exactly one probe slot. afterCall records the outcome and flips state. trip() is the one place that opens the circuit, so there's a single, obvious path into Open.

Using it

Wrap any call that can fail and hang. The pattern that makes a breaker worth having is the fallback — when it's open, you serve something else instead of an error:

cb := breaker.New(5, 30*time.Second)

err := cb.Do(func() error {
    resp, err := http.Get("https://api.example.com/data")
    if err != nil {
        return err
    }
    defer resp.Body.Close()
    if resp.StatusCode >= 500 {
        return fmt.Errorf("upstream returned %d", resp.StatusCode)
    }
    return nil
})

if errors.Is(err, breaker.ErrOpen) {
    return serveFromCache() // breaker is open — don't even try the network
}

Five failures in a row and the breaker opens. For the next 30 seconds every Do returns ErrOpen instantly — no hung goroutines, no drained pool. After 30 seconds one probe goes out; if it succeeds, normal traffic resumes.

The transitions at a glance

From	Event	To
Closed	a call succeeds	Closed (failure count reset to 0)
Closed	failures reach the threshold	Open
Open	a call arrives before the timeout	rejected with `ErrOpen`, stays Open
Open	the timeout has elapsed	Half-Open (one probe allowed)
Half-Open	the probe succeeds	Closed
Half-Open	the probe fails	Open (timeout restarts)
Half-Open	another call arrives mid-probe	rejected with `ErrOpen`

Where 100 lines runs out

This breaker works, and for a lot of services it's genuinely enough. But put it under real traffic and you'll hit its edges:

It only counts consecutive failures. Five failures in a row trips it — but a service that fails 30% of the time, never twice in a row, will sail right past. Real systems often want to trip on a rate: "more than half of the last 100 calls failed." That needs rolling counts, which my version doesn't keep.
One probe is noisy. A single half-open probe decides everything. If that one call happens to time out by bad luck, the breaker reopens even though the service had recovered. A handful of probes gives a steadier verdict.
No visibility. When did it trip? How often? My breaker can't tell you. In production you want a hook that fires on every state change so you can emit a metric.
4xx shouldn't trip it. If the dependency answers fast with 404s, it's working — your input is just wrong. A good breaker lets you decide which errors count as failures.

You can bolt each of these onto the 100-line version, but at that point you're reimplementing a library that already exists.

Reaching for gobreaker

github.com/sony/gobreaker is the well-worn Go implementation, and its current major version is v2 (import github.com/sony/gobreaker/v2), which uses generics. It covers all four gaps above with a small Settings struct:

import "github.com/sony/gobreaker/v2"

cb := gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
    Name:        "data-api",
    MaxRequests: 3,                // probes allowed in half-open
    Interval:    60 * time.Second, // window for clearing the counts
    Timeout:     30 * time.Second, // how long to stay open
    ReadyToTrip: func(c gobreaker.Counts) bool {
        // trip on a failure rate, not just a streak
        return c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) >= 0.5
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        log.Printf("breaker %s: %s -> %s", name, from, to) // emit a metric here
    },
    IsSuccessful: func(err error) bool {
        // treat a 404 as success so it doesn't trip the breaker
        return !errors.Is(err, errNotFound)
    },
})

body, err := cb.Execute(func() ([]byte, error) {
    return fetchData()
})

Each field maps straight onto a limitation we just hit: MaxRequests replaces the single noisy probe, ReadyToTrip with Counts replaces consecutive-only tripping (the default ReadyToTrip trips after more than five consecutive failures — the same rule as ours, just overridable), OnStateChange gives you metrics, and IsSuccessful keeps 4xx from opening the circuit. There's also a two-step Allow() / Done() API for when the call doesn't fit inside a single function — opening a stream now and closing it later.

The honest split: build the 100-line version to understand the machine, and run it for a simple internal service. Reach for gobreaker the moment you want rate-based tripping, real metrics, or more than one probe — which is most production services.

Give each dependency its own breaker

One breaker should guard one dependency — not your whole service. If you share a single breaker across calls to the payments API and the search API, a payments outage will open the circuit for search too, and you'll start rejecting perfectly healthy calls. Keep one per downstream:

var (
    paymentsCB = breaker.New(5, 30*time.Second)
    searchCB   = breaker.New(5, 30*time.Second)
)

Tune them separately, too. A critical, usually-fast dependency might trip after 3 failures with a short 10-second cooldown, so you fall back quickly. A flaky best-effort one might tolerate 10 failures and a longer timeout before you bother backing off. A breaker guards one relationship, and each relationship has its own tolerance.

How it fits with retry and timeout

A circuit breaker doesn't replace retries or timeouts; it sits with them. A solid arrangement, from the inside out:

err := cb.Do(func() error {
    return retry.Do( // github.com/avast/retry-go
        func() error {
            ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
            defer cancel()
            return callService(ctx)
        },
        retry.Attempts(3),
        retry.Delay(100*time.Millisecond),
    )
})

Each layer has a job. The timeout bounds a single attempt so it can't hang forever. Retry smooths over the occasional blip — a dropped packet, a one-off 503. The circuit breaker sits outside both, so it trips only when whole retry sequences keep failing — i.e. the dependency is genuinely down, not just flaky. Drop the breaker and your retries will hammer a dying service until every attempt times out. Drop the retries and the breaker overreacts to single transient blips. Drop the timeout and any one attempt can hang the whole stack. (You can also place the breaker inside retry if you want it to react to individual attempts — pick based on whether "a failure" means one attempt or one whole sequence.)

When you don't need one

A breaker isn't free, and it isn't always the answer:

No fallback, no point. If you have a single dependency, it's down, and you have nothing to serve instead — no cache, no default, no second instance — the breaker just turns a slow error into a fast one. Sometimes that alone is worth it (fast failure beats a hung request), but don't expect it to save the request.
4xx isn't a circuit problem. If the dependency responds quickly with client errors, it's healthy; tripping the breaker would be wrong. Only count failures that mean overload or outage, not bad input.

A breaker pays off when there's a real alternative to fall back to, and when the failures are the dependency buckling under load rather than something your own requests caused.

What to take away

A circuit breaker is a three-state machine — Closed, Open, Half-Open — that fails fast when a dependency is down, so one slow service doesn't drag yours down with it.
You can write a correct one in ~100 lines of Go. The only real subtlety is gating Half-Open to a single probe.
The pattern only earns its keep with a fallback (cache, default, another instance) to serve while the circuit is open.
A hand-rolled breaker trips on consecutive failures and allows one probe. When you need rate-based tripping, multiple probes, metrics, or 4xx exclusion, use github.com/sony/gobreaker/v2 — every one of its Settings fields maps to a limitation of the small version.
It lives alongside retry and timeout, not instead of them: timeout bounds an attempt, retry smooths blips, the breaker reacts to sustained failure.

Build the small one once to really see the machine. Then let the library carry it in production.

Top comments (1)

Roman Kotenko • Jul 7

The per-dependency point is the one people learn the hard way. We poll ~30 state DOT feeds on independent schedules, and early on a single shared breaker meant one slow feed would trip the circuit for all of them — healthy sources got starved behind the dead one. One breaker per source fixed it.

One wrinkle for a polling aggregator vs request/response: the fallback isn't a cache you serve to a caller, it's "don't clobber last-known-good." When a source's breaker is open we keep serving its last successful snapshot with a staleness timestamp instead of nulling the data out.

And +1 on IsSuccessful — a couple of our feeds answer 200 with an empty body when they're actually broken, so "failure" has to be defined semantically, not by HTTP status.