Gabriel Anhaia

Posted on May 3

Build a Circuit Breaker in TypeScript in 80 Lines

#typescript #node #webdev #resilience

Book: TypeScript in Production
Also by me: The TypeScript Library — the 5-book collection
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Imagine a downstream payment service starts misbehaving in the middle of the afternoon. Latency on a single endpoint goes from 80ms to 6 seconds. The provider has not declared an outage. Your service keeps calling it on every checkout, because every call still completes — slowly, with a mix of timeouts and 502s.

A few minutes later your event loop is saturated. The queue of pending requests on your Node process is climbing past 4,000. Healthy endpoints in the same service are now timing out, because their handlers cannot get scheduled. Your p99 for every route on the service is over five seconds. Your dashboard is green where you have synthetics, red on customer flows, and the on-call engineer is tabbing between the payment provider's status page and your own logs trying to decide who to page.

The thing missing here is a circuit breaker. Without one, a slow dependency multiplies your latency and floods your queues. With one, the slow dependency gets walled off in seconds: requests fail fast with a clear error, the queue drains, and you only spend a single test request every so often to find out when the dependency comes back. The breaker is one of the smallest pieces of code that changes how the whole service behaves under stress.

Three states, an error-rate window, an exponential cooldown, a half-open probe. No dependencies, runnable as a single file.

The state machine

A circuit breaker is a small state machine that sits in front of one risky call. Three states, four transitions:

CLOSED — the normal path. Calls go through. The breaker watches outcomes (success / failure) over a sliding window. If the failure rate over the window crosses a threshold, transition to OPEN.
OPEN — the dependency is presumed broken. Calls fail fast without touching the network. Stay here for a cooldown duration. When the cooldown expires, transition to HALF-OPEN.
HALF-OPEN — the dependency is presumed maybe-recovered. Allow exactly one probe call through. If it succeeds, transition back to CLOSED and reset everything. If it fails, transition back to OPEN and double the cooldown.

That last detail matters. The exponential cooldown is what makes a breaker behave well during a real outage. If the dependency is going to be down for an hour, you do not want to send a probe every 5 seconds for an hour. You want the probe interval to grow until it matches the actual recovery timescale of whatever you are calling.

The whole design fits on a napkin. The challenge is the implementation details: what counts as a failure, how the window slides, how state transitions interact with concurrent calls, what the breaker returns when it short-circuits.

The breaker exposes one method, call, which takes any async function:

type Breaker = {
  call: <T>(fn: () => Promise<T>) => Promise<T>;
};

You hand it any async function. It either runs that function and returns the result, or it short-circuits and throws a CircuitOpenError without running it. That is the whole API. Everything else lives behind that one method: when to open, when to probe, how to count failures.

The breaker has three states and a small bag of knobs:

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

type Outcome = { ok: boolean; at: number };

type BreakerConfig = {
  failureThreshold: number; // 0..1, e.g. 0.5
  windowMs: number;         // sliding window for outcomes
  minSamples: number;       // don't trip on 2 calls
  cooldownMs: number;       // initial OPEN duration
  maxCooldownMs: number;    // cap for the exponential ramp
};

class CircuitOpenError extends Error {
  constructor(public readonly retryAt: number) {
    super("circuit open");
    this.name = "CircuitOpenError";
  }
}

failureThreshold is a fraction. A fixed count of failures is a bad signal — 5 failures in 10 calls and 5 failures in 10,000 calls mean very different things. The window plus the threshold gives you at least 50% failure rate over the last 10 seconds, which is the question you actually want to answer.

minSamples keeps the breaker from tripping on a tiny sample size. Two failures in two calls is a 100% failure rate, but it is also two calls. Wait until you have a real sample.

cooldownMs and maxCooldownMs define the exponential ramp. Start at the initial cooldown; double on every failed probe; cap at maxCooldownMs so the breaker never gets stuck waiting hours when the dependency comes back.

The sliding window

The window is the trickiest part of a breaker, and the place most naive implementations get it wrong.

A common shortcut is "ratio over the last N calls". That works when traffic is steady, and lies when traffic spikes. You really want "ratio over the last T milliseconds". Calls outside the window do not count.

The minimal correct data structure is an array of Outcome objects. Push on each call, drop entries older than windowMs on every read. That is O(n) per check, but n is bounded by your traffic in the window, and the array stays small in practice (a couple of hundred entries even on a hot path).

function pruneOldOutcomes(
  outcomes: Outcome[],
  now: number,
  windowMs: number,
): Outcome[] {
  const cutoff = now - windowMs;
  let i = 0;
  while (i < outcomes.length && outcomes[i].at < cutoff) i++;
  return i === 0 ? outcomes : outcomes.slice(i);
}

function failureRate(outcomes: Outcome[]): number {
  if (outcomes.length === 0) return 0;
  const failures = outcomes.reduce((n, o) => n + (o.ok ? 0 : 1), 0);
  return failures / outcomes.length;
}

Outcomes are pushed in time order, so pruning is a single while from the front. No date comparisons inside reduce, no allocations on the hot path beyond the slice when there is something to drop.

If you have a service that handles tens of thousands of requests per second through a single breaker, swap the array for a small ring buffer. For everything else (and most things are everything else), the array is fine.

The half-open probe

The half-open state is where breakers earn their keep, and where they most often have bugs.

The rule is: while in HALF-OPEN, allow exactly one call through. Not "one per second". Not "a few in parallel". One. If it succeeds, close the circuit. If it fails, open it again with a longer cooldown.

The reason for the rule: when you are in HALF-OPEN, you are gambling that the dependency has recovered. If you let ten parallel probes through and the dependency is still broken, you have just sent ten failing requests when one was enough to learn the same thing. Worse, if the dependency is recovering (say, a database is replaying its WAL), those ten probes might be the thing that pushes it back over.

Implementing "exactly one probe" in JavaScript is one boolean. The call method checks it under the synchronous prelude before the await:

if (state === "HALF_OPEN") {
  if (probeInFlight) {
    throw new CircuitOpenError(nextRetryAt);
  }
  probeInFlight = true;
}

JavaScript is single-threaded for synchronous code, so two call invocations cannot both pass that gate. The probeInFlight = true write happens before either ever yields to the event loop. The second one sees the flag and short-circuits.

This is one of those small places where the JS event-loop model makes things cleaner than they look in pseudocode. In a multi-threaded language you would need a lock or an atomic CAS. In a single Node process, you do not.

The 80-line implementation

The whole breaker, runnable as a single file with no dependencies:

type State = "CLOSED" | "OPEN" | "HALF_OPEN";
type Outcome = { ok: boolean; at: number };

type BreakerConfig = {
  failureThreshold: number;
  windowMs: number;
  minSamples: number;
  cooldownMs: number;
  maxCooldownMs: number;
};

export class CircuitOpenError extends Error {
  constructor(public readonly retryAt: number) {
    super("circuit open");
    this.name = "CircuitOpenError";
  }
}

export function createBreaker(cfg: BreakerConfig) {
  let state: State = "CLOSED";
  let outcomes: Outcome[] = [];
  let cooldown = cfg.cooldownMs;
  let nextRetryAt = 0;
  let probeInFlight = false;

  const now = () => Date.now();

  const recordOutcome = (ok: boolean) => {
    outcomes.push({ ok, at: now() });
    const cutoff = now() - cfg.windowMs;
    let i = 0;
    while (i < outcomes.length && outcomes[i].at < cutoff) i++;
    if (i > 0) outcomes = outcomes.slice(i);
  };

  const shouldTrip = () => {
    if (outcomes.length < cfg.minSamples) return false;
    const fails = outcomes.reduce((n, o) => n + (o.ok ? 0 : 1), 0);
    return fails / outcomes.length >= cfg.failureThreshold;
  };

  const open = () => {
    state = "OPEN";
    nextRetryAt = now() + cooldown;
    cooldown = Math.min(cooldown * 2, cfg.maxCooldownMs);
  };

  const close = () => {
    state = "CLOSED";
    outcomes = [];
    cooldown = cfg.cooldownMs;
    probeInFlight = false;
  };

  return {
    call: async <T>(fn: () => Promise<T>): Promise<T> => {
      if (state === "OPEN") {
        if (now() < nextRetryAt) throw new CircuitOpenError(nextRetryAt);
        state = "HALF_OPEN";
        probeInFlight = false;
      }
      if (state === "HALF_OPEN") {
        if (probeInFlight) throw new CircuitOpenError(nextRetryAt);
        probeInFlight = true;
      }

      try {
        const result = await fn();
        if (state === "HALF_OPEN") close();
        else recordOutcome(true);
        return result;
      } catch (err) {
        if (state === "HALF_OPEN") {
          probeInFlight = false;
          open();
        } else {
          recordOutcome(false);
          if (shouldTrip()) open();
        }
        throw err;
      }
    },
    state: () => state,
  };
}

Eighty lines, one class for the error, one factory function. No timers, no setInterval, no background worker. The state transitions all happen on the call path, which means the breaker has zero overhead when nothing is calling it. That matters when you have a hundred breakers across a service, most of them idle.

The state === "OPEN" block does a lazy transition to HALF-OPEN when the cooldown has expired. No timer fires the transition; the next call observes that now() is past nextRetryAt and flips the state inline. Less moving machinery, less to test.

The half-open success path calls close(), which resets cooldown back to the initial value. A successful probe is the signal that the dependency is healthy again, and you want the next failure to start at the short cooldown, not at whatever the ramp had reached.

The half-open failure path calls open(), which doubles cooldown up to the cap. If the dependency was actually broken, the next probe is twice as far away. After enough failed probes, cooldown saturates at maxCooldownMs (typically around 5 minutes) and stays there until something works.

The probeInFlight = false on a fresh OPEN→HALF_OPEN transition is the one subtle line. Without it, a previous probe that failed and reopened the circuit would leave the flag stuck at true, and the next half-open window would reject everything indefinitely. Always reset the flag when you enter the state.

Wiring it into a real service

The breaker takes any function. The interesting question is what fn looks like at the call site.

For a fetch to a downstream HTTP service:

const paymentBreaker = createBreaker({
  failureThreshold: 0.5,
  windowMs: 10_000,
  minSamples: 20,
  cooldownMs: 1_000,
  maxCooldownMs: 60_000,
});

async function chargeCard(req: ChargeRequest): Promise<ChargeResult> {
  return paymentBreaker.call(async () => {
    const res = await fetch(`${PAYMENT_URL}/charge`, {
      method: "POST",
      body: JSON.stringify(req),
      signal: AbortSignal.timeout(2_000),
    });
    if (!res.ok) throw new Error(`payment ${res.status}`);
    return res.json() as Promise<ChargeResult>;
  });
}

The AbortSignal.timeout(2_000) is doing real work here. A breaker without per-call timeouts is half-built — if the underlying call never returns, the breaker never sees a failure, and the queue floods anyway. Combine the two: short timeout per call, breaker on top of the timeout. The breaker catches the timeouts as failures, and after enough of them, the breaker stops issuing the calls altogether.

The same shape works for anything that returns a Promise — a Drizzle query against a read replica, a gRPC or tRPC client, any thenable. One breaker per downstream, not one per route. The breaker is modelling the health of the dependency, not the health of one of your endpoints. Sharing a breaker across multiple downstream targets means a single bad target poisons unrelated traffic.

Production knobs

The five config values are a small surface, but each one matters.

failureThreshold is usually 0.4–0.6 for user-facing dependencies. Lower than 0.4 and you trip on transient noise. Higher than 0.6 and you absorb the latency hit longer than you need to. For non-critical dependencies (analytics writes, log shipping) you can go higher; you are willing to tolerate more pain to avoid losing data.

windowMs depends on your traffic shape. 10–30 seconds is the right ballpark for a service that sees more than a few requests per second. For a low-traffic dependency, the minSamples floor kicks in before the window does, so the window can be longer.

minSamples prevents the breaker from tripping on outlier sequences. If your hot path sees 100 calls per second through this breaker, set minSamples around 20. If it sees 1 call per second, you may need to drop it to 5 — otherwise the breaker is useless during the actual incident, because you never accumulate enough samples to trip.

cooldownMs is the initial OPEN duration. Start short, 500ms to 2 seconds, so a brief blip costs almost nothing. The exponential ramp will grow it for you if the outage is real.

maxCooldownMs is the cap. 1–5 minutes is a sensible range. Past that, you are no longer running a breaker; you are running a static gate. Cap it where a human operator would step in anyway.

The biggest mistake I see is starting with cooldownMs: 30_000 because "a downstream outage is usually a minute or so". You do not want a 30-second penalty on a 200ms blip. Start short, let the ramp do its job, cap it where you stop caring about the difference.

Where libraries help and where they hurt

Cockatiel and opossum are the two well-known Node circuit-breaker libraries. They are good. They give you metrics emission, composition with retries and bulkheads, configurable failure predicates, and consistent telemetry across breakers. If you are running breakers across dozens of dependencies, those are real wins.

The 80-line breaker exists in this post because it is the version you can read end-to-end in five minutes and reason about under stress. Write the small breaker, and once you have three of them, refactor toward something library-shaped. Most services get more out of three breakers tuned to specific dependencies than out of one breaker library configured generically.

Forward motion

The breaker is the smallest 80 lines of code with the biggest behavioural payoff in a service. It does not prevent the downstream outage. It makes sure the downstream outage stays downstream — it does not turn your latency graph into a step function while your queue floods and unrelated routes start failing.

Once the breaker is in, the next pieces are obvious: per-call timeouts (do this on the same day; the breaker is half-effective without them), retries with backoff (handle the transient noise without involving the breaker), bulkheads (cap the concurrency a single dependency can consume from your worker pool), metrics export (you want to see breaker state transitions in your dashboard alongside latency).

That sequence is the resilience kit for a single dependency: timeout, breaker, retry, bulkhead, metric. The breaker is the smallest piece of it and the one with the biggest behavioural change. Ship it on a Tuesday, see one downstream blip in a real outage, and you will never run a service without one again.

If this was useful

A circuit breaker is a tiny piece of code that changes how a whole service behaves under stress. The same is true of most of TypeScript's production layer — tsconfig flags, build choices, monorepo wiring, library-author defaults. TypeScript in Production is the book for that layer.

The full TypeScript Library, five books:

TypeScript Essentials — daily-driver TS across Node, Bun, Deno, and the browser: Amazon
The TypeScript Type System — generics, mapped/conditional types, template literals, branded types: Amazon
Kotlin and Java to TypeScript — bridge for JVM developers: Amazon
PHP to TypeScript — bridge for modern PHP 8+ developers: Amazon
TypeScript in Production — tsconfig, build, monorepos, library authoring: Amazon