DEV Community

Wilson Xu
Wilson Xu

Posted on

Building Resilient Microservices with Circuit Breakers in Node.js

Building Resilient Microservices with Circuit Breakers in Node.js

Microservices fail. Not if — when. A database goes down, a third-party API rate-limits you, a downstream service redeploys and spends 30 seconds restarting. What separates resilient systems from fragile ones is not preventing these failures but containing them. The circuit breaker pattern is the most battle-tested tool for that job.

In this article you'll implement circuit breakers in Node.js using the opossum library — the de facto standard in the Node ecosystem. You'll cover the three states of a circuit breaker, timeout and threshold configuration, fallback strategies, integration with metrics and alerting, and a testing strategy that validates failure behaviour without relying on actual network outages.

All examples use Node.js 20 LTS and TypeScript-friendly patterns. The code is production-ready.


Why Circuit Breakers Exist

Imagine a payments service that calls an upstream fraud-detection API. The fraud API starts returning errors. Without a circuit breaker, every payment request waits the full HTTP timeout (say, 10 seconds) before failing. With 200 RPS hitting your payments service, you rapidly exhaust your thread pool, connection pool, and memory. The payments service itself goes down — not because of its own bug, but because of a cascade from a dependency.

The circuit breaker stops this cascade. After a configurable number of failures, it "opens" the circuit and immediately rejects new requests to the failing dependency — no waiting, no thread exhaustion. After a cooldown period, it allows a probe request through. If that succeeds, it "closes" again.

The three states:

  • Closed: Everything works. Calls flow through normally. Failures are counted.
  • Open: Too many failures. Calls are short-circuited immediately with an error (or fallback). No real requests are made to the failing service.
  • Half-Open: After the reset timeout, one test request is allowed. Success closes the circuit; failure re-opens it.

Installing Opossum

npm install opossum
# Optional but recommended: Prometheus metrics
npm install opossum-prometheus prom-client
Enter fullscreen mode Exit fullscreen mode

Opossum is maintained by the Node.js Foundation's ecosystem and has been in production at Red Hat and IBM for years. It has zero mandatory dependencies and works with any function that returns a Promise.


Your First Circuit Breaker

The simplest case: wrap an HTTP call.

import CircuitBreaker from 'opossum';
import axios from 'axios';

async function callFraudAPI(transactionId) {
  const response = await axios.get(
    `https://fraud-api.internal/check/${transactionId}`,
    { timeout: 3000 }
  );
  return response.data;
}

const breaker = new CircuitBreaker(callFraudAPI, {
  timeout: 3000,          // If the function takes longer, it's a failure
  errorThresholdPercentage: 50,  // Open after 50% of requests fail
  resetTimeout: 30000,    // Try again after 30 seconds in open state
  volumeThreshold: 5,     // Minimum requests before % threshold applies
});

// Usage is identical to calling the function directly
try {
  const result = await breaker.fire('txn-abc-123');
  console.log('Fraud check passed:', result);
} catch (err) {
  // Could be a real error OR a circuit-open rejection
  if (breaker.opened) {
    console.error('Circuit open — fraud API is down');
  } else {
    console.error('Fraud check failed:', err.message);
  }
}
Enter fullscreen mode Exit fullscreen mode

breaker.fire() is a drop-in for callFraudAPI(). It accepts the same arguments, returns the same promise, and throws the same errors — plus the additional circuit-open error when the breaker trips. Your callers don't need to know about the circuit breaker at all.


Configuration Deep Dive

Opossum's defaults are conservative. Real systems need tuning:

const breaker = new CircuitBreaker(callFraudAPI, {
  // How long to wait for the protected function (ms)
  timeout: 3000,

  // % of failures in the rolling window to open the circuit
  errorThresholdPercentage: 50,

  // Minimum number of requests in the window before % applies
  // Prevents a single failure from opening a fresh circuit
  volumeThreshold: 10,

  // How long to stay open before trying half-open (ms)
  resetTimeout: 30000,

  // Rolling window size (ms) — how far back failures are counted
  rollingCountTimeout: 10000,

  // Number of buckets in the rolling window
  // (rollingCountTimeout / rollingCountBuckets = bucket duration)
  rollingCountBuckets: 10,

  // Errors that should NOT count as failures
  // Useful for 4xx errors that aren't the dependency's fault
  isFailure: (err) => {
    // Don't trip the breaker on client errors
    if (err.response?.status < 500) return false;
    return true;
  },
});
Enter fullscreen mode Exit fullscreen mode

The isFailure hook is underused and critically important. A 404 from your fraud API probably means the transaction ID is invalid — that's not a service failure. A 503 is. Without isFailure, a flood of bad client requests can trip your circuit breaker and make healthy services appear broken.


Fallback Strategies

An open circuit breaker should degrade gracefully, not just throw errors at your users. Opossum's .fallback() method registers a function that runs whenever the circuit is open or the primary function fails:

// Strategy 1: Return a safe default
breaker.fallback(() => ({ approved: true, confidence: 0, cached: true }));

// Strategy 2: Read from cache
breaker.fallback(async (transactionId) => {
  const cached = await redis.get(`fraud:${transactionId}`);
  if (cached) return JSON.parse(cached);
  // If no cache, fail-open (allow the transaction)
  return { approved: true, confidence: 0, source: 'fallback' };
});

// Strategy 3: Try an alternative endpoint
const backupBreaker = new CircuitBreaker(callBackupFraudAPI, { timeout: 2000 });
breaker.fallback((transactionId) => backupBreaker.fire(transactionId));
Enter fullscreen mode Exit fullscreen mode

The fallback receives the same arguments as the original function. Strategy 3 — chaining circuit breakers — is powerful but dangerous: if your backup is also unhealthy, you've just added latency. Use it only when the backup is independently reliable (different provider, different region).

For critical payment flows, fail-open (allow the transaction without a fraud check) is often preferable to blocking legitimate customers. Flag the transaction for async review instead:

breaker.fallback(async (transactionId) => {
  await queue.publish('fraud-review-needed', { transactionId, reason: 'circuit-open' });
  return { approved: true, requiresReview: true };
});
Enter fullscreen mode Exit fullscreen mode

Event-Driven Monitoring

Opossum is an EventEmitter. Every state transition, every timeout, every rejection fires an event. Tap into these for logging and alerting:

breaker.on('open', () => {
  logger.error({ service: 'fraud-api', state: 'open' }, 'Circuit breaker opened');
  alerts.trigger('circuit_breaker_opened', { service: 'fraud-api' });
});

breaker.on('halfOpen', () => {
  logger.warn({ service: 'fraud-api', state: 'half-open' }, 'Circuit breaker probing');
});

breaker.on('close', () => {
  logger.info({ service: 'fraud-api', state: 'closed' }, 'Circuit breaker recovered');
  alerts.resolve('circuit_breaker_opened', { service: 'fraud-api' });
});

breaker.on('fallback', (result) => {
  logger.warn({ service: 'fraud-api', result }, 'Fallback activated');
});

breaker.on('timeout', () => {
  metrics.increment('circuit_breaker.timeout', { service: 'fraud-api' });
});

breaker.on('reject', () => {
  metrics.increment('circuit_breaker.rejected', { service: 'fraud-api' });
});

breaker.on('success', (result, latency) => {
  metrics.histogram('circuit_breaker.latency', latency, { service: 'fraud-api' });
});
Enter fullscreen mode Exit fullscreen mode

The success event includes the call latency in milliseconds — use this to track P99 latency in Prometheus or Datadog even when the circuit is closed.


Prometheus Integration

Opossum ships a Prometheus plugin that auto-registers all circuit breaker metrics:

import CircuitBreaker from 'opossum';
import { PrometheusMetrics } from 'opossum-prometheus';
import { Registry } from 'prom-client';
import express from 'express';

const registry = new Registry();

const breaker = new CircuitBreaker(callFraudAPI, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
});

// Register the breaker with Prometheus — this adds standard circuit breaker metrics
new PrometheusMetrics({ circuits: [breaker], registry });

const app = express();
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});
Enter fullscreen mode Exit fullscreen mode

This gives you these metrics out of the box:

  • opossum_state — gauge: 0=closed, 1=open, 0.5=half-open
  • opossum_successful — counter
  • opossum_failed — counter
  • opossum_rejected — counter (short-circuits when open)
  • opossum_timeout — counter
  • opossum_fallback_success / opossum_fallback_failure — counters
  • opossum_latency_bucket — histogram

Pair this with a Grafana dashboard that alerts when opossum_state stays at 1 for more than 60 seconds.


Wrapping Multiple Services

In a real microservices architecture, you'll have breakers for every external dependency. Create a factory to keep configuration consistent:

// src/breaker-factory.js
import CircuitBreaker from 'opossum';
import { PrometheusMetrics } from 'opossum-prometheus';
import { registry } from './metrics.js';

const DEFAULT_OPTIONS = {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  volumeThreshold: 10,
  rollingCountTimeout: 10000,
  isFailure: (err) => !err.response || err.response.status >= 500,
};

const breakers = [];

export function createBreaker(fn, options = {}) {
  const breaker = new CircuitBreaker(fn, { ...DEFAULT_OPTIONS, ...options });

  // Structured logging for all state transitions
  const name = fn.name || 'anonymous';
  breaker.on('open',     () => logger.error(`[CB:${name}] OPENED`));
  breaker.on('halfOpen', () => logger.warn(`[CB:${name}] HALF-OPEN`));
  breaker.on('close',    () => logger.info(`[CB:${name}] CLOSED`));

  breakers.push(breaker);
  return breaker;
}

// Register all breakers with Prometheus at once
export function registerMetrics() {
  new PrometheusMetrics({ circuits: breakers, registry });
}
Enter fullscreen mode Exit fullscreen mode

Usage:

import { createBreaker } from './breaker-factory.js';

export const fraudBreaker    = createBreaker(callFraudAPI,    { timeout: 3000 });
export const inventoryBreaker = createBreaker(callInventoryAPI, { timeout: 2000 });
export const emailBreaker    = createBreaker(sendEmail,       { timeout: 10000, resetTimeout: 60000 });
Enter fullscreen mode Exit fullscreen mode

Testing Circuit Breaker Behaviour

You should never rely on actual network failures in tests. Test the breaker logic directly:

// tests/circuit-breaker.test.js
import CircuitBreaker from 'opossum';
import { createBreaker } from '../src/breaker-factory.js';

describe('fraud API circuit breaker', () => {
  let callCount = 0;
  let shouldFail = false;

  const mockFraudAPI = async () => {
    callCount++;
    if (shouldFail) throw new Error('Service unavailable');
    return { approved: true };
  };

  const breaker = new CircuitBreaker(mockFraudAPI, {
    errorThresholdPercentage: 50,
    volumeThreshold: 3,
    resetTimeout: 100, // Fast reset for tests
  });

  breaker.fallback(() => ({ approved: true, fallback: true }));

  beforeEach(() => {
    callCount = 0;
    shouldFail = false;
  });

  test('calls the function normally when closed', async () => {
    const result = await breaker.fire();
    expect(result.approved).toBe(true);
    expect(result.fallback).toBeUndefined();
  });

  test('opens after threshold failures', async () => {
    shouldFail = true;

    // Exceed volumeThreshold with failures
    for (let i = 0; i < 4; i++) {
      try { await breaker.fire(); } catch {}
    }

    expect(breaker.opened).toBe(true);
  });

  test('uses fallback when open', async () => {
    shouldFail = true;
    for (let i = 0; i < 4; i++) {
      try { await breaker.fire(); } catch {}
    }

    shouldFail = false; // Fix the upstream
    // Circuit is still open — should use fallback, not call the function
    const previousCallCount = callCount;
    const result = await breaker.fire();

    expect(result.fallback).toBe(true);
    expect(callCount).toBe(previousCallCount); // No new calls made
  });

  test('recovers after resetTimeout', async () => {
    shouldFail = true;
    for (let i = 0; i < 4; i++) {
      try { await breaker.fire(); } catch {}
    }

    shouldFail = false;

    // Wait for half-open
    await new Promise(r => setTimeout(r, 150));

    const result = await breaker.fire();
    expect(result.approved).toBe(true);
    expect(result.fallback).toBeUndefined();
    expect(breaker.closed).toBe(true);
  });
});
Enter fullscreen mode Exit fullscreen mode

These tests give you confidence in the circuit breaker's state machine without any network calls. Run them in CI on every commit.


Common Pitfalls

Setting volumeThreshold too low. A threshold of 1 means a single failed request during a quiet period opens the circuit. This causes false positives. Start at 10 and tune from real traffic.

Not configuring isFailure. Client errors (4xx) should not trip circuit breakers protecting server-side dependencies. Always override isFailure.

Forgetting to name your breakers. Prometheus metrics include the circuit breaker name. Anonymous breakers generate opossum_unknown metrics that are useless in dashboards. Set name in options: new CircuitBreaker(fn, { name: 'fraud-api' }).

Opening and closing too aggressively. Flapping breakers — opening and closing repeatedly — cause more harm than a stable failure. Tune resetTimeout to be longer than your service's expected restart time. 30–60 seconds is a reasonable starting point.


Key Takeaways

Circuit breakers are the difference between a single-service outage and a full-system cascade. With opossum in Node.js:

  • Wrap every network call or external dependency in a CircuitBreaker instance.
  • Configure isFailure to exclude 4xx errors from the failure count.
  • Always register a fallback — fail gracefully, not loudly.
  • Emit structured logs and Prometheus metrics on every state transition.
  • Test the state machine with mocked functions in CI, not production traffic.

The pattern costs about 20 lines of setup per service boundary. The payoff is measured in incidents-that-never-happened.

Top comments (0)