1xApi

Posted on Mar 17 • Originally published at 1xapi.com

How to Build Resilient APIs with Circuit Breaker, Bulkhead & Retry Patterns in Node.js (2026 Guide)

#node #microservices #api #javascript

Distributed systems fail. That's not pessimism — it's physics. A network glitch, a slow database, a third-party API hiccup: any one of these can cascade into a full-blown outage if your API doesn't know how to absorb the shock. In 2026, with most production Node.js architectures spanning multiple microservices, AI model endpoints, and external data providers, resilience patterns are no longer optional.

This guide covers three battle-tested patterns — Circuit Breaker, Bulkhead, and Retry with Exponential Backoff — and shows you how to implement them in Node.js using Opossum 9.0, the most widely-used circuit breaker library for Node.js.

Why Your API Needs Resilience Patterns in 2026

Modern APIs rarely operate in isolation. A typical production service in 2026 makes calls to:

External AI APIs (OpenAI, Anthropic, Gemini) that can spike in latency under load
Third-party payment gateways with SLA-defined uptime of 99.9% (meaning ~8.7 hours downtime/year)
Internal microservices with their own dependency chains
Databases and caches that can become overwhelmed during traffic surges

Without resilience patterns, a single slow dependency can exhaust your thread pool, queue, or connection pool — causing a cascading failure that takes down your entire service. The three patterns we'll cover are your first line of defense.

Pattern 1: Circuit Breaker

The Circuit Breaker pattern (popularized by Michael Nygard's Release It!) works exactly like an electrical circuit breaker: when failures exceed a threshold, it "trips open" and stops sending requests to the failing service, returning an immediate fallback instead of waiting for a timeout.

Three States of a Circuit Breaker

CLOSED → (failures > threshold) → OPEN → (reset timeout) → HALF-OPEN → (success) → CLOSED
                                                                       ↘ (failure) → OPEN

CLOSED: Normal operation. Requests flow through. Failures are counted.
OPEN: The circuit has tripped. All requests immediately return a fallback. No traffic hits the failing service.
HALF-OPEN: After a reset timeout, one probe request is allowed through. Success closes the circuit; failure reopens it.

Installing Opossum 9.x

npm install opossum@9
# TypeScript types
npm install -D @types/opossum

Note: Opossum 9.0.0 (released June 2025) dropped support for Node.js < 20 and aligns fully with the WHATWG Fetch API. If you're on Node.js 22 or 24, you're good to go.

Basic Circuit Breaker Implementation

// lib/circuit-breaker.js
import CircuitBreaker from 'opossum';

/**
 * Wraps an async function with a circuit breaker.
 * @param {Function} fn - The async function to protect
 * @param {Object} options - Opossum options
 */
export function createBreaker(fn, options = {}) {
  const defaults = {
    timeout: 5000,                  // Fail if takes > 5s
    errorThresholdPercentage: 50,   // Trip at 50% failure rate
    resetTimeout: 30000,            // Try again after 30s
    volumeThreshold: 5,             // Min 5 requests before tripping
    rollingCountTimeout: 10000,     // Measure over a 10s window
  };

  const breaker = new CircuitBreaker(fn, { ...defaults, ...options });

  // Observability hooks
  breaker.on('open',     () => console.warn(`[CircuitBreaker] OPEN — ${fn.name}`));
  breaker.on('halfOpen', () => console.info(`[CircuitBreaker] HALF-OPEN — ${fn.name}`));
  breaker.on('close',    () => console.info(`[CircuitBreaker] CLOSED — ${fn.name}`));
  breaker.on('fallback', (result) => console.warn(`[CircuitBreaker] FALLBACK — ${fn.name}:`, result));

  return breaker;
}

Protecting an External API Call

// services/payment.service.js
import { createBreaker } from '../lib/circuit-breaker.js';

async function chargeCard(payload) {
  const response = await fetch('https://api.payment-provider.com/v1/charge', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
    signal: AbortSignal.timeout(4500), // Hard abort at 4.5s
  });

  if (!response.ok) {
    throw new Error(`Payment API error: ${response.status}`);
  }

  return response.json();
}

// Fallback: queue for async retry
function paymentFallback(payload) {
  console.warn('Payment service unavailable — queuing for retry:', payload.orderId);
  return { queued: true, orderId: payload.orderId, retryAt: Date.now() + 60_000 };
}

const paymentBreaker = createBreaker(chargeCard, {
  timeout: 4000,
  errorThresholdPercentage: 30, // Payment APIs: trip faster
  resetTimeout: 60000,           // Try again after 60s
});

paymentBreaker.fallback(paymentFallback);

export { paymentBreaker };

Using the Breaker in Your Route Handler

// routes/orders.js
import { paymentBreaker } from '../services/payment.service.js';

app.post('/orders', async (req, res) => {
  try {
    const result = await paymentBreaker.fire({
      orderId: req.body.orderId,
      amount: req.body.amount,
      currency: req.body.currency,
    });

    if (result.queued) {
      return res.status(202).json({
        message: 'Order accepted, payment will process shortly',
        orderId: result.orderId,
      });
    }

    res.json({ success: true, transactionId: result.transactionId });
  } catch (err) {
    // Only reaches here if no fallback is set
    res.status(503).json({ error: 'Service temporarily unavailable' });
  }
});

Exposing Circuit Breaker Stats

// GET /health/circuit-breakers
app.get('/health/circuit-breakers', (req, res) => {
  res.json({
    payment: {
      state: paymentBreaker.opened ? 'OPEN' : paymentBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
      stats: paymentBreaker.stats,
    },
  });
});

The stats object includes fires, successes, failures, rejects, timeouts, fallbacks, and percentiles for latency — perfect for Prometheus scraping or your OpenTelemetry pipeline.

Pattern 2: Retry with Exponential Backoff

Not every failure is catastrophic — sometimes a service hiccups for 200ms. The Retry pattern automatically re-attempts failed requests, while Exponential Backoff prevents thundering herds by spacing retries with increasing delays.

Key Rules for Retry in 2026

Only retry idempotent operations — GET, PUT, DELETE are generally safe; POST requires care
Add jitter — randomize backoff to prevent synchronized retry storms
Set a maximum retry count — don't retry forever
Respect Retry-After headers — external APIs will tell you when to retry

Manual Retry with Exponential Backoff

// lib/retry.js

/**
 * Retry an async function with exponential backoff + jitter
 * @param {Function} fn - Async function to retry
 * @param {Object} opts - Retry configuration
 */
export async function withRetry(fn, opts = {}) {
  const {
    maxAttempts = 3,
    baseDelayMs = 200,
    maxDelayMs = 10_000,
    jitter = true,
    shouldRetry = (err) => true, // Custom predicate
  } = opts;

  let lastError;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;

      // Don't retry on the last attempt
      if (attempt === maxAttempts) break;

      // Check if this error is retryable
      if (!shouldRetry(err)) throw err;

      // Respect Retry-After header if present (HTTP 429 / 503)
      const retryAfter = err.headers?.get?.('Retry-After');
      if (retryAfter) {
        const waitMs = parseInt(retryAfter) * 1000;
        console.warn(`Rate limited. Waiting ${waitMs}ms before retry...`);
        await sleep(waitMs);
        continue;
      }

      // Exponential backoff: 200ms, 400ms, 800ms, ...
      const expDelay = Math.min(baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs);
      const delay = jitter
        ? expDelay * (0.5 + Math.random() * 0.5) // 50–100% of exp delay
        : expDelay;

      console.info(`Retry ${attempt}/${maxAttempts - 1} in ${Math.round(delay)}ms...`);
      await sleep(delay);
    }
  }

  throw lastError;
}

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

Combining Retry with Circuit Breaker

The real power comes from composing patterns. Here's the correct order:

Request → [Retry] → [Circuit Breaker] → External Service

The circuit breaker sits between retries and the service so it can track failure rates accurately:

// services/ai.service.js
import { createBreaker } from '../lib/circuit-breaker.js';
import { withRetry } from '../lib/retry.js';

async function callAIModel(prompt) {
  return withRetry(
    () => aiBreaker.fire(prompt),
    {
      maxAttempts: 3,
      baseDelayMs: 500,
      // Don't retry when the circuit is open (429-like rejection)
      shouldRetry: (err) => err.message !== 'Breaker is open',
    }
  );
}

Pattern 3: Bulkhead

Named after the watertight compartments in ships that prevent a single breach from sinking the whole vessel, the Bulkhead pattern isolates resource pools so that a surge in one area can't exhaust resources for everything else.

In Node.js APIs, the most common implementation is concurrency limiting per service — capping how many simultaneous calls can be in-flight to a given dependency.

Simple Bulkhead with p-limit

npm install p-limit

// lib/bulkhead.js
import pLimit from 'p-limit';

/**
 * Creates a bulkhead: a concurrency-limited wrapper around a service.
 * @param {number} concurrency - Max simultaneous calls allowed
 */
export function createBulkhead(concurrency) {
  const limit = pLimit(concurrency);

  let activeCount = 0;
  let rejectedCount = 0;

  return {
    /**
     * Execute a function within the bulkhead
     */
    async execute(fn) {
      if (limit.activeCount >= concurrency && limit.pendingCount > 0) {
        rejectedCount++;
        throw new Error(`Bulkhead full: ${concurrency} concurrent calls already in-flight`);
      }

      activeCount = limit.activeCount;
      return limit(fn);
    },

    stats() {
      return {
        active: limit.activeCount,
        pending: limit.pendingCount,
        rejected: rejectedCount,
        concurrency,
      };
    },
  };
}

Using Bulkheads per Dependency

// services/dependencies.js
import { createBulkhead } from '../lib/bulkhead.js';

// Each external dependency gets its own resource pool
export const bulkheads = {
  database:    createBulkhead(20),  // Max 20 DB queries in-flight
  aiModel:     createBulkhead(5),   // AI calls are expensive — limit tightly
  emailSender: createBulkhead(10),  // Email service concurrency
  paymentGw:   createBulkhead(8),   // Payment gateway
};

// Usage in a service
import { bulkheads } from './dependencies.js';

async function generateAIResponse(prompt) {
  try {
    return await bulkheads.aiModel.execute(() =>
      fetch('https://api.openai.com/v1/chat/completions', {
        method: 'POST',
        body: JSON.stringify({ model: 'gpt-4o', messages: [{ role: 'user', content: prompt }] }),
        headers: { Authorization: `Bearer ${process.env.OPENAI_KEY}` },
      }).then(r => r.json())
    );
  } catch (err) {
    if (err.message.startsWith('Bulkhead full')) {
      throw Object.assign(new Error('AI service at capacity'), { statusCode: 503 });
    }
    throw err;
  }
}

Putting It All Together: A Production-Ready Resilient Service

Here's a complete, production-ready service combining all three patterns:

// services/resilient-client.js
import CircuitBreaker from 'opossum';
import pLimit from 'p-limit';
import { withRetry } from '../lib/retry.js';

export class ResilientClient {
  constructor(name, fetchFn, options = {}) {
    this.name = name;

    // 1. Bulkhead: limit concurrency
    this.limit = pLimit(options.concurrency ?? 10);

    // 2. Circuit Breaker: detect and isolate failures
    this.breaker = new CircuitBreaker(
      (args) => this.limit(() => fetchFn(args)),
      {
        name,
        timeout: options.timeout ?? 5000,
        errorThresholdPercentage: options.errorThreshold ?? 50,
        resetTimeout: options.resetTimeout ?? 30_000,
        volumeThreshold: options.volumeThreshold ?? 5,
      }
    );

    // Fallback
    if (options.fallback) {
      this.breaker.fallback(options.fallback);
    }

    // Observability
    ['open', 'halfOpen', 'close', 'fallback', 'timeout'].forEach((event) => {
      this.breaker.on(event, (...args) => {
        console.warn(`[${name}] circuit:${event}`, ...args);
      });
    });

    // 3. Retry config
    this.retryOptions = {
      maxAttempts: options.maxAttempts ?? 3,
      baseDelayMs: options.baseDelayMs ?? 300,
      shouldRetry: options.shouldRetry ?? ((err) => !err.message.includes('Breaker is open')),
    };
  }

  async call(args) {
    return withRetry(
      () => this.breaker.fire(args),
      this.retryOptions
    );
  }

  health() {
    return {
      name: this.name,
      state: this.breaker.opened ? 'OPEN' : this.breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
      concurrency: { active: this.limit.activeCount, pending: this.limit.pendingCount },
      stats: this.breaker.stats,
    };
  }
}

// Wire it up at startup
import { ResilientClient } from './services/resilient-client.js';

export const paymentClient = new ResilientClient(
  'payment-gateway',
  async ({ orderId, amount }) => {
    const res = await fetch('https://api.stripe.com/v1/charges', { /* ... */ });
    if (!res.ok) throw new Error(`Stripe ${res.status}`);
    return res.json();
  },
  {
    concurrency: 8,
    timeout: 4000,
    errorThreshold: 30,
    resetTimeout: 60_000,
    maxAttempts: 2,          // Payment: only retry once
    fallback: ({ orderId }) => ({ queued: true, orderId }),
  }
);

export const aiClient = new ResilientClient(
  'ai-inference',
  async ({ prompt }) => callAIEndpoint(prompt),
  {
    concurrency: 5,
    timeout: 15_000,         // AI can be slow
    errorThreshold: 40,
    resetTimeout: 20_000,
    maxAttempts: 3,
    fallback: () => ({ text: 'AI temporarily unavailable. Please try again shortly.' }),
  }
);

Health Check Endpoint

Surface all resilience state in your /health endpoint:

app.get('/health', (req, res) => {
  const clients = [paymentClient, aiClient];
  const health = clients.map(c => c.health());
  const degraded = health.some(h => h.state !== 'CLOSED');

  res.status(degraded ? 207 : 200).json({
    status: degraded ? 'degraded' : 'healthy',
    timestamp: new Date().toISOString(),
    services: health,
  });
});

A 207 Multi-Status response tells your load balancer and monitoring stack that the service is up but partially degraded — which is more honest than a binary healthy/unhealthy.

Common Mistakes to Avoid

❌ Retrying Non-Idempotent Mutations

Never blindly retry POST /orders — you'll create duplicate orders. Either:

Use idempotency keys: Idempotency-Key: <uuid>
Convert to a two-phase commit (create intent first, then confirm)

❌ Same Circuit Breaker for All Routes

Don't share one breaker across unrelated operations. A slow GET /reports shouldn't trip the circuit for POST /orders. Create per-operation or per-service breakers.

❌ Retry Loops Without Jitter

Pure exponential backoff (200ms, 400ms, 800ms) across thousands of clients synchronizes perfectly into a thundering herd. Always add ±50% random jitter.

❌ Not Monitoring Circuit State

A circuit in OPEN state silently returning fallbacks is not always obvious. Export breaker.stats to your metrics pipeline (Prometheus, Datadog, OpenTelemetry) and alert when any circuit is OPEN for > 60 seconds.

Monitoring and Alerting

Wire Opossum stats into your existing observability stack. Here's a Prometheus example:

import { register, Gauge, Counter } from 'prom-client';

export function registerBreakerMetrics(breaker) {
  const state = new Gauge({
    name: `circuit_breaker_state`,
    help: 'Circuit breaker state (0=CLOSED, 1=HALF_OPEN, 2=OPEN)',
    labelNames: ['name'],
  });

  const fires = new Counter({
    name: `circuit_breaker_fires_total`,
    help: 'Total circuit breaker fires',
    labelNames: ['name'],
  });

  breaker.on('fire',  () => fires.labels(breaker.name).inc());
  breaker.on('open',  () => state.labels(breaker.name).set(2));
  breaker.on('halfOpen', () => state.labels(breaker.name).set(1));
  breaker.on('close', () => state.labels(breaker.name).set(0));
}

Summary: When to Use Each Pattern

Pattern	Problem It Solves	Use When
Circuit Breaker	Cascading failures from a stuck dependency	Any outbound call to external services
Retry + Backoff	Transient failures (network blips, rate limits)	Idempotent operations against retryable errors
Bulkhead	Resource exhaustion from one slow dependency	High-concurrency APIs with multiple upstream calls

These three patterns work best together. A production-grade Node.js API in 2026 should have all three in play for any external dependency — not as a sign of distrust in those dependencies, but as a commitment to your own SLAs.

Build APIs That Stay Up

Resilience isn't about expecting failures — it's about designing systems that handle them gracefully when they inevitably arrive. With Opossum 9.x, p-limit, and a clean retry utility, you can protect every outbound call in your Node.js service with minimal overhead.

Want a production-ready API template with these patterns pre-wired? Check out 1xAPI on RapidAPI — we ship APIs built on exactly these principles.

Published on 1xAPI Blog · March 17, 2026

DEV Community