Distributed systems fail. That's not pessimism — it's physics. A network glitch, a slow database, a third-party API hiccup: any one of these can cascade into a full-blown outage if your API doesn't know how to absorb the shock. In 2026, with most production Node.js architectures spanning multiple microservices, AI model endpoints, and external data providers, resilience patterns are no longer optional.
This guide covers three battle-tested patterns — Circuit Breaker, Bulkhead, and Retry with Exponential Backoff — and shows you how to implement them in Node.js using Opossum 9.0, the most widely-used circuit breaker library for Node.js.
Why Your API Needs Resilience Patterns in 2026
Modern APIs rarely operate in isolation. A typical production service in 2026 makes calls to:
- External AI APIs (OpenAI, Anthropic, Gemini) that can spike in latency under load
- Third-party payment gateways with SLA-defined uptime of 99.9% (meaning ~8.7 hours downtime/year)
- Internal microservices with their own dependency chains
- Databases and caches that can become overwhelmed during traffic surges
Without resilience patterns, a single slow dependency can exhaust your thread pool, queue, or connection pool — causing a cascading failure that takes down your entire service. The three patterns we'll cover are your first line of defense.
Pattern 1: Circuit Breaker
The Circuit Breaker pattern (popularized by Michael Nygard's Release It!) works exactly like an electrical circuit breaker: when failures exceed a threshold, it "trips open" and stops sending requests to the failing service, returning an immediate fallback instead of waiting for a timeout.
Three States of a Circuit Breaker
CLOSED → (failures > threshold) → OPEN → (reset timeout) → HALF-OPEN → (success) → CLOSED
↘ (failure) → OPEN
- CLOSED: Normal operation. Requests flow through. Failures are counted.
- OPEN: The circuit has tripped. All requests immediately return a fallback. No traffic hits the failing service.
- HALF-OPEN: After a reset timeout, one probe request is allowed through. Success closes the circuit; failure reopens it.
Installing Opossum 9.x
npm install opossum@9
# TypeScript types
npm install -D @types/opossum
Note: Opossum 9.0.0 (released June 2025) dropped support for Node.js < 20 and aligns fully with the WHATWG Fetch API. If you're on Node.js 22 or 24, you're good to go.
Basic Circuit Breaker Implementation
// lib/circuit-breaker.js
import CircuitBreaker from 'opossum';
/**
* Wraps an async function with a circuit breaker.
* @param {Function} fn - The async function to protect
* @param {Object} options - Opossum options
*/
export function createBreaker(fn, options = {}) {
const defaults = {
timeout: 5000, // Fail if takes > 5s
errorThresholdPercentage: 50, // Trip at 50% failure rate
resetTimeout: 30000, // Try again after 30s
volumeThreshold: 5, // Min 5 requests before tripping
rollingCountTimeout: 10000, // Measure over a 10s window
};
const breaker = new CircuitBreaker(fn, { ...defaults, ...options });
// Observability hooks
breaker.on('open', () => console.warn(`[CircuitBreaker] OPEN — ${fn.name}`));
breaker.on('halfOpen', () => console.info(`[CircuitBreaker] HALF-OPEN — ${fn.name}`));
breaker.on('close', () => console.info(`[CircuitBreaker] CLOSED — ${fn.name}`));
breaker.on('fallback', (result) => console.warn(`[CircuitBreaker] FALLBACK — ${fn.name}:`, result));
return breaker;
}
Protecting an External API Call
// services/payment.service.js
import { createBreaker } from '../lib/circuit-breaker.js';
async function chargeCard(payload) {
const response = await fetch('https://api.payment-provider.com/v1/charge', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
signal: AbortSignal.timeout(4500), // Hard abort at 4.5s
});
if (!response.ok) {
throw new Error(`Payment API error: ${response.status}`);
}
return response.json();
}
// Fallback: queue for async retry
function paymentFallback(payload) {
console.warn('Payment service unavailable — queuing for retry:', payload.orderId);
return { queued: true, orderId: payload.orderId, retryAt: Date.now() + 60_000 };
}
const paymentBreaker = createBreaker(chargeCard, {
timeout: 4000,
errorThresholdPercentage: 30, // Payment APIs: trip faster
resetTimeout: 60000, // Try again after 60s
});
paymentBreaker.fallback(paymentFallback);
export { paymentBreaker };
Using the Breaker in Your Route Handler
// routes/orders.js
import { paymentBreaker } from '../services/payment.service.js';
app.post('/orders', async (req, res) => {
try {
const result = await paymentBreaker.fire({
orderId: req.body.orderId,
amount: req.body.amount,
currency: req.body.currency,
});
if (result.queued) {
return res.status(202).json({
message: 'Order accepted, payment will process shortly',
orderId: result.orderId,
});
}
res.json({ success: true, transactionId: result.transactionId });
} catch (err) {
// Only reaches here if no fallback is set
res.status(503).json({ error: 'Service temporarily unavailable' });
}
});
Exposing Circuit Breaker Stats
// GET /health/circuit-breakers
app.get('/health/circuit-breakers', (req, res) => {
res.json({
payment: {
state: paymentBreaker.opened ? 'OPEN' : paymentBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
stats: paymentBreaker.stats,
},
});
});
The stats object includes fires, successes, failures, rejects, timeouts, fallbacks, and percentiles for latency — perfect for Prometheus scraping or your OpenTelemetry pipeline.
Pattern 2: Retry with Exponential Backoff
Not every failure is catastrophic — sometimes a service hiccups for 200ms. The Retry pattern automatically re-attempts failed requests, while Exponential Backoff prevents thundering herds by spacing retries with increasing delays.
Key Rules for Retry in 2026
- Only retry idempotent operations — GET, PUT, DELETE are generally safe; POST requires care
- Add jitter — randomize backoff to prevent synchronized retry storms
- Set a maximum retry count — don't retry forever
-
Respect
Retry-Afterheaders — external APIs will tell you when to retry
Manual Retry with Exponential Backoff
// lib/retry.js
/**
* Retry an async function with exponential backoff + jitter
* @param {Function} fn - Async function to retry
* @param {Object} opts - Retry configuration
*/
export async function withRetry(fn, opts = {}) {
const {
maxAttempts = 3,
baseDelayMs = 200,
maxDelayMs = 10_000,
jitter = true,
shouldRetry = (err) => true, // Custom predicate
} = opts;
let lastError;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
// Don't retry on the last attempt
if (attempt === maxAttempts) break;
// Check if this error is retryable
if (!shouldRetry(err)) throw err;
// Respect Retry-After header if present (HTTP 429 / 503)
const retryAfter = err.headers?.get?.('Retry-After');
if (retryAfter) {
const waitMs = parseInt(retryAfter) * 1000;
console.warn(`Rate limited. Waiting ${waitMs}ms before retry...`);
await sleep(waitMs);
continue;
}
// Exponential backoff: 200ms, 400ms, 800ms, ...
const expDelay = Math.min(baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs);
const delay = jitter
? expDelay * (0.5 + Math.random() * 0.5) // 50–100% of exp delay
: expDelay;
console.info(`Retry ${attempt}/${maxAttempts - 1} in ${Math.round(delay)}ms...`);
await sleep(delay);
}
}
throw lastError;
}
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
Combining Retry with Circuit Breaker
The real power comes from composing patterns. Here's the correct order:
Request → [Retry] → [Circuit Breaker] → External Service
The circuit breaker sits between retries and the service so it can track failure rates accurately:
// services/ai.service.js
import { createBreaker } from '../lib/circuit-breaker.js';
import { withRetry } from '../lib/retry.js';
async function callAIModel(prompt) {
return withRetry(
() => aiBreaker.fire(prompt),
{
maxAttempts: 3,
baseDelayMs: 500,
// Don't retry when the circuit is open (429-like rejection)
shouldRetry: (err) => err.message !== 'Breaker is open',
}
);
}
Pattern 3: Bulkhead
Named after the watertight compartments in ships that prevent a single breach from sinking the whole vessel, the Bulkhead pattern isolates resource pools so that a surge in one area can't exhaust resources for everything else.
In Node.js APIs, the most common implementation is concurrency limiting per service — capping how many simultaneous calls can be in-flight to a given dependency.
Simple Bulkhead with p-limit
npm install p-limit
// lib/bulkhead.js
import pLimit from 'p-limit';
/**
* Creates a bulkhead: a concurrency-limited wrapper around a service.
* @param {number} concurrency - Max simultaneous calls allowed
*/
export function createBulkhead(concurrency) {
const limit = pLimit(concurrency);
let activeCount = 0;
let rejectedCount = 0;
return {
/**
* Execute a function within the bulkhead
*/
async execute(fn) {
if (limit.activeCount >= concurrency && limit.pendingCount > 0) {
rejectedCount++;
throw new Error(`Bulkhead full: ${concurrency} concurrent calls already in-flight`);
}
activeCount = limit.activeCount;
return limit(fn);
},
stats() {
return {
active: limit.activeCount,
pending: limit.pendingCount,
rejected: rejectedCount,
concurrency,
};
},
};
}
Using Bulkheads per Dependency
// services/dependencies.js
import { createBulkhead } from '../lib/bulkhead.js';
// Each external dependency gets its own resource pool
export const bulkheads = {
database: createBulkhead(20), // Max 20 DB queries in-flight
aiModel: createBulkhead(5), // AI calls are expensive — limit tightly
emailSender: createBulkhead(10), // Email service concurrency
paymentGw: createBulkhead(8), // Payment gateway
};
// Usage in a service
import { bulkheads } from './dependencies.js';
async function generateAIResponse(prompt) {
try {
return await bulkheads.aiModel.execute(() =>
fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
body: JSON.stringify({ model: 'gpt-4o', messages: [{ role: 'user', content: prompt }] }),
headers: { Authorization: `Bearer ${process.env.OPENAI_KEY}` },
}).then(r => r.json())
);
} catch (err) {
if (err.message.startsWith('Bulkhead full')) {
throw Object.assign(new Error('AI service at capacity'), { statusCode: 503 });
}
throw err;
}
}
Putting It All Together: A Production-Ready Resilient Service
Here's a complete, production-ready service combining all three patterns:
// services/resilient-client.js
import CircuitBreaker from 'opossum';
import pLimit from 'p-limit';
import { withRetry } from '../lib/retry.js';
export class ResilientClient {
constructor(name, fetchFn, options = {}) {
this.name = name;
// 1. Bulkhead: limit concurrency
this.limit = pLimit(options.concurrency ?? 10);
// 2. Circuit Breaker: detect and isolate failures
this.breaker = new CircuitBreaker(
(args) => this.limit(() => fetchFn(args)),
{
name,
timeout: options.timeout ?? 5000,
errorThresholdPercentage: options.errorThreshold ?? 50,
resetTimeout: options.resetTimeout ?? 30_000,
volumeThreshold: options.volumeThreshold ?? 5,
}
);
// Fallback
if (options.fallback) {
this.breaker.fallback(options.fallback);
}
// Observability
['open', 'halfOpen', 'close', 'fallback', 'timeout'].forEach((event) => {
this.breaker.on(event, (...args) => {
console.warn(`[${name}] circuit:${event}`, ...args);
});
});
// 3. Retry config
this.retryOptions = {
maxAttempts: options.maxAttempts ?? 3,
baseDelayMs: options.baseDelayMs ?? 300,
shouldRetry: options.shouldRetry ?? ((err) => !err.message.includes('Breaker is open')),
};
}
async call(args) {
return withRetry(
() => this.breaker.fire(args),
this.retryOptions
);
}
health() {
return {
name: this.name,
state: this.breaker.opened ? 'OPEN' : this.breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
concurrency: { active: this.limit.activeCount, pending: this.limit.pendingCount },
stats: this.breaker.stats,
};
}
}
// Wire it up at startup
import { ResilientClient } from './services/resilient-client.js';
export const paymentClient = new ResilientClient(
'payment-gateway',
async ({ orderId, amount }) => {
const res = await fetch('https://api.stripe.com/v1/charges', { /* ... */ });
if (!res.ok) throw new Error(`Stripe ${res.status}`);
return res.json();
},
{
concurrency: 8,
timeout: 4000,
errorThreshold: 30,
resetTimeout: 60_000,
maxAttempts: 2, // Payment: only retry once
fallback: ({ orderId }) => ({ queued: true, orderId }),
}
);
export const aiClient = new ResilientClient(
'ai-inference',
async ({ prompt }) => callAIEndpoint(prompt),
{
concurrency: 5,
timeout: 15_000, // AI can be slow
errorThreshold: 40,
resetTimeout: 20_000,
maxAttempts: 3,
fallback: () => ({ text: 'AI temporarily unavailable. Please try again shortly.' }),
}
);
Health Check Endpoint
Surface all resilience state in your /health endpoint:
app.get('/health', (req, res) => {
const clients = [paymentClient, aiClient];
const health = clients.map(c => c.health());
const degraded = health.some(h => h.state !== 'CLOSED');
res.status(degraded ? 207 : 200).json({
status: degraded ? 'degraded' : 'healthy',
timestamp: new Date().toISOString(),
services: health,
});
});
A 207 Multi-Status response tells your load balancer and monitoring stack that the service is up but partially degraded — which is more honest than a binary healthy/unhealthy.
Common Mistakes to Avoid
❌ Retrying Non-Idempotent Mutations
Never blindly retry POST /orders — you'll create duplicate orders. Either:
- Use idempotency keys:
Idempotency-Key: <uuid> - Convert to a two-phase commit (create intent first, then confirm)
❌ Same Circuit Breaker for All Routes
Don't share one breaker across unrelated operations. A slow GET /reports shouldn't trip the circuit for POST /orders. Create per-operation or per-service breakers.
❌ Retry Loops Without Jitter
Pure exponential backoff (200ms, 400ms, 800ms) across thousands of clients synchronizes perfectly into a thundering herd. Always add ±50% random jitter.
❌ Not Monitoring Circuit State
A circuit in OPEN state silently returning fallbacks is not always obvious. Export breaker.stats to your metrics pipeline (Prometheus, Datadog, OpenTelemetry) and alert when any circuit is OPEN for > 60 seconds.
Monitoring and Alerting
Wire Opossum stats into your existing observability stack. Here's a Prometheus example:
import { register, Gauge, Counter } from 'prom-client';
export function registerBreakerMetrics(breaker) {
const state = new Gauge({
name: `circuit_breaker_state`,
help: 'Circuit breaker state (0=CLOSED, 1=HALF_OPEN, 2=OPEN)',
labelNames: ['name'],
});
const fires = new Counter({
name: `circuit_breaker_fires_total`,
help: 'Total circuit breaker fires',
labelNames: ['name'],
});
breaker.on('fire', () => fires.labels(breaker.name).inc());
breaker.on('open', () => state.labels(breaker.name).set(2));
breaker.on('halfOpen', () => state.labels(breaker.name).set(1));
breaker.on('close', () => state.labels(breaker.name).set(0));
}
Summary: When to Use Each Pattern
| Pattern | Problem It Solves | Use When |
|---|---|---|
| Circuit Breaker | Cascading failures from a stuck dependency | Any outbound call to external services |
| Retry + Backoff | Transient failures (network blips, rate limits) | Idempotent operations against retryable errors |
| Bulkhead | Resource exhaustion from one slow dependency | High-concurrency APIs with multiple upstream calls |
These three patterns work best together. A production-grade Node.js API in 2026 should have all three in play for any external dependency — not as a sign of distrust in those dependencies, but as a commitment to your own SLAs.
Build APIs That Stay Up
Resilience isn't about expecting failures — it's about designing systems that handle them gracefully when they inevitably arrive. With Opossum 9.x, p-limit, and a clean retry utility, you can protect every outbound call in your Node.js service with minimal overhead.
Want a production-ready API template with these patterns pre-wired? Check out 1xAPI on RapidAPI — we ship APIs built on exactly these principles.
Published on 1xAPI Blog · March 17, 2026
Top comments (0)