Young Gao

Posted on Mar 21

Graceful Degradation Patterns: Keep Your Backend Running When Dependencies Fail (2026)

#typescript #architecture #reliability #backend

Every backend engineer has lived through the 3 AM incident where a single Redis timeout brought down the entire checkout flow. Your API doesn't exist in isolation -- it depends on databases, caches, third-party services, and internal microservices. When any of these fail (and they will), the question isn't whether your system degrades, but whether it degrades gracefully or catastrophically.

This article walks through battle-tested patterns for keeping your API responsive when the world around it is on fire. All examples are in TypeScript/Node.js, and every pattern here has been extracted from real production incidents.

The Anatomy of a Cascading Failure

Before we build defenses, let's understand the enemy. Cascading failures follow a predictable pattern:

A downstream dependency slows down (not fails -- slows down)
Your connection pool fills up waiting for responses
Incoming requests queue behind the blocked pool
Memory pressure builds, garbage collection stalls, latency spikes
Health checks start failing, load balancers pull nodes out
Remaining nodes absorb more traffic, accelerating their own failure

The insidious part is step 1. Total failure is easy to detect. A service responding in 30 seconds instead of 30 milliseconds will kill you silently.

// This innocent-looking code is a cascading failure waiting to happen
async function getProductDetails(productId: string) {
  const product = await db.query('SELECT * FROM products WHERE id = $1', [productId]);
  const inventory = await inventoryService.check(productId);     // 5s timeout? 30s? None?
  const reviews = await reviewService.getRecent(productId);      // What if this hangs?
  const recommendations = await mlService.predict(productId);    // Is this even critical?

  return { product, inventory, reviews, recommendations };
}

Every await in that function is an opportunity for cascade. Let's fix this systematically.

Circuit Breaker: Your First Line of Defense

If you've read the earlier articles in this series, you know circuit breakers. Here's a production-grade implementation with the nuances that matter:

interface CircuitBreakerConfig {
  failureThreshold: number;
  resetTimeoutMs: number;
  halfOpenMaxAttempts: number;
  monitorWindowMs: number;
}

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures: number[] = [];
  private halfOpenAttempts = 0;
  private lastFailureTime = 0;

  constructor(
    private name: string,
    private config: CircuitBreakerConfig
  ) {}

  async execute<T>(fn: () => Promise<T>, fallback: () => Promise<T>): Promise<T> {
    this.pruneOldFailures();

    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.config.resetTimeoutMs) {
        this.state = 'half-open';
        this.halfOpenAttempts = 0;
      } else {
        return fallback();
      }
    }

    if (this.state === 'half-open' && this.halfOpenAttempts >= this.config.halfOpenMaxAttempts) {
      return fallback();
    }

    try {
      const result = await fn();
      if (this.state === 'half-open') {
        this.state = 'closed';
        this.failures = [];
      }
      return result;
    } catch (error) {
      this.recordFailure();
      if (this.state === 'half-open') {
        this.state = 'open';
      }
      return fallback();
    }
  }

  private recordFailure() {
    const now = Date.now();
    this.failures.push(now);
    this.lastFailureTime = now;

    if (this.failures.length >= this.config.failureThreshold) {
      this.state = 'open';
      console.warn(`[CircuitBreaker:${this.name}] OPEN after ${this.failures.length} failures`);
    }
  }

  private pruneOldFailures() {
    const cutoff = Date.now() - this.config.monitorWindowMs;
    this.failures = this.failures.filter(t => t > cutoff);
  }

  getState() { return this.state; }
}

The key detail most implementations miss: failure counting should be windowed. Five failures over five hours is normal. Five failures in ten seconds means something is wrong. The monitorWindowMs parameter handles this.

The Bulkhead Pattern: Isolate the Blast Radius

In ship design, bulkheads are walls between compartments that prevent a hull breach in one area from sinking the whole vessel. The same principle applies to your API.

class Bulkhead {
  private active = 0;
  private queue: Array<{ resolve: (permit: boolean) => void }> = [];

  constructor(
    private name: string,
    private maxConcurrent: number,
    private maxQueue: number
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    const permit = await this.acquirePermit();
    if (!permit) {
      throw new BulkheadRejectError(
        `Bulkhead ${this.name} full: ${this.active}/${this.maxConcurrent} active, ${this.queue.length}/${this.maxQueue} queued`
      );
    }

    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
      this.releaseNext();
    }
  }

  private acquirePermit(): Promise<boolean> {
    if (this.active < this.maxConcurrent) {
      return Promise.resolve(true);
    }
    if (this.queue.length >= this.maxQueue) {
      return Promise.resolve(false);
    }
    return new Promise(resolve => {
      this.queue.push({ resolve });
    });
  }

  private releaseNext() {
    const next = this.queue.shift();
    if (next) next.resolve(true);
  }

  getMetrics() {
    return { active: this.active, queued: this.queue.length, name: this.name };
  }
}

class BulkheadRejectError extends Error {
  readonly statusCode = 503;
}

Use separate bulkheads for each dependency:

const bulkheads = {
  database: new Bulkhead('database', 50, 100),
  inventoryService: new Bulkhead('inventory', 20, 30),
  reviewService: new Bulkhead('reviews', 10, 15),
  mlService: new Bulkhead('ml-predictions', 5, 10),
};

Now when mlService hangs, it can only consume 5 concurrent connections. The other 45 database slots remain available, and your core checkout flow keeps running.

Fallback Strategies: The Art of Useful Failure

Not all fallbacks are equal. Here's a hierarchy from best to worst user experience:

1. Cached Response

Serve stale data when fresh data is unavailable. Users rarely notice a 60-second-old price. They absolutely notice a 500 error.

class CacheFallbackService {
  constructor(
    private cache: Map<string, { data: unknown; timestamp: number }> = new Map(),
    private staleTTLMs: number = 300_000 // serve stale data up to 5 minutes
  ) {}

  async fetchWithCacheFallback<T>(
    key: string,
    fetcher: () => Promise<T>,
    options?: { freshTTLMs?: number }
  ): Promise<{ data: T; stale: boolean }> {
    const cached = this.cache.get(key);
    const freshTTL = options?.freshTTLMs ?? 30_000;

    // Try fresh fetch first
    try {
      const data = await fetcher();
      this.cache.set(key, { data, timestamp: Date.now() });
      return { data, stale: false };
    } catch (error) {
      // Serve stale cache if available and within tolerance
      if (cached && Date.now() - cached.timestamp < this.staleTTLMs) {
        console.warn(`Serving stale cache for ${key}, age: ${Date.now() - cached.timestamp}ms`);
        return { data: cached.data as T, stale: true };
      }
      throw error;
    }
  }
}

Production tip: Set the X-Served-Stale: true header when returning cached data. This lets clients make informed decisions and helps debugging.

2. Default / Static Value

When you have no cache, return a sensible default. This is especially useful for non-critical decorators on a response.

const FALLBACK_DEFAULTS: Record<string, unknown> = {
  recommendations: [],
  promotionBanner: null,
  recentReviews: [],
  estimatedDelivery: 'Contact support for delivery estimates',
  stockStatus: 'CHECK_IN_STORE',
};

function withDefault<T>(key: string, fn: () => Promise<T>): Promise<T> {
  return fn().catch((err) => {
    console.warn(`Using default fallback for ${key}: ${err.message}`);
    return FALLBACK_DEFAULTS[key] as T;
  });
}

3. Reduced Functionality

Disable non-essential features and communicate it to the client:

interface ServiceHealth {
  reviews: boolean;
  recommendations: boolean;
  inventory: boolean;
  pricing: boolean;
}

async function getProductDetails(productId: string, health: ServiceHealth) {
  const result: Record<string, unknown> = {};
  const degraded: string[] = [];

  // Core: always attempt, fail the request if this fails
  result.product = await db.query('SELECT * FROM products WHERE id = $1', [productId]);

  // Important: attempt with fallback
  if (health.inventory) {
    result.inventory = await withDefault('stockStatus',
      () => inventoryService.check(productId));
  } else {
    result.inventory = FALLBACK_DEFAULTS.stockStatus;
    degraded.push('inventory');
  }

  // Nice-to-have: skip entirely under pressure
  if (health.reviews) {
    result.reviews = await withDefault('recentReviews',
      () => reviewService.getRecent(productId));
  } else {
    result.reviews = [];
    degraded.push('reviews');
  }

  if (health.recommendations) {
    result.recommendations = await withDefault('recommendations',
      () => mlService.predict(productId));
  } else {
    result.recommendations = [];
    degraded.push('recommendations');
  }

  return {
    ...result,
    _meta: { degraded, timestamp: Date.now() }
  };
}

The _meta.degraded array tells the frontend exactly which sections are unavailable, so it can render appropriate UI instead of broken components.

Timeout Hierarchies: Budgeting Time

Individual timeouts aren't enough. You need a timeout budget for the entire request, with sub-budgets for each dependency.

class TimeoutBudget {
  private startTime: number;

  constructor(private totalBudgetMs: number) {
    this.startTime = Date.now();
  }

  remaining(): number {
    return Math.max(0, this.totalBudgetMs - (Date.now() - this.startTime));
  }

  expired(): boolean {
    return this.remaining() <= 0;
  }

  allocate(maxMs: number): number {
    // Never allocate more than remaining budget
    return Math.min(maxMs, this.remaining());
  }
}

async function handleRequest(productId: string): Promise<Response> {
  const budget = new TimeoutBudget(3000); // Total: 3 seconds for the entire request

  // Core data: allocate up to 1500ms
  const product = await withTimeout(
    db.query('SELECT * FROM products WHERE id = $1', [productId]),
    budget.allocate(1500)
  );

  // Secondary data: gets whatever time is left, capped per-call
  const [inventory, reviews] = await Promise.allSettled([
    withTimeout(inventoryService.check(productId), budget.allocate(800)),
    withTimeout(reviewService.getRecent(productId), budget.allocate(500)),
  ]);

  // Only attempt ML if we have >200ms remaining
  let recommendations = [];
  if (budget.remaining() > 200) {
    try {
      recommendations = await withTimeout(
        mlService.predict(productId),
        budget.allocate(400)
      );
    } catch { /* non-critical, swallow */ }
  }

  return buildResponse(product, inventory, reviews, recommendations);
}

function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  if (ms <= 0) return Promise.reject(new Error('No time budget remaining'));

  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms);
    promise.then(
      (val) => { clearTimeout(timer); resolve(val); },
      (err) => { clearTimeout(timer); reject(err); }
    );
  });
}

This guarantees your API responds within 3 seconds regardless of which dependencies are slow. Late arrivals get progressively less time and are progressively less critical.

Retry Budgets: Don't Amplify the Problem

Naive retries multiply load on an already struggling service. Use a retry budget to cap the percentage of retried requests across your entire application.

class RetryBudget {
  private requests = 0;
  private retries = 0;
  private windowStart = Date.now();

  constructor(
    private windowMs: number = 10_000,
    private maxRetryRatio: number = 0.1, // Max 10% of requests can be retries
    private minRetriesPerWindow: number = 5 // Always allow at least 5
  ) {}

  recordRequest() {
    this.maybeResetWindow();
    this.requests++;
  }

  canRetry(): boolean {
    this.maybeResetWindow();
    if (this.retries < this.minRetriesPerWindow) return true;
    return this.retries / Math.max(1, this.requests) < this.maxRetryRatio;
  }

  recordRetry() {
    this.retries++;
  }

  private maybeResetWindow() {
    if (Date.now() - this.windowStart > this.windowMs) {
      this.requests = 0;
      this.retries = 0;
      this.windowStart = Date.now();
    }
  }
}

const inventoryRetryBudget = new RetryBudget();

async function checkInventoryWithRetry(productId: string): Promise<InventoryStatus> {
  inventoryRetryBudget.recordRequest();

  try {
    return await inventoryService.check(productId);
  } catch (error) {
    if (isRetryable(error) && inventoryRetryBudget.canRetry()) {
      inventoryRetryBudget.recordRetry();
      await sleep(50 + Math.random() * 100); // Jittered backoff
      return await inventoryService.check(productId);
    }
    throw error;
  }
}

function isRetryable(error: unknown): boolean {
  if (error instanceof Error) {
    return error.message.includes('ECONNRESET')
      || error.message.includes('503')
      || error.message.includes('429');
  }
  return false;
}

At steady state (1000 req/s), this allows up to 100 retries per second. During an outage where every request fails, it caps retries at 100/s instead of doubling load to 2000 req/s. The minRetriesPerWindow ensures low-traffic endpoints can still retry.

Load Shedding: Protect Yourself

When your system is at capacity, rejecting requests quickly is better than serving them slowly. Fast failure lets clients retry elsewhere or show cached content.

class LoadShedder {
  private activeRequests = 0;

  constructor(
    private maxConcurrent: number,
    private priorityExtractor: (req: Request) => Priority
  ) {}

  middleware() {
    return (req: Request, res: Response, next: NextFunction) => {
      const priority = this.priorityExtractor(req);
      const threshold = this.getThreshold(priority);

      if (this.activeRequests >= threshold) {
        res.status(503).json({
          error: 'Service temporarily at capacity',
          retryAfterMs: 1000 + Math.random() * 2000,
        });
        return;
      }

      this.activeRequests++;
      res.on('finish', () => this.activeRequests--);
      next();
    };
  }

  private getThreshold(priority: Priority): number {
    // Higher priority requests get access to more capacity
    switch (priority) {
      case 'critical':  return this.maxConcurrent;        // 100%
      case 'high':      return this.maxConcurrent * 0.8;  // 80%
      case 'normal':    return this.maxConcurrent * 0.5;  // 50%
      case 'low':       return this.maxConcurrent * 0.2;  // 20%
    }
  }
}

type Priority = 'critical' | 'high' | 'normal' | 'low';

function extractPriority(req: Request): Priority {
  // Payment webhooks and health checks are critical
  if (req.path.startsWith('/webhooks/payment')) return 'critical';
  if (req.path === '/health') return 'critical';

  // Authenticated checkout flows are high priority
  if (req.path.startsWith('/checkout')) return 'high';

  // Browse/search is normal
  if (req.path.startsWith('/products')) return 'normal';

  // Everything else (analytics, recommendations) is low
  return 'low';
}

When the server hits 50% capacity, recommendation requests start getting shed. Checkout continues until 80%. Payment webhooks get the full capacity. Your revenue-generating paths survive longest.

Health-Aware Routing

Don't just check if a dependency is up -- track how healthy it is. Route traffic accordingly.

interface DependencyHealth {
  name: string;
  latencyP50: number;
  latencyP99: number;
  errorRate: number;   // 0.0 to 1.0
  lastCheck: number;
  status: 'healthy' | 'degraded' | 'unhealthy';
}

class HealthTracker {
  private metrics = new Map<string, { latencies: number[]; errors: number; total: number }>();

  record(name: string, latencyMs: number, success: boolean) {
    let m = this.metrics.get(name);
    if (!m) {
      m = { latencies: [], errors: 0, total: 0 };
      this.metrics.set(name, m);
    }

    m.total++;
    if (!success) m.errors++;
    m.latencies.push(latencyMs);

    // Keep sliding window of last 100 observations
    if (m.latencies.length > 100) m.latencies.shift();
  }

  getHealth(name: string): DependencyHealth {
    const m = this.metrics.get(name);
    if (!m || m.total === 0) {
      return { name, latencyP50: 0, latencyP99: 0, errorRate: 0, lastCheck: 0, status: 'healthy' };
    }

    const sorted = [...m.latencies].sort((a, b) => a - b);
    const p50 = sorted[Math.floor(sorted.length * 0.5)];
    const p99 = sorted[Math.floor(sorted.length * 0.99)];
    const errorRate = m.errors / m.total;

    let status: DependencyHealth['status'] = 'healthy';
    if (errorRate > 0.5 || p99 > 5000) status = 'unhealthy';
    else if (errorRate > 0.1 || p99 > 2000) status = 'degraded';

    return { name, latencyP50: p50, latencyP99: p99, errorRate, lastCheck: Date.now(), status };
  }

  getServiceHealth(): ServiceHealth {
    return {
      reviews: this.getHealth('reviews').status !== 'unhealthy',
      recommendations: this.getHealth('recommendations').status !== 'unhealthy',
      inventory: this.getHealth('inventory').status !== 'unhealthy',
      pricing: this.getHealth('pricing').status !== 'unhealthy',
    };
  }
}

This feeds directly into the reduced-functionality pattern from earlier. Instead of binary up/down decisions, your system continuously adapts based on real-time health signals.

Wiring It All Together

Here's how these patterns compose into a resilient service layer:

class ResilientService<T> {
  private circuitBreaker: CircuitBreaker;
  private bulkhead: Bulkhead;
  private retryBudget: RetryBudget;
  private cache: CacheFallbackService;
  private healthTracker: HealthTracker;

  constructor(private name: string, config: ResilientServiceConfig) {
    this.circuitBreaker = new CircuitBreaker(name, config.circuitBreaker);
    this.bulkhead = new Bulkhead(name, config.maxConcurrent, config.maxQueue);
    this.retryBudget = new RetryBudget();
    this.cache = new CacheFallbackService();
    this.healthTracker = config.healthTracker;
  }

  async call<R>(
    key: string,
    fn: () => Promise<R>,
    fallback: () => Promise<R>,
    timeoutMs: number
  ): Promise<{ data: R; degraded: boolean }> {
    this.retryBudget.recordRequest();

    const start = Date.now();
    try {
      const result = await this.circuitBreaker.execute(
        () => this.bulkhead.execute(
          () => withTimeout(fn(), timeoutMs)
        ),
        fallback
      );

      this.healthTracker.record(this.name, Date.now() - start, true);
      return { data: result, degraded: this.circuitBreaker.getState() !== 'closed' };
    } catch (error) {
      this.healthTracker.record(this.name, Date.now() - start, false);

      // Try cached fallback before giving up
      try {
        const cached = await this.cache.fetchWithCacheFallback(key, fallback);
        return { data: cached.data as R, degraded: true };
      } catch {
        throw error;
      }
    }
  }
}

The execution order matters: circuit breaker wraps bulkhead wraps timeout. The circuit breaker is the outermost check (cheapest -- just a state check). The bulkhead prevents resource exhaustion. The timeout prevents individual calls from hogging their bulkhead slot.

Chaos Engineering: Verify Your Defenses

Patterns on paper are worthless if you've never tested them under failure. You don't need a full Chaos Monkey setup to start. Inject failures in your middleware:

class ChaosMiddleware {
  private config: ChaosConfig;

  constructor(configPath: string) {
    // Load from config file or feature flag service
    this.config = this.loadConfig(configPath);
  }

  forDependency(name: string) {
    return async <T>(fn: () => Promise<T>): Promise<T> => {
      const rule = this.config.rules[name];
      if (!rule?.enabled) return fn();

      // Simulate latency injection
      if (rule.latencyMs && Math.random() < rule.probability) {
        await sleep(rule.latencyMs);
      }

      // Simulate failure injection
      if (rule.failureRate && Math.random() < rule.failureRate) {
        throw new Error(`[Chaos] Injected failure for ${name}`);
      }

      return fn();
    };
  }

  private loadConfig(path: string): ChaosConfig {
    // In production: poll a feature flag service or config store
    return JSON.parse(readFileSync(path, 'utf-8'));
  }
}

interface ChaosConfig {
  rules: Record<string, {
    enabled: boolean;
    probability: number;
    latencyMs?: number;
    failureRate?: number;
  }>;
}

// Usage in your service setup:
const chaos = new ChaosMiddleware('/etc/app/chaos.json');

// Wrap real calls during chaos experiments
const inventory = await chaos.forDependency('inventory')(
  () => inventoryService.check(productId)
);

Start small: inject 1% failures to your review service on staging. Verify that fallbacks trigger, circuit breakers open at the right threshold, and your bulkheads actually isolate the blast radius. Then gradually increase. The first time you run this, something will break in a way you didn't expect. That's the whole point.

A practical chaos gameday checklist:

Pick one non-critical dependency
Inject 100ms latency, observe dashboards
Increase to 2000ms, verify timeouts trigger
Inject 50% errors, verify circuit breaker opens
Inject 100% errors, verify fallbacks serve stale data
Remove injection, verify recovery (circuit breaker closes)

Key Takeaways

Design for partial failure. Your API should never be entirely down because one dependency is down. Categorize every dependency as critical (fail the request) or non-critical (degrade gracefully).

Timeouts are mandatory. Every network call needs a timeout. Every request needs a budget. No exceptions.

Fast failure beats slow failure. A 503 in 5ms is better than a 200 in 30 seconds. Load shedding and circuit breakers make this possible.

Test failure paths. If you've never seen your circuit breaker open in production, you don't know if it works. Chaos engineering isn't optional for systems that matter.

Observe everything. Every pattern here generates signals: circuit breaker state changes, bulkhead queue depths, cache hit rates, retry budget utilization. Ship these to your observability platform. You can't manage what you can't measure.

The code in this article is intentionally framework-agnostic -- these patterns work whether you're using Express, Fastify, NestJS, or raw Node.js HTTP. The underlying principles are even language-agnostic. What matters is that you think about failure modes before they happen.

Next time you write an await fetch(...), ask yourself: what happens when this takes 30 seconds? What happens when it fails 50% of the time? If you don't have answers, you now have the patterns to build them.

This is Part 5 of the **Production Backend Patterns* series. Previous articles covered request validation, structured logging, rate limiting, and circuit breakers. Next up: distributed tracing and observability pipelines.*

DEV Community