DEV Community

Cover image for Graceful Degradation: Circuit Breakers for External API Dependencies
HelperX
HelperX

Posted on • Originally published at helperx.app

Graceful Degradation: Circuit Breakers for External API Dependencies

When your application depends on external APIs that you don't control, failures are not a question of "if" but "when." X's API rate-limits you. Your proxy provider has an outage. The AI model endpoint returns 503s for 20 minutes.

The question is: does one failure cascade into total system failure, or does your system degrade gracefully?

We built a circuit breaker system for HelperX that keeps healthy slots running when unhealthy ones fail. Here's the implementation.

The cascade problem

Without circuit breakers, here's what happens when a proxy goes down:

  1. Slot A sends a request → proxy timeout (30 seconds)
  2. Slot A retries → another timeout (30 seconds)
  3. Slot A retries again → third timeout (30 seconds)
  4. Meanwhile, the scheduler is blocked for 90 seconds
  5. Other modules in Slot A queue up behind the blocked request
  6. If the proxy is truly dead, this loop continues indefinitely
  7. Node.js event loop gets congested with pending timeouts
  8. Other slots start experiencing delayed scheduling

One dead proxy degrades the entire system. With 200 slots, one bad proxy shouldn't affect 199 healthy ones.

The circuit breaker pattern

A circuit breaker sits between your application and an external dependency. It has three states:

     ┌──────────┐
     │  CLOSED  │ ← Normal operation. Requests pass through.
     └────┬─────┘
          │ failures >= threshold
          ▼
     ┌──────────┐
     │   OPEN   │ ← Requests fail immediately. No network calls.
     └────┬─────┘
          │ after resetTimeout
          ▼
    ┌───────────┐
    │ HALF-OPEN │ ← Allow one test request through.
    └─────┬─────┘
          │
    ┌─────┴──────┐
    │ success?   │
    ├─yes────────┤──► CLOSED (resume normal)
    └─no─────────┘──► OPEN (wait longer)
Enter fullscreen mode Exit fullscreen mode

Implementation

class CircuitBreaker {
  constructor(name, options = {}) {
    this.name = name;
    this.state = 'closed';
    this.failures = 0;
    this.successes = 0;
    this.lastFailure = null;
    this.lastAttempt = null;

    this.threshold = options.threshold || 5;
    this.resetTimeout = options.resetTimeout || 60_000;
    this.halfOpenMax = options.halfOpenMax || 1;
    this.onStateChange = options.onStateChange || (() => {});
  }

  async execute(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure >= this.resetTimeout) {
        this.transition('half-open');
      } else {
        throw new CircuitOpenError(
          `Circuit ${this.name} is open. ` +
          `Resets in ${this.timeUntilReset()}ms`
        );
      }
    }

    if (this.state === 'half-open') {
      // Only allow limited requests through
      if (this.halfOpenAttempts >= this.halfOpenMax) {
        throw new CircuitOpenError(
          `Circuit ${this.name} is half-open, max attempts reached`
        );
      }
      this.halfOpenAttempts++;
    }

    this.lastAttempt = Date.now();

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure(err);
      throw err;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.successes++;
    if (this.state === 'half-open') {
      this.transition('closed');
    }
  }

  onFailure(err) {
    this.failures++;
    this.lastFailure = Date.now();
    this.lastError = err;

    if (this.failures >= this.threshold) {
      this.transition('open');
    }
  }

  transition(newState) {
    const oldState = this.state;
    this.state = newState;

    if (newState === 'half-open') {
      this.halfOpenAttempts = 0;
    }

    this.onStateChange({
      name: this.name,
      from: oldState,
      to: newState,
      failures: this.failures,
      lastError: this.lastError
    });
  }

  timeUntilReset() {
    if (this.state !== 'open') return 0;
    return Math.max(0,
      this.resetTimeout - (Date.now() - this.lastFailure)
    );
  }

  getStatus() {
    return {
      name: this.name,
      state: this.state,
      failures: this.failures,
      successes: this.successes,
      lastFailure: this.lastFailure,
      timeUntilReset: this.timeUntilReset()
    };
  }
}

class CircuitOpenError extends Error {
  constructor(message) {
    super(message);
    this.name = 'CircuitOpenError';
    this.isCircuitOpen = true;
  }
}
Enter fullscreen mode Exit fullscreen mode

Per-slot circuit breakers

Each slot gets its own circuit breaker for each external dependency:

class SlotDependencies {
  constructor(slotId) {
    this.slotId = slotId;

    this.proxy = new CircuitBreaker(`${slotId}:proxy`, {
      threshold: 3,
      resetTimeout: 120_000,  // 2 minutes
      onStateChange: (e) => this.logStateChange(e)
    });

    this.ai = new CircuitBreaker(`${slotId}:ai`, {
      threshold: 5,
      resetTimeout: 60_000,   // 1 minute
      onStateChange: (e) => this.logStateChange(e)
    });

    this.api = new CircuitBreaker(`${slotId}:api`, {
      threshold: 3,
      resetTimeout: 300_000,  // 5 minutes (rate limits are longer)
      onStateChange: (e) => this.logStateChange(e)
    });
  }

  logStateChange(event) {
    const db = getDb(this.slotId);
    db.prepare(`
      INSERT INTO audit_log (id, module, action, status, detail, timestamp)
      VALUES (?, 'system', 'circuit_breaker', ?, ?, datetime('now'))
    `).run(
      crypto.randomUUID(),
      event.to === 'open' ? 'warning' : 'info',
      `${event.name}: ${event.from}${event.to} (${event.failures} failures)`
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

When Slot A's proxy circuit opens, Slot A stops sending requests through that proxy. Slots B through Z continue normally — they have their own circuit breakers with their own state.

Using circuit breakers in the scheduler

async function executeModuleAction(slotId, module) {
  const deps = getSlotDependencies(slotId);

  // Step 1: Find a tweet to reply to (uses proxy)
  let tweet;
  try {
    tweet = await deps.proxy.execute(() =>
      searchTweets(slotId, module.config.query)
    );
  } catch (err) {
    if (err.isCircuitOpen) {
      logAudit(slotId, module.name, 'skipped',
        `Proxy circuit open, resets in ${deps.proxy.timeUntilReset()}ms`);
      return;
    }
    throw err;
  }

  // Step 2: Generate AI reply (uses AI endpoint)
  let reply;
  try {
    reply = await deps.ai.execute(() =>
      generateReply(slotId, tweet, module.config.persona)
    );
  } catch (err) {
    if (err.isCircuitOpen) {
      logAudit(slotId, module.name, 'skipped',
        `AI circuit open, resets in ${deps.ai.timeUntilReset()}ms`);
      return;
    }
    throw err;
  }

  // Step 3: Send the reply (uses proxy + API)
  try {
    await deps.proxy.execute(() =>
      deps.api.execute(() =>
        sendReply(slotId, tweet.id, reply)
      )
    );
  } catch (err) {
    if (err.isCircuitOpen) {
      logAudit(slotId, module.name, 'skipped',
        `Circuit open: ${err.message}`);
      return;
    }
    throw err;
  }

  logAudit(slotId, module.name, 'success', reply);
}
Enter fullscreen mode Exit fullscreen mode

Each step of the action is wrapped in its own circuit breaker. If the AI is down but the proxy is fine, the system skips AI-dependent modules but can still run non-AI modules (scheduled posts, reposts).

Monitoring circuit state

The dashboard shows circuit breaker state for each slot:

function getSystemHealth() {
  const slots = getAllActiveSlots();

  return slots.map(slot => {
    const deps = getSlotDependencies(slot.id);
    return {
      slotId: slot.id,
      proxy: deps.proxy.getStatus(),
      ai: deps.ai.getStatus(),
      api: deps.api.getStatus(),
      healthy: ['proxy', 'ai', 'api']
        .every(dep => deps[dep].state === 'closed')
    };
  });
}
Enter fullscreen mode Exit fullscreen mode

An operator sees at a glance which slots are healthy, which have open circuits, and when each circuit will attempt recovery.

Tuning thresholds

Default thresholds aren't universal. We tuned ours based on failure patterns:

Dependency Threshold Reset timeout Why
Proxy 3 failures 2 min Proxy failures are usually transient. Quick retry.
AI model 5 failures 1 min AI endpoints recover fast. Higher threshold to absorb occasional 503s.
X API 3 failures 5 min Rate limits last 15 min. Longer reset avoids hammering.

The key insight: reset timeout should match the expected recovery time of the dependency, not an arbitrary number.

What we learned

1. One circuit breaker per dependency per tenant. Global circuit breakers cause healthy tenants to suffer for unhealthy ones. Per-tenant isolation is the whole point.

2. Log state transitions. When a circuit opens, the audit log records it. This is the most valuable diagnostic information during incidents.

3. Graceful skip > hard failure. When a circuit is open, the action is skipped and logged — not retried, not errored, not queued. The scheduler moves to the next action. Queuing failures leads to thundering herds when the circuit closes.

4. Nested circuit breakers work. An action that uses proxy + API goes through both breakers. If either is open, the action is skipped. This handles compound failures cleanly.

5. Half-open state prevents oscillation. Without half-open, a circuit that closes immediately sends a burst of requests that may re-trigger the failure. Half-open allows exactly one test request, preventing the open/close/open oscillation.


HelperX uses per-slot circuit breakers to keep your accounts running independently — one bad proxy doesn't affect the rest. Free 30-day trial.

Top comments (0)