DEV Community

zhongqiyue
zhongqiyue

Posted on

When Your AI Service Goes Down: Building a Multi-Model Fallback System

I remember the exact moment my weekend project turned into a nightmare. I'd been building a chatbot for a niche developer community, and everything was running smoothly with OpenAI's GPT-4. Then, without warning, the API started returning 503 errors. My app went silent. Users flooded my inbox. And I had no backup plan.

That afternoon, I learned the hard way: relying on a single AI provider is a single point of failure.

What I Tried First (And Why It Hurt)

My immediate reaction was to slap a try-catch around the OpenAI call and fall back to a different model. Something like:

async function getCompletion(prompt) {
  try {
    return await openai.createCompletion(prompt);
  } catch (e) {
    // fallback? but to what?
    return null;
  }
}
Enter fullscreen mode Exit fullscreen mode

This was naive. The fallback was undefined. Even if I did switch to another provider, I needed:

  • Different API endpoints and keys
  • Different request/response formats
  • Handling rate limits specific to each provider
  • Managing token limits (OpenAI vs Cohere vs Anthropic all differ)

I tried a brute-force approach: if OpenAI fails, try Anthropic. If that fails, try Cohere. But I was duplicating code everywhere, mixing up auth, and my error handling turned into spaghetti.

The dead ends taught me one thing: you need a real abstraction layer, not just a chain of try-catches.

The Approach That Actually Worked

I built a lightweight AI model router – a module that abstracts away provider differences and provides a unified interface. The core idea: define a common response shape, then implement a provider class for each AI service, and a router that decides which provider to use (and when to fallback).

Here's the stripped-down version I ended up using in production:

1. Define a Common Response Interface

interface AIResponse {
  text: string;
  model: string;
  tokensUsed: number;
  latencyMs: number;
}
Enter fullscreen mode Exit fullscreen mode

Every provider must return this exact shape. This normalizes the data for your app.

2. Build Provider Classes

Each provider implements a generate(prompt: string): Promise<AIResponse> method. Here's a simplified example for OpenAI:

class OpenAIProvider {
  private client: OpenAIApi;
  private model: string;

  constructor(apiKey: string, model = "gpt-4") {
    this.client = new OpenAIApi(new Configuration({ apiKey }));
    this.model = model;
  }

  async generate(prompt: string): Promise<AIResponse> {
    const start = Date.now();
    const response = await this.client.createCompletion({
      model: this.model,
      prompt,
      max_tokens: 500,
    });
    return {
      text: response.data.choices[0].text,
      model: this.model,
      tokensUsed: response.data.usage.total_tokens,
      latencyMs: Date.now() - start,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

And a similar class for Anthropic (Claude):

class AnthropicProvider {
  private client: Anthropic;
  private model: string;

  constructor(apiKey: string, model = "claude-2") {
    this.client = new Anthropic(apiKey);
    this.model = model;
  }

  async generate(prompt: string): Promise<AIResponse> {
    const start = Date.now();
    const response = await this.client.complete({ prompt, model: this.model });
    return {
      text: response.completion,
      model: this.model,
      tokensUsed: -1, // Anthropic doesn't always expose token count
      latencyMs: Date.now() - start,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

3. The Router Class

This is where the magic happens – it tries providers in order and handles errors gracefully.

class AIRouter {
  private providers: { provider: AIProvider; weight: number }[];
  private currentIndex: number = 0;

  constructor(providers: { provider: AIProvider; weight?: number }[]) {
    this.providers = providers.map(p => ({
      provider: p.provider,
      weight: p.weight ?? 1,
    }));
  }

  async generate(prompt: string): Promise<AIResponse> {
    // Try each provider cyclically based on weight
    let attempts = 0;
    while (attempts < this.providers.length) {
      const providerEntry = this.providers[this.currentIndex];
      try {
        const result = await providerEntry.provider.generate(prompt);
        // success – move to next provider for next call (round-robin)
        this.currentIndex = (this.currentIndex + 1) % this.providers.length;
        return result;
      } catch (error) {
        console.warn(`${providerEntry.provider.constructor.name} failed:`, error.message);
        // Move to next provider; if all fail, error will bubble up after loop
        this.currentIndex = (this.currentIndex + 1) % this.providers.length;
        attempts++;
      }
    }
    throw new Error("All AI providers failed");
  }
}
Enter fullscreen mode Exit fullscreen mode

I also added retry logic with exponential backoff inside each provider class, but the router handles total outages. You can also implement health checks (every 30 seconds) to skip known down providers.

Real Code: Putting It Together

const openaiProvider = new OpenAIProvider(process.env.OPENAI_API_KEY);
const anthropicProvider = new AnthropicProvider(process.env.ANTHROPIC_API_KEY);
// Add more providers (Cohere, local LLM, etc.)

const router = new AIRouter([
  { provider: openaiProvider, weight: 3 },   // preferred, higher weight
  { provider: anthropicProvider, weight: 1 },
]);

const response = await router.generate("Explain quantum computing in one sentence.");
console.log(response.text);
Enter fullscreen mode Exit fullscreen mode

This pattern is dead simple but saved me multiple times. The router doesn't just fallback; it also load-balances across providers, which helps avoid rate limits on any single one.

Lessons Learned & Trade-offs

What I'd do differently next time:

  • Track provider latency and failure rates, and dynamically adjust weights (e.g., via Prometheus metrics).
  • Cache responses for identical prompts to reduce costs and latency.
  • Implement circuit breakers so a failing provider isn't tried repeatedly.

When this approach is overkill:

  • If your app can tolerate downtime of minutes or hours, a simple single-provider with a queue is fine.
  • If you only use free/cheap models, the complexity of multiple providers may not justify the cost.
  • If you need exact same behavior across providers (e.g., specific tokenization), you'll face normalization challenges.

Trade-offs:

  • You now maintain multiple API keys and likely pay for multiple subscriptions.
  • Response quality may vary between providers – your app needs to handle different styles.
  • Latency increases as you go through fallback chain (though you can set timeouts).

I ended up open-sourcing a minimal version of this router, and also looked at existing solutions. There are services like ai.interwestinfo.com that offer similar multi-provider routing with built-in monitoring, but I wanted full control for my project.

Final Thoughts

Building this router wasn't glamorous – it was born out of panic and frustration. But it taught me a fundamental truth about AI integrations: they will fail, and you need a graceful exit.

Now, every time I deploy an AI-powered feature, I ask myself: "What happens if this provider dies at 3 AM?" The answer should never be "my app breaks."

What's your fallback strategy when an AI API goes down? I'd love to hear how others handle this – especially for lower-level models where you can't just switch without retraining.

Top comments (0)