DEV Community

zhongqiyue
zhongqiyue

Posted on

When Your AI API Budget Blew Up: Multi-Provider Routing

I remember the exact moment my heart sank. It was a Tuesday morning, and I opened the billing dashboard for our AI API provider to find a $3,200 charge staring back at me. Our previous month had been $400. A junior dev had accidentally left a loop running in production that was hammering the endpoint with redundant prompts.

That pain was real, but it forced me to solve a deeper issue: we were relying on a single AI provider, and our costs and reliability were completely out of our control.

The Real Problem

Like many teams, we'd started with one provider because it was the easiest. The API was straightforward, the documentation was decent. But as we scaled from a simple chatbot to more complex automations—parsing emails, summarizing documents, generating code reviews—the single point of failure became unbearable.

Rate limits started biting us during peak hours. Costs exploded because we had no way to route cheaper queries to a different model. And if that provider had an outage (which happened twice in three months), our product was dead in the water.

What I Tried First That Didn't Work

My first instinct was to just duplicate the calls: try provider A, if it fails, try provider B. I slapped together a quick Python script with try/except blocks and a requests library. It worked… for about two days.

# Naive fallback (don't do this)
def query_ai(prompt):
    try:
        return provider_a_call(prompt)
    except Exception:
        try:
            return provider_b_call(prompt)
        except Exception:
            raise RuntimeError("All providers failed")
Enter fullscreen mode Exit fullscreen mode

Problems: each exception added seconds of latency, I had no way to prioritize cheaper providers, and I wasn't tracking which calls actually succeeded or failed. Plus, the code quickly turned into a spaghetti mess as we added a third provider.

Then I tried a more sophisticated queue-based approach with Celery and task retries. That made things even worse—we were overloading downstream APIs, hitting stricter rate limits, and paying for compute we didn't need.

What Eventually Worked: An Adaptive Routing Layer

After a lot of trial and error, I settled on a different pattern: a routing layer that sits between your application code and your AI providers. It's not fancy—it's essentially a Python class that uses a configurable strategy to pick which provider to call, tracks performance, and handles fallbacks gracefully.

Here's the core idea in about 80 lines:

import time
from typing import Callable, Dict, List

class AIRouter:
    def __init__(self, providers: Dict[str, Callable], config: dict = None):
        self.providers = providers
        self.config = config or {
            'cost_per_token': {
                'provider_a': 0.03,
                'provider_b': 0.01,
                'provider_c': 0.008,
            },
            'max_retries': 2,
            'timeout': 10,
            'preferred_order': ['provider_c', 'provider_b', 'provider_a']
        }
        self.stats = {name: {'calls': 0, 'errors': 0, 'total_time': 0.0} for name in providers}

    def query(self, prompt: str, context: dict = None) -> str:
        # Use context to optionally override order (e.g., based on user tier)
        order = self.config['preferred_order']
        if context and 'force_provider' in context:
            order = [context['force_provider']]

        last_error = None
        for provider_name in order:
            if provider_name not in self.providers:
                continue
            provider_fn = self.providers[provider_name]
            for attempt in range(self.config['max_retries']):
                try:
                    start = time.time()
                    result = provider_fn(prompt, timeout=self.config['timeout'])
                    elapsed = time.time() - start
                    self._record_success(provider_name, elapsed)
                    return result
                except Exception as e:
                    self._record_error(provider_name)
                    last_error = e
                    # Small backoff before retry
                    time.sleep(0.5 * (attempt + 1))
        raise RuntimeError(f"All providers failed. Last error: {last_error}")

    def _record_success(self, name, elapsed):
        self.stats[name]['calls'] += 1
        self.stats[name]['total_time'] += elapsed

    def _record_error(self, name):
        self.stats[name]['errors'] += 1
Enter fullscreen mode Exit fullscreen mode

This class isn't production-ready—no logging, no async, no circuit breakers—but it's the skeleton you can build on. The key insight is decoupling the which provider logic from the how to call logic. Once you have that, you can add all sorts of strategies: cheapest-first, fastest-first, based on prompt length, or based on user subscription level.

I also added a simple cost-tracking module that estimates tokens and logs each request. That alone saved our team—we could see which endpoints were costing us the most and adjust the routing order accordingly.

The Setup in Practice

To use this, you'd define provider functions that wrap API calls. For example:

import openai
import anthropic

def call_openai(prompt: str, timeout=10):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        timeout=timeout
    )
    return response.choices[0].message.content

def call_anthropic(prompt: str, timeout=10):
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        timeout=timeout
    )
    return message.content[0].text

# We also add a local model for cheap tasks
from transformers import pipeline
gen = pipeline('text2text-generation', model='google/flan-t5-small')
def call_local(prompt: str, timeout=10):
    return gen(prompt)[0]['generated_text']

# Then wire it up
router = AIRouter(
    providers={
        'openai': call_openai,
        'anthropic': call_anthropic,
        'local': call_local
    },
    config={
        'preferred_order': ['local', 'openai', 'anthropic'],
        'cost_per_token': {
            'local': 0.0,
            'openai': 0.002,  # gpt-3.5-turbo
            'anthropic': 0.00025  # claude-haiku
        }
    }
)

# Use it in your app
result = router.query("Summarize this email: ...")
Enter fullscreen mode Exit fullscreen mode

Now, when we get a simple request like "summarize an email", the router tries local first (free), and only falls back to paid APIs if it fails or times out. This cut our AI bill by 60% in the first month.

Lessons Learned & Trade-offs

  • Routing logic is deceptively simple. The class above is <100 lines, but you'll spend real time tweaking the priority order and timeout values based on real traffic patterns.
  • Latency vs. Cost tradeoff is real. Local models are cheap but slow on CPU. We ended up moving local inference to a GPU node for better latency, which added infrastructure cost. For some use cases, it's still cheaper than API calls.
  • You need monitoring. Without stats, you're blind. We integrated with our existing observability stack to track provider performance and cost per user.
  • Not all models speak the same language. Claude and GPT may handle formatting differently. We had to add a normalisation layer for structured outputs (JSON parsing, etc.).
  • Provider API changes happen. We got burned when Anthropic deprecated their old message API. The routing layer meant we only needed to update one provider function, but it was still a scramble.

When NOT to Use This Approach

This pattern adds complexity. If you have a single, stable use case with predictable load and acceptable costs, don't bother. Also, if you need strict consistency (e.g., always the same model version for reproducibility), routing is a bad idea.

What I'd Do Differently Next Time

I'd start with a simpler config-driven router from day one, rather than the ad-hoc fallback mess. I'd also add rate-limit awareness—my current router doesn't proactively slow down when a provider is throttling; it just fails and moves on. A proper circuit breaker pattern would be better.

And I'd definitely not leave a loop running in production. But maybe that's just me.

The whole experience taught me that the real art isn't in picking the "best" AI model—it's in building systems that gracefully handle the messiness of real-world APIs.

So, what's your setup look like? Are you using a single provider or something more distributed?

Top comments (0)