zhongqiyue

Posted on Jun 27

How I built a simple AI router to avoid vendor lock-in and costs

#ai #python #webdev #tutorial

I've been working on a side project that needs AI for a few different tasks: answering user questions, generating image captions, and summarizing chat threads. At first, I just picked one provider (OpenAI) and called it a day. But after a month, two things became painfully clear: first, not every model is great at every task, and second, the bill was climbing fast because I was using GPT-4 for everything.

So I did what any reasonable developer would do: I started swapping API keys by hand. I'd comment out one import and uncomment another, deploy, test, get frustrated, rinse and repeat. That worked for about a week before I decided I needed a proper solution.

The problem

My project had three distinct AI needs:

Q&A: Needs high reasoning, can be slow and expensive (GPT-4 or Claude).
Caption generation: Fast, cheap, doesn't need deep reasoning (GPT-3.5 or Llama).
Summarization: Needs good context handling but not cutting-edge intelligence (Claude instant or Mixtral).

I was using one provider for all three, which meant I was either overpaying for simple tasks or getting low-quality results for complex ones.

What I tried that didn't work

First, I tried a simple if-elif chain in every endpoint. That turned into spaghetti within hours. Then I tried a config file with model names, but I still had to handle different SDKs, authentication, and response formats manually. It was brittle and ugly.

I also looked at some API aggregation services. They promised unified access but often introduced latency, added cost per call, or required me to trust their infrastructure with my keys. Not ideal for a small project where I wanted full control.

What eventually worked: an AI router

I built a tiny Python class that acts as a router. It takes a task name, picks a provider and model from a config file, and handles the request. The key insight: I didn't need a full proxy — just a configurable dispatcher that I could plug into my existing code with minimal changes.

Here's the core of it. First, the config file (config/ai_router.yaml):

# config/ai_router.yaml
routing:
  qa:
    provider: openai
    model: gpt-4
    max_tokens: 500
    temperature: 0.2
  captions:
    provider: anthropic
    model: claude-3-haiku-20240307
    max_tokens: 200
    temperature: 0.7
  summarize:
    provider: openai
    model: gpt-3.5-turbo
    max_tokens: 1000
    temperature: 0.3

Now the router class (router.py):

import os
import yaml
from functools import lru_cache

class AIRouter:
    def __init__(self, config_path="config/ai_router.yaml"):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)['routing']
        self._init_providers()

    def _init_providers(self):
        # Lazy import to avoid loading unused SDKs
        self.providers = {}

        if any(cfg['provider'] == 'openai' for cfg in self.config.values()):
            from openai import OpenAI
            self.providers['openai'] = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

        if any(cfg['provider'] == 'anthropic' for cfg in self.config.values()):
            from anthropic import Anthropic
            self.providers['anthropic'] = Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])

    def complete(self, task: str, prompt: str):
        cfg = self.config.get(task)
        if not cfg:
            raise ValueError(f"Unknown task: {task}")

        provider = self.providers[cfg['provider']]
        model = cfg['model']

        if cfg['provider'] == 'openai':
            response = provider.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=cfg['max_tokens'],
                temperature=cfg['temperature']
            )
            return response.choices[0].message.content

        elif cfg['provider'] == 'anthropic':
            response = provider.messages.create(
                model=model,
                max_tokens=cfg['max_tokens'],
                temperature=cfg['temperature'],
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text

        else:
            raise NotImplementedError(f"Provider {cfg['provider']} not implemented")

Usage in my app is dead simple:

from router import AIRouter
router = AIRouter()

# In one endpoint:
answer = router.complete('qa', "What's the capital of France?")
# In another:
caption = router.complete('captions', "Describe this image: [base64 data]")

Trade-offs and limitations

I'll be honest: this isn't production-grade. Error handling is minimal. If a provider is down, the whole request fails. There's no retry logic or fallback. Also, the config is static — if I want to switch models mid-request, I'd need a different approach.

But for my project, it solved the immediate pain: I can now route tasks to the most cost-effective model without touching code. I saved about 40% on API costs in the first month by sending captions to cheaper models.

What I'd do differently next time

I'd add a fallback mechanism. For example, if gpt-4 fails, try gpt-3.5-turbo before erroring out. Also, I'd make the router async — most providers support async now, and it would fit better in a web framework like FastAPI.

Another improvement: dynamic routing based on prompt length or complexity. For instance, if a Q&A prompt is short and simple, route it to a cheaper model automatically.

Real-world tools

If you don't want to build this yourself, there are services that do something similar. For instance, ai.interwestinfo.com offers a unified API with smart routing. But for my small project, rolling my own taught me a lot about each provider's quirks. It also gave me full control over the routing logic.

Lessons learned

Don't prematurely abstract. I almost built a full plugin system. The YAML config was enough.
Lazy imports matter. Loading all SDKs at startup wasted memory for providers I rarely used.
Cost visibility is gold. Logging which provider handled each request helped me spot waste.

I'm still iterating on this. Next up: adding streaming support and a simple latency monitor.

What does your AI infrastructure look like? Are you using a single provider or something more flexible? I'd love to hear how others handle this.

DEV Community