DEV Community

FuturMix
FuturMix

Posted on

Building a Multi-Model AI Gateway in 30 Minutes (Python + FastAPI)

Every AI app starts the same way: you pick one model provider, hardcode the API key, and ship it.

Then reality hits. GPT-4 is too expensive for simple tasks. Claude is better at code but slower for chat. DeepSeek costs a fraction but you're not sure about reliability. And when any one provider has an outage, your entire app goes down.

The solution? A multi-model AI gateway — a single OpenAI-compatible endpoint that routes requests to different providers based on the model name, handles failover, and gives you one place to manage keys, track usage, and control costs.

In this tutorial, we'll build one from scratch using Python and FastAPI.

What We're Building

Your App  →  Gateway (:8000)  →  OpenAI (gpt-4o, gpt-4o-mini)
                               →  Anthropic (claude-sonnet-4, claude-haiku)
                               →  Google (gemini-2.5-flash)
                               →  DeepSeek (deepseek-chat, deepseek-reasoner)
Enter fullscreen mode Exit fullscreen mode

The gateway exposes a single /v1/chat/completions endpoint that's compatible with the OpenAI SDK. Your app doesn't need to know which provider is behind a model — it just sends requests like normal.

Prerequisites

  • Python 3.10+
  • API keys for at least two providers (OpenAI + one other)
  • Basic familiarity with FastAPI

Step 1: Project Setup

mkdir ai-gateway && cd ai-gateway
python -m venv venv && source venv/bin/activate
pip install fastapi uvicorn httpx python-dotenv
Enter fullscreen mode Exit fullscreen mode

Create a .env file with your provider keys:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
DEEPSEEK_API_KEY=sk-...
GATEWAY_API_KEY=gw-your-secret-key
Enter fullscreen mode Exit fullscreen mode

Step 2: Define Provider Routing

The core idea: map model names to their provider's base URL and API key.

Create config.py:

import os
from dotenv import load_dotenv

load_dotenv()

# Model → Provider mapping
PROVIDERS = {
    # OpenAI models
    "gpt-4o": {
        "base_url": "https://api.openai.com/v1",
        "api_key": os.getenv("OPENAI_API_KEY"),
        "upstream_model": "gpt-4o",
    },
    "gpt-4o-mini": {
        "base_url": "https://api.openai.com/v1",
        "api_key": os.getenv("OPENAI_API_KEY"),
        "upstream_model": "gpt-4o-mini",
    },
    # Anthropic models (via OpenAI-compatible endpoint)
    "claude-sonnet": {
        "base_url": "https://api.anthropic.com/v1",
        "api_key": os.getenv("ANTHROPIC_API_KEY"),
        "upstream_model": "claude-sonnet-4-20250514",
    },
    # Google Gemini
    "gemini-flash": {
        "base_url": "https://generativelanguage.googleapis.com/v1beta/openai",
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "upstream_model": "gemini-2.5-flash",
    },
    # DeepSeek
    "deepseek-chat": {
        "base_url": "https://api.deepseek.com/v1",
        "api_key": os.getenv("DEEPSEEK_API_KEY"),
        "upstream_model": "deepseek-chat",
    },
    "deepseek-reasoner": {
        "base_url": "https://api.deepseek.com/v1",
        "api_key": os.getenv("DEEPSEEK_API_KEY"),
        "upstream_model": "deepseek-reasoner",
    },
}

GATEWAY_API_KEY = os.getenv("GATEWAY_API_KEY", "gw-default-key")
Enter fullscreen mode Exit fullscreen mode

This is the routing table. When a request comes in for gpt-4o, the gateway knows to forward it to api.openai.com. When it's deepseek-chat, it goes to api.deepseek.com. The caller doesn't need to care.

Step 3: Build the Gateway

Create main.py:

import json
import time
import httpx
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse, JSONResponse
from config import PROVIDERS, GATEWAY_API_KEY

app = FastAPI(title="AI Gateway")

def verify_auth(request: Request):
    """Verify the gateway API key."""
    auth = request.headers.get("authorization", "")
    token = auth.replace("Bearer ", "") if auth.startswith("Bearer ") else ""
    if token != GATEWAY_API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")

def get_provider(model: str) -> dict:
    """Look up the provider config for a model."""
    if model not in PROVIDERS:
        raise HTTPException(
            status_code=400,
            detail=f"Model '{model}' not supported. Available: {list(PROVIDERS.keys())}",
        )
    return PROVIDERS[model]

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    verify_auth(request)
    body = await request.json()

    model = body.get("model", "")
    provider = get_provider(model)
    stream = body.get("stream", False)

    # Rewrite model name to upstream model
    body["model"] = provider["upstream_model"]

    # Build upstream headers
    headers = {
        "Authorization": f"Bearer {provider['api_key']}",
        "Content-Type": "application/json",
    }
    if "headers" in provider:
        headers.update(provider["headers"])

    url = f"{provider['base_url']}/chat/completions"

    if stream:
        return await _stream_response(url, headers, body)
    else:
        return await _direct_response(url, headers, body)

async def _direct_response(url: str, headers: dict, body: dict):
    """Forward a non-streaming request."""
    async with httpx.AsyncClient(timeout=120.0) as client:
        resp = await client.post(url, headers=headers, json=body)
        if resp.status_code != 200:
            return JSONResponse(
                status_code=resp.status_code,
                content=resp.json(),
            )
        return JSONResponse(content=resp.json())

async def _stream_response(url: str, headers: dict, body: dict):
    """Forward a streaming request using Server-Sent Events."""
    async def event_generator():
        async with httpx.AsyncClient(timeout=120.0) as client:
            async with client.stream(
                "POST", url, headers=headers, json=body
            ) as resp:
                async for line in resp.aiter_lines():
                    if line:
                        yield f"{line}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

@app.get("/v1/models")
async def list_models(request: Request):
    """List available models."""
    verify_auth(request)
    models = [
        {
            "id": name,
            "object": "model",
            "created": int(time.time()),
            "owned_by": "gateway",
        }
        for name in PROVIDERS
    ]
    return {"object": "list", "data": models}

@app.get("/health")
async def health():
    return {"status": "ok", "models": len(PROVIDERS)}
Enter fullscreen mode Exit fullscreen mode

That's it — a working AI gateway in under 100 lines.

Step 4: Test It

Start the server:

uvicorn main:app --host 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Test with curl:

# List models
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer gw-your-secret-key"

# Chat completion (routes to OpenAI)
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer gw-your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

# Same endpoint, different model (routes to DeepSeek)
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer gw-your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-chat",
    "messages": [{"role": "user", "content": "Explain quicksort in one paragraph."}]
  }'
Enter fullscreen mode Exit fullscreen mode

And because it's OpenAI-compatible, the OpenAI SDK works directly:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="gw-your-secret-key",
)

# Use any model through the same client
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Step 5: Add Automatic Failover

The real power of a gateway is resilience. If one provider goes down, requests should automatically fall through to a backup.

Add this to config.py:

# Failover chains: if primary fails, try these in order
FAILOVER = {
    "gpt-4o": ["claude-sonnet", "deepseek-chat"],
    "claude-sonnet": ["gpt-4o", "deepseek-chat"],
    "deepseek-chat": ["gpt-4o-mini"],
    "gemini-flash": ["gpt-4o-mini", "deepseek-chat"],
}
Enter fullscreen mode Exit fullscreen mode

Update the chat_completions handler in main.py:

from config import PROVIDERS, GATEWAY_API_KEY, FAILOVER

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    verify_auth(request)
    body = await request.json()

    model = body.get("model", "")
    stream = body.get("stream", False)

    # Build attempt chain: primary + failover models
    attempts = [model] + FAILOVER.get(model, [])

    last_error = None
    for attempt_model in attempts:
        provider = PROVIDERS.get(attempt_model)
        if not provider:
            continue

        try:
            body_copy = {**body, "model": provider["upstream_model"]}
            headers = {
                "Authorization": f"Bearer {provider['api_key']}",
                "Content-Type": "application/json",
            }
            if "headers" in provider:
                headers.update(provider["headers"])

            url = f"{provider['base_url']}/chat/completions"

            if stream:
                return await _stream_response(url, headers, body_copy)
            else:
                return await _direct_response(url, headers, body_copy)

        except (httpx.HTTPStatusError, httpx.ConnectError, httpx.TimeoutException) as e:
            last_error = e
            continue  # Try next provider

    raise HTTPException(
        status_code=502,
        detail=f"All providers failed for model '{model}'. Last error: {str(last_error)}",
    )
Enter fullscreen mode Exit fullscreen mode

Now if OpenAI is down, requests for gpt-4o automatically route to Claude, then DeepSeek. Your users never see a 500.

Step 6: Add Request Logging

For cost tracking and debugging, log every request:

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("gateway")

def log_request(model: str, provider_model: str, latency_ms: float, status: int):
    logger.info(
        f"[{datetime.utcnow().isoformat()}] "
        f"model={model} provider={provider_model} "
        f"latency={latency_ms:.0f}ms status={status}"
    )
Enter fullscreen mode Exit fullscreen mode

Integrate this into _direct_response:

async def _direct_response(url: str, headers: dict, body: dict, model: str = ""):
    start = time.time()
    async with httpx.AsyncClient(timeout=120.0) as client:
        resp = await client.post(url, headers=headers, json=body)
        latency = (time.time() - start) * 1000
        log_request(model, body.get("model", ""), latency, resp.status_code)
        if resp.status_code != 200:
            raise httpx.HTTPStatusError(
                f"Upstream returned {resp.status_code}",
                request=resp.request,
                response=resp,
            )
        return JSONResponse(content=resp.json())
Enter fullscreen mode Exit fullscreen mode

Your logs now show exactly which provider handled each request and how long it took:

[2026-06-16T09:30:00] model=gpt-4o provider=gpt-4o latency=823ms status=200
[2026-06-16T09:30:01] model=deepseek-chat provider=deepseek-chat latency=412ms status=200
[2026-06-16T09:30:05] model=gpt-4o provider=gpt-4o latency=30002ms status=timeout
[2026-06-16T09:30:05] model=gpt-4o provider=claude-sonnet latency=1203ms status=200
Enter fullscreen mode Exit fullscreen mode

What's Missing for Production

This tutorial gateway works, but a production setup needs more:

  • Rate limiting — per-key request quotas to prevent abuse
  • Token counting — track input/output tokens for accurate billing
  • Caching — deduplicate identical requests to save costs
  • Load balancing — distribute across multiple API keys per provider
  • Authentication — multi-tenant key management with per-key model access
  • Monitoring — dashboards for latency, error rates, costs per model
  • Streaming edge cases — handle partial failures mid-stream, SSE reconnection

Building all of this yourself is a significant engineering investment.

The Production Shortcut

If you need multi-model routing in production without building the infrastructure, managed AI gateways handle all of the above out of the box.

For example, FuturMix provides an OpenAI-compatible endpoint with 25+ models, automatic failover, and listed discounts on selected models. Top up from $10 to get started. The migration from our tutorial gateway is a one-line change:

client = OpenAI(
    base_url="https://futurmix.ai/v1",  # ← swap this
    api_key="your-futurmix-key",
)

# Everything else stays the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
Enter fullscreen mode Exit fullscreen mode

Other options in this space include OpenRouter, LiteLLM (open source), and Portkey. Pick what fits your stack.

Wrapping Up

We built a working multi-model AI gateway in ~120 lines of Python:

  1. Unified endpoint — one /v1/chat/completions for all providers
  2. Model routing — map model names to upstream providers
  3. Automatic failover — if a provider is down, try the next one
  4. Request logging — track latency, costs, and errors

The full source code is included above — just config.py and main.py.

Whether you self-host or use a managed gateway, the pattern is the same: decouple your app from any single AI provider. Your code talks to one endpoint, and the gateway handles the rest.


Disclosure: I work on FuturMix, an OpenAI-compatible multi-model API platform. This tutorial works with any OpenAI-compatible setup.

Top comments (0)