zhongqiyue

Posted on Jun 14

How I stopped fighting AI API chaos with a simple proxy

#ai #api #webdev #python

I recently took on a side project that needed to tap into multiple AI models – GPT-4 for complex reasoning, Claude for creative writing, and a local Llama 2 for quick drafts. My naive plan was to just call each API directly from my Python backend. Three days later, I had a tangled mess of authentication headers, inconsistent rate limits, and error handling that looked like a love letter to try/except. I almost trashed the whole thing.

If you've ever tried to build anything beyond a single-LLM demo, you know the pain. Let me share what I tried, what failed, and the minimal approach that finally worked.

The problem that nearly broke me

My app was simple: a user sends a prompt, and I route it to the best model based on cost and context. But each provider had its own quirks:

OpenAI: uses Authorization: Bearer <key>, returns choices[0].message.content.
Anthropic: requires x-api-key header, returns content[0].text.
Replicate: expects a different payload structure with version IDs.
Ollama (local): needs only the model name, but different host/port.

When I added streaming, it got even worse. My router function grew to 300 lines of conditional logic. Every new model meant another if provider == 'claude': block. The code was fragile, untestable, and I hated opening it.

What I tried that didn’t work

First, I looked at popular libraries like langchain and llamaindex. They abstract a lot, but they also bring their own layers of complexity – vector stores, chains, agents. I didn't need any of that. I just wanted a unified way to call different models. Those libraries felt like hiring a full orchestra when all I needed was a piano.

Next, I considered using an API gateway like Kong or Tyk. Overkill for a side project. Configuring rate limiting and request transformation in YAML took longer than writing the original spaghetti code.

I even tried a hack: write a wrapper that converts everything to OpenAI's format and use that as a facade. That actually worked for Claude (Anthropic provides a compat endpoint), but not for Replicate or local models. Plus, I couldn't support streaming correctly.

What eventually worked: a tiny proxy in Python

I decided to build a dumb proxy – a single FastAPI server that translates incoming requests into provider-specific calls. The goal was to accept a standardised request format and return a standardised streaming response. The proxy doesn't add any logic – no routing, no caching, no load balancing. It just translates.

Here's the core idea:

# client.py – my app only ever talks to this proxy
import requests

payload = {
    "model": "claude-3-opus",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": True
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload, stream=True)
for chunk in response.iter_lines():
    print(chunk)

My proxy listens on /v1/chat/completions (mimicking OpenAI's route for familiarity) and internally translates the payload to whatever Anthropic or Ollama expects.

Here's the simplified proxy code (I'll show just the OpenAI → Claude translation):

# proxy.py
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx
import json

app = FastAPI()

# Map of model names to provider details
MODEL_MAP = {
    "gpt-4": {
        "provider": "openai",
        "url": "https://api.openai.com/v1/chat/completions",
        "auth": "Bearer sk-..."
    },
    "claude-3-opus": {
        "provider": "anthropic",
        "url": "https://api.anthropic.com/v1/messages",
        "auth": "x-api-key sk-ant-..."
    },
    "llama2": {
        "provider": "ollama",
        "url": "http://localhost:11434/api/chat",
        "auth": None
    }
}

async def translate_openai_to_anthropic(payload: dict) -> dict:
    """Convert OpenAI-style payload to Anthropic format."""
    messages = payload["messages"]
    system = None
    # Anthropic separates system prompt
    if messages and messages[0]["role"] == "system":
        system = messages[0]["content"]
        messages = messages[1:]
    anthropic_payload = {
        "model": payload["model"],
        "max_tokens": payload.get("max_tokens", 1024),
        "messages": [
            {
                "role": m["role"],
                "content": m["content"]
            } for m in messages
        ],
        "stream": payload.get("stream", False)
    }
    if system:
        anthropic_payload["system"] = system
    return anthropic_payload

async def translate_anthropic_to_openai(chunk: dict) -> dict:
    """Convert Anthropic streamed chunk to OpenAI-like format."""
    # Anthropic delta format: { type: 'content_block_delta', delta: { text: '...' } }
    if chunk.get("type") == "content_block_delta":
        return {
            "choices": [{
                "delta": {"content": chunk["delta"]["text"]},
                "index": 0
            }]
        }
    return None

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    payload = await request.json()
    model = payload["model"]
    if model not in MODEL_MAP:
        return {"error": "unknown model"}, 400

    info = MODEL_MAP[model]
    provider = info["provider"]

    if provider == "anthropic":
        anthropic_payload = await translate_openai_to_anthropic(payload)
        headers = {"x-api-key": info["auth"].split()[1], "anthropic-version": "2023-06-01"}
        async with httpx.AsyncClient() as client:
            response = await client.post(info["url"], json=anthropic_payload, headers=headers, timeout=None)
            if payload.get("stream"):
                # Turn Anthropic stream into OpenAI SSE format
                async def event_stream():
                    async for line in response.aiter_lines():
                        if line.startswith("data: "):
                            data = json.loads(line[6:])
                            openai_chunk = await translate_anthropic_to_openai(data)
                            if openai_chunk:
                                yield f"data: {json.dumps(openai_chunk)}\n\n"
                    yield "data: [DONE]\n\n"
                return StreamingResponse(event_stream(), media_type="text/event-stream")
            else:
                # Non-streaming: parse response
                result = response.json()
                content = result["content"][0]["text"]
                return {
                    "choices": [{
                        "message": {"role": "assistant", "content": content},
                        "index": 0
                    }]
                }
    # Similar blocks for OpenAI (pass-through) and Ollama ...

Yes, the translation functions are a bit tedious to write, but once done you can treat every model the same way from your client. I even added a small config file that lets me add new model endpoints without touching code:

{
    "models": {
        "claude-3-haiku": {
            "provider": "anthropic",
            "url": "https://api.anthropic.com/v1/messages",
            "auth_env": "ANTHROPIC_API_KEY"
        },
        "mixtral": {
            "provider": "ollama",
            "url": "http://localhost:11434/api/chat"
        }
    }
}

This proxy is intentionally stateless and dumb – no caching, no retries, no rate limiting. That keeps it easy to reason about and debug. I run it as a sidecar container next to my app.

Lessons learned and trade-offs

What I loved:

My main app code now only imports requests and talks to one endpoint.
Adding a new model takes 10 minutes: write the translation functions, update config.
I can test the proxy independently with a simple curl.

What I hate:

Latency is slightly higher because of the extra hop. For most use cases it's imperceptible, but if you need ultra-low latency (real-time conversational), you might want to call APIs directly.
No automatic retries – I have to handle that in the client or add it to the proxy.
The proxy is a single point of failure. If it goes down, you're dead. For production you'd want multiple instances with a load balancer.

When NOT to use this:

If you only ever call one model, just call it directly.
If you need complex routing (e.g., cheapest model for this prompt), push that logic to a different layer.
If you're already using langchain for chaining, you don't need another abstraction – but langchain does have its own overhead.

What I'd do differently next time

I'd skip the manual translation functions and instead leverage the OpenAI-compatible endpoints that many providers now offer. For example, Ollama has an OpenAI-compatible route if you run it with OLLAMA_ORIGINS=*. Anthropic also provides a /v1/completions endpoint (though not officially). That would reduce the translation code to just auth headers.

Also, I'd add structured logging from day one. Debugging stream translation was painful without seeing exactly what the proxy was sending and receiving.

The takeaway

You don't need a heavy framework to work with multiple AI APIs. A thin proxy that normalizes the interface can save you from a world of conditional pain. It's not perfect, but it's simple and works.

Now I'm curious: how do you manage multiple AI backends in your projects? Do you use a similar proxy, or have you found a different pattern that scales better?

Top comments (1)

FastAnchor_io • Jun 14

Nice writeup! The consistent error handling wrapper is the key insight — most people chase the routing logic but skip normalization. One lesson from doing this: add schema validation on responses. Providers change error formats mid-version and silent failures are the worst bugs.