Last month I added Claude to a project that was already using GPT-4o. Two SDKs, two error formats, two retry strategies. By the time I finished I had wrapped both in my own abstraction — a tiny LLM gateway, badly written, that I now had to maintain.
Then I noticed something I should have noticed earlier: most of the new providers expose an OpenAI-compatible endpoint. DeepSeek, Mistral, Together, Fireworks — they all speak the same wire format. You don't need a new SDK. You need a new base_url.
This post is the 5-minute version of that realization, with the tradeoffs I learned the hard way.
The "before" code
Standard OpenAI Python:
from openai import OpenAI
client = OpenAI(api_key="sk-...")
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this PR diff..."}],
)
print(resp.choices[0].message.content)
The "after" code
from openai import OpenAI
client = OpenAI(
api_key="th-...",
base_url="https://jiatoken.com/v1", # gateway
)
# Same call, different model
resp = client.chat.completions.create(
model="deepseek-chat", # DeepSeek-V3
messages=[{"role": "user", "content": "Summarize this PR diff..."}],
)
That's it. Two lines changed. The rest of your code — streaming handlers, tool calls, retry logic — keeps working because the response shape is identical.
Why this works
The OpenAI Python SDK is just a typed HTTP client. It POSTs JSON to {base_url}/chat/completions. Anything that responds with the same JSON shape is, from the SDK's point of view, OpenAI.
Most gateways take advantage of this:
-
DeepSeek ships its own OpenAI-compatible endpoint at
api.deepseek.com/v1. You can point the SDK there directly. - Anthropic does not — Claude has its own message format. You need a translator.
- Gemini has both: a native API and a Vertex-side OpenAI shim.
A multi-model gateway (LiteLLM, OpenRouter, TokenHub, your own) collapses these into one endpoint. One key, one base_url, every model behind it.
What I actually save
For the workload I just migrated (~3M input tokens / 1M output per day, mostly summarization):
| Model | Input $/1M | Output $/1M | Daily cost |
|---|---|---|---|
| GPT-4o | 2.50 | 10.00 | $17.50 |
| Claude 3.5 | 3.00 | 15.00 | $24.00 |
| DeepSeek-V3 | 0.07 | 0.28 | $0.49 |
DeepSeek isn't a drop-in quality replacement for everything — GPT-4o still wins on instruction following in my evals — but for the 80% of calls that are "summarize this", "extract these fields", "rewrite in tone X", it's fine and ~35× cheaper.
The annoying parts
A few things don't carry over cleanly through OpenAI compatibility:
- Tool calling JSON shape. Most providers match it now, but older OSS models return tool calls inside the content string. Always test with your actual prompts before flipping production.
-
Vision. OpenAI uses
image_urlparts; some providers want base64. A gateway should normalize this for you — verify before you assume. -
Streaming with usage stats. OpenAI added
stream_options={"include_usage": True}to get token counts on the final SSE chunk. Not every backend forwards this. - Rate limits. You're now subject to the gateway's RPM, which may be lower than direct provider limits.
When NOT to use a gateway
- You only ever call one provider. Direct SDK is one less moving part.
- You need provider-specific features (Anthropic's prompt caching, OpenAI's Realtime API, Gemini's long context). Gateways usually lag behind native features by weeks.
- You're in a regulated environment that requires data plane control. Most gateways are SaaS.
For everything else — especially side projects and prototypes where the model you "want" changes every two weeks — a gateway pays for itself in saved switching cost.
TL;DR
client = OpenAI(
api_key="...",
+ base_url="https://your-gateway/v1",
)
client.chat.completions.create(
- model="gpt-4o",
+ model="deepseek-chat",
...
)
If you want to skip running your own LiteLLM, TokenHub hosts a pre-configured gateway with 40+ models behind one key. Otherwise, LiteLLM self-hosted is the standard answer.
Top comments (0)