When you build a product that needs to serve multiple AI models from different providers, you quickly run into a wall: every provider has a different API.
Some use SSE streaming. Some don't. Some count tokens by characters. Some by sub-words. Rate limits? Completely different formats.
Here's how I built a gateway that handles all of them under one interface.
The Problem
You want to offer: DeepSeek, Qwen, GLM-4, Kimi, and more — all through one API key. Each provider has:
- Different auth methods
- Different content types (JSON vs plain text vs multipart)
- Different error formats
- Different streaming formats (SSE vs chunked vs WebSocket)
- Different token counting
A naive approach would be spaghetti code with if/else chains. Not sustainable.
Architecture: Three Layers
Client → Gateway (rate limiter + auth) → Router (model selection) → Provider Adapter (format normalization)
Layer 1: Auth & Rate Limiting
All requests start with API key validation. Simple Redis check: GET api_key:{key}. If found, extract user_id and plan.
Rate limiting is per-user, per-plan, per-model. Three tiers:
- Free tier: 10 RPM, 100K TPM
- Standard: 60 RPM, 1M TPM
- Pro: 300 RPM, 10M TPM
Implementation is a sliding window counter in Redis:
def check_rate_limit(user_id, model, rpm_limit):
key = f"ratelimit:{user_id}:{model}:{int(time.time()/60)}"
count = redis.incr(key)
redis.expire(key, 120) # 2 min ttl
return count <= rpm_limit
Layer 2: Router
Each provider registers itself with supported models:
ROUTING_TABLE = {
"deepseek-v4-flash": "deepseek",
"deepseek-r1": "deepseek",
"qwen-3": "alibaba",
"glm-4": "zhipu",
"doubao": "byteplus",
"kimi": "moonshot",
}
The router takes model from the request body and maps it to the correct provider adapter. No if/else — just a dict lookup.
Layer 3: Provider Adapters
This is where the magic happens. Each adapter normalizes:
- Input format: Convert OpenAI-style messages to provider-native format
- Output format: Convert provider response back to OpenAI-compatible
-
Streaming: Normalize SSE
data:chunks to a unified event format - Error codes: Map provider errors to OpenAI-style errors (401, 429, 500)
Example adapter for DeepSeek:
class DeepSeekAdapter(BaseAdapter):
def to_provider(self, payload):
return payload # DeepSeek already uses OpenAI format
def to_openai(self, response_json):
# DeepSeek returns OpenAI-compatible response
return response_json
def stream_chunks(self, raw_lines):
for line in raw_lines:
if line.startswith("data: "):
yield line[6:] # Strip SSE prefix
For providers that don't use OpenAI format (like Kimi or GLM-4), the adapter does a complete transformation:
class KimiAdapter(BaseAdapter):
def to_provider(self, payload):
# Kimi uses a different message format
return {
"model": "kimi",
"messages": [{"role": m["role"], "content": m["content"]}
for m in payload["messages"]],
"temperature": payload.get("temperature", 0.7),
}
Cost Optimization
The real value is intelligent routing. With multiple providers, you can:
- Fallback on error: If DeepSeek returns 503, try Qwen
- Latency-based routing: Route to the fastest provider right now
- Cost-based routing: Use the cheapest model that meets quality requirements
Implementing fallback:
async def chat_completion(request):
providers = priority_list(request.model)
last_error = None
for provider in providers:
try:
return await provider.complete(request)
except ProviderOverloaded:
last_error = "All providers overloaded"
continue
raise ServiceUnavailable(last_error)
Token Counting
The hardest part. Each provider counts tokens differently. Our approach:
- Default to tiktoken (OpenAI's tokenizer) for OpenAI-compatible models
- Provider-reported token counts from response headers
- Estimated:
len(text) / 4for Chinese-heavy content (Chinese chars are ~2 tokens in most tokenizers)
We store user usage as the count reported by the provider, not our estimate. This avoids disputes.
Results
With this architecture:
- Adding a new provider takes ~100 lines of code (adapter + routing entry)
- 99.9% uptime across 45 models
- Average response time: 380ms (slightly higher than single-provider due to routing)
The full gateway serves ~100M tokens per day with 6 worker processes. No special hardware needed.
Key Takeaways
- Provider adapters are the critical abstraction — invest in a clean interface
- Rate limiting must be per-model, not per-user — one noisy user shouldn't block all models
- Fallback chain is free reliability — one provider goes down, another takes over
- Unified error handling matters more than you think — your SDK users will thank you
Built with ❤️ and Python async. Data from production serving 45+ Chinese AI models globally.
Top comments (0)