ModelHub Dev

Posted on Jun 7

Building a Multi-Provider AI Gateway: Rate Limiting, Format Normalization, and Cost Optimization

#api #architecture #backend #tutorial

When you build a product that needs to serve multiple AI models from different providers, you quickly run into a wall: every provider has a different API.

Some use SSE streaming. Some don't. Some count tokens by characters. Some by sub-words. Rate limits? Completely different formats.

Here's how I built a gateway that handles all of them under one interface.

The Problem

You want to offer: DeepSeek, Qwen, GLM-4, Kimi, and more — all through one API key. Each provider has:

Different auth methods
Different content types (JSON vs plain text vs multipart)
Different error formats
Different streaming formats (SSE vs chunked vs WebSocket)
Different token counting

A naive approach would be spaghetti code with if/else chains. Not sustainable.

Architecture: Three Layers

Client → Gateway (rate limiter + auth) → Router (model selection) → Provider Adapter (format normalization)

Layer 1: Auth & Rate Limiting

All requests start with API key validation. Simple Redis check: GET api_key:{key}. If found, extract user_id and plan.

Rate limiting is per-user, per-plan, per-model. Three tiers:

Free tier: 10 RPM, 100K TPM
Standard: 60 RPM, 1M TPM
Pro: 300 RPM, 10M TPM

Implementation is a sliding window counter in Redis:

def check_rate_limit(user_id, model, rpm_limit):
    key = f"ratelimit:{user_id}:{model}:{int(time.time()/60)}"
    count = redis.incr(key)
    redis.expire(key, 120)  # 2 min ttl
    return count <= rpm_limit

Layer 2: Router

Each provider registers itself with supported models:

ROUTING_TABLE = {
    "deepseek-v4-flash": "deepseek",
    "deepseek-r1": "deepseek",
    "qwen-3": "alibaba",
    "glm-4": "zhipu",
    "doubao": "byteplus",
    "kimi": "moonshot",
}

The router takes model from the request body and maps it to the correct provider adapter. No if/else — just a dict lookup.

Layer 3: Provider Adapters

This is where the magic happens. Each adapter normalizes:

Input format: Convert OpenAI-style messages to provider-native format
Output format: Convert provider response back to OpenAI-compatible
Streaming: Normalize SSE data: chunks to a unified event format
Error codes: Map provider errors to OpenAI-style errors (401, 429, 500)

Example adapter for DeepSeek:

class DeepSeekAdapter(BaseAdapter):
    def to_provider(self, payload):
        return payload  # DeepSeek already uses OpenAI format

    def to_openai(self, response_json):
        # DeepSeek returns OpenAI-compatible response
        return response_json

    def stream_chunks(self, raw_lines):
        for line in raw_lines:
            if line.startswith("data: "):
                yield line[6:]  # Strip SSE prefix

For providers that don't use OpenAI format (like Kimi or GLM-4), the adapter does a complete transformation:

class KimiAdapter(BaseAdapter):
    def to_provider(self, payload):
        # Kimi uses a different message format
        return {
            "model": "kimi",
            "messages": [{"role": m["role"], "content": m["content"]}
                         for m in payload["messages"]],
            "temperature": payload.get("temperature", 0.7),
        }

Cost Optimization

The real value is intelligent routing. With multiple providers, you can:

Fallback on error: If DeepSeek returns 503, try Qwen
Latency-based routing: Route to the fastest provider right now
Cost-based routing: Use the cheapest model that meets quality requirements

Implementing fallback:

async def chat_completion(request):
    providers = priority_list(request.model)
    last_error = None
    for provider in providers:
        try:
            return await provider.complete(request)
        except ProviderOverloaded:
            last_error = "All providers overloaded"
            continue
    raise ServiceUnavailable(last_error)

Token Counting

The hardest part. Each provider counts tokens differently. Our approach:

Default to tiktoken (OpenAI's tokenizer) for OpenAI-compatible models
Provider-reported token counts from response headers
Estimated: len(text) / 4 for Chinese-heavy content (Chinese chars are ~2 tokens in most tokenizers)

We store user usage as the count reported by the provider, not our estimate. This avoids disputes.

Results

With this architecture:

Adding a new provider takes ~100 lines of code (adapter + routing entry)
99.9% uptime across 45 models
Average response time: 380ms (slightly higher than single-provider due to routing)

The full gateway serves ~100M tokens per day with 6 worker processes. No special hardware needed.

Key Takeaways

Provider adapters are the critical abstraction — invest in a clean interface
Rate limiting must be per-model, not per-user — one noisy user shouldn't block all models
Fallback chain is free reliability — one provider goes down, another takes over
Unified error handling matters more than you think — your SDK users will thank you

Built with ❤️ and Python async. Data from production serving 45+ Chinese AI models globally.

DEV Community