Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

AI API Gateway

#ai #llm #machinelearning #python

AI API Gateway

Stop hard-coding provider-specific API calls throughout your codebase. This gateway gives you a single unified interface to OpenAI, Anthropic, Google, Mistral, and local models — with automatic fallback routing, response caching, rate limiting, and real-time usage analytics. Switch providers, manage costs, and add resilience without changing a single line of application code.

Key Features

Unified API Interface — One consistent request/response format across OpenAI, Anthropic, Google Gemini, Mistral, and Ollama
Automatic Fallback Routing — Define provider priority chains; if the primary provider fails or hits rate limits, requests route to the next provider seamlessly
Response Caching — Cache identical prompts with configurable TTL to slash costs on repeated queries (Redis or in-memory)
Rate Limiting — Per-user, per-model, and global rate limits with token bucket algorithm
Usage Analytics Dashboard — Track tokens, latency, cost, and error rates per provider/model/user in real time
Request/Response Middleware — Plug in custom transforms (PII scrubbing, logging, prompt injection detection) as middleware
Streaming Support — Full SSE streaming passthrough with provider-agnostic event format
API Key Rotation — Rotate provider API keys without downtime via hot-reload configuration

Quick Start

from ai_gateway import Gateway, Provider

# 1. Configure providers
gateway = Gateway(
    providers=[
        Provider(
            name="openai",
            api_key="YOUR_OPENAI_KEY_HERE",
            models=["gpt-4o", "gpt-4o-mini"],
            priority=1,
        ),
        Provider(
            name="anthropic",
            api_key="YOUR_ANTHROPIC_KEY_HERE",
            models=["claude-sonnet-4-20250514"],
            priority=2,  # Fallback when OpenAI is unavailable
        ),
    ],
    cache_backend="redis",
    cache_ttl=3600,
)

# 2. Make requests — same interface regardless of provider
response = gateway.chat(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph."}],
    max_tokens=200,
)
print(response.content)
print(f"Provider: {response.provider}, Cost: ${response.cost:.4f}")

Architecture

Client Request
      │
      ▼
┌──────────────┐
│  Rate Limiter │──── Reject (429) if over limit
└──────┬───────┘
       ▼
┌──────────────┐
│  Cache Check  │──── Return cached response if hit
└──────┬───────┘
       ▼
┌──────────────┐
│  Middleware   │──── Pre-process (PII scrub, logging, validation)
│  Pipeline     │
└──────┬───────┘
       ▼
┌──────────────┐     ┌──────────┐
│  Router      │────▶│Provider A│──── Success ──▶ Response
│              │     └──────────┘
│              │──── Failure ────▶┌──────────┐
│              │                  │Provider B│──── Fallback
│              │                  └──────────┘
└──────┬───────┘
       ▼
┌──────────────┐
│  Analytics   │──── Log tokens, latency, cost, errors
└──────────────┘

Usage Examples

Fallback Routing with Cost Controls

from ai_gateway import Gateway, RoutingPolicy

gateway = Gateway(
    routing=RoutingPolicy(
        primary="openai/gpt-4o",
        fallbacks=["anthropic/claude-sonnet-4-20250514", "mistral/mistral-large"],
        fallback_on=["rate_limit", "timeout", "server_error"],
        max_cost_per_request=0.05,  # Skip expensive models if budget exceeded
    )
)

Streaming Responses

for chunk in gateway.chat_stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
):
    print(chunk.delta, end="", flush=True)

Custom Middleware

from ai_gateway.middleware import Middleware

class PIIScrubber(Middleware):
    """Remove PII from prompts before sending to providers."""

    def pre_request(self, request):
        for msg in request.messages:
            msg["content"] = self.redact_emails(msg["content"])
            msg["content"] = self.redact_phones(msg["content"])
        return request

gateway.add_middleware(PIIScrubber())

Usage Analytics

stats = gateway.analytics.summary(period="24h")
print(f"Total requests: {stats.total_requests}")
print(f"Total cost: ${stats.total_cost:.2f}")
print(f"Avg latency: {stats.avg_latency_ms:.0f}ms")
print(f"Cache hit rate: {stats.cache_hit_rate:.1%}")
for provider in stats.by_provider:
    print(f"  {provider.name}: {provider.requests} reqs, {provider.error_rate:.1%} errors")

Configuration

# gateway_config.yaml
providers:
  openai:
    api_key: "${OPENAI_API_KEY}"
    base_url: "https://api.example.com/v1/"
    models: ["gpt-4o", "gpt-4o-mini"]
    timeout_seconds: 30
    max_retries: 2
    priority: 1

  anthropic:
    api_key: "${ANTHROPIC_API_KEY}"
    models: ["claude-sonnet-4-20250514"]
    timeout_seconds: 60
    priority: 2

rate_limiting:
  global_rpm: 1000             # Requests per minute across all users
  per_user_rpm: 60
  per_model_rpm: 500
  algorithm: "token_bucket"

cache:
  backend: "redis"             # redis | memory | disabled
  redis_url: "redis://localhost:6379/0"
  ttl_seconds: 3600
  max_cache_size_mb: 512
  hash_strategy: "content"     # content | content+model | full_request

analytics:
  enabled: true
  storage: "sqlite"            # sqlite | postgres
  retention_days: 90
  dashboard_port: 8080

middleware:
  - pii_scrubber
  - request_logger
  - prompt_injection_detector

Best Practices

Set per-user rate limits — Prevent a single user from exhausting your entire API quota.
Cache aggressively for deterministic queries — If temperature: 0, the same prompt always yields the same result. Cache it.
Use the cheapest model that works — Route simple tasks to gpt-4o-mini and reserve gpt-4o for complex reasoning.
Monitor error rates per provider — A sudden spike in 500s from one provider means your fallback chain is earning its keep.
Rotate API keys on a schedule — Use hot-reload config to rotate keys monthly without gateway restarts.
Test fallback paths — Intentionally disable your primary provider in staging to verify fallback routing works end-to-end.

Troubleshooting

Problem	Cause	Fix
All requests return 429 Too Many Requests	Global rate limit too low for traffic volume	Increase `global_rpm` in config or add more provider API keys
Cache never hits despite repeated prompts	Message metadata (timestamps, request IDs) differ between calls	Set `hash_strategy: "content"` to hash only message content
Fallback provider returns format errors	Response schemas differ between providers	Ensure `response_format` normalization is enabled in middleware
Analytics dashboard shows $0 cost	Cost calculation requires model pricing table	Update `pricing.yaml` with current per-token rates for each model

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI API Gateway] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

AI API Gateway

AI API Gateway

Key Features

Quick Start

Architecture

Usage Examples

Fallback Routing with Cost Controls

Streaming Responses

Custom Middleware

Usage Analytics

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)