DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

AI API Gateway

AI API Gateway

Stop hard-coding provider-specific API calls throughout your codebase. This gateway gives you a single unified interface to OpenAI, Anthropic, Google, Mistral, and local models — with automatic fallback routing, response caching, rate limiting, and real-time usage analytics. Switch providers, manage costs, and add resilience without changing a single line of application code.

Key Features

  • Unified API Interface — One consistent request/response format across OpenAI, Anthropic, Google Gemini, Mistral, and Ollama
  • Automatic Fallback Routing — Define provider priority chains; if the primary provider fails or hits rate limits, requests route to the next provider seamlessly
  • Response Caching — Cache identical prompts with configurable TTL to slash costs on repeated queries (Redis or in-memory)
  • Rate Limiting — Per-user, per-model, and global rate limits with token bucket algorithm
  • Usage Analytics Dashboard — Track tokens, latency, cost, and error rates per provider/model/user in real time
  • Request/Response Middleware — Plug in custom transforms (PII scrubbing, logging, prompt injection detection) as middleware
  • Streaming Support — Full SSE streaming passthrough with provider-agnostic event format
  • API Key Rotation — Rotate provider API keys without downtime via hot-reload configuration

Quick Start

from ai_gateway import Gateway, Provider

# 1. Configure providers
gateway = Gateway(
    providers=[
        Provider(
            name="openai",
            api_key="YOUR_OPENAI_KEY_HERE",
            models=["gpt-4o", "gpt-4o-mini"],
            priority=1,
        ),
        Provider(
            name="anthropic",
            api_key="YOUR_ANTHROPIC_KEY_HERE",
            models=["claude-sonnet-4-20250514"],
            priority=2,  # Fallback when OpenAI is unavailable
        ),
    ],
    cache_backend="redis",
    cache_ttl=3600,
)

# 2. Make requests — same interface regardless of provider
response = gateway.chat(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph."}],
    max_tokens=200,
)
print(response.content)
print(f"Provider: {response.provider}, Cost: ${response.cost:.4f}")
Enter fullscreen mode Exit fullscreen mode

Architecture

Client Request
      │
      ▼
┌──────────────┐
│  Rate Limiter │──── Reject (429) if over limit
└──────┬───────┘
       ▼
┌──────────────┐
│  Cache Check  │──── Return cached response if hit
└──────┬───────┘
       ▼
┌──────────────┐
│  Middleware   │──── Pre-process (PII scrub, logging, validation)
│  Pipeline     │
└──────┬───────┘
       ▼
┌──────────────┐     ┌──────────┐
│  Router      │────▶│Provider A│──── Success ──▶ Response
│              │     └──────────┘
│              │──── Failure ────▶┌──────────┐
│              │                  │Provider B│──── Fallback
│              │                  └──────────┘
└──────┬───────┘
       ▼
┌──────────────┐
│  Analytics   │──── Log tokens, latency, cost, errors
└──────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Fallback Routing with Cost Controls

from ai_gateway import Gateway, RoutingPolicy

gateway = Gateway(
    routing=RoutingPolicy(
        primary="openai/gpt-4o",
        fallbacks=["anthropic/claude-sonnet-4-20250514", "mistral/mistral-large"],
        fallback_on=["rate_limit", "timeout", "server_error"],
        max_cost_per_request=0.05,  # Skip expensive models if budget exceeded
    )
)
Enter fullscreen mode Exit fullscreen mode

Streaming Responses

for chunk in gateway.chat_stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
):
    print(chunk.delta, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

Custom Middleware

from ai_gateway.middleware import Middleware

class PIIScrubber(Middleware):
    """Remove PII from prompts before sending to providers."""

    def pre_request(self, request):
        for msg in request.messages:
            msg["content"] = self.redact_emails(msg["content"])
            msg["content"] = self.redact_phones(msg["content"])
        return request

gateway.add_middleware(PIIScrubber())
Enter fullscreen mode Exit fullscreen mode

Usage Analytics

stats = gateway.analytics.summary(period="24h")
print(f"Total requests: {stats.total_requests}")
print(f"Total cost: ${stats.total_cost:.2f}")
print(f"Avg latency: {stats.avg_latency_ms:.0f}ms")
print(f"Cache hit rate: {stats.cache_hit_rate:.1%}")
for provider in stats.by_provider:
    print(f"  {provider.name}: {provider.requests} reqs, {provider.error_rate:.1%} errors")
Enter fullscreen mode Exit fullscreen mode

Configuration

# gateway_config.yaml
providers:
  openai:
    api_key: "${OPENAI_API_KEY}"
    base_url: "https://api.example.com/v1/"
    models: ["gpt-4o", "gpt-4o-mini"]
    timeout_seconds: 30
    max_retries: 2
    priority: 1

  anthropic:
    api_key: "${ANTHROPIC_API_KEY}"
    models: ["claude-sonnet-4-20250514"]
    timeout_seconds: 60
    priority: 2

rate_limiting:
  global_rpm: 1000             # Requests per minute across all users
  per_user_rpm: 60
  per_model_rpm: 500
  algorithm: "token_bucket"

cache:
  backend: "redis"             # redis | memory | disabled
  redis_url: "redis://localhost:6379/0"
  ttl_seconds: 3600
  max_cache_size_mb: 512
  hash_strategy: "content"     # content | content+model | full_request

analytics:
  enabled: true
  storage: "sqlite"            # sqlite | postgres
  retention_days: 90
  dashboard_port: 8080

middleware:
  - pii_scrubber
  - request_logger
  - prompt_injection_detector
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Set per-user rate limits — Prevent a single user from exhausting your entire API quota.
  2. Cache aggressively for deterministic queries — If temperature: 0, the same prompt always yields the same result. Cache it.
  3. Use the cheapest model that works — Route simple tasks to gpt-4o-mini and reserve gpt-4o for complex reasoning.
  4. Monitor error rates per provider — A sudden spike in 500s from one provider means your fallback chain is earning its keep.
  5. Rotate API keys on a schedule — Use hot-reload config to rotate keys monthly without gateway restarts.
  6. Test fallback paths — Intentionally disable your primary provider in staging to verify fallback routing works end-to-end.

Troubleshooting

Problem Cause Fix
All requests return 429 Too Many Requests Global rate limit too low for traffic volume Increase global_rpm in config or add more provider API keys
Cache never hits despite repeated prompts Message metadata (timestamps, request IDs) differ between calls Set hash_strategy: "content" to hash only message content
Fallback provider returns format errors Response schemas differ between providers Ensure response_format normalization is enabled in middleware
Analytics dashboard shows $0 cost Cost calculation requires model pricing table Update pricing.yaml with current per-token rates for each model

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI API Gateway] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)