OpenClaw got banned from Claude with 40,000 tools in production. No warning, no grace period — just a policy enforcement that shut down their entire inference pipeline. I watched the Hacker News thread light up with the predictable mix of schadenfreude and terror from people running similar systems.
This isn't an edge case. Anthropic, OpenAI, and every other LLM provider reserves the right to change terms, throttle capacity, or outright ban use cases. When you're handling production traffic, a single-provider dependency is a ticking time bomb. Your system needs to fail over between providers without dropping requests or requiring a deploy.
The Architecture Problem Nobody Talks About
Most teams build LLM integrations like this: a direct HTTP client to OpenAI's API, maybe with some retry logic. When that provider goes down — policy change, rate limit, regional outage — your application crashes. The "fix" is usually a frantic weekend migration to another provider, rewriting prompts to match different tokenization limits, adjusting temperature parameters, and praying the output format stays consistent.
Here's what a production multi-provider layer looks like instead. You need three components: a provider abstraction interface, a routing layer with fallback logic, and request-level observability to track which provider handled each call.
from abc import ABC, abstractmethod
from typing import Optional, Dict, Any
import anthropic
import openai
from dataclasses import dataclass
@dataclass
class LLMRequest:
prompt: str
max_tokens: int = 1000
temperature: float = 0.7
metadata: Dict[str, Any] = None
@dataclass
class LLMResponse:
content: str
provider: str
tokens_used: int
latency_ms: float
class LLMProvider(ABC):
@abstractmethod
def generate(self, request: LLMRequest) -> Optional[LLMResponse]:
pass
@abstractmethod
def is_available(self) -> bool:
pass
class AnthropicProvider(LLMProvider):
def __init__(self, api_key: str):
self.client = anthropic.Anthropic(api_key=api_key)
self._available = True
def generate(self, request: LLMRequest) -> Optional[LLMResponse]:
import time
start = time.perf_counter()
try:
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=request.max_tokens,
temperature=request.temperature,
messages=[{"role": "user", "content": request.prompt}]
)
latency = (time.perf_counter() - start) * 1000
return LLMResponse(
content=response.content[0].text,
provider="anthropic",
tokens_used=response.usage.input_tokens + response.usage.output_tokens,
latency_ms=latency
)
except anthropic.RateLimitError:
self._available = False
return None
except anthropic.PermissionDeniedError:
self._available = False # Policy ban
return None
def is_available(self) -> bool:
return self._available
class OpenAIProvider(LLMProvider):
def __init__(self, api_key: str):
self.client = openai.OpenAI(api_key=api_key)
self._available = True
def generate(self, request: LLMRequest) -> Optional[LLMResponse]:
import time
start = time.perf_counter()
try:
response = self.client.chat.completions.create(
model="gpt-4",
max_tokens=request.max_tokens,
temperature=request.temperature,
messages=[{"role": "user", "content": request.prompt}]
)
latency = (time.perf_counter() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
provider="openai",
tokens_used=response.usage.total_tokens,
latency_ms=latency
)
except openai.RateLimitError:
self._available = False
return None
except openai.PermissionDeniedError:
self._available = False
return None
def is_available(self) -> bool:
return self._available
That abstraction costs you maybe 200 lines of code. In return, you get the ability to swap providers at runtime without touching application logic.
The Routing Layer With Fallback
The router decides which provider handles each request. Priority order, round-robin, least-latency — pick a strategy, but make sure it degrades gracefully when providers fail.
from typing import List
import logging
logger = logging.getLogger(__name__)
class LLMRouter:
def __init__(self, providers: List[LLMProvider]):
self.providers = providers
def route(self, request: LLMRequest) -> LLMResponse:
"""Try providers in order until one succeeds."""
last_error = None
for provider in self.providers:
if not provider.is_available():
logger.warning(f"Skipping unavailable provider: {provider.__class__.__name__}")
continue
response = provider.generate(request)
if response:
logger.info(f"Request served by {response.provider} in {response.latency_ms:.0f}ms")
return response
logger.warning(f"Provider {provider.__class__.__name__} failed, trying next")
raise RuntimeError("All LLM providers exhausted")
# Usage
router = LLMRouter([
AnthropicProvider(api_key="sk-ant-..."),
OpenAIProvider(api_key="sk-..."),
])
request = LLMRequest(
prompt="Explain Kubernetes pod affinity in one sentence.",
max_tokens=100
)
response = router.route(request)
print(f"Response from {response.provider}: {response.content}")
When Anthropic bans your use case at 3 PM on Friday, the router marks that provider unavailable and immediately starts sending traffic to OpenAI. No downtime, no emergency deploy. Your application logs show the provider switch, but your users see continuous service.
What You Lose With Provider Switching
Consistency. Each LLM has different output characteristics — Claude tends toward verbosity, GPT-4 is more concise, open-source models vary wildly. If your application depends on exact JSON output format or specific reasoning patterns, a mid-flight provider switch will break things.
The fix is output validation and retry logic. Parse the response, check for required fields, and if the new provider's format doesn't match, either transform it or fail back to a known-good provider. This adds latency — budget an extra 50-100ms for validation — but it prevents silent corruption.
Cost also changes. Claude's pricing differs from OpenAI's. If you switch from a $0.003/1K token model to a $0.03/1K model under load, your AWS bill will reflect that in real time. Monitor token usage per provider and set budget alerts.
The Observability You Actually Need
When your system is routing between three providers, you need metrics on who's handling what. Track these per-provider:
- Request success rate
- P50/P95/P99 latency
- Token consumption
- Error rate by type (rate limit, policy, timeout)
- Availability status changes
Export these to Prometheus or your existing metrics system. Set alerts when a provider's success rate drops below 95% over a 5-minute window. That's your early warning before a total ban.
# prometheus-rules.yaml
groups:
- name: llm_provider_health
interval: 30s
rules:
- alert: LLMProviderDegraded
expr: rate(llm_requests_failed_total{provider="anthropic"}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Anthropic provider failing >5% of requests"
- alert: LLMProviderDown
expr: llm_provider_available{provider="anthropic"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Anthropic provider marked unavailable"
The Cost of Getting This Wrong
OpenClaw had 40,000 tools depending on a single provider. When the ban hit, every one of those tools stopped working. Users couldn't complete tasks, SLA guarantees were violated, and the team spent the next week migrating to a different API while fielding support tickets.
If they'd had a multi-provider router, the impact would have been contained to the time it takes to mark Anthropic unavailable — seconds, not days. The router would have shifted traffic to the backup provider automatically.
This isn't theoretical. I've run production systems serving 2M+ LLM requests per day. Provider issues happen monthly: rate limits during usage spikes, regional capacity constraints, model deprecations, terms-of-service enforcement. The systems that survive are the ones that treat providers as interchangeable infrastructure, not trusted dependencies.
Build the abstraction layer before you need it. When your primary provider goes dark, you'll have minutes to respond, not hours to code.
This post is an excerpt from Practical AI Infrastructure Engineering — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at https://activ8ted.gumroad.com/l/ssmfkx
Originally published at fivenineslab.com
Top comments (0)