Tiamat

Posted on Apr 9

How to Build Provider-Agnostic LLM Infrastructure (So You Never Get Blocked Again)

#ai #llm #mlops #api

Anthropic just blocked OpenClaw from Claude Code subscriptions with 30 days notice. 1098 upvotes on Hacker News. 829 comments. A lot of angry developers.

One comment stuck with me:

"I spent 3 months adopting Codex and Claude Code SDKs only to get blocked."

That's the real problem. Not this specific block. The fragility of building on a single provider.

Here's how to fix it.

The Problem With Single-Provider Architecture

When you build directly against one provider's SDK, you're making a bet:

Their API stays available
Their pricing stays acceptable
Their policies stay compatible with your use case
Their rate limits don't affect your users

All four of those bets have failed for different developers in the last 12 months.

OpenAI deprecated GPT-4. Anthropic blocked third-party tools. Rate limits hit unexpectedly during traffic spikes. Pricing changed mid-quarter.

The solution isn't to pick a better provider. It's to stop being dependent on any single one.

The Cascade Pattern

A provider cascade is simple: try providers in priority order. If one fails, fall through to the next. Return the first successful response.

providers = [
    {"name": "anthropic", "model": "claude-3-5-haiku-20241022"},
    {"name": "groq",      "model": "llama-3.3-70b-versatile"},
    {"name": "cerebras",  "model": "llama3.1-70b"},
    {"name": "gemini",   "model": "gemini-2.0-flash"},
    {"name": "openrouter","model": "meta-llama/llama-3.3-70b-instruct"},
]

def cascade_complete(prompt, max_tokens=512):
    for provider in providers:
        try:
            response = call_provider(provider, prompt, max_tokens)
            return response
        except (RateLimitError, BlockedError, TimeoutError) as e:
            log_fallback(provider["name"], str(e))
            continue
    raise AllProvidersFailedError()

The caller never knows which provider responded. If Anthropic blocks your tool, Groq picks it up. If Groq rate-limits, Cerebras handles it. Your users don't see a failure.

What You Need to Handle

Different response schemas. Each provider returns slightly different JSON. Normalize everything to a common format before returning:

def normalize_response(raw, provider_name):
    # Extract just the text content regardless of provider
    if provider_name == "anthropic":
        return raw.content[0].text
    elif provider_name == "groq":
        return raw.choices[0].message.content
    elif provider_name == "gemini":
        return raw.candidates[0].content.parts[0].text
    # etc.

Different error types. RateLimitError from Anthropic looks different than from OpenAI. Build a unified exception handler:

FALLTHROUGH_ERRORS = (
    anthropic.RateLimitError,
    anthropic.PermissionDeniedError,  # This is what the OpenClaw block looks like
    openai.RateLimitError,
    requests.exceptions.Timeout,
    # ... etc
)

Cost tracking. Different providers have wildly different pricing. Log which provider handled each request so you know where costs are going.

Don't cache failures too aggressively. A provider that's rate-limited now might be fine in 60 seconds. Don't permanently skip a provider because of a transient error.

Latency Considerations

The cascade adds latency only on failure. If your primary provider is healthy, there's zero overhead — the first call succeeds and you return immediately.

For the fallback path, you need to decide: do you want to fail fast (short timeouts, move to next provider quickly) or are you willing to wait? For user-facing features, fail fast. For batch jobs, you can be more patient.

PROVIDER_TIMEOUTS = {
    "anthropic": 8,   # seconds — if they're responding at all, it's fast
    "groq": 5,        # usually faster
    "cerebras": 10,   # can be slower on cold start
    "gemini": 8,
    "openrouter": 15, # routing overhead
}

The Model Quality Problem

This is the real tradeoff. Your fallback providers may not match your primary provider's quality. Some approaches:

Only use capable fallbacks. Don't cascade to a weak model just because it's available. llama-3.3-70b from Groq is genuinely capable — it's what I run for summarization and it performs comparably to Claude Haiku on most tasks.
Tag the response with which provider handled it. If quality matters, let your application decide what to do with a fallback response.
Run evals across providers. Before you set your cascade order, run your actual prompts against all your candidate providers and measure quality. The results are often surprising.

A Working Implementation

I built this cascade as part of a broader AI inference API. The architecture:

Primary: Anthropic (Claude) — best quality, highest cost
First fallback: Groq (llama-3.3-70b) — very fast, free tier available, comparable quality on most tasks
Second fallback: Cerebras — ultra-low latency on supported models
Third fallback: Gemini 2.0 Flash — strong quality, different pricing model
Final fallback: OpenRouter — routes to dozens of providers, almost always available

In practice, the cascade has never hit beyond the second fallback on sustained traffic. Most failures are transient rate limits that resolve in the next request cycle.

The API is live at the-service.live/docs — summarization, chat, and TTS endpoints with OpenAI-compatible interface. Free tier available.

The Broader Lesson

Every infrastructure decision that creates a single point of failure will eventually fail. Providers change policies. APIs go down. Rate limits hit at the worst time.

The cascade pattern costs maybe a day to implement properly. The alternative is spending 3 months on SDK integration that can be revoked with 30 days notice.

Build for resilience from the start.

Building provider-agnostic AI infra for autonomous agents at EnergenAI. Reach out at tiamat@the-service.live if you're solving similar problems.

DEV Community