Jay

Posted on May 3

How I added LLM fallback to my OpenAI app in 10 minutes

#python #llm #openai #tutorial

How I added LLM fallback to my OpenAI app in 10 minutes

You're running a production app on OpenAI. One Tuesday morning it goes down. Your app returns 500s. You spend an hour refreshing status.openai.com.

There's a better setup. Here's how to add provider fallback to any OpenAI-SDK app without rewriting anything.

The problem with single-provider setups

When you call OpenAI directly, you have one point of failure:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarise this text..."}],
)

If OpenAI returns a 500 or a 429, your user sees an error. You have no fallback, no visibility into what failed, and no easy way to route to a cheaper provider when you don't need GPT-4 quality.

The fix: two lines and a gateway

InferBridge is an OpenAI-compatible API gateway. You point the OpenAI SDK at it instead of OpenAI directly. It handles routing, fallback, and per-request observability — without touching your application logic.

Step 1: Get an InferBridge key (run once)

# Create an account — returns your InferBridge key exactly once, save it.
curl -X POST https://api.inferbridge.dev/v1/users \
  -H 'Content-Type: application/json' \
  -d '{"email":"you@example.com"}'

# {"api_key": "ib_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", ...}

Step 2: Register your existing OpenAI key

curl -X POST https://api.inferbridge.dev/v1/keys \
  -H 'Authorization: Bearer ib_xxx...' \
  -H 'Content-Type: application/json' \
  -d '{"provider":"openai","api_key":"sk-..."}'

Your key is Fernet-encrypted at rest. InferBridge never logs request content and never marks up inference — your key goes directly to the provider.

Step 3: Change two lines in your app

from openai import OpenAI

client = OpenAI(
    api_key="ib_xxx...",                          # ← was sk-...
    base_url="https://api.inferbridge.dev/v1",    # ← new
)

resp = client.chat.completions.create(
    model="ib/balanced",                          # ← was "gpt-4o-mini"
    messages=[{"role": "user", "content": "Summarise this text..."}],
)

That's it. Your app now has fallback.

What the routing tiers actually do

InferBridge uses explicit routing tiers instead of magic auto-classification:

Tier	Chain	Use when
`ib/cheap`	Groq → DeepSeek → Together → Sarvam → OpenAI	High volume, cost-sensitive, quality flexible
`ib/balanced`	OpenAI → Sarvam → Anthropic	Default for most production apps
`ib/premium`	Anthropic → OpenAI	Complex reasoning, quality-critical

The router intersects the tier with the provider keys you've registered. So if you only have an OpenAI key, ib/cheap routes to OpenAI. Register a Groq key (free tier available) and the same request code now hits Groq first — no code change.

What fallback looks like in practice

A 500 from OpenAI on ib/balanced is invisible to your app. You get a clean 200 with a normal OpenAI-shaped response. The only signal is in the inferbridge block appended to the response body:

{
  "id": "chatcmpl-...",
  "choices": [...],
  "usage": {...},
  "inferbridge": {
    "provider": "anthropic",
    "model": "claude-3-5-haiku-20241022",
    "mode": "ib/balanced",
    "cache_hit": false,
    "latency_ms": 834,
    "cost_usd": "0.000041",
    "residency_actual": "global",
    "request_id": "abc123"
  }
}

provider: "anthropic" tells you OpenAI failed and Anthropic served it. Your application code didn't change. Your user saw nothing.

If every candidate in the chain fails, you get a clean error:

All 429s → 429 rate_limit_error with a Retry-After header
Mixed 5xx/timeouts → 502 provider_error or 504 gateway_timeout

Observability you get for free

Every request is logged. Two endpoints give you visibility without a dashboard:

# Aggregated stats
GET /v1/stats
# → totals, cache_hit_rate, breakdown by provider/mode/status

# Paginated request log
GET /v1/logs
# → per-request: provider, model, cost_usd, latency_ms, status, request_id

status can be success, fallback_success, cache_hit, or error. Filter for fallback_success to see exactly when and how often your primary provider is failing.

Optional: add caching for repeated prompts

For deterministic prompts (classification, extraction, templated queries) you can opt into exact-match caching with one header:

resp = client.chat.completions.create(
    model="ib/balanced",
    messages=[...],
    extra_headers={
        "X-InferBridge-Cache": "true",
        "X-InferBridge-Cache-TTL": "3600",  # seconds
    }
)

Cache key is a SHA-256 over provider + model + messages + determinism params. A cache hit returns cache_hit: true in the inferbridge block and costs zero tokens.

What's not built yet (be honest with yourself)

InferBridge is early. Before you adopt it, know the gaps:

No dashboard UI — observability is JSON endpoints only
Streaming requests bypass the cache
No embeddings endpoint
No vision inputs
No streaming tool use / function calling

If those are blockers for your use case, it's not the right fit yet.

Try it

Free tier is unlimited BYOK. No credit card.

Docs: inferbridge.dev/docs
Migration guide: inferbridge.dev/docs/migration-from-openai

If you run into anything broken or confusing, hello@inferbridge.dev goes to a real inbox.

DEV Community

How I added LLM fallback to my OpenAI app in 10 minutes

How I added LLM fallback to my OpenAI app in 10 minutes

The problem with single-provider setups

The fix: two lines and a gateway

Step 1: Get an InferBridge key (run once)

Step 2: Register your existing OpenAI key

Step 3: Change two lines in your app

What the routing tiers actually do

What fallback looks like in practice

Observability you get for free

Optional: add caching for repeated prompts

What's not built yet (be honest with yourself)

Try it

Top comments (0)