DEV Community

Jay
Jay

Posted on

How I added LLM fallback to my OpenAI app in 10 minutes

How I added LLM fallback to my OpenAI app in 10 minutes

You're running a production app on OpenAI. One Tuesday morning it goes down. Your app returns 500s. You spend an hour refreshing status.openai.com.

There's a better setup. Here's how to add provider fallback to any OpenAI-SDK app without rewriting anything.


The problem with single-provider setups

When you call OpenAI directly, you have one point of failure:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarise this text..."}],
)
Enter fullscreen mode Exit fullscreen mode

If OpenAI returns a 500 or a 429, your user sees an error. You have no fallback, no visibility into what failed, and no easy way to route to a cheaper provider when you don't need GPT-4 quality.


The fix: two lines and a gateway

InferBridge is an OpenAI-compatible API gateway. You point the OpenAI SDK at it instead of OpenAI directly. It handles routing, fallback, and per-request observability — without touching your application logic.

Step 1: Get an InferBridge key (run once)

# Create an account — returns your InferBridge key exactly once, save it.
curl -X POST https://api.inferbridge.dev/v1/users \
  -H 'Content-Type: application/json' \
  -d '{"email":"you@example.com"}'

# {"api_key": "ib_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", ...}
Enter fullscreen mode Exit fullscreen mode

Step 2: Register your existing OpenAI key

curl -X POST https://api.inferbridge.dev/v1/keys \
  -H 'Authorization: Bearer ib_xxx...' \
  -H 'Content-Type: application/json' \
  -d '{"provider":"openai","api_key":"sk-..."}'
Enter fullscreen mode Exit fullscreen mode

Your key is Fernet-encrypted at rest. InferBridge never logs request content and never marks up inference — your key goes directly to the provider.

Step 3: Change two lines in your app

from openai import OpenAI

client = OpenAI(
    api_key="ib_xxx...",                          # ← was sk-...
    base_url="https://api.inferbridge.dev/v1",    # ← new
)

resp = client.chat.completions.create(
    model="ib/balanced",                          # ← was "gpt-4o-mini"
    messages=[{"role": "user", "content": "Summarise this text..."}],
)
Enter fullscreen mode Exit fullscreen mode

That's it. Your app now has fallback.


What the routing tiers actually do

InferBridge uses explicit routing tiers instead of magic auto-classification:

Tier Chain Use when
ib/cheap Groq → DeepSeek → Together → Sarvam → OpenAI High volume, cost-sensitive, quality flexible
ib/balanced OpenAI → Sarvam → Anthropic Default for most production apps
ib/premium Anthropic → OpenAI Complex reasoning, quality-critical

The router intersects the tier with the provider keys you've registered. So if you only have an OpenAI key, ib/cheap routes to OpenAI. Register a Groq key (free tier available) and the same request code now hits Groq first — no code change.


What fallback looks like in practice

A 500 from OpenAI on ib/balanced is invisible to your app. You get a clean 200 with a normal OpenAI-shaped response. The only signal is in the inferbridge block appended to the response body:

{
  "id": "chatcmpl-...",
  "choices": [...],
  "usage": {...},
  "inferbridge": {
    "provider": "anthropic",
    "model": "claude-3-5-haiku-20241022",
    "mode": "ib/balanced",
    "cache_hit": false,
    "latency_ms": 834,
    "cost_usd": "0.000041",
    "residency_actual": "global",
    "request_id": "abc123"
  }
}
Enter fullscreen mode Exit fullscreen mode

provider: "anthropic" tells you OpenAI failed and Anthropic served it. Your application code didn't change. Your user saw nothing.

If every candidate in the chain fails, you get a clean error:

  • All 429s → 429 rate_limit_error with a Retry-After header
  • Mixed 5xx/timeouts → 502 provider_error or 504 gateway_timeout

Observability you get for free

Every request is logged. Two endpoints give you visibility without a dashboard:

# Aggregated stats
GET /v1/stats
# → totals, cache_hit_rate, breakdown by provider/mode/status

# Paginated request log
GET /v1/logs
# → per-request: provider, model, cost_usd, latency_ms, status, request_id
Enter fullscreen mode Exit fullscreen mode

status can be success, fallback_success, cache_hit, or error. Filter for fallback_success to see exactly when and how often your primary provider is failing.


Optional: add caching for repeated prompts

For deterministic prompts (classification, extraction, templated queries) you can opt into exact-match caching with one header:

resp = client.chat.completions.create(
    model="ib/balanced",
    messages=[...],
    extra_headers={
        "X-InferBridge-Cache": "true",
        "X-InferBridge-Cache-TTL": "3600",  # seconds
    }
)
Enter fullscreen mode Exit fullscreen mode

Cache key is a SHA-256 over provider + model + messages + determinism params. A cache hit returns cache_hit: true in the inferbridge block and costs zero tokens.


What's not built yet (be honest with yourself)

InferBridge is early. Before you adopt it, know the gaps:

  • No dashboard UI — observability is JSON endpoints only
  • Streaming requests bypass the cache
  • No embeddings endpoint
  • No vision inputs
  • No streaming tool use / function calling

If those are blockers for your use case, it's not the right fit yet.


Try it

Free tier is unlimited BYOK. No credit card.

If you run into anything broken or confusing, hello@inferbridge.dev goes to a real inbox.

Top comments (0)