DEV Community

江欢(JackSoul)
江欢(JackSoul)

Posted on

AI API gateway fallback policy template for production apps

Fallback rules are where an AI API gateway becomes operationally valuable.

The goal is not to blindly retry every failed LLM call. The goal is to choose the right backup model, provider, or budget path based on the workflow, customer tier, latency target, and risk of a lower-quality answer.

A practical fallback policy should define:

  1. which failures are retryable;
  2. which workflows may downgrade models;
  3. which customers or API keys are allowed to use premium fallback routes;
  4. how budget caps change routing behavior;
  5. what metadata gets logged so the team can debug cost and quality later.

1. Classify traffic before routing

Do not write one global fallback rule for every request. Start by classifying traffic:

  • Critical user-facing: support chat, checkout assistance, customer-facing agent answers.
  • Non-critical user-facing: summaries, title generation, enrichment, recommendations.
  • Internal automation: triage, labeling, data cleanup, back-office agents.
  • Batch jobs: long-running summarization, extraction, report generation.
  • Experiments: tests, staging, evaluation, prompt tuning.

Each class should have a different fallback budget and quality floor.

2. Decide what counts as a retryable failure

Good retry candidates:

  • upstream timeout;
  • 429 rate limit;
  • temporary 5xx provider error;
  • network interruption;
  • overloaded model endpoint;
  • streaming connection drop before useful output.

Poor retry candidates:

  • invalid API key;
  • malformed request payload;
  • unsupported tool-call schema;
  • content policy rejection;
  • user quota exhausted;
  • deterministic validation failure.

Retrying non-retryable failures usually burns tokens and hides product bugs.

3. Example fallback policy matrix

Traffic class Primary route First fallback Second fallback Hard stop
Critical user-facing frontier model same-class model on second provider cheaper model with explicit uncertainty after 2 provider failures
Non-critical user-facing balanced model cheaper model cached/default response after budget cap
Internal automation low-cost model alternate low-cost provider queue for retry after daily budget cap
Batch jobs cheapest acceptable model pause and resume later manual review queue after retry budget
Experiments test route no fallback fail fast immediately

The exact model names matter less than the policy shape.

4. Add budget-aware routing

Fallback should consider cost, not only uptime.

Useful rules:

  • If tenant is below 70% of monthly budget, allow normal fallback.
  • If tenant is above 80%, downgrade non-critical traffic.
  • If tenant is above 95%, block batch jobs and keep only critical routes.
  • If prepaid balance is exhausted, return a clear quota response instead of silently routing to an expensive model.

This protects gross margin and avoids surprise bills from agent loops.

5. Preserve attribution metadata

Every fallback event should keep the original request context:

  • tenant id;
  • user id where available;
  • app, feature, workflow, or assistant id;
  • thread/session id;
  • primary provider/model;
  • fallback provider/model;
  • failure reason;
  • input and output tokens;
  • final cost;
  • latency before and after fallback.

Without this metadata, fallback behavior is almost impossible to tune.

6. Avoid quality cliffs

A fallback model may be cheaper or more available, but it may not be safe for every task.

Be careful with downgrades for:

  • legal, medical, financial, or compliance-sensitive text;
  • code generation that will be executed automatically;
  • tool-calling agents with write permissions;
  • long-context tasks that require full recall;
  • multilingual customer support where weaker models may hallucinate.

For these routes, it is often better to fail clearly than to silently downgrade.

7. Recommended default policy

For most SaaS teams, a sane starting point is:

  1. retry the same provider once for transient failures;
  2. switch to an equivalent-quality provider/model for critical traffic;
  3. switch to a cheaper model only for non-critical tasks;
  4. stop fallback when tenant or key budget is exhausted;
  5. log every fallback decision with tenant, feature, model, provider, latency, and cost.

How FerryAPI fits

FerryAPI is an OpenAI-compatible AI API gateway for teams that want one control point for model access, scoped keys, usage visibility, balance controls, and lower-cost routing options without rewriting existing OpenAI SDK integrations.

A gateway-level fallback policy lets teams evolve provider choices while keeping application code stable.

Learn more: https://www.ferryapi.io/docs?utm_source=devto&utm_medium=article&utm_campaign=7day_growth

Final note

Fallback is not just an availability feature. It is a cost, quality, and risk-control feature. The best policy is explicit enough that engineering, product, and finance all understand what happens when the primary model fails or becomes too expensive.

Top comments (0)