Nazar Boyko

Posted on Jun 19 • Originally published at nazarboyko.com

LLM Gateways: Routing, Fallbacks, And Semantic Caching

#ai #architecture #llm #gateway

Here's a line of code that's quietly running in production at a surprising number of companies:

const response = await openai.chat.completions.create({ model: "gpt-4o", messages });

It looks harmless. It's also why your AI bill is whatever it is this month, why your app goes down the moment OpenAI has a bad afternoon, and why the same question typed by ten thousand users costs you ten thousand inference calls. That one line hardcodes a vendor, a model, a pricing tier, and a single point of failure all at once.

An LLM gateway is the fix, and the idea is older than the AI hype around it. It's a proxy, the same pattern you've used in front of databases and microservices for years, except it sits between your app and every model provider you talk to. Your code calls the gateway. The gateway decides which model actually answers, what happens when that model is down, and whether it even needs to call a model at all. Three jobs: routing, fallbacks, and caching. Let's take them apart, because each one has a gotcha that the marketing pages skip.

Why A Proxy, And Not Just A Wrapper Function

The instinct is to write a helper function, function askLlm(prompt) { ... }, and call it a day. That works until the second provider shows up. Then you're threading model names, API keys, and provider-specific quirks through your call sites. OpenAI wants messages, Anthropic wants system separated out, Google wants something else again. Every place you call a model now knows too much.

A gateway collapses all of that into one surface. You speak one dialect, almost always the OpenAI chat-completions shape, because it's become the lingua franca, and the gateway translates to whatever provider it routes to. That single chokepoint is the whole point. Cross-cutting concerns want a chokepoint. Caching, retries, budget caps, rate limiting, audit logging, PII redaction: none of those belong scattered across your codebase. They belong in the one place every request already flows through.

        ┌────────────────────────────────────────────┐
your app │  cache?  →  route  →  call  →  fallback?    │  →  provider
  ──────►│   ▲                                         │      (OpenAI,
         │   └── hit: return in <5ms, $0               │       Anthropic,
         └────────────────────────────────────────────┘       local, ...)

You can build this yourself (it's a few hundred lines of Node or Python around an HTTP client) or use one of the open-source ones like LiteLLM (which speaks to 100+ providers behind the OpenAI API shape) or a managed edge gateway from Cloudflare or Vercel. The build-versus-buy call comes down to how much of the hard part (the caching semantics, the failover logic, the observability) you want to own. We'll come back to that. First, the three jobs.

Routing: Stop Paying Frontier Prices For "What's 2+2"

Most apps send every request to their best, most expensive model. It feels safe. It's also wildly wasteful, because most requests don't need a frontier model. Classifying a support ticket, extracting a date from a sentence, deciding whether a comment is spam: a small, cheap model nails these. You're paying Michelin-star prices to flip a burger.

Routing is the gateway deciding, per request, which model should answer. The strategies stack roughly like this:

Static rules are the floor. Route by a field you already have: this customer tier gets the big model, that internal tool gets the cheap one. No intelligence, just config. Cheap to build, easy to reason about, and honestly enough for a lot of apps.

Latency- and cost-based routing picks the model that's fastest or cheapest right now, often with a fallback chain so a rate-limited provider hands off to the next one automatically. This is bread-and-butter for gateways like LiteLLM and OpenRouter: you define an ordered list, and traffic flows to the first one that's healthy.

Model routing by difficulty is where it gets interesting. A small "router model" looks at the prompt and predicts whether a cheap model can handle it or whether you need the expensive one. This sounds like a toy until you look at the numbers. The RouteLLM work out of LMSYS showed a router that hit 95% of GPT-4's quality while sending only 14% of queries to GPT-4, the other 86% went to a far cheaper model. Other published setups report hitting ~97% of GPT-4 accuracy at roughly a quarter of the cost. The savings aren't a rounding error; they're the difference between a feature that ships and one that gets killed in a budget review.

Here's the shape of a tiered router. The point isn't the exact code: it's that this logic lives in one place, not sprinkled across forty call sites:

function route(prompt: string): string {
  // cheap heuristic first: no model call to decide
  if (prompt.length < 200 && !needsReasoning(prompt)) {
    return "gpt-4o-mini";          // ~15x cheaper per token
  }
  if (isCodeTask(prompt)) {
    return "claude-sonnet";
  }
  return "gpt-4o";                 // the expensive default, earned not assumed
}

Tip
Before you reach for a fancy ML router, try the dumb version: route by your own metadata. You usually already know whether a request is a high-stakes user-facing answer or a background batch job. That single boolean captures most of the savings with none of the complexity.

The honest tradeoff: a learned router adds its own small inference cost and a chance of misrouting a hard question to a weak model. That's why the serious teams roll routing out in shadow mode first: send every request to both the router's pick and the current default, log both, return only the default to the user, and compare offline. Once the router's choices look good on real traffic, flip it live behind a feature flag at 5% and climb. You don't bet production quality on a routing table you've never seen run.

Fallbacks: The Part Everyone Skips Until 2am

Routing decides who answers when things are fine. Fallbacks decide what happens when they're not. And things are not fine more often than the status pages admit. Providers rate-limit you, time out, return 500s, or get slow enough that your users give up. If your app has exactly one model hardcoded, every one of those becomes your outage.

A fallback chain is just an ordered list: try the primary, and on failure, transparently try the next. The user never sees the seam.

# litellm-style fallback config
model_list:
  - model_name: chat
    litellm_params: { model: openai/gpt-4o }
  - model_name: chat
    litellm_params: { model: anthropic/claude-sonnet-4 }
  - model_name: chat
    litellm_params: { model: ollama/llama3 }   # last-resort local model
fallbacks:
  - chat: ["chat"]   # walk the list on error

But naive retries make outages worse, not better. If a provider is drowning, hammering it with retries is pouring water on a grease fire. Two patterns keep you honest:

Exponential backoff spaces retries out: wait a bit, then a bit more, with a touch of random jitter so all your servers don't retry in lockstep and create a thundering herd.

Circuit breaking is the one people forget. After a provider fails enough times in a row, you stop sending it traffic entirely for a cooling-off window, fall straight through to the backup, and only probe the broken one occasionally to see if it's back. Without a breaker, every single request still pays the full timeout penalty against a dead provider before failing over. With one, you fail over instantly.

class CircuitBreaker {
  private fails = 0;
  private openUntil = 0;

  constructor(
    private threshold = 5,      // trip after 5 consecutive failures
    private cooldownMs = 30_000, // stay open for 30s
  ) {}

  allow(): boolean {
    if (this.fails >= this.threshold && Date.now() < this.openUntil) {
      return false;             // circuit open: skip this provider
    }
    return true;
  }

  record(ok: boolean): void {
    if (ok) {
      this.fails = 0;           // recovered
    } else {
      this.fails += 1;
      this.openUntil = Date.now() + this.cooldownMs;
    }
  }
}

Warning
A fallback chain is only as good as your failure detection. A provider that returns a fast, confident, completely wrong 200 OK won't trip any breaker. It isn't "failing," it's just bad. Health checks catch downtime, not degradation. That's a different problem, and it's why you still need evals on the output, not just monitoring on the transport.

Semantic Caching: The Part That's Magic And The Part That Bites

Now the headline feature. Normal caching keys on exact bytes: same request in, same response out. That's useless for LLMs, because nobody types the same thing twice. "How do I reset my password?" and "I forgot my password, how do I change it?" are the same question with zero matching characters. Exact-match caching sees two different keys and calls the model twice.

Semantic caching keys on meaning instead of bytes. Here's the actual mechanism, because this is where the "under the hood" lives:

Convert the incoming prompt into an embedding, a vector of numbers that encodes its meaning.
Run a similarity search against the embeddings of everything you've cached, usually with cosine similarity.
If the closest match scores above a threshold, return that cached answer. Otherwise, call the model and cache the new result.

async function semanticLookup(prompt: string, threshold = 0.95) {
  const vec = await embed(prompt);                       // prompt -> vector
  const { match, score } = await vectorDb.nearest(vec);  // cosine similarity search
  if (score >= threshold) {
    return match.cachedResponse;                         // HIT: ~5ms, $0
  }
  const answer = await callModel(prompt);                // MISS: 2-5s, full token cost
  await vectorDb.insert(vec, answer);
  return answer;
}

The payoff is real and large. A cache hit comes back in single-digit milliseconds instead of the two-to-five seconds a full inference call takes, and it costs you nothing: no tokens, no provider call. Published results put cost reductions in the 40-80%+ range on workloads with repetitive queries; one widely-cited writeup measured a 73% drop in spend. Even a modest 30-40% hit rate is free money and a snappier app. For an FAQ bot or a docs assistant where users ask the same fifty things forever, this is the single highest-leverage thing a gateway does.

And now the part the glossy benchmarks bury.

The Threshold Is The Whole Ballgame

That threshold = 0.95 is the most dangerous number in your stack, and it's a slider, not a switch. Set it too high and almost nothing matches: your hit rate collapses and the cache does nothing. Set it too low and you start serving false hits: confidently returning a cached answer to a question that only looks similar.

The classic example: at an aggressive threshold around 0.85, "how to reset my password" can match "how to change my email." Topically cousins, completely different answers. The user asked to reset a password and got told how to change an email, and your logs show a cheerful cache hit. There's a well-documented danger zone roughly between 0.88 and 0.94, where questions are related enough to match but different enough that the answer is wrong.

Negation is even nastier. "Is it safe to run migrations on a live database?" and "Is it not safe to run migrations on a live database?" are nearly identical as vectors, one tiny word apart, but the correct answers are opposites. Embeddings are notoriously soft on negation, so a careless threshold will happily serve the wrong polarity.

Warning
Different query types need different thresholds. Reported sweet spots cluster around 0.94 for FAQ-style queries (where a wrong answer burns trust) and lower for fuzzy product search where a near-match is fine. There is no universal "correct" number: it's a precision-versus-hit-rate dial you tune per use case, and you should be watching for false positives, not just celebrating your hit rate.

The practical move is to track false-positive signals: if users immediately rephrase or thumbs-down right after a cache hit, your threshold is too loose. And some things should never be cached at all: anything personalized, anything time-sensitive ("what's my order status"), anything that depends on context the prompt doesn't carry. Caching "summarize this document" across different documents is a great way to hand user A's answer to user B. Scope your cache keys by user or tenant when the answer isn't truly global.

So Should You Build It Or Buy It?

You've now seen the three jobs and their teeth. Here's the call.

Build it if your needs are simple and you want zero new dependencies: a thin proxy with a fallback list and exact-match caching is genuinely a weekend project, and you'll understand every line. The trouble starts when you want semantic caching (now you're running a vector store and an embedding model), real circuit breaking, per-tenant budgets, and dashboards. That's a product, not a weekend.

Buy or adopt open source when you want those features without owning them. LiteLLM gives you the unified API and fallbacks across 100+ providers in a few lines. Cloudflare and Vercel offer gateways that run at the edge with caching and analytics baked in. The one cost you're accepting is a network hop: a hosted gateway adds latency (figures around 50ms get quoted for the round trip), though a self-hosted or in-process proxy can keep the overhead far smaller. For most apps, trading 50ms for automatic failover, caching, and cost control is an easy yes. For a latency-critical hot path, measure it before you commit.

The thing to internalize is that the gateway is infrastructure, not a feature. You don't bolt it on at the end. The moment you have a second model, a real bill, or a single user who'll be annoyed when OpenAI hiccups, you want that chokepoint. The line openai.chat.completions.create(...) scattered across your code is a liability the same way raw SQL strings scattered across your code were a liability. It works right up until the day it really, really doesn't.

Put the gate in front. Route the cheap stuff cheap, survive the outages your users will otherwise eat, and stop paying full price for questions you've already answered. Just keep one hand on that similarity dial. It's the one piece of this whole setup that can make you faster, cheaper, and wrong all at the same time.

Originally published at nazarboyko.com.

Top comments (2)

Mudassir Khan • Jun 22

"a provider that returns a fast, confident, completely wrong 200 OK won't trip any breaker" is the thing that burns you in practice. transport health and output health are different problems.

we had a load balanced gateway where the circuit breaker was clean but one provider had gone quietly degraded: valid JSON, garbage quality, two weeks before anyone noticed because all the transport metrics were green.

fixed it with a lightweight judge eval on sampled responses feeding into the breaker. costs a bit, but it caught silent degradation twice in three months.

how are you handling negation in the semantic cache — flagging those prompts for exact match routing, or something else?

Nazar Boyko • Jun 26

Love this - the "two weeks of green transport metrics over a quietly degraded provider" story is exactly the failure mode that section was trying to warn about, and feeding a sampled judge eval into the breaker is the right fix. Transport health and output health really are two different problems.
On negation: yeah, I flag those out of the semantic path rather than trusting the threshold. A cheap negation/polarity check on the prompt routes anything with "not / isn't / can't / without" straight to exact-match (or just skips cache and hits the model). Embeddings are too soft on one-word polarity flips to risk it. Cheaper to detect the danger than to tune a threshold that can't actually tell those two prompts apart. Thanks for the sharp comment 🙌