Every AI Gateway Asks 'Which Provider?' — None Ask 'Is the Response Correct?'

#ai #llm #architecture #devops

There is a lively Ask HN thread right now asking "Best AI Gateway?" People are comparing LiteLLM, Portkey, OpenRouter, Cloudflare AI Gateway, and the newer self-hosted options like Olla and LunarGate.

Every answer evaluates the same dimensions: provider coverage, latency, cost, rate limiting, circuit breakers.

All of them miss the same thing.

The Gateway Blind Spot

Gateways solve a real problem: routing across providers, handling 429s, managing keys. Olla adds auto-failover with health checks. LiteLLM covers 100+ providers. Busbar implements circuit breakers that know whose fault a failure is.

These are all transport-level guarantees. The request went out, a response came back, the status code was 200. Gateway job done.

But here is the uncomfortable truth: a 200 OK with wrong content is worse than a 5xx. A 5xx is visible. Monitoring catches it. The team gets paged. A 200 with silently wrong output... the agent consuming it simply marches forward on bad data.

A Production Pattern We Keep Seeing

Over the past year of building LLM reliability infrastructure, a consistent pattern emerged during cross-provider failover testing:

A backup model returns HTTP 200 with complete, well-formed JSON — every API check passes
But the content is subtly wrong: missing key fields, hallucinated entities, contradictory reasoning
The agent consuming the response has zero indication anything is wrong
These silent failures can persist for extended periods — the arXiv:2606.14589 taxonomy found 70% of silent failures are caught by human users, not monitoring

Our own microbenchmarks confirm that contract validation (checking response structure, required fields, forbidden patterns) runs at ~45us P50 — negligible compared to LLM latency.

This is the gap between failover (route to another provider) and what we call correctover (verify the response after switching).

Why This Matters Now

Three trends make this gap critical:

1. Multi-provider is becoming standard. Teams don't rely on one LLM provider anymore. But switching models mid-task means switching "brains" — same prompt, different model, different reasoning. The assumption that Model B will produce equivalent output to Model A is false more often than people realize.

2. Agentic workflows amplify silent failures. A single wrong intermediate result propagates across downstream steps. One misinterpretation becomes a cascade. The arXiv paper calls this "chained hallucination and fabrication" — the most dangerous failure class in multi-step LLM systems.

3. Every gateway vendor is racing to add the same features. Fallback chains, circuit breakers, rate limiting — these are table stakes now. The next competitive differentiator won't be another routing strategy. It will be answer correctness.

What a Validation Layer Looks Like

At proxy level, this means adding a verification step after failover:

Define expected response contracts: required fields, forbidden patterns, latency budgets, schema constraints
After failover response arrives, validate against contract before returning to caller
If validation fails, either retry with another provider or surface the degradation explicitly
Contract validation at ~45us P50 — negligible overhead compared to LLM latency

This is not about replacing gateways. It is about adding a layer that gateways don't provide. We have been building this at Correctover.

What I Want to Know

Has anyone here encountered a production incident where your failover "worked" — the backup provider returned 200 OK — but the response was subtly wrong and it took hours or days to notice?

I think the next Ask HN will not be "Best AI Gateway?" but "Best AI Gateway + Response Validation?"

References:

arXiv:2606.14589 — "When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime"
Microbenchmark: Correctover contract validation engine, 7 fault types, mock provider setup
Correctover.com