correctover

Posted on Jun 25

LLM Failover vs Verified Failover: Why Switching APIs Is Not Enough

#devops #llm #tutorial #ai

When an LLM API provider goes down, most tools switch to a backup. That's failover. But the backup might return a broken response — and you'd never know.

Correctover (pip install correctover) introduced the concept of verified failover: validate every response from a backup provider before accepting it.

The Problem With Standard Failover

Standard failover detects a provider outage and routes to the next provider. But "outage" is just one failure mode. Consider:

Truncation: OpenAI returns 500 tokens instead of 2000. HTTP 200, but the user sees half a response.
Schema drift: Provider A returns {"content": [...]} but Provider B returns {"text": "..."}. Your parser breaks.
Cost spike: Failover from GPT-4o ($2.50/1M) to Claude Opus ($15/1M). Request works, bill is 6x.
Format inconsistency: JSON output requested, but backup returns markdown. Downstream pipeline chokes.

In every case, failover "worked" — you got a response. But the response violated your contract. This is the silent failure problem.

What Verified Failover Does Differently

Verified failover adds a validation step between provider response and application delivery:

Standard Failover:
Provider A fails → route to Provider B → deliver response

Verified Failover (Correctover):
Provider A fails → route to Provider B → validate vs 6-dimension contract → accept or rollback

The 6-Dimension Contract

Dimension	Why It Matters
Schema	Prevents parser crashes from structural mismatches
Latency	Avoids swapping a fast provider for a slow one
Cost	Prevents budget blowouts from expensive backup providers
Completeness	Catches truncation and partial responses
Identity	Ensures the right provider served the response
Integrity	Detects corrupted or malformed responses

Real-World Example

from correctover import NeuralReliabilityEngine

engine = NeuralReliabilityEngine()

# This call will automatically fail over if the primary provider fails
# But unlike standard failover, it validates the backup response first
response = engine.chat_completion(
    messages=[{"role": "user", "content": "Write a JSON config for a web server"}],
    providers=["openai", "anthropic"],
    contract={
        "require_json": True,           # Reject non-JSON responses
        "max_latency_ms": 10000,        # Reject slow responses
        "min_completion_ratio": 0.8     # Reject truncated responses
    }
)

# If Anthropic returns markdown instead of JSON,
# Correctover rejects it and falls back
print(response["choices"][0]["message"]["content"])

Why This Matters for Production

In production, LLM calls are part of pipelines — agents, data processors, customer-facing features. A silent failure from a backup provider can:

Corrupt a database with malformed data
Send customers incorrect information
Break downstream automation
Waste money on overpriced backup providers

Verified failover catches these cases before they reach your application.

Correctover vs Traditional Gateways

Aspect	Gateway Failover	Correctover Verified Failover
Detection	HTTP status code	6-dimension contract
Validation	None	Schema + Latency + Cost + Completeness + Identity + Integrity
Rollback	Not supported	Automatic on contract failure
Latency overhead	5-50ms (proxy hop)	22µs (in-process)
Deployment	Proxy server	Embedded SDK
Pricing	Per-token markup	BYOK, zero markup

Getting Started

pip install correctover

Then wrap your existing OpenAI/Anthropic calls with NeuralReliabilityEngine for automatic verified failover.

Key Takeaway

Failover switches providers. Correctover verifies the switch worked. In the era of multi-provider LLM architectures, that distinction is the difference between "the system is up" and "the system is correct."

Website: correctover.com | Documentation: correctover.com/llms.txt | PyPI: pip install correctover

DEV Community