You're mid-task in Claude Code. You hit enter. Then... nothing. 12 seconds later, either the response arrives or you're refreshing.
That lag isn't a bug. It's Opus under peak load. It happens constantly during high-traffic hours. And for a developer in an agentic workflow, it feels identical to a crash.
I got tired of it, so I built glide: a transparent proxy that sits between your AI agent and the API, and automatically switches to a faster model when yours is slow, before you ever experience the timeout.
pip install glide
glide start
export ANTHROPIC_BASE_URL=http://127.0.0.1:8743
claude # Claude Code now routes through glide
That's the entire setup.
The problem with existing approaches
Standard retry logic re-attempts the same slow endpoint, making things worse. Load balancers distribute across identical instances, but LLM models are not identical. LiteLLM does static routing and doesn't adapt to live latency.
None of them address the actual failure mode: a model that's slow right now but will recover in 10 minutes.
Core insight: TTFT as a health signal
Time-to-First-Token (TTFT) is measurable during the stream, before the full response arrives. You don't have to wait 15 seconds to know a model is slow. You know at second 4.
So glide races each request against a per-model TTFT budget. Exceed it? Connection cancelled, next model in the cascade starts immediately.
claude-opus-4-6 TTFT budget: 4s <- best quality, tried first
claude-sonnet-4-6 TTFT budget: 5s <- fast fallback
claude-haiku-4-5 TTFT budget: 3s <- fastest Anthropic model
qwen2.5:14b no limit <- local Ollama, always works
Problem: naive cascade compounds latency
If opus takes 8s to timeout and sonnet takes 5s, a naive cascade makes you wait 13s before reaching haiku. That's worse than just waiting for opus.
Solution: proactive p95 routing
glide maintains a rolling window of observed TTFT values per model (SQLite-backed, persists across restarts) and computes the p95 continuously. If a model's p95 already exceeds its budget, glide skips it without waiting.
Normal day -> opus p95=2s -> serves in ~2s
Peak load -> opus p95=11s -> skipped, sonnet serves in ~1.5s
Recovery -> opus p95=3s -> resumes automatically
No restarts. No config changes. No intervention.
Second signal: TTT for extended thinking
TTFT covers slow starts but misses a different failure: runaway extended thinking.
Claude Opus with extended reasoning emits thinking tokens before any text. A request can get a fast TTFT (thinking starts immediately) but then spend 60 seconds in the reasoning phase. The user sees nothing the whole time.
I added TTT (Time-to-Think): elapsed time from request start until the first text token after thinking completes. Budget exceeded mid-think? Abort and cascade.
# Inline SSE parser, runs during the active stream
if event_type == "content_block_start":
if block_type == "thinking":
ttt_start = time.monotonic() # start TTT clock
elif block_type == "text":
ttt = time.monotonic() - ttt_start
if ttt > budget:
raise TTTTimeoutError() # cascade to next model
text_started = True # stream from here
The tricky part: SSE events can span HTTP chunk boundaries, so you can't just parse per-chunk. I built a buffer that accumulates bytes, splits on \n\n, and parses complete events while yielding chunks to the client and monitoring inline.
Third: request hedging for borderline cases
Proactive routing handles sustained load. But when a model is trending slow, not yet over budget but elevated, you're still exposed on individual tail requests.
This is the same problem Google solved in "The Tail at Scale" (2013): send the same request to two replicas, use whichever responds first. I applied that idea across heterogeneous model tiers.
But you don't want to double your API cost on every request. So glide computes a routing decision before each request using observed p95:
| Decision | Condition | Action |
|---|---|---|
| SOLO | primary p95 < 80% of budget | Fire only primary, it's healthy |
| HEDGE | primary risky, backup healthy or cold | Fire both, race on asyncio queue, stream winner, cancel loser |
| SKIP | both risky | Skip hedge entirely, go to sequential cascade |
def _hedge_decision(hedge_models):
p95_1 = registry.get(models[0].model).p95()
if p95_1 is None:
return "hedge" # cold start, hedge conservatively
if p95_1 < budget_1 * 0.8:
return "solo" # healthy, no cost wasted
if p95_2 is not None and p95_2 >= budget_2 * 0.8:
return "skip" # both slow, sequential is better
return "hedge" # first risky, second healthy, race them
The 80% threshold catches the trend before models actually start failing individual requests.
When a hedge fires, the losing task gets task.cancel() which propagates through httpx's async with client.stream() context manager, closing the upstream HTTP connection immediately. No resource leaks.
Provider-agnostic
All cascade providers yield Anthropic SSE internally. glide converts at the edge for each provider:
-
OpenAI uses
anthropic_to_openai()for the request body andstream_openai_as_anthropic()for the response -
Gemini uses
anthropic_to_gemini()andstream_gemini_as_anthropic() - Ollama is already streaming JSON, wrapped to Anthropic SSE
Mix providers freely:
export CASCADE_JSON='[
{"provider": "anthropic", "model": "claude-opus-4-6", "ttft_budget": 4.0},
{"provider": "openai", "model": "gpt-4o", "ttft_budget": 5.0},
{"provider": "google", "model": "gemini-2.0-flash", "ttft_budget": 3.0},
{"provider": "ollama", "model": "qwen2.5:14b", "ttft_budget": null}
]'
glide start
Accepts both POST /v1/messages (Anthropic) and POST /v1/chat/completions (OpenAI). Returns the matching format automatically.
Observability
curl http://127.0.0.1:8743/metrics
glide_requests_total 42.0
glide_hedge_decision_total{decision="solo"} 30.0
glide_hedge_decision_total{decision="hedge"} 10.0
glide_hedge_decision_total{decision="skip"} 2.0
glide_hedge_winner_total{model="claude-sonnet-4-6"} 8.0
glide_ttft_p95_seconds{model="claude-opus-4-6"} 3.82
glide_ttft_p95_seconds{model="claude-sonnet-4-6"} 0.41
glide_ttft_samples_total{model="claude-opus-4-6"} 20.0
Standard Prometheus text format, no extra dependencies, formatted manually. Plug into Grafana or scrape directly.
The pattern
I'm calling this the LLM Request Cascade Pattern, a reliability primitive with three components:
- Budget-based streaming abort - TTFT and TTT as actionable in-stream health signals
- Proactive p95 routing - skip models whose recent observed p95 exceeds their budget
- Adaptive hedging - race models when borderline slow, not on every request
It sits alongside two existing patterns:
- Circuit breaker (binary up/down) handled by llm-circuit
- Load balancing (identical replicas) not applicable to heterogeneous model tiers
The cascade is specifically for the heterogeneous LLM ecosystem: different models with different quality/speed/cost tradeoffs, where you want to route to the best option that can actually respond in time.
Try it
pip install glide
glide start
export ANTHROPIC_BASE_URL=http://127.0.0.1:8743
Works with Claude Code, Cursor, code_puppy, or anything using the Anthropic or OpenAI API.
- GitHub: https://github.com/phanisaimunipalli/glide
- Pattern docs: https://github.com/phanisaimunipalli/glide/blob/main/docs/the-cascade-pattern.md
- HN thread: https://news.ycombinator.com/item?id=47285435
22 tests, MIT license. Would love feedback especially on the mid-stream SSE abort implementation and the hedge trigger thresholds.
Top comments (4)
Nice build — auto-fallback routing is a real problem in production LLM setups.
One thing worth adding to the cascade logic: prompt complexity awareness. A terse fallback model will sometimes fail not because of a timeout but because the original prompt was too loosely structured for smaller models. Shorter, more ambiguous prompts get worse at degraded quality tiers.
I've been working on flompt (flompt.dev), a free prompt structuring tool that breaks prompts into semantic blocks (role, constraints, output format, etc.). If you pre-structure prompts before they enter the cascade, even the fallback models handle them much more reliably. Could pair well with what you're building here.
Smart approach to the timeout problem. The cascade proxy pattern makes sense especially for long-running tasks where you'd rather get a slightly different model's output than wait indefinitely.
One thing worth considering in the prompt-forwarding layer: prompts that are optimized for one model often underperform on the fallback. If you're switching from GPT-4o to Claude mid-cascade, Claude responds significantly better to XML-structured prompts while GPT tends to prefer markdown. A cascade proxy that also normalizes the prompt format per target model would be a strong differentiator.
Have you thought about adding prompt format adaptation as part of the cascade logic, or is that out of scope for what you're building?
Love this! Cascade proxies are underrated — automatic fallback before timeout is so much cleaner than catching exceptions client-side.
One thing that compounds nicely with this: consistent prompt structure across model switches. If your prompts are tightly structured (role, objective, constraints, output format defined explicitly), different models handle them more predictably. I built flompt (flompt.dev) for exactly this — a free visual prompt builder that compiles prompts into Claude-optimized XML. Works with Claude, GPT, Gemini. When you're routing across models, having a standardized prompt format means you don't get wildly different outputs depending on which model ends up serving the request.
Really clever project — what's your current fallback chain ordering?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.