<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Phani Sai Ram M</title>
    <description>The latest articles on DEV Community by Phani Sai Ram M (@iamphanisairam).</description>
    <link>https://dev.to/iamphanisairam</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811156%2F7a98d958-d5d1-4d6a-a902-a62b16699340.png</url>
      <title>DEV Community: Phani Sai Ram M</title>
      <link>https://dev.to/iamphanisairam</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iamphanisairam"/>
    <language>en</language>
    <item>
      <title>I built an LLM Request Cascade proxy that auto-switches models before you ever timeout</title>
      <dc:creator>Phani Sai Ram M</dc:creator>
      <pubDate>Sat, 07 Mar 2026 08:16:41 +0000</pubDate>
      <link>https://dev.to/iamphanisairam/i-built-an-llm-request-cascade-proxy-that-auto-switches-models-before-you-ever-timeout-71n</link>
      <guid>https://dev.to/iamphanisairam/i-built-an-llm-request-cascade-proxy-that-auto-switches-models-before-you-ever-timeout-71n</guid>
      <description>&lt;p&gt;You're mid-task in Claude Code. You hit enter. Then... nothing. 12 seconds later, either the response arrives or you're refreshing.&lt;/p&gt;

&lt;p&gt;That lag isn't a bug. It's Opus under peak load. It happens constantly during high-traffic hours. And for a developer in an agentic workflow, it feels identical to a crash.&lt;/p&gt;

&lt;p&gt;I got tired of it, so I built &lt;strong&gt;&lt;a href="https://github.com/phanisaimunipalli/glide" rel="noopener noreferrer"&gt;glide&lt;/a&gt;&lt;/strong&gt;: a transparent proxy that sits between your AI agent and the API, and automatically switches to a faster model when yours is slow, before you ever experience the timeout.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;glide
glide start
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://127.0.0.1:8743
claude   &lt;span class="c"&gt;# Claude Code now routes through glide&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with existing approaches
&lt;/h2&gt;

&lt;p&gt;Standard retry logic re-attempts the same slow endpoint, making things worse. Load balancers distribute across identical instances, but LLM models are not identical. LiteLLM does static routing and doesn't adapt to live latency.&lt;/p&gt;

&lt;p&gt;None of them address the actual failure mode: a model that's &lt;em&gt;slow right now&lt;/em&gt; but will recover in 10 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core insight: TTFT as a health signal
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time-to-First-Token (TTFT)&lt;/strong&gt; is measurable &lt;em&gt;during&lt;/em&gt; the stream, before the full response arrives. You don't have to wait 15 seconds to know a model is slow. You know at second 4.&lt;/p&gt;

&lt;p&gt;So glide races each request against a per-model TTFT budget. Exceed it? Connection cancelled, next model in the cascade starts immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;claude-opus-4-6    TTFT budget: 4s   &amp;lt;- best quality, tried first
claude-sonnet-4-6  TTFT budget: 5s   &amp;lt;- fast fallback
claude-haiku-4-5   TTFT budget: 3s   &amp;lt;- fastest Anthropic model
qwen2.5:14b        no limit          &amp;lt;- local Ollama, always works
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Problem: naive cascade compounds latency
&lt;/h2&gt;

&lt;p&gt;If opus takes 8s to timeout and sonnet takes 5s, a naive cascade makes you wait 13s before reaching haiku. That's worse than just waiting for opus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: proactive p95 routing
&lt;/h3&gt;

&lt;p&gt;glide maintains a &lt;strong&gt;rolling window of observed TTFT values&lt;/strong&gt; per model (SQLite-backed, persists across restarts) and computes the p95 continuously. If a model's p95 already exceeds its budget, glide skips it without waiting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Normal day  -&amp;gt; opus p95=2s  -&amp;gt; serves in ~2s
Peak load   -&amp;gt; opus p95=11s -&amp;gt; skipped, sonnet serves in ~1.5s
Recovery    -&amp;gt; opus p95=3s  -&amp;gt; resumes automatically
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No restarts. No config changes. No intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  Second signal: TTT for extended thinking
&lt;/h2&gt;

&lt;p&gt;TTFT covers slow starts but misses a different failure: &lt;strong&gt;runaway extended thinking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Claude Opus with extended reasoning emits thinking tokens before any text. A request can get a fast TTFT (thinking starts immediately) but then spend 60 seconds in the reasoning phase. The user sees nothing the whole time.&lt;/p&gt;

&lt;p&gt;I added &lt;strong&gt;TTT (Time-to-Think)&lt;/strong&gt;: elapsed time from request start until the first &lt;em&gt;text&lt;/em&gt; token after thinking completes. Budget exceeded mid-think? Abort and cascade.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inline SSE parser, runs during the active stream
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_block_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ttt_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;     &lt;span class="c1"&gt;# start TTT clock
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;block_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ttt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ttt_start&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ttt&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;TTTTimeoutError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# cascade to next model
&lt;/span&gt;        &lt;span class="n"&gt;text_started&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;              &lt;span class="c1"&gt;# stream from here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tricky part: SSE events can span HTTP chunk boundaries, so you can't just parse per-chunk. I built a buffer that accumulates bytes, splits on &lt;code&gt;\n\n&lt;/code&gt;, and parses complete events while yielding chunks to the client and monitoring inline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Third: request hedging for borderline cases
&lt;/h2&gt;

&lt;p&gt;Proactive routing handles sustained load. But when a model is &lt;em&gt;trending&lt;/em&gt; slow, not yet over budget but elevated, you're still exposed on individual tail requests.&lt;/p&gt;

&lt;p&gt;This is the same problem Google solved in &lt;a href="https://research.google/pubs/the-tail-at-scale/" rel="noopener noreferrer"&gt;"The Tail at Scale" (2013)&lt;/a&gt;: send the same request to two replicas, use whichever responds first. I applied that idea across &lt;strong&gt;heterogeneous model tiers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But you don't want to double your API cost on every request. So glide computes a routing decision before each request using observed p95:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SOLO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;primary p95 &amp;lt; 80% of budget&lt;/td&gt;
&lt;td&gt;Fire only primary, it's healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HEDGE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;primary risky, backup healthy or cold&lt;/td&gt;
&lt;td&gt;Fire both, race on asyncio queue, stream winner, cancel loser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SKIP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;both risky&lt;/td&gt;
&lt;td&gt;Skip hedge entirely, go to sequential cascade&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_hedge_decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hedge_models&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;p95_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p95_1&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hedge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# cold start, hedge conservatively
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p95_1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;budget_1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;solo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# healthy, no cost wasted
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p95_2&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;p95_2&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;budget_2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# both slow, sequential is better
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hedge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;       &lt;span class="c1"&gt;# first risky, second healthy, race them
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 80% threshold catches the trend before models actually start failing individual requests.&lt;/p&gt;

&lt;p&gt;When a hedge fires, the losing task gets &lt;code&gt;task.cancel()&lt;/code&gt; which propagates through httpx's &lt;code&gt;async with client.stream()&lt;/code&gt; context manager, closing the upstream HTTP connection immediately. No resource leaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Provider-agnostic
&lt;/h2&gt;

&lt;p&gt;All cascade providers yield Anthropic SSE internally. glide converts at the edge for each provider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; uses &lt;code&gt;anthropic_to_openai()&lt;/code&gt; for the request body and &lt;code&gt;stream_openai_as_anthropic()&lt;/code&gt; for the response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; uses &lt;code&gt;anthropic_to_gemini()&lt;/code&gt; and &lt;code&gt;stream_gemini_as_anthropic()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; is already streaming JSON, wrapped to Anthropic SSE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mix providers freely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CASCADE_JSON&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'[
  {"provider": "anthropic", "model": "claude-opus-4-6",   "ttft_budget": 4.0},
  {"provider": "openai",    "model": "gpt-4o",            "ttft_budget": 5.0},
  {"provider": "google",    "model": "gemini-2.0-flash",  "ttft_budget": 3.0},
  {"provider": "ollama",    "model": "qwen2.5:14b",       "ttft_budget": null}
]'&lt;/span&gt;
glide start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accepts both &lt;code&gt;POST /v1/messages&lt;/code&gt; (Anthropic) and &lt;code&gt;POST /v1/chat/completions&lt;/code&gt; (OpenAI). Returns the matching format automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://127.0.0.1:8743/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;glide_requests_total 42.0
glide_hedge_decision_total{decision="solo"} 30.0
glide_hedge_decision_total{decision="hedge"} 10.0
glide_hedge_decision_total{decision="skip"} 2.0
glide_hedge_winner_total{model="claude-sonnet-4-6"} 8.0
glide_ttft_p95_seconds{model="claude-opus-4-6"} 3.82
glide_ttft_p95_seconds{model="claude-sonnet-4-6"} 0.41
glide_ttft_samples_total{model="claude-opus-4-6"} 20.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard Prometheus text format, no extra dependencies, formatted manually. Plug into Grafana or scrape directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;I'm calling this the &lt;strong&gt;LLM Request Cascade Pattern&lt;/strong&gt;, a reliability primitive with three components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Budget-based streaming abort&lt;/strong&gt; - TTFT and TTT as actionable in-stream health signals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proactive p95 routing&lt;/strong&gt; - skip models whose recent observed p95 exceeds their budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive hedging&lt;/strong&gt; - race models when borderline slow, not on every request&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It sits alongside two existing patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaker&lt;/strong&gt; (binary up/down) handled by &lt;a href="https://github.com/phanisaimunipalli/llm-circuit" rel="noopener noreferrer"&gt;llm-circuit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; (identical replicas) not applicable to heterogeneous model tiers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cascade is specifically for the heterogeneous LLM ecosystem: different models with different quality/speed/cost tradeoffs, where you want to route to the best option that can actually respond in time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;glide
glide start
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://127.0.0.1:8743
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works with Claude Code, Cursor, code_puppy, or anything using the Anthropic or OpenAI API.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/phanisaimunipalli/glide" rel="noopener noreferrer"&gt;https://github.com/phanisaimunipalli/glide&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern docs:&lt;/strong&gt; &lt;a href="https://github.com/phanisaimunipalli/glide/blob/main/docs/the-cascade-pattern.md" rel="noopener noreferrer"&gt;https://github.com/phanisaimunipalli/glide/blob/main/docs/the-cascade-pattern.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HN thread:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=47285435" rel="noopener noreferrer"&gt;https://news.ycombinator.com/item?id=47285435&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;22 tests, MIT license. Would love feedback especially on the mid-stream SSE abort implementation and the hedge trigger thresholds.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
