<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ravil Minigulov</title>
    <description>The latest articles on DEV Community by Ravil Minigulov (@lokyfour).</description>
    <link>https://dev.to/lokyfour</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908658%2F1ada4c10-339f-4211-8e2d-93d8fc0e476a.jpg</url>
      <title>DEV Community: Ravil Minigulov</title>
      <link>https://dev.to/lokyfour</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lokyfour"/>
    <language>en</language>
    <item>
      <title>Hybrid LLM Routing: Ollama + Claude API Without Quality Degradation</title>
      <dc:creator>Ravil Minigulov</dc:creator>
      <pubDate>Sat, 02 May 2026 08:57:20 +0000</pubDate>
      <link>https://dev.to/lokyfour/hybrid-llm-routing-ollama-claude-api-without-quality-degradation-5e5b</link>
      <guid>https://dev.to/lokyfour/hybrid-llm-routing-ollama-claude-api-without-quality-degradation-5e5b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpici442efroqxwsr4psl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpici442efroqxwsr4psl.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bill arrives at the end of the month&lt;/strong&gt;&lt;br&gt;
You ship a bot. Claude responds well, the client is happy. The first month goes by quietly. Then you open Anthropic billing: $200+ for traffic from a small café.&lt;br&gt;
You dig into the logs. 60,000 requests over a month. "Are you open on Sundays?", "What's your address?", "Is delivery free?" — thousands of times. Every single one routed through Claude Sonnet with a 400-token system prompt.&lt;br&gt;
This isn't a model cost problem. It's an architecture problem: a uniform model serving fundamentally non-uniform load.&lt;br&gt;
Request complexity in a business bot isn't normally distributed — it's bimodal. A long tail of FAQ requests where Claude's power is completely wasted, and a narrow spike of complaints, edge cases, and generation tasks where it's actually needed. If you don't split these flows, you're paying for cloud inference where a local model would have been fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why "just use Ollama" doesn't work&lt;/strong&gt;&lt;br&gt;
The obvious fix: move everything to Ollama. Models like llama3.1:8b or mistral:7b on a GPU give acceptable quality for simple tasks at zero variable cost.&lt;br&gt;
The problem is that open-source models degrade in specific scenarios: long context (&amp;gt;3K tokens), strict output format requirements, multi-step reasoning. In a bot with RAG, these come up regularly. Moving everything to Ollama means unpredictable quality exactly where the client will notice.&lt;br&gt;
The other take — "only pay Claude for complex requests" — is directionally right, but what counts as "complex"? Without a formal classifier, this turns into manually maintained conditionals in code that don't scale and break with every traffic shift.&lt;br&gt;
You need a router: a component that decides which model handles the request before it goes anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture: one interface, two tiers&lt;/strong&gt;&lt;br&gt;
The core requirement: the router must be invisible from the outside. From the FastAPI endpoint's perspective, there's a single llm_client.complete() that always returns a response. Where the request went is an implementation detail.&lt;br&gt;
There's no load balancing between Ollama and Claude — there's a hierarchy. Ollama is the first tier, Claude is escalation. Escalation happens in three cases: the router decided so, Ollama returned an invalid response, or Ollama is unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The router: asymmetry of error cost&lt;/strong&gt;&lt;br&gt;
The router isn't a binary "simple/complex" classifier. The correct framing: minimize the expected cost of a routing error.&lt;br&gt;
Error toward Ollama for a complex request: quality degradation, retry, potentially a broken conversation. In B2B — real business consequences for the client.&lt;br&gt;
Error toward Claude for a simple request: a few cents of overspending.&lt;br&gt;
The asymmetry is obvious. It produces a concrete rule: when in doubt, go to the cloud. This isn't being conservative — it's correctly accounting for the real cost of each error type.&lt;br&gt;
Decision logic is two-layered.&lt;br&gt;
Hard rules fire first and override any scoring. Complaints, legal context, generation tasks — always Claude. A clean request for opening hours or an address — always Ollama.&lt;br&gt;
Soft scoring kicks in when hard rules don't fire. Factors: RAG context volume, format requirements, message length, the number of consecutive clarifying questions in the dialog (a rising count signals that previous answers weren't solving the problem).&lt;br&gt;
The routing threshold is deliberately shifted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ModelTarget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLOUD&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;ModelTarget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCAL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;confidence &amp;lt; 0.6 — if the router isn't confident enough in its classification, the request goes to Claude. Explicit codification of the asymmetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three things that break in production&lt;/strong&gt;&lt;br&gt;
Ollama's formatted output. Even with an explicit instruction to return JSON, llama3.1:8b periodically wraps it in a markdown code block or adds surrounding text. In production this isn't an edge case — it's a regular scenario. Solution: parsing with multiple fallback patterns, and after two failed attempts — automatic escalation to Claude. Not three retries, not four: a second retry on Ollama is slower than a single Claude call.&lt;br&gt;
Context window under load. Ollama allocates num_ctx on the first request to a model and doesn't adjust it dynamically within a session. If the service started with the default num_ctx=2048 and a request arrives with 3,500 tokens of RAG context — the context gets silently truncated. No error, just a response about nothing. num_ctx must be passed explicitly on every request, with headroom above the actual volume.&lt;br&gt;
Latency degradation during spikes. On a single GPU, Ollama doesn't parallelize requests — it queues them. During sudden traffic spikes, p95 latency grows linearly, the router doesn't know this, and keeps routing locally. You need a circuit breaker on latency, not just errors: when the p95 threshold is exceeded, all traffic temporarily goes to Claude regardless of classification. This needs to be a separate component — don't add the condition into the router logic, or the breaker's state gets tangled up with classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;br&gt;
Without proper logging, the system is opaque: you see costs but don't understand what's driving them.&lt;br&gt;
The key is logging not just routed_to, but also actual_model. These fields diverge during escalation. Escalation frequency is the primary health metric for the router: if it's growing, either the traffic pattern changed, the local model degraded, or the thresholds need recalibration.&lt;br&gt;
The second important signal is a proxy quality metric. Not manual response labeling — downstream behavior: if a user asks a follow-up question within two minutes of a response, the first answer probably didn't solve the problem. Measurable with zero additional infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers&lt;/strong&gt;&lt;br&gt;
Real case: a Telegram bot for a café, one month of observation after rolling out the router.&lt;br&gt;
Request typeTraffic shareModelFAQ, hours, address, prices61%OllamaMenu clarifications, ingredients18%OllamaEdge cases, complaints12%ClaudeRAG over documents, generation9%Claude&lt;br&gt;
Cost before: $234/month. After: $47/month. Quality by client complaints — unchanged: the scenarios that used to go to Claude still go to Claude.&lt;br&gt;
The 80% cost reduction isn't the goal of the architecture. It's a side effect of making request cost a function of complexity rather than a constant. The real gain: the system became legible. Now you can see what each interaction type costs and know exactly what to do about it when traffic grows.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>python</category>
      <category>fastapi</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
