<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manoj Krishna Mohan</title>
    <description>The latest articles on DEV Community by Manoj Krishna Mohan (@manoj_krishna_f13c6).</description>
    <link>https://dev.to/manoj_krishna_f13c6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3998021%2F8b7cfcf9-f506-42be-bcd9-a10fd68b792f.webp</url>
      <title>DEV Community: Manoj Krishna Mohan</title>
      <link>https://dev.to/manoj_krishna_f13c6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manoj_krishna_f13c6"/>
    <language>en</language>
    <item>
      <title>I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed</title>
      <dc:creator>Manoj Krishna Mohan</dc:creator>
      <pubDate>Tue, 23 Jun 2026 05:43:36 +0000</pubDate>
      <link>https://dev.to/manoj_krishna_f13c6/i-built-a-rust-entropy-monitor-to-route-llm-inference-heres-what-the-benchmark-showed-4b7d</link>
      <guid>https://dev.to/manoj_krishna_f13c6/i-built-a-rust-entropy-monitor-to-route-llm-inference-heres-what-the-benchmark-showed-4b7d</guid>
      <description>&lt;p&gt;Frontier LLM inference is expensive. I wanted to see how far a 4B local model could go before needing a cloud call — and when the cloud call actually adds value.&lt;/p&gt;

&lt;p&gt;The result is Buddy System: a tiered inference architecture where a Rust entropy monitor watches per-token uncertainty during local generation and routes to Sonnet only when the local model is genuinely stuck. (I know Anthropic has the advisor system, but this is different)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Manojython/buddy-system" rel="noopener noreferrer"&gt;https://github.com/Manojython/buddy-system&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Gemma 3 4B generates locally on Apple Silicon via MLX. A Rust &lt;code&gt;EntropyMonitor&lt;/code&gt; (compiled as a PyO3 extension) computes Shannon entropy over the full token vocabulary on every generated token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// bridge/src/entropy.rs&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;token_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.cloned&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.fold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;NEG_INFINITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;exp_sum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.exp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.sum&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.exp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;exp_sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="nf"&gt;.ln&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="nf"&gt;.sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At high-entropy clause boundaries (threshold: 0.8), spaCy NER identifies what the model is uncertain about — the specific named entity or noun chunk, not just "confidence is low":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# frugal/uncertainty.py
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_span_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_char&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best_entropy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;best_entropy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
        &lt;span class="n"&gt;best_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A sentence-transformers retriever finds the relevant passage chunk. Sonnet gets a targeted query: the uncertain fact + the grounding document. All cloud calls fire async after local generation completes — generation never blocks on the API.&lt;/p&gt;

&lt;p&gt;Classical tools (math, dates, units) sit between local and cloud, handling deterministic answers at zero cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark results
&lt;/h2&gt;

&lt;p&gt;3 conditions, 7 HuggingFace datasets, 140 total samples:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local only (Gemma 3 4B)&lt;/td&gt;
&lt;td&gt;70.7%&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Buddy System (Gemma + Sonnet)&lt;/td&gt;
&lt;td&gt;71.4%&lt;/td&gt;
&lt;td&gt;$0.21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advisor pattern (Haiku → Opus)&lt;/td&gt;
&lt;td&gt;62.9%&lt;/td&gt;
&lt;td&gt;$0.44&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Per-dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Local&lt;/th&gt;
&lt;th&gt;Buddy&lt;/th&gt;
&lt;th&gt;Advisor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AG News&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WikiANN&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STS-B&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SST-2&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GSM8K&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQuAD v2&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HotpotQA&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The interesting finding
&lt;/h2&gt;

&lt;p&gt;The Advisor pattern (Haiku generates → Opus reviews unconditionally) dropped 30pp on SQuAD v2 and 20pp on HotpotQA compared to local-only. The mechanism: the review step receives Haiku's answer but not the source document. Opus corrects from parametric memory, not from the passage.&lt;/p&gt;

&lt;p&gt;It's not a model capability problem. It's what context the review tier receives. Pass the document to the reviewer and the result changes — which is exactly what the Buddy System does via the retrieval step.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>rust</category>
      <category>ai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
