I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed

Manoj Krishna Mohan — Tue, 23 Jun 2026 05:43:36 +0000

Frontier LLM inference is expensive. I wanted to see how far a 4B local model could go before needing a cloud call — and when the cloud call actually adds value.

The result is Buddy System: a tiered inference architecture where a Rust entropy monitor watches per-token uncertainty during local generation and routes to Sonnet only when the local model is genuinely stuck. (I know Anthropic has the advisor system, but this is different)

GitHub: https://github.com/Manojython/buddy-system

How it works

Gemma 3 4B generates locally on Apple Silicon via MLX. A Rust EntropyMonitor (compiled as a PyO3 extension) computes Shannon entropy over the full token vocabulary on every generated token:

// bridge/src/entropy.rs
pub fn token_entropy(&self, logits: &[f32]) -> f32 {
    let max = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
    let exp_sum: f32 = logits.iter().map(|&l| (l - max).exp()).sum();
    logits.iter()
        .map(|&l| {
            let p = ((l - max).exp()) / exp_sum;
            if p > 0.0 { -p * p.ln() } else { 0.0 }
        })
        .sum()
}

At high-entropy clause boundaries (threshold: 0.8), spaCy NER identifies what the model is uncertain about — the specific named entity or noun chunk, not just "confidence is low":

# frugal/uncertainty.py
for ent in doc.ents:
    e = _span_entropy(ent.start_char, ent.end_char)
    if e > best_entropy:
        best_entropy = e
        best_text = ent.text

A sentence-transformers retriever finds the relevant passage chunk. Sonnet gets a targeted query: the uncertain fact + the grounding document. All cloud calls fire async after local generation completes — generation never blocks on the API.

Classical tools (math, dates, units) sit between local and cloud, handling deterministic answers at zero cost.

Benchmark results

3 conditions, 7 HuggingFace datasets, 140 total samples:

Condition	Accuracy	Cost
Local only (Gemma 3 4B)	70.7%	$0.00
Buddy System (Gemma + Sonnet)	71.4%	$0.21
Advisor pattern (Haiku → Opus)	62.9%	$0.44

Per-dataset:

Dataset	Local	Buddy	Advisor
AG News	75%	75%	75%
WikiANN	60%	60%	70%
STS-B	30%	30%	30%
SST-2	90%	90%	95%
GSM8K	75%	80%	55%
SQuAD v2	90%	90%	60%
HotpotQA	75%	75%	55%

The interesting finding

The Advisor pattern (Haiku generates → Opus reviews unconditionally) dropped 30pp on SQuAD v2 and 20pp on HotpotQA compared to local-only. The mechanism: the review step receives Haiku's answer but not the source document. Opus corrects from parametric memory, not from the passage.

It's not a model capability problem. It's what context the review tier receives. Pass the document to the reviewer and the result changes — which is exactly what the Buddy System does via the retrieval step.

DEV Community: Manoj Krishna Mohan

I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed

How it works

Benchmark results

The interesting finding