DEV Community

Manoj Krishna Mohan
Manoj Krishna Mohan

Posted on

I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed

Frontier LLM inference is expensive. I wanted to see how far a 4B local model could go before needing a cloud call — and when the cloud call actually adds value.

The result is Buddy System: a tiered inference architecture where a Rust entropy monitor watches per-token uncertainty during local generation and routes to Sonnet only when the local model is genuinely stuck. (I know Anthropic has the advisor system, but this is different)

GitHub: https://github.com/Manojython/buddy-system

How it works

Gemma 3 4B generates locally on Apple Silicon via MLX. A Rust EntropyMonitor (compiled as a PyO3 extension) computes Shannon entropy over the full token vocabulary on every generated token:

// bridge/src/entropy.rs
pub fn token_entropy(&self, logits: &[f32]) -> f32 {
    let max = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
    let exp_sum: f32 = logits.iter().map(|&l| (l - max).exp()).sum();
    logits.iter()
        .map(|&l| {
            let p = ((l - max).exp()) / exp_sum;
            if p > 0.0 { -p * p.ln() } else { 0.0 }
        })
        .sum()
}
Enter fullscreen mode Exit fullscreen mode

At high-entropy clause boundaries (threshold: 0.8), spaCy NER identifies what the model is uncertain about — the specific named entity or noun chunk, not just "confidence is low":

# frugal/uncertainty.py
for ent in doc.ents:
    e = _span_entropy(ent.start_char, ent.end_char)
    if e > best_entropy:
        best_entropy = e
        best_text = ent.text
Enter fullscreen mode Exit fullscreen mode

A sentence-transformers retriever finds the relevant passage chunk. Sonnet gets a targeted query: the uncertain fact + the grounding document. All cloud calls fire async after local generation completes — generation never blocks on the API.

Classical tools (math, dates, units) sit between local and cloud, handling deterministic answers at zero cost.

Benchmark results

3 conditions, 7 HuggingFace datasets, 140 total samples:

Condition Accuracy Cost
Local only (Gemma 3 4B) 70.7% $0.00
Buddy System (Gemma + Sonnet) 71.4% $0.21
Advisor pattern (Haiku → Opus) 62.9% $0.44

Per-dataset:

Dataset Local Buddy Advisor
AG News 75% 75% 75%
WikiANN 60% 60% 70%
STS-B 30% 30% 30%
SST-2 90% 90% 95%
GSM8K 75% 80% 55%
SQuAD v2 90% 90% 60%
HotpotQA 75% 75% 55%

The interesting finding

The Advisor pattern (Haiku generates → Opus reviews unconditionally) dropped 30pp on SQuAD v2 and 20pp on HotpotQA compared to local-only. The mechanism: the review step receives Haiku's answer but not the source document. Opus corrects from parametric memory, not from the passage.

It's not a model capability problem. It's what context the review tier receives. Pass the document to the reviewer and the result changes — which is exactly what the Buddy System does via the retrieval step.

Top comments (0)