Frontier LLM inference is expensive. I wanted to see how far a 4B local model could go before needing a cloud call — and when the cloud call actually adds value.
The result is Buddy System: a tiered inference architecture where a Rust entropy monitor watches per-token uncertainty during local generation and routes to Sonnet only when the local model is genuinely stuck. (I know Anthropic has the advisor system, but this is different)
GitHub: https://github.com/Manojython/buddy-system
How it works
Gemma 3 4B generates locally on Apple Silicon via MLX. A Rust EntropyMonitor (compiled as a PyO3 extension) computes Shannon entropy over the full token vocabulary on every generated token:
// bridge/src/entropy.rs
pub fn token_entropy(&self, logits: &[f32]) -> f32 {
let max = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
let exp_sum: f32 = logits.iter().map(|&l| (l - max).exp()).sum();
logits.iter()
.map(|&l| {
let p = ((l - max).exp()) / exp_sum;
if p > 0.0 { -p * p.ln() } else { 0.0 }
})
.sum()
}
At high-entropy clause boundaries (threshold: 0.8), spaCy NER identifies what the model is uncertain about — the specific named entity or noun chunk, not just "confidence is low":
# frugal/uncertainty.py
for ent in doc.ents:
e = _span_entropy(ent.start_char, ent.end_char)
if e > best_entropy:
best_entropy = e
best_text = ent.text
A sentence-transformers retriever finds the relevant passage chunk. Sonnet gets a targeted query: the uncertain fact + the grounding document. All cloud calls fire async after local generation completes — generation never blocks on the API.
Classical tools (math, dates, units) sit between local and cloud, handling deterministic answers at zero cost.
Benchmark results
3 conditions, 7 HuggingFace datasets, 140 total samples:
| Condition | Accuracy | Cost |
|---|---|---|
| Local only (Gemma 3 4B) | 70.7% | $0.00 |
| Buddy System (Gemma + Sonnet) | 71.4% | $0.21 |
| Advisor pattern (Haiku → Opus) | 62.9% | $0.44 |
Per-dataset:
| Dataset | Local | Buddy | Advisor |
|---|---|---|---|
| AG News | 75% | 75% | 75% |
| WikiANN | 60% | 60% | 70% |
| STS-B | 30% | 30% | 30% |
| SST-2 | 90% | 90% | 95% |
| GSM8K | 75% | 80% | 55% |
| SQuAD v2 | 90% | 90% | 60% |
| HotpotQA | 75% | 75% | 55% |
The interesting finding
The Advisor pattern (Haiku generates → Opus reviews unconditionally) dropped 30pp on SQuAD v2 and 20pp on HotpotQA compared to local-only. The mechanism: the review step receives Haiku's answer but not the source document. Opus corrects from parametric memory, not from the passage.
It's not a model capability problem. It's what context the review tier receives. Pass the document to the reviewer and the result changes — which is exactly what the Buddy System does via the retrieval step.
Top comments (0)