TL;DR: I priced 8 local Ollama models by € per 1,000 correct answers — metered GPU energy ÷ correct answers, on one RTX 3090. gemma4:26b won at 96.9% accuracy for €0.013/1k-correct. The most expensive model (qwen3:8b-fp16) cost €0.239/1k and scored worse (66.7%). Reasoning tokens and full precision both cost a lot and bought nothing here. Every cost comes from real metered kWh via the open-source HomeLab Monitor.
This is the short, copy-pasteable version. The narrative writeup is on Medium.
The metric
€ per correct answer = (metered GPU energy cost over the eval window) ÷ (number of correct answers)
Tokens-per-euro flatters whichever model talks the most. Cost-per-correct only rewards being right cheaply — which is the thing you actually pay for.
The signal
Model VRAM Acc Tok/task Tok/s Wh/pass €/1k correct (day)
gemma4:26b 16.9 GB 96.9% 68 86 4.5 €0.013 ← winner
gemma3:1b 0.9 GB 82.1% 125 133 3.8 €0.013
gemma3:27b 17.1 GB 100.0% 119 36 16.3 €0.046
qwen3:30b-a3b (MoE) 18.4 GB 83.3% 555 186 14.1 €0.048
qwen3:8b (Q4_K_M) 🧠 5.4 GB 64.8% 626 126 22.7 €0.100
qwen3:8b 🧠 5.4 GB 64.8% 626 126 23.6 €0.104
qwen3:8b (Q8_0) 🧠 8.7 GB 61.1% 672 88 33.5 €0.156
qwen3:8b (fp16) 🧠 15.5 GB 66.7% 664 53 56.2 €0.239 ← most expensive
🧠 = reasoning/thinking mode on. Night tariff knocks ~40% off every row.
Three things the numbers say
1. The value champion is mid-size, not max-size. gemma4:26b hit 96.9% for €0.013 per 1,000 correct — cheapest-per-correct on the whole bench and near-perfect, ~18× cheaper per correct answer than qwen3:8b-fp16. gemma3:27b is the only 100% model but costs ~3.5× more (slower, 36 tok/s).
2. The thinking tax is real and didn't pay off. qwen3 reasoning models emit 555–672 tokens/task vs the gemmas' 68–125 (5–9×). Tokens are energy. On these 54 deterministic tasks that extra reasoning bought no correctness — the priciest model scored lower than one 18× cheaper. (Caveat: this suite is arithmetic / executable code / format-following. On open-ended hard problems, reasoning earns its tokens. On structured agent work, it was dead weight.)
3. The quantization paradox. Same qwen3:8b at three precisions:
Accuracy Energy/pass Throughput
Q4_K_M 64.8% 22.7 Wh 126 tok/s
Q8_0 61.1% 33.5 Wh 88 tok/s
fp16 66.7% 56.2 Wh 53 tok/s
└ flat ┘ └ 2.5× ┘ └ halved ┘
Higher precision cost 2.5× the energy and half the throughput for accuracy that's flat-and-noisy. On a 3090, aggressive quant was the correct call, not a compromise.
Methodology (so you can trust the ranking)
- 54 deterministic tasks, mechanically graded — no LLM judge. Reasoning 15 (GSM8K-style numeric extraction), code 12 (HumanEval-style, executed asserts in a sandbox), factual 12 (keyword), instruct 15 (format predicates). Grader selftest 11/11.
- Controls identical across all 8 models: temperature 0, seed 42, num_ctx 4096, num_predict 1024, identical prompts.
- Warm-up discarded → model-load energy excluded (pricing inference, not cold starts).
- 3 passes each, ranges reported.
- Idle baseline = 38 W, measured as a control.
- qwen3 thinking left on (realistic); thinking tokens counted for energy, stripped before grading.
-
Honest determinism caveat: Ollama is not bit-exact at temp 0.
gemma3:1bdrifted 81–83% across passes;gemma3:27bwas 100% on all three; qwen3 runs were identical. Report ranges, not point claims. - CPU/DRAM not metered (no RAPL on this host), so true wall-plug cost is a bit higher — but the ranking holds because every model paid the same un-metered overhead.
The currency gotcha (measure twice)
Costs are EUR from measured kWh × Bulgarian dual tariff (€0.1534 day / €0.0920 night). While building this I caught my own dashboard mislabeling BGN as EUR: the tariff read 0.30/0.18 EUR, but those are leva. Bulgaria joined the euro on 2026-01-01 at fixed 1 EUR = 1.95583 BGN; €0.30/kWh would be German-tier, implausible for the EU's cheapest household power. Converted: 0.30 / 1.95583 = €0.1534, 0.18 / 1.95583 = €0.0920. Lesson: don't trust the dashboard's € field — compute from physical kWh and your verified tariff.
How to reproduce the energy tracking
Every cost above came from HomeLab Monitor (MIT, one container) — its Experiments tab integrates real GPU power over a run's window into kWh and money. Bring it up:
docker compose up -d # port 9800
Grab the one-file homelab_run.py client, mint an ingest key, and wrap your eval — the run comes back priced:
import homelab_run as homelab
homelab.configure(url="http://<your-host>:9800", key="hlm_…")
with homelab.run("gemma4:26b", tags=["llm-cost-bench"]) as r:
for _ in range(PASSES):
run_graded_eval(model) # all inference inside the run
priced = homelab.pull(r.id) # energy_kwh, cost, avg_w, peak_util — from real power
That's the whole instrumentation. Divide the priced energy by your grader's correct count and you've got cost-per-correct for your own roster. Docs · docker pull sikamikaniko123/homelab-monitor.
What I deliberately did NOT do
- No LLM-as-judge — mechanical grading only.
- No cold-start energy in the numbers — warm-up discarded on purpose.
- No trusting the dashboard's € field — costs recomputed from measured kWh.
- No single-run claims — 3 passes, ranges where they exist.
- No CPU/DRAM cost claim — only the GPU is metered, and I say so.
Over to you
Bigger and full-precision lost. A 26B model did near-perfect work for a rounding error; an fp16 reasoning model charged 18× as much to be wrong more often.
So when you reach for a local model — accuracy, speed, or cost per answer it actually gets right? And have you ever measured the third one? Drop your own cost-per-correct numbers in the comments.
Top comments (0)