DEV Community

Cover image for How to Rank Local LLMs by Cost per Correct Answer (Measured GPU Energy, 8 Ollama Models)
Arsen Apostolov
Arsen Apostolov

Posted on

How to Rank Local LLMs by Cost per Correct Answer (Measured GPU Energy, 8 Ollama Models)

TL;DR: I priced 8 local Ollama models by € per 1,000 correct answers — metered GPU energy ÷ correct answers, on one RTX 3090. gemma4:26b won at 96.9% accuracy for €0.013/1k-correct. The most expensive model (qwen3:8b-fp16) cost €0.239/1k and scored worse (66.7%). Reasoning tokens and full precision both cost a lot and bought nothing here. Every cost comes from real metered kWh via the open-source HomeLab Monitor.

This is the short, copy-pasteable version. The narrative writeup is on Medium.

The metric

€ per correct answer = (metered GPU energy cost over the eval window) ÷ (number of correct answers)

Tokens-per-euro flatters whichever model talks the most. Cost-per-correct only rewards being right cheaply — which is the thing you actually pay for.

The signal

Model                  VRAM     Acc     Tok/task  Tok/s  Wh/pass  €/1k correct (day)
gemma4:26b             16.9 GB  96.9%   68        86     4.5      €0.013   ← winner
gemma3:1b              0.9 GB   82.1%   125       133    3.8      €0.013
gemma3:27b             17.1 GB  100.0%  119       36     16.3     €0.046
qwen3:30b-a3b   (MoE)  18.4 GB  83.3%   555       186    14.1     €0.048
qwen3:8b (Q4_K_M) 🧠   5.4 GB   64.8%   626       126    22.7     €0.100
qwen3:8b          🧠   5.4 GB   64.8%   626       126    23.6     €0.104
qwen3:8b (Q8_0)   🧠   8.7 GB   61.1%   672       88     33.5     €0.156
qwen3:8b (fp16)   🧠   15.5 GB  66.7%   664       53     56.2     €0.239   ← most expensive
Enter fullscreen mode Exit fullscreen mode

🧠 = reasoning/thinking mode on. Night tariff knocks ~40% off every row.

Three things the numbers say

1. The value champion is mid-size, not max-size. gemma4:26b hit 96.9% for €0.013 per 1,000 correct — cheapest-per-correct on the whole bench and near-perfect, ~18× cheaper per correct answer than qwen3:8b-fp16. gemma3:27b is the only 100% model but costs ~3.5× more (slower, 36 tok/s).

2. The thinking tax is real and didn't pay off. qwen3 reasoning models emit 555–672 tokens/task vs the gemmas' 68–125 (5–9×). Tokens are energy. On these 54 deterministic tasks that extra reasoning bought no correctness — the priciest model scored lower than one 18× cheaper. (Caveat: this suite is arithmetic / executable code / format-following. On open-ended hard problems, reasoning earns its tokens. On structured agent work, it was dead weight.)

3. The quantization paradox. Same qwen3:8b at three precisions:

            Accuracy   Energy/pass   Throughput
Q4_K_M      64.8%      22.7 Wh       126 tok/s
Q8_0        61.1%      33.5 Wh       88  tok/s
fp16        66.7%      56.2 Wh       53  tok/s
            └ flat ┘   └ 2.5× ┘      └ halved ┘
Enter fullscreen mode Exit fullscreen mode

Higher precision cost 2.5× the energy and half the throughput for accuracy that's flat-and-noisy. On a 3090, aggressive quant was the correct call, not a compromise.

Methodology (so you can trust the ranking)

  • 54 deterministic tasks, mechanically graded — no LLM judge. Reasoning 15 (GSM8K-style numeric extraction), code 12 (HumanEval-style, executed asserts in a sandbox), factual 12 (keyword), instruct 15 (format predicates). Grader selftest 11/11.
  • Controls identical across all 8 models: temperature 0, seed 42, num_ctx 4096, num_predict 1024, identical prompts.
  • Warm-up discarded → model-load energy excluded (pricing inference, not cold starts).
  • 3 passes each, ranges reported.
  • Idle baseline = 38 W, measured as a control.
  • qwen3 thinking left on (realistic); thinking tokens counted for energy, stripped before grading.
  • Honest determinism caveat: Ollama is not bit-exact at temp 0. gemma3:1b drifted 81–83% across passes; gemma3:27b was 100% on all three; qwen3 runs were identical. Report ranges, not point claims.
  • CPU/DRAM not metered (no RAPL on this host), so true wall-plug cost is a bit higher — but the ranking holds because every model paid the same un-metered overhead.

The currency gotcha (measure twice)

Costs are EUR from measured kWh × Bulgarian dual tariff (€0.1534 day / €0.0920 night). While building this I caught my own dashboard mislabeling BGN as EUR: the tariff read 0.30/0.18 EUR, but those are leva. Bulgaria joined the euro on 2026-01-01 at fixed 1 EUR = 1.95583 BGN; €0.30/kWh would be German-tier, implausible for the EU's cheapest household power. Converted: 0.30 / 1.95583 = €0.1534, 0.18 / 1.95583 = €0.0920. Lesson: don't trust the dashboard's € field — compute from physical kWh and your verified tariff.

How to reproduce the energy tracking

Every cost above came from HomeLab Monitor (MIT, one container) — its Experiments tab integrates real GPU power over a run's window into kWh and money. Bring it up:

docker compose up -d        # port 9800
Enter fullscreen mode Exit fullscreen mode

Grab the one-file homelab_run.py client, mint an ingest key, and wrap your eval — the run comes back priced:

import homelab_run as homelab
homelab.configure(url="http://<your-host>:9800", key="hlm_…")

with homelab.run("gemma4:26b", tags=["llm-cost-bench"]) as r:
    for _ in range(PASSES):
        run_graded_eval(model)          # all inference inside the run
priced = homelab.pull(r.id)             # energy_kwh, cost, avg_w, peak_util — from real power
Enter fullscreen mode Exit fullscreen mode

That's the whole instrumentation. Divide the priced energy by your grader's correct count and you've got cost-per-correct for your own roster. Docs · docker pull sikamikaniko123/homelab-monitor.

What I deliberately did NOT do

  • No LLM-as-judge — mechanical grading only.
  • No cold-start energy in the numbers — warm-up discarded on purpose.
  • No trusting the dashboard's € field — costs recomputed from measured kWh.
  • No single-run claims — 3 passes, ranges where they exist.
  • No CPU/DRAM cost claim — only the GPU is metered, and I say so.

Over to you

Bigger and full-precision lost. A 26B model did near-perfect work for a rounding error; an fp16 reasoning model charged 18× as much to be wrong more often.

So when you reach for a local model — accuracy, speed, or cost per answer it actually gets right? And have you ever measured the third one? Drop your own cost-per-correct numbers in the comments.

Top comments (0)