DEV Community

Roberto de la Cámara
Roberto de la Cámara

Posted on

Variance testing flipped my Ollama benchmark ranking

I ran 6 local Ollama models against strict code-gen prompts, then re-ran the most discriminating prompt 3 times each. The single-shot winner was unstable, and the actual best was a general-purpose model the single-shot run had ranked 5th.

I've been picking models for a local Ollama pool that handles small, well-scoped coding chores delegated from a main agent. Before cabling routing rules into the agent, I wanted a defensible answer to "which model for which task family." So I built a tiny benchmark. The interesting part wasn't the ranking. It was that the ranking changed after I added variance testing.

TL;DR

I ran 6 models against 3 strict, single-function prompts (auto-graded by I/O equivalence, 32 test cases). Then I ran the most discriminating prompt 3 times on every model. Findings:

  • Single-shot ranking placed qwen3.5:9b at the top and gemma4:latest 5th.
  • Post-variance, gemma4:latest was the only byte-stable perfect model. qwen3.5:9b produced byte-identical buggy code in 2 of 3 runs at temperature=0.2. Its dominant decoding mode is broken on this prompt.
  • The Qwen3 thinking variants returned empty response fields on 100% of constrained code-gen prompts until I set think:false.
  • The "obvious coder" pick (qwen2.5-coder:14b) lost to a general-purpose model (gemma4) on every code-gen prompt that didn't require Python runtime reasoning.

Methodological lesson: single-shot LLM benchmarks lie in both directions. The "winner" was unstable, and the "loser" was best-in-class for a specific task family.

Setup

Single workstation, 16 GB VRAM, Ollama on 127.0.0.1:11434. A 60-line bash wrapper POSTs each prompt with temperature=0.2, stream=false. A Python verifier strips markdown fences, exec()s the model's output, and runs valid + invalid inputs against the resulting function. All scores are automated.

Three prompts, all forbidding markdown fences and preamble:

  • P1: a pytest test generator with a stale-reference trap (the function under test rebinds the module global, so the test must re-read by attribute, not hold a local).
  • P2: parse_iso_duration(s) -> int for PT<H>H<M>M<S>S strings, raising ValueError on malformed input. 6 valid + 8 invalid cases.
  • P3: flatten(d, sep=".") -> dict recursing into nested dicts but leaving lists alone. 10 cases.

Variance test (P2, 3 runs per model)

Same prompt, same temperature=0.2, three independent calls:

Model Run 1 Run 2 Run 3 Mean Stability
gemma4:latest 22/22 22/22 22/22 22.0 byte-stable perfect
qwen2.5-coder:14b 22/22 20/22 20/22 20.7 tight cluster
qwen3:14b (think:false) 17/22 16/22 17/22 16.7 stable, mediocre
deepseek-coder-v2:16b 16/22 16/22 12/22 14.7 stable, wrong
qwen3.5:9b (think:false) 9/22 9/22 21/22 13.0 bimodal
qwen3.5:4b (think:false) 4/22 19/22 16/22 13.0 wild

The bug qwen3.5:9b produced byte-identically in runs 1 and 2 was a regex requiring all three letters: ^(\d+)?H(\d+)?M(\d+)?S$. So "PT5M" falsely fails because there's no H and no S literal. Subtle, plausible-looking, and it ships unless you actually run the function. The 21/22 score in single-shot was the less common sampling path.

deepseek-coder-v2:16b is stably wrong: 0/6 valid inputs across all 3 runs. Same regex bug every time. Rerunning won't save it.

I ran a cross-prompt confirmation on the two stable models with P3, 3 runs each. gemma4 10/10/10. qwen2.5-coder:14b 10/10/9. gemma4 went 6 for 6 across both code-gen prompts, byte-stable. The point qwen2.5-coder lost was using if v: (truthy check) instead of if v is not None, silently dropping a None value. Idiomatic but wrong.

The thinking-mode trap

First pass on Qwen3 with the default think:true: qwen3:14b returned 1 byte (\n) after 1174 seconds of GPU time. Twenty minutes for nothing. Ollama's /api/generate returns two fields for thinking-mode models: response and thinking. My script only logged response. When I dumped the raw JSON, the 9B's thinking field was 21 KB of this:

* Wait, I need to check if I can use `src` if `import src.main_improved` is used.
* Yes.
* So I will use `src.main_improved`.
* Wait, I need to check if I can use `src` if `import src` is used.
* Yes.
* So I will use `src.main_improved`.
[...repeats until context fills...]
Enter fullscreen mode Exit fullscreen mode

done_reason: "stop" on a 21,000-character thinking trace with no committed answer. The fix was one parameter: "think": false in the request body. With it, all three Qwen3 sizes responded in 8 to 11 seconds and produced clean code.

If you're benchmarking thinking-capable models against strict output requirements: smoke-test with think:false first, and log both fields. One missing line of logging cost me 20 minutes of GPU debugging for what looked like crashes but were actually infinite-loop self-arguments inside thinking.

Routing rules I ended up with

  • Parsers, regex, recursive transformers: gemma4:latest. Byte-stable 22/22 across 6 runs of 2 different prompts at temp 0.2.
  • Tests, fixtures, anything needing Python module/runtime semantics: qwen2.5-coder:14b. Stable 20-22/22, the only model that handled the test-scaffolding trap correctly.
  • Mini tier (laptop, 4 GB VRAM): qwen3.5:4b with think:false, sample 5x at temp 0.7, run a verifier, keep the passer. 3.4 GB, ~20s total. Hit rate >=18/22 was 60% in my runs.
  • Skip: qwen3:14b (stably mediocre, 16/22 mean) and deepseek-coder-v2:16b (stably wrong on valid inputs, same regex bug 3/3 runs).

The most useful single observation: a general-purpose model beat the dedicated coder on every code-gen prompt that didn't require Python runtime reasoning. The "coder" label means trained on code, not best at every code task.

What I'd do differently next time

Run every prompt against every model 5+ times from the start. The cheap-shot single-run cost me a wrong recommendation once. Add mypy --strict to the verifier to catch type-hint laziness that exec() doesn't. And test phi-4-mini and granite-code:3b against qwen3.5:4b for the mini-tier slot.

If you've shipped qwen3.5:4b (or anything smaller) in a best-of-N + verifier loop in production, I'd be curious about your hit rate and N.

Top comments (0)