DEV Community: Roberto de la Cámara

Variance testing flipped my Ollama benchmark ranking

Roberto de la Cámara — Sun, 26 Apr 2026 17:26:30 +0000

I ran 6 local Ollama models against strict code-gen prompts, then re-ran the most discriminating prompt 3 times each. The single-shot winner was unstable, and the actual best was a general-purpose model the single-shot run had ranked 5th.

I've been picking models for a local Ollama pool that handles small, well-scoped coding chores delegated from a main agent. Before cabling routing rules into the agent, I wanted a defensible answer to "which model for which task family." So I built a tiny benchmark. The interesting part wasn't the ranking. It was that the ranking changed after I added variance testing.

TL;DR

I ran 6 models against 3 strict, single-function prompts (auto-graded by I/O equivalence, 32 test cases). Then I ran the most discriminating prompt 3 times on every model. Findings:

Single-shot ranking placed qwen3.5:9b at the top and gemma4:latest 5th.
Post-variance, gemma4:latest was the only byte-stable perfect model. qwen3.5:9b produced byte-identical buggy code in 2 of 3 runs at temperature=0.2. Its dominant decoding mode is broken on this prompt.
The Qwen3 thinking variants returned empty response fields on 100% of constrained code-gen prompts until I set think:false.
The "obvious coder" pick (qwen2.5-coder:14b) lost to a general-purpose model (gemma4) on every code-gen prompt that didn't require Python runtime reasoning.

Methodological lesson: single-shot LLM benchmarks lie in both directions. The "winner" was unstable, and the "loser" was best-in-class for a specific task family.

Setup

Single workstation, 16 GB VRAM, Ollama on 127.0.0.1:11434. A 60-line bash wrapper POSTs each prompt with temperature=0.2, stream=false. A Python verifier strips markdown fences, exec()s the model's output, and runs valid + invalid inputs against the resulting function. All scores are automated.

Three prompts, all forbidding markdown fences and preamble:

P1: a pytest test generator with a stale-reference trap (the function under test rebinds the module global, so the test must re-read by attribute, not hold a local).
P2: parse_iso_duration(s) -> int for PT<H>H<M>M<S>S strings, raising ValueError on malformed input. 6 valid + 8 invalid cases.
P3: flatten(d, sep=".") -> dict recursing into nested dicts but leaving lists alone. 10 cases.

Variance test (P2, 3 runs per model)

Same prompt, same temperature=0.2, three independent calls:

Model	Run 1	Run 2	Run 3	Mean	Stability
gemma4:latest	22/22	22/22	22/22	22.0	byte-stable perfect
qwen2.5-coder:14b	22/22	20/22	20/22	20.7	tight cluster
qwen3:14b (`think:false`)	17/22	16/22	17/22	16.7	stable, mediocre
deepseek-coder-v2:16b	16/22	16/22	12/22	14.7	stable, wrong
qwen3.5:9b (`think:false`)	9/22	9/22	21/22	13.0	bimodal
qwen3.5:4b (`think:false`)	4/22	19/22	16/22	13.0	wild

The bug qwen3.5:9b produced byte-identically in runs 1 and 2 was a regex requiring all three letters: ^(\d+)?H(\d+)?M(\d+)?S$. So "PT5M" falsely fails because there's no H and no S literal. Subtle, plausible-looking, and it ships unless you actually run the function. The 21/22 score in single-shot was the less common sampling path.

deepseek-coder-v2:16b is stably wrong: 0/6 valid inputs across all 3 runs. Same regex bug every time. Rerunning won't save it.

I ran a cross-prompt confirmation on the two stable models with P3, 3 runs each. gemma4 10/10/10. qwen2.5-coder:14b 10/10/9. gemma4 went 6 for 6 across both code-gen prompts, byte-stable. The point qwen2.5-coder lost was using if v: (truthy check) instead of if v is not None, silently dropping a None value. Idiomatic but wrong.

The thinking-mode trap

First pass on Qwen3 with the default think:true: qwen3:14b returned 1 byte (\n) after 1174 seconds of GPU time. Twenty minutes for nothing. Ollama's /api/generate returns two fields for thinking-mode models: response and thinking. My script only logged response. When I dumped the raw JSON, the 9B's thinking field was 21 KB of this:

* Wait, I need to check if I can use `src` if `import src.main_improved` is used.
* Yes.
* So I will use `src.main_improved`.
* Wait, I need to check if I can use `src` if `import src` is used.
* Yes.
* So I will use `src.main_improved`.
[...repeats until context fills...]

done_reason: "stop" on a 21,000-character thinking trace with no committed answer. The fix was one parameter: "think": false in the request body. With it, all three Qwen3 sizes responded in 8 to 11 seconds and produced clean code.

If you're benchmarking thinking-capable models against strict output requirements: smoke-test with think:false first, and log both fields. One missing line of logging cost me 20 minutes of GPU debugging for what looked like crashes but were actually infinite-loop self-arguments inside thinking.

Routing rules I ended up with

Parsers, regex, recursive transformers: gemma4:latest. Byte-stable 22/22 across 6 runs of 2 different prompts at temp 0.2.
Tests, fixtures, anything needing Python module/runtime semantics: qwen2.5-coder:14b. Stable 20-22/22, the only model that handled the test-scaffolding trap correctly.
Mini tier (laptop, 4 GB VRAM): qwen3.5:4b with think:false, sample 5x at temp 0.7, run a verifier, keep the passer. 3.4 GB, ~20s total. Hit rate >=18/22 was 60% in my runs.
Skip: qwen3:14b (stably mediocre, 16/22 mean) and deepseek-coder-v2:16b (stably wrong on valid inputs, same regex bug 3/3 runs).

The most useful single observation: a general-purpose model beat the dedicated coder on every code-gen prompt that didn't require Python runtime reasoning. The "coder" label means trained on code, not best at every code task.

What I'd do differently next time

Run every prompt against every model 5+ times from the start. The cheap-shot single-run cost me a wrong recommendation once. Add mypy --strict to the verifier to catch type-hint laziness that exec() doesn't. And test phi-4-mini and granite-code:3b against qwen3.5:4b for the mini-tier slot.

If you've shipped qwen3.5:4b (or anything smaller) in a best-of-N + verifier loop in production, I'd be curious about your hit rate and N.

Building a multi-source autonomous research agent with LangGraph, ThreadPoolExecutor and Ollama

Roberto de la Cámara — Fri, 03 Apr 2026 13:11:08 +0000

I wanted a tool that could research any topic deeply — not just one web search, but Wikipedia, arXiv, Semantic Scholar, GitHub, Hacker News, Stack Overflow, Reddit, YouTube and local documents, all at once. So I built it.

This post covers the architecture decisions, the parallel execution model, the self-correction loop, and a few things that didn't work before I got it right.

Live demo: https://huggingface.co/spaces/ecerocg/research-agent

Source: https://github.com/RobertoDeLaCamara/Research-Agent

The problem with sequential research agents

Most agent examples I found do this:

search web → process → search wiki → process → search arxiv → process → synthesize

If each source takes 5–10 seconds (network + LLM processing), a 10-source agent takes 50–100 seconds minimum — before synthesis.

The fix is obvious: run everything in parallel.

Architecture overview

initialize_state
      │
plan_research  ←──────────────────────┐
      │                               │
parallel_search                    re-plan
      │                               │
consolidate ──→ evaluate ─────────────┘
      │              │
   report          END

The graph is implemented with LangGraph's StateGraph. Each node receives the full AgentState TypedDict and returns a partial update.

Parallel execution with ThreadPoolExecutor

The core of the agent is parallel_search_node:

def parallel_search_node(state: AgentState) -> dict:
    source_functions = {
        "web":       search_web_node,
        "wiki":      search_wiki_node,
        "arxiv":     search_arxiv_node,
        "scholar":   search_scholar_node,
        "github":    search_github_node,
        "hn":        search_hn_node,
        "so":        search_so_node,
        "reddit":    search_reddit_node,
        "local_rag": local_rag_node,
        "youtube":   _youtube_combined_node,
    }

    plan = state.get("research_plan", [])
    combined = {}
    futures_map = {}

    with ThreadPoolExecutor(max_workers=len(plan) or 1) as executor:
        for source_name in plan:
            fn = source_functions.get(source_name)
            if fn:
                future = executor.submit(fn, state)
                futures_map[future] = source_name

        for future in as_completed(futures_map):
            source_name = futures_map[future]
            try:
                result = future.result()
                combined.update(result)
            except Exception as e:
                logger.error(f"Source '{source_name}' failed: {e}")

    return combined

Each source function is independent and returns a partial state dict. combined.update(result) merges all results — no locking needed because each source writes to different state keys.

YouTube is an exception — search must complete before summarize can run, so it gets a sequential wrapper inside the parallel executor:

def _youtube_combined_node(state: AgentState) -> dict:
    search_result = search_videos_node(state)
    merged_state = {**state, **search_result}
    summarize_result = summarize_videos_node(merged_state)
    combined = {}
    combined.update(search_result)
    combined.update(summarize_result)
    return combined

This brings total research time from ~5 min sequential to ~45s on a decent connection.

The self-correction loop

After parallel search, an evaluation node checks for knowledge gaps:

def evaluate_research_node(state: AgentState) -> dict:
    iteration = state.get("iteration_count", 0)

    if iteration >= 2:
        return {
            "next_node": "END",
            "evaluation_report": "Max iterations reached."
        }

    llm = get_llm(temperature=0.1)
    response = llm.invoke([HumanMessage(content=prompt)])

    if gaps_detected(response):
        return {
            "next_node": "re_plan",
            "iteration_count": iteration + 1
        }
    return {"next_node": "END"}

The LangGraph conditional edge routes back to plan_research or forward to END:

workflow.add_conditional_edges(
    "evaluate_research",
    lambda state: state.get("next_node", "END"),
    {
        "re_plan":  "plan_research",
        "END":      "consolidate_research"
    }
)

On re-plan, the LLM can select different or additional sources based on what was missing. On niche topics this second pass noticeably improves coverage.

Dynamic research planning with personas

Before searching, plan_research_node asks the LLM which sources are relevant for the topic. This avoids wasting API calls on irrelevant sources.

Five personas shape the planning:

Persona	Focus
Generalist	Balanced across all sources
Software Architect	GitHub, HN, SO heavily weighted
Market Analyst	Web, Reddit, HN
Scientific Reviewer	arXiv, Semantic Scholar
Product Manager	Web, Reddit, YouTube

The persona prompt changes what the LLM considers "relevant", so the research plan — and therefore which threads run in parallel — differs per persona.

LLM factory: local or cloud with zero config changes

def get_llm(temperature: float = 0, timeout: int = None):
    api_key  = os.environ.get("OPENAI_API_KEY")
    base_url = os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434")
    model    = os.environ.get("OLLAMA_MODEL", "qwen2.5:1.5b")

    if api_key or _is_cloud_endpoint(base_url):
        return ChatOpenAI(
            api_key=api_key or "ollama",
            base_url=base_url,
            model=model,
            temperature=temperature,
            timeout=timeout,
        )
    else:
        return ChatOllama(
            base_url=base_url,
            model=model,
            temperature=temperature,
        )

The same agent code runs against local Ollama or Groq/Gemini/OpenAI — just swap env vars. Reads os.environ at call time (not at import) so Streamlit sidebar overrides work without restart.

What didn't work

nonlocal in threaded callbacks — I originally used nonlocal to capture results from threads. Race conditions appeared under load. Fixed by switching to a mutable container pattern (container = {"data": []}) and reading only after thread.join().

Pydantic v1 validators — @validator with positional cls broke on Pydantic v2. Migrated to @field_validator with @classmethod.

Sequential YouTube — the first version treated YouTube like other sources. Summarization needs the transcript, which needs the video URL, which needs the search. Making it a composed sequential node within the parallel executor fixed this.

.env baked into Docker image — COPY . . was copying .env into the image, leaking credentials. Added .env to .dockerignore. Docker Compose also interpolates ${OLLAMA_MODEL:-default} from .env, which overrode the intended demo model. Hardcoded the values in docker-compose.full.yml instead.

Running it locally

# With local Ollama — no API keys needed:
git clone https://github.com/RobertoDeLaCamara/Research-Agent
cd Research-Agent
docker compose -f docker-compose.full.yml up
# pulls qwen2.5:1.5b automatically, starts at localhost:8501

# With Groq free tier:
cp env.example .env
# OPENAI_API_KEY=your_groq_key
# OLLAMA_BASE_URL=https://api.groq.com/openai/v1
# OLLAMA_MODEL=llama-3.1-8b-instant
docker compose up

What's next

Streaming output so the UI updates as each source completes
Better synthesis prompts for small models (1.5b demo)
Persistent research sessions with diff between iterations

Live demo: https://huggingface.co/spaces/ecerocg/research-agent

Source: https://github.com/RobertoDeLaCamara/Research-Agent

Happy to answer questions about any part of the architecture.