DEV Community

Cover image for Why your local LLM aces benchmarks but fails real terminal tasks
Alan West
Alan West

Posted on

Why your local LLM aces benchmarks but fails real terminal tasks

Last month I spent an entire weekend frustrated by the same pattern. I'd download a shiny new open-weight model, see it crush MMLU and HumanEval, then watch it faceplant the second I handed it a multi-step shell task. "Find the largest log file in /var/log, grep for OOM errors, and write a summary." The model would confidently invent flags that don't exist, forget what it ran two steps ago, or get stuck in a loop running ls forever.

If you've tried running local models as terminal agents, you know the feeling. The score on the leaderboard says one thing; your actual workflow says another. With agentic benchmarks like Terminal-Bench 2.0 getting more attention (and newer MoE models like the Qwen3.6 family reportedly landing on the public board), it's worth understanding why this gap exists and what you can do about it.

The root cause: static benchmarks aren't agentic benchmarks

Most of the scores you see on Hugging Face leaderboards measure single-turn reasoning. The model gets a prompt, produces an answer, done. That tells you almost nothing about how the same model behaves when it has to:

  • Decide which tool to call
  • Parse messy stdout from a real shell
  • Remember state across 15+ turns
  • Recover when a command fails

This is the gap that benchmarks like Terminal-Bench try to close. They put the model in an actual sandbox, give it a real task, and grade it on whether the task got done — not whether the intermediate reasoning looked plausible.

The problem is that until you run an agentic eval yourself, you have no way to know if the model you're betting your stack on actually works for your use case.

Setting up a local agentic eval harness

Here's the approach I've been using to sanity-check models before committing to one. The core idea: simulate the same loop your production agent would run, but against a fixed task set you control.

First, a minimal tool-call loop. I'll use the transformers library since it works with most open-weight models out of the box.

from transformers import AutoModelForCausalLM, AutoTokenizer
import subprocess, json

MODEL_ID = "your-model-here"  # swap in whatever you're testing
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",  # let HF pick bf16/fp16 based on hardware
)

def run_shell(cmd: str, timeout: int = 10) -> str:
    # Always use a sandbox in real evals — this is illustrative
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True, timeout=timeout
    )
    return result.stdout + result.stderr
Enter fullscreen mode Exit fullscreen mode

Next, the agent loop itself. The thing that surprised me when I first wrote this: most failures don't happen in the model. They happen at the boundary — bad parsing, dropped context, no recovery path.

def agent_step(history, max_new_tokens=512):
    # Apply the model's chat template — this matters a lot for instruct models
    prompt = tokenizer.apply_chat_template(
        history, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,  # deterministic for evals
    )
    # Slice off the prompt tokens so we only decode the new output
    new_tokens = out[0][inputs.input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

def run_task(task: str, max_turns: int = 20):
    history = [
        {"role": "system", "content": "You are a shell agent. Reply with a single JSON object: {\"cmd\": \"...\"} or {\"done\": \"summary\"}."},
        {"role": "user", "content": task},
    ]
    for _ in range(max_turns):
        reply = agent_step(history)
        history.append({"role": "assistant", "content": reply})
        try:
            action = json.loads(reply)
        except json.JSONDecodeError:
            # Parsing failures are a HUGE source of false-negative scores
            history.append({"role": "user", "content": "Reply must be valid JSON."})
            continue
        if "done" in action:
            return action["done"]
        observation = run_shell(action["cmd"])
        history.append({"role": "user", "content": f"<output>\n{observation}\n</output>"})
    return None  # ran out of turns
Enter fullscreen mode Exit fullscreen mode

That's the skeleton. The interesting part is the failure modes you'll see.

What actually goes wrong (and how to fix it)

After running this harness against half a dozen open-weight models on the same fixed task set, here's the pattern I keep hitting:

1. The model ignores your output format

The most common failure isn't a reasoning failure. It's that the model wraps its JSON in markdown fences, or adds a chatty preamble, or hallucinates a thoughts field your parser doesn't know about. The fix isn't more prompting — it's constrained decoding.

from transformers import LogitsProcessorList
# Use a library like `outlines` or `lm-format-enforcer`
# to force the model to emit valid JSON matching your schema
from outlines import models, generate

schema = '{"type": "object", "properties": {"cmd": {"type": "string"}}}'
# This guarantees parseable output — even from smaller models
Enter fullscreen mode Exit fullscreen mode

This single change moved one 9B model I tested from ~30% task completion to ~55% on my local set. The model was capable; it just kept tripping the parser.

2. Context collapse around turn 8–10

Long shell sessions get noisy fast. A single ls -la /usr can dump thousands of tokens. By turn 10 the model has lost track of the original task.

The practical fix: truncate or summarize old observations aggressively. Keep the original task and the last 2–3 turns verbatim; collapse everything in between.

3. MoE models need different inference tuning

If you're testing newer mixture-of-experts releases (the "A3B" suffix in some recent Qwen releases reportedly indicates ~3B active parameters per token), the default transformers settings often leave performance on the table. For these, I've had much better latency with vllm:

pip install vllm
vllm serve your-model-here --tensor-parallel-size 2
Enter fullscreen mode Exit fullscreen mode

Then point your harness at the OpenAI-compatible endpoint instead of running the model in-process. The throughput difference on multi-turn agent loops is noticeable — you're doing dozens of forward passes per task.

Prevention: bake the eval into your workflow

The meta-lesson from all this: don't trust leaderboards for your specific use case. They're a useful filter, but a 5-point gap on Terminal-Bench means almost nothing if the model fails on the specific commands your agent runs.

A few habits that have saved me time:

  • Keep a fixed task set of 20–30 representative jobs. Re-run them against every model you consider. Same prompts, same scoring, same sandbox.
  • Log every failed turn. Most regressions show up as parsing or format issues long before they show up as reasoning issues.
  • Test the inference stack, not just the weights. The same model on transformers vs vllm vs llama.cpp can score differently because of subtle tokenization or sampling defaults.
  • Check the official model card and benchmark source before quoting numbers. Leaderboard scores get updated; blog posts don't.

The gap between "this model benchmarks well" and "this model works in my agent" is real, and it's almost always closeable with better tooling around the model rather than a bigger model. Start with the harness, find your actual bottleneck, then decide what to swap.

Top comments (0)