DEV Community

Cover image for Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)
Amariah Kamau
Amariah Kamau

Posted on

Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)

Benchmarks


I build Atlarix, an agent workstation for open-weight models. The core claim behind it is that the harness — retrieval, tool surface, control loop — is what lets an open-weight model perform, not just the model's raw weights. This post is me trying to falsify that claim with a controlled run, and publishing every output file so you can check it.

Short version: on Terminal-Bench 2.0, single attempt, Atlarix resolved 42/89 and opencode resolved 39/89 on the same model. That 3-task gap is within k=1 noise — I'm not claiming a win. What it shows is that the harness isn't bottlenecking the model. Details and caveats below; raw files at the end.

The experiment

The only variable is the harness. Everything else is pinned identical across both agents.

  • Benchmark: terminal-bench/terminal-bench-2 — all 89 tasks, one isolated container each, automated verifiers.
  • Model: minimax/minimax-m3, routed through OpenRouter, pinned to a single provider at fp8 — identical for both harnesses.
  • Infrastructure: Harbor on Modal (-e modal), one container per task.
  • Attempts: single attempt, -k 1.
  • Timeout: native, --timeout-multiplier 1 (same for both).
  • Retries: --max-retries 3 (same for both).
  • Tool calling: native function-calling forced, no text-tool shim.

Commands

# Atlarix harness
harbor run -d terminal-bench/terminal-bench-2 \
  -m openai/minimax/minimax-m3 \
  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
  -e modal --agent-import-path atlarix_tb:AtlarixAgent

# opencode harness (same model + provider + infra)
harbor run -d terminal-bench/terminal-bench-2 \
  -m bench/minimax/minimax-m3 \
  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
  -e modal --agent-import-path atlarix_tb.opencode_proxy:BenchOpenCodeAgent
Enter fullscreen mode Exit fullscreen mode

(-n 24 is concurrency — how many containers run in parallel — not a task count. All 89 tasks run.)

Results

Harness Resolved Score
Atlarix 42 / 89 47%
opencode 39 / 89 44%

Read this before you read the table

k=1 means one sample per task. The official Terminal-Bench leaderboard requires k=5 specifically to measure run-to-run variance. A 3-task difference at k=1 is inside that noise band. So this is not a leaderboard result and not a claim that Atlarix beats opencode. The honest takeaway: an open-weight model performs about as well under Atlarix as under a strong existing harness — the harness isn't holding it back.

~25% of tasks timed out — for both harnesses. At native timeout (×1), roughly a quarter of tasks hit AgentTimeoutError on each side and count as unresolved. So the sub-50% absolute scores aren't all capability failures; a meaningful share are wall-clock on heavy tasks. The timeout ceiling is identical for both agents, so the comparison stays fair — but that's why neither number is higher.

The one config change (full disclosure)

Atlarix's desktop app asks for human approval before every file write and command — a core safety feature. Benchmarks run unattended, so I grant that approval once via an explicit operator flag (ATLARIX_AUTONOMOUS_DANGER=1). Without it, any task needing an install or privileged command is blocked and fails.

This is not an advantage over opencode — every agent auto-approves to run an automated benchmark; it's inherent to running unattended. Stating it for full transparency. The flag is off by default; the interactive app always asks.

Reproduce it

The exact Atlarix bundle I ran is a public, Electron-free headless build: atlarix-headless-linux-amd64.tar.gz. The benchmark is the open-source Harbor framework. The raw Harbor result files — per-task pass/fail for both harnesses — are published unedited. Nothing is hand-typed.

Everything (raw result.json for both sides, summary.csv, exact bundle, full setup): atlarix.dev/benchmark

What's next

  • More open-weight models, so no claim rests on one.
  • The official Terminal-Bench (k=5) submission — on the roadmap.
  • More benchmarks beyond terminal tasks.

If you spot something wrong in the result files, that's the point — tell me.

Built in Nairobi.

Top comments (0)