DEV Community

Cover image for How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models, Priced in Watts)
Arsen Apostolov
Arsen Apostolov

Posted on

How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models, Priced in Watts)

I gave GLM-4.5-Air (106B, open weights) 12 coding tasks through opencode on my RTX 3090. It scored 0% — never edited a single file.

Same model, same GPU, same tasks, but driven by a ~150-line LangGraph agent instead: 93%.

The model was never the problem. The orchestrator was. Here's the benchmark — including the part nobody else measures, the electricity cost per correct task.

opencode vs LangGraph tool-adherence

Setup

  • RTX 3090 (24 GB) + 128 GB RAM, models via ollama, Q4 quants, temp 0.2
  • 5 recent open models × 2 orchestrators (opencode vs custom LangGraph ReAct with ollama-native tool-calling)
  • 17 graded tasks (12 coding in Python/JS/C++ + 5 general-agent) with hidden unit tests
  • Every run priced in GPU watts via my open-source homelab-monitor

Results

Model tok/s opencode adh. LangGraph adh. LangGraph coding LangGraph general
Qwen3-Coder 30B-A3B 130 92% 100% 100% 100%
GLM-4.5-Air 106B 5.7 0% 100% 89% 100%
Devstral Small 24B 49 8% 53% 8% 40%
Seed-OSS 36B 9.5 0% 7% 0% 20%
DeepSeek-R1-Distill 32B 6.7 0% 0% 0% 0%

Tool-adherence = % of tasks where the model actually called a tool instead of just printing code in chat. It was the master variable. (GLM's headline "93%" is its blended score across all 17 tasks: 89% coding + 100% general.)

Three takeaways

  1. The framework can matter more than the model. opencode sends a frontier-shaped system prompt + 12 tools over its OpenAI-compat path; most local models fall back to chatting. Native tool-calling through a lean agent fixes that — GLM went 0% → 93%. (Qwen3-Coder is the exception: it's tuned for agentic tool use and aces opencode out of the box.)
  2. Acting ≠ solving. LangGraph made Devstral act (8% → 53% adherence) but not solve (coding stayed 8%). The framework decides whether a model acts; the model decides whether it's right.
  3. The wattmeter ranks honestly. Qwen solved tasks at ~0.0005 BGN each; the models that scored zero still burned 10–30× more energy for nothing. On a home rig, the cheapest model is the fast, correct one — and MoE (Qwen activates ~3B of 30B per token) wins twice.

Bonus: 128 GB RAM let me run the 106B GLM (23 GB VRAM + 27 GB spilled to RAM) — it works, at 5.7 tok/s. Great for fire-and-forget batch jobs, not interactive coding.

The recipe for reliable local agents

Pick a tool-use-tuned model (Qwen3-Coder 30B-A3B is the all-weather winner) → use native tool-calling, not an OpenAI-compat path → keep the harness lean → use RAM for reach, not speed → measure correctness per kWh.

📖 Full write-up with methodology, charts, and the deeper "why" → [https://medium.com/@arsen.apostolov/local-llm-agents-on-an-rtx-3090-i-benchmarked-5-models-2-frameworks-and-the-orchestrator-f5fd600ca221]

⭐ Every number was priced in watts by homelab-monitor — my open-source tool that turns your GPU's power draw into per-task cost. Star it if you want the same receipts for your own rig. Harness + tasks + leaderboard code are reproducible.

Top comments (0)