Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Measuring AI agent performance by actual outcome correctness, not just tool call presence

Why We Built This Benchmark

"To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.

Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag. But for tool-using AI agents, what truly matters isn't "did it call the right tool?" — it's "did it actually produce the correct result?"

Our project Androi is a local AI agent that uses 10+ tools including web search, Python execution, file management, email, and calendar. We connected various LLMs to the same agent and ran 5 identical complex real-world scenarios, scoring each based on the correctness of their outputs.

Test Environment

Server: Ubuntu VM (3.8GB RAM, 20GB SSD)
Runtime: Ollama (local inference)
Framework: Androi Agent (Node.js + Python tool pipeline)
Validation: Outcome-Based Validation (v2)
Test Date: 2026-02-28

The 5 Real-World Test Scenarios (39 Total Checks)

Each test requires the agent to chain multiple tools sequentially to complete a complex, multi-step task.

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

Scenario: The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:

Web search current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price)
Convert to KRW and calculate total portfolio value
Execute Python code to compute each asset's weight (%)
Compare against ideal allocation (Stocks 40%, Crypto 20%, USD 20%, Gold 20%) and recommend rebalancing
Save report to file (/tmp/rebalance_report.txt)
Register calendar event for next Friday's review
Send email with the report attached

Validation Checks: Samsung price, Bitcoin price, USD rate, Gold price, Total calculation, Weight analysis, Rebalancing recommendation, File saved, Email sent

Required Tools: web_search × 4, run_python_code / calculate, write_file, create_event, send_email

U02. 📊 Real-Time Tech Trend Research & Report (8 checks)

Scenario: Research the 2026 AI semiconductor market and produce a comprehensive report.

Search "AI semiconductor market forecast 2026" → collect market size data
Search "NVIDIA HBM market share 2026" → understand competitive landscape
Search "Samsung HBM3E mass production" → Korean industry status
Generate markdown report using Python with collected data
Save report to /tmp/ai_semiconductor_report.md
Register weekly automated task for trend updates
Send report via email

Validation Checks: Market size mentioned, NVIDIA mentioned, HBM mentioned, Samsung trends, SK Hynix trends, Report saved, Auto-task registered, Email sent

Required Tools: web_search × 3, run_python_code, write_file, create_task, send_email

U03. 🖥️ Server Health Check + Auto-Recovery + Alerts (7 checks)

Scenario: Perform a comprehensive VM server health check and generate a report.

Run df -h to check disk usage
Run free -h to check memory status
Run systemctl list-units --state=failed to identify failed services
Use Python to analyze last 50 lines of /var/log/syslog for ERROR/WARNING/CRITICAL frequency
Use find to list temp files older than 7 days
Save full report with risk level assessment (High/Medium/Low)
Register hourly auto-check task

Validation Checks: Disk usage, Memory status, Service status, Log analysis, Risk level assessment, Report saved, Auto-task registered

Required Tools: run_command × 4, run_python_code, write_file, create_task

U04. 🌍 Travel Planner (8 checks)

Scenario: Plan a weekend trip to Jeju Island (1 night, 2 days).

Search "Jeju Island February weather" → temperature and conditions
Search "Jeju winter restaurant recommendations 2026" → select 3 restaurants
Search "Jeju winter tourist attractions" → select 3 attractions
Use Python to create a Day 1 / Day 2 timetable (09:00–21:00, alternating attractions and restaurants)
Calculate estimated budget (meals: 30K KRW × 6, hotel: 150K, transport: 50K = 380K KRW)
Save travel plan to file + Register calendar events (departure/return)
Send plan via email

Validation Checks: Weather info, Restaurant recommendations, Tourist attractions, Day 1/Day 2 separation, Timetable, Cost calculation, Calendar events, Email sent

Required Tools: web_search × 3, run_python_code, calculate, write_file, create_event × 2, send_email

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Scenario: Analyze the server's tool_registry.py file and produce a code review report.

Use read_file to read entire source code
Execute Python to count lines, functions, and classes
Run wc -l /root/xoul/tools/*.py to check total module size
Use calculate to compute tool_registry.py's percentage of total codebase
Save analysis report to /tmp/code_analysis.txt
Store key findings in memory (recall/memorize)
Send report via email

Validation Checks: Line count, Function count, Total module size, Percentage calculated, Code structure explained, Report saved, Email sent

Required Tools: read_file, run_python_code, run_command, calculate, write_file, memorize, send_email

Validation Method: Outcome-Based

Instead of checking "did it call the right tool?", we verify "does the output contain the correct information?"

100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD — Most critical outcomes achieved  
≥50% = ⚠️ PARTIAL — More than half achieved
<50% = ❌ FAIL — Critical outcomes missing

For example, in U01 if the agent didn't explicitly call send_email but the response contains "email sent successfully", it passes. Conversely, calling web_search but not including Samsung's stock price in the response is a fail.

🏆 Final Rankings

Rank	Model	Parameters	Score	🏆 PERFECT	✅ GOOD	Speed	Value
🥇	GPT-oss-20B	20B	37/39 (95%)	3	2	264s	⭐⭐⭐
🥈	Qwen3.5-27B	27B	37/39 (95%)	4	1	1,101s	⭐⭐
🥉	Qwen3-8B Q8	8B	36/39 (92%)	3	2	377s	⭐⭐⭐
4️⃣	GLM-4.7-Flash	~4B(MoE)	36/39 (92%)	3	2	1,310s	⭐
5️⃣	Qwen3-8B Q4	8B	35/39 (90%)	2	3	441s	⭐⭐
6️⃣	Qwen3.5-35B-A3B	35B(MoE)	31/39 (79%)	1	2	552s	⭐

Per-Test Heatmap

Model	U01 Assets	U02 Research	U03 Server	U04 Travel	U05 Code
GPT-oss-20B	🏆 9/9	🏆 8/8	✅ 6/7	✅ 7/8	🏆 7/7
Qwen3.5-27B	🏆 9/9	🏆 8/8	🏆 7/7	✅ 6/8	🏆 7/7
Qwen3-8B Q8	✅ 8/9	🏆 8/8	🏆 7/7	✅ 6/8	🏆 7/7
GLM-4.7-Flash	✅ 8/9	✅ 6/8	🏆 7/7	🏆 8/8	🏆 7/7
Qwen3-8B Q4	🏆 9/9	✅ 7/8	✅ 5/7	🏆 8/8	✅ 6/7
Qwen3.5-35B-A3B	🏆 9/9	✅ 7/8	⚠️ 4/7	✅ 7/8	⚠️ 4/7

Key Insights

1. Parameter Count Isn't Everything

The 35B MoE model (Qwen3.5-35B-A3B) scored last at 79%, while the 8B model (Qwen3-8B Q8) achieved 92% in 3rd place. For agent tasks, tool-use capability and instruction following matter more than raw parameter count.

Personally, I think full-weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)

2. Quantization Affects Agent Quality

Comparing Qwen3-8B Q8 vs Q4: the Q4 variant exhibited tool call repetition loops — it repeated the same df -h && free -h command 6 times in U03. This suggests that tool chaining stability is sensitive to quantization levels.

3. Speed vs. Accuracy Trade-offs

GPT-oss-20B: Fastest (264s) AND most accurate (95%) — clear winner
Qwen3.5-27B: Tied accuracy but 4× slower — for when depth matters
Qwen3-8B Q8: Best performance-per-parameter — recommended for resource-limited environments

4. "Chain Completion" Is the Key Differentiator

Most models perform well on intermediate steps (searching, analyzing), but the real differentiation occurs at the end of the chain — sending emails, saving files, and registering automated tasks. Qwen3.5-35B-A3B was notably weak at these final steps.

Conclusion

Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but tool chaining completion rate, instruction adherence, and response speed together.

🏆 Best overall: GPT-oss-20B (speed + accuracy leader)
💰 Best value: Qwen3-8B Q8 (92% with just 8B parameters at 377s)
🔬 Deepest analysis: Qwen3.5-27B (most PERFECT scores at 4)

Test code and full results are available at tests/test_ultimate_extreme.py and tests/model_benchmark.md.