Measuring AI agent performance by actual outcome correctness, not just tool call presence
Why We Built This Benchmark
"To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.
Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag. But for tool-using AI agents, what truly matters isn't "did it call the right tool?" — it's "did it actually produce the correct result?"
Our project Androi is a local AI agent that uses 10+ tools including web search, Python execution, file management, email, and calendar. We connected various LLMs to the same agent and ran 5 identical complex real-world scenarios, scoring each based on the correctness of their outputs.
Test Environment
- Server: Ubuntu VM (3.8GB RAM, 20GB SSD)
- Runtime: Ollama (local inference)
- Framework: Androi Agent (Node.js + Python tool pipeline)
- Validation: Outcome-Based Validation (v2)
- Test Date: 2026-02-28
The 5 Real-World Test Scenarios (39 Total Checks)
Each test requires the agent to chain multiple tools sequentially to complete a complex, multi-step task.
U01. 🏦 Global Asset Rebalancing Advisor (9 checks)
Scenario: The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:
- Web search current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price)
- Convert to KRW and calculate total portfolio value
- Execute Python code to compute each asset's weight (%)
- Compare against ideal allocation (Stocks 40%, Crypto 20%, USD 20%, Gold 20%) and recommend rebalancing
-
Save report to file (
/tmp/rebalance_report.txt) - Register calendar event for next Friday's review
- Send email with the report attached
Validation Checks: Samsung price, Bitcoin price, USD rate, Gold price, Total calculation, Weight analysis, Rebalancing recommendation, File saved, Email sent
Required Tools: web_search × 4, run_python_code / calculate, write_file, create_event, send_email
U02. 📊 Real-Time Tech Trend Research & Report (8 checks)
Scenario: Research the 2026 AI semiconductor market and produce a comprehensive report.
- Search "AI semiconductor market forecast 2026" → collect market size data
- Search "NVIDIA HBM market share 2026" → understand competitive landscape
- Search "Samsung HBM3E mass production" → Korean industry status
- Generate markdown report using Python with collected data
-
Save report to
/tmp/ai_semiconductor_report.md - Register weekly automated task for trend updates
- Send report via email
Validation Checks: Market size mentioned, NVIDIA mentioned, HBM mentioned, Samsung trends, SK Hynix trends, Report saved, Auto-task registered, Email sent
Required Tools: web_search × 3, run_python_code, write_file, create_task, send_email
U03. 🖥️ Server Health Check + Auto-Recovery + Alerts (7 checks)
Scenario: Perform a comprehensive VM server health check and generate a report.
- Run
df -hto check disk usage - Run
free -hto check memory status - Run
systemctl list-units --state=failedto identify failed services - Use Python to analyze last 50 lines of
/var/log/syslogfor ERROR/WARNING/CRITICAL frequency - Use
findto list temp files older than 7 days - Save full report with risk level assessment (High/Medium/Low)
- Register hourly auto-check task
Validation Checks: Disk usage, Memory status, Service status, Log analysis, Risk level assessment, Report saved, Auto-task registered
Required Tools: run_command × 4, run_python_code, write_file, create_task
U04. 🌍 Travel Planner (8 checks)
Scenario: Plan a weekend trip to Jeju Island (1 night, 2 days).
- Search "Jeju Island February weather" → temperature and conditions
- Search "Jeju winter restaurant recommendations 2026" → select 3 restaurants
- Search "Jeju winter tourist attractions" → select 3 attractions
- Use Python to create a Day 1 / Day 2 timetable (09:00–21:00, alternating attractions and restaurants)
- Calculate estimated budget (meals: 30K KRW × 6, hotel: 150K, transport: 50K = 380K KRW)
- Save travel plan to file + Register calendar events (departure/return)
- Send plan via email
Validation Checks: Weather info, Restaurant recommendations, Tourist attractions, Day 1/Day 2 separation, Timetable, Cost calculation, Calendar events, Email sent
Required Tools: web_search × 3, run_python_code, calculate, write_file, create_event × 2, send_email
U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)
Scenario: Analyze the server's tool_registry.py file and produce a code review report.
- Use
read_fileto read entire source code - Execute Python to count lines, functions, and classes
- Run
wc -l /root/xoul/tools/*.pyto check total module size - Use
calculateto compute tool_registry.py's percentage of total codebase -
Save analysis report to
/tmp/code_analysis.txt - Store key findings in memory (recall/memorize)
- Send report via email
Validation Checks: Line count, Function count, Total module size, Percentage calculated, Code structure explained, Report saved, Email sent
Required Tools: read_file, run_python_code, run_command, calculate, write_file, memorize, send_email
Validation Method: Outcome-Based
Instead of checking "did it call the right tool?", we verify "does the output contain the correct information?"
100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD — Most critical outcomes achieved
≥50% = ⚠️ PARTIAL — More than half achieved
<50% = ❌ FAIL — Critical outcomes missing
For example, in U01 if the agent didn't explicitly call send_email but the response contains "email sent successfully", it passes. Conversely, calling web_search but not including Samsung's stock price in the response is a fail.
🏆 Final Rankings
| Rank | Model | Parameters | Score | 🏆 PERFECT | ✅ GOOD | Speed | Value |
|---|---|---|---|---|---|---|---|
| 🥇 | GPT-oss-20B | 20B | 37/39 (95%) | 3 | 2 | 264s | ⭐⭐⭐ |
| 🥈 | Qwen3.5-27B | 27B | 37/39 (95%) | 4 | 1 | 1,101s | ⭐⭐ |
| 🥉 | Qwen3-8B Q8 | 8B | 36/39 (92%) | 3 | 2 | 377s | ⭐⭐⭐ |
| 4️⃣ | GLM-4.7-Flash | ~4B(MoE) | 36/39 (92%) | 3 | 2 | 1,310s | ⭐ |
| 5️⃣ | Qwen3-8B Q4 | 8B | 35/39 (90%) | 2 | 3 | 441s | ⭐⭐ |
| 6️⃣ | Qwen3.5-35B-A3B | 35B(MoE) | 31/39 (79%) | 1 | 2 | 552s | ⭐ |
Per-Test Heatmap
| Model | U01 Assets | U02 Research | U03 Server | U04 Travel | U05 Code |
|---|---|---|---|---|---|
| GPT-oss-20B | 🏆 9/9 | 🏆 8/8 | ✅ 6/7 | ✅ 7/8 | 🏆 7/7 |
| Qwen3.5-27B | 🏆 9/9 | 🏆 8/8 | 🏆 7/7 | ✅ 6/8 | 🏆 7/7 |
| Qwen3-8B Q8 | ✅ 8/9 | 🏆 8/8 | 🏆 7/7 | ✅ 6/8 | 🏆 7/7 |
| GLM-4.7-Flash | ✅ 8/9 | ✅ 6/8 | 🏆 7/7 | 🏆 8/8 | 🏆 7/7 |
| Qwen3-8B Q4 | 🏆 9/9 | ✅ 7/8 | ✅ 5/7 | 🏆 8/8 | ✅ 6/7 |
| Qwen3.5-35B-A3B | 🏆 9/9 | ✅ 7/8 | ⚠️ 4/7 | ✅ 7/8 | ⚠️ 4/7 |
Key Insights
1. Parameter Count Isn't Everything
The 35B MoE model (Qwen3.5-35B-A3B) scored last at 79%, while the 8B model (Qwen3-8B Q8) achieved 92% in 3rd place. For agent tasks, tool-use capability and instruction following matter more than raw parameter count.
Personally, I think full-weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)
2. Quantization Affects Agent Quality
Comparing Qwen3-8B Q8 vs Q4: the Q4 variant exhibited tool call repetition loops — it repeated the same df -h && free -h command 6 times in U03. This suggests that tool chaining stability is sensitive to quantization levels.
3. Speed vs. Accuracy Trade-offs
- GPT-oss-20B: Fastest (264s) AND most accurate (95%) — clear winner
- Qwen3.5-27B: Tied accuracy but 4× slower — for when depth matters
- Qwen3-8B Q8: Best performance-per-parameter — recommended for resource-limited environments
4. "Chain Completion" Is the Key Differentiator
Most models perform well on intermediate steps (searching, analyzing), but the real differentiation occurs at the end of the chain — sending emails, saving files, and registering automated tasks. Qwen3.5-35B-A3B was notably weak at these final steps.
Conclusion
Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but tool chaining completion rate, instruction adherence, and response speed together.
- 🏆 Best overall: GPT-oss-20B (speed + accuracy leader)
- 💰 Best value: Qwen3-8B Q8 (92% with just 8B parameters at 377s)
- 🔬 Deepest analysis: Qwen3.5-27B (most PERFECT scores at 4)
Test code and full results are available at tests/test_ultimate_extreme.py and tests/model_benchmark.md.
Top comments (0)