TL;DR
We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every model on the official mini-SWE-agent leaderboard, including Claude Opus 4.5 (76.8%) which costs 37x more per call. The improvement comes entirely from better context, not a better model.
Full benchmark results and interactive dashboard: xanther.ai/benchmarks
Try it free: xanther.ai
The Official Leaderboard (as of February 2026)
The SWE-bench Verified leaderboard uses mini-SWE-agent as a standardized harness to evaluate models on 500 human-verified bug instances from real open-source Python repositories. Here are the top results:
| Rank | Model | Resolve Rate | Cost/Instance |
|---|---|---|---|
| 1 | Claude 4.5 Opus (high reasoning) | 76.80% | $0.75 |
| 2 | Gemini 3 Flash (high reasoning) | 75.80% | $0.36 |
| 3 | MiniMax M2.5 (high reasoning) | 75.80% | $0.07 |
| 4 | Claude Opus 4.6 | 75.60% | $0.55 |
| 5 | GPT-5-2 Codex | 72.80% | $0.45 |
| 6 | Claude 4.5 Sonnet (high reasoning) | 71.40% | $0.66 |
| 7 | Kimi K2.5 (high reasoning) | 70.80% | $0.15 |
| 8 | DeepSeek V3.2 (high reasoning) | 70.00% | $0.45 |
Source: swebench.com, February 2026
The top score is 76.80% from Claude 4.5 Opus with high reasoning enabled, at $0.75 per instance. The cheapest competitive model is MiniMax M2.5 at 75.80% for $0.07.
Now here's what happens when you add Xanther Context Engine:
| Model | Without XCE | With XCE | Delta | Cost/Instance |
|---|---|---|---|---|
| MiniMax M2.5 | 75.80% | 78.20% | +2.4pp | $0.22 |
| Sonnet 4.0 | 66.00% | 73.40% | +7.4pp | $0.22 |
| Sonnet 4.0 (cascade hybrid) | 66.00% | 76.80% | +10.8pp | — |
MiniMax M2.5 + XCE at 78.2% would be the #1 entry on the official leaderboard — and it costs $0.22 per instance, not $0.75.
See the full results breakdown: xanther.ai/benchmarks | Raw data: github.com/Xanther-Ai/xce-benchmarks
What Is SWE-bench Verified?
SWE-bench Verified is the industry-standard benchmark for evaluating AI coding agents on real-world software engineering tasks. It consists of 500 instances, each representing a real bug from a real open-source Python repository. Each instance includes:
- A problem statement (the GitHub issue description)
- A codebase snapshot (the repository at the time the bug was reported)
- A gold patch (the actual fix that was merged)
- Test cases that verify the fix
The agent must read the problem statement, navigate the codebase, write a patch, and pass the test cases. No hints, no file locations, no guidance beyond the issue description.
The repositories span a wide range of complexity:
| Repository | Stars | Files | Lines of Code | Architecture |
|---|---|---|---|---|
| django/django | 82K | ~4,000 | ~300K | Layered MVC, ORM, middleware, admin |
| scikit-learn | 61K | ~1,200 | ~200K | Estimator inheritance chains, pipelines |
| sympy/sympy | 13K | ~1,500 | ~400K | Deep mathematical module dependencies |
| matplotlib | 20K | ~1,000 | ~150K | Complex rendering pipeline, backends |
| pytest | 12K | ~400 | ~50K | Plugin system, fixture resolution |
This is not a toy benchmark. These are production codebases with real architectural complexity.
The Context Problem
Watch what happens when a coding agent tries to fix a bug in Django without architectural context:
Bug: django__django-16379 — FileBasedCache crashes with FileNotFoundError on concurrent access
Agent behavior (without XCE):
- Searches for "FileBasedCache" — finds the class in
django/core/cache/backends/filebased.py - Reads the file, sees the
delete()method - Doesn't understand the cache backend hierarchy — misses that
FileBasedCacheinherits fromBaseCache - Doesn't know about the concurrent access patterns in Django's cache framework
- Writes a fix that handles the
FileNotFoundErrorbut breaks the cache invalidation contract - Test fails. Tries again.
- Explores
django/core/cache/__init__.py,django/core/cache/backends/base.py - Eventually finds the right approach after 15+ file reads and 4,000+ tokens
Agent behavior (with XCE):
- Calls
xce_get_context("FileBasedCache FileNotFoundError concurrent access") - Gets back: the cache backend hierarchy (BaseCache → FileBasedCache), the locking mechanism, the file operations that can race, and the related test infrastructure
- Understands the architecture immediately
- Writes a fix that wraps the file operation in a try/except with proper fallback
- Test passes on first attempt. ~1,500 tokens.
The difference isn't that the model is smarter. It's that the model has a map.
Per-Repository Analysis
XCE doesn't provide a uniform boost. The improvement correlates strongly with architectural complexity:
| Repository | Sonnet 4.0 Baseline | Sonnet 4.0 + XCE | Delta | Why |
|---|---|---|---|---|
| sympy/sympy | 45% | 62% | +17% | Deep module dependencies. A fix in sympy/core/ often requires understanding sympy/simplify/, sympy/printing/, and sympy/polys/. Without context, the agent gets lost in the dependency maze. |
| scikit-learn | 58% | 71% | +13% | Complex estimator inheritance. BaseEstimator → ClassifierMixin → LinearClassifierMixin → LogisticRegression. Bugs often require understanding the full chain. |
| matplotlib | 52% | 65% | +13% | Rendering pipeline with multiple backends. A bug in axes.py might require understanding figure.py, backend_agg.py, and the transform system. |
| django/django | 62% | 74% | +12% | Layered architecture (models → views → templates → middleware). Bugs cross layers frequently. |
| pytest | 70% | 78% | +8% | Relatively flat architecture. The plugin system is complex but most bugs are localized. Less benefit from architectural context. |
The pattern is clear: the more architectural dependencies a codebase has, the more the agent benefits from having a structural map.
Pytest, with its relatively flat architecture, sees the smallest improvement (+8%). Sympy, where fixing a bug in one module often requires understanding five others, sees the largest (+17%).
The Cost Analysis
Here's where it gets interesting from a business perspective.
The official leaderboard shows that reaching 76%+ on SWE-bench Verified requires expensive models:
| Score Range | Cheapest Model | Cost/Instance |
|---|---|---|
| 76%+ | Claude 4.5 Opus (high reasoning) | $0.75 |
| 75%+ | MiniMax M2.5 (high reasoning) | $0.07 |
| 72%+ | GPT-5-2 Codex | $0.45 |
| 70%+ | DeepSeek V3.2 (high reasoning) | $0.45 |
With XCE, the cost equation changes:
| Score | Setup | Cost/Instance | Savings vs. Opus |
|---|---|---|---|
| 78.2% | MiniMax M2.5 + XCE | $0.22 | 3.4x cheaper |
| 73.4% | Sonnet 4.0 + XCE | $0.22 | 3.4x cheaper |
| 76.8% | Claude 4.5 Opus (no XCE) | $0.75 | baseline |
The $0.22 includes the XCE query cost (~$0.001 per query, amortized over multiple queries per instance) plus the model inference cost. The XCE overhead is negligible — the savings come from the model needing fewer tokens to solve each problem.
Token reduction: XCE reduces token usage by approximately 20% per task. The agent makes fewer wrong turns, reads fewer irrelevant files, and arrives at the solution faster. On a 500-instance benchmark run, this translates to significant cost savings.
At scale, the math is compelling. A team running 1,000 coding agent tasks per month:
| Setup | Monthly Cost | Annual Cost |
|---|---|---|
| Claude Opus (no XCE) | $750 | $9,000 |
| MiniMax M2.5 + XCE | $220 | $2,640 |
| Savings | $530/mo | $6,360/yr |
And the XCE setup gets better results.
How XCE Works (High Level)
XCE indexes a codebase into a multi-level structured representation that captures both code and architecture. When an agent queries XCE, it gets back context at the right level of abstraction — not just a code snippet, but an understanding of where that code fits in the system, what depends on it, and what it depends on.
The indexing uses the proprietary PRAT algorithm to build this structured index. The key difference from embedding-based search: PRAT captures structural relationships between components, not just text similarity. This means the agent can ask "what depends on this function?" and get a real answer — something embeddings alone cannot provide.
The result is served via MCP, so any compatible agent gets architectural context on every tool call without any changes to the agent itself.
Reproducing These Results
All results are published and reproducible. The benchmark repository includes predictions, resolved instance IDs, and trajectory download scripts:
Repository: github.com/Xanther-Ai/xce-benchmarks
To reproduce:
# 1. Install mini-swe-agent
pip install mini-swe-agent
# 2. Get an XCE API key (free at app.xanther.ai)
# 3. Index the target repo
npx xanther-cli init --api-key xce_your_key
# 4. Run the benchmark
mini-swe-agent run \
--model claude-sonnet-4-20250514 \
--dataset swe-bench-verified \
--mcp-config '{"xanther": {"url": "https://mcp.xanther.ai/sse", "headers": {"Authorization": "Bearer xce_your_key"}}}'
# 5. Evaluate
sb submit --predictions results/preds.jsonl
Each run's preds.jsonl contains one prediction per instance:
{
"instance_id": "django__django-16379",
"model_name_or_path": "sonnet-4.0-xce",
"model_patch": "diff --git a/...",
"full_output": "..."
}
Trajectory files (100-600MB per run) are available for download from S3 for detailed analysis.
What This Means
Three takeaways:
1. Context is cheaper than compute. You don't need the most expensive model to get the best results. You need the right context. A $0.02/call model with good architectural context outperforms a $0.30/call model without it.
2. The improvement scales with complexity. Simple codebases with flat architectures see modest gains (+8%). Complex codebases with deep dependencies see dramatic gains (+17%). As codebases grow, the value of architectural context increases.
3. This is model-agnostic. XCE works with any MCP-compatible agent. The same context infrastructure that improves MiniMax M2.5 also improves Sonnet 4.0, and would improve any future model. Better models + better context = compounding gains.
Learn more about how XCE works: xanther.ai | See the benchmark methodology: xanther.ai/benchmarks
Try It
Xanther is in open beta. Free tier: 3 repos, 100 queries/month. No credit card.
npx xanther-cli init --api-key YOUR_KEY
- Website: xanther.ai
- Benchmark Dashboard: xanther.ai/benchmarks
- Dashboard: app.xanther.ai
- Benchmarks (raw data): github.com/Xanther-Ai/xce-benchmarks
- Discord: discord.gg/Y768kBRS
- npm: npmjs.com/package/xanther-cli
All benchmark results were evaluated using the official SWE-bench CLI (sb submit) against SWE-bench Verified (500 instances). The agent harness is mini-swe-agent. Predictions and resolved instance IDs are published at github.com/Xanther-Ai/xce-benchmarks.




Top comments (0)