DEV Community: Hoyin kyoma

Context Engineering Is the Compass Your Coding Agent Needs

Hoyin kyoma — Sun, 10 May 2026 07:32:48 +0000

TL;DR

Coding agents are powerful ships, but they're sailing without a map. They can write code, run tests, and iterate — but they don't know where they are in the codebase. Context engineering is the discipline of giving agents the architectural awareness they need to navigate effectively. Without it, even the best models waste tokens exploring dead ends. With it, a cheap model outperforms an expensive one.

The Navigation Problem

Picture a ship in open water. It has a powerful engine, a skilled crew, and enough fuel to reach any destination. But it has no compass, no charts, and no GPS. What happens?

It explores. It tries directions. It backtracks when it hits land where it expected open water. Eventually, through trial and error, it might reach its destination — but it burns 3x the fuel and takes 5x the time.

This is exactly what happens when you point a coding agent at a large codebase without architectural context.

The agent has all the capabilities it needs. It can read files, write code, run tests, search for patterns. But it doesn't know the architecture. It doesn't know that django/db/models/sql/compiler.py is the heart of query generation, or that changing BaseCache.set() affects every cache backend downstream. It discovers these things through exploration — expensive, token-heavy, error-prone exploration.

What Is Context Engineering?

Context engineering is the practice of providing AI agents with structured, relevant information about the system they're working in — before they start exploring on their own.

It's not prompt engineering (crafting better instructions). It's not RAG (retrieving text snippets by similarity). It's building a structured representation of the codebase that captures architecture, relationships, and design intent — then serving it to agents at the right moment.

The key insight: agents don't need more intelligence. They need better maps.

Consider the difference:

Without context engineering:

Agent: "I need to fix the cache race condition"
→ Searches for "cache" → finds 47 files
→ Reads django/core/cache/__init__.py → not helpful
→ Reads django/core/cache/backends/filebased.py → finds the class
→ Reads django/core/cache/backends/base.py → understands inheritance
→ Searches for "thread" → finds 23 files
→ Reads django/utils/autoreload.py → wrong file
→ Reads django/core/files/locks.py → relevant but doesn't know why yet
→ Eventually pieces together the architecture after 12 file reads
Total: ~4,000 tokens, 45 seconds, 2 wrong attempts

With context engineering:

Agent: "I need to fix the cache race condition"
→ Queries XCE: "FileBasedCache race condition threading"
→ Gets back: inheritance chain, threading concerns, related utilities, test infrastructure
→ Goes directly to the right files with full architectural understanding
Total: ~1,500 tokens, 15 seconds, correct on first attempt

Same agent. Same model. Same capabilities. The only difference is the map.

The Three Levels of Context

Not all context is created equal. There's a hierarchy:

Level 1: Code Context (What exists)

This is what most tools provide today — file contents, function signatures, grep results. It answers "what code is here?" but not "why?" or "how does it connect?"

Tools at this level: file search, grep, symbol lookup, embeddings-based RAG.

Limitation: Finding a function doesn't tell you what calls it, what it depends on, or what breaks if you change it.

Level 2: Structural Context (How things connect)

This captures relationships — call graphs, inheritance chains, import dependencies, module boundaries. It answers "what depends on what?" and "what's the execution flow?"

Tools at this level: static analysis, dependency graphs, call chain extraction.

Limitation: Knowing the call graph doesn't tell you the design intent or architectural role of each component.

Level 3: Architectural Context (Why things exist)

This captures design intent — why a module exists, what role it plays in the system, what design patterns it implements, what constraints it must satisfy. It answers "what is this component's job?" and "what are the rules?"

Tools at this level: XCE's PRAT-powered structured index.

This is the level that changes agent behavior. When an agent knows that CsrfViewMiddleware must run before CacheMiddleware (and why), it doesn't accidentally break that constraint. When it knows that BaseCache defines a contract that all backends must satisfy, it doesn't write a fix that violates that contract.

Why Embeddings Alone Aren't Enough

The most common approach to giving agents codebase context is embedding-based retrieval: embed all code chunks, embed the query, return the most similar chunks. This works for simple lookups but fails for architectural questions.

Example: "How does Django's ORM compile a QuerySet into SQL?"

Embedding search returns: chunks from query.py, compiler.py, maybe expressions.py — based on text similarity. But it doesn't tell you the execution order, the inheritance chain, or which method calls which.

The agent gets fragments. It doesn't get the story.

Structured context engineering provides the story:

QuerySet.filter() creates a Query object
Query accumulates conditions via add_q()
When evaluated, SQLCompiler.as_sql() walks the Query tree
Each node (WhereNode, Col, Ref) has an as_sql() method
The compiler assembles these into a final SQL string
Backend-specific compilers override for dialect differences

This is the difference between handing someone a box of puzzle pieces versus showing them the completed picture.

The Compass Metaphor

A compass doesn't tell you the answer. It tells you which direction to look.

Context engineering works the same way. XCE doesn't write the fix for you. It tells your agent:

Which files are relevant (and which aren't)
How those files relate to each other
What constraints must be preserved
What patterns to follow
What will break if you get it wrong

The agent still does the work. But it does the right work, in the right place, on the first try.

This is why a $0.02/call model with good context (MiniMax M2.5 + XCE at 78.2% on SWE-bench) outperforms a $0.30/call model without it (Claude Opus at 76.8%). The expensive model is a faster ship — but it's still sailing without a compass. The cheap model with XCE has the map.

Real Numbers

We tested this on SWE-bench Verified — 500 real bugs from real open-source repositories. The results:

Setup	Resolve Rate	Cost/Instance
MiniMax M2.5 + XCE	78.2%	$0.22
Claude 4.5 Opus (no context)	76.8%	$0.75
Sonnet 4.0 + XCE	73.4%	$0.22
Sonnet 4.0 (no context)	66.0%	$0.22

The improvement scales with codebase complexity:

Simple codebases (flat architecture, few dependencies): +8% improvement
Medium codebases (some layering, moderate dependencies): +12% improvement
Complex codebases (deep inheritance, cross-cutting concerns): +17% improvement

The more complex the architecture, the more valuable the compass becomes. A flat codebase is like sailing in a small lake — you can see the shore from anywhere. A complex codebase is like the open ocean — without navigation, you're lost.

Context Engineering vs. Other Approaches

How does context engineering compare to other ways of helping agents?

Approach	What it provides	Limitation
Better prompts	Clearer instructions	Doesn't help with codebase navigation
Longer context windows	More code visible at once	Agent still doesn't know what's relevant
Embedding RAG	Similar code chunks	No structural relationships
File tree	Directory structure	No semantic understanding
Documentation	Design intent (if it exists)	Usually outdated, incomplete
Context engineering (XCE)	Architecture + structure + semantics	Requires indexing (one-time cost)

The key differentiator: context engineering provides relational information. Not just "here's a file" but "here's how this file connects to 5 other files, what calls it, what it calls, and what role it plays in the system."

Building Your Own Compass

If you want to apply context engineering to your codebase, here's the approach:

Option 1: Use XCE (fastest)

npm install -g xanther-cli
xanther-cli init --api-key YOUR_KEY

This indexes your repo and serves structured context via MCP. Works with any MCP-compatible agent (Claude Code, Kiro, Cursor, OpenCode, Windsurf, Cline).

Option 2: Build lightweight context yourself

If you want a DIY approach, start with these principles:

Map module boundaries: Document which directories/packages form logical modules
Capture key relationships: Which modules depend on which? What are the integration points?
Document constraints: What rules must be preserved? (e.g., "middleware ordering matters")
Provide it via MCP: Build a simple MCP server that serves this context to your agent

Even a hand-written architecture document served via MCP is better than nothing. The agent goes from "I have no idea how this codebase is organized" to "I know the major modules and their relationships."

Option 3: Steering files

For smaller codebases, agent steering files (like .kiro/steering/ or CLAUDE.md) can provide basic architectural context. These are static documents that get included in every agent interaction.

Limitation: they don't scale. A 500-line steering file for a 300K-line codebase can only capture the highest-level architecture. XCE provides context at every level of detail, dynamically, based on what the agent is working on.

The Future of Agent-Assisted Development

We're at an inflection point. Models are getting better every quarter. Context windows are growing. But the fundamental problem remains: agents don't understand architecture.

A 1M-token context window doesn't help if the agent doesn't know which 5,000 tokens are relevant to the current task. More compute doesn't help if the agent is exploring the wrong part of the codebase.

Context engineering is the missing layer. It sits between the codebase and the agent, providing the architectural awareness that transforms exploration into navigation.

The ships are getting faster. But speed without direction is just expensive wandering. Context engineering is the compass.

Try It

Xanther Context Engine is in open beta. Free tier: 3 repos, 100 queries/month.

npx xanther-cli init --api-key YOUR_KEY

All benchmark results from SWE-bench Verified (500 instances) using mini-swe-agent. Full data: github.com/Xanther-Ai/xce-benchmarks

Why AI Coding Agents Waste 30% of Their Tokens — And How to Fix It

Hoyin kyoma — Sat, 09 May 2026 06:27:34 +0000

The Hidden Cost of Blind Agents

Every AI coding agent has the same workflow: receive a task, search the codebase, read files, write code. The problem is step 2. The agent doesn't know the codebase. It doesn't know the architecture. So it searches.

And searches. And searches.

We analyzed token usage across 500 SWE-bench Verified instances and found that agents spend approximately 30-40% of their tokens on exploration — reading files that turn out to be irrelevant, following import chains that lead nowhere, and backtracking from wrong approaches.

This isn't a model problem. GPT-5, Claude Opus, Gemini — they all do it. The issue is structural: the agent lacks a map of the codebase.

A Real Example: Django Bug #16379

Let's trace through a real bug to see this in action.

Bug: FileBasedCache crashes with FileNotFoundError when multiple processes access the cache simultaneously.

What a human developer does:

Reads the issue — understands it's a race condition in the file cache backend
Knows (from experience) that Django's cache backends inherit from BaseCache
Opens django/core/cache/backends/filebased.py directly
Checks the delete() and _cull() methods for file operations without proper locking
Writes a fix: wrap the os.remove() call in a try/except for FileNotFoundError
Done. ~5 minutes, ~3 files read.

What an AI agent does (without context):

Reads the issue
Searches for "FileNotFoundError" — finds 47 matches across the codebase
Opens django/core/files/storage.py — wrong file
Opens django/core/files/base.py — wrong file
Searches for "FileBasedCache" — finds it
Opens django/core/cache/backends/filebased.py — right file
Reads the whole file but doesn't understand the inheritance from BaseCache
Writes a fix that handles the error but doesn't respect the cache contract
Test fails
Opens django/core/cache/backends/base.py to understand the base class
Opens django/core/cache/__init__.py to understand the cache framework
Rewrites the fix
Test passes. ~20 minutes, ~12 files read, ~4,000 tokens.

What an AI agent does (with XCE):

Reads the issue
Calls xce_get_context("FileBasedCache FileNotFoundError concurrent access")
Gets back: the cache backend hierarchy, the file operations in filebased.py, the locking patterns, and the test infrastructure
Understands the architecture immediately
Writes the correct fix on the first attempt
Test passes. ~3 minutes, ~3 files read, ~1,500 tokens.

The token savings compound across hundreds of tasks. On our 500-instance benchmark run, XCE reduced total token usage by approximately 20%.

Why Embeddings Aren't Enough

The obvious solution is "just use code search." Tools like Greptile, Sourcegraph Cody, and GitHub Copilot all offer some form of code search. Most use embedding-based retrieval: convert code to vectors, find the most similar vectors to the query.

This works for simple lookups. "Find the login function" → returns the login function. But it fails for architectural questions:

Question	Embedding Search	Architectural Context
"Which module owns this logic?"	Returns similar code snippets	Returns the HLD module, its role, and its boundaries
"What depends on this function?"	Returns functions with similar names	Returns the call graph and downstream consumers
"If I change this, what breaks?"	Returns similar code (not dependent code)	Returns impact analysis with affected modules
"How does this fit in the architecture?"	Returns nearby code	Returns HLD → LLD → code hierarchy

The fundamental issue: embeddings measure text similarity, not structural relationships. Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain.

The Architecture Gap Across Repositories

We measured the improvement from XCE across five major open-source repositories. The results reveal a clear pattern:

Repository	Architecture Type	Baseline	With XCE	Delta
sympy	Deep module dependencies	45%	62%	+17%
scikit-learn	Complex inheritance chains	58%	71%	+13%
matplotlib	Multi-backend rendering pipeline	52%	65%	+13%
django	Layered MVC + ORM + middleware	62%	74%	+12%
pytest	Plugin system (relatively flat)	70%	78%	+8%

sympy (+17%): The largest improvement. Sympy has deep cross-module dependencies. A bug in sympy/core/expr.py might require understanding sympy/simplify/, sympy/printing/, sympy/polys/, and sympy/series/. Without a map, the agent gets lost in the dependency maze. With XCE, it knows which modules are structurally related before it starts exploring.

scikit-learn (+13%): Complex estimator inheritance. BaseEstimator → ClassifierMixin → LinearClassifierMixin → LogisticRegression. A bug in LogisticRegression.fit() might actually be in LinearClassifierMixin._fit() or even BaseEstimator.set_params(). The agent needs to understand the full inheritance chain to find the right place to fix.

pytest (+8%): The smallest improvement. Pytest has a plugin system that's complex, but most bugs are localized to a single file or module. The agent doesn't need as much architectural context because the architecture is relatively flat.

The correlation is strong: the more architecturally complex the codebase, the more the agent benefits from having a structural map.

This has a practical implication: if your codebase is a simple CRUD app with flat architecture, XCE helps modestly. If your codebase is a complex system with deep module dependencies, layered abstractions, and cross-cutting concerns — XCE helps dramatically.

How XCE Works

XCE uses the proprietary PRAT algorithm to build a structured codebase index that captures architectural relationships — not just code text. Unlike embedding-based search, PRAT understands structural connections between components at multiple levels of abstraction.

When an agent queries XCE, it gets back a structured response that includes: what module the code belongs to, what its role is in the system, what depends on it, and what it depends on. The agent doesn't just know where the code is — it knows why it exists and how it connects to the rest of the system.

This is served via MCP, so any compatible agent gets architectural context on every tool call without modifications.

Practical Setup

XCE runs as an MCP service. Any MCP-compatible agent can connect with one config block:

# Index your repo (one command)
npx xanther-cli init --api-key YOUR_KEY

This indexes the codebase and installs a git hook that auto-syncs after every commit. Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

Works with Claude Code, Kiro, Cursor, OpenCode, Windsurf — any MCP-compatible tool.

The agent gets five tools:

xce_get_context — Full architectural context for a problem statement
xce_search — Semantic search across the codebase
xce_architecture_context — Architecture around a specific file or symbol
xce_trace — Trace relationships from code to design artifacts
xce_impact_analysis — What breaks if you change specific files

The Bottom Line

AI coding agents are getting better every quarter. But the bottleneck isn't model capability — it's context quality. A cheap model with the right context outperforms an expensive model without it.

The numbers:

78.2% on SWE-bench Verified with MiniMax M2.5 + XCE (beats every model on the official leaderboard)
20% token reduction per task (fewer wrong turns, less exploration)
$0.22 per instance (16x cheaper than Claude Opus)

Context is cheaper than compute. And it compounds: better models + better context = better results than either alone.

Xanther is in open beta. Free tier: 3 repos, 100 queries/month.

How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard

Hoyin kyoma — Sat, 09 May 2026 05:53:53 +0000

TL;DR

We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every model on the official mini-SWE-agent leaderboard, including Claude Opus 4.5 (76.8%) which costs 37x more per call. The improvement comes entirely from better context, not a better model.

Full benchmark results and interactive dashboard: xanther.ai/benchmarks
Try it free: xanther.ai

The Official Leaderboard (as of February 2026)

The SWE-bench Verified leaderboard uses mini-SWE-agent as a standardized harness to evaluate models on 500 human-verified bug instances from real open-source Python repositories. Here are the top results:

Rank	Model	Resolve Rate	Cost/Instance
1	Claude 4.5 Opus (high reasoning)	76.80%	$0.75
2	Gemini 3 Flash (high reasoning)	75.80%	$0.36
3	MiniMax M2.5 (high reasoning)	75.80%	$0.07
4	Claude Opus 4.6	75.60%	$0.55
5	GPT-5-2 Codex	72.80%	$0.45
6	Claude 4.5 Sonnet (high reasoning)	71.40%	$0.66
7	Kimi K2.5 (high reasoning)	70.80%	$0.15
8	DeepSeek V3.2 (high reasoning)	70.00%	$0.45

Source: swebench.com, February 2026

The top score is 76.80% from Claude 4.5 Opus with high reasoning enabled, at $0.75 per instance. The cheapest competitive model is MiniMax M2.5 at 75.80% for $0.07.

Now here's what happens when you add Xanther Context Engine:

Model	Without XCE	With XCE	Delta	Cost/Instance
MiniMax M2.5	75.80%	78.20%	+2.4pp	$0.22
Sonnet 4.0	66.00%	73.40%	+7.4pp	$0.22
Sonnet 4.0 (cascade hybrid)	66.00%	76.80%	+10.8pp	—

MiniMax M2.5 + XCE at 78.2% would be the #1 entry on the official leaderboard — and it costs $0.22 per instance, not $0.75.

See the full results breakdown: xanther.ai/benchmarks | Raw data: github.com/Xanther-Ai/xce-benchmarks

What Is SWE-bench Verified?

SWE-bench Verified is the industry-standard benchmark for evaluating AI coding agents on real-world software engineering tasks. It consists of 500 instances, each representing a real bug from a real open-source Python repository. Each instance includes:

A problem statement (the GitHub issue description)
A codebase snapshot (the repository at the time the bug was reported)
A gold patch (the actual fix that was merged)
Test cases that verify the fix

The agent must read the problem statement, navigate the codebase, write a patch, and pass the test cases. No hints, no file locations, no guidance beyond the issue description.

The repositories span a wide range of complexity:

Repository	Stars	Files	Lines of Code	Architecture
django/django	82K	~4,000	~300K	Layered MVC, ORM, middleware, admin
scikit-learn	61K	~1,200	~200K	Estimator inheritance chains, pipelines
sympy/sympy	13K	~1,500	~400K	Deep mathematical module dependencies
matplotlib	20K	~1,000	~150K	Complex rendering pipeline, backends
pytest	12K	~400	~50K	Plugin system, fixture resolution

This is not a toy benchmark. These are production codebases with real architectural complexity.

The Context Problem

Watch what happens when a coding agent tries to fix a bug in Django without architectural context:

Bug: django__django-16379 — FileBasedCache crashes with FileNotFoundError on concurrent access

Agent behavior (without XCE):

Searches for "FileBasedCache" — finds the class in django/core/cache/backends/filebased.py
Reads the file, sees the delete() method
Doesn't understand the cache backend hierarchy — misses that FileBasedCache inherits from BaseCache
Doesn't know about the concurrent access patterns in Django's cache framework
Writes a fix that handles the FileNotFoundError but breaks the cache invalidation contract
Test fails. Tries again.
Explores django/core/cache/__init__.py, django/core/cache/backends/base.py
Eventually finds the right approach after 15+ file reads and 4,000+ tokens

Agent behavior (with XCE):

Calls xce_get_context("FileBasedCache FileNotFoundError concurrent access")
Gets back: the cache backend hierarchy (BaseCache → FileBasedCache), the locking mechanism, the file operations that can race, and the related test infrastructure
Understands the architecture immediately
Writes a fix that wraps the file operation in a try/except with proper fallback
Test passes on first attempt. ~1,500 tokens.

The difference isn't that the model is smarter. It's that the model has a map.

Per-Repository Analysis

XCE doesn't provide a uniform boost. The improvement correlates strongly with architectural complexity:

Repository	Sonnet 4.0 Baseline	Sonnet 4.0 + XCE	Delta	Why
sympy/sympy	45%	62%	+17%	Deep module dependencies. A fix in `sympy/core/` often requires understanding `sympy/simplify/`, `sympy/printing/`, and `sympy/polys/`. Without context, the agent gets lost in the dependency maze.
scikit-learn	58%	71%	+13%	Complex estimator inheritance. `BaseEstimator` → `ClassifierMixin` → `LinearClassifierMixin` → `LogisticRegression`. Bugs often require understanding the full chain.
matplotlib	52%	65%	+13%	Rendering pipeline with multiple backends. A bug in `axes.py` might require understanding `figure.py`, `backend_agg.py`, and the transform system.
django/django	62%	74%	+12%	Layered architecture (models → views → templates → middleware). Bugs cross layers frequently.
pytest	70%	78%	+8%	Relatively flat architecture. The plugin system is complex but most bugs are localized. Less benefit from architectural context.

The pattern is clear: the more architectural dependencies a codebase has, the more the agent benefits from having a structural map.

Pytest, with its relatively flat architecture, sees the smallest improvement (+8%). Sympy, where fixing a bug in one module often requires understanding five others, sees the largest (+17%).

The Cost Analysis

Here's where it gets interesting from a business perspective.

The official leaderboard shows that reaching 76%+ on SWE-bench Verified requires expensive models:

Score Range	Cheapest Model	Cost/Instance
76%+	Claude 4.5 Opus (high reasoning)	$0.75
75%+	MiniMax M2.5 (high reasoning)	$0.07
72%+	GPT-5-2 Codex	$0.45
70%+	DeepSeek V3.2 (high reasoning)	$0.45

With XCE, the cost equation changes:

Score	Setup	Cost/Instance	Savings vs. Opus
78.2%	MiniMax M2.5 + XCE	$0.22	3.4x cheaper
73.4%	Sonnet 4.0 + XCE	$0.22	3.4x cheaper
76.8%	Claude 4.5 Opus (no XCE)	$0.75	baseline

The $0.22 includes the XCE query cost (~$0.001 per query, amortized over multiple queries per instance) plus the model inference cost. The XCE overhead is negligible — the savings come from the model needing fewer tokens to solve each problem.

Token reduction: XCE reduces token usage by approximately 20% per task. The agent makes fewer wrong turns, reads fewer irrelevant files, and arrives at the solution faster. On a 500-instance benchmark run, this translates to significant cost savings.

At scale, the math is compelling. A team running 1,000 coding agent tasks per month:

Setup	Monthly Cost	Annual Cost
Claude Opus (no XCE)	$750	$9,000
MiniMax M2.5 + XCE	$220	$2,640
Savings	$530/mo	$6,360/yr

And the XCE setup gets better results.

How XCE Works (High Level)

XCE indexes a codebase into a multi-level structured representation that captures both code and architecture. When an agent queries XCE, it gets back context at the right level of abstraction — not just a code snippet, but an understanding of where that code fits in the system, what depends on it, and what it depends on.

The indexing uses the proprietary PRAT algorithm to build this structured index. The key difference from embedding-based search: PRAT captures structural relationships between components, not just text similarity. This means the agent can ask "what depends on this function?" and get a real answer — something embeddings alone cannot provide.

The result is served via MCP, so any compatible agent gets architectural context on every tool call without any changes to the agent itself.

Reproducing These Results

All results are published and reproducible. The benchmark repository includes predictions, resolved instance IDs, and trajectory download scripts:

Repository: github.com/Xanther-Ai/xce-benchmarks

To reproduce:

# 1. Install mini-swe-agent
pip install mini-swe-agent

# 2. Get an XCE API key (free at app.xanther.ai)
# 3. Index the target repo
npx xanther-cli init --api-key xce_your_key

# 4. Run the benchmark
mini-swe-agent run \
  --model claude-sonnet-4-20250514 \
  --dataset swe-bench-verified \
  --mcp-config '{"xanther": {"url": "https://mcp.xanther.ai/sse", "headers": {"Authorization": "Bearer xce_your_key"}}}'

# 5. Evaluate
sb submit --predictions results/preds.jsonl

Each run's preds.jsonl contains one prediction per instance:

{
  "instance_id": "django__django-16379",
  "model_name_or_path": "sonnet-4.0-xce",
  "model_patch": "diff --git a/...",
  "full_output": "..."
}

Trajectory files (100-600MB per run) are available for download from S3 for detailed analysis.

What This Means

Three takeaways:

1. Context is cheaper than compute. You don't need the most expensive model to get the best results. You need the right context. A $0.02/call model with good architectural context outperforms a $0.30/call model without it.

2. The improvement scales with complexity. Simple codebases with flat architectures see modest gains (+8%). Complex codebases with deep dependencies see dramatic gains (+17%). As codebases grow, the value of architectural context increases.

3. This is model-agnostic. XCE works with any MCP-compatible agent. The same context infrastructure that improves MiniMax M2.5 also improves Sonnet 4.0, and would improve any future model. Better models + better context = compounding gains.

Learn more about how XCE works: xanther.ai | See the benchmark methodology: xanther.ai/benchmarks

Try It

Xanther is in open beta. Free tier: 3 repos, 100 queries/month. No credit card.

npx xanther-cli init --api-key YOUR_KEY

Website: xanther.ai
Benchmark Dashboard: xanther.ai/benchmarks
Dashboard: app.xanther.ai
Benchmarks (raw data): github.com/Xanther-Ai/xce-benchmarks
Discord: discord.gg/Y768kBRS
npm: npmjs.com/package/xanther-cli

All benchmark results were evaluated using the official SWE-bench CLI (sb submit) against SWE-bench Verified (500 instances). The agent harness is mini-swe-agent. Predictions and resolved instance IDs are published at github.com/Xanther-Ai/xce-benchmarks.