<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hoyin kyoma</title>
    <description>The latest articles on DEV Community by Hoyin kyoma (@kyoma_1234).</description>
    <link>https://dev.to/kyoma_1234</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921114%2F8be2e4b6-ba5a-4b85-8cc0-e3c43854a551.jpg</url>
      <title>DEV Community: Hoyin kyoma</title>
      <link>https://dev.to/kyoma_1234</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kyoma_1234"/>
    <language>en</language>
    <item>
      <title>Why AI Coding Agents Waste 30% of Their Tokens — And How to Fix It</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sat, 09 May 2026 06:27:34 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/why-ai-coding-agents-waste-30-of-their-tokens-and-how-to-fix-it-42c1</link>
      <guid>https://dev.to/kyoma_1234/why-ai-coding-agents-waste-30-of-their-tokens-and-how-to-fix-it-42c1</guid>
      <description>&lt;h2&gt;
  
  
  The Hidden Cost of Blind Agents
&lt;/h2&gt;

&lt;p&gt;Every AI coding agent has the same workflow: receive a task, search the codebase, read files, write code. The problem is step 2. The agent doesn't know the codebase. It doesn't know the architecture. So it searches.&lt;/p&gt;

&lt;p&gt;And searches. And searches.&lt;/p&gt;

&lt;p&gt;We analyzed token usage across 500 SWE-bench Verified instances and found that agents spend approximately 30-40% of their tokens on &lt;strong&gt;exploration&lt;/strong&gt; — reading files that turn out to be irrelevant, following import chains that lead nowhere, and backtracking from wrong approaches.&lt;/p&gt;

&lt;p&gt;This isn't a model problem. GPT-5, Claude Opus, Gemini — they all do it. The issue is structural: the agent lacks a map of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;[Insert pie chart: "Token Breakdown — Agent Without Architectural Context"]&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real Example: Django Bug #16379
&lt;/h2&gt;

&lt;p&gt;Let's trace through a real bug to see this in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug:&lt;/strong&gt; &lt;code&gt;FileBasedCache&lt;/code&gt; crashes with &lt;code&gt;FileNotFoundError&lt;/code&gt; when multiple processes access the cache simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What a human developer does:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue — understands it's a race condition in the file cache backend&lt;/li&gt;
&lt;li&gt;Knows (from experience) that Django's cache backends inherit from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt; directly&lt;/li&gt;
&lt;li&gt;Checks the &lt;code&gt;delete()&lt;/code&gt; and &lt;code&gt;_cull()&lt;/code&gt; methods for file operations without proper locking&lt;/li&gt;
&lt;li&gt;Writes a fix: wrap the &lt;code&gt;os.remove()&lt;/code&gt; call in a try/except for &lt;code&gt;FileNotFoundError&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Done. ~5 minutes, ~3 files read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What an AI agent does (without context):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue&lt;/li&gt;
&lt;li&gt;Searches for "FileNotFoundError" — finds 47 matches across the codebase&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/files/storage.py&lt;/code&gt; — wrong file&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/files/base.py&lt;/code&gt; — wrong file&lt;/li&gt;
&lt;li&gt;Searches for "FileBasedCache" — finds it&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt; — right file&lt;/li&gt;
&lt;li&gt;Reads the whole file but doesn't understand the inheritance from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Writes a fix that handles the error but doesn't respect the cache contract&lt;/li&gt;
&lt;li&gt;Test fails&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/base.py&lt;/code&gt; to understand the base class&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/__init__.py&lt;/code&gt; to understand the cache framework&lt;/li&gt;
&lt;li&gt;Rewrites the fix&lt;/li&gt;
&lt;li&gt;Test passes. ~20 minutes, ~12 files read, ~4,000 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What an AI agent does (with XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;xce_get_context("FileBasedCache FileNotFoundError concurrent access")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gets back: the cache backend hierarchy, the file operations in &lt;code&gt;filebased.py&lt;/code&gt;, the locking patterns, and the test infrastructure&lt;/li&gt;
&lt;li&gt;Understands the architecture immediately&lt;/li&gt;
&lt;li&gt;Writes the correct fix on the first attempt&lt;/li&gt;
&lt;li&gt;Test passes. ~3 minutes, ~3 files read, ~1,500 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;[Insert flowchart: "File Access Pattern Comparison"]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The token savings compound across hundreds of tasks. On our 500-instance benchmark run, XCE reduced total token usage by approximately 20%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Embeddings Aren't Enough
&lt;/h2&gt;

&lt;p&gt;The obvious solution is "just use code search." Tools like Greptile, Sourcegraph Cody, and GitHub Copilot all offer some form of code search. Most use embedding-based retrieval: convert code to vectors, find the most similar vectors to the query.&lt;/p&gt;

&lt;p&gt;This works for simple lookups. "Find the login function" → returns the login function. But it fails for architectural questions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Embedding Search&lt;/th&gt;
&lt;th&gt;Architectural Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Which module owns this logic?"&lt;/td&gt;
&lt;td&gt;Returns similar code snippets&lt;/td&gt;
&lt;td&gt;Returns the HLD module, its role, and its boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What depends on this function?"&lt;/td&gt;
&lt;td&gt;Returns functions with similar names&lt;/td&gt;
&lt;td&gt;Returns the call graph and downstream consumers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"If I change this, what breaks?"&lt;/td&gt;
&lt;td&gt;Returns similar code (not dependent code)&lt;/td&gt;
&lt;td&gt;Returns impact analysis with affected modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"How does this fit in the architecture?"&lt;/td&gt;
&lt;td&gt;Returns nearby code&lt;/td&gt;
&lt;td&gt;Returns HLD → LLD → code hierarchy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fundamental issue: &lt;strong&gt;embeddings measure text similarity, not structural relationships.&lt;/strong&gt; Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;[Insert diagram: "Embedding Search vs. Architectural Context"]&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture Gap Across Repositories
&lt;/h2&gt;

&lt;p&gt;We measured the improvement from XCE across five major open-source repositories. The results reveal a clear pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Architecture Type&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;With XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sympy&lt;/td&gt;
&lt;td&gt;Deep module dependencies&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;+17%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;Complex inheritance chains&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;+13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;Multi-backend rendering pipeline&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;+13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django&lt;/td&gt;
&lt;td&gt;Layered MVC + ORM + middleware&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;+12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;Plugin system (relatively flat)&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;+8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;sympy (+17%):&lt;/strong&gt; The largest improvement. Sympy has deep cross-module dependencies. A bug in &lt;code&gt;sympy/core/expr.py&lt;/code&gt; might require understanding &lt;code&gt;sympy/simplify/&lt;/code&gt;, &lt;code&gt;sympy/printing/&lt;/code&gt;, &lt;code&gt;sympy/polys/&lt;/code&gt;, and &lt;code&gt;sympy/series/&lt;/code&gt;. Without a map, the agent gets lost in the dependency maze. With XCE, it knows which modules are structurally related before it starts exploring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scikit-learn (+13%):&lt;/strong&gt; Complex estimator inheritance. &lt;code&gt;BaseEstimator&lt;/code&gt; → &lt;code&gt;ClassifierMixin&lt;/code&gt; → &lt;code&gt;LinearClassifierMixin&lt;/code&gt; → &lt;code&gt;LogisticRegression&lt;/code&gt;. A bug in &lt;code&gt;LogisticRegression.fit()&lt;/code&gt; might actually be in &lt;code&gt;LinearClassifierMixin._fit()&lt;/code&gt; or even &lt;code&gt;BaseEstimator.set_params()&lt;/code&gt;. The agent needs to understand the full inheritance chain to find the right place to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pytest (+8%):&lt;/strong&gt; The smallest improvement. Pytest has a plugin system that's complex, but most bugs are localized to a single file or module. The agent doesn't need as much architectural context because the architecture is relatively flat.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;[Insert scatter plot: "Architectural Complexity vs. XCE Improvement"]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The correlation is strong: &lt;strong&gt;the more architecturally complex the codebase, the more the agent benefits from having a structural map.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This has a practical implication: if your codebase is a simple CRUD app with flat architecture, XCE helps modestly. If your codebase is a complex system with deep module dependencies, layered abstractions, and cross-cutting concerns — XCE helps dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  How XCE Works
&lt;/h2&gt;

&lt;p&gt;XCE uses the proprietary PRAT algorithm to build a structured codebase index that captures architectural relationships — not just code text. Unlike embedding-based search, PRAT understands structural connections between components at multiple levels of abstraction.&lt;/p&gt;

&lt;p&gt;When an agent queries XCE, it gets back a structured response that includes: what module the code belongs to, what its role is in the system, what depends on it, and what it depends on. The agent doesn't just know where the code is — it knows why it exists and how it connects to the rest of the system.&lt;/p&gt;

&lt;p&gt;This is served via MCP, so any compatible agent gets architectural context on every tool call without modifications.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Setup
&lt;/h2&gt;

&lt;p&gt;XCE runs as an MCP service. Any MCP-compatible agent can connect with one config block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Index your repo (one command)&lt;/span&gt;
npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This indexes the codebase and installs a git hook that auto-syncs after every commit. Then add to your agent's MCP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"xanther-xce"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.xanther.ai/sse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer YOUR_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works with Claude Code, Kiro, Cursor, OpenCode, Windsurf — any MCP-compatible tool.&lt;/p&gt;

&lt;p&gt;The agent gets five tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;xce_get_context&lt;/code&gt; — Full architectural context for a problem statement&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_search&lt;/code&gt; — Semantic search across the codebase&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_architecture_context&lt;/code&gt; — Architecture around a specific file or symbol&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_trace&lt;/code&gt; — Trace relationships from code to design artifacts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_impact_analysis&lt;/code&gt; — What breaks if you change specific files&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;AI coding agents are getting better every quarter. But the bottleneck isn't model capability — it's context quality. A cheap model with the right context outperforms an expensive model without it.&lt;/p&gt;

&lt;p&gt;The numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;78.2%&lt;/strong&gt; on SWE-bench Verified with MiniMax M2.5 + XCE (beats every model on the official leaderboard)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20% token reduction&lt;/strong&gt; per task (fewer wrong turns, less exploration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.22&lt;/strong&gt; per instance (16x cheaper than Claude Opus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Context is cheaper than compute. And it compounds: better models + better context = better results than either alone.&lt;/p&gt;




&lt;p&gt;Xanther is in open beta. Free tier: 3 repos, 100 queries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;Benchmarks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.gg/Y768kBRS" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/xanther-cli" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sat, 09 May 2026 05:53:53 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/how-a-002call-model-scored-782-on-swe-bench-verified-beating-every-model-on-the-leaderboard-4bm3</link>
      <guid>https://dev.to/kyoma_1234/how-a-002call-model-scored-782-on-swe-bench-verified-beating-every-model-on-the-leaderboard-4bm3</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every model on the official mini-SWE-agent leaderboard, including Claude Opus 4.5 (76.8%) which costs 37x more per call. The improvement comes entirely from better context, not a better model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full benchmark results and interactive dashboard: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;&lt;br&gt;
Try it free: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Official Leaderboard (as of February 2026)
&lt;/h2&gt;

&lt;p&gt;The SWE-bench Verified leaderboard uses mini-SWE-agent as a standardized harness to evaluate models on 500 human-verified bug instances from real open-source Python repositories. Here are the top results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Resolve Rate&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (high reasoning)&lt;/td&gt;
&lt;td&gt;76.80%&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Gemini 3 Flash (high reasoning)&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;$0.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;75.60%&lt;/td&gt;
&lt;td&gt;$0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;GPT-5-2 Codex&lt;/td&gt;
&lt;td&gt;72.80%&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Claude 4.5 Sonnet (high reasoning)&lt;/td&gt;
&lt;td&gt;71.40%&lt;/td&gt;
&lt;td&gt;$0.66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Kimi K2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;70.80%&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;DeepSeek V3.2 (high reasoning)&lt;/td&gt;
&lt;td&gt;70.00%&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://www.swebench.com" rel="noopener noreferrer"&gt;swebench.com&lt;/a&gt;, February 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The top score is 76.80% from Claude 4.5 Opus with high reasoning enabled, at $0.75 per instance. The cheapest competitive model is MiniMax M2.5 at 75.80% for $0.07.&lt;/p&gt;

&lt;p&gt;Now here's what happens when you add Xanther Context Engine:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Without XCE&lt;/th&gt;
&lt;th&gt;With XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MiniMax M2.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.20%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.4pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0&lt;/td&gt;
&lt;td&gt;66.00%&lt;/td&gt;
&lt;td&gt;73.40%&lt;/td&gt;
&lt;td&gt;+7.4pp&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0 (cascade hybrid)&lt;/td&gt;
&lt;td&gt;66.00%&lt;/td&gt;
&lt;td&gt;76.80%&lt;/td&gt;
&lt;td&gt;+10.8pp&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;MiniMax M2.5 + XCE at 78.2% would be the #1 entry on the official leaderboard&lt;/strong&gt; — and it costs $0.22 per instance, not $0.75.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;See the full results breakdown: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt; | Raw data: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkicyes8r8j52rjn5dss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkicyes8r8j52rjn5dss.png" width="720" height="244"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is SWE-bench Verified?
&lt;/h2&gt;

&lt;p&gt;SWE-bench Verified is the industry-standard benchmark for evaluating AI coding agents on real-world software engineering tasks. It consists of 500 instances, each representing a real bug from a real open-source Python repository. Each instance includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;problem statement&lt;/strong&gt; (the GitHub issue description)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;codebase snapshot&lt;/strong&gt; (the repository at the time the bug was reported)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;gold patch&lt;/strong&gt; (the actual fix that was merged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test cases&lt;/strong&gt; that verify the fix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent must read the problem statement, navigate the codebase, write a patch, and pass the test cases. No hints, no file locations, no guidance beyond the issue description.&lt;/p&gt;

&lt;p&gt;The repositories span a wide range of complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Lines of Code&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;django/django&lt;/td&gt;
&lt;td&gt;82K&lt;/td&gt;
&lt;td&gt;~4,000&lt;/td&gt;
&lt;td&gt;~300K&lt;/td&gt;
&lt;td&gt;Layered MVC, ORM, middleware, admin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;61K&lt;/td&gt;
&lt;td&gt;~1,200&lt;/td&gt;
&lt;td&gt;~200K&lt;/td&gt;
&lt;td&gt;Estimator inheritance chains, pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sympy/sympy&lt;/td&gt;
&lt;td&gt;13K&lt;/td&gt;
&lt;td&gt;~1,500&lt;/td&gt;
&lt;td&gt;~400K&lt;/td&gt;
&lt;td&gt;Deep mathematical module dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;20K&lt;/td&gt;
&lt;td&gt;~1,000&lt;/td&gt;
&lt;td&gt;~150K&lt;/td&gt;
&lt;td&gt;Complex rendering pipeline, backends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;12K&lt;/td&gt;
&lt;td&gt;~400&lt;/td&gt;
&lt;td&gt;~50K&lt;/td&gt;
&lt;td&gt;Plugin system, fixture resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not a toy benchmark. These are production codebases with real architectural complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Context Problem
&lt;/h2&gt;

&lt;p&gt;Watch what happens when a coding agent tries to fix a bug in Django without architectural context:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug:&lt;/strong&gt; &lt;code&gt;django__django-16379&lt;/code&gt; — &lt;code&gt;FileBasedCache&lt;/code&gt; crashes with &lt;code&gt;FileNotFoundError&lt;/code&gt; on concurrent access&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent behavior (without XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches for "FileBasedCache" — finds the class in &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reads the file, sees the &lt;code&gt;delete()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Doesn't understand the cache backend hierarchy — misses that &lt;code&gt;FileBasedCache&lt;/code&gt; inherits from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Doesn't know about the concurrent access patterns in Django's cache framework&lt;/li&gt;
&lt;li&gt;Writes a fix that handles the &lt;code&gt;FileNotFoundError&lt;/code&gt; but breaks the cache invalidation contract&lt;/li&gt;
&lt;li&gt;Test fails. Tries again.&lt;/li&gt;
&lt;li&gt;Explores &lt;code&gt;django/core/cache/__init__.py&lt;/code&gt;, &lt;code&gt;django/core/cache/backends/base.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Eventually finds the right approach after 15+ file reads and 4,000+ tokens&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrw4pmypy3k4r1f1lmq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrw4pmypy3k4r1f1lmq.png" width="720" height="669"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent behavior (with XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Calls &lt;code&gt;xce_get_context("FileBasedCache FileNotFoundError concurrent access")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gets back: the cache backend hierarchy (BaseCache → FileBasedCache), the locking mechanism, the file operations that can race, and the related test infrastructure&lt;/li&gt;
&lt;li&gt;Understands the architecture immediately&lt;/li&gt;
&lt;li&gt;Writes a fix that wraps the file operation in a try/except with proper fallback&lt;/li&gt;
&lt;li&gt;Test passes on first attempt. ~1,500 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv32vd9nal5xu1tn1ja0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv32vd9nal5xu1tn1ja0d.png" width="720" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The difference isn't that the model is smarter. It's that the model has a map.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Per-Repository Analysis
&lt;/h2&gt;

&lt;p&gt;XCE doesn't provide a uniform boost. The improvement correlates strongly with architectural complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Sonnet 4.0 Baseline&lt;/th&gt;
&lt;th&gt;Sonnet 4.0 + XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sympy/sympy&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+17%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deep module dependencies. A fix in &lt;code&gt;sympy/core/&lt;/code&gt; often requires understanding &lt;code&gt;sympy/simplify/&lt;/code&gt;, &lt;code&gt;sympy/printing/&lt;/code&gt;, and &lt;code&gt;sympy/polys/&lt;/code&gt;. Without context, the agent gets lost in the dependency maze.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex estimator inheritance. &lt;code&gt;BaseEstimator&lt;/code&gt; → &lt;code&gt;ClassifierMixin&lt;/code&gt; → &lt;code&gt;LinearClassifierMixin&lt;/code&gt; → &lt;code&gt;LogisticRegression&lt;/code&gt;. Bugs often require understanding the full chain.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rendering pipeline with multiple backends. A bug in &lt;code&gt;axes.py&lt;/code&gt; might require understanding &lt;code&gt;figure.py&lt;/code&gt;, &lt;code&gt;backend_agg.py&lt;/code&gt;, and the transform system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django/django&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layered architecture (models → views → templates → middleware). Bugs cross layers frequently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Relatively flat architecture. The plugin system is complex but most bugs are localized. Less benefit from architectural context.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmihrvif9ue9abxiizx1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmihrvif9ue9abxiizx1w.png" width="720" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pattern is clear: &lt;strong&gt;the more architectural dependencies a codebase has, the more the agent benefits from having a structural map.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pytest, with its relatively flat architecture, sees the smallest improvement (+8%). Sympy, where fixing a bug in one module often requires understanding five others, sees the largest (+17%).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Analysis
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting from a business perspective.&lt;/p&gt;

&lt;p&gt;The official leaderboard shows that reaching 76%+ on SWE-bench Verified requires expensive models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score Range&lt;/th&gt;
&lt;th&gt;Cheapest Model&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;76%+&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;75%+&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;72%+&lt;/td&gt;
&lt;td&gt;GPT-5-2 Codex&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70%+&lt;/td&gt;
&lt;td&gt;DeepSeek V3.2 (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With XCE, the cost equation changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;th&gt;Savings vs. Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 + XCE&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4x cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;73.4%&lt;/td&gt;
&lt;td&gt;Sonnet 4.0 + XCE&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4x cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (no XCE)&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The $0.22 includes the XCE query cost (~$0.001 per query, amortized over multiple queries per instance) plus the model inference cost. The XCE overhead is negligible — the savings come from the model needing fewer tokens to solve each problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token reduction:&lt;/strong&gt; XCE reduces token usage by approximately 20% per task. The agent makes fewer wrong turns, reads fewer irrelevant files, and arrives at the solution faster. On a 500-instance benchmark run, this translates to significant cost savings.&lt;/p&gt;

&lt;p&gt;At scale, the math is compelling. A team running 1,000 coding agent tasks per month:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Annual Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus (no XCE)&lt;/td&gt;
&lt;td&gt;$750&lt;/td&gt;
&lt;td&gt;$9,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5 + XCE&lt;/td&gt;
&lt;td&gt;$220&lt;/td&gt;
&lt;td&gt;$2,640&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Savings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$530/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$6,360/yr&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And the XCE setup gets better results.&lt;/p&gt;




&lt;h2&gt;
  
  
  How XCE Works (High Level)
&lt;/h2&gt;

&lt;p&gt;XCE indexes a codebase into a multi-level structured representation that captures both code and architecture. When an agent queries XCE, it gets back context at the right level of abstraction — not just a code snippet, but an understanding of where that code fits in the system, what depends on it, and what it depends on.&lt;/p&gt;

&lt;p&gt;The indexing uses the proprietary PRAT algorithm to build this structured index. The key difference from embedding-based search: PRAT captures structural relationships between components, not just text similarity. This means the agent can ask "what depends on this function?" and get a real answer — something embeddings alone cannot provide.&lt;/p&gt;

&lt;p&gt;The result is served via MCP, so any compatible agent gets architectural context on every tool call without any changes to the agent itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuigqmc7db16zdwp2jpp.png" width="720" height="372"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Reproducing These Results
&lt;/h2&gt;

&lt;p&gt;All results are published and reproducible. The benchmark repository includes predictions, resolved instance IDs, and trajectory download scripts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To reproduce:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install mini-swe-agent&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mini-swe-agent

&lt;span class="c"&gt;# 2. Get an XCE API key (free at app.xanther.ai)&lt;/span&gt;
&lt;span class="c"&gt;# 3. Index the target repo&lt;/span&gt;
npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; xce_your_key

&lt;span class="c"&gt;# 4. Run the benchmark&lt;/span&gt;
mini-swe-agent run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; claude-sonnet-4-20250514 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset&lt;/span&gt; swe-bench-verified &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; &lt;span class="s1"&gt;'{"xanther": {"url": "https://mcp.xanther.ai/sse", "headers": {"Authorization": "Bearer xce_your_key"}}}'&lt;/span&gt;

&lt;span class="c"&gt;# 5. Evaluate&lt;/span&gt;
sb submit &lt;span class="nt"&gt;--predictions&lt;/span&gt; results/preds.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each run's &lt;code&gt;preds.jsonl&lt;/code&gt; contains one prediction per instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"instance_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"django__django-16379"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_name_or_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sonnet-4.0-xce"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_patch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"diff --git a/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"full_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trajectory files (100-600MB per run) are available for download from S3 for detailed analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;Three takeaways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context is cheaper than compute.&lt;/strong&gt; You don't need the most expensive model to get the best results. You need the right context. A $0.02/call model with good architectural context outperforms a $0.30/call model without it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The improvement scales with complexity.&lt;/strong&gt; Simple codebases with flat architectures see modest gains (+8%). Complex codebases with deep dependencies see dramatic gains (+17%). As codebases grow, the value of architectural context increases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. This is model-agnostic.&lt;/strong&gt; XCE works with any MCP-compatible agent. The same context infrastructure that improves MiniMax M2.5 also improves Sonnet 4.0, and would improve any future model. Better models + better context = compounding gains.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Learn more about how XCE works: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt; | See the benchmark methodology: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Xanther is in open beta. Free tier: 3 repos, 100 queries/month. No credit card.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark Dashboard: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard: &lt;a href="https://app.xanther.ai" rel="noopener noreferrer"&gt;app.xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmarks (raw data): &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Discord: &lt;a href="https://discord.gg/Y768kBRS" rel="noopener noreferrer"&gt;discord.gg/Y768kBRS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/xanther-cli" rel="noopener noreferrer"&gt;npmjs.com/package/xanther-cli&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All benchmark results were evaluated using the official SWE-bench CLI (&lt;code&gt;sb submit&lt;/code&gt;) against SWE-bench Verified (500 instances). The agent harness is mini-swe-agent. Predictions and resolved instance IDs are published at &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>minimax</category>
    </item>
  </channel>
</rss>
