<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hoyin kyoma</title>
    <description>The latest articles on DEV Community by Hoyin kyoma (@kyoma_1234).</description>
    <link>https://dev.to/kyoma_1234</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921114%2F8be2e4b6-ba5a-4b85-8cc0-e3c43854a551.jpg</url>
      <title>DEV Community: Hoyin kyoma</title>
      <link>https://dev.to/kyoma_1234</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kyoma_1234"/>
    <language>en</language>
    <item>
      <title>A Data-Driven Comparison of Xanther, Augment Code, and Serena on Django's 200K-Line Codebase</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sat, 06 Jun 2026 01:25:18 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/a-data-driven-comparison-of-xanther-augment-code-and-serena-on-djangos-200k-line-codebase-2mim</link>
      <guid>https://dev.to/kyoma_1234/a-data-driven-comparison-of-xanther-augment-code-and-serena-on-djangos-200k-line-codebase-2mim</guid>
      <description>&lt;p&gt;&lt;strong&gt;Full benchmark data, visualizations, and scripts available on GitHub:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;→ &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📁 Repository contents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete blog post + detailed analysis&lt;/li&gt;
&lt;li&gt;35+ SWE-bench issue results with scoring&lt;/li&gt;
&lt;li&gt;10 complex feature design traces&lt;/li&gt;
&lt;li&gt;Machine-readable JSON results (&lt;code&gt;results/full_test_results.json&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;8 high-quality visualizations (S3 CDN hosted)&lt;/li&gt;
&lt;li&gt;Python scripts for reproducibility&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Background: The Context Problem in AI-Assisted Development
&lt;/h2&gt;

&lt;p&gt;I've been thinking about a fundamental question in AI-assisted coding: &lt;strong&gt;what matters more—the model's raw intelligence, or the context it receives?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider this scenario. You give GPT-4 a bug report: "Django's &lt;code&gt;select_related&lt;/code&gt; breaks when combined with &lt;code&gt;.only()&lt;/code&gt; on deferred fields." Without context, even the most capable model will hallucinate file paths, invent function signatures, and produce plausible-sounding but wrong fixes. Give that same model the exact file (&lt;code&gt;django/db/models/query.py&lt;/code&gt;), the relevant method (&lt;code&gt;select_related&lt;/code&gt; at line 1245), the call graph showing how it flows into &lt;code&gt;SQLCompiler.get_select()&lt;/code&gt;, and suddenly even a smaller model like Claude Haiku or GPT-4-mini can produce a correct fix.&lt;/p&gt;

&lt;p&gt;This is the thesis I set out to test:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;With the right context engine, a relatively less capable model can perform as well as a frontier model operating without structured context.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The context engine is the bottleneck, not the model. If you feed garbage context to GPT-4, you get garbage out. If you feed precise architectural context to a smaller model, you get correct, well-integrated code.&lt;/p&gt;

&lt;p&gt;To test this, I ran a comprehensive experiment comparing three context engines on Django's codebase (~200,000 lines of Python):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Augment Code (Auggie)&lt;/strong&gt; — Embedding-based semantic search + LLM synthesis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serena&lt;/strong&gt; — LSP-based symbol lookup engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xanther Context Engine (XCE)&lt;/strong&gt; — PRAT-based hierarchical graph engine&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I tested them across 35+ real SWE-bench Verified bug fixes and 10 complex architectural feature design tasks. The results surprised me.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Context Engines Matter More Than Model Choice
&lt;/h2&gt;

&lt;p&gt;Before diving into the comparison, let me explain why I believe context is the real differentiator.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Information Asymmetry Problem
&lt;/h3&gt;

&lt;p&gt;When a developer asks an AI to fix a bug in Django, the model faces an information asymmetry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What the model knows&lt;/strong&gt;: General Python patterns, Django documentation it was trained on, common bug patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What the model doesn't know&lt;/strong&gt;: The exact current state of the codebase, which functions call which, what changed in the last commit, how modules interconnect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gap is what context engines fill. The question is: how well do they fill it?&lt;/p&gt;
&lt;h3&gt;
  
  
  My Hypothesis
&lt;/h3&gt;

&lt;p&gt;I hypothesized that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A graph-based context engine (XCE) would dominate on problems requiring architectural understanding&lt;/li&gt;
&lt;li&gt;A semantic search engine (Auggie) would excel on novel design problems where patterns matter more than structure&lt;/li&gt;
&lt;li&gt;An LSP-based engine (Serena) would win on speed but lose on depth&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The data confirmed all three hypotheses—but with an unexpected twist on the complexity curve.&lt;/p&gt;


&lt;h2&gt;
  
  
  Executive Summary: Comparative Metrics
&lt;/h2&gt;

&lt;p&gt;Before diving into the full analysis, here's what the data shows:&lt;/p&gt;
&lt;h3&gt;
  
  
  Overall Performance Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;SWE-bench Avg&lt;/th&gt;
&lt;th&gt;Features Avg&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;Wins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.5/12 (87.5%)&lt;/td&gt;
&lt;td&gt;10.3/12 (85.8%)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.4/12 (86.7%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7/10 features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;10.5/12 (87.5%)&lt;/td&gt;
&lt;td&gt;9.7/12 (80.8%)&lt;/td&gt;
&lt;td&gt;10.1/12 (84.2%)&lt;/td&gt;
&lt;td&gt;3/10 features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;11.0/12 (91.7%)&lt;/td&gt;
&lt;td&gt;6.4/12 (53.3%)&lt;/td&gt;
&lt;td&gt;8.7/12 (72.5%)&lt;/td&gt;
&lt;td&gt;0/10 features&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key Finding&lt;/strong&gt;: XCE wins overall, but the story is nuanced. Serena scores high on SWE-bench (because it always finds the exact location—3/3 on code location), but fails on architectural understanding. Auggie struggles on standard problems but catches up on novel ones.&lt;/p&gt;
&lt;h3&gt;
  
  
  Response Time &amp;amp; Context Richness
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;XCE&lt;/th&gt;
&lt;th&gt;Auggie&lt;/th&gt;
&lt;th&gt;Serena&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Response Time&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens/Query&lt;/td&gt;
&lt;td&gt;~2000+&lt;/td&gt;
&lt;td&gt;~1500&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens/Second&lt;/td&gt;
&lt;td&gt;~1000&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Novel design&lt;/td&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  The Complexity Crossover (Most Important Finding)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Standard Complexity (Features 1-6): XCE wins
  XCE:    11.0/12 avg  ████████████
  Auggie:  9.8/12 avg  ██████████
  Serena:  6.2/12 avg  ███████

Novel Complexity (Features 7-10): Auggie wins ⭐
  XCE:    8.3/12 avg   █████████
  Auggie:  9.3/12 avg  ██████████
  Serena:  N/A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This suggests a phase transition: on problems where the solution exists in Django's current architecture, XCE's graph-based approach dominates. On problems requiring novel patterns not in Django, Auggie's embedding-based semantic synthesis takes the lead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See also&lt;/strong&gt;: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;Full benchmark repository&lt;/a&gt; with detailed metrics, traces, and reproducible scripts.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Three Context Engines: How They Work
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Augment Code (Auggie) — Semantic Search + LLM Synthesis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
Auggie is Augment Code's AI coding assistant. It uses embedding-based semantic search combined with large language models to understand codebases. When you ask it a question, it converts your query into a vector embedding, searches a vector database of code chunks, retrieves the most semantically similar code, and then uses an LLM to synthesize an answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works internally:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Indexing Phase&lt;/strong&gt;: Auggie chunks the codebase into semantic units (functions, classes, modules). Each chunk is converted into a high-dimensional vector embedding (typically 1536 dimensions) using a transformer-based code embedding model. These vectors are stored in a vector database.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Phase&lt;/strong&gt;: When I ask "How does Django handle database transactions?", Auggie:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converts my query into the same embedding space&lt;/li&gt;
&lt;li&gt;Performs approximate nearest-neighbor search in the vector DB&lt;/li&gt;
&lt;li&gt;Retrieves top-k most semantically similar code chunks&lt;/li&gt;
&lt;li&gt;Passes retrieved chunks + my query to an LLM for synthesis&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Synthesis Phase&lt;/strong&gt;: The LLM receives the retrieved context and produces a natural language explanation with code references.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understands code meaning beyond keywords (semantic similarity)&lt;/li&gt;
&lt;li&gt;Excellent at diagnosing problems and explaining code behavior&lt;/li&gt;
&lt;li&gt;No rate limits for heavy usage&lt;/li&gt;
&lt;li&gt;Requires repository indexing through Augment's system&lt;/li&gt;
&lt;li&gt;Response time: ~3 seconds per query&lt;/li&gt;
&lt;li&gt;Token output: ~1500 tokens per response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why semantic search has a unique advantage:&lt;/strong&gt;&lt;br&gt;
Unlike graph-based systems that rely on explicit relationships (function A calls function B), semantic embeddings capture &lt;em&gt;latent&lt;/em&gt; patterns. The embedding for "event sourcing" will be close to embeddings for "audit log", "change tracking", "immutable records"—even if those concepts appear in completely different files with different names. This is why Auggie excels on novel design problems where the answer isn't in the existing architecture but in recognizing patterns across concepts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Auggie requires the repository to be indexed in their system. During my testing, I had to add Django to my Augment workspace and wait for indexing to complete. The CLI (&lt;code&gt;auggie&lt;/code&gt;) worked reliably once indexed.&lt;/p&gt;


&lt;h3&gt;
  
  
  Serena — LSP-Based Symbol Lookup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
Serena is a Language Server Protocol (LSP) based semantic retrieval engine. Think of it as a programmatic version of VS Code's "Go to Definition" and "Find All References" features, exposed as an MCP tool. It doesn't understand what code &lt;em&gt;does&lt;/em&gt;—it understands where code &lt;em&gt;is&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works internally:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LSP Foundation&lt;/strong&gt;: Serena wraps a Python language server (Pyright) and exposes its capabilities as MCP tools. The Language Server Protocol is the same protocol that powers IDE features like autocomplete, go-to-definition, and find-references in VS Code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Indexing Phase&lt;/strong&gt;: When Serena starts on a project, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scans all &lt;code&gt;.py&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;Parses each into an Abstract Syntax Tree (AST)&lt;/li&gt;
&lt;li&gt;Builds a symbol index mapping names to exact file:line locations&lt;/li&gt;
&lt;li&gt;Caches the index for sub-second lookups&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Phase&lt;/strong&gt;: When I ask for &lt;code&gt;QuerySet&lt;/code&gt;, Serena:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Looks up "QuerySet" in its symbol index&lt;/li&gt;
&lt;li&gt;Returns: &lt;code&gt;django/db/models/query.py:324, class QuerySet(AltersData)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optionally includes the symbol body (source code)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Available Operations&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;find_symbol&lt;/code&gt; — Find where a symbol is defined&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;find_referencing_symbols&lt;/code&gt; — Find all references to a symbol&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;find_implementations&lt;/code&gt; — Find implementations of an interface&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;find_declaration&lt;/code&gt; — Find where something is declared&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_symbols_overview&lt;/code&gt; — List all symbols in a file&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blazing fast: ~1 second response time (no ML inference, just index lookup)&lt;/li&gt;
&lt;li&gt;Precise: exact line numbers, exact symbol locations&lt;/li&gt;
&lt;li&gt;Minimal token usage: ~500 tokens per response&lt;/li&gt;
&lt;li&gt;Runs entirely locally (no cloud dependency)&lt;/li&gt;
&lt;li&gt;No semantic understanding—can't explain &lt;em&gt;what&lt;/em&gt; code does or &lt;em&gt;why&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Serena is fast but limited:&lt;/strong&gt;&lt;br&gt;
Serena's speed comes from doing less. There's no embedding computation, no graph traversal, no LLM synthesis. It's a direct index lookup. But this means it can't answer "how does QuerySet integrate with the SQL compiler?" because that requires understanding relationships between symbols, not just their locations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Quick symbol lookups when you already know what you're looking for. Token-constrained environments. Offline development.&lt;/p&gt;


&lt;h3&gt;
  
  
  Xanther Context Engine (XCE) — PRAT-Based Hierarchical Graph
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
XCE is a context engine built on a data structure called PRAT (Persistent Recursive Abstract Tree). Unlike Auggie's flat embedding space or Serena's symbol index, XCE builds a hierarchical graph of the entire codebase that captures relationships at multiple levels of abstraction—from individual function calls up to high-level architectural modules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works internally:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PRAT Indexing Phase&lt;/strong&gt;: XCE parses the codebase and builds a Persistent Recursive Abstract Tree:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent&lt;/strong&gt;: The tree is stored and updated incrementally (not rebuilt from scratch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive&lt;/strong&gt;: Each node can contain sub-trees (a module contains classes, which contain methods)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abstract&lt;/strong&gt;: Nodes represent concepts at different abstraction levels, not just syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree&lt;/strong&gt;: Hierarchical structure with parent-child relationships&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-Layer Graph Construction&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1 (Syntax)&lt;/strong&gt;: AST parsing extracts functions, classes, imports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2 (Relationships)&lt;/strong&gt;: Call graphs, inheritance chains, import dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3 (Architecture)&lt;/strong&gt;: HLD (High-Level Design) modules and LLD (Low-Level Design) components&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Phase&lt;/strong&gt;: When I ask "how does QuerySet.delete() work?", XCE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Locates &lt;code&gt;QuerySet.delete()&lt;/code&gt; in the PRAT&lt;/li&gt;
&lt;li&gt;Traverses the call graph: &lt;code&gt;delete()&lt;/code&gt; → &lt;code&gt;Collector.collect()&lt;/code&gt; → &lt;code&gt;Collector.delete()&lt;/code&gt; → &lt;code&gt;SQLDeleteCompiler&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Identifies related modules: &lt;code&gt;deletion.py&lt;/code&gt;, &lt;code&gt;signals.py&lt;/code&gt;, &lt;code&gt;sql/compiler.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Returns the full architectural context including HLD module classification&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Available Operations&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;xce_search&lt;/code&gt; — Semantic search across the indexed codebase&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_architecture_context&lt;/code&gt; — Get full architectural context for a file or symbol&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_trace&lt;/code&gt; — Trace relationships from code to design artifacts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_impact_analysis&lt;/code&gt; — Analyze what would break if you change specific files&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_get_context&lt;/code&gt; — Combined search + architecture + tracing in one call&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provides call graphs showing function-to-function relationships&lt;/li&gt;
&lt;li&gt;Architectural layering (HLD → LLD → function level)&lt;/li&gt;
&lt;li&gt;Cross-module dependency understanding&lt;/li&gt;
&lt;li&gt;Impact analysis (what breaks if I change X?)&lt;/li&gt;
&lt;li&gt;Response time: ~2 seconds per query&lt;/li&gt;
&lt;li&gt;Token output: ~2000+ tokens per response (richest context)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why XCE's PRAT Makes It Faster Than Augment's Embedding Search
&lt;/h2&gt;

&lt;p&gt;This deserves its own section because it's counterintuitive. XCE returns &lt;em&gt;more&lt;/em&gt; context than Auggie (~2000 tokens vs ~1500 tokens) but responds &lt;em&gt;faster&lt;/em&gt; (~2 seconds vs ~3 seconds). How?&lt;/p&gt;

&lt;p&gt;The answer lies in the PRAT data structure.&lt;/p&gt;
&lt;h3&gt;
  
  
  PRAT: Persistent Recursive Abstract Tree
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Persistent&lt;/strong&gt; means the tree doesn't need to be rebuilt on each query. When Django's codebase changes (a new commit), XCE only updates the affected nodes in the tree—it doesn't re-embed the entire codebase. Auggie, by contrast, needs to re-embed changed chunks and update its vector index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recursive&lt;/strong&gt; means traversal is O(log n) for most queries. When I ask about &lt;code&gt;QuerySet.delete()&lt;/code&gt;, XCE doesn't search the entire codebase—it navigates directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;django/ (root)
  └── db/ (module)
       └── models/ (submodule)
            └── query.py (file)
                 └── QuerySet (class)
                      └── delete() (method)
                           └── [call graph: Collector.collect(), ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a tree traversal, not a vector similarity search. Tree traversal is O(depth) ≈ O(log n). Vector similarity search is O(n) or O(n log n) even with approximate nearest neighbors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt; means XCE pre-computes architectural relationships at index time. When I query, the call graph and HLD/LLD classification are already stored in the tree—they don't need to be computed on the fly. Auggie computes its synthesis at query time using an LLM, which adds latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tree&lt;/strong&gt; means the data structure supports efficient updates. When a file changes, only that file's subtree needs updating. The rest of the PRAT remains valid.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;XCE (PRAT)&lt;/th&gt;
&lt;th&gt;Auggie (Embeddings)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index update&lt;/td&gt;
&lt;td&gt;O(changed files)&lt;/td&gt;
&lt;td&gt;O(changed chunks × embedding dim)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query&lt;/td&gt;
&lt;td&gt;O(tree depth) ≈ O(log n)&lt;/td&gt;
&lt;td&gt;O(n) approximate NN search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context assembly&lt;/td&gt;
&lt;td&gt;Pre-computed (stored in tree)&lt;/td&gt;
&lt;td&gt;Computed at query time (LLM call)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response time&lt;/td&gt;
&lt;td&gt;~2 seconds&lt;/td&gt;
&lt;td&gt;~3 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ~1 second difference comes primarily from Auggie's LLM synthesis step. XCE's architectural context is pre-computed and stored in the PRAT, so it just needs to be retrieved and formatted. Auggie retrieves raw code chunks and then needs an LLM to synthesize them into a coherent answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scoring Methodology: LLM-as-Judge with Structured Rubric
&lt;/h2&gt;

&lt;p&gt;This section explains exactly how I scored each engine's responses. I used an &lt;strong&gt;LLM-as-Judge&lt;/strong&gt; approach with a structured rubric to ensure consistency and reproducibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Rubric (12 Points Maximum)
&lt;/h3&gt;

&lt;p&gt;Each engine response was scored on 4 criteria, each worth 0-3 points:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;0 Points&lt;/th&gt;
&lt;th&gt;1 Point&lt;/th&gt;
&lt;th&gt;2 Points&lt;/th&gt;
&lt;th&gt;3 Points&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wrong file/function&lt;/td&gt;
&lt;td&gt;Right file, wrong function&lt;/td&gt;
&lt;td&gt;Right file and function, imprecise&lt;/td&gt;
&lt;td&gt;Exact file, function, and line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Problem Identification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Missed the issue entirely&lt;/td&gt;
&lt;td&gt;Identified area but not root cause&lt;/td&gt;
&lt;td&gt;Identified root cause partially&lt;/td&gt;
&lt;td&gt;Precise root cause with explanation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architectural Understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No module context&lt;/td&gt;
&lt;td&gt;Single-module context&lt;/td&gt;
&lt;td&gt;Multi-module awareness&lt;/td&gt;
&lt;td&gt;Full cross-module dependency map&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Implementation Guidance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No actionable guidance&lt;/td&gt;
&lt;td&gt;Vague direction&lt;/td&gt;
&lt;td&gt;Specific approach but incomplete&lt;/td&gt;
&lt;td&gt;Complete implementation path with integration points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How Scoring Worked in Practice
&lt;/h3&gt;

&lt;p&gt;For each test (bug fix or feature design), I:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Formulated a query&lt;/strong&gt; appropriate for each engine's interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captured the raw response&lt;/strong&gt; from each engine (XCE MCP tools, Auggie CLI, Serena MCP tools)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applied the rubric&lt;/strong&gt; by evaluating each criterion independently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Used LLM-as-Judge&lt;/strong&gt; for borderline cases: I fed the engine's response + the rubric to Claude and asked it to score, then verified the score myself&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example Scoring: Feature 1 (QuerySet Pipeline)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a QuerySet pipeline API in Django that allows chaining data transformations"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XCE Response Scoring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Found &lt;code&gt;QuerySet&lt;/code&gt; class in &lt;code&gt;django/db/models/query.py&lt;/code&gt;, &lt;code&gt;SQLCompiler&lt;/code&gt; in &lt;code&gt;django/db/models/sql/compiler.py&lt;/code&gt;, related methods&lt;/li&gt;
&lt;li&gt;Problem Identification: 3/3 — Identified that pipeline needs to integrate with existing &lt;code&gt;_chain()&lt;/code&gt; mechanism and SQL compilation&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 3/3 — Showed full cross-module map: QuerySet → sql.Query → sql.compiler, plus admin/auth dependencies&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 2/3 — Showed where to integrate but didn't provide complete implementation path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 11/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Auggie Response Scoring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Found relevant QuerySet code&lt;/li&gt;
&lt;li&gt;Problem Identification: 3/3 — Good analysis of pipeline pattern&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 2/3 — Multi-module awareness but no dependency graph&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 2/3 — Good direction but no integration points&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 10/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Serena Response Scoring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Found QuerySet exactly&lt;/li&gt;
&lt;li&gt;Problem Identification: 1/3 — Just showed the class, no analysis&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 1/3 — Single file only&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 1/3 — No guidance, just location&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 6/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why LLM-as-Judge?
&lt;/h3&gt;

&lt;p&gt;I chose LLM-as-Judge because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: Human scoring varies between sessions. LLM scoring with a fixed rubric is reproducible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: Scoring 35+ issues × 3 engines = 100+ evaluations. Manual scoring at this scale introduces fatigue bias.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt;: The rubric is explicit. Anyone can re-run the evaluation with the same rubric and get similar scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibration&lt;/strong&gt;: I manually verified ~20% of LLM-assigned scores and found &amp;gt;90% agreement with my own assessment.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Limitations of This Methodology
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subjectivity in "Architectural Understanding"&lt;/strong&gt;: What counts as "full cross-module dependency map" vs "multi-module awareness" is somewhat subjective&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query formulation matters&lt;/strong&gt;: Different queries to the same engine can produce different results. I tried to use natural, representative queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-shot evaluation&lt;/strong&gt;: Each engine got one query per test. In practice, developers iterate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No A/B testing with actual developers&lt;/strong&gt;: This measures context quality, not end-to-end developer productivity.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Test Environment and Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Repository
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; (django/django) — ~200,000 lines of Python&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch&lt;/strong&gt;: main (latest as of May 2026)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modules tested&lt;/strong&gt;: ORM, admin, auth, cache, signals, migrations, transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Engine Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"xanther"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.xanther.ai/sse?repo_id=django-django"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer [key]"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"serena"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"serena"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"start-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--project"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/django"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auggie was run via CLI: &lt;code&gt;auggie --mcp --mcp-auto-workspace&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Platform
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;macOS (Apple Silicon)&lt;/li&gt;
&lt;li&gt;VS Code with Kiro AI assistant&lt;/li&gt;
&lt;li&gt;All engines running simultaneously for fair comparison&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results: SWE-bench Verified Bug Fixes (35+ Issues)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Aggregate Scores
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Issues Tested&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;Std Dev&lt;/th&gt;
&lt;th&gt;Success Rate (≥10/12)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;10.5/12&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;11.0/12&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10.5/12&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key observation&lt;/strong&gt;: Serena's higher average on SWE-bench is misleading. It scores high on Code Location (always 3/3) but low on Architectural Understanding and Implementation Guidance. Its median is the same as XCE's, but its success rate is lower because it completely fails on cross-module bugs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Time Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Avg Response Time&lt;/th&gt;
&lt;th&gt;Token Output&lt;/th&gt;
&lt;th&gt;Tokens/Second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;~1 second&lt;/td&gt;
&lt;td&gt;~500 tokens&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;td&gt;~2 seconds&lt;/td&gt;
&lt;td&gt;~2000 tokens&lt;/td&gt;
&lt;td&gt;~1000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;~3 seconds&lt;/td&gt;
&lt;td&gt;~1500 tokens&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;XCE delivers the most context per second. Serena is fastest in wall-clock time but delivers the least context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqb786yfpainp4bfok13m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqb786yfpainp4bfok13m.png" alt="Response Time Comparison" width="799" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: Response time comparison — Serena is fastest, but XCE delivers more context per second&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Notable Discovery: Unfixed Bug
&lt;/h3&gt;

&lt;p&gt;During testing, XCE helped me identify a bug that's still present in Django main:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Issue&lt;/strong&gt;: ForeignKey CASCADE delete doesn't call &lt;code&gt;model.delete()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Location&lt;/strong&gt;: &lt;code&gt;django/db/models/deletion.py&lt;/code&gt; — CASCADE function&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause&lt;/strong&gt;: Uses &lt;code&gt;collector.collect()&lt;/code&gt; + &lt;code&gt;collector.delete()&lt;/code&gt; directly, bypassing individual model's &lt;code&gt;delete()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: Custom &lt;code&gt;delete()&lt;/code&gt; logic in models is silently ignored during CASCADE deletes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XCE Score&lt;/strong&gt;: 12/12 — Found exact location, traced full call graph, identified all affected modules&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results: Complex Architectural Features (10 Features)
&lt;/h2&gt;

&lt;p&gt;This is where the comparison gets interesting. These aren't bug fixes—they're feature design tasks requiring multi-module architectural understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature-by-Feature Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;XCE&lt;/th&gt;
&lt;th&gt;Auggie&lt;/th&gt;
&lt;th&gt;Serena&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;QuerySet Pipeline API&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;6/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Cross-DB Transaction Coordinator&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;5/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Dynamic Model Schema Evolution&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;7/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Unified Caching Layer&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;6/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Real-time QuerySet Observations&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Multi-Tenant Row-Level Security&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;7/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;GraphQL QuerySet Integration&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Automatic Query Optimization&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;6/12&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Distributed Lock Manager&lt;/td&gt;
&lt;td&gt;7/12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Event Sourcing Backend&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6ia412zr6nivvc6hh3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6ia412zr6nivvc6hh3a.png" alt="Feature Test Results" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: Feature-by-feature scores — XCE dominates Features 1-6, Auggie wins Features 7-10&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Complexity Crossover
&lt;/h3&gt;

&lt;p&gt;This is the most important finding in the entire experiment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Features 1-6 (Standard Complexity — extending existing Django patterns):
  XCE:    11.0/12 avg  ████████████  91.7%
  Auggie:  9.8/12 avg  ██████████   81.7%
  Serena:  6.2/12 avg  ██████       51.7%

Features 7-10 (High Complexity — novel patterns not in Django):
  XCE:    8.3/12 avg   ████████     69.2%
  Auggie:  9.3/12 avg  █████████    77.5%
  Serena:  N/A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1v6a3ifgkc4sejeev4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1v6a3ifgkc4sejeev4p.png" alt="Complexity Curve" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 3: The complexity crossover — Auggie surpasses XCE as problems become more novel&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this happen?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Features 1-6 require understanding Django's &lt;em&gt;existing&lt;/em&gt; architecture. XCE's PRAT has this pre-computed.&lt;/li&gt;
&lt;li&gt;Features 7-10 require &lt;em&gt;inventing&lt;/em&gt; new architecture. There's no existing call graph for "event sourcing in Django" because it doesn't exist yet. Auggie's semantic search finds conceptually similar patterns from other contexts.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Complexity Curve: Why It Matters
&lt;/h2&gt;

&lt;p&gt;This finding has practical implications for how developers should choose context engines:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem Type&lt;/th&gt;
&lt;th&gt;% of Daily Work&lt;/th&gt;
&lt;th&gt;Best Engine&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Quick lookups&lt;/td&gt;
&lt;td&gt;~40%&lt;/td&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;Speed, precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard complex tasks&lt;/td&gt;
&lt;td&gt;~40%&lt;/td&gt;
&lt;td&gt;XCE&lt;/td&gt;
&lt;td&gt;Architecture, call graphs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Novel design problems&lt;/td&gt;
&lt;td&gt;~20%&lt;/td&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;Semantic synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The crossover point is approximately at the boundary between "extending existing patterns" and "creating new patterns." If the answer exists somewhere in the current codebase's architecture, XCE will find it faster and more completely. If the answer requires synthesizing concepts that don't yet exist in the codebase, Auggie's embedding-based approach has an edge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Token Usage and Context Quality
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Tokens/Query&lt;/th&gt;
&lt;th&gt;Context Type&lt;/th&gt;
&lt;th&gt;Information Density&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw (no engine)&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;td&gt;File path only&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;Symbol + location&lt;/td&gt;
&lt;td&gt;Low (but precise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;~1500&lt;/td&gt;
&lt;td&gt;Semantic explanation&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2000+&lt;/td&gt;
&lt;td&gt;Architecture + call graph&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjimcm1ytayup44z2ys0m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjimcm1ytayup44z2ys0m.png" alt="Token Usage" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 4: Token usage per query — XCE provides 4x more context than Serena&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F48aryfmsl7vmcev7c90k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F48aryfmsl7vmcev7c90k.png" alt="Overall Scores" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 5: Overall performance comparison across both test programs&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The context-to-performance relationship&lt;/strong&gt;: More context generally means better scores, but only if the context is &lt;em&gt;relevant&lt;/em&gt;. XCE's 2000 tokens of architectural context outperform Auggie's 1500 tokens of semantic explanation on standard problems because architecture is what matters for those problems. But on novel problems, Auggie's 1500 tokens of semantic synthesis outperform XCE's 2000 tokens of (potentially irrelevant) architectural context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Decision Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Engine&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fix Django ORM bug&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Call graph shows impact chain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quick symbol lookup&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Serena&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instant, precise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design new feature (novel)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Auggie&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Understand legacy code&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Auggie&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prevent regressions&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dependency/impact analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debug transaction issues&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cross-module tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design GraphQL API&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Auggie&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern recognition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add cache backend&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Follow existing architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My Recommended Workflow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Start with Serena for navigation (find the file/function)
2. Use XCE for understanding (how does it connect to other modules?)
3. Switch to Auggie for novel design (what patterns apply here?)
4. Return to XCE for impact analysis (what might break?)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Finztw8o9783cbzjfwt37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Finztw8o9783cbzjfwt37.png" alt="When to Use Which Engine" width="800" height="570"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 6: Decision guide for choosing the right context engine&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pdsredl5odi9vp28dsa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pdsredl5odi9vp28dsa.png" alt="Wins by Category" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 7: Summary of wins by problem category&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;th&gt;Implication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context engines enable weaker models&lt;/td&gt;
&lt;td&gt;Structured context + small model ≈ frontier model without context&lt;/td&gt;
&lt;td&gt;Invest in context, not just model size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XCE wins on standard complex tasks&lt;/td&gt;
&lt;td&gt;6/6 Features 1-6, avg 11/12&lt;/td&gt;
&lt;td&gt;Use XCE for Django core work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie wins on novel complexity&lt;/td&gt;
&lt;td&gt;3/4 Features 7-10, avg 9.3/12&lt;/td&gt;
&lt;td&gt;Use Auggie for new design problems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serena wins on speed&lt;/td&gt;
&lt;td&gt;~1s vs 2-3s&lt;/td&gt;
&lt;td&gt;Use Serena for quick lookups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRAT enables faster rich context&lt;/td&gt;
&lt;td&gt;XCE: 2s for 2000 tokens vs Auggie: 3s for 1500 tokens&lt;/td&gt;
&lt;td&gt;Pre-computed architecture beats runtime synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Final Scores
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;SWE-bench&lt;/th&gt;
&lt;th&gt;Features&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XCE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.5/12&lt;/td&gt;
&lt;td&gt;10.3/12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.4/12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standard complex tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auggie&lt;/td&gt;
&lt;td&gt;10.5/12&lt;/td&gt;
&lt;td&gt;9.7/12&lt;/td&gt;
&lt;td&gt;10.1/12&lt;/td&gt;
&lt;td&gt;Novel design problems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serena&lt;/td&gt;
&lt;td&gt;11.0/12&lt;/td&gt;
&lt;td&gt;6.4/12&lt;/td&gt;
&lt;td&gt;8.7/12&lt;/td&gt;
&lt;td&gt;Speed/precision&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;XCE wins overall&lt;/strong&gt; for Django development. Its PRAT-based architecture provides the richest context in the least time for the majority of development tasks. But the nuanced finding is that &lt;strong&gt;as problem novelty increases, Auggie's semantic approach catches up and eventually surpasses XCE&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The best approach is to use all three strategically based on the problem at hand.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix A: Complete Engine Traces for All 10 Complex Features
&lt;/h2&gt;

&lt;p&gt;Below are the complete traces from each engine for every feature test. These show exactly what each engine returned and how I scored them.&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 1: QuerySet Pipeline API
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a QuerySet pipeline API in Django that allows chaining data transformations like &lt;code&gt;.pipe(filter()).pipe(transform()).pipe(aggregate())&lt;/code&gt;. Show where this would integrate in Django's ORM and what files need modification."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/query.py&lt;/code&gt;, &lt;code&gt;django/db/models/sql/compiler.py&lt;/code&gt;, &lt;code&gt;django/db/models/sql/query.py&lt;/code&gt;, &lt;code&gt;django/db/models/aggregates.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph returned&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  QuerySet._chain() → QuerySet._clone()
  QuerySet.filter() → Q() → Query.add_q() → WhereNode
  QuerySet.annotate() → Query.add_annotation()
  QuerySet.aggregate() → Query.get_aggregation() → SQLCompiler
  select_related() → Query.add_select_related() → SQLCompiler.get_select()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HLD Context&lt;/strong&gt;: "django/db/models — ORM service layer providing database abstraction"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLD Context&lt;/strong&gt;: "QuerySet implements lazy evaluation with _chain() for immutable query building"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-module dependencies&lt;/strong&gt;: Admin uses QuerySet for list display, Auth uses QuerySet for user lookups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration point identified&lt;/strong&gt;: Pipeline should hook into &lt;code&gt;_chain()&lt;/code&gt; mechanism since all QuerySet methods use it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring breakdown&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Found all relevant files and methods&lt;/li&gt;
&lt;li&gt;Problem Identification: 3/3 — Identified &lt;code&gt;_chain()&lt;/code&gt; as the integration point&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 3/3 — Full cross-module map with HLD/LLD&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 2/3 — Clear direction but no complete implementation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 11/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;auggie codebase-retrieval&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found QuerySet class and its chaining mechanism&lt;/li&gt;
&lt;li&gt;Explained how Django's lazy evaluation works&lt;/li&gt;
&lt;li&gt;Identified that &lt;code&gt;.filter()&lt;/code&gt;, &lt;code&gt;.exclude()&lt;/code&gt;, &lt;code&gt;.annotate()&lt;/code&gt; all return new QuerySet instances&lt;/li&gt;
&lt;li&gt;Suggested pipeline could follow same pattern&lt;/li&gt;
&lt;li&gt;Did NOT provide call graph or architectural hierarchy&lt;/li&gt;
&lt;li&gt;Did NOT show cross-module dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring breakdown&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Found QuerySet correctly&lt;/li&gt;
&lt;li&gt;Problem Identification: 3/3 — Good analysis of chaining pattern&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 2/3 — Understood ORM but no cross-module map&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 2/3 — Good direction, no integration specifics&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 10/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 6/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;find_symbol QuerySet&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;QuerySet&lt;/code&gt; class at &lt;code&gt;django/db/models/query.py:324&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Listed methods: &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;exclude&lt;/code&gt;, &lt;code&gt;annotate&lt;/code&gt;, &lt;code&gt;aggregate&lt;/code&gt;, &lt;code&gt;select_related&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;No explanation of how they work together&lt;/li&gt;
&lt;li&gt;No cross-module context&lt;/li&gt;
&lt;li&gt;No architectural understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring breakdown&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — Exact location found&lt;/li&gt;
&lt;li&gt;Problem Identification: 1/3 — Just listed methods, no analysis&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 1/3 — Single file only&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 1/3 — No guidance provided&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 6/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Feature 2: Cross-Database Transaction Coordinator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a cross-database transaction coordinator for Django that supports two-phase commit across multiple databases."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_architecture_context&lt;/code&gt; on &lt;code&gt;django/db/transaction.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/transaction.py&lt;/code&gt;, &lt;code&gt;django/db/backends/base/base.py&lt;/code&gt;, &lt;code&gt;django/db/utils.py&lt;/code&gt; (ConnectionRouter), &lt;code&gt;django/test/utils.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  atomic() → Atomic.__enter__() → connection.savepoint()
  Atomic.__exit__() → connection.savepoint_commit() or connection.savepoint_rollback()
  connection.commit() → BaseDatabaseWrapper.commit()
  connection.rollback() → BaseDatabaseWrapper.rollback()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HLD&lt;/strong&gt;: "django/db — Database abstraction layer managing connections, transactions, and query execution"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: Django's current transaction system is per-connection. Two-phase commit would need a coordinator above &lt;code&gt;BaseDatabaseWrapper&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration points&lt;/strong&gt;: &lt;code&gt;django/db/utils.py&lt;/code&gt; ConnectionHandler manages multiple database connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring breakdown&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code Location: 3/3 — All transaction-related files found&lt;/li&gt;
&lt;li&gt;Problem Identification: 3/3 — Identified per-connection limitation&lt;/li&gt;
&lt;li&gt;Architectural Understanding: 3/3 — Full transaction architecture mapped&lt;/li&gt;
&lt;li&gt;Implementation Guidance: 2/3 — Clear where to add coordinator, but no 2PC protocol details&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 11/12&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explained Django's &lt;code&gt;atomic()&lt;/code&gt; decorator and context manager&lt;/li&gt;
&lt;li&gt;Described how &lt;code&gt;commit()&lt;/code&gt;, &lt;code&gt;rollback()&lt;/code&gt;, and savepoints work&lt;/li&gt;
&lt;li&gt;Identified that Django doesn't support distributed transactions natively&lt;/li&gt;
&lt;li&gt;Suggested XA transaction protocol for 2PC&lt;/li&gt;
&lt;li&gt;Good conceptual explanation but no call graph&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 2, Guidance 2 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 5/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;atomic&lt;/code&gt; function in &lt;code&gt;django/db/transaction.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Found &lt;code&gt;Atomic&lt;/code&gt; class&lt;/li&gt;
&lt;li&gt;No explanation of how transactions flow through the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 1, Architecture 0, Guidance 1 = &lt;strong&gt;5/12&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 3: Dynamic Model Schema Evolution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "How does Django's Model metaclass work in django/db/models/base.py for dynamic field creation?"&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_architecture_context&lt;/code&gt; on &lt;code&gt;ModelBase&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/base.py&lt;/code&gt; (ModelBase metaclass), &lt;code&gt;django/db/models/options.py&lt;/code&gt; (Options/Meta), &lt;code&gt;django/db/migrations/state.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  ModelBase.__new__() → Options() → contribute_to_class()
  Model._meta → Options instance
  Options.contribute_to_class() → field.contribute_to_class()
  Model._check_fields() → field validation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: &lt;code&gt;ModelBase.__new__()&lt;/code&gt; is where fields are collected and attached. Dynamic schema would need to hook here or use &lt;code&gt;contribute_to_class()&lt;/code&gt; post-creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration integration&lt;/strong&gt;: &lt;code&gt;django/db/migrations/state.py&lt;/code&gt; ModelState would need to support runtime changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 3, Guidance 2 = &lt;strong&gt;11/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Excellent explanation of Python metaclasses and how Django uses them&lt;/li&gt;
&lt;li&gt;Described &lt;code&gt;ModelBase.__new__()&lt;/code&gt; flow in detail&lt;/li&gt;
&lt;li&gt;Explained &lt;code&gt;contribute_to_class()&lt;/code&gt; mechanism&lt;/li&gt;
&lt;li&gt;Suggested &lt;code&gt;add_to_class()&lt;/code&gt; for runtime field addition&lt;/li&gt;
&lt;li&gt;No migration system integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 2, Guidance 2 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 7/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;ModelBase&lt;/code&gt; class at &lt;code&gt;django/db/models/base.py:94&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Returned full class body (lines 94-458)&lt;/li&gt;
&lt;li&gt;Good for reading the actual code but no explanation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 2, Architecture 1, Guidance 1 = &lt;strong&gt;7/12&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 4: Unified Caching Layer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a unified intelligent caching layer for Django that automatically selects optimal cache backend based on access patterns."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/core/cache/__init__.py&lt;/code&gt; (CacheHandler), &lt;code&gt;django/core/cache/backends/redis.py&lt;/code&gt;, &lt;code&gt;django/core/cache/backends/memcached.py&lt;/code&gt;, &lt;code&gt;django/core/cache/backends/db.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  CacheHandler.__getitem__() → create_connection()
  BaseCache.add/get/set/delete/touch/has_key/incr
  RedisCache → RedisCacheClient → RedisSerializer
  close_caches signal → test cleanup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HLD&lt;/strong&gt;: "django/core/cache — Caching service layer with pluggable backends"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: CacheHandler already supports multiple named caches. Intelligent routing could be a new CacheHandler subclass that delegates based on key patterns or access frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 3, Guidance 2 = &lt;strong&gt;11/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found cache backend implementations&lt;/li&gt;
&lt;li&gt;Explained differences between Redis, Memcached, and DB cache&lt;/li&gt;
&lt;li&gt;Suggested access pattern tracking with LRU/LFU metrics&lt;/li&gt;
&lt;li&gt;Good design suggestions but no architectural integration details&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 2, Guidance 2 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 6/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;CacheHandler&lt;/code&gt; class definition&lt;/li&gt;
&lt;li&gt;Listed available backends&lt;/li&gt;
&lt;li&gt;No integration context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 1, Architecture 1, Guidance 1 = &lt;strong&gt;6/12&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 5: Real-time QuerySet Observations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a real-time queryset observation system for Django that pushes updates via WebSockets when underlying data changes."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/signals.py&lt;/code&gt;, &lt;code&gt;django/db/models/query.py&lt;/code&gt;, &lt;code&gt;django/dispatch/dispatcher.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Model.save() → post_save.send()
  Model.delete() → post_delete.send()
  Signal.send() → receiver functions
  QuerySet._insert() → mark_for_rollback_on_error()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: Django's signal system (&lt;code&gt;post_save&lt;/code&gt;, &lt;code&gt;post_delete&lt;/code&gt;, &lt;code&gt;m2m_changed&lt;/code&gt;) already fires on data changes. Real-time observation would need to:

&lt;ol&gt;
&lt;li&gt;Register a signal receiver per observed QuerySet&lt;/li&gt;
&lt;li&gt;Re-evaluate the QuerySet filter on each signal&lt;/li&gt;
&lt;li&gt;Push diffs via WebSocket&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration&lt;/strong&gt;: Signals are the hook point; QuerySet's &lt;code&gt;_result_cache&lt;/code&gt; could track what's "observed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 3, Guidance 2 = &lt;strong&gt;11/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explained Django signals system&lt;/li&gt;
&lt;li&gt;Described how to combine signals with WebSocket channels&lt;/li&gt;
&lt;li&gt;Referenced Django Channels for WebSocket support&lt;/li&gt;
&lt;li&gt;Good high-level design but less specific about ORM integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 3, Architecture 2, Guidance 3 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: N/A — not tested on this feature)
&lt;/h4&gt;




&lt;h3&gt;
  
  
  Feature 6: Multi-Tenant Row-Level Security
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a multi-tenant row-level security system in Django that automatically filters queries based on tenant context."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/query.py&lt;/code&gt; (QuerySet), &lt;code&gt;django/db/models/manager.py&lt;/code&gt; (Manager), &lt;code&gt;django/db/utils.py&lt;/code&gt; (ConnectionRouter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Manager.get_queryset() → QuerySet()
  QuerySet.filter() → Query.add_q()
  ConnectionRouter.db_for_read() → database selection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: RLS should be implemented at the Manager level. A &lt;code&gt;TenantManager&lt;/code&gt; that overrides &lt;code&gt;get_queryset()&lt;/code&gt; to automatically add &lt;code&gt;.filter(tenant=current_tenant)&lt;/code&gt; is the cleanest approach. This mirrors how Django's &lt;code&gt;auth&lt;/code&gt; module uses custom managers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with auth&lt;/strong&gt;: &lt;code&gt;request.user.tenant&lt;/code&gt; provides the context; middleware sets thread-local tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 3, Guidance 2 = &lt;strong&gt;11/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 9/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good explanation of multi-tenancy patterns&lt;/li&gt;
&lt;li&gt;Suggested middleware + custom Manager approach&lt;/li&gt;
&lt;li&gt;Referenced PostgreSQL RLS as inspiration&lt;/li&gt;
&lt;li&gt;Less specific about Django's Manager/QuerySet integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 3, Architecture 2, Guidance 2 = &lt;strong&gt;9/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 7/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;ConnectionRouter&lt;/code&gt; class with methods&lt;/li&gt;
&lt;li&gt;Found &lt;code&gt;Manager&lt;/code&gt; class&lt;/li&gt;
&lt;li&gt;Precise locations but no design guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 2, Architecture 1, Guidance 1 = &lt;strong&gt;7/12&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 7: GraphQL QuerySet Integration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a GraphQL QuerySet integration for Django that compiles GraphQL queries to optimized SQL with automatic dataloader-style N+1 prevention."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 9/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/sql/compiler.py&lt;/code&gt; (SQLCompiler), &lt;code&gt;django/db/models/query.py&lt;/code&gt; (select_related, prefetch_related)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  SQLCompiler.execute_sql() → cursor.execute()
  select_related() → Query.add_select_related() → JOIN generation
  prefetch_related() → prefetch_related_objects() → separate queries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it found well&lt;/strong&gt;: Django's existing N+1 prevention mechanisms (select_related for JOINs, prefetch_related for batched queries)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it missed&lt;/strong&gt;: No GraphQL-specific context. XCE's PRAT only indexes Django's codebase, so it can't provide patterns for GraphQL integration that don't exist in Django yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 2, Architecture 2, Guidance 2 = &lt;strong&gt;9/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explained N+1 problem in GraphQL context&lt;/li&gt;
&lt;li&gt;Referenced dataloader pattern (batching + caching)&lt;/li&gt;
&lt;li&gt;Showed how to map GraphQL field resolution to Django's &lt;code&gt;select_related&lt;/code&gt;/&lt;code&gt;prefetch_related&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Suggested query analysis at GraphQL AST level to determine which relations to prefetch&lt;/li&gt;
&lt;li&gt;Provided conceptual implementation with resolver → QuerySet mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 3, Architecture 2, Guidance 3 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Auggie won&lt;/strong&gt;: The problem requires synthesizing knowledge from two domains (GraphQL + Django ORM). Auggie's semantic search found patterns from both domains. XCE only has Django's architecture indexed.&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: N/A — not tested)
&lt;/h4&gt;




&lt;h3&gt;
  
  
  Feature 8: Automatic Query Optimization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design an automatic query optimization system for Django that analyzes QuerySets and rewrites inefficient patterns."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 11/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_architecture_context&lt;/code&gt; on &lt;code&gt;django/db/models/sql/query.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/models/sql/query.py&lt;/code&gt; (Query class), &lt;code&gt;django/db/models/sql/compiler.py&lt;/code&gt; (SQLCompiler), &lt;code&gt;django/db/models/query.py&lt;/code&gt; (QuerySet)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Query.build_filter() → WhereNode
  Query.join() → Join objects
  Query.set_limits() → LIMIT/OFFSET
  Query.promote_joins() → LEFT JOIN promotion
  Query.resolve_expression() → expression compilation
  SQLCompiler.as_sql() → final SQL generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key insight&lt;/strong&gt;: Optimization rules could be applied between &lt;code&gt;Query&lt;/code&gt; construction and &lt;code&gt;SQLCompiler.as_sql()&lt;/code&gt;. The &lt;code&gt;Query&lt;/code&gt; object is mutable and can be rewritten before compilation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimization opportunities identified&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Subquery → JOIN conversion (when Query has subqueries that could be JOINs)&lt;/li&gt;
&lt;li&gt;Automatic &lt;code&gt;select_related&lt;/code&gt; insertion (when filter references related fields)&lt;/li&gt;
&lt;li&gt;Index hint generation (when Query filters on non-indexed fields)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 3, Architecture 3, Guidance 2 = &lt;strong&gt;11/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 10/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good explanation of common Django ORM anti-patterns&lt;/li&gt;
&lt;li&gt;Suggested EXPLAIN-based analysis&lt;/li&gt;
&lt;li&gt;Described rule-based optimization (similar to database query planners)&lt;/li&gt;
&lt;li&gt;Less specific about where in Django's code to hook in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 3, Architecture 2, Guidance 3 = &lt;strong&gt;10/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: 6/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;Query&lt;/code&gt; class in &lt;code&gt;django/db/models/sql/query.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Listed methods but no analysis of optimization points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 3, Problem ID 1, Architecture 1, Guidance 1 = &lt;strong&gt;6/12&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Feature 9: Distributed Lock Manager
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design a distributed lock manager for Django that coordinates operations across multiple application instances."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: 7/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Files identified&lt;/strong&gt;: &lt;code&gt;django/db/backends/base/base.py&lt;/code&gt; (BaseDatabaseWrapper), &lt;code&gt;django/core/cache/backends/&lt;/code&gt; (cache backends)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it found&lt;/strong&gt;: Database connection management, schema operations, cache backend interfaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it missed&lt;/strong&gt;: Django doesn't have a lock manager, so XCE's architecture graph has limited relevant nodes. It found database advisory locks (PostgreSQL &lt;code&gt;pg_advisory_lock&lt;/code&gt;) in the backends but couldn't provide a complete distributed locking design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial insight&lt;/strong&gt;: Cache backends could serve as lock storage (Redis SETNX pattern)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 2, Architecture 2, Guidance 1 = &lt;strong&gt;7/12&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 8/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explained distributed locking patterns (Redis SETNX, ZooKeeper, database advisory locks)&lt;/li&gt;
&lt;li&gt;Described Django's cache framework as a natural fit for lock storage&lt;/li&gt;
&lt;li&gt;Suggested implementation using &lt;code&gt;cache.add()&lt;/code&gt; (atomic set-if-not-exists)&lt;/li&gt;
&lt;li&gt;Provided timeout and renewal patterns&lt;/li&gt;
&lt;li&gt;Referenced Django's &lt;code&gt;select_for_update()&lt;/code&gt; for database-level locking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 2, Architecture 2, Guidance 2 = &lt;strong&gt;8/12&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Auggie won&lt;/strong&gt;: Distributed locking is a pattern that exists outside Django. Auggie's semantic search found relevant patterns from Redis/distributed systems knowledge. XCE only has Django's existing architecture.&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: N/A — not tested)
&lt;/h4&gt;




&lt;h3&gt;
  
  
  Feature 10: Event Sourcing Backend
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Design an event sourcing backend for Django that stores all model changes as immutable events with full replay capability."&lt;/p&gt;

&lt;h4&gt;
  
  
  XCE Trace (Score: N/A — off-target results)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Tool used&lt;/strong&gt;: &lt;code&gt;xce_get_context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XCE returned results focused on &lt;code&gt;django/contrib/auth/&lt;/code&gt; (authentication) instead of model save/delete signals&lt;/li&gt;
&lt;li&gt;The query "event sourcing" didn't map well to XCE's indexed architecture because event sourcing doesn't exist in Django&lt;/li&gt;
&lt;li&gt;When I refined the query to "Model.save() delete() signals", XCE found relevant code but the initial query failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why XCE failed here&lt;/strong&gt;: XCE's PRAT indexes &lt;em&gt;existing&lt;/em&gt; architecture. "Event sourcing" is a concept that doesn't exist in Django's codebase, so there are no PRAT nodes for it. The semantic gap between "event sourcing" and Django's actual &lt;code&gt;Model.save()&lt;/code&gt; → &lt;code&gt;post_save&lt;/code&gt; signal flow was too large for graph traversal to bridge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Not scored (off-target results)&lt;/p&gt;

&lt;h4&gt;
  
  
  Auggie Trace (Score: 9/12)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Response summary&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found &lt;code&gt;Model.save()&lt;/code&gt; and &lt;code&gt;Model.delete()&lt;/code&gt; methods&lt;/li&gt;
&lt;li&gt;Identified &lt;code&gt;post_save&lt;/code&gt; and &lt;code&gt;post_delete&lt;/code&gt; signals as hook points&lt;/li&gt;
&lt;li&gt;Explained event sourcing pattern: store events instead of state&lt;/li&gt;
&lt;li&gt;Suggested implementation:

&lt;ol&gt;
&lt;li&gt;Event model storing (model_class, pk, event_type, data, timestamp)&lt;/li&gt;
&lt;li&gt;Signal receivers capturing all save/delete operations&lt;/li&gt;
&lt;li&gt;Replay mechanism iterating events to reconstruct state&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Discussed migration implications (event replay as alternative to schema migrations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Code Location 2, Problem ID 3, Architecture 2, Guidance 2 = &lt;strong&gt;9/12&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Auggie won decisively&lt;/strong&gt;: Event sourcing is a well-known pattern in software architecture. Auggie's semantic search found conceptually similar patterns and synthesized them into a Django-specific design. XCE couldn't bridge the gap between "event sourcing" (a concept) and Django's actual code (which doesn't implement it).&lt;/p&gt;

&lt;h4&gt;
  
  
  Serena Trace (Score: N/A — not tested)
&lt;/h4&gt;




&lt;h2&gt;
  
  
  Appendix B: SWE-bench Verified Issues Tested
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue ID&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;XCE&lt;/th&gt;
&lt;th&gt;Auggie&lt;/th&gt;
&lt;th&gt;Serena&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16379&lt;/td&gt;
&lt;td&gt;FileBasedCache race conditions&lt;/td&gt;
&lt;td&gt;Cache&lt;/td&gt;
&lt;td&gt;12/12&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;8/12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16527&lt;/td&gt;
&lt;td&gt;AdminSite catch_all_view APPEND_SLASH&lt;/td&gt;
&lt;td&gt;Admin&lt;/td&gt;
&lt;td&gt;11/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16595&lt;/td&gt;
&lt;td&gt;Migration optimizer AlterField&lt;/td&gt;
&lt;td&gt;Migrations&lt;/td&gt;
&lt;td&gt;11/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16816&lt;/td&gt;
&lt;td&gt;makemigrations --check exit code&lt;/td&gt;
&lt;td&gt;Migrations&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;11/12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16910&lt;/td&gt;
&lt;td&gt;QuerySet.only after select_related&lt;/td&gt;
&lt;td&gt;ORM&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-17051&lt;/td&gt;
&lt;td&gt;bulk_create update_conflicts&lt;/td&gt;
&lt;td&gt;ORM&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16255&lt;/td&gt;
&lt;td&gt;Signer uses SHA-256&lt;/td&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-17087&lt;/td&gt;
&lt;td&gt;Class decorators method_decorator&lt;/td&gt;
&lt;td&gt;Decorators&lt;/td&gt;
&lt;td&gt;10/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-16400&lt;/td&gt;
&lt;td&gt;migrate --run-syncdb custom user&lt;/td&gt;
&lt;td&gt;Migrations&lt;/td&gt;
&lt;td&gt;11/12&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django__django-10097&lt;/td&gt;
&lt;td&gt;URLValidator username/password&lt;/td&gt;
&lt;td&gt;Validators&lt;/td&gt;
&lt;td&gt;12/12&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;8/12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: "-" indicates engine was not tested on that specific issue. Full results for all 35+ issues available in &lt;code&gt;swe-bench-results/RESULTS.md&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix C: Scoring Rubric Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code Location (0-3 points)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Wrong file or completely off-target&lt;/td&gt;
&lt;td&gt;Returns admin code when asked about ORM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Right module/directory but wrong file&lt;/td&gt;
&lt;td&gt;Found &lt;code&gt;django/db/&lt;/code&gt; but wrong file within it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Right file and general area&lt;/td&gt;
&lt;td&gt;Found &lt;code&gt;query.py&lt;/code&gt; and QuerySet class&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Exact file, class, method, and line number&lt;/td&gt;
&lt;td&gt;&lt;code&gt;django/db/models/query.py:324 QuerySet.filter()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Problem Identification (0-3 points)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No understanding of the problem&lt;/td&gt;
&lt;td&gt;Returns unrelated code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Identified the general area&lt;/td&gt;
&lt;td&gt;"It's somewhere in the ORM"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Identified root cause partially&lt;/td&gt;
&lt;td&gt;"The issue is in filter() but unclear exactly where"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Precise root cause with mechanism&lt;/td&gt;
&lt;td&gt;"filter() calls add_q() which doesn't handle deferred fields correctly because..."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Architectural Understanding (0-3 points)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No module context&lt;/td&gt;
&lt;td&gt;Just a code snippet with no context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Single-module context&lt;/td&gt;
&lt;td&gt;"This is in the ORM module"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Multi-module awareness&lt;/td&gt;
&lt;td&gt;"This affects ORM and admin"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Full cross-module dependency map with call graph&lt;/td&gt;
&lt;td&gt;"QuerySet → SQL Compiler → Admin list_display → Auth permissions"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Implementation Guidance (0-3 points)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No actionable guidance&lt;/td&gt;
&lt;td&gt;Just shows code location&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Vague direction&lt;/td&gt;
&lt;td&gt;"You'd need to modify the ORM"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Specific approach but incomplete&lt;/td&gt;
&lt;td&gt;"Override _chain() to add pipeline step"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Complete implementation path&lt;/td&gt;
&lt;td&gt;"1. Add pipe() to QuerySet 2. Hook into _chain() 3. Modify SQLCompiler to handle pipeline nodes 4. Update admin to support pipeline display"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Results: SWE-bench Verified Bug Fixes (35+ Issues)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📊 See full data&lt;/strong&gt;: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks/blob/main/results/full_test_results.json" rel="noopener noreferrer"&gt;results/full_test_results.json&lt;/a&gt; · &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks/blob/main/SWE_BENCH_RESULTS.md" rel="noopener noreferrer"&gt;SWE_BENCH_RESULTS.md&lt;/a&gt; on GitHub&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Complex Architectural Features (10 Features)
&lt;/h2&gt;

&lt;p&gt;** See full data**: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks/blob/main/complex_feature_test_results.md" rel="noopener noreferrer"&gt;complex_feature_test_results.md&lt;/a&gt; · &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks/blob/main/10_complex_features.md" rel="noopener noreferrer"&gt;10_complex_features.md&lt;/a&gt; on GitHub&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Testing conducted May 2026. Full conversation logs and raw engine outputs available upon request.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Context Engineering Is the Compass Your Coding Agent Needs</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sun, 10 May 2026 07:32:48 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/context-engineering-is-the-compass-your-coding-agent-needs-6kc</link>
      <guid>https://dev.to/kyoma_1234/context-engineering-is-the-compass-your-coding-agent-needs-6kc</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Coding agents are powerful ships, but they're sailing without a map. They can write code, run tests, and iterate — but they don't know where they are in the codebase. Context engineering is the discipline of giving agents the architectural awareness they need to navigate effectively. Without it, even the best models waste tokens exploring dead ends. With it, a cheap model outperforms an expensive one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Navigation Problem
&lt;/h2&gt;

&lt;p&gt;Picture a ship in open water. It has a powerful engine, a skilled crew, and enough fuel to reach any destination. But it has no compass, no charts, and no GPS. What happens?&lt;/p&gt;

&lt;p&gt;It explores. It tries directions. It backtracks when it hits land where it expected open water. Eventually, through trial and error, it might reach its destination — but it burns 3x the fuel and takes 5x the time.&lt;/p&gt;

&lt;p&gt;This is exactly what happens when you point a coding agent at a large codebase without architectural context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqkjj9ppjc7o8x5jd15c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqkjj9ppjc7o8x5jd15c.png" alt="Navigation without vs. with a compass" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent has all the capabilities it needs. It can read files, write code, run tests, search for patterns. But it doesn't know the architecture. It doesn't know that &lt;code&gt;django/db/models/sql/compiler.py&lt;/code&gt; is the heart of query generation, or that changing &lt;code&gt;BaseCache.set()&lt;/code&gt; affects every cache backend downstream. It discovers these things through exploration — expensive, token-heavy, error-prone exploration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Context Engineering?
&lt;/h2&gt;

&lt;p&gt;Context engineering is the practice of providing AI agents with structured, relevant information about the system they're working in — before they start exploring on their own.&lt;/p&gt;

&lt;p&gt;It's not prompt engineering (crafting better instructions). It's not RAG (retrieving text snippets by similarity). It's building a structured representation of the codebase that captures architecture, relationships, and design intent — then serving it to agents at the right moment.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;agents don't need more intelligence. They need better maps.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider the difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without context engineering:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent: "I need to fix the cache race condition"
→ Searches for "cache" → finds 47 files
→ Reads django/core/cache/__init__.py → not helpful
→ Reads django/core/cache/backends/filebased.py → finds the class
→ Reads django/core/cache/backends/base.py → understands inheritance
→ Searches for "thread" → finds 23 files
→ Reads django/utils/autoreload.py → wrong file
→ Reads django/core/files/locks.py → relevant but doesn't know why yet
→ Eventually pieces together the architecture after 12 file reads
Total: ~4,000 tokens, 45 seconds, 2 wrong attempts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With context engineering:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent: "I need to fix the cache race condition"
→ Queries XCE: "FileBasedCache race condition threading"
→ Gets back: inheritance chain, threading concerns, related utilities, test infrastructure
→ Goes directly to the right files with full architectural understanding
Total: ~1,500 tokens, 15 seconds, correct on first attempt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same agent. Same model. Same capabilities. The only difference is the map.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Levels of Context
&lt;/h2&gt;

&lt;p&gt;Not all context is created equal. There's a hierarchy:&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: Code Context (What exists)
&lt;/h3&gt;

&lt;p&gt;This is what most tools provide today — file contents, function signatures, grep results. It answers "what code is here?" but not "why?" or "how does it connect?"&lt;/p&gt;

&lt;p&gt;Tools at this level: file search, grep, symbol lookup, embeddings-based RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: Finding a function doesn't tell you what calls it, what it depends on, or what breaks if you change it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: Structural Context (How things connect)
&lt;/h3&gt;

&lt;p&gt;This captures relationships — call graphs, inheritance chains, import dependencies, module boundaries. It answers "what depends on what?" and "what's the execution flow?"&lt;/p&gt;

&lt;p&gt;Tools at this level: static analysis, dependency graphs, call chain extraction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: Knowing the call graph doesn't tell you the design intent or architectural role of each component.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3: Architectural Context (Why things exist)
&lt;/h3&gt;

&lt;p&gt;This captures design intent — why a module exists, what role it plays in the system, what design patterns it implements, what constraints it must satisfy. It answers "what is this component's job?" and "what are the rules?"&lt;/p&gt;

&lt;p&gt;Tools at this level: XCE's PRAT-powered structured index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the level that changes agent behavior.&lt;/strong&gt; When an agent knows that &lt;code&gt;CsrfViewMiddleware&lt;/code&gt; must run before &lt;code&gt;CacheMiddleware&lt;/code&gt; (and why), it doesn't accidentally break that constraint. When it knows that &lt;code&gt;BaseCache&lt;/code&gt; defines a contract that all backends must satisfy, it doesn't write a fix that violates that contract.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ewnjokvfr3vqon9vcu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ewnjokvfr3vqon9vcu1.png" alt="The Three Levels of Context" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Embeddings Alone Aren't Enough
&lt;/h2&gt;

&lt;p&gt;The most common approach to giving agents codebase context is embedding-based retrieval: embed all code chunks, embed the query, return the most similar chunks. This works for simple lookups but fails for architectural questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: "How does Django's ORM compile a QuerySet into SQL?"&lt;/p&gt;

&lt;p&gt;Embedding search returns: chunks from &lt;code&gt;query.py&lt;/code&gt;, &lt;code&gt;compiler.py&lt;/code&gt;, maybe &lt;code&gt;expressions.py&lt;/code&gt; — based on text similarity. But it doesn't tell you the execution order, the inheritance chain, or which method calls which.&lt;/p&gt;

&lt;p&gt;The agent gets fragments. It doesn't get the story.&lt;/p&gt;

&lt;p&gt;Structured context engineering provides the story:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;QuerySet.filter()&lt;/code&gt; creates a &lt;code&gt;Query&lt;/code&gt; object&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Query&lt;/code&gt; accumulates conditions via &lt;code&gt;add_q()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;When evaluated, &lt;code&gt;SQLCompiler.as_sql()&lt;/code&gt; walks the &lt;code&gt;Query&lt;/code&gt; tree&lt;/li&gt;
&lt;li&gt;Each node (&lt;code&gt;WhereNode&lt;/code&gt;, &lt;code&gt;Col&lt;/code&gt;, &lt;code&gt;Ref&lt;/code&gt;) has an &lt;code&gt;as_sql()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;The compiler assembles these into a final SQL string&lt;/li&gt;
&lt;li&gt;Backend-specific compilers override for dialect differences&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the difference between handing someone a box of puzzle pieces versus showing them the completed picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Compass Metaphor
&lt;/h2&gt;

&lt;p&gt;A compass doesn't tell you the answer. It tells you which direction to look.&lt;/p&gt;

&lt;p&gt;Context engineering works the same way. XCE doesn't write the fix for you. It tells your agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which files are relevant (and which aren't)&lt;/li&gt;
&lt;li&gt;How those files relate to each other&lt;/li&gt;
&lt;li&gt;What constraints must be preserved&lt;/li&gt;
&lt;li&gt;What patterns to follow&lt;/li&gt;
&lt;li&gt;What will break if you get it wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent still does the work. But it does the right work, in the right place, on the first try.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9g3c4nepjplj0uq2x3g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9g3c4nepjplj0uq2x3g.png" alt="The Four Directions of Context" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why a $0.02/call model with good context (MiniMax M2.5 + XCE at 78.2% on SWE-bench) outperforms a $0.30/call model without it (Claude Opus at 76.8%). The expensive model is a faster ship — but it's still sailing without a compass. The cheap model with XCE has the map.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Numbers
&lt;/h2&gt;

&lt;p&gt;We tested this on SWE-bench Verified — 500 real bugs from real open-source repositories. The results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Resolve Rate&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5 + XCE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 4.5 Opus (no context)&lt;/td&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0 + XCE&lt;/td&gt;
&lt;td&gt;73.4%&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0 (no context)&lt;/td&gt;
&lt;td&gt;66.0%&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The improvement scales with codebase complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple codebases&lt;/strong&gt; (flat architecture, few dependencies): +8% improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium codebases&lt;/strong&gt; (some layering, moderate dependencies): +12% improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex codebases&lt;/strong&gt; (deep inheritance, cross-cutting concerns): +17% improvement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more complex the architecture, the more valuable the compass becomes. A flat codebase is like sailing in a small lake — you can see the shore from anywhere. A complex codebase is like the open ocean — without navigation, you're lost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Engineering vs. Other Approaches
&lt;/h2&gt;

&lt;p&gt;How does context engineering compare to other ways of helping agents?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;What it provides&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Better prompts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clearer instructions&lt;/td&gt;
&lt;td&gt;Doesn't help with codebase navigation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Longer context windows&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;More code visible at once&lt;/td&gt;
&lt;td&gt;Agent still doesn't know what's relevant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Similar code chunks&lt;/td&gt;
&lt;td&gt;No structural relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Directory structure&lt;/td&gt;
&lt;td&gt;No semantic understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Design intent (if it exists)&lt;/td&gt;
&lt;td&gt;Usually outdated, incomplete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context engineering (XCE)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Architecture + structure + semantics&lt;/td&gt;
&lt;td&gt;Requires indexing (one-time cost)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key differentiator: context engineering provides &lt;strong&gt;relational&lt;/strong&gt; information. Not just "here's a file" but "here's how this file connects to 5 other files, what calls it, what it calls, and what role it plays in the system."&lt;/p&gt;




&lt;h2&gt;
  
  
  Building Your Own Compass
&lt;/h2&gt;

&lt;p&gt;If you want to apply context engineering to your codebase, here's the approach:&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Use XCE (fastest)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; xanther-cli
xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This indexes your repo and serves structured context via MCP. Works with any MCP-compatible agent (Claude Code, Kiro, Cursor, OpenCode, Windsurf, Cline).&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Build lightweight context yourself
&lt;/h3&gt;

&lt;p&gt;If you want a DIY approach, start with these principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Map module boundaries&lt;/strong&gt;: Document which directories/packages form logical modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture key relationships&lt;/strong&gt;: Which modules depend on which? What are the integration points?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document constraints&lt;/strong&gt;: What rules must be preserved? (e.g., "middleware ordering matters")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide it via MCP&lt;/strong&gt;: Build a simple MCP server that serves this context to your agent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even a hand-written architecture document served via MCP is better than nothing. The agent goes from "I have no idea how this codebase is organized" to "I know the major modules and their relationships."&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Steering files
&lt;/h3&gt;

&lt;p&gt;For smaller codebases, agent steering files (like &lt;code&gt;.kiro/steering/&lt;/code&gt; or &lt;code&gt;CLAUDE.md&lt;/code&gt;) can provide basic architectural context. These are static documents that get included in every agent interaction.&lt;/p&gt;

&lt;p&gt;Limitation: they don't scale. A 500-line steering file for a 300K-line codebase can only capture the highest-level architecture. XCE provides context at every level of detail, dynamically, based on what the agent is working on.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Future of Agent-Assisted Development
&lt;/h2&gt;

&lt;p&gt;We're at an inflection point. Models are getting better every quarter. Context windows are growing. But the fundamental problem remains: &lt;strong&gt;agents don't understand architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 1M-token context window doesn't help if the agent doesn't know which 5,000 tokens are relevant to the current task. More compute doesn't help if the agent is exploring the wrong part of the codebase.&lt;/p&gt;

&lt;p&gt;Context engineering is the missing layer. It sits between the codebase and the agent, providing the architectural awareness that transforms exploration into navigation.&lt;/p&gt;

&lt;p&gt;The ships are getting faster. But speed without direction is just expensive wandering. Context engineering is the compass.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Xanther Context Engine is in open beta. Free tier: 3 repos, 100 queries/month.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard: &lt;a href="https://app.xanther.ai" rel="noopener noreferrer"&gt;app.xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmarks: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CLI: &lt;a href="https://www.npmjs.com/package/xanther-cli" rel="noopener noreferrer"&gt;npmjs.com/package/xanther-cli&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Discord: &lt;a href="https://discord.gg/YaBekKpR" rel="noopener noreferrer"&gt;discord.gg/YaBekKpR&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Twitter: &lt;a href="https://x.com/xantherai" rel="noopener noreferrer"&gt;x.com/xantherai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All benchmark results from SWE-bench Verified (500 instances) using mini-swe-agent. Full data: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>coding</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Why AI Coding Agents Waste 30% of Their Tokens — And How to Fix It</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sat, 09 May 2026 06:27:34 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/why-ai-coding-agents-waste-30-of-their-tokens-and-how-to-fix-it-42c1</link>
      <guid>https://dev.to/kyoma_1234/why-ai-coding-agents-waste-30-of-their-tokens-and-how-to-fix-it-42c1</guid>
      <description>&lt;h2&gt;
  
  
  The Hidden Cost of Blind Agents
&lt;/h2&gt;

&lt;p&gt;Every AI coding agent has the same workflow: receive a task, search the codebase, read files, write code. The problem is step 2. The agent doesn't know the codebase. It doesn't know the architecture. So it searches.&lt;/p&gt;

&lt;p&gt;And searches. And searches.&lt;/p&gt;

&lt;p&gt;We analyzed token usage across 500 SWE-bench Verified instances and found that agents spend approximately 30-40% of their tokens on &lt;strong&gt;exploration&lt;/strong&gt; — reading files that turn out to be irrelevant, following import chains that lead nowhere, and backtracking from wrong approaches.&lt;/p&gt;

&lt;p&gt;This isn't a model problem. GPT-5, Claude Opus, Gemini — they all do it. The issue is structural: the agent lacks a map of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmqes20ouypyekygk8ow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmqes20ouypyekygk8ow.png" alt="Token Breakdown — Agent Without Architectural Context" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real Example: Django Bug #16379
&lt;/h2&gt;

&lt;p&gt;Let's trace through a real bug to see this in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug:&lt;/strong&gt; &lt;code&gt;FileBasedCache&lt;/code&gt; crashes with &lt;code&gt;FileNotFoundError&lt;/code&gt; when multiple processes access the cache simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What a human developer does:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue — understands it's a race condition in the file cache backend&lt;/li&gt;
&lt;li&gt;Knows (from experience) that Django's cache backends inherit from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt; directly&lt;/li&gt;
&lt;li&gt;Checks the &lt;code&gt;delete()&lt;/code&gt; and &lt;code&gt;_cull()&lt;/code&gt; methods for file operations without proper locking&lt;/li&gt;
&lt;li&gt;Writes a fix: wrap the &lt;code&gt;os.remove()&lt;/code&gt; call in a try/except for &lt;code&gt;FileNotFoundError&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Done. ~5 minutes, ~3 files read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What an AI agent does (without context):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue&lt;/li&gt;
&lt;li&gt;Searches for "FileNotFoundError" — finds 47 matches across the codebase&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/files/storage.py&lt;/code&gt; — wrong file&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/files/base.py&lt;/code&gt; — wrong file&lt;/li&gt;
&lt;li&gt;Searches for "FileBasedCache" — finds it&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt; — right file&lt;/li&gt;
&lt;li&gt;Reads the whole file but doesn't understand the inheritance from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Writes a fix that handles the error but doesn't respect the cache contract&lt;/li&gt;
&lt;li&gt;Test fails&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/backends/base.py&lt;/code&gt; to understand the base class&lt;/li&gt;
&lt;li&gt;Opens &lt;code&gt;django/core/cache/__init__.py&lt;/code&gt; to understand the cache framework&lt;/li&gt;
&lt;li&gt;Rewrites the fix&lt;/li&gt;
&lt;li&gt;Test passes. ~20 minutes, ~12 files read, ~4,000 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What an AI agent does (with XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;xce_get_context("FileBasedCache FileNotFoundError concurrent access")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gets back: the cache backend hierarchy, the file operations in &lt;code&gt;filebased.py&lt;/code&gt;, the locking patterns, and the test infrastructure&lt;/li&gt;
&lt;li&gt;Understands the architecture immediately&lt;/li&gt;
&lt;li&gt;Writes the correct fix on the first attempt&lt;/li&gt;
&lt;li&gt;Test passes. ~3 minutes, ~3 files read, ~1,500 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zj26mebdtupqck4f55g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zj26mebdtupqck4f55g.png" alt="File Access Pattern — Django Bug #16379" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The token savings compound across hundreds of tasks. On our 500-instance benchmark run, XCE reduced total token usage by approximately 20%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Embeddings Aren't Enough
&lt;/h2&gt;

&lt;p&gt;The obvious solution is "just use code search." Tools like Greptile, Sourcegraph Cody, and GitHub Copilot all offer some form of code search. Most use embedding-based retrieval: convert code to vectors, find the most similar vectors to the query.&lt;/p&gt;

&lt;p&gt;This works for simple lookups. "Find the login function" → returns the login function. But it fails for architectural questions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Embedding Search&lt;/th&gt;
&lt;th&gt;Architectural Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Which module owns this logic?"&lt;/td&gt;
&lt;td&gt;Returns similar code snippets&lt;/td&gt;
&lt;td&gt;Returns the HLD module, its role, and its boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What depends on this function?"&lt;/td&gt;
&lt;td&gt;Returns functions with similar names&lt;/td&gt;
&lt;td&gt;Returns the call graph and downstream consumers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"If I change this, what breaks?"&lt;/td&gt;
&lt;td&gt;Returns similar code (not dependent code)&lt;/td&gt;
&lt;td&gt;Returns impact analysis with affected modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"How does this fit in the architecture?"&lt;/td&gt;
&lt;td&gt;Returns nearby code&lt;/td&gt;
&lt;td&gt;Returns HLD → LLD → code hierarchy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fundamental issue: &lt;strong&gt;embeddings measure text similarity, not structural relationships.&lt;/strong&gt; Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjjheel6lnsrdp6qf8xb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjjheel6lnsrdp6qf8xb.png" alt="Embedding Similarity vs. Structural Relationships" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture Gap Across Repositories
&lt;/h2&gt;

&lt;p&gt;We measured the improvement from XCE across five major open-source repositories. The results reveal a clear pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Architecture Type&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;With XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sympy&lt;/td&gt;
&lt;td&gt;Deep module dependencies&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;+17%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;Complex inheritance chains&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;+13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;Multi-backend rendering pipeline&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;+13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django&lt;/td&gt;
&lt;td&gt;Layered MVC + ORM + middleware&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;+12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;Plugin system (relatively flat)&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;+8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;sympy (+17%):&lt;/strong&gt; The largest improvement. Sympy has deep cross-module dependencies. A bug in &lt;code&gt;sympy/core/expr.py&lt;/code&gt; might require understanding &lt;code&gt;sympy/simplify/&lt;/code&gt;, &lt;code&gt;sympy/printing/&lt;/code&gt;, &lt;code&gt;sympy/polys/&lt;/code&gt;, and &lt;code&gt;sympy/series/&lt;/code&gt;. Without a map, the agent gets lost in the dependency maze. With XCE, it knows which modules are structurally related before it starts exploring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scikit-learn (+13%):&lt;/strong&gt; Complex estimator inheritance. &lt;code&gt;BaseEstimator&lt;/code&gt; → &lt;code&gt;ClassifierMixin&lt;/code&gt; → &lt;code&gt;LinearClassifierMixin&lt;/code&gt; → &lt;code&gt;LogisticRegression&lt;/code&gt;. A bug in &lt;code&gt;LogisticRegression.fit()&lt;/code&gt; might actually be in &lt;code&gt;LinearClassifierMixin._fit()&lt;/code&gt; or even &lt;code&gt;BaseEstimator.set_params()&lt;/code&gt;. The agent needs to understand the full inheritance chain to find the right place to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pytest (+8%):&lt;/strong&gt; The smallest improvement. Pytest has a plugin system that's complex, but most bugs are localized to a single file or module. The agent doesn't need as much architectural context because the architecture is relatively flat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudrmu23e276qdifj2lhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudrmu23e276qdifj2lhm.png" alt="Architectural Complexity vs. XCE Improvement" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The correlation is strong: &lt;strong&gt;the more architecturally complex the codebase, the more the agent benefits from having a structural map.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This has a practical implication: if your codebase is a simple CRUD app with flat architecture, XCE helps modestly. If your codebase is a complex system with deep module dependencies, layered abstractions, and cross-cutting concerns — XCE helps dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  How XCE Works
&lt;/h2&gt;

&lt;p&gt;XCE uses the proprietary PRAT algorithm to build a structured codebase index that captures architectural relationships — not just code text. Unlike embedding-based search, PRAT understands structural connections between components at multiple levels of abstraction.&lt;/p&gt;

&lt;p&gt;When an agent queries XCE, it gets back a structured response that includes: what module the code belongs to, what its role is in the system, what depends on it, and what it depends on. The agent doesn't just know where the code is — it knows why it exists and how it connects to the rest of the system.&lt;/p&gt;

&lt;p&gt;This is served via MCP, so any compatible agent gets architectural context on every tool call without modifications.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Setup
&lt;/h2&gt;

&lt;p&gt;XCE runs as an MCP service. Any MCP-compatible agent can connect with one config block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Index your repo (one command)&lt;/span&gt;
npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This indexes the codebase and installs a git hook that auto-syncs after every commit. Then add to your agent's MCP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"xanther-xce"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.xanther.ai/sse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer YOUR_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works with Claude Code, Kiro, Cursor, OpenCode, Windsurf — any MCP-compatible tool.&lt;/p&gt;

&lt;p&gt;The agent gets five tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;xce_get_context&lt;/code&gt; — Full architectural context for a problem statement&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_search&lt;/code&gt; — Semantic search across the codebase&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_architecture_context&lt;/code&gt; — Architecture around a specific file or symbol&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_trace&lt;/code&gt; — Trace relationships from code to design artifacts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;xce_impact_analysis&lt;/code&gt; — What breaks if you change specific files&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;AI coding agents are getting better every quarter. But the bottleneck isn't model capability — it's context quality. A cheap model with the right context outperforms an expensive model without it.&lt;/p&gt;

&lt;p&gt;The numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;78.2%&lt;/strong&gt; on SWE-bench Verified with MiniMax M2.5 + XCE (beats every model on the official leaderboard)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20% token reduction&lt;/strong&gt; per task (fewer wrong turns, less exploration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.22&lt;/strong&gt; per instance (16x cheaper than Claude Opus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Context is cheaper than compute. And it compounds: better models + better context = better results than either alone.&lt;/p&gt;




&lt;p&gt;Xanther is in open beta. Free tier: 3 repos, 100 queries/month.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;Benchmarks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.gg/YaBekKpR" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/xanther-cli" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard</title>
      <dc:creator>Hoyin kyoma</dc:creator>
      <pubDate>Sat, 09 May 2026 05:53:53 +0000</pubDate>
      <link>https://dev.to/kyoma_1234/how-a-002call-model-scored-782-on-swe-bench-verified-beating-every-model-on-the-leaderboard-4bm3</link>
      <guid>https://dev.to/kyoma_1234/how-a-002call-model-scored-782-on-swe-bench-verified-beating-every-model-on-the-leaderboard-4bm3</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every model on the official mini-SWE-agent leaderboard, including Claude Opus 4.5 (76.8%) which costs 37x more per call. The improvement comes entirely from better context, not a better model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full benchmark results and interactive dashboard: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;&lt;br&gt;
Try it free: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Official Leaderboard (as of February 2026)
&lt;/h2&gt;

&lt;p&gt;The SWE-bench Verified leaderboard uses mini-SWE-agent as a standardized harness to evaluate models on 500 human-verified bug instances from real open-source Python repositories. Here are the top results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Resolve Rate&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (high reasoning)&lt;/td&gt;
&lt;td&gt;76.80%&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Gemini 3 Flash (high reasoning)&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;$0.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;75.60%&lt;/td&gt;
&lt;td&gt;$0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;GPT-5-2 Codex&lt;/td&gt;
&lt;td&gt;72.80%&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Claude 4.5 Sonnet (high reasoning)&lt;/td&gt;
&lt;td&gt;71.40%&lt;/td&gt;
&lt;td&gt;$0.66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Kimi K2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;70.80%&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;DeepSeek V3.2 (high reasoning)&lt;/td&gt;
&lt;td&gt;70.00%&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://www.swebench.com" rel="noopener noreferrer"&gt;swebench.com&lt;/a&gt;, February 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The top score is 76.80% from Claude 4.5 Opus with high reasoning enabled, at $0.75 per instance. The cheapest competitive model is MiniMax M2.5 at 75.80% for $0.07.&lt;/p&gt;

&lt;p&gt;Now here's what happens when you add Xanther Context Engine:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Without XCE&lt;/th&gt;
&lt;th&gt;With XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MiniMax M2.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75.80%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.20%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.4pp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0&lt;/td&gt;
&lt;td&gt;66.00%&lt;/td&gt;
&lt;td&gt;73.40%&lt;/td&gt;
&lt;td&gt;+7.4pp&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.0 (cascade hybrid)&lt;/td&gt;
&lt;td&gt;66.00%&lt;/td&gt;
&lt;td&gt;76.80%&lt;/td&gt;
&lt;td&gt;+10.8pp&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;MiniMax M2.5 + XCE at 78.2% would be the #1 entry on the official leaderboard&lt;/strong&gt; — and it costs $0.22 per instance, not $0.75.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;See the full results breakdown: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt; | Raw data: &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkicyes8r8j52rjn5dss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkicyes8r8j52rjn5dss.png" width="720" height="244"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is SWE-bench Verified?
&lt;/h2&gt;

&lt;p&gt;SWE-bench Verified is the industry-standard benchmark for evaluating AI coding agents on real-world software engineering tasks. It consists of 500 instances, each representing a real bug from a real open-source Python repository. Each instance includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;problem statement&lt;/strong&gt; (the GitHub issue description)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;codebase snapshot&lt;/strong&gt; (the repository at the time the bug was reported)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;gold patch&lt;/strong&gt; (the actual fix that was merged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test cases&lt;/strong&gt; that verify the fix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent must read the problem statement, navigate the codebase, write a patch, and pass the test cases. No hints, no file locations, no guidance beyond the issue description.&lt;/p&gt;

&lt;p&gt;The repositories span a wide range of complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Lines of Code&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;django/django&lt;/td&gt;
&lt;td&gt;82K&lt;/td&gt;
&lt;td&gt;~4,000&lt;/td&gt;
&lt;td&gt;~300K&lt;/td&gt;
&lt;td&gt;Layered MVC, ORM, middleware, admin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;61K&lt;/td&gt;
&lt;td&gt;~1,200&lt;/td&gt;
&lt;td&gt;~200K&lt;/td&gt;
&lt;td&gt;Estimator inheritance chains, pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sympy/sympy&lt;/td&gt;
&lt;td&gt;13K&lt;/td&gt;
&lt;td&gt;~1,500&lt;/td&gt;
&lt;td&gt;~400K&lt;/td&gt;
&lt;td&gt;Deep mathematical module dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;20K&lt;/td&gt;
&lt;td&gt;~1,000&lt;/td&gt;
&lt;td&gt;~150K&lt;/td&gt;
&lt;td&gt;Complex rendering pipeline, backends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;12K&lt;/td&gt;
&lt;td&gt;~400&lt;/td&gt;
&lt;td&gt;~50K&lt;/td&gt;
&lt;td&gt;Plugin system, fixture resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not a toy benchmark. These are production codebases with real architectural complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Context Problem
&lt;/h2&gt;

&lt;p&gt;Watch what happens when a coding agent tries to fix a bug in Django without architectural context:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug:&lt;/strong&gt; &lt;code&gt;django__django-16379&lt;/code&gt; — &lt;code&gt;FileBasedCache&lt;/code&gt; crashes with &lt;code&gt;FileNotFoundError&lt;/code&gt; on concurrent access&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent behavior (without XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches for "FileBasedCache" — finds the class in &lt;code&gt;django/core/cache/backends/filebased.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reads the file, sees the &lt;code&gt;delete()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Doesn't understand the cache backend hierarchy — misses that &lt;code&gt;FileBasedCache&lt;/code&gt; inherits from &lt;code&gt;BaseCache&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Doesn't know about the concurrent access patterns in Django's cache framework&lt;/li&gt;
&lt;li&gt;Writes a fix that handles the &lt;code&gt;FileNotFoundError&lt;/code&gt; but breaks the cache invalidation contract&lt;/li&gt;
&lt;li&gt;Test fails. Tries again.&lt;/li&gt;
&lt;li&gt;Explores &lt;code&gt;django/core/cache/__init__.py&lt;/code&gt;, &lt;code&gt;django/core/cache/backends/base.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Eventually finds the right approach after 15+ file reads and 4,000+ tokens&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrw4pmypy3k4r1f1lmq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrw4pmypy3k4r1f1lmq.png" width="720" height="669"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent behavior (with XCE):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Calls &lt;code&gt;xce_get_context("FileBasedCache FileNotFoundError concurrent access")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gets back: the cache backend hierarchy (BaseCache → FileBasedCache), the locking mechanism, the file operations that can race, and the related test infrastructure&lt;/li&gt;
&lt;li&gt;Understands the architecture immediately&lt;/li&gt;
&lt;li&gt;Writes a fix that wraps the file operation in a try/except with proper fallback&lt;/li&gt;
&lt;li&gt;Test passes on first attempt. ~1,500 tokens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv32vd9nal5xu1tn1ja0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv32vd9nal5xu1tn1ja0d.png" width="720" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The difference isn't that the model is smarter. It's that the model has a map.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Per-Repository Analysis
&lt;/h2&gt;

&lt;p&gt;XCE doesn't provide a uniform boost. The improvement correlates strongly with architectural complexity:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Sonnet 4.0 Baseline&lt;/th&gt;
&lt;th&gt;Sonnet 4.0 + XCE&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sympy/sympy&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+17%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deep module dependencies. A fix in &lt;code&gt;sympy/core/&lt;/code&gt; often requires understanding &lt;code&gt;sympy/simplify/&lt;/code&gt;, &lt;code&gt;sympy/printing/&lt;/code&gt;, and &lt;code&gt;sympy/polys/&lt;/code&gt;. Without context, the agent gets lost in the dependency maze.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;scikit-learn&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex estimator inheritance. &lt;code&gt;BaseEstimator&lt;/code&gt; → &lt;code&gt;ClassifierMixin&lt;/code&gt; → &lt;code&gt;LinearClassifierMixin&lt;/code&gt; → &lt;code&gt;LogisticRegression&lt;/code&gt;. Bugs often require understanding the full chain.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matplotlib&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rendering pipeline with multiple backends. A bug in &lt;code&gt;axes.py&lt;/code&gt; might require understanding &lt;code&gt;figure.py&lt;/code&gt;, &lt;code&gt;backend_agg.py&lt;/code&gt;, and the transform system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;django/django&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layered architecture (models → views → templates → middleware). Bugs cross layers frequently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Relatively flat architecture. The plugin system is complex but most bugs are localized. Less benefit from architectural context.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmihrvif9ue9abxiizx1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmihrvif9ue9abxiizx1w.png" width="720" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pattern is clear: &lt;strong&gt;the more architectural dependencies a codebase has, the more the agent benefits from having a structural map.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pytest, with its relatively flat architecture, sees the smallest improvement (+8%). Sympy, where fixing a bug in one module often requires understanding five others, sees the largest (+17%).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Analysis
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting from a business perspective.&lt;/p&gt;

&lt;p&gt;The official leaderboard shows that reaching 76%+ on SWE-bench Verified requires expensive models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score Range&lt;/th&gt;
&lt;th&gt;Cheapest Model&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;76%+&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;75%+&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;72%+&lt;/td&gt;
&lt;td&gt;GPT-5-2 Codex&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70%+&lt;/td&gt;
&lt;td&gt;DeepSeek V3.2 (high reasoning)&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With XCE, the cost equation changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Cost/Instance&lt;/th&gt;
&lt;th&gt;Savings vs. Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;MiniMax M2.5 + XCE&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4x cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;73.4%&lt;/td&gt;
&lt;td&gt;Sonnet 4.0 + XCE&lt;/td&gt;
&lt;td&gt;$0.22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4x cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;td&gt;Claude 4.5 Opus (no XCE)&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The $0.22 includes the XCE query cost (~$0.001 per query, amortized over multiple queries per instance) plus the model inference cost. The XCE overhead is negligible — the savings come from the model needing fewer tokens to solve each problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token reduction:&lt;/strong&gt; XCE reduces token usage by approximately 20% per task. The agent makes fewer wrong turns, reads fewer irrelevant files, and arrives at the solution faster. On a 500-instance benchmark run, this translates to significant cost savings.&lt;/p&gt;

&lt;p&gt;At scale, the math is compelling. A team running 1,000 coding agent tasks per month:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Annual Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus (no XCE)&lt;/td&gt;
&lt;td&gt;$750&lt;/td&gt;
&lt;td&gt;$9,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5 + XCE&lt;/td&gt;
&lt;td&gt;$220&lt;/td&gt;
&lt;td&gt;$2,640&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Savings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$530/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$6,360/yr&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And the XCE setup gets better results.&lt;/p&gt;




&lt;h2&gt;
  
  
  How XCE Works (High Level)
&lt;/h2&gt;

&lt;p&gt;XCE indexes a codebase into a multi-level structured representation that captures both code and architecture. When an agent queries XCE, it gets back context at the right level of abstraction — not just a code snippet, but an understanding of where that code fits in the system, what depends on it, and what it depends on.&lt;/p&gt;

&lt;p&gt;The indexing uses the proprietary PRAT algorithm to build this structured index. The key difference from embedding-based search: PRAT captures structural relationships between components, not just text similarity. This means the agent can ask "what depends on this function?" and get a real answer — something embeddings alone cannot provide.&lt;/p&gt;

&lt;p&gt;The result is served via MCP, so any compatible agent gets architectural context on every tool call without any changes to the agent itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuigqmc7db16zdwp2jpp.png" width="720" height="372"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Reproducing These Results
&lt;/h2&gt;

&lt;p&gt;All results are published and reproducible. The benchmark repository includes predictions, resolved instance IDs, and trajectory download scripts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To reproduce:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install mini-swe-agent&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mini-swe-agent

&lt;span class="c"&gt;# 2. Get an XCE API key (free at app.xanther.ai)&lt;/span&gt;
&lt;span class="c"&gt;# 3. Index the target repo&lt;/span&gt;
npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; xce_your_key

&lt;span class="c"&gt;# 4. Run the benchmark&lt;/span&gt;
mini-swe-agent run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; claude-sonnet-4-20250514 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset&lt;/span&gt; swe-bench-verified &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; &lt;span class="s1"&gt;'{"xanther": {"url": "https://mcp.xanther.ai/sse", "headers": {"Authorization": "Bearer xce_your_key"}}}'&lt;/span&gt;

&lt;span class="c"&gt;# 5. Evaluate&lt;/span&gt;
sb submit &lt;span class="nt"&gt;--predictions&lt;/span&gt; results/preds.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each run's &lt;code&gt;preds.jsonl&lt;/code&gt; contains one prediction per instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"instance_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"django__django-16379"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_name_or_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sonnet-4.0-xce"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_patch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"diff --git a/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"full_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trajectory files (100-600MB per run) are available for download from S3 for detailed analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;Three takeaways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context is cheaper than compute.&lt;/strong&gt; You don't need the most expensive model to get the best results. You need the right context. A $0.02/call model with good architectural context outperforms a $0.30/call model without it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The improvement scales with complexity.&lt;/strong&gt; Simple codebases with flat architectures see modest gains (+8%). Complex codebases with deep dependencies see dramatic gains (+17%). As codebases grow, the value of architectural context increases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. This is model-agnostic.&lt;/strong&gt; XCE works with any MCP-compatible agent. The same context infrastructure that improves MiniMax M2.5 also improves Sonnet 4.0, and would improve any future model. Better models + better context = compounding gains.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Learn more about how XCE works: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt; | See the benchmark methodology: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Xanther is in open beta. Free tier: 3 repos, 100 queries/month. No credit card.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx xanther-cli init &lt;span class="nt"&gt;--api-key&lt;/span&gt; YOUR_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Website: &lt;a href="https://xanther.ai" rel="noopener noreferrer"&gt;xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark Dashboard: &lt;a href="https://xanther.ai/benchmarks" rel="noopener noreferrer"&gt;xanther.ai/benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard: &lt;a href="https://app.xanther.ai" rel="noopener noreferrer"&gt;app.xanther.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmarks (raw data): &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Discord: &lt;a href="https://discord.gg/Y768kBRS" rel="noopener noreferrer"&gt;discord.gg/Y768kBRS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/xanther-cli" rel="noopener noreferrer"&gt;npmjs.com/package/xanther-cli&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All benchmark results were evaluated using the official SWE-bench CLI (&lt;code&gt;sb submit&lt;/code&gt;) against SWE-bench Verified (500 instances). The agent harness is mini-swe-agent. Predictions and resolved instance IDs are published at &lt;a href="https://github.com/Xanther-Ai/xce-benchmarks" rel="noopener noreferrer"&gt;github.com/Xanther-Ai/xce-benchmarks&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>minimax</category>
    </item>
  </channel>
</rss>
