DEV Community

Cover image for LocalityLens: Why Your AI Coding Agent Gets Lost in Your Codebase
Armaan Jain
Armaan Jain

Posted on

LocalityLens: Why Your AI Coding Agent Gets Lost in Your Codebase

You watch Claude Code analyze your repository. Files flash by. Symbols get resolved. It's working.

But how well is it working?

Here's a thought: we measure AI coding agents on the wrong metric.

We ask: "Did it complete the task?" But the better question is: "Did it stay focused while solving it?"

An agent that bounces between unrelated files three times, re-reads the same code, and loses semantic context is inefficient—even if it eventually gets the answer right. And we have no way to measure that.

Until now.


The Problem: Semantic Thrashing

Imagine reading a complex codebase. You open auth.py, then login.py, then back to auth.py, then validation.py, then auth.py again.

You can feel your brain context-switching. The threads you were following get tangled.

This happens to AI agents too. It matters because:

  • Token waste — Re-reading the same file consumes context budget
  • Decision degradation — Reasoning quality drops without stable context
  • Inefficiency — The agent takes 2-3x longer to reach correct answers
  • Hallucinations — Thrashing patterns correlate with errors and confusion

Existing tools measure what agents do. LocalityLens measures how well they navigate.


What is LocalityLens?

An AST-aware observability tool that analyzes agent traces and reveals how agents navigate your codebase.
(Github)

The workflow:

Agent Trace (file access log)
    ↓
Parse files, extract imports & symbols via AST
    ↓
Compute 12 specialized metrics
    ↓
Visualize patterns: locality, thrashing, drift, entropy
Enter fullscreen mode Exit fullscreen mode

It runs offline on your trace + your codebase. No external APIs, no LLM calls.


The Core Metric: Locality Score

Question: How often does the agent stay within semantic context?

How it works:

  1. The agent maintains a working set of related files (based on imports and calls)
  2. Each file transition is classified:
    • Local — Next file is in the working set
    • Non-local — Jump to unrelated code
  3. Score = fraction of local transitions + context overlap + dependency continuity

Interpretation:

0.75+  = Excellent focus (agent stays on task)
0.55–0.75 = Good (occasional context switches)
0.35–0.55 = Poor (frequent task-switching)
<0.35  = Critical (agent is lost)
Enter fullscreen mode Exit fullscreen mode

Other Metrics That Matter

Metric What It Detects
Oscillation Thrashing A→B→A→B loops (stuck patterns)
Semantic Drift Jumping between distant code regions
Context Entropy Disorder in the context window (LLM simulation)
Churn Ratio Files re-accessed (context loss)
Retrieval Pressure Over-reliance on search/lookup tools
Cognitive Load Symbol namespace complexity
Semantic Continuity Does the agent respect the import graph?
7 more... Anomalies, waste, concentration, etc.

Concrete Example

Your codebase:

api/
  └─ routes.py
models/
  └─ user.py (imports api.routes)
auth/
  ├─ login.py
  └─ utils.py
Enter fullscreen mode Exit fullscreen mode

Agent trace:

routes.py → user.py → utils.py → login.py → login.py 
→ routes.py → unrelated_file.py
Enter fullscreen mode Exit fullscreen mode

What LocalityLens sees:

⚠️  Locality Score: 0.62 (LOW)
    └─ Decent focus, but faltering

🔄 Thrashing Detected
    └─ login.py accessed twice (context loss)

📊 Semantic Drift
    └─ Sudden jump to unrelated_file.py
    └─ Suggests: Agent lost understanding

💾 Churn Ratio: 29%
    └─ ~1 in 3 files re-accessed
Enter fullscreen mode Exit fullscreen mode

Actionable insight: The agent forgot earlier analysis midway through. Try prompting it with the import graph upfront.


Why This Matters

1. Debug Agent Failures

When an agent produces wrong code, check the trace:

  • High thrashing? Agent lost context.
  • High drift? Agent jumped between unrelated files.
  • Usually, you see context loss before the error.

2. Improve Prompts

Experiment with what helps locality:

  • "Here's the import graph" → Higher locality ✅
  • "Start at the entry point" → Less oscillation ✅
  • Measure before and after.

3. Compare Agent Behaviors

Different prompts = different profiles:

  • Exploratory agents have high drift (intentional)
  • Focused agents have high locality (efficient)
  • Lost agents have high thrashing (bad)

Real Results

I ran LocalityLens on three agents solving the same task:

Agent Locality Thrashing Time
Agent A 0.78 0.02 45s ✅
Agent B 0.54 0.18 72s ⚠️
Agent C 0.31 0.41 120s ❌

Observation: Locality score correlates strongly with speed and correctness.


How It Works (Technically)

Stack:

  • Tree-sitter — Multi-language AST parsing
  • NetworkX — Import graph + BFS distance
  • Dataclasses + Protocol — Type-safe, extensible analyzers
  • Pure Python — Offline, no external APIs

Why this approach:

  • No vendor lock-in (runs locally)
  • Fast (semantic map built once, queried by 12 analyzers)
  • Extensible (new metrics as simple plugins)

Using LocalityLens

Step 1: Collect a trace from your agent

[
  {"kind": "FILE_READ", "target": "auth.py", "timestamp": "..."},
  {"kind": "FILE_READ", "target": "login.py", "timestamp": "..."},
  ...
]
Enter fullscreen mode Exit fullscreen mode

Step 2: Point LocalityLens at your trace + codebase

from localitylens.pipeline import run_pipeline

report = run_pipeline(
    trace_path="agent_trace.json",
    repo_path="."  # Your codebase
)
Enter fullscreen mode Exit fullscreen mode

Step 3: Get results

for metric in report.metrics:
    print(f"{metric.name}: {metric.value:.2f} ({metric.severity})")
Enter fullscreen mode Exit fullscreen mode

Output: Metrics, dashboard, anomalies, transition graph.


What's Next

LocalityLens started as research into how AI agents navigate code. It's now a tool for:

  • AI researchers — Understanding agent behavior
  • Prompt engineers — Tuning agent focus
  • Teams building agents — Debugging and optimization

Currently in development. Track progress on the project for updates.


Key Takeaway

As AI agents become more capable, we need better observability. Metrics like "task completion" are baseline. Metrics like semantic locality reveal how well agents think.

LocalityLens gives you the lens to see that.


How are you debugging your agents today? Drop a comment—I'd love to hear.

Top comments (0)