Armaan Jain

Posted on Jun 16

LocalityLens: Why Your AI Coding Agent Gets Lost in Your Codebase

#ai #opensource #agents #software

You watch Claude Code analyze your repository. Files flash by. Symbols get resolved. It's working.

But how well is it working?

Here's a thought: we measure AI coding agents on the wrong metric.

We ask: "Did it complete the task?" But the better question is: "Did it stay focused while solving it?"

An agent that bounces between unrelated files three times, re-reads the same code, and loses semantic context is inefficient—even if it eventually gets the answer right. And we have no way to measure that.

Until now.

The Problem: Semantic Thrashing

Imagine reading a complex codebase. You open auth.py, then login.py, then back to auth.py, then validation.py, then auth.py again.

You can feel your brain context-switching. The threads you were following get tangled.

This happens to AI agents too. It matters because:

Token waste — Re-reading the same file consumes context budget
Decision degradation — Reasoning quality drops without stable context
Inefficiency — The agent takes 2-3x longer to reach correct answers
Hallucinations — Thrashing patterns correlate with errors and confusion

Existing tools measure what agents do. LocalityLens measures how well they navigate.

What is LocalityLens?

An AST-aware observability tool that analyzes agent traces and reveals how agents navigate your codebase.
(Github)

The workflow:

Agent Trace (file access log)
    ↓
Parse files, extract imports & symbols via AST
    ↓
Compute 12 specialized metrics
    ↓
Visualize patterns: locality, thrashing, drift, entropy

It runs offline on your trace + your codebase. No external APIs, no LLM calls.

The Core Metric: Locality Score

Question: How often does the agent stay within semantic context?

How it works:

The agent maintains a working set of related files (based on imports and calls)
Each file transition is classified:
- Local — Next file is in the working set
- Non-local — Jump to unrelated code
Score = fraction of local transitions + context overlap + dependency continuity

Interpretation:

0.75+  = Excellent focus (agent stays on task)
0.55–0.75 = Good (occasional context switches)
0.35–0.55 = Poor (frequent task-switching)
<0.35  = Critical (agent is lost)

Other Metrics That Matter

Metric	What It Detects
Oscillation Thrashing	A→B→A→B loops (stuck patterns)
Semantic Drift	Jumping between distant code regions
Context Entropy	Disorder in the context window (LLM simulation)
Churn Ratio	Files re-accessed (context loss)
Retrieval Pressure	Over-reliance on search/lookup tools
Cognitive Load	Symbol namespace complexity
Semantic Continuity	Does the agent respect the import graph?
7 more...	Anomalies, waste, concentration, etc.

Concrete Example

Your codebase:

api/
  └─ routes.py
models/
  └─ user.py (imports api.routes)
auth/
  ├─ login.py
  └─ utils.py

Agent trace:

routes.py → user.py → utils.py → login.py → login.py 
→ routes.py → unrelated_file.py

What LocalityLens sees:

⚠️  Locality Score: 0.62 (LOW)
    └─ Decent focus, but faltering

🔄 Thrashing Detected
    └─ login.py accessed twice (context loss)

📊 Semantic Drift
    └─ Sudden jump to unrelated_file.py
    └─ Suggests: Agent lost understanding

💾 Churn Ratio: 29%
    └─ ~1 in 3 files re-accessed

Actionable insight: The agent forgot earlier analysis midway through. Try prompting it with the import graph upfront.

Why This Matters

1. Debug Agent Failures

When an agent produces wrong code, check the trace:

High thrashing? Agent lost context.
High drift? Agent jumped between unrelated files.
Usually, you see context loss before the error.

2. Improve Prompts

Experiment with what helps locality:

"Here's the import graph" → Higher locality ✅
"Start at the entry point" → Less oscillation ✅
Measure before and after.

3. Compare Agent Behaviors

Different prompts = different profiles:

Exploratory agents have high drift (intentional)
Focused agents have high locality (efficient)
Lost agents have high thrashing (bad)

Real Results

I ran LocalityLens on three agents solving the same task:

Agent	Locality	Thrashing	Time
Agent A	0.78	0.02	45s ✅
Agent B	0.54	0.18	72s ⚠️
Agent C	0.31	0.41	120s ❌

Observation: Locality score correlates strongly with speed and correctness.

How It Works (Technically)

Stack:

Tree-sitter — Multi-language AST parsing
NetworkX — Import graph + BFS distance
Dataclasses + Protocol — Type-safe, extensible analyzers
Pure Python — Offline, no external APIs

Why this approach:

No vendor lock-in (runs locally)
Fast (semantic map built once, queried by 12 analyzers)
Extensible (new metrics as simple plugins)

Using LocalityLens

Step 1: Collect a trace from your agent

[
  {"kind": "FILE_READ", "target": "auth.py", "timestamp": "..."},
  {"kind": "FILE_READ", "target": "login.py", "timestamp": "..."},
  ...
]

Step 2: Point LocalityLens at your trace + codebase

from localitylens.pipeline import run_pipeline

report = run_pipeline(
    trace_path="agent_trace.json",
    repo_path="."  # Your codebase
)

Step 3: Get results

for metric in report.metrics:
    print(f"{metric.name}: {metric.value:.2f} ({metric.severity})")

Output: Metrics, dashboard, anomalies, transition graph.

What's Next

LocalityLens started as research into how AI agents navigate code. It's now a tool for:

AI researchers — Understanding agent behavior
Prompt engineers — Tuning agent focus
Teams building agents — Debugging and optimization

Currently in development. Track progress on the project for updates.

Key Takeaway

As AI agents become more capable, we need better observability. Metrics like "task completion" are baseline. Metrics like semantic locality reveal how well agents think.

LocalityLens gives you the lens to see that.

How are you debugging your agents today? Drop a comment—I'd love to hear.

DEV Community