You watch Claude Code analyze your repository. Files flash by. Symbols get resolved. It's working.
But how well is it working?
Here's a thought: we measure AI coding agents on the wrong metric.
We ask: "Did it complete the task?" But the better question is: "Did it stay focused while solving it?"
An agent that bounces between unrelated files three times, re-reads the same code, and loses semantic context is inefficient—even if it eventually gets the answer right. And we have no way to measure that.
Until now.
The Problem: Semantic Thrashing
Imagine reading a complex codebase. You open auth.py, then login.py, then back to auth.py, then validation.py, then auth.py again.
You can feel your brain context-switching. The threads you were following get tangled.
This happens to AI agents too. It matters because:
- Token waste — Re-reading the same file consumes context budget
- Decision degradation — Reasoning quality drops without stable context
- Inefficiency — The agent takes 2-3x longer to reach correct answers
- Hallucinations — Thrashing patterns correlate with errors and confusion
Existing tools measure what agents do. LocalityLens measures how well they navigate.
What is LocalityLens?
An AST-aware observability tool that analyzes agent traces and reveals how agents navigate your codebase.
(Github)
The workflow:
Agent Trace (file access log)
↓
Parse files, extract imports & symbols via AST
↓
Compute 12 specialized metrics
↓
Visualize patterns: locality, thrashing, drift, entropy
It runs offline on your trace + your codebase. No external APIs, no LLM calls.
The Core Metric: Locality Score
Question: How often does the agent stay within semantic context?
How it works:
- The agent maintains a working set of related files (based on imports and calls)
- Each file transition is classified:
- Local — Next file is in the working set
- Non-local — Jump to unrelated code
- Score = fraction of local transitions + context overlap + dependency continuity
Interpretation:
0.75+ = Excellent focus (agent stays on task)
0.55–0.75 = Good (occasional context switches)
0.35–0.55 = Poor (frequent task-switching)
<0.35 = Critical (agent is lost)
Other Metrics That Matter
| Metric | What It Detects |
|---|---|
| Oscillation Thrashing | A→B→A→B loops (stuck patterns) |
| Semantic Drift | Jumping between distant code regions |
| Context Entropy | Disorder in the context window (LLM simulation) |
| Churn Ratio | Files re-accessed (context loss) |
| Retrieval Pressure | Over-reliance on search/lookup tools |
| Cognitive Load | Symbol namespace complexity |
| Semantic Continuity | Does the agent respect the import graph? |
| 7 more... | Anomalies, waste, concentration, etc. |
Concrete Example
Your codebase:
api/
└─ routes.py
models/
└─ user.py (imports api.routes)
auth/
├─ login.py
└─ utils.py
Agent trace:
routes.py → user.py → utils.py → login.py → login.py
→ routes.py → unrelated_file.py
What LocalityLens sees:
⚠️ Locality Score: 0.62 (LOW)
└─ Decent focus, but faltering
🔄 Thrashing Detected
└─ login.py accessed twice (context loss)
📊 Semantic Drift
└─ Sudden jump to unrelated_file.py
└─ Suggests: Agent lost understanding
💾 Churn Ratio: 29%
└─ ~1 in 3 files re-accessed
Actionable insight: The agent forgot earlier analysis midway through. Try prompting it with the import graph upfront.
Why This Matters
1. Debug Agent Failures
When an agent produces wrong code, check the trace:
- High thrashing? Agent lost context.
- High drift? Agent jumped between unrelated files.
- Usually, you see context loss before the error.
2. Improve Prompts
Experiment with what helps locality:
- "Here's the import graph" → Higher locality ✅
- "Start at the entry point" → Less oscillation ✅
- Measure before and after.
3. Compare Agent Behaviors
Different prompts = different profiles:
- Exploratory agents have high drift (intentional)
- Focused agents have high locality (efficient)
- Lost agents have high thrashing (bad)
Real Results
I ran LocalityLens on three agents solving the same task:
| Agent | Locality | Thrashing | Time |
|---|---|---|---|
| Agent A | 0.78 | 0.02 | 45s ✅ |
| Agent B | 0.54 | 0.18 | 72s ⚠️ |
| Agent C | 0.31 | 0.41 | 120s ❌ |
Observation: Locality score correlates strongly with speed and correctness.
How It Works (Technically)
Stack:
- Tree-sitter — Multi-language AST parsing
- NetworkX — Import graph + BFS distance
- Dataclasses + Protocol — Type-safe, extensible analyzers
- Pure Python — Offline, no external APIs
Why this approach:
- No vendor lock-in (runs locally)
- Fast (semantic map built once, queried by 12 analyzers)
- Extensible (new metrics as simple plugins)
Using LocalityLens
Step 1: Collect a trace from your agent
[
{"kind": "FILE_READ", "target": "auth.py", "timestamp": "..."},
{"kind": "FILE_READ", "target": "login.py", "timestamp": "..."},
...
]
Step 2: Point LocalityLens at your trace + codebase
from localitylens.pipeline import run_pipeline
report = run_pipeline(
trace_path="agent_trace.json",
repo_path="." # Your codebase
)
Step 3: Get results
for metric in report.metrics:
print(f"{metric.name}: {metric.value:.2f} ({metric.severity})")
Output: Metrics, dashboard, anomalies, transition graph.
What's Next
LocalityLens started as research into how AI agents navigate code. It's now a tool for:
- AI researchers — Understanding agent behavior
- Prompt engineers — Tuning agent focus
- Teams building agents — Debugging and optimization
Currently in development. Track progress on the project for updates.
Key Takeaway
As AI agents become more capable, we need better observability. Metrics like "task completion" are baseline. Metrics like semantic locality reveal how well agents think.
LocalityLens gives you the lens to see that.
How are you debugging your agents today? Drop a comment—I'd love to hear.
Top comments (0)