Ratish jain

Posted on Jun 16

I built a CLI that learns from your past bugs — here's how the scoring works

#devtools #opensource #claude #debugging

The problem with AI-assisted debugging isn't Claude. It's the 10 minutes before you even open Claude — manually hunting through 300 files to figure out which 5 are actually relevant.

My debugging loop used to look like this:

Error fires in production
Open the repo — 300 files staring back at me
Spend 10 minutes manually figuring out which 5 files are actually relevant
Copy those files into Claude
Claude gives a great answer
Repeat tomorrow with a different error

Step 3 was the bottleneck. Not the AI — the archaeology before the AI.

So I built dug — a CLI that does that context-gathering step automatically, and gets better at it every time you fix a bug.

What dug actually does

dug init          # indexes your codebase once
dug "your error"  # generates a Claude Code prompt with ranked file context

That second command outputs something like this — on the dug codebase itself:

dug "language detection returning wrong languages includes python in typescript project"

## Bug Report

**Error:** language detection returning wrong languages includes python in typescript project

**Files to investigate (ranked by relevance):**
  - src/dug/__main__.py  (modified in relevant recent commit, semantic match 3.36/5)
  - src/dug/graph.py     (semantic match 2.00/5)
  - src/dug/chunker.py   (semantic match 1.51/5)

**Recent commits touching these files:**
  0560e92: "fix: language detection now respects ignore_paths"  (0d ago)

**Suggested starting point:**
  Begin at src/dug/__main__.py.

__main__.py is exactly where the bug lived — in a function called _detect_languages(). The tool found it from a plain English description with no file names, no line numbers, no stack trace.

Here's how.

The three-layer scoring system

dug doesn't just grep. It combines three independent signals into a single relevance score for every file in your repo.

Layer 1 — Structural scoring

During dug init, dug builds a directed graph of your codebase:

FILE nodes   →  SYMBOL nodes (functions, classes)
FILE nodes   →  FILE nodes (imports)
COMMIT nodes →  FILE nodes (recently changed)

At query time, dug extracts signals from your error text — file names mentioned, symbol names, error type — and walks this graph:

Signal	Points
File directly mentioned in error	+10
File imports the mentioned file	+8
File modified in a commit matching the error	+8
File modified in any recent commit	+2

This alone gets you far. A NullPointerException in UserService will score UserService.java highly just from the graph — no ML required.

Layer 2 — Semantic scoring

Structural scoring fails when the error text doesn't match file names. "checkout is failing with a null value" won't grep-match anything useful.

During dug init, every function body gets embedded using fastembed (ONNX-based, no PyTorch, runs fully local) and stored in LanceDB. At query time, the error text gets embedded with the same model and a cosine similarity search finds the most semantically related functions.

The model is sentence-transformers/all-MiniLM-L6-v2 — 384 dimensions, fast on CPU. Semantic hits add up to +5 points based on similarity score.

Layer 3 — History boost

This is the part that makes dug different from every other code search tool.

After you fix a bug, you run:

dug solved

It shows what it suggested and asks which files actually had the fix:

Last query: "language detection returning wrong languages"
Suggested files were:
  - src/dug/__main__.py
  - src/dug/graph.py

Which files actually contained the bug? (comma-separated paths)
> src/dug/__main__.py

This gets saved to .dug/history.json:

{
  "bug_input": "language detection returning wrong languages",
  "error_type": "None",
  "resolved_files": ["src/dug/__main__.py"],
  "solve_count": 1,
  "last_solved": "2024-06-16T10:23:00Z"
}

Next time a similar error comes in, find_similar_past_bugs() scores it against every entry in history:

score = text_similarity × 0.6
      + signal_overlap × 0.25    # shared files/symbols between queries
      + error_type_match × 0.2   # exact error class gives a bonus

Text similarity uses a blend of character-level SequenceMatcher and word-level Jaccard with CamelCase/snake_case splitting — so NullPointerException and null pointer in config score as similar even without shared substrings.

Files from matching past bugs get up to +6 points, scaled by similarity:

boost = 6.0 × similarity_score
# 0.9 similar → +5.4 points
# 0.5 similar → +3.0 points

On top of that, there's an error pattern boost — if UserService.java appeared in 8 of 10 past NullPointerException fixes, it gets an extra +0–3 points from that pattern alone, independent of text similarity.

Why "learning" is the right word

Fresh install, dug is good. After 20 bugs marked solved, dug is better for your specific codebase. After 100 bugs, it knows:

ImportError almost always means __main__.py in this project
TypeError undefined almost always means api/client.ts
The auth bug that keeps coming back always starts in src/auth/

It's not ML training in the PyTorch sense. It's weighted frequency built from your team's actual debugging history. Stateless tools like grep can't do this — they have no memory. dug does.

Zero LLM calls

The entire pipeline — graph traversal, vector search, history lookup, scoring, prompt assembly — runs locally. No API key. No network requests. No latency.

The LLM call happens after, when you paste the output into Claude Code. dug's job is purely context assembly.

This means it works offline, costs nothing to run, and produces deterministic output you can inspect and debug.

Install

# macOS
brew tap ratishjain12/dug
brew trust ratishjain12/dug
brew install dug-cli

# Python users
pipx install dug-cli

# Linux / macOS one-liner
curl -fsSL https://raw.githubusercontent.com/ratishjain12/dug/main/install.sh | sh

Supports Python, TypeScript, JavaScript, Java.

GitHub: github.com/ratishjain12/dug

What's next

Sentry / error tracker integration — dug sentry <issue-url> fetches the stack trace directly. Eliminates copy-paste entirely.

MCP server — expose dug as an MCP tool so Claude Code can call it mid-session directly.

Call graph edges — use jedi to add SYMBOL→SYMBOL edges for Python. Callers of the broken function get scored too.

VSCode extension — highlight error text, right-click, "Generate dug prompt."

If you try it, run dug solved after your first fix — that's what starts the learning loop. The first few times it's just bookkeeping. By week two it's noticeably better at predicting where your bugs live.

Questions about the architecture or scoring? Happy to go deeper in the comments.

Top comments (2)

Alex Shev • Jun 16

A CLI that learns from past bugs is useful if the scoring stays explainable. The best version would not just say "this file is risky"; it would show the historical pattern that made the score rise, so the developer can challenge it.

Alex Shev • Jun 17

The scoring idea is interesting because past bugs are usually better retrieval anchors than generic docs. The hard part is avoiding superstition: a file was involved in a previous bug, but that does not mean it is guilty this time. I would want the score to show why it thinks a path matters, not just rank it.