DEV Community: Raghav Chamadiya

I tested whether a code health score actually predicts bugs. Here's the benchmark

Raghav Chamadiya — Sat, 06 Jun 2026 07:57:41 +0000

Most code health scores are vibes. A number goes up, a number goes down, and nobody checks whether the files it flags are the files that actually break later. I wanted to know if the score I built does better than that, so I ran it against a defect corpus and put it head to head with the leading commercial code-health tool.

On the same 2,770 files across 9 languages, scored at the same leakage-free commit against the same defect labels, the score surfaces 2.3x the defects under a fixed review budget.

This post is how that works, and the four other layers sitting next to it in repowise.

What the score is

Every file gets a 1 to 10 score from 25 deterministic biomarkers. McCabe complexity, deep nesting, brain methods, class cohesion (LCOM4), god classes, native Rabin-Karp clone detection, untested hotspots, function-level churn, code-age volatility, ownership dispersion, change entropy, co-change scatter, prior-defect history, test-quality smells, and more.

No LLM calls. No cloud. No new runtime dependency. It is pure Python over tree-sitter and git data, and it finishes in under 30 seconds on a 3,000-file repo.

repowise health                       # KPIs + lowest-scoring files
repowise health --coverage cov.lcov   # ingest LCOV/Cobertura/Clover
repowise health --refactoring-targets # ranked by impact / effort
repowise health --trend               # snapshots + declining alerts

The biomarker weights are calibrated against a real defect corpus instead of hand-tuned. Only the learned constants ship. The runtime itself stays fully deterministic, so the same file produces the same score every time.

Does the score find bugs

The validation setup avoids the usual leakage trap. Health scores are collected at a historical commit (call it T0). Bug-fixing commits are counted over the following 6 months. Then the two get correlated. The score never sees the future it is being graded on.

Across 21 open-source repositories spanning all 9 Full-tier languages:

Cross-project mean ROC AUC of 0.74 [95% CI 0.68 to 0.79] at identifying files that go on to receive bug fixes. Up to 0.90 on individual repos.
It survives controlling for file size (partial Spearman rho = -0.16). It is not just flagging the big files.
It out-discriminates recent churn by +0.10 AUC and prior-defect history by +0.12 AUC, DeLong p < 1e-9.
It holds on an external published dataset it has never seen (PROMISE/jEdit CK-metrics: AUC 0.76 to 0.78, within about 0.03 of that dataset's own tuned model).

Head to head

Same files, same commit, same labels, paired tests against the leading commercial code-health tool:

Axis	repowise	Commercial tool
Recall @ 20%-of-lines budget	0.173	0.074
Effort-aware ranking (Popt)	0.607	0.462
Defect density, size-normalized (Alert:Healthy)	2.18x	0.56x
Discrimination (ROC AUC)	0.731	0.705

Ranking files by repowise health surfaces 2.3x the defects under a fixed review budget. Popt delta +0.144, recall delta +0.098, density delta p = 0.003, all paired and significant.

Full methodology and CIs.

The four other layers

Code health is one of five. The point of the other four is that your AI coding agent reads files but knows nothing about how the codebase got there.

Graph. A real tree-sitter dependency graph across 15 languages. File and symbol nodes, 3-tier call resolution, Leiden communities, PageRank, framework-aware route-to-handler edges.

Git. Behavioral signals static analysis cannot see. Hotspots (churn times complexity), ownership percentages, co-change pairs that expose hidden coupling, bus factor, reviewer suggestions.

Docs. An LLM-generated wiki per module and file, rebuilt incrementally on every commit with freshness and confidence scoring, searchable via hybrid RAG (full-text plus vector through reciprocal rank fusion).

Decisions. Architectural decisions mined from 8 sources, evidence-backed, linked to graph nodes, connected by supersedes / refines / conflicts_with edges. This is the layer most tools capture nowhere.

The agent angle

All five layers expose through nine MCP tools shaped around tasks, not data entities. You pass multiple files or symbols in one call and get complete context back, instead of chaining 30 greps and reads.

Paired SWE-QA runs on real repos, same model and harness, with and without the MCP tools:

70% fewer tool calls
89% fewer file reads
36% lower cost per query
answer quality at parity

Feeding an agent a commit through get_context costs 2,391 tokens versus 64,039 for the raw changed files. About 27x fewer.

There is also repowise distill, which compresses noisy command output before the agent reads it, errors first, every omission reversible:

Command	Raw to distilled tokens	Saved
`pytest -q` (11 failures)	3,374 to 1,317	61%
`git log -50`	3,064 to 331	89%
`git diff` (30 commits)	62,833 to 8,635	86%

Try it

pip install repowise
cd your-project
repowise init        # builds all five layers
repowise serve       # MCP server + local dashboard

The graph, git, dead-code, and health layers build in minutes with zero LLM calls. Run repowise init --index-only for a queryable index almost immediately. After that, every commit-triggered update takes under 30 seconds and only regenerates the pages your change touched.

100% local, bring your own API key, AGPL-3.0.

Repo, benchmarks, and live demo: github.com/repowise-dev/repowise and repowise.dev.

If you run the health-defect benchmark on your own repos, I want to see the numbers. The whole harness is public so you can reproduce or break it.

Why Grinding LeetCode Randomly Stops Working After a Point

Raghav Chamadiya — Fri, 26 Dec 2025 07:46:04 +0000

Most people don’t fail technical interviews because they didn’t solve enough problems.

They fail because their practice stopped compounding.

If you’ve done 200, 300, sometimes even 500 LeetCode problems and still feel shaky in interviews, this post is for you. Not because you’re bad at DSA, but because random practice quietly plateaus.

The illusion of progress

Early on, any practice works.

You solve two sum, reverse a linked list, maybe your first BFS. Every new concept feels like progress because it is. Your brain is laying down basic patterns.

But after a while, something weird happens.

You solve more problems, but interviews don’t feel easier. New questions still feel unfamiliar. Under pressure, your mind goes blank even for things you “know”.

That’s not a motivation problem. It’s a structure problem.

Problems are not the unit of learning

We treat problems as the unit of progress because they’re countable.

One problem done. Two problems done. Streaks maintained.

But interviews don’t test whether you’ve seen a problem before. They test whether you can recognize a pattern, adapt it, and explain your thinking clearly under ambiguity.

Patterns, not problems, are the real unit of learning.

If you solve ten problems that all secretly rely on the same idea, but you never abstract that idea, you didn’t learn ten things. You learned one thing ten times.

Random practice breaks pattern recognition

Random problem solving does three harmful things over time.

First, it fragments your mental model. You remember solutions, not strategies.

Second, it hides gaps. You might be strong at sliding window but weak at tree recursion and random practice won’t surface that clearly.

Third, it trains recall instead of reasoning. Interviews rarely reward recalling the exact solution you practiced last night.

This is why people say things like “I knew this problem, but I couldn’t do it in the interview.”

They knew the answer. They didn’t own the pattern.

What structured practice looks like

Structured practice is not about rigid schedules or fancy roadmaps. It’s about sequencing learning so that each problem reinforces an idea.

A simple example.

Instead of solving random array problems, you group problems by technique. Two pointers. Sliding window. Prefix sums.

You solve a few problems back to back that use the same idea, with increasing difficulty.

After each set, you pause and ask:

What was the invariant here

What made this pattern applicable

What variations could break my approach

That reflection is where learning actually happens.

Interviews test clarity, not cleverness

Interviewers are not impressed by clever tricks. They’re assessing whether you can reason clearly, communicate tradeoffs, and recover when stuck.

Structured practice naturally builds these skills because you’re constantly asking why something works, not just whether it passes.

Over time, you stop memorizing solutions and start recognizing shapes of problems.

That’s when interviews start feeling familiar again.

The real takeaway

If your prep feels busy but not effective, don’t add more hours.

Change the structure.

Stop counting problems. Start owning patterns.

This shift takes slightly more effort upfront, but it’s the difference between endless grinding and actual confidence.

It’s also the difference between hoping the interview question matches something you’ve seen and knowing you can handle whatever shows up.