DEV Community

Ty Wells
Ty Wells

Posted on

We found 250 semantic bugs in popular open-source projects that linters completely missed

AI coding assistants generate code that compiles clean but contains semantic bugs — SQL injection, auth bypasses, null dereferences. Linters and type checkers miss them because the bugs are in what the code claims to do, not how it's structured.

I built Assay to catch what static tools can't. Then I ran it on popular open-source projects.

The results

Project Stars Claims Verified Bugs Critical Score
LiteLLM 18K 1,381 185 30 78/100
Chatbot UI 28K 476 41 12 91/100
LobeChat 50K 205 14 1 87/100
Open Interpreter 55K 12 4 2 60/100

Total: 2,400+ claims verified. 250 bugs found.

Every finding links to an interactive dashboard with file paths, line numbers, and code evidence:

How it works

Assay extracts every testable claim from a codebase:

  • "this validates auth tokens"
  • "this handles null input"
  • "this query prevents injection"

Then it uses an adversarial AI pass to verify each claim against the actual code. Think red team for code, not code review.

The approach is based on a formal framework we published: DOI 10.5281/zenodo.18522644

Benchmark results ($638 total experiment cost)

HumanEval (164 coding tasks) — $220

  • Baseline: 86.6% pass rate
  • Assay: 100% at pass@5 (164/164)
  • Self-refine: 87.2% (barely above baseline)
  • LLM-as-judge: peaks at 99.4%, then drops to 97.2% at k=5 (more review = worse code)

SWE-bench (300 real GitHub bugs) — $246

  • Baseline: 18.3% resolved
  • Assay: 30.3% resolved (+65.5% improvement)

What I learned building this

  1. The biggest projects have the most bugs. LiteLLM (52 API routes) had 185 bugs. Smaller, more focused projects scored higher.

  2. Critical bugs hide in plain sight. These projects have thousands of stars, active communities, and regular releases. The bugs aren't in obscure corners — they're in core functionality.

  3. Traditional tools don't catch semantic bugs. Linters check syntax. Type checkers check types. Nothing checks whether the code actually does what it claims to do. That's the gap Assay fills.

  4. LLM-as-judge gets worse with more attempts. At k=5, it starts approving code that actually fails tests. Verification needs to be adversarial, not just "ask the AI if it looks good."

Try it

npx tryassay assess /path/to/your/project
Enter fullscreen mode Exit fullscreen mode

Free, open source. Uses the Anthropic API (~$2-3 for a small project, ~$30-50 for a large codebase). Add --publish for an interactive dashboard at tryassay.ai.

Free offer: Drop a repo link in the comments and I'll run Assay on it and share the dashboard. No charge — I want the data.


Have you caught semantic bugs in AI-generated code that linters missed? What tools do you use?

Top comments (0)