We found 250 semantic bugs in popular open-source projects that linters completely missed

#ai #programming #opensource #security

AI coding assistants generate code that compiles clean but contains semantic bugs — SQL injection, auth bypasses, null dereferences. Linters and type checkers miss them because the bugs are in what the code claims to do, not how it's structured.

I built Assay to catch what static tools can't. Then I ran it on popular open-source projects.

The results

Project	Stars	Claims Verified	Bugs	Critical	Score
LiteLLM	18K	1,381	185	30	78/100
Chatbot UI	28K	476	41	12	91/100
LobeChat	50K	205	14	1	87/100
Open Interpreter	55K	12	4	2	60/100

Total: 2,400+ claims verified. 250 bugs found.

Every finding links to an interactive dashboard with file paths, line numbers, and code evidence:

How it works

Assay extracts every testable claim from a codebase:

"this validates auth tokens"
"this handles null input"
"this query prevents injection"

Then it uses an adversarial AI pass to verify each claim against the actual code. Think red team for code, not code review.

The approach is based on a formal framework we published: DOI 10.5281/zenodo.18522644

Benchmark results ($638 total experiment cost)

HumanEval (164 coding tasks) — $220

Baseline: 86.6% pass rate
Assay: 100% at pass@5 (164/164)
Self-refine: 87.2% (barely above baseline)
LLM-as-judge: peaks at 99.4%, then drops to 97.2% at k=5 (more review = worse code)

SWE-bench (300 real GitHub bugs) — $246

Baseline: 18.3% resolved
Assay: 30.3% resolved (+65.5% improvement)

What I learned building this

The biggest projects have the most bugs. LiteLLM (52 API routes) had 185 bugs. Smaller, more focused projects scored higher.
Critical bugs hide in plain sight. These projects have thousands of stars, active communities, and regular releases. The bugs aren't in obscure corners — they're in core functionality.
Traditional tools don't catch semantic bugs. Linters check syntax. Type checkers check types. Nothing checks whether the code actually does what it claims to do. That's the gap Assay fills.
LLM-as-judge gets worse with more attempts. At k=5, it starts approving code that actually fails tests. Verification needs to be adversarial, not just "ask the AI if it looks good."

Try it

npx tryassay assess /path/to/your/project

Free, open source. Uses the Anthropic API (~$2-3 for a small project, ~$30-50 for a large codebase). Add --publish for an interactive dashboard at tryassay.ai.