AI coding assistants generate code that compiles clean but contains semantic bugs — SQL injection, auth bypasses, null dereferences. Linters and type checkers miss them because the bugs are in what the code claims to do, not how it's structured.
I built Assay to catch what static tools can't. Then I ran it on popular open-source projects.
The results
| Project | Stars | Claims Verified | Bugs | Critical | Score |
|---|---|---|---|---|---|
| LiteLLM | 18K | 1,381 | 185 | 30 | 78/100 |
| Chatbot UI | 28K | 476 | 41 | 12 | 91/100 |
| LobeChat | 50K | 205 | 14 | 1 | 87/100 |
| Open Interpreter | 55K | 12 | 4 | 2 | 60/100 |
Total: 2,400+ claims verified. 250 bugs found.
Every finding links to an interactive dashboard with file paths, line numbers, and code evidence:
How it works
Assay extracts every testable claim from a codebase:
- "this validates auth tokens"
- "this handles null input"
- "this query prevents injection"
Then it uses an adversarial AI pass to verify each claim against the actual code. Think red team for code, not code review.
The approach is based on a formal framework we published: DOI 10.5281/zenodo.18522644
Benchmark results ($638 total experiment cost)
HumanEval (164 coding tasks) — $220
- Baseline: 86.6% pass rate
- Assay: 100% at pass@5 (164/164)
- Self-refine: 87.2% (barely above baseline)
- LLM-as-judge: peaks at 99.4%, then drops to 97.2% at k=5 (more review = worse code)
SWE-bench (300 real GitHub bugs) — $246
- Baseline: 18.3% resolved
- Assay: 30.3% resolved (+65.5% improvement)
What I learned building this
The biggest projects have the most bugs. LiteLLM (52 API routes) had 185 bugs. Smaller, more focused projects scored higher.
Critical bugs hide in plain sight. These projects have thousands of stars, active communities, and regular releases. The bugs aren't in obscure corners — they're in core functionality.
Traditional tools don't catch semantic bugs. Linters check syntax. Type checkers check types. Nothing checks whether the code actually does what it claims to do. That's the gap Assay fills.
LLM-as-judge gets worse with more attempts. At k=5, it starts approving code that actually fails tests. Verification needs to be adversarial, not just "ask the AI if it looks good."
Try it
npx tryassay assess /path/to/your/project
Free, open source. Uses the Anthropic API (~$2-3 for a small project, ~$30-50 for a large codebase). Add --publish for an interactive dashboard at tryassay.ai.
- GitHub: gtsbahamas/hallucination-reversing-system
- npm: tryassay
- Live dashboards: tryassay.ai
Free offer: Drop a repo link in the comments and I'll run Assay on it and share the dashboard. No charge — I want the data.
Have you caught semantic bugs in AI-generated code that linters missed? What tools do you use?
Top comments (0)