We found 250 semantic bugs in popular open-source projects that linters completely missed

Ty Wells — Thu, 19 Feb 2026 14:06:19 +0000

AI coding assistants generate code that compiles clean but contains semantic bugs — SQL injection, auth bypasses, null dereferences. Linters and type checkers miss them because the bugs are in what the code claims to do, not how it's structured.

I built Assay to catch what static tools can't. Then I ran it on popular open-source projects.

The results

Project	Stars	Claims Verified	Bugs	Critical	Score
LiteLLM	18K	1,381	185	30	78/100
Chatbot UI	28K	476	41	12	91/100
LobeChat	50K	205	14	1	87/100
Open Interpreter	55K	12	4	2	60/100

Total: 2,400+ claims verified. 250 bugs found.

Every finding links to an interactive dashboard with file paths, line numbers, and code evidence:

How it works

Assay extracts every testable claim from a codebase:

"this validates auth tokens"
"this handles null input"
"this query prevents injection"

Then it uses an adversarial AI pass to verify each claim against the actual code. Think red team for code, not code review.

The approach is based on a formal framework we published: DOI 10.5281/zenodo.18522644

Benchmark results ($638 total experiment cost)

HumanEval (164 coding tasks) — $220

Baseline: 86.6% pass rate
Assay: 100% at pass@5 (164/164)
Self-refine: 87.2% (barely above baseline)
LLM-as-judge: peaks at 99.4%, then drops to 97.2% at k=5 (more review = worse code)

SWE-bench (300 real GitHub bugs) — $246

Baseline: 18.3% resolved
Assay: 30.3% resolved (+65.5% improvement)

What I learned building this

The biggest projects have the most bugs. LiteLLM (52 API routes) had 185 bugs. Smaller, more focused projects scored higher.
Critical bugs hide in plain sight. These projects have thousands of stars, active communities, and regular releases. The bugs aren't in obscure corners — they're in core functionality.
Traditional tools don't catch semantic bugs. Linters check syntax. Type checkers check types. Nothing checks whether the code actually does what it claims to do. That's the gap Assay fills.
LLM-as-judge gets worse with more attempts. At k=5, it starts approving code that actually fails tests. Verification needs to be adversarial, not just "ask the AI if it looks good."

Try it

npx tryassay assess /path/to/your/project

Free, open source. Uses the Anthropic API (~$2-3 for a small project, ~$30-50 for a large codebase). Add --publish for an interactive dashboard at tryassay.ai.

GitHub: gtsbahamas/hallucination-reversing-system
npm: tryassay
Live dashboards: tryassay.ai

Free offer: Drop a repo link in the comments and I'll run Assay on it and share the dashboard. No charge — I want the data.

Have you caught semantic bugs in AI-generated code that linters missed? What tools do you use?

How to Use AI Hallucination to Generate Your Software Spec

Ty Wells — Sun, 08 Feb 2026 07:04:13 +0000

What if the most hated property of AI models is actually their most useful feature for software development?

Every AI coding tool fights hallucination. LUCID exploits it. This tutorial shows you how to use deliberate AI hallucination to generate a comprehensive, testable software specification for your application -- then verify it against your actual code.

By the end, you will have extracted 80-150 testable requirements spanning functionality, security, privacy, performance, and compliance from a single LLM prompt. Total cost: about $3 per iteration.

Prerequisites

Node.js 20+
An Anthropic API key (set as ANTHROPIC_API_KEY)
A codebase you want to specify (any language, any framework)

Installation

git clone https://github.com/gtsbahamas/hallucination-reversing-system.git
cd hallucination-reversing-system
npm install
npm run build

Step 1: Initialize Your Project

Navigate to your application's root directory and initialize LUCID:

lucid init

This creates a .lucid/ directory to store iterations, claims, and verification results.

Step 2: Describe Your App (Loosely)

lucid describe

LUCID will prompt you for a description of your application. The key here is to be deliberately vague. Do not write a detailed spec. Write what you would tell a friend at a bar:

"It's a career development platform. Users set goals, get AI coaching, manage their finances, upload documents. There's a subscription tier."

The vagueness is the point. Every gap you leave is a gap the AI will fill with its own hallucinated requirements. That is the raw material.

Step 3: Hallucinate

This is where the magic happens:

lucid hallucinate

LUCID prompts the LLM to write a full Terms of Service and Acceptable Use Policy for your application as if it is already live in production with paying customers. The model does not know your app doesn't match its description. It confabulates.

The output is saved to .lucid/iterations/1/hallucinated-tos.md. Open it up and read it. You will find the LLM has invented:

Specific features you never mentioned
Data handling procedures
Security measures
Performance guarantees
User rights and limitations
Account lifecycle rules
SLA commitments

All in precise, legally-styled declarative language. A typical hallucination runs 400-600 lines.

Step 4: Extract Claims

Now parse every declarative statement into a testable requirement:

lucid extract

This produces a structured JSON file at .lucid/iterations/1/claims.json. Each claim looks like:

{
  "id": "CLAIM-042",
  "section": "Data Handling",
  "category": "security",
  "severity": "critical",
  "text": "User data is encrypted at rest using AES-256",
  "testable": true
}

On our test run, this produced 91 claims across five categories:

Category	Count	Examples
Functionality	34	Feature capabilities, user workflows
Security	18	Encryption, access control, auth
Data Privacy	15	Data retention, deletion, portability
Operational	14	Uptime, rate limits, backups
Legal	10	Liability, modifications, termination

No human requirements session produces this breadth in 30 seconds.

Step 5: Verify Against Your Codebase

This is where hallucination meets reality:

lucid verify

LUCID reads your codebase and checks each claim against what actually exists in your code. Each claim receives a verdict:

PASS -- Code fully implements the claim
PARTIAL -- Code partially implements it
FAIL -- Code does not implement or contradicts it
N/A -- Cannot be verified from code alone

The output goes to .lucid/iterations/1/verification-results.json.

Step 6: Generate Your Gap Report

lucid report

This generates a human-readable gap analysis. The compliance score formula is:

Score = (PASS + 0.5 * PARTIAL) / (Total - N/A) * 100

Our first verifiable iteration scored 57.3%. The report shows exactly which claims failed and why -- your development backlog writes itself.

Example report output:

LUCID Gap Report - Iteration 3
===============================
Compliance Score: 57.3%

PASS:    38 claims (44.7%)
PARTIAL: 15 claims (17.6%)
FAIL:    32 claims (37.6%)
N/A:      6 claims

TOP FAILURES (Critical):
- CLAIM-012: Rate limiting not enforced server-side
- CLAIM-027: No malware scanning for file uploads
- CLAIM-041: Account lockout parameters don't match spec

Step 7: Fix, Then Remediate

After addressing gaps in your code, generate specific fix tasks:

lucid remediate

This converts FAIL and PARTIAL verdicts into actionable remediation tasks, sorted by severity:

{
  "id": "REM-001",
  "claimId": "CLAIM-012",
  "title": "Add rate limiting middleware",
  "action": "add",
  "targetFiles": ["src/middleware/rate-limit.ts"],
  "estimatedEffort": "medium",
  "codeGuidance": "Implement express-rate-limit with..."
}

Step 8: Regenerate and Loop

After implementing fixes, feed the updated reality back to the model:

lucid regenerate

This generates a new ToS that incorporates what now exists, while hallucinating new capabilities built on the verified foundation. Extract, verify, report again. Each iteration, the score climbs:

Iteration	Score
3	57.3%
4	69.8%
5	83.2%
6	90.8%

The loop converges because each regeneration is grounded in more reality. New hallucinations become more contextually appropriate. The gap shrinks.

When to Stop

Stop when:

All critical claims are verified
Remaining gaps are intentionally deferred
New hallucinations offer diminishing returns

On our test run, we stopped at 90.8% after 6 iterations. The 5 remaining failures were genuine missing functionality (rate limiting, malware scanning, data retention logic). The hallucinated ToS correctly identified them as requirements a production app should have.

The Cost

Phase	Approximate Cost
Hallucinate	$0.15
Extract	$0.25
Verify	$1.50
Remediate	$0.60
Regenerate	$0.40
Per iteration	~$2.90

Six iterations cost about $17 total. For a verified specification with 91 claims, a gap report, and a prioritized remediation plan, that is the cheapest spec you will ever produce.

Why This Works

The theoretical basis is not hand-waving. Transformer self-attention is mathematically equivalent to Hopfield network pattern completion -- the same computation the hippocampus uses for memory retrieval (Ramsauer et al., 2020). When the LLM hallucinates, it is performing pattern completion from partial cues against its training data. The output includes both accurate completions (real patterns) and confabulated completions (plausible extensions).

The Terms of Service format forces precision because legal language cannot be vague. And external verification (against the codebase, not the model's own assessment) provides the reality-checking that LLMs provably cannot perform on themselves (Huang et al., ICLR 2024).

The closest precedent: protein hallucination from the Baker Lab, where neural network "dreams" served as blueprints for novel proteins. That won the 2024 Nobel Prize in Chemistry.

Get Started

git clone https://github.com/gtsbahamas/lucid.git
cd hallucination-reversing-system
npm install && npm run build

Full paper with neuroscience grounding: https://github.com/gtsbahamas/lucid/blob/main/docs/paper.md

Questions, issues, and contributions welcome.

DEV Community: Ty Wells