I Built a Security Scanner for AI-Generated Code — Then Found Vulnerabilities in My Own Projects What happens when you run your own tool on your own code

AyushkhatiDev — Mon, 25 May 2026 02:29:18 +0000

Been building with Cursor and Bolt lately like a lot of people here.

Started wondering — is the code these tools generate actually secure?
So I dug into it.

Turns out the numbers are bad:

45% of AI-generated code has OWASP Top 10 vulnerabilities (Veracode)
65% of vibe-coded apps have security issues (Escape.tech, 1400+ apps)
35 CVEs in a single month attributed to AI-generated code (March 2026)

Patterns I kept seeing in AI-generated code:

Hardcoded API keys and Supabase service keys
RLS disabled on Supabase tables
Hallucinated npm packages that don't exist
Wildcard CORS in backends
eval() calls with dynamic values

So I scanned my own projects. Found 4 vulnerabilities in my own
LLM eval platform that's already live:

3 eval() XSS risks in frontend
Wildcard CORS in Flask backend

Built an open source CLI to automate this scan for anyone:

pip install vibesec
vibesec scan ./your-project

Full writeup: https://medium.com/@ayushiskhati305/i-built-a-security-scanner-for-ai-generated-code-then-found-vulnerabilities-in-my-own-projects-82974fc97e43
Repo: github.com/AyushkhatiDev/vibesec

Curious what others find when they scan their Cursor projects —
anyone else checked their AI-generated code?me

I built an open-source LLM eval framework as a BCA student — hallucination detection, red-teaming, regression tracking

AyushkhatiDev — Tue, 19 May 2026 03:51:11 +0000

## The Problem

Every company building AI products needs to know if their LLM is
actually working — or getting worse over time. This is harder than
it sounds.

I built an open-source evaluation framework to solve this.

What It Does

Runs a 27-test suite covering factual accuracy, safety refusals, hallucination resistance, adversarial prompts, and reasoning
Scores outputs using a 3-tier judge chain: semantic similarity → LLM judge → regex fallback
Auto-generates adversarial prompt attacks to red-team any endpoint
Tracks regressions across model versions
Live dashboard with pass/fail rates and per-test inspection

Research Finding

The hallucination scorer hit 86% classification accuracy vs
50% random baseline on a 50-case benchmark.

Architecture

Flask backend → PostgreSQL → Groq API → Next.js dashboard

Deployed completely free on Render + Vercel + Neon + Upstash.

Stack

Flask, SQLAlchemy, Groq SDK, PostgreSQL, Next.js, Framer Motion,
Render, Vercel

About Me

I'm a BCA student from Siliguri, India. I built this in a few weeks
because I wanted a portfolio project that solves a real problem —
not another todo app.

Would love feedback on the scoring approach and architecture.

DEV Community: AyushkhatiDev