Aditi Bhatnagar

Posted on May 4

We Scanned AI-Built Apps and Found Holes That Would End Companies. Here's What We Found.

#ai #security #vibecoding #mythos

I want to tell you about a bootstrap endpoint.

It was in a production app, live, serving real users. The endpoint existed because an AI assistant had helpfully included it during scaffolding. It returned the application's master authentication token. To anyone who asked. No login required. No API key. No nothing.

One HTTP request and you had every LLM API key, every database password, every third-party credential the app had ever touched. Fourteen secrets, handed over cheerfully by an app that had no idea it was doing anything wrong.

The developer who built this wasn't careless. They were fast. That's the whole point.

How We Got Here

Georgetown's CSET ran the numbers on AI-generated code versus hand-written code. The result was 2.74 times more vulnerabilities. Not a little worse. Nearly three times.

That tracks with what we see. Not because AI is bad at writing code — it's genuinely extraordinary at writing code. But writing code that works and writing code that's secure are two completely different objectives, and AI coding tools are only trained on one of them.

The model doesn't have a threat model. It has never been attacked. It doesn't know what SQL injection feels like from the inside, or why that particular pattern of URL handling becomes a problem the moment someone points it at an internal metadata service. It knows what working code looks like, and it reproduces those patterns at a speed no human can match — including the insecure ones that have been quietly dangerous in codebases for years.

This matters more than it used to because the other side has also gotten faster. Time from CVE to working exploit used to be weeks. With a language model it's now under an hour. The cost to build a targeted attack has dropped 100x. Your attack surface is expanding daily. Check it out.

What's Actually Sitting in Production

We've been scanning AI-built apps. Here's a sample of what we've found this year. Every single one of these teams responded immediately, patched within the same day or same week, and handled disclosure with more professionalism than most companies twice their size. That's worth saying upfront. These are security-conscious teams. The vulnerabilities got in anyway — because that's the nature of AI-generated code at speed, not the nature of the people shipping it.

Cognithor · Critical · CVSS 9.8

The bootstrap endpoint I opened with. Cognithor is an AI assistant with an active user base, and when we reported this they patched it the same day and credited the disclosure in their release notes, SECURITY.md, and commit message. That's the response of a team that takes this seriously.

The vulnerability itself: the app generated a master bearer token at startup. The /api/v1/bootstrap route returned it to any caller. No authentication. Not even a rate limit — that route was explicitly exempted. The API server bound to 0.0.0.0 by default, so any host that could reach the port could retrieve the token. With it you had access to every API key, every database password, the ability to wipe the entire configuration. Everything, one unauthenticated request away.

The AI built authentication that worked perfectly for the frontend. It just also built it for everyone else.

Fixed in v0.78.2. Same day.

LiteLLM · Critical · CVSS 9.0

LiteLLM is the proxy layer many teams use to manage access to language model APIs. Thousands of deployments. When we reported this they moved fast — patched and shipped within the week across two separate releases that addressed both failure points.

The bug was subtle. An org admin hitting POST /user/bulk_update could elevate any user on the entire platform to proxy_admin — full access to all model configurations, API keys, spend data, every tenant — in a single request. The authorization check read organization_id from the raw request body, confirmed it, then handed the payload to a Pydantic model that didn't declare that field. Pydantic silently dropped it. The handler never saw it. The scope constraint vanished between middleware and execution.

The team identified both root causes immediately and addressed them in separate targeted fixes rather than a band-aid patch. That's engineering maturity.

Patched in v1.83.7 and v1.83.8 within the week.

Microsoft VibeVoice · High · CVSS 7.8

Microsoft acknowledged the report promptly and patched the same week. Their response was exactly what responsible disclosure is supposed to look like.

The vulnerability was in VibeVoice's checkpoint conversion script, which called torch.load() without the weights_only=True flag. PyTorch's default deserialization executes arbitrary Python during unpickling. The fix has existed since PyTorch 1.13. The AI wrote the code without it because most torch.load() examples in training data don't include it — a safe pattern that became unsafe at scale.

A crafted model file passed to that script executes whatever the attacker embedded, before any application code runs, before any checks happen. In CI that means secrets, tokens, and internal network access exfiltrated silently during a routine pipeline run.

One parameter. The team knew exactly what to do the moment we showed them.

Patched same week. Credited in the commit.

The Point Isn't That These Teams Made Mistakes

They didn't, really. They built fast, used AI to help them build faster, and ended up with vulnerabilities that no reasonable code review process would have caught. The same vulnerabilities are sitting in hundreds of other codebases right now, in teams that haven't had someone look.

What these teams did right — and why we're naming them — is that they fixed it. Immediately. Professionally. Without drama. Every one of them treated this like the security-conscious organizations they are.

The vulnerability wasn't a character flaw. It was a structural problem with how AI writes code, and it's a problem every team shipping AI-generated code is carrying whether they know it or not.

Why The Old Approach Can't Keep Up

Quarterly pentests are a snapshot. You pay $15k, they examine the codebase as it stood when the engagement started, you get a report two weeks later. Your team shipped daily in between.

One security engineer against ten developers with AI assistants isn't a program. It's a gesture.

Static scanners generate noise. Most teams stop reading them after a few weeks because the signal-to-noise ratio destroys their usefulness as a daily workflow. The findings above would pass most static scans — they require tracing data flow across the whole stack, not pattern matching on individual files.

None of these solutions run at the speed you're actually shipping.

What Kira Does

Kira runs on every commit. Connects to GitHub, GitLab, or Bitbucket in one click, nothing to install.

It traces data flow across your entire codebase looking for real attack paths — the ones where untrusted input travels through several functions and ends up somewhere it shouldn't, where a permission check happens before a parameter that gets modified afterward, where the combination of individually reasonable decisions creates something exploitable.

When it finds something it doesn't ask you to investigate. It shows you whether it's actually exploitable and exactly how. Proof of concept. Reproduction steps you can run yourself.

Findings come out as shareable reports. Clean enough to hand to a customer asking about your security posture or an auditor who wants to see evidence of process.

The findings above all came from Kira. So did a lot of others we haven't written up yet.

One Thing Worth Knowing

Every line your AI writes today is code a security engineer didn't review. That's not a criticism, it's arithmetic. You're moving too fast for manual review to keep up, and that gap is exactly where the vulnerabilities that matter tend to live.

Your first scan is free. No credit card. Connect your repo and see what's actually there.

offgridsec.com

If it comes back clean, great. If it doesn't, you'll be glad you looked before someone else did.

Top comments (1)

PEACEBINFLOW • May 4

The bootstrap endpoint story is the kind of thing that sticks with you because it's not really about AI making a mistake — it's about AI doing exactly what it was asked to do, thoroughly, without ever understanding the context that would make it stop.

What I keep thinking about is the asymmetry here: the AI can scaffold an entire authentication system in seconds, but it has no instinct for "this endpoint probably shouldn't exist." That instinct isn't a technical skill — it's a product of having been burned before, of having seen something similar go wrong in a previous codebase, of having a visceral reaction to the word "bootstrap" in a production context. The model doesn't have that. It's never been the person who got paged at 2 AM.

The 2.74x vulnerability stat makes sense through that lens. It's not that AI writes worse code on a line-by-line basis. It's that security is a property of the whole system — the interactions between components, the assumptions about who can reach what — and AI generates each piece without awareness of the whole. Traditional code review partly addressed this because the reviewer carried that systemic awareness. But when both the code generation and the review speed are being compressed, that awareness doesn't have time to activate. I wonder if the real gap isn't a scanning tool but something that injects friction specifically at the points where systemic awareness usually lives — the architectural decisions, the endpoint definitions, the places where a human would say "hang on, why does this route exist?"