DEV Community

I Scanned 100 AI Codebases - Here's What I Found

Mykola Kondratiuk on March 19, 2026

I've been building VibeCheck for the past few months - it's a security scanner specifically for AI-generated code. And after scanning over a hundre...

Read full post

Daniel Yarmoluk • Mar 20

Can't this be mitigated by adding more context into model?

Mykola Kondratiuk • Mar 20

it helps but only partially. i tried this - added security guidelines, told it to never hardcode secrets, use parameterized queries, etc. the model follows it when the task is clearly framed that way. but when it is doing something else (like "add this feature") it just... forgets. the security context doesnt carry over into every decision. what i found works better is having a separate security review pass after generation - treat it like a code review step, not something you bake into the initial prompt

Daniel Yarmoluk • Mar 20

well, one of students is doing security knowledge graphs and compressing at md level as a start, it's not the end, but it will bring fidelity to the front (problems/issues), try it out, I'd be happy to do it and give it back to you

Mykola Kondratiuk • Mar 20

that is actually a clever angle - compressing security knowledge into structured context rather than loose prose instructions. would be curious to see how well it holds up across different types of vulns. the fidelity problem is real, vague rules get interpreted loosely. share it when done

Daniel Yarmoluk • Mar 20

yeah, give me an input, or I can go with your thread, choose the battlefield and weaponry

Mykola Kondratiuk • Mar 20

haha ok - battlefield: a prod codebase that was 100% vibe coded in one weekend. weaponry: a security scanner and a code review. my bet is on the scanner finding something before you finish reading the first file

Sarwar • Mar 19

I think this problem would most commonly surface on apps that were built on a single go/prompt. When the whole codebase is generated as the output, it can be hard to wrap your head around what to look for and validate. Any "shortcuts" taken by AI at that point is easy to miss. Combined with the "don't fix it if it ain't broken" mentality and excitement to ship new apps, it's easy to see how people end up shipping something real but insecure.

Mykola Kondratiuk • Mar 19

Yeah, totally agree. The single-prompt codebases were the worst offenders in what I scanned - you could almost tell by the file structure alone. Everything flat, no real separation, and the AI just... didn't know what it didn't know. Auth mixed with business logic, env vars hardcoded because nothing was telling it otherwise. The "don't know what to validate" part is real too - when you've never had to think about the security surface of a piece of code, you have no mental model for what could go wrong. At least with iterative builds there's usually some moment where a human has to integrate things and maybe notices something feels off.

Daniel Yarmoluk • Mar 20

The framework here is backwards. Let me explain.

Boris Cherny created Claude Code and shipped 300+ PRs in December 2025 running 5 agents simultaneously. His setup is intentionally vanilla — minimal customization, trust the model, let it rip. That works when you're Boris and you've internalized 20 years of engineering patterns. The model fills in what he already knows.

But that's exactly the problem this post exposes. Most developers don't have Boris's mental model. So the AI optimizes for "make it work" with zero weight on security, architecture, or trust boundaries. The model doesn't know what it doesn't know.

The fix isn't scanning after the fact. It's structuring what the model knows before it writes a single line.

I've been building the opposite of Boris's vanilla approach. Instead of trusting the model to figure it out, I compress domain knowledge — security constraints, architectural rules, dependency relationships, trust boundaries — into structured .md files as knowledge graphs. Entities connected by typed relationships. Not prose instructions that get buried. Traversable constraints the model follows like a checklist it can't skip.

Two personas using the same tool:

Boris (creator of Claude Code):

Vanilla setup, minimal customization
300+ PRs in a month running 5 agents
Trusts the model — his 20 years of engineering IS the context
Works because his mental model fills the gaps the AI can't see

Dan (knowledge graph builder):

Heavy context architecture — structured .md files as input
Domain knowledge compressed to 12KB traversable graphs
Doesn't trust the model to know what it doesn't know
Works because the graph fills the gaps instead of hoping the developer will

Boris can afford vanilla because he IS the knowledge graph. Most developers aren't Boris. They need the structure externalized.

It's the same pattern everywhere:

Chef vs. recipe follower. A Michelin chef doesn't need a recipe — decades of training IS the context. A home cook needs the recipe or they burn the sauce. The recipe is the knowledge graph. AI is the kitchen.

Surgeon vs. med student. A senior surgeon operates from pattern recognition built over thousands of procedures. A resident needs the checklist. Atul Gawande wrote a whole book about this — structured checklists reduced surgical deaths by 47%. Not because surgeons are bad. Because externalized structure catches what even experts miss under pressure.

Senior dev vs. vibe coder. A senior dev reads AI output and instinctively flags "wait, why does this have admin access?" A vibe coder ships it because nothing broke in dev. That instinct IS a knowledge graph — it's just trapped in one person's head.

The 100 codebases in this post are the home cooks, the residents, the vibe coders. They're not careless. They just don't have the graph externalized yet.

LEAST_PRIVILEGE → APPLIES_TO → Service Accounts
  ↳ RULE: Never default to admin. Scope to minimum required.
ENV_VARS → MUST_USE → .env + .gitignore
  ↳ RULE: Zero hardcoded credentials. Flag and block.
INPUT_VALIDATION → REQUIRED_AT → Every boundary
  ↳ RULE: Sanitize before shell, DB, or API pass-through.
CORS → MUST_BE → Origin-specific
  ↳ RULE: Never wildcard in production.

When those constraints are typed relationships in the model's context, it doesn't "forget" them. It can't take the path of least resistance because the graph constrains the path.

Same Claude. Same model. Different structure going in. Different security posture coming out.

The 100 codebases in this post didn't fail because AI is bad at security. They failed because nobody structured what "secure" means for that specific domain before the model started coding.

The numbers from my builds: 3M+ lines of code in 50 days. Solo. On Claude Code. 170x token density when compressing domain knowledge to .md. 93% token compression. ~3,000 tokens instead of ~500,000 for the same reasoning quality. 99.4% lower carbon per query.

That's what I build at Graphify.md — domain knowledge graphs compressed to portable .md. Works in Claude, ChatGPT, Cursor, anywhere. Security is just one domain.

graphifymd.com

Mykola Kondratiuk • Mar 20

yeah this is basically the crux of it. the model inherits your taste - if you have good taste, output is good. if you don't, the model confidently produces something that looks right but isn't. the security holes we found were exactly that pattern, generated code that looked professional but was missing the instincts a senior dev would have applied automatically

Daniel Yarmoluk • Mar 20

love on the model Mykola, context is love

Mykola Kondratiuk • Mar 20

haha context is love until it forgets everything past 200k tokens and you are back to square one

Pixeliro • Mar 24

That makes a lot of sense.

What you’re describing reminds me of how tools like Google’s Stitch are approaching UI generation — instead of generating arbitrary code, they start from structured HTML/design foundations and then build on top of that.

It feels like the same idea in a different domain:
not letting the system generate freely, but constraining it within a known structure from the beginning.

In your case, it’s about trust and permissions.
In UI systems, it’s about tokens, semantics, and layout constraints.

Both are trying to solve the same underlying problem:
reduce the space of “unsafe” decisions before they ever make it into production.

I’m starting to think the future isn’t better generation — it’s better constraint systems around generation.

Curious if you see this evolving toward more “structured-first” AI tools rather than free-form ones.

Mykola Kondratiuk • Mar 24

that Stitch parallel is actually really sharp. constrained generation - whether it's design tokens or permission templates - forces the model to work within a safe envelope instead of hallucinating its way into free-form territory. I wonder if the same idea applies to infra: give the AI a library of known-good IAM snippets to compose from rather than letting it generate policies from scratch. less creative, but the blast radius when it's wrong is way smaller

Daniel Yarmoluk • Mar 25

Why isn’t it the industry didn’t pay attention to graph databases more? It was always the better, but very few very skilled professionals map entire ontologies. Palantir does, Bloomberg does, big money. Now if your knowledge graph structure is compressed time and time again, reduced to 10-15kb, you’re trying more context through the context window and hence more fidelity. I can get the Jensen Huang engineer 250k to 5k? Context efficiency, reasoning hops, context density. I’d like to graph and give the files for you to play with in any environment. That’s my proposal. However, I will not use search engines personally, think what that means.

Mykola Kondratiuk • Mar 25

graph databases had the tooling problem for a long time - the queryability and dev experience lagged behind relational. LLMs are actually changing that equation because natural language queries map better to graph traversal. compressed ontologies as context is a real pattern, especially for domain-heavy apps where the graph IS the knowledge

Pixeliro • Mar 25

That’s a really sharp extension of the idea. Constrained generation feels like the common pattern here — whether it’s design tokens, UI components, or permission templates, the goal is to keep the model operating inside a safe envelope instead of letting it hallucinate freely.

Applying this to infra makes a lot of sense. Composing IAM policies from a library of validated snippets is far more predictable than generating them from scratch.

You lose some flexibility, but the reduction in blast radius when things go wrong is absolutely worth it in production.

Mykola Kondratiuk • Mar 25

composing from validated snippets is the right direction. basically policy-as-code with a model-friendly interface. you get the expressiveness of natural language generation but bounded by proven primitives - audit trail becomes way cleaner too

Tomik T • Mar 21

Great writeup. Security in vibe coded projects is the elephant in the room that nobody wants to talk about. I run a small hosting platform and we added automated YARA scanning on every single deploy exactly because of this. You would not believe the stuff people upload, hardcoded API keys, SQL injection galore, sometimes even test credentials for production databases. The AI generates code that works but it has zero concept of security. I think every hosting provider that targets vibe coders needs to have some form of automated scanning built in, its not optional anymore

Mykola Kondratiuk • Mar 21

honestly the YARA scanning is smart - we ended up doing something similar for VibeCheck, scanning for patterns that scream "AI wrote this and forgot security exists". the hardcoded credentials thing you mentioned is everywhere, I found them in like 40% of the repos I scanned, sometimes buried 3 folders deep like the dev thought obscurity = security. tbh I think you're right that hosting platforms need to build this in because most vibe coders won't think about it until something breaks. the AI just doesn't model threat scenarios, it models "make the tests pass"

Pavel Ishchin • Mar 25

YARA at deploy is smart but what happens when it catches something? Do you block the deploy entirely or just flag it? Asking because the article mentions that most of these issues aren't malicious, they're trust misconfigurations. Blocking a deploy for a hardcoded key makes sense. Blocking it for open CORS might be too aggressive if the dev intended it for a public API. Where do you draw the line on what's a hard block vs a warning?

Mykola Kondratiuk • Mar 25

the block vs flag question is real and context-dependent. hardcoded secrets = block, no debate. open CORS or overpermissioned IAM - I would flag with severity + required acknowledgment rather than hard block. the key is forcing a conscious decision rather than hoping the dev noticed. "I know this is open CORS and I intended it" is a very different state than it slipping through unreviewed

Daniel Yarmoluk • Mar 25

Who are you worried about? Like you’re worried for what group of people? The security company that watches what?

Mykola Kondratiuk • Mar 25

mostly worried about the builders themselves - a lot of vibe coders ship without ever looking at what permissions their apps are granting or what data they are exposing. no malicious intent, just not in the habit of thinking about attack surface

arun rajkumar • Mar 25

The trust misconfiguration pattern is painfully familiar — and not just from AI-generated code. We run 15 microservices at a fintech startup, and before we centralised our env management, we had the exact same issues with human-written code: DB connection strings with full admin creds, no connection pooling limits, env files copy-pasted over Slack.

The fix was a shared Zod schema that validates every env var on startup. Dev-local mode warns, production mode calls process.exit(1). It catches the "fastest path to working" problem before it ships. AI just makes that anti-pattern happen faster. The real gap is that nobody validates the boring infrastructure decisions — human or AI.

Mykola Kondratiuk • Mar 25

the human-written fintech example is actually the better data point tbh - same patterns, just slower velocity. AI just makes it impossible to ignore anymore because it surfaces 3x faster. centralising env management was the right call regardless of how you got there

arun rajkumar • Apr 1

Exactly right. The AI just made it impossible to ignore what was always there. We'd been accumulating env variable drift for years — every new service copied from the last one with slight modifications nobody documented. An AI agent flagged 23 inconsistencies in minutes. A human would've found the same things eventually, just spread over months of "hmm, that's weird" moments. The centralised env management was the real fix. The AI was just the mirror that made us stop pretending the mess wasn't there.

Mykola Kondratiuk • Apr 1

23 inconsistencies in minutes is a good example of the pattern - not finding new problems, just surfacing old ones faster. The "hmm that is weird" moments that used to disappear into the backlog now come back with receipts. Centralized env management makes sense as the fix but the harder part is usually the cultural shift: actually enforcing it after you find the problems.

arun rajkumar • Apr 1

We added drift checker in pr and pre commit so instead of culturally aligning we did it as automated process. Which prevents this from ever happening again.

Mykola Kondratiuk • Apr 1

That is the right call - baking it into the pipeline so it is not a cultural ask anymore. The cultural alignment approach works until someone is rushing a deploy at 11pm. Automated constraint is the only version that holds under pressure.

wong2 kim • Mar 24

As someone who builds iOS apps entirely with AI-assisted development, the trust misconfiguration pattern is painfully real. I've caught myself shipping overly permissive Supabase RLS policies because the AI-generated code "just worked" in dev.

The gap between "works in dev" and "safe for real users" is exactly what keeps me up at night. My approach has been adding a manual security review checkpoint before every App Store submission — but that doesn't scale well when you're shipping multiple apps.

The idea of background tooling that catches these issues during the coding session (not after) feels like the right direction. Would love to try VibeCheck on my projects.

Mykola Kondratiuk • Mar 24

yeah the Supabase RLS thing hit me too. had a policy that was basically true for reads because the AI wanted to make the demo work fast. caught it only when I read through the migration files line by line before launch. now I ask the AI to specifically audit the policies after writing them - weirdly works better than asking it to write secure ones in the first place. still manual but at least it's a separate pass with a different prompt

Daniel Yarmoluk • Mar 25

We can knowledge graph these solutions. I cannot say what will or will not work, but I can help get more efficient context architecture to assist in your solution development. We all can assist each other in solving big problems together.

Mykola Kondratiuk • Mar 25

collective intelligence angle is interesting. the challenge is keeping context sharp - more nodes can help or hurt depending on how well the graph is structured

Daniel Yarmoluk • Mar 25

Honestly, I’d rather have a recorded conversation, a real conversation, and load that text into model. Not tax my prefrontal cortex to write a prompt. I’m too dumb

Mykola Kondratiuk • Mar 25

conversation as context is underrated. transcripts carry nuance that prompts lose. the "too dumb to prompt" framing is honest - most people think in conversation not structured instructions

Daniel Yarmoluk • Mar 20

You can just give me a call

Mykola Kondratiuk • Mar 20

haha lets keep it in the comments for now - more interesting for others to read too

Daniel Yarmoluk • Mar 20

I’m going to give this to you in a few hours

Mykola Kondratiuk • Mar 20

no rush

Daniel Yarmoluk • Mar 20

Mykola — here it is. I owe you a proper response, not scattered comments from behind the wheel.

the thread earlier was raw signal — fragments of an idea typed between errands. this is what happens when I actually sit down and structure it. a page of structured context enables 170 reasoning paths. a few comments in a thread enable maybe 3. you deserve the full page.

you said: "battlefield: a prod codebase that was 100% vibe coded in one weekend. weaponry: a security scanner and a code review. my bet is on the scanner finding something before you finish reading the first file."

I accept. but here's the twist.

I'm going to vibe code a real production-ready app this weekend. 100% AI-generated, one weekend, shipped. but before I write a single line, I'm loading a security knowledge graph built from YOUR findings into the model's context. your article, your comment thread, your data — structured as typed relationships with rules, constraints, and the vulnerability patterns you found across 100 codebases.

your own data, aimed back at your scanner.

then you scan it with VibeCheck. if your scanner finds vulns — you win, the scanner beats the graph. if it comes back clean — the graph wins, structured context prevented what scanning would have caught after the fact.

either way, we both learn something real. and the experiment is documented.

I took your article + this thread + your twitter findings and compressed them into a knowledge graph. 58 entities, 89 typed relationships, 8 layers, ~320 reasoning paths. paste it into any model and ask it anything about your security domain.

but it's not just a diagnostic. layers 6-8 are solutions — things you can actually build, including the "something watching in the background" you said nobody's solved yet (spoiler: it's solved — Claude Code hooks do exactly this).

there are also 5 weak signals I found in your own words that you might not have noticed the significance of. your data told a bigger story than the article captured. the graph connected them.

I already ran the A/B test before posting. same model (ChatGPT, extended thinking), same prompt: "what's my minimum security checklist for shipping a vibe-coded app this weekend?"

flat article: generic 11-item OWASP checklist. zero references to your actual findings. zero named vulnerability patterns from your data. the model ignored your article and fell back to training data.

knowledge graph: 3-tier ship/no-ship decision framework. 16 identifiable graph traversals. 6 named patterns from your scans (open CORS, overprivileged accounts, hardcoded creds, unsanitized passthrough, f-string SQL, secrets in source). 4 direct citations of your data. referenced your failure modes by name.

same model. same prompt. 16:0 on domain-specific insights. the flat version didn't score lower — it didn't score.

the full math is in Layer 10 of the knowledge graph. run the same test yourself — paste your flat article into one chat, paste the KG into another, same question, compare. that's the proof.

10 layers. 78 entities. 124 typed relationships. ~450 reasoning paths. from one page of structured .md — the same token budget as your flat article, but 170x more ways for a model to reason through it. that's the claim. the A/B test is the proof. the knowledge graph is in my next comment.

and when you're ready to define the battlefield — framework, scope, what "production-ready" means to you — tell me. I'll build it live, KG-loaded, and you scan the result. loser buys the other one a coffee via the internet.

let's go. 🇺🇦

We both win this way I would argue to say...but I need more context when you're concerned with context architecture.

Daniel Yarmoluk • Mar 20

Jace Altgen • Mar 20

This is super interesting to validate statistically. Definitely, having security principles in prompts helps, so does having a semblance of an idea of the architecture of your application. Now get that that’s exactly what some people can’t define because they aren’t experts. But let’s say you’re writing a piece of python tooling: tell it which version you want, that you want pyproject.toml instead of requirements.txt, some file structure you’d typically want, that you want to use pydantic for settings. And more importantly, what you don’t want it to do. This in my experience increases the chance of the output being easier to validate and check because it aligns with my mental model of how this tool should probably work. Knowing where to look becomes easier.

Mykola Kondratiuk • Mar 20

yeah the "you have to know what you want" problem is real. i ran into this exact wall - if you can articulate the architecture you already kind of know what you are doing. the devs who get burned the hardest are the ones who dont have that mental model yet, so they just accept whatever the model outputs without questioning it. the pyproject.toml type of constraint helps a lot though, having a checklist of non-negotiables you paste in every time basically forces a floor on quality

Harjot Singh • May 31

100-codebase scans are exactly the data this conversation needs instead of vibes. The patterns that usually fall out: AI-generated repos over-index on the happy path, under-invest in error handling/tests, and reach for clever-but-nonstandard structure the next human can't follow. Basically the model optimizes for "passes the immediate check," not "maintainable in month three."

The fix isn't "stop using AI," it's constraining it toward boring, conventional, well-trodden patterns and gating the output through real review - the consistency is what makes generated code legible later. Curious what your single most common red flag was across the 100, and whether the better-looking ones correlated with the human steering harder vs the tool being smarter. Genuinely useful dataset - more of this empirical stuff please.

Mykola Kondratiuk • May 31

the month-three framing is the right diagnostic. error handling gap hits hardest when a second dev picks up the feature - AI-generated code rarely documents failure modes, so whoever extends it just stacks more happy-path assumptions on top of the existing ones. two layers of optimism instead of one.

Daniel Yarmoluk • Mar 20

I found Dev.to last night, I don’t even know why I’m here, it’s like to be rejected by people that certainly don’t help me, but again, I’m trying my best to show what I think, in my opinion, has merit.

Mykola Kondratiuk • Mar 20

hey, for what its worth - your takes have merit. the graph / context angle is genuinely interesting, it pushed me to think differently. dev.to is worth sticking around

Daniel Yarmoluk • Mar 20

Or better e why don’t I give you a compressed md file of a graph database of your request on here later this evening? Then you go work with it and get more context out of it, like I want to

Mykola Kondratiuk • Mar 20

sounds good

Comment deleted

Mykola Kondratiuk • Mar 20

hey, probably worth deleting that - public comments are indexed. keep it here