DEV Community

Mykola Kondratiuk
Mykola Kondratiuk

Posted on

I Scanned 100 AI Codebases - Here's What I Found

I've been building VibeCheck for the past few months - it's a security scanner specifically for AI-generated code. And after scanning over a hundred real codebases that people built with Cursor, Copilot, Claude, and various other AI tools, I have thoughts.

Not the "AI is dangerous" hot take. Something more specific than that.

The pattern that kept showing up

Almost every codebase had the same category of issue. Not SQL injection or XSS or anything that would show up in a classic OWASP checklist. The dominant problem was what I started calling trust misconfigurations - places where the code just... assumed everything was fine.

Open CORS policies. Service accounts with admin permissions because that was the fastest path to getting it working. API keys hardcoded in config files that weren't in .gitignore. Input that got passed straight into shell commands with no sanitization.

None of it was malicious. The AI wasn't trying to introduce vulnerabilities. It was just optimizing for "make it work" and had zero weight on "make it survivable in production."

The thing that surprised me most

I expected the biggest problems in the actual logic - like the AI misunderstanding authentication flows or getting crypto wrong. That exists too, but it's not the main thing.

The main thing is environmental. All these tiny decisions about permissions and access and trust that a senior dev would make automatically, almost subconsciously, because they've been burned before - the AI just doesn't make those decisions. It picks the path of least resistance every time.

One project had a DB connection string with full admin creds, no connection pooling limits, and a query that accepted raw user input. Technically functional. Completely fine for local dev. The kind of thing that gets quietly exploited six months after launch.

What actually helps

Scanning after the fact (what we do with VibeCheck) catches the obvious stuff. But the real fix is earlier in the loop.

The projects that had the least issues were the ones where the developer was actually paying attention during generation - not just accepting output wholesale but reading it, asking "wait, why does this need admin access?" That friction. Even a little bit of it makes a big difference.

Some people are building this into their prompts - explicitly telling the AI to follow least-privilege principles, to validate all inputs, to not hardcode credentials. Works okay. Feels like workarounds.

The better solution is probably tooling that runs in the background during vibe coding sessions and flags stuff in real time. Not a code review gate. Just... something watching.

The uncomfortable part

A lot of these codebases were shipped. Some had real users. A few were running in production environments with actual credentials and real data.

The developers weren't careless people. Most of them were genuinely excited about what they'd built - and most of what they built was genuinely cool. The security stuff just wasn't on their radar because it never came up during development. Nothing broke. Tests passed. It worked on their machine.

I keep thinking about that gap. Between "works fine in dev" and "safe to run with real users." AI coding tools are really good at closing the first gap - getting something functional fast. Nobody's really solved the second one yet.

That's the problem I'm trying to figure out. Not sure I have it yet. But the 100 codebases were pretty clarifying.


If you're using AI to build things and want to know what the scanner finds in your repo, VibeCheck is live. Free tier, no credit card. Takes about 2 minutes.

Top comments (89)

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Can't this be mitigated by adding more context into model?

Collapse
 
itskondrat profile image
Mykola Kondratiuk

it helps but only partially. i tried this - added security guidelines, told it to never hardcode secrets, use parameterized queries, etc. the model follows it when the task is clearly framed that way. but when it is doing something else (like "add this feature") it just... forgets. the security context doesnt carry over into every decision. what i found works better is having a separate security review pass after generation - treat it like a code review step, not something you bake into the initial prompt

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

well, one of students is doing security knowledge graphs and compressing at md level as a start, it's not the end, but it will bring fidelity to the front (problems/issues), try it out, I'd be happy to do it and give it back to you

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

that is actually a clever angle - compressing security knowledge into structured context rather than loose prose instructions. would be curious to see how well it holds up across different types of vulns. the fidelity problem is real, vague rules get interpreted loosely. share it when done

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

yeah, give me an input, or I can go with your thread, choose the battlefield and weaponry

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

haha ok - battlefield: a prod codebase that was 100% vibe coded in one weekend. weaponry: a security scanner and a code review. my bet is on the scanner finding something before you finish reading the first file

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

You can have a highly compressed structured knowledge graph that knows structures of relationships that’s way more reasoning hops.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

fair point on reasoning hops - that is a real advantage for knowledge-dense domains. different problem space than what i was scanning for though

 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

What language are you speaking? What’s your problem?

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

First all, this is not a problem I give a crap about philosophically, so I don’t work on that. You probably have these nuances that could be really graphed out. Are you graph aware? Are you context aware? How much reasoning hops or token density have you figured? You are married to the narrow mindedness that a relational database is any good, or legacy dogfood for gluttonous corporations?

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

haha no problem at all - this is just a good debate. i build things, you challenge assumptions, that is how it should work

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Perfect! Slava Ukraine

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

No I build things. I have 3 million lines of code, 4 million tokens on one Mac mini on one max plan per month. I compress graph knowledge. I have 50 problems I’m solving. I’m not building?

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

Slava Ukraini 🇺🇦

 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

But as a man of good sport and international goodwill, I accept the challenge to learn together. Mario Andretti said, if you seem like you got things under control, you’re not going fast enough. Speed kills. Probabilistically speaking, that is, in my little vibe coding mind.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

graph awareness helps but most vibe coders are not reaching for neo4j, they are pasting code into chatgpt and clicking deploy. the relational db critique is fair for complex domains but the security issues i found were not about data modeling - hardcoded secrets, no input sanitization, open cors. those are not architecture problems

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Create Claude skills to knowledge graph scanner, reduce to the MD level second knowledge graph the hardcoded secrets reduce those anomalies to an md level. Keep going. More edges and nodes and directions of our relationships in textured way. I have calculated 170x more reasoning hops than flat files. There’s math in interconnected relationships, which sounds spiritual. 🙏

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Open claw? I never left my terminal as I was building things. If I solve problems, I don’t care what anybody says, right? It’s who cares who will win this game = context. Power to the people

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

170x is a bold claim - curious how you measured it. the interconnected relationships angle does have something to it though

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Graphifymd.com

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Or better e why don’t I give you a compressed md file of a graph database of your request on here later this evening? Then you go work with it and get more context out of it, like I want to

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

will check it out

 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Or better why don’t I give you a compressed md file of a graph database of your request on here later this evening? Then you go work with it and get more context out of it, like I want to

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah drop it when ready, genuinely curious what that looks like

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Ok I’m driving. I need a few hours. I’m running errands

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

take your time, no hurry

 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

But I need more context. Describe those signals you sense, we will get more fidelity. Describe your friction right now with this process, tell me what you wrestle with = context. More is better. Conversations recorded or voice notes are better than dumb prompts. Graph building skills are better than your representation, use your voice

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

main friction was that vibe-coded apps failed in ways that werent obvious until you looked - surface looked fine, internals were not. thats the signal worth capturing

Thread Thread
Collapse
 
jokoprecious222 profile image
Comment deleted
Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Precious/Oliver, who are you asking this?

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

not sure I follow - looks like this comment thread got a bit tangled. feel free to reach out directly if there is something specific you wanted to discuss

Thread Thread
Collapse
 
oceansach profile image
Sarwar

I think this problem would most commonly surface on apps that were built on a single go/prompt. When the whole codebase is generated as the output, it can be hard to wrap your head around what to look for and validate. Any "shortcuts" taken by AI at that point is easy to miss. Combined with the "don't fix it if it ain't broken" mentality and excitement to ship new apps, it's easy to see how people end up shipping something real but insecure.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

Yeah, totally agree. The single-prompt codebases were the worst offenders in what I scanned - you could almost tell by the file structure alone. Everything flat, no real separation, and the AI just... didn't know what it didn't know. Auth mixed with business logic, env vars hardcoded because nothing was telling it otherwise. The "don't know what to validate" part is real too - when you've never had to think about the security surface of a piece of code, you have no mental model for what could go wrong. At least with iterative builds there's usually some moment where a human has to integrate things and maybe notices something feels off.

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

The framework here is backwards. Let me explain.

Boris Cherny created Claude Code and shipped 300+ PRs in December 2025 running 5 agents simultaneously. His setup is intentionally vanilla — minimal customization, trust the model, let it rip. That works when you're Boris and you've internalized 20 years of engineering patterns. The model fills in what he already knows.

But that's exactly the problem this post exposes. Most developers don't have Boris's mental model. So the AI optimizes for "make it work" with zero weight on security, architecture, or trust boundaries. The model doesn't know what it doesn't know.

The fix isn't scanning after the fact. It's structuring what the model knows before it writes a single line.

I've been building the opposite of Boris's vanilla approach. Instead of trusting the model to figure it out, I compress domain knowledge — security constraints, architectural rules, dependency relationships, trust boundaries — into structured .md files as knowledge graphs. Entities connected by typed relationships. Not prose instructions that get buried. Traversable constraints the model follows like a checklist it can't skip.

Two personas using the same tool:

Boris (creator of Claude Code):

  • Vanilla setup, minimal customization
  • 300+ PRs in a month running 5 agents
  • Trusts the model — his 20 years of engineering IS the context
  • Works because his mental model fills the gaps the AI can't see

Dan (knowledge graph builder):

  • Heavy context architecture — structured .md files as input
  • Domain knowledge compressed to 12KB traversable graphs
  • Doesn't trust the model to know what it doesn't know
  • Works because the graph fills the gaps instead of hoping the developer will

Boris can afford vanilla because he IS the knowledge graph. Most developers aren't Boris. They need the structure externalized.

It's the same pattern everywhere:

Chef vs. recipe follower. A Michelin chef doesn't need a recipe — decades of training IS the context. A home cook needs the recipe or they burn the sauce. The recipe is the knowledge graph. AI is the kitchen.

Surgeon vs. med student. A senior surgeon operates from pattern recognition built over thousands of procedures. A resident needs the checklist. Atul Gawande wrote a whole book about this — structured checklists reduced surgical deaths by 47%. Not because surgeons are bad. Because externalized structure catches what even experts miss under pressure.

Senior dev vs. vibe coder. A senior dev reads AI output and instinctively flags "wait, why does this have admin access?" A vibe coder ships it because nothing broke in dev. That instinct IS a knowledge graph — it's just trapped in one person's head.

The 100 codebases in this post are the home cooks, the residents, the vibe coders. They're not careless. They just don't have the graph externalized yet.

LEAST_PRIVILEGE → APPLIES_TO → Service Accounts
  ↳ RULE: Never default to admin. Scope to minimum required.
ENV_VARS → MUST_USE → .env + .gitignore
  ↳ RULE: Zero hardcoded credentials. Flag and block.
INPUT_VALIDATION → REQUIRED_AT → Every boundary
  ↳ RULE: Sanitize before shell, DB, or API pass-through.
CORS → MUST_BE → Origin-specific
  ↳ RULE: Never wildcard in production.
Enter fullscreen mode Exit fullscreen mode

When those constraints are typed relationships in the model's context, it doesn't "forget" them. It can't take the path of least resistance because the graph constrains the path.

Same Claude. Same model. Different structure going in. Different security posture coming out.

The 100 codebases in this post didn't fail because AI is bad at security. They failed because nobody structured what "secure" means for that specific domain before the model started coding.

The numbers from my builds: 3M+ lines of code in 50 days. Solo. On Claude Code. 170x token density when compressing domain knowledge to .md. 93% token compression. ~3,000 tokens instead of ~500,000 for the same reasoning quality. 99.4% lower carbon per query.

That's what I build at Graphify.md — domain knowledge graphs compressed to portable .md. Works in Claude, ChatGPT, Cursor, anywhere. Security is just one domain.

graphifymd.com

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah this is basically the crux of it. the model inherits your taste - if you have good taste, output is good. if you don't, the model confidently produces something that looks right but isn't. the security holes we found were exactly that pattern, generated code that looked professional but was missing the instincts a senior dev would have applied automatically

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

love on the model Mykola, context is love

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

haha context is love until it forgets everything past 200k tokens and you are back to square one

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

just like those pretty ladies in Kiev, back to square one after all those tokens gone, my life story.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

haha too real - context window as a metaphor for life, hits different at 2am

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

So are you pleased with result? I’m happy if you’re happy.

Collapse
 
poushwell profile image
Pavel Ishchin

The excitement to ship is the part that gets me. I have shipped stuff I knew wasn't fully reviewed because the demo was working and the window felt short. It's not even ignorance at that point, it's a conscious tradeoff you make and then just hope nothing happens. The difference with vibe-coded apps is the developer might not even know the tradeoff exists. They think the app is fine because nothing broke during testing. At least when I cut corners I know where I cut them.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the invisible tradeoff is exactly it. "nothing broke yet" and "nothing is broken" feel the same from the inside. experienced devs at least know they made a tradeoff. vibe coders might not even know the category of risk exists

Collapse
 
pixeliro profile image
Pixeliro

That makes a lot of sense.

What you’re describing reminds me of how tools like Google’s Stitch are approaching UI generation — instead of generating arbitrary code, they start from structured HTML/design foundations and then build on top of that.

It feels like the same idea in a different domain:
not letting the system generate freely, but constraining it within a known structure from the beginning.

In your case, it’s about trust and permissions.
In UI systems, it’s about tokens, semantics, and layout constraints.

Both are trying to solve the same underlying problem:
reduce the space of “unsafe” decisions before they ever make it into production.

I’m starting to think the future isn’t better generation — it’s better constraint systems around generation.

Curious if you see this evolving toward more “structured-first” AI tools rather than free-form ones.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

that Stitch parallel is actually really sharp. constrained generation - whether it's design tokens or permission templates - forces the model to work within a safe envelope instead of hallucinating its way into free-form territory. I wonder if the same idea applies to infra: give the AI a library of known-good IAM snippets to compose from rather than letting it generate policies from scratch. less creative, but the blast radius when it's wrong is way smaller

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Why isn’t it the industry didn’t pay attention to graph databases more? It was always the better, but very few very skilled professionals map entire ontologies. Palantir does, Bloomberg does, big money. Now if your knowledge graph structure is compressed time and time again, reduced to 10-15kb, you’re trying more context through the context window and hence more fidelity. I can get the Jensen Huang engineer 250k to 5k? Context efficiency, reasoning hops, context density. I’d like to graph and give the files for you to play with in any environment. That’s my proposal. However, I will not use search engines personally, think what that means.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

graph databases had the tooling problem for a long time - the queryability and dev experience lagged behind relational. LLMs are actually changing that equation because natural language queries map better to graph traversal. compressed ontologies as context is a real pattern, especially for domain-heavy apps where the graph IS the knowledge

Collapse
 
pixeliro profile image
Pixeliro

That’s a really sharp extension of the idea. Constrained generation feels like the common pattern here — whether it’s design tokens, UI components, or permission templates, the goal is to keep the model operating inside a safe envelope instead of letting it hallucinate freely.

Applying this to infra makes a lot of sense. Composing IAM policies from a library of validated snippets is far more predictable than generating them from scratch.

You lose some flexibility, but the reduction in blast radius when things go wrong is absolutely worth it in production.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

composing from validated snippets is the right direction. basically policy-as-code with a model-friendly interface. you get the expressiveness of natural language generation but bounded by proven primitives - audit trail becomes way cleaner too

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Well, what if I say I knowledge graph any question I have through Claude Code CLI, and my Google is the Claude on the browser. What am I?

Collapse
 
itskondrat profile image
Mykola Kondratiuk

someone who built a workflow that actually fits how they think. using Claude as both the graph and the search layer is interesting - you lose the precision of structured queries but gain the flexibility. depends what you are optimizing for

Collapse
 
pixeliro profile image
Pixeliro

That’s a really interesting way to frame it — especially the idea of turning Claude into a knowledge graph interface rather than a generator.

I think we’re seeing the same shift in UI systems. The problem isn’t that models aren’t good enough at generating — it’s that we’re letting them operate in an unconstrained space.

What we’re experimenting with is pushing structure before generation — forcing everything through component specs, tokens, and layout constraints. The model isn’t generating UI anymore, it’s composing within a system.

So yeah, I’m starting to believe the future isn’t better generation — it’s better boundaries around generation.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

you are someone who found a workflow that actually fits how you think. the tool matters less than whether it extends your cognition or interrupts it

Collapse
 
mickyarun profile image
arun rajkumar

The trust misconfiguration pattern is painfully familiar — and not just from AI-generated code. We run 15 microservices at a fintech startup, and before we centralised our env management, we had the exact same issues with human-written code: DB connection strings with full admin creds, no connection pooling limits, env files copy-pasted over Slack.

The fix was a shared Zod schema that validates every env var on startup. Dev-local mode warns, production mode calls process.exit(1). It catches the "fastest path to working" problem before it ships. AI just makes that anti-pattern happen faster. The real gap is that nobody validates the boring infrastructure decisions — human or AI.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the human-written fintech example is actually the better data point tbh - same patterns, just slower velocity. AI just makes it impossible to ignore anymore because it surfaces 3x faster. centralising env management was the right call regardless of how you got there

Collapse
 
mickyarun profile image
arun rajkumar

Exactly right. The AI just made it impossible to ignore what was always there. We'd been accumulating env variable drift for years — every new service copied from the last one with slight modifications nobody documented. An AI agent flagged 23 inconsistencies in minutes. A human would've found the same things eventually, just spread over months of "hmm, that's weird" moments. The centralised env management was the real fix. The AI was just the mirror that made us stop pretending the mess wasn't there.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

23 inconsistencies in minutes is a good example of the pattern - not finding new problems, just surfacing old ones faster. The "hmm that is weird" moments that used to disappear into the backlog now come back with receipts. Centralized env management makes sense as the fix but the harder part is usually the cultural shift: actually enforcing it after you find the problems.

Thread Thread
 
mickyarun profile image
arun rajkumar

We added drift checker in pr and pre commit so instead of culturally aligning we did it as automated process. Which prevents this from ever happening again.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

That is the right call - baking it into the pipeline so it is not a cultural ask anymore. The cultural alignment approach works until someone is rushing a deploy at 11pm. Automated constraint is the only version that holds under pressure.

Thread Thread
 
mickyarun profile image
arun rajkumar

100% agree on the automated constraint point. We learned this the hard way — we had a "best practices" doc for env management that everyone agreed with and nobody followed consistently. The moment we replaced it with a Zod schema that validates on startup and crashes the process in production if something's wrong, compliance went from "mostly" to "always." The doc was aspirational. The schema is a fact. Cultural alignment is important for things that genuinely need judgment. For things that have a clear right answer — like "don't deploy with an empty DB connection string" — just make the wrong thing impossible.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the zod schema approach is the right call - trust the invariant in code, not the dev's memory of a doc. we hit the same thing with config validation for AI agents: once we moved from 'the readme says X' to 'startup crashes if X is wrong', compliance stopped being a conversation. the cultural buy-in you already had makes the technical enforcement feel collaborative rather than punitive too, which is a nice side effect.

Collapse
 
tomik_t_9d83e9bb1c7e21b1d profile image
Tomik T

Great writeup. Security in vibe coded projects is the elephant in the room that nobody wants to talk about. I run a small hosting platform and we added automated YARA scanning on every single deploy exactly because of this. You would not believe the stuff people upload, hardcoded API keys, SQL injection galore, sometimes even test credentials for production databases. The AI generates code that works but it has zero concept of security. I think every hosting provider that targets vibe coders needs to have some form of automated scanning built in, its not optional anymore

Collapse
 
itskondrat profile image
Mykola Kondratiuk

honestly the YARA scanning is smart - we ended up doing something similar for VibeCheck, scanning for patterns that scream "AI wrote this and forgot security exists". the hardcoded credentials thing you mentioned is everywhere, I found them in like 40% of the repos I scanned, sometimes buried 3 folders deep like the dev thought obscurity = security. tbh I think you're right that hosting platforms need to build this in because most vibe coders won't think about it until something breaks. the AI just doesn't model threat scenarios, it models "make the tests pass"

Collapse
 
poushwell profile image
Pavel Ishchin

YARA at deploy is smart but what happens when it catches something? Do you block the deploy entirely or just flag it? Asking because the article mentions that most of these issues aren't malicious, they're trust misconfigurations. Blocking a deploy for a hardcoded key makes sense. Blocking it for open CORS might be too aggressive if the dev intended it for a public API. Where do you draw the line on what's a hard block vs a warning?

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the block vs flag question is real and context-dependent. hardcoded secrets = block, no debate. open CORS or overpermissioned IAM - I would flag with severity + required acknowledgment rather than hard block. the key is forcing a conscious decision rather than hoping the dev noticed. "I know this is open CORS and I intended it" is a very different state than it slipping through unreviewed

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Who are you worried about? Like you’re worried for what group of people? The security company that watches what?

Collapse
 
itskondrat profile image
Mykola Kondratiuk

mostly worried about the builders themselves - a lot of vibe coders ship without ever looking at what permissions their apps are granting or what data they are exposing. no malicious intent, just not in the habit of thinking about attack surface

Collapse
 
wong2kim profile image
wong2 kim

As someone who builds iOS apps entirely with AI-assisted development, the trust misconfiguration pattern is painfully real. I've caught myself shipping overly permissive Supabase RLS policies because the AI-generated code "just worked" in dev.

The gap between "works in dev" and "safe for real users" is exactly what keeps me up at night. My approach has been adding a manual security review checkpoint before every App Store submission — but that doesn't scale well when you're shipping multiple apps.

The idea of background tooling that catches these issues during the coding session (not after) feels like the right direction. Would love to try VibeCheck on my projects.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah the Supabase RLS thing hit me too. had a policy that was basically true for reads because the AI wanted to make the demo work fast. caught it only when I read through the migration files line by line before launch. now I ask the AI to specifically audit the policies after writing them - weirdly works better than asking it to write secure ones in the first place. still manual but at least it's a separate pass with a different prompt

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

We can knowledge graph these solutions. I cannot say what will or will not work, but I can help get more efficient context architecture to assist in your solution development. We all can assist each other in solving big problems together.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

collective intelligence angle is interesting. the challenge is keeping context sharp - more nodes can help or hurt depending on how well the graph is structured

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Honestly, I’d rather have a recorded conversation, a real conversation, and load that text into model. Not tax my prefrontal cortex to write a prompt. I’m too dumb

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

conversation as context is underrated. transcripts carry nuance that prompts lose. the "too dumb to prompt" framing is honest - most people think in conversation not structured instructions

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

You can just give me a call

Collapse
 
itskondrat profile image
Mykola Kondratiuk

haha lets keep it in the comments for now - more interesting for others to read too

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

I’m going to give this to you in a few hours

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

no rush

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Mykola — here it is. I owe you a proper response, not scattered comments from behind the wheel.

the thread earlier was raw signal — fragments of an idea typed between errands. this is what happens when I actually sit down and structure it. a page of structured context enables 170 reasoning paths. a few comments in a thread enable maybe 3. you deserve the full page.

you said: "battlefield: a prod codebase that was 100% vibe coded in one weekend. weaponry: a security scanner and a code review. my bet is on the scanner finding something before you finish reading the first file."

I accept. but here's the twist.

I'm going to vibe code a real production-ready app this weekend. 100% AI-generated, one weekend, shipped. but before I write a single line, I'm loading a security knowledge graph built from YOUR findings into the model's context. your article, your comment thread, your data — structured as typed relationships with rules, constraints, and the vulnerability patterns you found across 100 codebases.

your own data, aimed back at your scanner.

then you scan it with VibeCheck. if your scanner finds vulns — you win, the scanner beats the graph. if it comes back clean — the graph wins, structured context prevented what scanning would have caught after the fact.

either way, we both learn something real. and the experiment is documented.

I took your article + this thread + your twitter findings and compressed them into a knowledge graph. 58 entities, 89 typed relationships, 8 layers, ~320 reasoning paths. paste it into any model and ask it anything about your security domain.

but it's not just a diagnostic. layers 6-8 are solutions — things you can actually build, including the "something watching in the background" you said nobody's solved yet (spoiler: it's solved — Claude Code hooks do exactly this).

there are also 5 weak signals I found in your own words that you might not have noticed the significance of. your data told a bigger story than the article captured. the graph connected them.

I already ran the A/B test before posting. same model (ChatGPT, extended thinking), same prompt: "what's my minimum security checklist for shipping a vibe-coded app this weekend?"

flat article: generic 11-item OWASP checklist. zero references to your actual findings. zero named vulnerability patterns from your data. the model ignored your article and fell back to training data.

knowledge graph: 3-tier ship/no-ship decision framework. 16 identifiable graph traversals. 6 named patterns from your scans (open CORS, overprivileged accounts, hardcoded creds, unsanitized passthrough, f-string SQL, secrets in source). 4 direct citations of your data. referenced your failure modes by name.

same model. same prompt. 16:0 on domain-specific insights. the flat version didn't score lower — it didn't score.

the full math is in Layer 10 of the knowledge graph. run the same test yourself — paste your flat article into one chat, paste the KG into another, same question, compare. that's the proof.

10 layers. 78 entities. 124 typed relationships. ~450 reasoning paths. from one page of structured .md — the same token budget as your flat article, but 170x more ways for a model to reason through it. that's the claim. the A/B test is the proof. the knowledge graph is in my next comment.

and when you're ready to define the battlefield — framework, scope, what "production-ready" means to you — tell me. I'll build it live, KG-loaded, and you scan the result. loser buys the other one a coffee via the internet.

let's go. 🇺🇦

We both win this way I would argue to say...but I need more context when you're concerned with context architecture.

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Thread Thread
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Here's the knowledge graph. Copy everything below and paste into any AI model.


Vibe-Coded Security Vulnerability Knowledge Graph

Source: "I Scanned 100 AI Codebases" — Mykola Kondratiuk (VibeCheck)

58 entities | 89 typed relationships | 8 layers | ~320 reasoning paths


GRAPH SCHEMA

[ENTITY] --RELATIONSHIP--> [ENTITY]
  ↳ CONSTRAINT: rule or invariant
  ↳ EVIDENCE: observed data point
  ↳ METRIC: quantified signal
Enter fullscreen mode Exit fullscreen mode

LAYER 1: ROOT CAUSE MODEL

AI_CODE_GENERATION --OPTIMIZES_FOR--> FUNCTIONAL_CORRECTNESS
  ↳ CONSTRAINT: Zero weight on production survivability
  ↳ EVIDENCE: 100/100 codebases prioritized "make it work"

AI_CODE_GENERATION --IGNORES--> TRUST_BOUNDARIES
  ↳ CONSTRAINT: Model has no concept of blast radius
  ↳ EVIDENCE: Dominant vulnerability class = trust misconfigurations (not OWASP classics)

DEVELOPER_EXPERTISE --INVERSELY_CORRELATES--> VULNERABILITY_DENSITY
  ↳ METRIC: Single-prompt codebases = worst offenders
  ↳ METRIC: Iterative builds with human review = significantly fewer issues

PATH_OF_LEAST_RESISTANCE --IS_DEFAULT_FOR--> AI_DECISION_MAKING
  ↳ CONSTRAINT: Model selects fastest working path, not safest path
  ↳ EVIDENCE: Admin creds used because "fastest path to getting it working"
Enter fullscreen mode Exit fullscreen mode

LAYER 2: VULNERABILITY TAXONOMY

TRUST_MISCONFIGURATION --PARENT_OF--> OPEN_CORS
  ↳ RULE: CORS must be origin-specific, never wildcard in production
  ↳ FREQUENCY: High (appeared across majority of scanned repos)
  ↳ DETECTION: Static scan for Access-Control-Allow-Origin: *

TRUST_MISCONFIGURATION --PARENT_OF--> OVERPRIVILEGED_SERVICE_ACCOUNTS
  ↳ RULE: Least privilege — never default to admin, scope to minimum required
  ↳ FREQUENCY: High
  ↳ EXPLOIT_WINDOW: Lateral movement after initial compromise

TRUST_MISCONFIGURATION --PARENT_OF--> HARDCODED_CREDENTIALS
  ↳ RULE: Zero hardcoded secrets. Use .env + .gitignore. Flag and block.
  ↳ FREQUENCY: High (API keys in config files not in .gitignore)
  ↳ EXPLOIT_WINDOW: Credential harvesting from public repos

TRUST_MISCONFIGURATION --PARENT_OF--> UNSANITIZED_INPUT_PASSTHROUGH
  ↳ RULE: Sanitize before shell, DB, or API pass-through. Always.
  ↳ FREQUENCY: High (input passed straight into shell commands)
  ↳ SEVERITY: Critical — command injection vector

SQL_INJECTION --CAUSED_BY--> STRING_CONCATENATION_IN_QUERIES
  ↳ RULE: Always use parameterized queries. Never f-string SQL.
  ↳ FREQUENCY: 11/50 apps (22%) — concentrated in "quick fix" late-night commits
  ↳ AI_TRIGGER: Model defaults to f-strings when unsure about ORM

EXPOSED_SECRETS --CAUSED_BY--> ENV_IN_SOURCE
  ↳ RULE: Secrets manager or .env with .gitignore. No exceptions.
  ↳ FREQUENCY: 17/50 apps (34%)
  ↳ AI_TRIGGER: Model includes working example values that are real credentials

NO_CONNECTION_POOLING --CAUSED_BY--> DEFAULT_DB_CONFIG
  ↳ RULE: Set connection pool limits. Never expose raw admin connection string.
  ↳ EVIDENCE: One project had full admin creds + no pooling + raw user input in queries
  ↳ EXPLOIT_WINDOW: Resource exhaustion + data exfiltration
Enter fullscreen mode Exit fullscreen mode

LAYER 3: FAILURE MODE ANALYSIS

DEV_ENVIRONMENT --MASKS--> PRODUCTION_VULNERABILITIES
  ↳ MECHANISM: Nothing breaks in dev. Tests pass. Works on their machine.
  ↳ GAP: "works fine in dev" ≠ "safe to run with real users"
  ↳ METRIC: Multiple scanned codebases were running in production with real data

SINGLE_PROMPT_GENERATION --PRODUCES--> FLAT_ARCHITECTURE
  ↳ SIGNAL: Everything in one directory. No separation of concerns.
  ↳ SIGNAL: Auth logic mixed with business logic
  ↳ SIGNAL: Can identify single-prompt codebases by file structure alone

EXCITEMENT_TO_SHIP --OVERRIDES--> SECURITY_REVIEW
  ↳ CONSTRAINT: "Don't fix it if it ain't broken" mentality
  ↳ EVIDENCE: Developers were genuinely excited, built cool things, security never surfaced
  ↳ INSIGHT: Not careless people — security just wasn't on their radar

AI_CONTEXT_WINDOW --DROPS--> SECURITY_CONSTRAINTS
  ↳ MECHANISM: Security guidelines in prompt work when task is security-framed
  ↳ MECHANISM: When task shifts to "add feature," security context doesn't carry over
  ↳ EVIDENCE: Author tested adding security guidelines — partial improvement only
Enter fullscreen mode Exit fullscreen mode

LAYER 4: INTERVENTION MODEL

SCANNING_AFTER_GENERATION --CATCHES--> OBVIOUS_VULNERABILITIES
  ↳ TOOL: VibeCheck (vibe-checker.dev)
  ↳ TIMING: Post-generation, pre-deployment
  ↳ LIMITATION: Reactive, not preventive

DEVELOPER_ATTENTION_DURING_GENERATION --PREVENTS--> TRUST_MISCONFIGURATIONS
  ↳ MECHANISM: Reading output, asking "why does this need admin access?"
  ↳ METRIC: Projects with human review during generation = fewest issues
  ↳ INSIGHT: Even a little friction makes a big difference

SECURITY_PROMPTS --PARTIALLY_MITIGATE--> VULNERABILITY_GENERATION
  ↳ METHOD: Explicitly telling AI to follow least-privilege, validate inputs, no hardcoded creds
  ↳ EFFECTIVENESS: Partial — works when task aligns, fails on context switch
  ↳ ASSESSMENT: "Feels like workarounds"

STRUCTURED_CONTEXT_INJECTION --CONSTRAINS--> AI_DECISION_PATHS
  ↳ METHOD: Security rules as typed relationships loaded into model context
  ↳ ADVANTAGE: Model can't take path of least resistance when graph constrains the path
  ↳ ADVANTAGE: Relationships persist across task switches (unlike prose instructions)
Enter fullscreen mode Exit fullscreen mode

LAYER 5: DOMAIN TRANSFER PATTERNS

KNOWLEDGE_GRAPH_APPROACH --ANALOGOUS_TO--> SURGICAL_CHECKLIST
  ↳ EVIDENCE: Atul Gawande — structured checklists reduced surgical deaths 47%
  ↳ PRINCIPLE: Externalized structure catches what even experts miss under pressure
  ↳ TRANSFER: Security KG = surgical checklist for code generation

EXPERT_INTUITION --IS_A--> INTERNALIZED_KNOWLEDGE_GRAPH
  ↳ EXAMPLE: Senior dev instinctively flags "why admin access?" = graph traversal in wetware
  ↳ PROBLEM: Most developers don't have that instinct yet. Can't externalize what they don't have.
  ↳ SOLUTION: Externalize the graph so any developer can traverse expert-level constraints
Enter fullscreen mode Exit fullscreen mode

LAYER 6: SOLUTION ARCHITECTURE — What You Can Build

VIBECHECK_SCANNER --CURRENTLY_IS--> REACTIVE_POST_GENERATION
  ↳ STATUS: Works. Catches obvious vulns after code exists.
  ↳ LIMITATION: By the time you scan, the developer has already shipped.
  ↳ UPGRADE_PATH: Move scanning LEFT in the pipeline (pre-commit, real-time, context-injection).

VIBECHECK_SCANNER --SHOULD_BECOME--> THREE_STAGE_SECURITY_PIPELINE
  ↳ STAGE_1: Pre-generation (context injection — security KG loaded before coding starts)
  ↳ STAGE_2: During-generation (real-time monitoring — flags as code is written)
  ↳ STAGE_3: Post-generation (current VibeCheck scan — catches what slipped through)
  ↳ PRINCIPLE: Defense in depth. No single stage catches everything.
Enter fullscreen mode Exit fullscreen mode

STAGE 1: Pre-Generation Context Injection

SECURITY_RULES_FILE --INJECTS_INTO--> AI_CODING_SESSION
  ↳ FORMAT: .cursorrules (Cursor), CLAUDE.md (Claude Code), .github/copilot-instructions.md (Copilot)
  ↳ MECHANISM: IDE/tool loads rules file automatically before any code generation
  ↳ CONTENT: This KG's LAYER 2 rules, compressed to ~500 tokens
  ↳ RESULT: Model sees security constraints BEFORE it writes line 1

SECURITY_CONTEXT_FILE --CONTAINS--> MANDATORY_RULES
  ↳ RULE_1: "Never hardcode secrets. Use environment variables with .env + .gitignore."
  ↳ RULE_2: "Always use parameterized queries. Never use string concatenation for SQL."
  ↳ RULE_3: "CORS must be origin-specific. Never use wildcard (*) in production."
  ↳ RULE_4: "Apply least privilege to all service accounts. Never default to admin."
  ↳ RULE_5: "Sanitize all user input before passing to shell, database, or API."
  ↳ RULE_6: "Set connection pool limits on all database connections."
  ↳ RULE_7: "Separate auth logic from business logic. Never mix in same file."
  ↳ DELIVERY: VibeCheck generates this file automatically from scan results
  ↳ VALUE: "We scanned your code. Here's the rules file so it doesn't happen again."

VIBECHECK --CAN_GENERATE--> SECURITY_RULES_FILE
  ↳ FLOW: Scan repo → identify violations → generate .cursorrules/.claude.md with rules
  ↳ DIFFERENTIATION: No other scanner generates pre-generation context files
  ↳ MOAT: Becomes the bridge between "what broke" and "don't break it again"
Enter fullscreen mode Exit fullscreen mode

STAGE 2: During-Generation Real-Time Monitoring

PRE_COMMIT_HOOK --INTERCEPTS--> INSECURE_CODE_AT_COMMIT
  ↳ TOOL: pre-commit framework (pre-commit.com) + custom VibeCheck hook
  ↳ MECHANISM: Developer runs git commit → hook runs VibeCheck scan → blocks if critical
  ↳ FRICTION: Low — only fires on commit, not every keystroke
  ↳ IMPLEMENTATION: .pre-commit-config.yaml in repo root, VibeCheck as a hook entry

CLAUDE_CODE_HOOK --INTERCEPTS--> INSECURE_CODE_AT_WRITE
  ↳ TOOL: Claude Code hooks system (settings.json → hooks[] config)
  ↳ MECHANISM: PostToolUse event on Edit/Write → runs security check → warns developer
  ↳ FRICTION: Near-zero — runs in background, only surfaces when something is wrong
  ↳ NOTE: This is your "unsolved problem" — "something watching during vibe coding." It's solved. Claude Code hooks do exactly this.

FILE_WATCHER --MONITORS--> CODE_CHANGES_IN_REAL_TIME
  ↳ TOOL: fswatch / chokidar / nodemon pattern
  ↳ MECHANISM: Watch project directory → on file change → run quick security scan
  ↳ SPEED_REQUIREMENT: Must complete in <2 seconds or developers will disable it
  ↳ SCOPE: Only scan changed files, not full repo
Enter fullscreen mode Exit fullscreen mode

STAGE 3: Post-Generation (Enhanced Current VibeCheck)

VIBECHECK_SCAN --ENHANCED_BY--> KNOWLEDGE_GRAPH_RULES
  ↳ CURRENT: Pattern matching against known vulnerability signatures
  ↳ ENHANCED: Traverse LAYER 2 taxonomy → check each typed relationship as a rule
  ↳ ADVANTAGE: KG rules have FREQUENCY and SEVERITY → prioritize findings by real-world impact
  ↳ ADVANTAGE: KG rules have AI_TRIGGER → explain WHY the AI made this mistake

SCAN_RESULTS --FEED_BACK_INTO--> KNOWLEDGE_GRAPH
  ↳ MECHANISM: Each new scan adds new vulnerability patterns to the KG
  ↳ RESULT: KG grows with every codebase scanned — self-improving system
  ↳ METRIC: After 1,000 scans, KG predicts which vuln types appear in which frameworks

SCAN_REPORT --GENERATES--> REMEDIATION_CONTEXT
  ↳ CURRENT: "Found hardcoded API key on line 47" (what's wrong)
  ↳ ENHANCED: "Found hardcoded API key on line 47. AI_TRIGGER: Model included working example value as real credential. RULE: Use .env + .gitignore. Here's the fix + a .cursorrules entry to prevent recurrence." (what's wrong + why + fix + prevention)
  ↳ DIFFERENTIATION: No other scanner explains WHY the AI made the mistake
Enter fullscreen mode Exit fullscreen mode

LAYER 7: PRODUCT EVOLUTION — VibeCheck Becomes a Platform

VIBECHECK_V1 --IS--> SCANNER (current)
  ↳ INPUT: Codebase URL or upload
  ↳ OUTPUT: Vulnerability report
  ↳ VALUE: "Here's what's wrong"

VIBECHECK_V2 --BECOMES--> SCANNER_PLUS_CONTEXT_GENERATOR
  ↳ INPUT: Codebase URL or upload
  ↳ OUTPUT: Vulnerability report + .cursorrules + CLAUDE.md + copilot-instructions.md
  ↳ VALUE: "Here's what's wrong + here's how to never do it again"

VIBECHECK_V3 --BECOMES--> REAL_TIME_SECURITY_COPILOT
  ↳ INPUT: Installed as pre-commit hook + Claude Code hook + IDE extension
  ↳ OUTPUT: Real-time warnings during coding + auto-generated fixes
  ↳ VALUE: "I'm watching while you vibe code — you'll never ship insecure code"

VIBECHECK_V4 --BECOMES--> SECURITY_KNOWLEDGE_PLATFORM
  ↳ INPUT: Aggregated scan data from all users (anonymized)
  ↳ OUTPUT: Community security KG — which frameworks produce which vulns at what rates
  ↳ VALUE: "We've scanned 10,000 vibe-coded apps. Here's what every AI tool gets wrong."
  ↳ MOAT: Network effect — more users → better KG → better prevention → more users

COMMUNITY_SECURITY_KG --ENABLES--> FRAMEWORK_SPECIFIC_RULES
  ↳ EXAMPLE: "Next.js + Claude → 40% chance of open CORS. Auto-inject: CORS_RULE."
  ↳ EXAMPLE: "Flask + Copilot → 30% chance of f-string SQL. Auto-inject: PARAMETERIZED_QUERY_RULE."
  ↳ EXAMPLE: "React + Cursor → 25% chance of exposed API keys. Auto-inject: ENV_VAR_RULE."
  ↳ DATA_SOURCE: Aggregated VibeCheck scan results across all users
  ↳ COMPETITIVE_ADVANTAGE: Nobody else has this dataset
Enter fullscreen mode Exit fullscreen mode

LAYER 8: IMPLEMENTATION QUICKSTART — Build This Weekend

WEEKEND_BUILD_1 --IS--> SECURITY_RULES_GENERATOR (4 hours)
  ↳ STEP_1: Take VibeCheck scan output → parse into structured violations
  ↳ STEP_2: Map violations to LAYER 2 rules (this KG provides the mapping)
  ↳ STEP_3: Generate .cursorrules / CLAUDE.md / copilot-instructions.md from rules
  ↳ STEP_4: User downloads generated file → drops in project root → done
  ↳ OUTPUT: "Scan + Protect" feature on vibe-checker.dev
  ↳ EFFORT: Mostly template generation — the hard part (the rules) is already in this KG

WEEKEND_BUILD_2 --IS--> PRE_COMMIT_HOOK_PACKAGE (2 hours)
  ↳ STEP_1: Wrap VibeCheck CLI as a pre-commit hook
  ↳ STEP_2: Publish to pre-commit hook registry
  ↳ STEP_3: Users add one line to .pre-commit-config.yaml
  ↳ EFFORT: Minimal if VibeCheck already has a CLI

WEEKEND_BUILD_3 --IS--> CLAUDE_CODE_SECURITY_SKILL (3 hours)
  ↳ STEP_1: Use Claude Code's skill-creator to scaffold the skill
  ↳ STEP_2: Embed LAYER 2 rules as skill references
  ↳ STEP_3: Write evals (test prompts → expected security behavior)
  ↳ STEP_4: Publish to Claude skill marketplace
  ↳ TRIGGER_PHRASES: "scan this for security" / "is this safe" / "check before I ship"

WEEKEND_BUILD_4 --IS--> ENHANCED_SCAN_REPORT (2 hours)
  ↳ STEP_1: Add AI_TRIGGER field to scan findings ("why did the AI do this?")
  ↳ STEP_2: Add REMEDIATION field with copy-paste fix
  ↳ STEP_3: Add PREVENTION field with .cursorrules entry
  ↳ IMPACT: Transforms report from "what's wrong" to "what's wrong + why + fix + prevent"
Enter fullscreen mode Exit fullscreen mode

QUERY INTERFACE

to use this graph, paste it into any AI model and traverse relationships:

Query Try This
What causes X? SQL_INJECTION --CAUSED_BY--> STRING_CONCATENATION
What prevents X? DEVELOPER_ATTENTION --PREVENTS--> TRUST_MISCONFIGURATIONS
What does AI ignore? AI_CODE_GENERATION --IGNORES--> TRUST_BOUNDARIES
What's the gap? DEV_ENVIRONMENT --MASKS--> PRODUCTION_VULNERABILITIES
What's unsolved? REAL_TIME_MONITORING → see LAYER 6 STAGE 2 (it's solved)
What should I build first? WEEKEND_BUILD_1 → Security Rules Generator (4 hours, highest ROI)
How does VibeCheck evolve? V1 → V2 → V3 → V4 (scanner → context gen → copilot → platform)
What's the moat? COMMUNITY_SECURITY_KG --ENABLES--> FRAMEWORK_SPECIFIC_RULES

YOUR NUMBERS

Metric Value
Codebases scanned 100
Dominant vulnerability class Trust misconfigurations
SQL injection via f-strings 22% (11/50)
Exposed secrets in source 34% (17/50)
Shipped to production with issues Multiple confirmed
Worst offender pattern Single-prompt, one-weekend builds
Best mitigation Human attention during generation + structured context

LAYER 9: THE CHALLENGE — Scanner vs. Knowledge Graph

CHALLENGE --PROPOSED_BY--> MYKOLA
  ↳ BATTLEFIELD: A prod codebase, 100% vibe coded in one weekend
  ↳ WEAPONRY: VibeCheck security scanner + code review
  ↳ BET: "Scanner finds something before you finish reading the first file"

CHALLENGE --ACCEPTED_BY--> DAN
  ↳ TWIST: Vibe code the app with THIS KG loaded as structured context
  ↳ SOURCE: Mykola's own findings from 100 codebase scans
  ↳ HYPOTHESIS: Structured context injection prevents the vulns the scanner would catch
Enter fullscreen mode Exit fullscreen mode

Experiment Protocol

EXPERIMENT --HAS_PHASES--> FOUR_PHASES

PHASE_1 --IS--> DEFINE_BATTLEFIELD
  ↳ ACTION: Mykola defines framework, scope, what "production-ready" means
  ↳ CONSTRAINT: Must be realistic — auth, database, user input, API calls
  ↳ CONSTRAINT: Must be vibe-codeable in one weekend (reasonable scope)

PHASE_2 --IS--> VIBE_CODE_WITH_KG
  ↳ ACTION: Dan vibe codes the entire app using AI (Claude Code)
  ↳ CONTEXT_LOADED: This KG — all 8 layers, all 89 relationships
  ↳ ADDITIONAL_CONTEXT: LAYER 2 rules extracted as .cursorrules / CLAUDE.md
  ↳ NO_MANUAL_SECURITY_REVIEW: Pure vibe coding. The KG is the only safety net.
  ↳ DOCUMENTATION: Every prompt, every AI output, timestamped

PHASE_3 --IS--> SCAN_AND_REVIEW
  ↳ ACTION: Mykola scans the codebase with VibeCheck
  ↳ ACTION: Mykola does a manual code review
  ↳ OUTPUT: Full vulnerability report — what was found, severity, location
  ↳ METRIC: Time to first finding (scanner) vs. time to first finding (human review)

PHASE_4 --IS--> ANALYZE_AND_PUBLISH
  ↳ ACTION: Compare findings against LAYER 2 predictions
  ↳ QUESTION_1: Which LAYER 2 rules held? (KG prevented the vulnerability)
  ↳ QUESTION_2: Which LAYER 2 rules failed? (KG was loaded but AI still took shortcuts)
  ↳ QUESTION_3: What NEW vulnerability types appeared that aren't in the KG yet?
  ↳ OUTPUT: Updated KG with new findings → graph grows from the experiment
  ↳ OUTPUT: Joint blog post / dev.to article documenting results
Enter fullscreen mode Exit fullscreen mode

What the Experiment Proves (Either Way)

IF_SCANNER_WINS --MEANS--> KG_NEEDS_MORE_RULES
  ↳ INSIGHT: Which specific rules failed? Were they in the KG but ignored by the model?
  ↳ INSIGHT: Or were they missing from the KG entirely? (new vulnerability class)
  ↳ ACTION: Add new findings to the KG → it gets stronger for next time
  ↳ VALUE_FOR_MYKOLA: Real data on what structured context can and can't prevent
  ↳ VALUE_FOR_DAN: Real data on where graph constraints break down

IF_GRAPH_WINS --MEANS--> STRUCTURED_CONTEXT_PREVENTS_VULNS
  ↳ INSIGHT: Which specific rules held? How did the model use the constraints?
  ↳ INSIGHT: Did the model reference LAYER 2 rules explicitly during generation?
  ↳ ACTION: Package the winning KG as a downloadable security context file
  ↳ VALUE_FOR_MYKOLA: Proof that prevention > detection → informs V2 product direction
  ↳ VALUE_FOR_DAN: Proof that structured .md context changes AI output quality

EITHER_WAY --PRODUCES--> VALUABLE_DATA
  ↳ NOVEL: Nobody has tested KG-loaded vibe coding against a security scanner before
  ↳ PUBLISHABLE: Joint article on dev.to, real numbers, real code, real scan results
  ↳ GROWTH: Both audiences see the collaboration — Mykola's security community + Dan's graph community
  ↳ KG_GROWS: New findings feed back into the graph → self-improving system
Enter fullscreen mode Exit fullscreen mode

Weak Signals That Inform the Challenge

MYKOLA_SAID --WEAK_SIGNAL--> "single-prompt codebases were worst offenders"
  ↳ IMPLICATION: If Dan builds iteratively with the KG loaded at each step, this risk drops significantly
  ↳ CHALLENGE_ADVANTAGE: KG-loaded iterative builds ≠ single-prompt generation

MYKOLA_SAID --WEAK_SIGNAL--> "security context doesn't carry over into every decision"
  ↳ IMPLICATION: Prose instructions degrade across tasks. Typed relationships persist.
  ↳ CHALLENGE_ADVANTAGE: The KG's RULE fields are explicit constraints, not suggestions

MYKOLA_SAID --WEAK_SIGNAL--> "projects with human review during generation had fewest issues"
  ↳ IMPLICATION: The KG acts as a 24/7 reviewer that never loses attention
  ↳ CHALLENGE_ADVANTAGE: Every file write checked against LAYER 2 rules automatically

MYKOLA_SAID --WEAK_SIGNAL--> "adding security to prompts feels like workarounds"
  ↳ IMPLICATION: He already knows prose instructions aren't enough
  ↳ CHALLENGE_ADVANTAGE: This isn't prose instructions — it's a traversable constraint graph

MYKOLA_SAID --WEAK_SIGNAL--> "nobody's really solved the second gap yet"
  ↳ IMPLICATION: The gap between "works in dev" and "safe in prod" is the challenge itself
  ↳ CHALLENGE_QUESTION: Does structured context close the gap, or just narrow it?
Enter fullscreen mode Exit fullscreen mode

GRAPH TOTALS (including challenge layer)

Metric Value
Entities 72
Typed relationships 112
Layers 9
Reasoning paths ~400

LAYER 10: THE MATH — A/B Test Results

I ran this before posting. same model (ChatGPT, extended thinking). same prompt. two different inputs.

Test A: flat article pasted in (no graph)

Test B: this knowledge graph pasted in

TEST_A --INPUT--> FLAT_ARTICLE (~4,300 tokens, linear prose)
  ↳ OUTPUT: Generic 11-item OWASP checklist
  ↳ DOMAIN_SPECIFIC_INSIGHTS: 0
  ↳ NAMED_VULNERABILITY_PATTERNS_FROM_SOURCE: 0
  ↳ FAILURE_MODE_REFERENCES: 0
  ↳ MYKOLA_DATA_CITATIONS: 0
  ↳ STRUCTURE: Flat numbered list
  ↳ VERDICT: Model ignored the article. Fell back to OWASP training data.

TEST_B --INPUT--> THIS_KNOWLEDGE_GRAPH (~3,200 tokens, structured)
  ↳ OUTPUT: 3-tier decision framework (don't ship / checklist / ship gate)
  ↳ DOMAIN_SPECIFIC_INSIGHTS: 16 identifiable graph traversals
  ↳ NAMED_VULNERABILITY_PATTERNS_FROM_SOURCE: 6 (open CORS, overprivileged accounts, hardcoded creds, unsanitized passthrough, f-string SQL, secrets in source)
  ↳ FAILURE_MODE_REFERENCES: 2 (single-prompt worst offenders, dev→prod gap)
  ↳ MYKOLA_DATA_CITATIONS: 4 direct
  ↳ STRUCTURE: Ship/no-ship binary gate — production deployment decision
  ↳ VERDICT: Model traversed the graph. Output was domain-specific, evidence-backed, actionable.
Enter fullscreen mode Exit fullscreen mode

the numbers

COMPARISON --METRIC--> TOKEN_EFFICIENCY
  ↳ TEST_A: 4,300 tokens in → 0 domain-specific insights out
  ↳ TEST_B: 3,200 tokens in → 16 domain-specific insights out
  ↳ RESULT: Fewer tokens in, categorically more actionable output

COMPARISON --METRIC--> GRAPH_TRAVERSALS
  ↳ TEST_A: 0 identifiable traversals (model used training data, not the article)
  ↳ TEST_B: 16 identifiable traversals from 1 question
  ↳ RESULT: 16:0 — the flat version didn't score lower. it didn't score.

COMPARISON --METRIC--> PATH_UTILIZATION
  ↳ AVAILABLE_PATHS_IN_GRAPH: ~170 (structural — 47 relationships × 3.6 avg chain depth)
  ↳ PATHS_ACTIVATED_BY_1_QUESTION: 16 (measured — 9.4% utilization)
  ↳ PATHS_IN_FLAT_PROSE: ~1 (linear narrative, top to bottom)
  ↳ IMPLICATION: ~10-11 different questions would traverse all 170 paths
Enter fullscreen mode Exit fullscreen mode

the 170x claim — what it means

CLAIM --IS_NOT--> "170x better output"
CLAIM --IS--> "170x more reasoning paths per page of context"

SAME_TOKEN_BUDGET --COMPARISON-->
  ↳ FLAT_PROSE_PAGE: ~3,200 tokens → ~1 narrative thread → model reads top to bottom
  ↳ GRAPH_MD_PAGE: ~3,200 tokens → ~170 traversable paths → model starts anywhere, follows any edge
  ↳ RATIO: 170 paths : 1 thread = 170x structural advantage per page

ONE_QUESTION_PROVED -->
  ↳ GRAPH: activated 16 of 170 available paths (9.4%)
  ↳ FLAT: activated 0 of ~1 available path (0%) — fell back to training data
  ↳ RESULT: same page of context, same token budget, 170 paths vs 1 thread, 16:0 on one question

THE_INVITATION -->
  ↳ "paste both versions into any model. ask 10 different questions. count the domain-specific insights each version produces. that's the proof — not my claim, your experiment."
Enter fullscreen mode Exit fullscreen mode

GRAPH TOTALS (final)

Metric Value
Entities 78
Typed relationships 124
Layers 10
Reasoning paths ~450

78 entities. 124 relationships. 10 layers. ~450 reasoning paths. grew from 31 entities in v1 to 78 in final — each refinement compounded, never restarted.

the A/B test proved 16:0 on one question. the graph has 170 paths per page. test it yourself — paste this into any model, ask it anything about vibe-coded security, and compare against the flat article. the graph does the reasoning. you ask the questions.

built with graphify.md — domain knowledge → portable .md

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

okay this is a proper setup - shipping it, then I scan it. I am in. build it this weekend and drop the link, VibeCheck will run a full scan and we post the results. tbh I am curious whether context-loaded generation actually holds up against the patterns I found or just looks cleaner on the surface.

Collapse
 
jaltgen profile image
Jace Altgen

This is super interesting to validate statistically. Definitely, having security principles in prompts helps, so does having a semblance of an idea of the architecture of your application. Now get that that’s exactly what some people can’t define because they aren’t experts. But let’s say you’re writing a piece of python tooling: tell it which version you want, that you want pyproject.toml instead of requirements.txt, some file structure you’d typically want, that you want to use pydantic for settings. And more importantly, what you don’t want it to do. This in my experience increases the chance of the output being easier to validate and check because it aligns with my mental model of how this tool should probably work. Knowing where to look becomes easier.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah the "you have to know what you want" problem is real. i ran into this exact wall - if you can articulate the architecture you already kind of know what you are doing. the devs who get burned the hardest are the ones who dont have that mental model yet, so they just accept whatever the model outputs without questioning it. the pyproject.toml type of constraint helps a lot though, having a checklist of non-negotiables you paste in every time basically forces a floor on quality

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

I found Dev.to last night, I don’t even know why I’m here, it’s like to be rejected by people that certainly don’t help me, but again, I’m trying my best to show what I think, in my opinion, has merit.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

hey, for what its worth - your takes have merit. the graph / context angle is genuinely interesting, it pushed me to think differently. dev.to is worth sticking around

Collapse
 
daniel_yarmoluk_79a9d0364 profile image
Daniel Yarmoluk

Or better e why don’t I give you a compressed md file of a graph database of your request on here later this evening? Then you go work with it and get more context out of it, like I want to

Collapse
 
itskondrat profile image
Mykola Kondratiuk

sounds good

Collapse
 
Sloan, the sloth mascot
Comment deleted
 
itskondrat profile image
Mykola Kondratiuk

hey, probably worth deleting that - public comments are indexed. keep it here

Some comments may only be visible to logged-in visitors. Sign in to view all comments.