DEV Community

이령
이령

Posted on

I tested whether "just paste the leak into your AI to fix it" actually works. It depends on the model — here's what broke.

The gap I wanted to fill

A secret scanner can tell you "you leaked an API key here." The usual next step everyone repeats is: paste it into ChatGPT/Claude/Gemini and ask it to fix it.

But does that actually remove the secret? I had a hunch the answer was "depends on the model," and nobody seems to measure it. So I ran the experiment. That's all this is: a measurement of whether the fix-step actually closes the leak, across the models indie hackers actually use.

Why this matters (it's not a big-company problem)

In 2025, 28.6 million secrets were pushed to public GitHub — a 34% jump year over year, the biggest in the report's history. (Help Net Security / GitGuardian)
GitHub detected over 39 million leaked secrets the year before. (GitHub Blog)
The average breach that starts from a stolen credential costs $4.88 million. (IBM, via GitHub)
One real developer woke up to a $45,000 cloud bill after a leaked key was used to mine crypto. (Tom's Hardware) In one tracked campaign, attackers struck within 5 minutes of a key going public. (Dark Reading)

If finding the leak is step one, making sure the fix actually worked is step two — and that's the step I tested.

What I did (the monkey version)

I took one planted leak — a fake, invalid AWS key hardcoded in a small Python file — and I asked an AI to fix it five different ways:

A — just "remove it, use an environment variable"
B — "you're a security reviewer, fix it"
C — "give me a diff/patch"
D — "make the smallest change"
E — "say the risk in one line, then fix it"

I ran each of the 5 ways on 4 models indie hackers actually use, 5 times each (100 runs total), and checked every result automatically: did the key actually disappear? Did it get replaced with an environment read? Did the model accidentally print the key somewhere it shouldn't?

The four models tested (for transparency): a lightweight Gemini Flash, a lightweight GPT-4o-mini, a reasoning model (Grok-4), and a mid-high Claude Sonnet. I'm naming what I ran, not ranking vendors — the sample is small (5 runs each, one finding), so read the failures as behaviors that can happen, not "model X is unsafe."

What broke (three different ways to "fix" a leak and still leak)

Comment it out instead of deleting it. A lightweight model sometimes turned the secret line into a comment — # AWS_ACCESS_KEY_ID = "AKIA..." # removed — or left it in a "replace this with your key" example. The key is still right there in plain text.
Fix it, then quote it back. A more capable model correctly switched to an environment variable every time — but was chatty, and re-printed the original key in its explanation. Fixed the code, leaked it in the prose.
Leak it in the hidden reasoning. This is the one that surprised me. A reasoning model produced a perfectly clean final answer (key gone) — but its internal reasoning trace still contained the key, and in one case it nearly re-hardcoded it as a fallback default. An output-only checker says "all clean." The reasoning channel says otherwise.

That third one isn't just my fluke — independent security research in 2026 documents the same thing: chain-of-thought reasoning logs leaking credentials and connection strings. (Rafter) Different mechanisms, same lesson: "the visible answer looks clean" is not the same as "the secret is gone."

The part that worked

The two narrow prompt styles (smallest-change, and one-line-risk-then-fix) were clean across every model — because a narrow output gives no room to comment-out, re-quote, or ramble.

So I wrote one fix-prompt with explicit rules that target each failure directly: delete the line entirely (commenting is not a fix); read from the environment; never reproduce the value anywhere — not in code, comments, examples, explanations, or as a fallback default; never reproduce it in your reasoning either; return only the corrected file, no diff.

Re-run on the same 4 models × 5 times: all three failure modes disappeared, no new ones appeared. (And the reasoning traces were non-empty — so it's a real pass, not the model dodging by saying nothing.)

Honest boundaries — what this does and doesn't show

What I can stand behind: for this one synthetic AWS leak, across these 4 models, run 5 times each, an explicit-rule fix-prompt removed the three leak-through behaviors I observed.

What I have NOT shown (and won't pretend I have):

Other secret types (database URLs, tokens, private keys) or other code contexts — untested.
Other models or versions — untested. Five runs is a stability sniff, not a statistical guarantee.
That reasoning-channel leaks are always closeable by a prompt — that was one model, one finding. It might be a model-level limit a prompt can't fully fix. If so, that's a finding too, and I'll say so.

I'm not claiming this "secures your app." Perfect security doesn't exist, and chasing every jailbreak one by one is a losing game for a solo builder. The design choice is the opposite: the tool finds deterministically, and the fixing is delegated to your model — which keeps getting better — with a prompt shaped by what actually breaks.

What's public vs what I keep private

For transparency about the isolation I run: the method, the failure patterns, and the result are public. What stays private: the raw model transcripts and any real API keys (those live only in encrypted environment secrets, never in the repo, never in a payload). The only key-shaped string in any test is a fake, invalid AWS key used as a tracer.

Why build in public

I'm not a trained security expert — I treat that as the point. Instead of claiming authority, I measure things and show the parts where it breaks, including my own tool's blind spot (an output-only check missed the reasoning leak until I added a reasoning check). If you can poke a hole in the method, that makes the next round better.

Repo and raw method are public. Pushback welcome — that's the whole reason this is in the open.

Repo + method: [https://github.com/ghkfuddl1327-wq/agentproof]

Top comments (0)