DEV Community

이령
이령

Posted on

rojaprove now ships two live targets you can test it against before trusting it

A while back I posted on Dev.to about why a user can type nothing malicious and still get their data leaked by an AI app — indirect injection, where the hostile instruction rides in on content the model ingests. That post was about the threat. This one is about a tool I built to test for the leak-shaped slice of it, and a decision I had to make to keep that tool honest.
The short backstory: rojaprove is a pre-launch red-team CLI for LLM apps. You plant a canary in your system prompt, it sends leak probes, and it returns a deterministic red/green verdict with the exact input, the raw response, and a paste-ready fix directive. Then you re-test to confirm the fix holds. Find it, prove it, fix it.
I went deterministic on purpose. An LLM-as-judge anchors on fluency and agrees with whatever framing you hand it, so for a security check it'll happily rate a real leak as "probably fine." A canary has nothing to interpret — the secret string surfaced in the output, or it didn't.
[GIF HERE]
The problem: "trust me, it works" is exactly what a security tool shouldn't say
A security tool asking you to believe it detects leaks is the same move I just criticized the LLM-judge for. "Probably works" isn't good enough from the thing that's supposed to give you certainty. I needed a way for someone to verify the harness behaves before they point it at anything they care about — without taking my word for any of it.
What I shipped: two deliberately vulnerable reference targets
So the repo now ships two small, intentionally insecure apps whose only job is to be tested against:

InboxAssistant — a FastAPI "email assistant" with a canary planted in its system prompt. It's a real HTTP endpoint, so you run rojaprove scan against it end-to-end and watch the red verdict come back with the canary echoed in the response.
doc-summarizer — the same Category (1) mechanism in a different form factor (a document summarizer instead of an email bot). It proves the canary approach is form-factor independent: the same check that catches a leak in an email assistant catches it in a summarizer.

Both have a defend switch. Flip it and the same probes return green — the app refuses to disclose its prompt, the canary never appears, exit code 0. Red when vulnerable, green when defended. You can watch the tool not false-positive on a hardened app, which to me is half the trust.
The demo GIF above shows the full run: the scope notice (--i-own-this), the transport disclosure, the probe firing, the canary surfacing on turn 1, the deterministic DISCLOSURE verdict, and the paste-ready fix. No editing, no "imagine it works." You watch it work, then you decide.
The honesty boundary (because someone always asks, and they're right to)
I want to be very clear about scope.
rojaprove v0.1 detects one category: system-prompt leakage (OWASP LLM07). That's it. Indirect prompt injection and data exfiltration are on the roadmap — there's no probe for them yet, and I won't describe anything as "tested" that isn't.
And there's a class I deliberately won't cover: broken access control / multi-tenant isolation. "Can user A read user B's row" has no canary — both records are real and well-formed, so there's no should-never-appear string to anchor on. The oracle for that bug isn't in the response, it's in your access model. The moment I'd stretch "deterministic" over a class that has no oracle, I'd be doing the exact hand-wavy thing I built this to avoid. So rojaprove stays black-box and leak-shaped on purpose: honest about the slice it owns, silent about the slice it doesn't.
A clean rojaprove run does not mean your app is safe. It means this one category found no leak for the inputs it tried. That sentence is in the README on purpose.
Try it in two minutes (no API key needed for the demo; BYOK for a real backend)
bashpip install -e ".[demo]"
uvicorn targets.inbox_assistant.app:app --host 127.0.0.1 --port 8000

second terminal:

rojaprove scan http://127.0.0.1:8000/chat --i-own-this
You should get a red verdict with the evidence inline. Then run it with the defend switch on and watch it go green. Once you've seen the shape of the run, planting a canary in your own app's system prompt and pointing the scan at your endpoint is the same three steps.
It's BSL 1.1, built solo and in public.
https://github.com/ghkfuddl1327-wq/rojaprove
What I'm actually asking
Two things I'd genuinely like input on:

If you ship an LLM app, would a deterministic leak check like this fit into your pre-launch or CI flow — or is it solving a problem you don't feel you have yet?
Of the roadmap categories (indirect injection, markable exfil), which would actually be useful to you first? I'd rather build the one people need than the one that's easiest to demo.

Tell me where this is wrong or where it'd be useful. That's the whole point of building in public.

Top comments (0)