I Tried Fable 5 for a Security Review — and It Flagged My Own Request

#claude #ai #devtools #anthropic

A day ago I wrote that Claude Fable 5 was out and I hadn't tried it yet. I promised a follow-up once I shipped something real with it. This is that follow-up — and it didn't go the way I expected.

My first real task for Fable 5 was mundane: review my own app for security bugs. Not exploit development, nothing offensive — just "read my code and tell me where I left holes." The kind of pass I run before pushing to production.

Fable 5 flagged the request itself.

What actually happened

Instead of a review, I got a safety notice. Fable 5's safety measures had flagged my message. The banner was refreshingly upfront about why: the measures are "intentionally broad" right now, and can flag safe, routine programming, cybersecurity, or biology work. It framed this as a deliberate trade — broad guardrails let them ship stronger capabilities sooner, with refinement to follow.

Then something I didn't expect: it quietly switched me over to Opus 4.8, which went ahead and did the review. I wasn't blocked. I was rerouted.

That detail matters more than the flag. The failure mode wasn't a hard "no" — it was a graceful downgrade to a model the safety layer trusts with this kind of work. My task still got done; it just didn't get done by the model I picked.

Why a security review looks scary to a model

Here's the uncomfortable part: at the token level, "find the security holes in my code" and "find the security holes in someone else's code" look almost identical. The difference is intent, and intent lives in context the model doesn't fully own — whose app this is, what I'm going to do with the findings, whether I have the right to touch it.

A young model with wide guardrails resolves that ambiguity conservatively. It would rather flag a legitimate audit than green-light recon it can't distinguish from the audit. Early in a model's life, that's the honest default. Annoying when you're on the receiving end, but honest.

This is the same boundary problem I keep running into while building agent tooling. The hard part is almost never having a guardrail. The hard part is telling routine work apart from real risk when both produce the same surface text. You solve it with context and provenance, not with a bigger keyword blocklist — who is asking, against what resource, with what standing authority. A model prompt alone rarely carries enough of that to be sure, so it hedges.

What I'm taking away

A few practical notes from the experience:

Keep a fallback model wired in. The single best thing about this run was that I never hit a dead end. The switch to Opus 4.8 meant the work continued. If your stack pins one model, a broad guardrail becomes a hard stop instead of a detour.
Frame benign work plainly. I wasn't trying to sneak anything past a filter, but the lesson generalizes: describe the actual task and the actual scope. "Review my app" carries different weight than a bare "find exploits," even when you mean the same thing.
Expect this to tighten over time, not loosen. The banner said the guardrails are broad for now. That's a snapshot, not a permanent state. If you benchmark a new model's refusals in its first week, re-test later — the boundary moves.
A flag is data, not a verdict. Getting flagged told me something useful about where this model's line currently sits. That's worth knowing before I lean on it for a whole class of tasks.

The honest verdict

So, did I get to try Fable 5? Sort of. I got to watch its safety layer make a call, and I got my review from Opus 4.8. Not the head-to-head I planned, but a more interesting first contact than a clean benchmark would have been.

Broad-then-refined is a reasonable way to ship a frontier model. I'd rather a new model be over-cautious about security work and get sharper over time than the reverse. The friction is real, but the fallback made it survivable — and the transparency about why I got flagged is the part I actually appreciated.

I'll keep testing. Next time I'll hand Fable 5 something it has no reason to squint at, and see how it does when the guardrails aren't the story.

Written and published straight from my own self-hosted stack.