The critic approved a draft containing a partisan political term. Not ambiguous framing, not adjacent commentary, an actual party noun sitting in the copy that was supposed to never publish. The LLM critic looked at it, decided it was fine, and returned a pass.
That is the failure mode I wired against this week.
My content pipeline runs like this: generate a draft with claude p, run a deterministic regex lint floor, then run a second claude p critic pass. The critic is smart and catches subtle things the regex floor misses. But the critic can also have bad days, infra timeout and soft pass, or just misjudge when the draft is borderline.
The lint floor was always the right place to put the hard rules. I already had topic policy logic in the system prompt for my tooling, but it lived only there. If the voice file was unreadable, the guardrails dropped silently with it. The critic knew the rules but was not enforcing them with code.
The fix was a violates_topic_policy() function that runs inside the deterministic lint floor across all six pipelines: X, LinkedIn, Threads, Bluesky, my tooling, and personal site. It is not a suggestion to the critic. It is a hard fail. If the function returns true, the draft never reaches the critic, never reaches the publish step, never leaves the machine.
The lexicon it checks includes a party noun bigram. The reasoning there is simple: party names appear in political content, but they also appear in neutral sentences like "the ruling party announced." A bigram anchors the match to contexts where the political framing is the point, not a passing reference. It still has edges, but it cuts false positives without letting clear cases through.
The voice fallback got the same treatment. If my tooling cannot read the voice file, it falls back to a default system prompt. That fallback now hardcodes the guardrails directive directly, so the policy constraint does not depend on the file being readable. Defense in depth for a quiet failure mode.
I also added topic policy tests per package. That part took longer than the implementation. Each package needed its own fixture for what a policy violating draft looks like in that pipeline format, and making sure the test actually exercises the code path at the lint floor, not a mock above it.
The tradeoff worth naming: deterministic lexicon rules are brittle. A determined paraphrase gets through. The critic might still catch it, or it might not. What the lint floor guarantees is that an explicit term in the lexicon never ships, regardless of the critic state. That guarantee is worth more than the flexibility the critic adds at that layer.
What I would do differently: wire this at day one, not week forty something. The architecture always had a lint floor. Putting hard policy constraints there from the start would have made the failure mode impossible from the first pipeline that went live, not after six had already shipped without it.
Top comments (0)