DEV Community

Deva
Deva

Posted on

Two topic guardrails in one commit, and why they do different jobs

The build log has an interesting split: topic_guardrails_directive() gets injected into every generation prompt, and violates_topic_policy() is a pure Python lexical gate wired into the lint step before publish. Two different tools doing two different jobs.

The problem they solve: a content engine whose voice profile actively listed partisan political positions. Left unchecked, that profile instructs the generator to produce exactly what the guardrail is meant to block. The prompt fragment supersedes the voice profile by sitting higher in the instruction stack. The Python gate catches whatever the prompt missed.

What prompted the split? The key line in the commit is "never passes through on infra failure." The generation critic gate in these pipelines is another LLM call. When the LLM backend is down, critics pass by design so an infra failure does not block posting. That is the right default for a critic. It is the wrong default for a hard policy gate. You cannot have a system where "backend unreachable" means "sure, post the political rant." So the hard gate is pure Python: no network call, no subprocess, no dependency on claude p being alive. It either runs or the process dies.

The two layers handle different failure modes. The prompt injection handles the base case where the model is alive and cooperative. The lexical gate handles the model ignoring the instruction, the model hallucinating around the constraint, and infra being down entirely.

That is the right architecture. Soft guards at generation time, hard gate at publish time. The publish gate must be independent of whatever generated the content.

The tradeoff with lexical matching is that it is inherently imprecise. A word list catches obvious cases. It misses creative reframings and generates false positives on legitimate economic analysis that happens to use flagged vocabulary. You are trading recall for reliability. That is acceptable here because false positives mean manual review, not a catastrophic post going live.

What I would do differently: the lexical gate is currently wired into the step that runs immediately before publishing. I would add a second call right after generation and before the critic, so bad drafts get rejected before spending tokens on a full critic pass. The critic is expensive. Killing a draft early on a keyword match is cheap. Running the check twice would add one fast Python call and save one LLM call on the failure path.

One more thing worth noting: changing "partisan" to "opinionated" in the voice profile header. That is not cosmetic. "Partisan" in a system prompt is an affirmative instruction to produce partisan content. The prompt injection then has to override it. Removing the affirmative instruction means the override only has to maintain a constraint rather than reverse a standing order. Easier for the model to follow, less surprising when the constraint holds.

Top comments (0)