614 tests green, no regressions. That is what the guardrail cost in test churn: zero.
The problem was straightforward, even if the fix required touching five files across four packages. I run content engines on X, Bluesky, Threads, and LinkedIn. All of them generate posts and comments, all of them go through a two layer quality gate (regex lint floor plus a Claude critic). None of them had an explicit rule against personal attacks.
That sounds obvious in hindsight. You build a system that generates sharp, opinionated content and you assume the model knows not to be cruel. Sometimes it does. Sometimes it produces something that attacks the person instead of the idea, and "the person" is exactly what you cannot touch. Critique the argument, the claim, the track record, the logic. Never the human.
So I added it at two seams.
Generation. In core/voice_prompts.py, inside _hard_rules(), a universal rule now ships with every build_voice_block() call. Every engine already imports this, so one edit covers all four platforms. The rule is explicit: attack ideas, arguments, logic, and track records as hard as you want. No insults, no name calling, no slurs, no content that targets a person or group.
Critic. The second seam is the quality gate that reviews generated content before it posts. In quality.py for X, Bluesky, and Threads: pass=false on a personal attack, same gate as the value or score floor. In LinkedIn's quality.py and critic.py: a critical issue flag in the post rubric that caps the score at 3 and forces a revise verdict, plus the same check in the comment rubric.
The tradeoff I chose deliberately: prompt only, no control flow change. The alternative was adding a separate classification step, something like a lightweight binary classifier that runs before the critic and hard kills any content with a personal attack detected. That would be more robust, especially as an automated tripwire that does not depend on the model following instructions correctly.
I went prompt only for two reasons. First, I already have five interrelated quality layers and adding a sixth classification step increases latency and cost on every generation cycle. Second, the critic prompt already has a boolean pass gate. A personal attack trips it to false, which is the same outcome as a hard kill, just one step later in the pipeline. The failure mode I accepted: a personal attack could theoretically score high on other dimensions and slip through if the critic misses it. That is a real risk, not a theoretical one.
What I would do differently: add a targeted regex scan after generation for the most obvious personal attack patterns, before the critic even runs. Slurs and direct insults have a finite vocabulary. A compact blocklist that short circuits immediately would catch the worst cases without a full critic pass, and I could add it without touching any of the existing critic logic. That is the next step. The prompt rule handles the middle ground; the regex handles the floor.
The core package carries 40 tests, platforms carry 574. Both green, no changes needed to any test. Prompt only changes have that advantage: they do not change the control flow the tests were written against, so the test suite validates the surrounding logic without modification. When you do need to test a prompt change, you test it by exercising the actual generation path, not by writing unit assertions against prompt text.
That is the build. One hard rule, five files, 614 tests still green.
Top comments (0)