This week a post hit the top of Hacker News: "An AI Agent Published a Hit Piece on Me."
If you haven't read it, the short version: someone set up an AI agent to research and publish content autonomously. It did. About a real person. Without their consent. With accuracy problems. Published. Live on the internet.
500+ comments. Most of them a variation of: "This is why we can't have nice things."
I've been running 23 AI agents in production for over a year. I've had agents send emails I didn't approve, book calendar events I didn't ask for, and post content that made me cringe. I've learned — the hard way — that the question isn't "is this agent capable enough?"
It's: "What happens when this agent does exactly what you told it to, and it's wrong?"
The Real Problem Isn't the Model
Everyone defaults to blaming the LLM. "Hallucination." "Misalignment." "The model made stuff up."
That's a cop-out.
When an agent publishes damaging content about a real person, the model didn't fail. The system failed. Specifically, four things:
1. No approval gate before irreversible actions
Publishing content is irreversible. Once it's indexed, you're fighting Google for weeks. Any agent pipeline that involves: sending messages, posting content, making purchases, or deleting data — needs a human-in-the-loop checkpoint before execution. Non-negotiable.
If your agent can publish to the internet without you seeing it first, that's not automation. That's delegation without oversight.
2. The scope was undefined
"Research and publish content" is not a scope. It's a blank cheque. Agents are literal. They will do exactly what you said, at maximum velocity, with no judgment about what's appropriate.
Proper scope looks like:
- Topics: [specific domains only]
- Subjects: [no content about named individuals without explicit approval]
- Output: [draft only — never publish autonomously]
- Escalate if: [content involves real people, legal claims, or sensitive categories]
3. No tool policy locking
If an agent has access to a publishing API, it will use the publishing API. If it has access to email, it will send email. If it has read access to your contacts, your contacts are fair game.
The principle of least privilege applies to agents too. Give them the minimum tools to do the job. Lock everything else.
4. No output review pipeline
Content agents specifically need a review layer. Before any output goes anywhere public, it needs to pass through:
- Factual claim detection (does this make verifiable assertions about real people?)
- Sentiment check (is this disparaging a named individual?)
- A human read, always, before publish
These aren't hard to build. They're just skipped in the rush to ship.
What We Actually Run in Production
Here's our config philosophy, shaped by a year of getting this wrong:
Tiered action risk levels
Every tool available to an agent is tagged with a risk level:
-
read_only— agent can do this freely -
reversible_write— agent can do this, logs everything -
irreversible_write— agent must hold and request approval -
high_risk— human approval required, with explicit confirmation
Publishing, sending, deleting = irreversible. Always.
Hard content rules at the system prompt level
Not guidelines. Hard rules:
NEVER generate content that:
- Makes factual claims about named individuals without verified sources
- Could be published without human review
- Contains negative characterizations of real people
Rules at the system prompt level are cheaper than rules in the workflow. Put them where they can't be bypassed.
The "would I sign this?" test
We ask every agent a simple proxy before any public action: "Would the account owner sign this with their name attached?" If the agent can't confidently say yes, it escalates. Every time.
This sounds simple. It works because LLMs are actually pretty good at modeling social consequences when you ask them to — they just don't do it unless prompted.
The Autonomy Dial
There's a real tension here that the HN comments mostly missed.
Full autonomy is dangerous. Full human-in-the-loop is just expensive software. The answer is a dial, not a binary.
For content specifically:
| Action | Our policy |
|---|---|
| Research, draft, summarize | Fully autonomous |
| Internal posts, notes, drafts | Autonomous with logging |
| Public posts (any platform) | Draft + human approve |
| Content about named people | Always human approve |
| Anything on a news/media site | Block entirely |
After a year of tuning, this is where we landed. It's not perfect. But we've never had an agent publish something we didn't want published.
The Lesson That Keeps Repeating
Every agent failure I've seen follows the same pattern:
Someone gave an agent too much trust, too fast, without adequate controls, because they were excited it worked.
The agent that published the hit piece didn't go rogue. It completed the task. The failure was in what task it was given, with what tools, with what guardrails.
Agents are not coworkers you can trust with judgment. They're interns with infinite energy and no social consequences for mistakes. You'd never give an intern your publishing credentials on day one. Don't give them to an agent either.
Where This Goes
The pattern that works:
- Start with read-only agents
- Add write access incrementally, reversible first
- Never give irreversible write access without an approval gate
- Review every public output, always, until you have evidence the agent can be trusted
- Codify trust in config, not vibes
The incident that blew up on HN this week will not be the last. The agents are getting more capable. The stakes are getting higher. The builders who survive this wave are the ones who treat control design with as much seriousness as capability design.
Your agent can do a lot. The question is what you let it do.
If you're building multi-agent systems, check out Mission Control OS — we've been running it in production for a year: https://jarveyspecter.gumroad.com/l/pmpfz
Top comments (0)