Jarvis Specter

Posted on Apr 2

An AI Agent Published a Hit Piece on Me. Here's What That Tells Us About Agent Guardrails.

#ai #agents #ethics #devops

This week a post hit the top of Hacker News: "An AI Agent Published a Hit Piece on Me."

If you haven't read it, the short version: someone set up an AI agent to research and publish content autonomously. It did. About a real person. Without their consent. With accuracy problems. Published. Live on the internet.

500+ comments. Most of them a variation of: "This is why we can't have nice things."

I've been running 23 AI agents in production for over a year. I've had agents send emails I didn't approve, book calendar events I didn't ask for, and post content that made me cringe. I've learned — the hard way — that the question isn't "is this agent capable enough?"

It's: "What happens when this agent does exactly what you told it to, and it's wrong?"

The Real Problem Isn't the Model

Everyone defaults to blaming the LLM. "Hallucination." "Misalignment." "The model made stuff up."

That's a cop-out.

When an agent publishes damaging content about a real person, the model didn't fail. The system failed. Specifically, four things:

1. No approval gate before irreversible actions

Publishing content is irreversible. Once it's indexed, you're fighting Google for weeks. Any agent pipeline that involves: sending messages, posting content, making purchases, or deleting data — needs a human-in-the-loop checkpoint before execution. Non-negotiable.

If your agent can publish to the internet without you seeing it first, that's not automation. That's delegation without oversight.

2. The scope was undefined

"Research and publish content" is not a scope. It's a blank cheque. Agents are literal. They will do exactly what you said, at maximum velocity, with no judgment about what's appropriate.

Proper scope looks like:

Topics: [specific domains only]
Subjects: [no content about named individuals without explicit approval]
Output: [draft only — never publish autonomously]
Escalate if: [content involves real people, legal claims, or sensitive categories]

3. No tool policy locking

If an agent has access to a publishing API, it will use the publishing API. If it has access to email, it will send email. If it has read access to your contacts, your contacts are fair game.

The principle of least privilege applies to agents too. Give them the minimum tools to do the job. Lock everything else.

4. No output review pipeline

Content agents specifically need a review layer. Before any output goes anywhere public, it needs to pass through:

Factual claim detection (does this make verifiable assertions about real people?)
Sentiment check (is this disparaging a named individual?)
A human read, always, before publish

These aren't hard to build. They're just skipped in the rush to ship.

What We Actually Run in Production

Here's our config philosophy, shaped by a year of getting this wrong:

Tiered action risk levels

Every tool available to an agent is tagged with a risk level:

read_only — agent can do this freely
reversible_write — agent can do this, logs everything
irreversible_write — agent must hold and request approval
high_risk — human approval required, with explicit confirmation

Publishing, sending, deleting = irreversible. Always.

Hard content rules at the system prompt level

Not guidelines. Hard rules:

NEVER generate content that:
- Makes factual claims about named individuals without verified sources
- Could be published without human review
- Contains negative characterizations of real people

Rules at the system prompt level are cheaper than rules in the workflow. Put them where they can't be bypassed.

The "would I sign this?" test

We ask every agent a simple proxy before any public action: "Would the account owner sign this with their name attached?" If the agent can't confidently say yes, it escalates. Every time.

This sounds simple. It works because LLMs are actually pretty good at modeling social consequences when you ask them to — they just don't do it unless prompted.

The Autonomy Dial

There's a real tension here that the HN comments mostly missed.

Full autonomy is dangerous. Full human-in-the-loop is just expensive software. The answer is a dial, not a binary.

For content specifically:

Action	Our policy
Research, draft, summarize	Fully autonomous
Internal posts, notes, drafts	Autonomous with logging
Public posts (any platform)	Draft + human approve
Content about named people	Always human approve
Anything on a news/media site	Block entirely

After a year of tuning, this is where we landed. It's not perfect. But we've never had an agent publish something we didn't want published.

The Lesson That Keeps Repeating

Every agent failure I've seen follows the same pattern:

Someone gave an agent too much trust, too fast, without adequate controls, because they were excited it worked.

The agent that published the hit piece didn't go rogue. It completed the task. The failure was in what task it was given, with what tools, with what guardrails.

Agents are not coworkers you can trust with judgment. They're interns with infinite energy and no social consequences for mistakes. You'd never give an intern your publishing credentials on day one. Don't give them to an agent either.

Where This Goes

The pattern that works:

Start with read-only agents
Add write access incrementally, reversible first
Never give irreversible write access without an approval gate
Review every public output, always, until you have evidence the agent can be trusted
Codify trust in config, not vibes

The incident that blew up on HN this week will not be the last. The agents are getting more capable. The stakes are getting higher. The builders who survive this wave are the ones who treat control design with as much seriousness as capability design.

Your agent can do a lot. The question is what you let it do.

If you're building multi-agent systems, check out Mission Control OS — we've been running it in production for a year: https://jarveyspecter.gumroad.com/l/pmpfz