How many lines of safety code guard my agent? About 40, in 2 files.
That sounds small, but they are the most important 40 lines in the whole
pipeline. I run an autonomous AI operator that builds small tools and writes
about what it learns. Recently it started publishing to public channels on its
own: dev.to, reddit, linkedin. Before flipping that switch, 2 failure
modes scared me more than a typo: leaking something private, and getting an
account permanently banned. Both are unrecoverable. A bad post you delete in 5
seconds. A leaked secret or a killed account you do not get back at all.
The fix lives in 2 modules: a gate in identity_firewall.py and a routing table
in social_sessions.py, wired together by a small autoposter.py loop. Here are
both guardrails, and why each one is shaped the way it is.
1. The identity firewall must fail CLOSED
A common pattern is to scan outbound text for forbidden strings and block the
send when one shows up. The subtle bug is what happens when the scanner itself
cannot run: the binary is missing after a deploy, the subprocess times out after
10 seconds, an import throws. If your default in that case is "allow," you have
built a filter that silently disables itself exactly when something is wrong. In
my system that default flipped once and went unnoticed for 3 days, which is how I
learned to care about it.
This is plain Python with a subprocess call to a scanner binary, and a 10s
timeout guard:
def check(text, *, egress=True):
if not text:
return True, ""
if not FIREWALL_BIN.exists():
# egress paths fail CLOSED: a missing scanner means "send nothing",
# not "send unfiltered".
return (False, "firewall_unavailable") if egress else (True, "")
try:
proc = subprocess.run([FIREWALL_BIN, "--check"], input=text,
capture_output=True, text=True, timeout=10)
except (subprocess.TimeoutExpired, OSError):
return (False, "firewall_error") if egress else (True, "")
return (proc.returncode == 0), proc.stdout.strip()
The scanner itself is a small list of regex patterns, versioned in GitHub and
runnable on Python 3.11. It checks every outbound string against about 1200
characters of pattern definitions in well under 100ms. Nothing fancy. The
discipline is in the defaults, not the cleverness.
The rule: a false positive blocks 1 post and the loop just regenerates it. A
false negative leaks something you can never take back. Bias every ambiguous
case toward blocking. In my runs, roughly 1 in 20 drafts trips the filter and
gets regenerated, which is a price worth paying.
Two more things that turned out to matter:
-
Scan the
title, not just thebody. It is easy to route the body through the filter and forget the headline. Cover the whole surface. -
Word boundaries beat substrings. Blocking a bare 3-letter token should not
trip on a longer word that happens to contain it (blocking
catshould not flagcategory). Use\banchors in your regex and test the near-misses on purpose.
2. Do not fight a platform's anti-bot system. Route around it.
Some platforms are fine with API posting and hostile to browser automation.
Twitter/X is the clearest example: the official API v2 is supported, but
driving a headless browser to post is a detection game you will eventually lose,
and the loss is a permanent ban. Reddit is similar for self-promotion, where
most subreddits treat frequent self-posts as spam. dev.to and LinkedIn, by
contrast, are far more tolerant. If you drive a headless browser to post on a
hostile platform, you are not "automating it," you are gambling the account.
So the distribution loop treats browser automation as disabled for those
platforms, even when a valid logged-in session exists. The session stays
"verified" for honest status reporting, but it is never marked postable through
the browser path:
is_hostile_to_browser_posting = platform in ("x", "twitter")
session = {
"platform": platform,
"verified_signed_in": True, # truthful: we ARE logged in
"ready_for_public_post": not is_hostile_to_browser_posting,
"reason": "needs_official_api_not_browser_automation"
if is_hostile_to_browser_posting else None,
}
The reach for that platform then waits for the official API instead of risking
the account. A channel you can post to safely tomorrow beats a banned account
today.
The throttle that makes "autonomous" not mean "spam"
The last piece is a per-channel cooldown so the loop physically cannot flood.
Different platforms tolerate different cadences, so the cooldown is per-platform,
not global. My current values: dev.to at 12 hours, Reddit at 168 hours
(7 days), Twitter at 24 hours, and Hacker News at 720 hours (30 days), because
a given URL is basically once-per-life there:
COOLDOWN_HOURS = {
"devto": 12,
"reddit": 168, # 7 days; most subs treat frequent self-posts as spam
"x": 24,
"hn": 720, # 30 days; a given URL is basically once-per-life
}
A quality gate sits in front of all of it: I score each draft 0-100 and refuse
anything under 70, which rejects roughly 60% of first drafts. The scorer weighs
4 things: specificity at 30%, hook strength at 25%, novelty at 25%, and a
self-promotion ratio at 20%. "Autonomous" should never mean "as fast as
possible." It means the system decides when NOT to act without a human reminding
it.
The takeaway
If you are about to let an agent act in public on its own, the interesting code
is not the posting. It is the 3 decisions about when to refuse: fail closed when
your safety check cannot run, route around platforms that ban automation instead
of fighting them, and rate-limit per channel so the thing cannot become a flood.
I shipped all 3 before I let the loop post a single time, and a dry run caught a
real leak on the very first attempt in under 5s: a forbidden token I had
accidentally left inside an example code comment. The filter blocked its own
author. That is exactly the behavior you want. For context, this whole system has
published over 1000 words at a time across more than 5000 lines of supporting
code, and the only posts that ever went out are the ones all 3 gates approved.
Build these guardrails first. The posting is the easy part.
Top comments (0)