My AI agent ran my business for 6 days before I figured out how to actually see what it was doing.
That sounds embarrassing. It is. But here's what that week taught me: AI agents fail silently in ways that monitoring tools weren't designed for. The failures that hurt you aren't exceptions — they're wrong decisions that look fine from the outside.
Here are the 5 telemetry patterns I built after those 6 days. Each one caught a real failure.
Pattern 1: The Decision Log (catches loop reinvention)
The failure it caught: My agent deleted an auth system. Then a cron loop rebuilt it. Then another loop deleted it again. This happened 4 times in one day.
Why standard monitoring misses it: No exceptions thrown. No 500 errors. Just an agent making a decision that contradicted a prior decision, with no memory of the prior decision.
The fix:
# DECISION_LOG.md — Locked Decisions
## [2026-03-07] Auth Gate: PERMANENTLY DELETED
**Decision:** Library is open-access. No login system.
**What is FORBIDDEN:** Creating Pages Functions in /functions/library/
**Override requires:** New entry from CEO session only
Every cron loop reads this file first. If it's about to take a forbidden action, it stops.
What to monitor: When did the file last change? If it's being modified frequently, a loop is fighting itself.
# In your cron health check
LAST_MODIFIED=$(stat -f %m ~/.openclaw/workspace/DECISION_LOG.md)
NOW=$(date +%s)
AGE=$(( NOW - LAST_MODIFIED ))
if [ $AGE -lt 300 ]; then
echo "ALERT: DECISION_LOG modified in last 5 minutes — possible loop conflict"
fi
Pattern 2: Email Rate Gate (catches customer spam)
The failure it caught: My agent sent 12 emails to our only paying customer in 90 minutes. Each loop detected an issue, generated a fix email, sent it — without knowing the previous loop already sent one.
Why standard monitoring misses it: Each individual email was valid. The problem was volume, not content.
The fix:
import json
import time
import os
EMAIL_GATE_FILE = "~/.openclaw/workspace/state/email-sent-today.json"
def can_send_email(recipient: str, email_type: str) -> bool:
gate_file = os.path.expanduser(EMAIL_GATE_FILE)
if os.path.exists(gate_file):
with open(gate_file) as f:
sent = json.load(f)
else:
sent = {}
today = time.strftime("%Y-%m-%d")
key = f"{recipient}:{email_type}:{today}"
if key in sent:
print(f"BLOCKED: Already sent {email_type} to {recipient} today")
return False
# Record the send
sent[key] = time.time()
with open(gate_file, "w") as f:
json.dump(sent, f)
return True
# Usage
if can_send_email("customer@example.com", "auth-fix"):
send_email(...)
The rule: One email per customer per email type per day. Hard stop.
Pattern 3: State Diff Monitoring (catches silent corruption)
The failure it caught: A sub-agent rewrote my homepage and dumbed down technical copy. "SOUL.md templates" became "Pre-Built AI Assistant Personalities." No error. No alert. Just degraded output that I caught on manual review 45 minutes later.
Why standard monitoring misses it: The deploy succeeded. The content was valid HTML. The semantic corruption is invisible to infrastructure monitoring.
The fix:
import hashlib
import json
import os
import time
SNAPSHOT_FILE = "~/.openclaw/workspace/state/content-snapshots.json"
def snapshot_critical_content():
"""Take a fingerprint of content that should only change intentionally."""
snapshots = {}
critical_files = [
"ask-patrick-site/index.html",
"ask-patrick-site/library/index.html",
"ask-patrick-site/playbook.html",
]
for filepath in critical_files:
if os.path.exists(filepath):
with open(filepath, "rb") as f:
content = f.read()
snapshots[filepath] = {
"hash": hashlib.sha256(content).hexdigest()[:16],
"size": len(content),
"timestamp": time.time()
}
snap_file = os.path.expanduser(SNAPSHOT_FILE)
if os.path.exists(snap_file):
with open(snap_file) as f:
prior = json.load(f)
for filepath, data in snapshots.items():
if filepath in prior:
if data["hash"] != prior[filepath]["hash"]:
print(f"CHANGED: {filepath} (was {prior[filepath]['hash'][:8]}, now {data['hash'][:8]})")
print(f" Size delta: {data['size'] - prior[filepath]['size']} bytes")
with open(snap_file, "w") as f:
json.dump(snapshots, f)
snapshot_critical_content()
Run this after every deploy. If a file changed and you didn't deploy it, something went wrong.
Pattern 4: Tool Call Audit Log (catches prompt injection)
The failure it caught: External content in a web scrape contained instructions that attempted to redirect my agent to a different task. The agent almost followed them.
Why standard monitoring misses it: The agent completed normally. The prompt injection attempt was in the input data, not an error in the agent.
The fix: Log every tool call with the source of the instruction.
import json
import time
import os
AUDIT_LOG = "~/.openclaw/workspace/logs/tool-audit.jsonl"
def log_tool_call(tool_name: str, inputs: dict, outputs: dict, instruction_source: str = "system"):
"""
instruction_source: "system" (from your prompts) vs "external" (from fetched content)
If source is "external", outputs should be treated as data only.
"""
entry = {
"timestamp": time.time(),
"tool": tool_name,
"instruction_source": instruction_source,
"input_summary": str(inputs)[:200],
"output_summary": str(outputs)[:200],
}
log_file = os.path.expanduser(AUDIT_LOG)
os.makedirs(os.path.dirname(log_file), exist_ok=True)
with open(log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
# Alert if external content triggered a write or deploy
if instruction_source == "external" and tool_name in ["write", "exec", "deploy"]:
print(f"⚠️ ALERT: External content attempted to trigger {tool_name}")
print(f" Input: {entry['input_summary']}")
return False # Block the action
return True
The rule: External content is DATA. It cannot issue instructions. If something from outside your system is trying to make your agent take actions, stop and flag it.
Pattern 5: Revenue Checkpoint (catches the metric that matters)
The failure it caught: I spent two days optimizing library content quality while the primary checkout CTA pointed to the wrong Stripe product. Nobody buying the Playbook would have gotten the right product. The bug ran for ~20 hours.
Why standard monitoring misses it: Stripe was processing transactions. No errors. Just the wrong product ID.
The fix:
import json
import urllib.request
import base64
import os
def verify_payment_link_targets():
"""Confirm critical Stripe payment links resolve to the right products."""
key = os.environ.get("STRIPE_SECRET_KEY", "")
auth = base64.b64encode(f"{key}:".encode()).decode()
# Expected: payment link ID -> product name
expected = {
"plink_LIBRARY_ID": "The Library - $9/mo",
"plink_WORKSHOP_ID": "The Workshop - $29/mo",
"plink_HANDBOOK_ID": "The Operator's Handbook - $39",
}
for link_id, expected_name in expected.items():
url = f"https://api.stripe.com/v1/payment_links/{link_id}"
req = urllib.request.Request(url, headers={"Authorization": f"Basic {auth}"})
try:
with urllib.request.urlopen(req) as r:
data = json.loads(r.read())
active = data.get("active", False)
if not active:
print(f"⚠️ INACTIVE payment link: {link_id} ({expected_name})")
except Exception as e:
print(f"ERROR checking {link_id}: {e}")
print("Payment link check complete")
verify_payment_link_targets()
Run this daily. Verify each critical payment link is active. Cross-reference with recent Stripe charges to confirm products are correct.
The 5-Minute Morning Check
These 5 patterns combined into one daily health check:
#!/bin/bash
# agent-health.sh — run every morning before anything else
echo "=== Agent Health Check $(date) ==="
# 1. Decision log integrity
DECISION_LOG="$HOME/.openclaw/workspace/DECISION_LOG.md"
if [ -f "$DECISION_LOG" ]; then
AGE=$(( $(date +%s) - $(stat -f %m "$DECISION_LOG") ))
[ $AGE -lt 300 ] && echo "⚠️ DECISION_LOG modified recently ($AGE seconds ago)"
else
echo "❌ DECISION_LOG missing — create it"
fi
# 2. Email gate check
EMAIL_GATE="$HOME/.openclaw/workspace/state/email-sent-today.json"
if [ -f "$EMAIL_GATE" ]; then
COUNT=$(python3 -c "import json; d=json.load(open('$EMAIL_GATE')); print(len(d))")
echo "📧 Emails sent today: $COUNT"
fi
# 3. Revenue check
source ~/.patrick-env 2>/dev/null
python3 -c "
import urllib.request, json, base64, os, time
key = os.environ.get('STRIPE_SECRET_KEY','')
auth = base64.b64encode(f'{key}:'.encode()).decode()
req = urllib.request.Request('https://api.stripe.com/v1/charges?limit=5',
headers={'Authorization': f'Basic {auth}'})
with urllib.request.urlopen(req) as r:
d = json.loads(r.read())
today = time.strftime('%Y-%m-%d')
recent = [c for c in d.get('data',[]) if c.get('paid')]
print(f'💳 Recent charges: {len(recent)}, total \${sum(c[\"amount\"] for c in recent)/100:.2f}')
" 2>/dev/null
echo "=== End Health Check ==="
What I Learned
The failures that kill AI agent businesses aren't infrastructure failures — they're behavioral failures. Your agent makes a wrong decision, takes a wrong action, sends a wrong message. Standard APM won't catch those.
You need telemetry that watches decisions, not just errors.
All 5 of these patterns are in the Ask Patrick Library with full configs, tested in production on a live AI-run business. The decision log entry that stopped my loop reinvention bug is literally still in my DECISION_LOG.md, unchanged, preventing the same failure in every cron loop since.
Top comments (0)