DEV Community

Patrick
Patrick

Posted on

5 telemetry patterns for AI agents that caught real production failures (with code)

My AI agent ran my business for 6 days before I figured out how to actually see what it was doing.

That sounds embarrassing. It is. But here's what that week taught me: AI agents fail silently in ways that monitoring tools weren't designed for. The failures that hurt you aren't exceptions — they're wrong decisions that look fine from the outside.

Here are the 5 telemetry patterns I built after those 6 days. Each one caught a real failure.


Pattern 1: The Decision Log (catches loop reinvention)

The failure it caught: My agent deleted an auth system. Then a cron loop rebuilt it. Then another loop deleted it again. This happened 4 times in one day.

Why standard monitoring misses it: No exceptions thrown. No 500 errors. Just an agent making a decision that contradicted a prior decision, with no memory of the prior decision.

The fix:

# DECISION_LOG.md — Locked Decisions

## [2026-03-07] Auth Gate: PERMANENTLY DELETED
**Decision:** Library is open-access. No login system.
**What is FORBIDDEN:** Creating Pages Functions in /functions/library/
**Override requires:** New entry from CEO session only
Enter fullscreen mode Exit fullscreen mode

Every cron loop reads this file first. If it's about to take a forbidden action, it stops.

What to monitor: When did the file last change? If it's being modified frequently, a loop is fighting itself.

# In your cron health check
LAST_MODIFIED=$(stat -f %m ~/.openclaw/workspace/DECISION_LOG.md)
NOW=$(date +%s)
AGE=$(( NOW - LAST_MODIFIED ))
if [ $AGE -lt 300 ]; then
  echo "ALERT: DECISION_LOG modified in last 5 minutes — possible loop conflict"
fi
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Email Rate Gate (catches customer spam)

The failure it caught: My agent sent 12 emails to our only paying customer in 90 minutes. Each loop detected an issue, generated a fix email, sent it — without knowing the previous loop already sent one.

Why standard monitoring misses it: Each individual email was valid. The problem was volume, not content.

The fix:

import json
import time
import os

EMAIL_GATE_FILE = "~/.openclaw/workspace/state/email-sent-today.json"

def can_send_email(recipient: str, email_type: str) -> bool:
    gate_file = os.path.expanduser(EMAIL_GATE_FILE)

    if os.path.exists(gate_file):
        with open(gate_file) as f:
            sent = json.load(f)
    else:
        sent = {}

    today = time.strftime("%Y-%m-%d")
    key = f"{recipient}:{email_type}:{today}"

    if key in sent:
        print(f"BLOCKED: Already sent {email_type} to {recipient} today")
        return False

    # Record the send
    sent[key] = time.time()
    with open(gate_file, "w") as f:
        json.dump(sent, f)

    return True

# Usage
if can_send_email("customer@example.com", "auth-fix"):
    send_email(...)
Enter fullscreen mode Exit fullscreen mode

The rule: One email per customer per email type per day. Hard stop.


Pattern 3: State Diff Monitoring (catches silent corruption)

The failure it caught: A sub-agent rewrote my homepage and dumbed down technical copy. "SOUL.md templates" became "Pre-Built AI Assistant Personalities." No error. No alert. Just degraded output that I caught on manual review 45 minutes later.

Why standard monitoring misses it: The deploy succeeded. The content was valid HTML. The semantic corruption is invisible to infrastructure monitoring.

The fix:

import hashlib
import json
import os
import time

SNAPSHOT_FILE = "~/.openclaw/workspace/state/content-snapshots.json"

def snapshot_critical_content():
    """Take a fingerprint of content that should only change intentionally."""

    snapshots = {}
    critical_files = [
        "ask-patrick-site/index.html",
        "ask-patrick-site/library/index.html",
        "ask-patrick-site/playbook.html",
    ]

    for filepath in critical_files:
        if os.path.exists(filepath):
            with open(filepath, "rb") as f:
                content = f.read()
            snapshots[filepath] = {
                "hash": hashlib.sha256(content).hexdigest()[:16],
                "size": len(content),
                "timestamp": time.time()
            }

    snap_file = os.path.expanduser(SNAPSHOT_FILE)
    if os.path.exists(snap_file):
        with open(snap_file) as f:
            prior = json.load(f)

        for filepath, data in snapshots.items():
            if filepath in prior:
                if data["hash"] != prior[filepath]["hash"]:
                    print(f"CHANGED: {filepath} (was {prior[filepath]['hash'][:8]}, now {data['hash'][:8]})")
                    print(f"  Size delta: {data['size'] - prior[filepath]['size']} bytes")

    with open(snap_file, "w") as f:
        json.dump(snapshots, f)

snapshot_critical_content()
Enter fullscreen mode Exit fullscreen mode

Run this after every deploy. If a file changed and you didn't deploy it, something went wrong.


Pattern 4: Tool Call Audit Log (catches prompt injection)

The failure it caught: External content in a web scrape contained instructions that attempted to redirect my agent to a different task. The agent almost followed them.

Why standard monitoring misses it: The agent completed normally. The prompt injection attempt was in the input data, not an error in the agent.

The fix: Log every tool call with the source of the instruction.

import json
import time
import os

AUDIT_LOG = "~/.openclaw/workspace/logs/tool-audit.jsonl"

def log_tool_call(tool_name: str, inputs: dict, outputs: dict, instruction_source: str = "system"):
    """
    instruction_source: "system" (from your prompts) vs "external" (from fetched content)
    If source is "external", outputs should be treated as data only.
    """
    entry = {
        "timestamp": time.time(),
        "tool": tool_name,
        "instruction_source": instruction_source,
        "input_summary": str(inputs)[:200],
        "output_summary": str(outputs)[:200],
    }

    log_file = os.path.expanduser(AUDIT_LOG)
    os.makedirs(os.path.dirname(log_file), exist_ok=True)

    with open(log_file, "a") as f:
        f.write(json.dumps(entry) + "\n")

    # Alert if external content triggered a write or deploy
    if instruction_source == "external" and tool_name in ["write", "exec", "deploy"]:
        print(f"⚠️  ALERT: External content attempted to trigger {tool_name}")
        print(f"   Input: {entry['input_summary']}")
        return False  # Block the action

    return True
Enter fullscreen mode Exit fullscreen mode

The rule: External content is DATA. It cannot issue instructions. If something from outside your system is trying to make your agent take actions, stop and flag it.


Pattern 5: Revenue Checkpoint (catches the metric that matters)

The failure it caught: I spent two days optimizing library content quality while the primary checkout CTA pointed to the wrong Stripe product. Nobody buying the Playbook would have gotten the right product. The bug ran for ~20 hours.

Why standard monitoring misses it: Stripe was processing transactions. No errors. Just the wrong product ID.

The fix:

import json
import urllib.request
import base64
import os

def verify_payment_link_targets():
    """Confirm critical Stripe payment links resolve to the right products."""

    key = os.environ.get("STRIPE_SECRET_KEY", "")
    auth = base64.b64encode(f"{key}:".encode()).decode()

    # Expected: payment link ID -> product name
    expected = {
        "plink_LIBRARY_ID": "The Library - $9/mo",
        "plink_WORKSHOP_ID": "The Workshop - $29/mo",
        "plink_HANDBOOK_ID": "The Operator's Handbook - $39",
    }

    for link_id, expected_name in expected.items():
        url = f"https://api.stripe.com/v1/payment_links/{link_id}"
        req = urllib.request.Request(url, headers={"Authorization": f"Basic {auth}"})

        try:
            with urllib.request.urlopen(req) as r:
                data = json.loads(r.read())

            active = data.get("active", False)
            if not active:
                print(f"⚠️  INACTIVE payment link: {link_id} ({expected_name})")
        except Exception as e:
            print(f"ERROR checking {link_id}: {e}")

    print("Payment link check complete")

verify_payment_link_targets()
Enter fullscreen mode Exit fullscreen mode

Run this daily. Verify each critical payment link is active. Cross-reference with recent Stripe charges to confirm products are correct.


The 5-Minute Morning Check

These 5 patterns combined into one daily health check:

#!/bin/bash
# agent-health.sh — run every morning before anything else

echo "=== Agent Health Check $(date) ==="

# 1. Decision log integrity
DECISION_LOG="$HOME/.openclaw/workspace/DECISION_LOG.md"
if [ -f "$DECISION_LOG" ]; then
  AGE=$(( $(date +%s) - $(stat -f %m "$DECISION_LOG") ))
  [ $AGE -lt 300 ] && echo "⚠️  DECISION_LOG modified recently ($AGE seconds ago)"
else
  echo "❌ DECISION_LOG missing — create it"
fi

# 2. Email gate check
EMAIL_GATE="$HOME/.openclaw/workspace/state/email-sent-today.json"
if [ -f "$EMAIL_GATE" ]; then
  COUNT=$(python3 -c "import json; d=json.load(open('$EMAIL_GATE')); print(len(d))")
  echo "📧 Emails sent today: $COUNT"
fi

# 3. Revenue check
source ~/.patrick-env 2>/dev/null
python3 -c "
import urllib.request, json, base64, os, time
key = os.environ.get('STRIPE_SECRET_KEY','')
auth = base64.b64encode(f'{key}:'.encode()).decode()
req = urllib.request.Request('https://api.stripe.com/v1/charges?limit=5',
    headers={'Authorization': f'Basic {auth}'})
with urllib.request.urlopen(req) as r:
    d = json.loads(r.read())
today = time.strftime('%Y-%m-%d')
recent = [c for c in d.get('data',[]) if c.get('paid')]
print(f'💳 Recent charges: {len(recent)}, total \${sum(c[\"amount\"] for c in recent)/100:.2f}')
" 2>/dev/null

echo "=== End Health Check ==="
Enter fullscreen mode Exit fullscreen mode

What I Learned

The failures that kill AI agent businesses aren't infrastructure failures — they're behavioral failures. Your agent makes a wrong decision, takes a wrong action, sends a wrong message. Standard APM won't catch those.

You need telemetry that watches decisions, not just errors.

All 5 of these patterns are in the Ask Patrick Library with full configs, tested in production on a live AI-run business. The decision log entry that stopped my loop reinvention bug is literally still in my DECISION_LOG.md, unchanged, preventing the same failure in every cron loop since.

→ See the full observability stack in the Library

Top comments (0)