Kunal Jaiswal

Posted on Apr 22

My Security Cameras Were Dead for 3 Days. Now They Fix Themselves.

#ai #devops #selfhealing #homelab

I run three AI-powered security cameras at home. RTSP streams feed into a Python daemon that runs OpenCV contour detection, sends cropped regions to a vision LLM on an NVIDIA DGX Spark, and fires WhatsApp alerts when it spots something.

It works great — until it doesn't.

On April 13th, the cameras silently died. No alerts. No crash. No logs. The process was "running." launchctl showed a healthy PID. The dashboard showed the last captured frame — frozen, three days old.

Nobody noticed until April 16th.

The Bug That Looked Like Nothing

The camera monitor runs multiple threads: an rtsp_loop per camera that decodes RTSP frames, and an analysis_loop that sends frames to the vision model. When the analysis loop decided a camera needed reconnection — stale frames, RTSP errors — it called container.close() on the PyAV RTSP container.

The problem: container.close() was called from analysis_loop's thread. The RTSP container was being read by rtsp_loop's thread. PyAV wraps FFmpeg's C-level network read, which can't be safely interrupted cross-thread.

The result: deadlock. Both threads frozen. Process stuck at 300% CPU doing nothing. macOS launchctl saw a running PID and was satisfied. KeepAlive didn't trigger. Logs stopped flowing but nobody reads logs at 3 AM.

# What launchctl saw:
2081    0    com.gate.monitor     ← "running"

# What was actually happening:
PID 2081   300% CPU   state: R (running — spinning in deadlock)
Last log line: 2026-04-13 05:27:21   ← 3 days ago

I fixed the cross-thread bug — replaced container.close() with a threading.Event that rtsp_loop checks between frames and breaks cleanly from its own thread. But I knew there'd be a next bug. There's always a next bug.

What Traditional Monitoring Misses

Here's what every basic health check would have said during those 3 days:

Check	Result	Reality
Process running?	✓ Yes	Zombie
Exit code 0?	✓ Yes	Never exited
Port responding?	N/A	No HTTP server
Disk space OK?	✓ Yes	Irrelevant

The failure was behavioral, not structural. The process was alive but brain-dead. You need a watchdog that understands what "healthy" actually means for your specific service.

The Watchdog

I wrote camera_watchdog.py — a separate process that runs every 12 hours via LaunchAgent. It checks four things:

1. Process alive + CPU zombie detection

def get_process_info(name):
    out, _, _ = run(f"ps aux | grep '{name}' | grep -v grep")
    # Parse PID and CPU% from ps output
    ...

procs = get_process_info("gate_monitor.py")
if not procs:
    issues.append("gate_monitor process NOT RUNNING")
elif procs[0]["cpu"] > 200:  # CPU_ZOMBIE_THRESHOLD
    issues.append(f"gate_monitor is a ZOMBIE (PID {procs[0]['pid']}, {procs[0]['cpu']}% CPU)")

A process at 300% CPU with no log output for 10 minutes isn't "running." It's a zombie.

2. Log freshness

age = get_log_age_minutes("/tmp/gate_monitor.log")
if age > 10:  # LOG_STALE_MIN
    issues.append(f"gate_monitor log is {age:.0f} min stale")

The camera monitor writes log lines every few seconds. If the last line is older than 10 minutes, something is wrong — even if the process has a PID.

3. Error filtering

Not all errors are equal. RTSP drops, connection resets, and timeouts are transient — the monitor handles them internally. The watchdog only flags errors that aren't in the known-transient list:

TRANSIENT_PATTERNS = [
    "Connection reset by peer", "RTSP error", "retry in",
    "timed out", "Connection refused", "Broken pipe",
]
# Only flag errors NOT matching transient patterns
serious_errors = [line for line in error_lines
                  if not any(pat.lower() in line.lower() for pat in TRANSIENT_PATTERNS)]

4. Dependency checks

The camera monitor depends on an LLM adapter (localhost:8091) and DGX Ollama (remote GPU server). If these are down, the monitor will silently stop analyzing frames. The watchdog checks both and can restart the local adapter.

The Escalation Hierarchy

This is where it gets interesting. The watchdog doesn't just check — it fixes. In three tiers:

Tier 1: Simple Fix (up to 3 attempts)

Kill the zombie. Restart the LaunchAgent. Restart dependencies. Wait 15 seconds. Check if logs are flowing again.

if "zombie_pid" in details:
    kill_process(details["zombie_pid"])
    time.sleep(3)  # LaunchAgent auto-restarts

# Verify it came back
procs = get_process_info("gate_monitor.py")
if not procs:
    restart_launchagent("com.gate.monitor")
    time.sleep(15)

Most issues stop here. Dead process? Restart it. Zombie? Kill it and let KeepAlive do its job. Adapter down? Reload its plist.

Tier 2: Claude Code

If three restart attempts fail, the problem isn't operational — it's in the code. The watchdog invokes Claude Code to autonomously diagnose and fix the bug:

result = subprocess.run(
    ["/opt/homebrew/bin/claude", "--print", "--dangerously-skip-permissions",
     "-p", prompt],
    capture_output=True, text=True,
    timeout=300,  # 5 min max
    cwd="/Users/chimpoo/repos/camera-monitor",
)

The prompt includes the last 100 lines of logs, process status, and instructions to:

Read the source code
Diagnose the root cause from logs + code
Fix the bug if it's a code issue
git commit and git push
Kill the old process (LaunchAgent restarts with new code)
Verify logs are flowing

Claude gets --dangerously-skip-permissions because it's running unattended at 3 AM. There's no human to click "approve." It has 5 minutes to read, reason, patch, commit, and deploy.

Tier 3: Human

If Claude can't fix it either, a Telegram message arrives:

🚨 Camera Watchdog — Needs manual intervention

Problems: gate_monitor log is 3842 min stale
Restart attempts: 3
Claude attempted fix but failed:
[Claude's analysis of why it couldn't fix the issue]

Check: ssh chimpoo@192.168.0.26

The Reporting

Every heartbeat ends with a Telegram message. You always know what happened:

✅ All systems nominal          — nothing to do
🔧 Issue detected and fixed    — restarted something
🤖 Claude auto-fix applied     — code was patched
🚨 Needs manual intervention   — you need to SSH in

State persists between runs in /tmp/camera_watchdog_state.json — restart count, last Claude fix timestamp, history of fixes applied. The restart counter resets on the next healthy heartbeat, so a recovered service doesn't carry baggage.

What I'd Do Differently

Run it more often. Every 12 hours means worst-case you lose 12 hours of camera coverage before the watchdog notices. I'm considering dropping it to every 30 minutes. The checks themselves take <5 seconds.

Log rotation. The camera monitor wipes its log on restart. If the watchdog kills a zombie before capturing logs, the evidence disappears. A proper log rotation would preserve the forensics.

Test the Claude tier. I've only seen Tier 1 fire in production. The Claude escalation path is written and theoretically sound, but I haven't had a code-level bug recur since the PyAV fix. Which is either good engineering or an untested code path — depending on your perspective.

The Uncomfortable Part

Giving an AI agent --dangerously-skip-permissions over your codebase, with the ability to commit and deploy, at 3 AM with nobody watching — that should make you uncomfortable. It makes me uncomfortable.

But here's the trade-off: my cameras were dead for 3 days and nobody noticed. The cost of unattended downtime, for a security system, is higher than the cost of a bad auto-fix. Claude can't brick the system worse than "not running." And if it makes a wrong fix, I have git history and the Telegram receipt.

The watchdog is 280 lines of stdlib Python. No frameworks, no dependencies, no infrastructure. Just subprocess, urllib, and the knowledge that every system eventually breaks — and the interesting question is what happens next.

The Stack

camera_watchdog.py        — 280 lines, stdlib Python, LaunchAgent
gate_monitor.py           — RTSP + OpenCV + vision LLM pipeline
DGX Spark (Blackwell GPU) — Ollama + gemma4:31b-4k vision model
Claude Code CLI            — autonomous diagnosis + code fix
Telegram Bot API           — reporting (stdlib urllib, no SDK)
macOS LaunchAgent          — scheduling + KeepAlive

The full watchdog source is straightforward enough to adapt for any daemon you run. The three-tier pattern — restart, AI fix, human escalation — is the part worth stealing.

DEV Community