How I Use Claude to Watch My Infrastructure While I Sleep
Running a monitoring service means your infrastructure can't go down. The irony isn't lost on me - if Boop goes offline, nobody gets alerted that their own services are down.
Our customers rely on us to watch their websites, APIs, and servers 24/7. When their infrastructure has a problem at 3am, we're the ones who wake them up. That only works if we're awake ourselves.
So I built something a little unusual: I have Claude running every 30 minutes, checking on everything, and fixing problems automatically. Here's how it works and why I can actually sleep at night.
The Problem: Alert Fatigue vs. Missing Real Issues
Traditional monitoring has two failure modes:
- Too many alerts - You get paged for every blip, eventually ignore them all
- Too few alerts - You miss the one that matters until a user complains
I wanted something smarter. Not just "send me an alert" but "diagnose the problem and fix it if you can."
The Setup: Claude as an On-Call Engineer
Every 30 minutes, a cron job runs a health check script. But instead of just checking metrics and sending alerts, it hands off to Claude with a simple mission:
Check everything. If something's broken, fix it.
Here's what Claude actually does:
- Checks Fly.io machine status - Are all 4 regions running?
- Hits the health endpoint - Is the API responding?
- Reviews recent logs - Any errors or warnings?
- Checks queue depth - Is work backing up?
- Diagnoses issues - What's actually wrong?
- Fixes what it can - Restart machines, deploy fixes, push code changes
claude -p "Do a health check on boop infrastructure..." \
--allowedTools "Bash,Read,Edit,Write,Grep,Glob" \
--permission-mode bypassPermissions
What Claude Can Actually Fix
When Claude finds a problem, it doesn't just report it. It acts:
Machine down? Restart it:
fly machine start <machine-id> -a boop-monitor
Need a redeploy? Deploy it:
fly deploy -a boop-monitor
Code bug causing issues? Fix and push:
# Edit the problematic code
# Run npm run build to verify
git add -A && git commit -m "Auto-fix: improve queue handling" && git push
The CI/CD pipeline automatically deploys code changes. I wake up to an email saying "Fixed a bug while you slept."
The Contention Problem
Here's where it gets tricky. What happens when:
- I'm actively coding and have uncommitted changes?
- A GitHub Action is deploying?
- Another health check is already running?
- I'm doing maintenance and don't want automation interfering?
If Claude just blindly runs and pushes code, it could:
- Overwrite my work in progress
- Race against CI/CD and cause conflicts
- Stack multiple fix attempts on top of each other
- Push broken code because it didn't have the full context
The Solution: Four Layers of Contention Prevention
I built a "do not disturb" system with four layers:
Layer 1: Lock File
LOCK_FILE="/tmp/boop-health-check.lock"
if [ -f "$LOCK_FILE" ]; then
log "SKIP - Another health check is already running"
exit 0
fi
touch "$LOCK_FILE"
Only one health check runs at a time. Period.
Layer 2: Uncommitted Changes Detection
if [ -n "$(git status --porcelain 2>/dev/null)" ]; then
log "SKIP - Uncommitted changes detected in main repo"
exit 0
fi
If I'm working on something, Claude backs off. My uncommitted changes are a signal that a human is actively developing.
Layer 3: CI/CD Awareness
RUNNING_ACTIONS=$(gh run list --repo "$GITHUB_REPO" --status in_progress --json databaseId --jq 'length')
if [ "$RUNNING_ACTIONS" != "0" ]; then
log "SKIP - $RUNNING_ACTIONS GitHub Action(s) in progress"
exit 0
fi
If GitHub Actions is deploying, Claude waits. No racing against the pipeline.
Layer 4: Manual Override
if [ -f "$MAIN_REPO/.no-health-check" ]; then
log "SKIP - Do not disturb file present"
exit 0
fi
If I'm doing something unusual and want automation to stay away, I create a .no-health-check file. Simple human override.
The Flow
Every 30 minutes:
┌─────────────────────────────────────────┐
│ Health Check Script Starts │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Check: Lock file exists? │──Yes──▶ EXIT (another check running)
└────────────────┬────────────────────────┘
│ No
▼
┌─────────────────────────────────────────┐
│ Check: .no-health-check file exists? │──Yes──▶ EXIT (manual override)
└────────────────┬────────────────────────┘
│ No
▼
┌─────────────────────────────────────────┐
│ Check: Uncommitted changes in repo? │──Yes──▶ EXIT (human working)
└────────────────┬────────────────────────┘
│ No
▼
┌─────────────────────────────────────────┐
│ Check: GitHub Actions running? │──Yes──▶ EXIT (CI/CD in progress)
└────────────────┬────────────────────────┘
│ No
▼
┌─────────────────────────────────────────┐
│ Run pre-checks (machines, API, queue) │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Launch Claude with context │
│ - Check status │
│ - Review logs │
│ - Diagnose issues │
│ - Fix if possible │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ If fixes made → Email notification │
└─────────────────────────────────────────┘
Bounded Autonomy
Claude isn't running wild. It has explicit boundaries:
Can do:
- Read files and logs
- Run bash commands (fly status, curl, etc.)
- Edit code files
- Grep/search the codebase
- Commit and push to git
Cannot do:
- Delete files
- Access cloud provider directly (only via fly CLI)
- Make speculative improvements
The prompt explicitly says:
"Only make code changes if there's a clear issue that needs fixing. Do NOT make speculative improvements or refactors."
Notifications: Only When It Matters
I don't get emailed for every health check. Only when something actually happened:
if echo "$CLAUDE_OUTPUT" | grep -qiE "(fixed|restarted|deployed|resolved|corrected|pushed to git)"; then
send_email "$EMAIL_SUBJECT" "$EMAIL_BODY"
fi
If everything is healthy, the log just says "All systems healthy" and I never hear about it.
If Claude fixed something, I get an email with:
- What the pre-check found
- What Claude did to fix it
- Full output for review
Real Results
In the past month:
- 48 health checks per day (every 30 minutes)
- ~1,400 total checks run automatically
- 3 automated fixes pushed while I slept
- 0 false positives or unnecessary alerts
- 0 conflicts with my development work
The fixes were real issues:
- A worker that crashed and needed restart
- A connection pool that needed tuning
- A queue backpressure threshold that needed adjustment
Each time, I woke up to an email explaining what happened and what was fixed. Reviewed the changes, confirmed they were correct, moved on with my day.
Most importantly: I sleep well. I don't lie awake wondering if something's broken. I don't compulsively check my phone at 2am. I know that if something goes wrong, Claude will fix it and that lets me actually rest.
Why This Works for Solo Developers
If you're running infrastructure solo, you can't be on-call 24/7. You have to sleep. You have other things to do.
This approach gives me:
- Coverage - Something is watching even when I'm not
- Intelligence - Not just alerts, but diagnosis and fixes
- Safety - Multiple layers prevent automation from causing problems
- Transparency - I know exactly what happened and why
The contention strategy is the key. Without it, I'd be afraid to let automation touch anything. With it, I know Claude will back off whenever there's a reason to.
Try It Yourself
This pattern works with any infrastructure:
- Write a health check script that does your pre-flight checks
- Add contention prevention (lock file, git status, CI check)
- Call Claude with bounded permissions
- Only notify on actual fixes
The Claude CLI makes this straightforward. The --allowedTools flag lets you restrict what Claude can do, and --permission-mode bypassPermissions lets it run unattended.
Running a service solo? I'd love to hear how you handle on-call. Drop a comment below.
Top comments (0)