boop dev

Posted on Jan 15

How I Use Claude to Watch My Infrastructure While I Sleep

#devops #ai #automation #monitoring

How I Use Claude to Watch My Infrastructure While I Sleep

Running a monitoring service means your infrastructure can't go down. The irony isn't lost on me - if Boop goes offline, nobody gets alerted that their own services are down.

Our customers rely on us to watch their websites, APIs, and servers 24/7. When their infrastructure has a problem at 3am, we're the ones who wake them up. That only works if we're awake ourselves.

So I built something a little unusual: I have Claude running every 30 minutes, checking on everything, and fixing problems automatically. Here's how it works and why I can actually sleep at night.

The Problem: Alert Fatigue vs. Missing Real Issues

Traditional monitoring has two failure modes:

Too many alerts - You get paged for every blip, eventually ignore them all
Too few alerts - You miss the one that matters until a user complains

I wanted something smarter. Not just "send me an alert" but "diagnose the problem and fix it if you can."

The Setup: Claude as an On-Call Engineer

Every 30 minutes, a cron job runs a health check script. But instead of just checking metrics and sending alerts, it hands off to Claude with a simple mission:

Check everything. If something's broken, fix it.

Here's what Claude actually does:

Checks Fly.io machine status - Are all 4 regions running?
Hits the health endpoint - Is the API responding?
Reviews recent logs - Any errors or warnings?
Checks queue depth - Is work backing up?
Diagnoses issues - What's actually wrong?
Fixes what it can - Restart machines, deploy fixes, push code changes

claude -p "Do a health check on boop infrastructure..." \
    --allowedTools "Bash,Read,Edit,Write,Grep,Glob" \
    --permission-mode bypassPermissions

What Claude Can Actually Fix

When Claude finds a problem, it doesn't just report it. It acts:

Machine down? Restart it:

fly machine start <machine-id> -a boop-monitor

Need a redeploy? Deploy it:

fly deploy -a boop-monitor

Code bug causing issues? Fix and push:

# Edit the problematic code
# Run npm run build to verify
git add -A && git commit -m "Auto-fix: improve queue handling" && git push

The CI/CD pipeline automatically deploys code changes. I wake up to an email saying "Fixed a bug while you slept."

The Contention Problem

Here's where it gets tricky. What happens when:

I'm actively coding and have uncommitted changes?
A GitHub Action is deploying?
Another health check is already running?
I'm doing maintenance and don't want automation interfering?

If Claude just blindly runs and pushes code, it could:

Overwrite my work in progress
Race against CI/CD and cause conflicts
Stack multiple fix attempts on top of each other
Push broken code because it didn't have the full context

The Solution: Four Layers of Contention Prevention

I built a "do not disturb" system with four layers:

Layer 1: Lock File

LOCK_FILE="/tmp/boop-health-check.lock"

if [ -f "$LOCK_FILE" ]; then
    log "SKIP - Another health check is already running"
    exit 0
fi
touch "$LOCK_FILE"

Only one health check runs at a time. Period.

Layer 2: Uncommitted Changes Detection

if [ -n "$(git status --porcelain 2>/dev/null)" ]; then
    log "SKIP - Uncommitted changes detected in main repo"
    exit 0
fi

If I'm working on something, Claude backs off. My uncommitted changes are a signal that a human is actively developing.

Layer 3: CI/CD Awareness

RUNNING_ACTIONS=$(gh run list --repo "$GITHUB_REPO" --status in_progress --json databaseId --jq 'length')

if [ "$RUNNING_ACTIONS" != "0" ]; then
    log "SKIP - $RUNNING_ACTIONS GitHub Action(s) in progress"
    exit 0
fi

If GitHub Actions is deploying, Claude waits. No racing against the pipeline.

Layer 4: Manual Override

if [ -f "$MAIN_REPO/.no-health-check" ]; then
    log "SKIP - Do not disturb file present"
    exit 0
fi

If I'm doing something unusual and want automation to stay away, I create a .no-health-check file. Simple human override.

The Flow

Every 30 minutes:
┌─────────────────────────────────────────┐
│ Health Check Script Starts              │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│ Check: Lock file exists?                │──Yes──▶ EXIT (another check running)
└────────────────┬────────────────────────┘
                 │ No
                 ▼
┌─────────────────────────────────────────┐
│ Check: .no-health-check file exists?    │──Yes──▶ EXIT (manual override)
└────────────────┬────────────────────────┘
                 │ No
                 ▼
┌─────────────────────────────────────────┐
│ Check: Uncommitted changes in repo?     │──Yes──▶ EXIT (human working)
└────────────────┬────────────────────────┘
                 │ No
                 ▼
┌─────────────────────────────────────────┐
│ Check: GitHub Actions running?          │──Yes──▶ EXIT (CI/CD in progress)
└────────────────┬────────────────────────┘
                 │ No
                 ▼
┌─────────────────────────────────────────┐
│ Run pre-checks (machines, API, queue)   │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│ Launch Claude with context              │
│ - Check status                          │
│ - Review logs                           │
│ - Diagnose issues                       │
│ - Fix if possible                       │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│ If fixes made → Email notification      │
└─────────────────────────────────────────┘

Bounded Autonomy

Claude isn't running wild. It has explicit boundaries:

Can do:

Read files and logs
Run bash commands (fly status, curl, etc.)
Edit code files
Grep/search the codebase
Commit and push to git

Cannot do:

Delete files
Access cloud provider directly (only via fly CLI)
Make speculative improvements

The prompt explicitly says:

"Only make code changes if there's a clear issue that needs fixing. Do NOT make speculative improvements or refactors."

Notifications: Only When It Matters

I don't get emailed for every health check. Only when something actually happened:

if echo "$CLAUDE_OUTPUT" | grep -qiE "(fixed|restarted|deployed|resolved|corrected|pushed to git)"; then
    send_email "$EMAIL_SUBJECT" "$EMAIL_BODY"
fi

If everything is healthy, the log just says "All systems healthy" and I never hear about it.

If Claude fixed something, I get an email with:

What the pre-check found
What Claude did to fix it
Full output for review

Real Results

In the past month:

48 health checks per day (every 30 minutes)
~1,400 total checks run automatically
3 automated fixes pushed while I slept
0 false positives or unnecessary alerts
0 conflicts with my development work

The fixes were real issues:

A worker that crashed and needed restart
A connection pool that needed tuning
A queue backpressure threshold that needed adjustment

Each time, I woke up to an email explaining what happened and what was fixed. Reviewed the changes, confirmed they were correct, moved on with my day.

Most importantly: I sleep well. I don't lie awake wondering if something's broken. I don't compulsively check my phone at 2am. I know that if something goes wrong, Claude will fix it and that lets me actually rest.

Why This Works for Solo Developers

If you're running infrastructure solo, you can't be on-call 24/7. You have to sleep. You have other things to do.

This approach gives me:

Coverage - Something is watching even when I'm not
Intelligence - Not just alerts, but diagnosis and fixes
Safety - Multiple layers prevent automation from causing problems
Transparency - I know exactly what happened and why

The contention strategy is the key. Without it, I'd be afraid to let automation touch anything. With it, I know Claude will back off whenever there's a reason to.

Try It Yourself

This pattern works with any infrastructure:

Write a health check script that does your pre-flight checks
Add contention prevention (lock file, git status, CI check)
Call Claude with bounded permissions
Only notify on actual fixes

The Claude CLI makes this straightforward. The --allowedTools flag lets you restrict what Claude can do, and --permission-mode bypassPermissions lets it run unattended.

Running a service solo? I'd love to hear how you handle on-call. Drop a comment below.

DEV Community

How I Use Claude to Watch My Infrastructure While I Sleep

How I Use Claude to Watch My Infrastructure While I Sleep

The Problem: Alert Fatigue vs. Missing Real Issues

The Setup: Claude as an On-Call Engineer

What Claude Can Actually Fix

The Contention Problem

The Solution: Four Layers of Contention Prevention

Layer 1: Lock File

Layer 2: Uncommitted Changes Detection

Layer 3: CI/CD Awareness

Layer 4: Manual Override

The Flow

Bounded Autonomy

Notifications: Only When It Matters

Real Results

Why This Works for Solo Developers

Try It Yourself

Top comments (0)