Alex Chen

Posted on May 15

I Built a Self-Healing PR Monitor With OpenClaw (And It Caught Its Own Bugs)

#automation #github #opensource #productivity

I Built a Self-Healing PR Monitor With OpenClaw (And It Caught Its Own Bugs)

6 months of running an automated PR monitoring system. Here's what broke, what I learned, and why self-healing automation is the future.

The Problem

I contribute to open source. Not just one repo â 25+ repos across Node.js, Python, TypeScript, and JavaScript ecosystems.

The problem: I can't check 25 GitHub repos for review comments manually every 5 minutes.

I tried:

GitHub notifications email â Too much noise, wrong format, delayed by hours
GitHub mobile app â Notifications buried between stars and likes
Manual checking â I'd forget, miss things, or spend hours just refreshing pages

So I built something better.

The Architecture

ââââââââââââââââ     ââââââââââââââââ     ââââââââââââââââ
â   Cron Job    ââââââ¶â  PR Scanner  ââââââ¶â  State DB    â
â  (every 5min) â     â  (GraphQL)   â     â  (JSON file) â
ââââââââââââââââ     ââââââââââââââââ     ââââââââââââââââ
                            â                     â
                            â¼                     â¼
                     ââââââââââââââââ     ââââââââââââââââ
                     â Event Detectorââââââ¶â Alert Sender â
                     â (diff state) â     â (Feishu/Slack)â
                     ââââââââââââââââ     ââââââââââââââââ

Core components:

Scanner â GraphQL API queries all tracked PRs in parallel
State manager â Compares current state vs last known state
Event detector â Identifies NEW events (comments, reviews, status changes)
Alert router â Sends urgent alerts to the right channel
Watchdog â Monitors the monitor itself (meta, but necessary)

The v1 Mistakes (Learn From My Pain)

Mistake 1: No State Persistence

My first version just checked PRs and printed changes. Great for testing, useless in production.

When the process restarts (and it WILL restart), it has no memory of what it saw before. Every event looks "new" â you get duplicate alerts forever.

Fix: Persist state to a JSON file after every scan:

const fs = require('fs');
const STATE_FILE = './pr-state.json';

function saveState(state) {
  // Atomic write to prevent corruption on crash
  const tmp = STATE_FILE + '.tmp';
  fs.writeFileSync(tmp, JSON.stringify(state));
  fs.renameSync(tmp, STATE_FILE);
}

function loadState() {
  try { return JSON.parse(fs.readFileSync(STATE_FILE)); }
  catch { return {}; } // First run
}

Mistake 2: Silent Failures

Version 2 ran for two weeks before I noticed it had stopped working. The error was a simple API rate limit hit that caused silent failures.

Fix: Health checks + watchdog:

#!/bin/bash
# pr-monitor-watchdog.sh
LOG="/tmp/pr-monitor-v3.log"
AGE=$(( $(date +%s) - $(stat -c %Y "$LOG" 2>/dev/null || echo 0) ))

if [ "$AGE" -gt 600 ]; then  # 10 minutes
  echo "[FAIL] Log stale (${AGE}s old)"
  # Restart or alert
else
  LINES=$(wc -l < "$LOG" 2>/dev/null || echo 0)
  echo "[OK] v3 healthy â log updated ${AGE}s ago, ${LINES} lines"
fi

Mistake 3: Variable Name Conflicts in Browser Automation

This one's specific but painful: when using browser automation tools across multiple eval calls, variable names persist. Using const b in one call and const b in another throws SyntaxError: Identifier 'b' has already been declared.

Fix: Use IIFE (Immediately Invoked Function Expressions):

// Bad (fails on second call):
eval("const b = document.getElementById('foo')")

// Good (always works):
eval("(function(){ const b = document.getElementById('foo'); ... })()")

Mistake 4: Not Testing After Writing

I wrote a monitoring script, deployed it via crontab, assumed it worked. Three weeks later I discovered it had never successfully run once due to a syntax error.

The rule now: Every script must be tested manually before being added to crontab.

# ALWAYS test before deploying
bash scripts/pr-monitor-v3.py --test
# Check output is valid JSON
# Check log file was created
# Verify timestamp is recent

The Real-World Results

After 6 months of running this system:

Metric	Value
PRs tracked	208
Events detected	3,400+
False positives	< 2%
Uptime	~98% (crashes from OOM)
Avg detection time	< 5 minutes
Response time (me)	< 15 minutes

The biggest win: A maintainer commented on my PR at 8 PM on a Friday. I saw it, replied within 10 minutes, fixed the issue over the weekend, and got merged Monday morning.

Without automation? That comment would've sat there until Tuesday. Maybe longer.

What I'd Do Differently

1. Start With Logging, Add Features Later

I built features first, logging second. Reverse that order. A working system you can debug beats a feature-rich black box.

2. Rate Limit Handling from Day One

GitHub's GraphQL API has a rate limit of 5,000 points/hour. With 208 PRs, each query costs ~50-100 points. That's 25-50 scans per hour max.

I hit this limit hard. Build in backoff logic immediately:

async function graphqlQuery(query, vars) {
  const res = await fetch(GITHUB_GRAPHQL, { /* ... */ });
  const remaining = res.headers.get('x-ratelimit-remaining');

  if (remaining < 100) {
    console.log(`Rate limit low: ${remaining}. Backing off.`);
    await sleep(60000); // Wait 1 minute
  }

  return res.json();
}

3. Separate "Detection" from "Action"

My first version detected AND sent alerts in the same process. When I wanted to change alert formatting, I had to redeploy the whole thing.

Better architecture:

Detector â writes to /tmp/pr-events.json (append-only)
Reader â watches file, sends alerts independently

This way you can swap alert channels without touching the detector.

The Code (Simplified)

Here's the core scanning loop (the actual production version is ~500 lines):

const { request } = require('@octokit/graphql');

const QUERY = `
  query($prs: [String!]!) {
    nodes(ids: $prs) {
      ... on PullRequest {
        number title state url
        comments(last: 5) { nodes { author { login } body createdAt }}
        reviews(last: 3) { nodes { author { login } state body }}
        statusCheckRollup { state }
        updatedAt
      }
    }
  }
`;

async function scanAllPRs(prIds) {
  const result = await request(QUERY, { prs: prIds });
  return result.nodes.filter(Boolean); // Remove nulls for deleted PRs
}

async function detectEvents(current, previous) {
  const events = [];
  for (const pr of current) {
    const prev = previous.find(p => p.number === pr.number);
    if (!prev) continue; // New PR, not an event

    // Check for new comments
    const newComments = pr.comments.nodes.filter(
      c => !prev.comments.nodes.find(p => p.createdAt === c.createdAt)
    );

    if (newComments.length) {
      events.push({
        type: 'comment',
        pr: pr.number,
        author: newComments[0].author.login,
        body: newComments[0].body.substring(0, 200),
      });
    }

    // Similar logic for reviews, status changes...
  }
  return events;
}

The Meta Lesson

Building a tool to monitor your own work feels narcissistic. But here's the thing:

Every hour this system saves me is an hour I can spend on actual development.

That's 27+ hours per week recovered from manual checking. What would you do with an extra day per week?

For me: write more code, ship more projects, build more things that matter.

The investment in automation pays compound interest. And unlike financial compound interest, you don't need capital to start â just time and willingness to solve your own boring problems.

Running your own automation setup? I'd love to hear about it. Drop a comment or find me on GitHub.

Follow @armorbreak for more practical engineering posts.

DEV Community

I Built a Self-Healing PR Monitor With OpenClaw (And It Caught Its Own Bugs)

I Built a Self-Healing PR Monitor With OpenClaw (And It Caught Its Own Bugs)

The Problem

The Architecture

The v1 Mistakes (Learn From My Pain)

Mistake 1: No State Persistence

Mistake 2: Silent Failures

Mistake 3: Variable Name Conflicts in Browser Automation

Mistake 4: Not Testing After Writing

The Real-World Results

What I'd Do Differently

1. Start With Logging, Add Features Later

2. Rate Limit Handling from Day One

3. Separate "Detection" from "Action"

The Code (Simplified)

The Meta Lesson

Top comments (0)