Tired of Being Paged at 3am? Let Your AI Handle the Runbook

#pagerduty #ai #claude #webdev

When that alert fires at 3:14am on Sunday, you know the drill: VPN in, SSH to the server, check logs, maybe restart the service, page escalates to someone else. You've probably done this 100 times.

What if the runbook executed itself? See it for yourself

Meet RunbookAI

RunbookAI is an open-source autonomous incident response agent. Connect it to PagerDuty, fire a webhook at it, and it reads your runbook, diagnoses the problem, and acts—without paging a human first.

How It Works

Alert fires → RunbookAI reads the runbook
Diagnosis → runs tools: check_logs, http_check, run_db_check, query_metrics, check_disk, check_processes
Remediation → executes: restart_service, clear_disk, scale_service
Resolves or escalates → full summary, no human was involved

git clone https://github.com/Pritom14/runbookai
cd runbookai
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Run with local LLM (no API keys)
ollama pull qwen2.5:7b
DEMO_MODE=true uvicorn runbookai.main:app --port 7000
python demo/run_demo.py regression

The Game-Changer: Regression Detection

Here's the real magic. Your service crashed 2 hours ago. RunbookAI restarted it. But if it crashes again within 6 hours, the agent is warned: "Don't just restart again—you did that before. Dig deeper."

Instead of blindly running the same remediation, it:

Checks for new logs
Queries recent metrics changes
Looks for disk space issues, process hangs, or configuration drift
Suggests a root cause before acting

This turns "fix the symptom" into "understand the problem."

Suggest Mode

High-risk actions (service restart, disk cleanup, scale-up) pause with a 5-second countdown for human approval. You stay in control while the agent handles the grunt work.

Auto-Generated Postmortem

After every resolved incident, hit GET /incidents/{id}/postmortem and get a ready-to-share markdown document: full timeline, actions taken, regression analysis, duration, and a recommendations checklist. Two hours of postmortem work, done automatically.

Slack Lifecycle Notifications

Set SLACK_WEBHOOK_URL and RunbookAI posts a rich message at every stage: incident started, approval required (with the curl command to approve), resolved with duration, escalated with reason. Your Slack channel becomes your incident dashboard.

AgentTrace Replay UI

Every tool call, every decision, every second of the remediation is logged. Open the browser, replay the entire incident timeline. Understand what the agent decided and why.

Why Open Source?

Incident response is deeply custom—every company's runbooks, tools, and risk tolerance differ. We ship the core (diagnosis + remediation) free and self-hosted. No vendor lock-in, no SaaS fee, no pinging external APIs.

No API Keys Needed

Runs on Ollama locally. qwen2.5:7b is small, fast, and good enough for runbook reasoning. Everything stays on your infrastructure. Or swap in OpenAI, Anthropic, or Groq with a single env var, no code changes.

GitHub: https://github.com/Pritom14/runbookai

Try it now. Fire a demo alert. See regression detection in action. Fork, extend, and own your incident response.