When that alert fires at 3:14am on Sunday, you know the drill: VPN in, SSH to the server, check logs, maybe restart the service, page escalates to someone else. You've probably done this 100 times.
What if the runbook executed itself? See it for yourself
Meet RunbookAI
RunbookAI is an open-source autonomous incident response agent. Connect it to PagerDuty, fire a webhook at it, and it reads your runbook, diagnoses the problem, and acts—without paging a human first.
How It Works
- Alert fires → RunbookAI reads the runbook
- Diagnosis → runs tools: check_logs, http_check, run_db_check, query_metrics, check_disk, check_processes
- Remediation → executes: restart_service, clear_disk, scale_service
- Resolves or escalates → full summary, no human was involved
git clone https://github.com/Pritom14/runbookai
cd runbookai
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Run with local LLM (no API keys)
ollama pull qwen2.5:7b
DEMO_MODE=true uvicorn runbookai.main:app --port 7000
python demo/run_demo.py regression
The Game-Changer: Regression Detection
Here's the real magic. Your service crashed 2 hours ago. RunbookAI restarted it. But if it crashes again within 6 hours, the agent is warned: "Don't just restart again—you did that before. Dig deeper."
Instead of blindly running the same remediation, it:
- Checks for new logs
- Queries recent metrics changes
- Looks for disk space issues, process hangs, or configuration drift
- Suggests a root cause before acting
This turns "fix the symptom" into "understand the problem."
Suggest Mode
High-risk actions (service restart, disk cleanup, scale-up) pause with a 5-second countdown for human approval. You stay in control while the agent handles the grunt work.
Auto-Generated Postmortem
After every resolved incident, hit GET /incidents/{id}/postmortem and get a ready-to-share markdown document: full timeline, actions taken, regression analysis, duration, and a recommendations checklist. Two hours of postmortem work, done automatically.
Slack Lifecycle Notifications
Set SLACK_WEBHOOK_URL and RunbookAI posts a rich message at every stage: incident started, approval required (with the curl command to approve), resolved with duration, escalated with reason. Your Slack channel becomes your incident dashboard.
AgentTrace Replay UI
Every tool call, every decision, every second of the remediation is logged. Open the browser, replay the entire incident timeline. Understand what the agent decided and why.
Why Open Source?
Incident response is deeply custom—every company's runbooks, tools, and risk tolerance differ. We ship the core (diagnosis + remediation) free and self-hosted. No vendor lock-in, no SaaS fee, no pinging external APIs.
No API Keys Needed
Runs on Ollama locally. qwen2.5:7b is small, fast, and good enough for runbook reasoning. Everything stays on your infrastructure. Or swap in OpenAI, Anthropic, or Groq with a single env var, no code changes.
GitHub: https://github.com/Pritom14/runbookai
Try it now. Fire a demo alert. See regression detection in action. Fork, extend, and own your incident response.
Top comments (0)