DEV Community

Cover image for 🚀 Building an AI Incident Copilot: How I Automated the First 15 Minutes of Every Production Incident
Sajja Sudhakararao for Sajja Sudhakararao

Posted on • Originally published at autoshiftops.com

🚀 Building an AI Incident Copilot: How I Automated the First 15 Minutes of Every Production Incident

Every production incident follows the same painful ritual.

An alert fires at 2am. An engineer wakes up, SSH's into a server, and begins the manual loop — pulling logs, scanning for errors, guessing what to check next. This loop can take 15 to 45 minutes before the real diagnosis even begins. Multiply that by every incident across every team in your organisation, and you have thousands of engineering hours lost every year to work that is repetitive, stressful, and largely automatable.

I've been on that on-call rotation. I know what it costs — not just in time, but in cognitive load, in missed context, and in the compounding pressure of an active incident. So I built incopilot: a CLI tool that automates the entire first-pass triage so engineers can skip straight to actual problem-solving.

This post walks through the architecture, the design decisions, and exactly how to build it yourself. Everything is open source at https://github.com/AutoShiftOps/incopilot.

Project structure

incopilot/
  __init__.py
  cli.py          # argument parsing + console output
  collectors.py   # journalctl, docker logs, file, bundle
  analyzer.py     # pattern detection + line normalization
  reporter.py     # report.md / report.json generation
  config.py       # patterns, golden-signal map, safe-command list
scripts/
  demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md
Enter fullscreen mode Exit fullscreen mode

Setup

git clone https://github.com/AutoShiftOps/incopilot.git
cd incopilot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Quick test (no real services needed)

python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log
ls out/
Enter fullscreen mode Exit fullscreen mode

Systemd journal triage

python -m incopilot journal --unit nginx --since "30 min ago"
Enter fullscreen mode Exit fullscreen mode

Docker triage

python -m incopilot docker --container my-api --since 1h
Enter fullscreen mode Exit fullscreen mode

Both sources (bundle)

python -m incopilot bundle \
  --unit nginx \
  --container my-api \
  --since-journal "30 min ago" \
  --since-docker 1h
Enter fullscreen mode Exit fullscreen mode

What you get

out/report.md — paste into your incident doc

out/report.json — attach to a ticket or POST to a webhook

What to improve next

  • Per-service pattern packs (nginx, postgres, java, node)
  • Slack/Teams webhook posting (--webhook <url>)
  • Unit tests + GitHub Actions CI
  • Scheduled timer (systemd timer unit) for proactive reports

Sudhakar Sajja is an Application Architect at TechMahindra with 13 years of experience across protocol testing, SDET, DevOps, and cloud architecture. He specialises in AI-powered DevOps operations — building tools that use LLMs to replace manual incident response and query diagnostics. He writes weekly at AutoShiftOps (autoshiftops.com) and built QueryTuner (querytuner.com), an AI-driven SQL query analysis tool. Based in Mississauga, Canada.

Top comments (0)