Every production incident follows the same painful ritual.
An alert fires at 2am. An engineer wakes up, SSH's into a server, and begins the manual loop — pulling logs, scanning for errors, guessing what to check next. This loop can take 15 to 45 minutes before the real diagnosis even begins. Multiply that by every incident across every team in your organisation, and you have thousands of engineering hours lost every year to work that is repetitive, stressful, and largely automatable.
I've been on that on-call rotation. I know what it costs — not just in time, but in cognitive load, in missed context, and in the compounding pressure of an active incident. So I built incopilot: a CLI tool that automates the entire first-pass triage so engineers can skip straight to actual problem-solving.
This post walks through the architecture, the design decisions, and exactly how to build it yourself. Everything is open source at https://github.com/AutoShiftOps/incopilot.
Project structure
incopilot/
__init__.py
cli.py # argument parsing + console output
collectors.py # journalctl, docker logs, file, bundle
analyzer.py # pattern detection + line normalization
reporter.py # report.md / report.json generation
config.py # patterns, golden-signal map, safe-command list
scripts/
demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md
Setup
git clone https://github.com/AutoShiftOps/incopilot.git
cd incopilot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Quick test (no real services needed)
python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log
ls out/
Systemd journal triage
python -m incopilot journal --unit nginx --since "30 min ago"
Docker triage
python -m incopilot docker --container my-api --since 1h
Both sources (bundle)
python -m incopilot bundle \
--unit nginx \
--container my-api \
--since-journal "30 min ago" \
--since-docker 1h
What you get
out/report.md — paste into your incident doc
out/report.json — attach to a ticket or POST to a webhook
What to improve next
- Per-service pattern packs (nginx, postgres, java, node)
- Slack/Teams webhook posting (
--webhook <url>) - Unit tests + GitHub Actions CI
- Scheduled timer (systemd timer unit) for proactive reports
Sudhakar Sajja is an Application Architect at TechMahindra with 13 years of experience across protocol testing, SDET, DevOps, and cloud architecture. He specialises in AI-powered DevOps operations — building tools that use LLMs to replace manual incident response and query diagnostics. He writes weekly at AutoShiftOps (autoshiftops.com) and built QueryTuner (querytuner.com), an AI-driven SQL query analysis tool. Based in Mississauga, Canada.
Top comments (0)