<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sajja Sudhakararao</title>
    <description>The latest articles on DEV Community by Sajja Sudhakararao (@sajjas).</description>
    <link>https://dev.to/sajjas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12388%2F9ee6f0e9-48cf-40c1-8a31-cc2f392506e1.png</url>
      <title>DEV Community: Sajja Sudhakararao</title>
      <link>https://dev.to/sajjas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sajjas"/>
    <language>en</language>
    <item>
      <title>🚀 Building an AI Incident Copilot: How I Automated the First 15 Minutes of Every Production Incident</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Wed, 29 Apr 2026 17:15:12 +0000</pubDate>
      <link>https://dev.to/sajjas/building-an-ai-incident-copilot-how-i-automated-the-first-15-minutes-of-every-production-incident-1cbm</link>
      <guid>https://dev.to/sajjas/building-an-ai-incident-copilot-how-i-automated-the-first-15-minutes-of-every-production-incident-1cbm</guid>
      <description>&lt;p&gt;Every production incident follows the same painful ritual.&lt;/p&gt;

&lt;p&gt;An alert fires at 2am. An engineer wakes up, SSH's into a server, and begins the manual loop — pulling logs, scanning for errors, guessing what to check next. This loop can take 15 to 45 minutes before the real diagnosis even begins. Multiply that by every incident across every team in your organisation, and you have thousands of engineering hours lost every year to work that is repetitive, stressful, and largely automatable.&lt;/p&gt;

&lt;p&gt;I've been on that on-call rotation. I know what it costs — not just in time, but in cognitive load, in missed context, and in the compounding pressure of an active incident. So I built incopilot: a CLI tool that automates the entire first-pass triage so engineers can skip straight to actual problem-solving.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture, the design decisions, and exactly how to build it yourself. Everything is open source at &lt;a href="https://github.com/AutoShiftOps/incopilot" rel="noopener noreferrer"&gt;https://github.com/AutoShiftOps/incopilot&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incopilot/
  __init__.py
  cli.py          # argument parsing + console output
  collectors.py   # journalctl, docker logs, file, bundle
  analyzer.py     # pattern detection + line normalization
  reporter.py     # report.md / report.json generation
  config.py       # patterns, golden-signal map, safe-command list
scripts/
  demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/AutoShiftOps/incopilot.git
&lt;span class="nb"&gt;cd &lt;/span&gt;incopilot
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick test (no real services needed)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/demo_generate_sample_logs.py
python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot file &lt;span class="nt"&gt;--path&lt;/span&gt; sample.log
&lt;span class="nb"&gt;ls &lt;/span&gt;out/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Systemd journal triage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot journal &lt;span class="nt"&gt;--unit&lt;/span&gt; nginx &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"30 min ago"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Docker triage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot docker &lt;span class="nt"&gt;--container&lt;/span&gt; my-api &lt;span class="nt"&gt;--since&lt;/span&gt; 1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Both sources (bundle)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot bundle &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--unit&lt;/span&gt; nginx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--container&lt;/span&gt; my-api &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since-journal&lt;/span&gt; &lt;span class="s2"&gt;"30 min ago"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since-docker&lt;/span&gt; 1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What you get
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;out/report.md&lt;/code&gt; — paste into your incident doc&lt;br&gt;&lt;br&gt;
&lt;code&gt;out/report.json&lt;/code&gt; — attach to a ticket or POST to a webhook&lt;/p&gt;

&lt;h2&gt;
  
  
  What to improve next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Per-service pattern packs (nginx, postgres, java, node)&lt;/li&gt;
&lt;li&gt;Slack/Teams webhook posting (&lt;code&gt;--webhook &amp;lt;url&amp;gt;&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Unit tests + GitHub Actions CI&lt;/li&gt;
&lt;li&gt;Scheduled timer (systemd timer unit) for proactive reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sudhakar Sajja is an Application Architect at TechMahindra with 13 years of experience across protocol testing, SDET, DevOps, and cloud architecture. He specialises in AI-powered DevOps operations — building tools that use LLMs to replace manual incident response and query diagnostics. He writes weekly at AutoShiftOps (autoshiftops.com) and built QueryTuner (querytuner.com), an AI-driven SQL query analysis tool. Based in Mississauga, Canada.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>python</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
