<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sajja Sudhakararao</title>
    <description>The latest articles on DEV Community by Sajja Sudhakararao (@sajjasudhakararao).</description>
    <link>https://dev.to/sajjasudhakararao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3613095%2F38ea8a76-19cb-4ef0-a58a-5717c66d0b98.png</url>
      <title>DEV Community: Sajja Sudhakararao</title>
      <link>https://dev.to/sajjasudhakararao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sajjasudhakararao"/>
    <language>en</language>
    <item>
      <title>Build an Alert Decision Layer CLI in Python</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 19 Apr 2026 01:53:24 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/build-an-alert-decision-layer-cli-in-python-50l0</link>
      <guid>https://dev.to/sajjasudhakararao/build-an-alert-decision-layer-cli-in-python-50l0</guid>
      <description>&lt;p&gt;We talk a lot about &lt;strong&gt;alerting&lt;/strong&gt;, but not enough about &lt;strong&gt;deciding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This weekend project builds a small &lt;strong&gt;Alert Decision Layer&lt;/strong&gt; as a Python CLI called &lt;code&gt;alertdecider&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: alerts JSON (think Alertmanager or PagerDuty export).&lt;/li&gt;
&lt;li&gt;Engine: a clear rule set that considers severity, environment, service tier, and flapping history.&lt;/li&gt;
&lt;li&gt;Output: Markdown + JSON with decisions (&lt;code&gt;page&lt;/code&gt;, &lt;code&gt;ticket&lt;/code&gt;, &lt;code&gt;aggregate&lt;/code&gt;, &lt;code&gt;suppress&lt;/code&gt;) and reasons.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you liked project-based posts like "AI trading bot in Python" or "Self-healing containers with Bash", this sits in the same category: you end up with a tool you can run and extend.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you'll build
&lt;/h2&gt;

&lt;p&gt;By the end of this tutorial you’ll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Python package &lt;code&gt;alertdecider-agent/&lt;/code&gt; with:

&lt;ul&gt;
&lt;li&gt;Dataclasses for &lt;code&gt;Alert&lt;/code&gt;, &lt;code&gt;ServiceProfile&lt;/code&gt;, &lt;code&gt;History&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A rule-based &lt;code&gt;AlertDecisionEngine&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A CLI entry point.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;An &lt;code&gt;examples/&lt;/code&gt; folder with sample alerts, services, and alert history.&lt;/li&gt;

&lt;li&gt;A command you can run locally to triage alerts and generate a report.&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;Basic familiarity with JSON/YAML&lt;/li&gt;
&lt;li&gt;A terminal where you can run &lt;code&gt;python -m ...&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Clone and set up the project
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/AutoShiftOps/alertdecider.git
&lt;span class="nb"&gt;cd &lt;/span&gt;alertdecider
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;requirements.txt&lt;/code&gt; is intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rich==13.9.4
PyYAML==6.0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Model the domain: alerts, services, history
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;alertdecider-agent/models.py&lt;/code&gt; we define three dataclasses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;starts_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ServiceProfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;slo_critical&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;History&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;count_24h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;last_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us a &lt;strong&gt;normalized view&lt;/strong&gt; of alerts and some context we can use to make better decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Load alerts, services, and history
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;alertdecider-agent/loader.py&lt;/code&gt; contains helpers to turn raw files into those models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;load_alerts(path)&lt;/code&gt; reads &lt;code&gt;alerts.json&lt;/code&gt; and extracts labels like &lt;code&gt;alertname&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, &lt;code&gt;env&lt;/code&gt;, and &lt;code&gt;fingerprint&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load_services(path)&lt;/code&gt; reads &lt;code&gt;services.yml&lt;/code&gt; and builds &lt;code&gt;ServiceProfile&lt;/code&gt; objects.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load_history(path)&lt;/code&gt; reads &lt;code&gt;history.json&lt;/code&gt; and tracks how many times each fingerprint fired in the last 24h.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example &lt;code&gt;services.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;checkout-api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tier1&lt;/span&gt;
    &lt;span class="na"&gt;slo_critical&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-checkout&lt;/span&gt;
  &lt;span class="na"&gt;notification-service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tier1&lt;/span&gt;
    &lt;span class="na"&gt;slo_critical&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-notify&lt;/span&gt;
  &lt;span class="na"&gt;internal-reporting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tier2&lt;/span&gt;
    &lt;span class="na"&gt;slo_critical&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Implement the AlertDecisionEngine
&lt;/h2&gt;

&lt;p&gt;Now the interesting part: turn alerts + context into decisions.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;alertdecider-agent/engine.py&lt;/code&gt; we implement &lt;code&gt;AlertDecisionEngine&lt;/code&gt; with a few rules. Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AlertDecisionEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_decide_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_decide_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;service_profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;hist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 1) Suppress noisy low-severity alerts in non-prod
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sev&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;debug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suppress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low-severity alert in non-prod environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2) Page for tier1, slo-critical services on critical alerts
&lt;/span&gt;        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sev&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;service_profile&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt;
                &lt;span class="n"&gt;service_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tier1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;service_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slo_critical&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical alert on tier1 slo-critical service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3) Aggregate flapping alerts (lots of repeats in 24h)
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_24h&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aggregate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alert fingerprint is flapping/noisy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 4) Warnings in prod for tier1 services become tickets
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sev&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;service_profile&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;service_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tier1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warning on tier1 service; track as ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 5) Default: ticket for prod, suppress for non-prod
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod alert without more specific rule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suppress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-prod alert without more specific rule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These rules are not perfect – they’re a &lt;strong&gt;starting point&lt;/strong&gt; you can tweak.&lt;/p&gt;

&lt;p&gt;The key is that they’re &lt;strong&gt;explicit&lt;/strong&gt;. Anyone on your team can read, discuss, and change them.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Wire up the CLI
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;alertdecider-agent/cli.py&lt;/code&gt; glues everything together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prog&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alertdecider-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alert Decision Layer CLI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--alerts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--out-dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_alerts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_services&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AlertDecisionEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;decisions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;write_reports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decisions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;render_console&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decisions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also have &lt;code&gt;__main__.py&lt;/code&gt; so you can use &lt;code&gt;python -m alertdecider-agent&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Run it with sample data
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;examples/&lt;/code&gt; folder contains a simple dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; alertdecider-agent   &lt;span class="nt"&gt;--alerts&lt;/span&gt; examples/alerts.json   &lt;span class="nt"&gt;--services&lt;/span&gt; examples/services.yml   &lt;span class="nt"&gt;--history&lt;/span&gt; examples/history.json   &lt;span class="nt"&gt;--out-dir&lt;/span&gt; out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Check the CLI table output.&lt;/li&gt;
&lt;li&gt;Open &lt;code&gt;out/decision_report.md&lt;/code&gt; to see the human-friendly report.&lt;/li&gt;
&lt;li&gt;Open &lt;code&gt;out/decision_report.json&lt;/code&gt; if you want to wire this into another tool.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Try changing severity, env, or service tier in the examples and see how decisions change.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Make it yours
&lt;/h2&gt;

&lt;p&gt;Here are some ideas for adapting this to your environment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add time-of-day logic (e.g., don’t page at 03:00 for non-critical stuff).&lt;/li&gt;
&lt;li&gt;Add SLO signals (e.g., error budget burn rate) into the decision rules.&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;history.json&lt;/code&gt; with a real datastore of past alerts.&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;alertdecider&lt;/code&gt; from your Alertmanager/PagerDuty webhook pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You now have a small, understandable &lt;strong&gt;alert decision layer&lt;/strong&gt; you can evolve – and a much better place to plug AI into in the future.&lt;/p&gt;

&lt;p&gt;If you build on this, drop a link – I’d love to see different rule sets and architectures.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>python</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Build an AI Incident Copilot CLI in Python</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 12 Apr 2026 04:46:54 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/build-an-ai-incident-copilot-cli-in-python-4850</link>
      <guid>https://dev.to/sajjasudhakararao/build-an-ai-incident-copilot-cli-in-python-4850</guid>
      <description>&lt;p&gt;When an incident fires, you don't need more dashboards.&lt;br&gt;
You need &lt;strong&gt;answers, fast&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post is a build-a-tool weekend project: a Python CLI that collects logs from systemd and Docker, highlights repeating patterns, maps them to the Golden Signals, and generates a ready-to-use incident report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project files
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incopilot/
  cli.py         collectors.py
  analyzer.py    reporter.py    config.py
scripts/
  demo_generate_sample_logs.py
requirements.txt    pyproject.toml    README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick demo (no real services)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/demo_generate_sample_logs.py
python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot file &lt;span class="nt"&gt;--path&lt;/span&gt; sample.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Systemd triage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot journal &lt;span class="nt"&gt;--unit&lt;/span&gt; nginx &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"30 min ago"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Docker triage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot docker &lt;span class="nt"&gt;--container&lt;/span&gt; my-api &lt;span class="nt"&gt;--since&lt;/span&gt; 1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Both (bundle)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; incopilot bundle &lt;span class="nt"&gt;--unit&lt;/span&gt; nginx &lt;span class="nt"&gt;--container&lt;/span&gt; my-api &lt;span class="nt"&gt;--since-journal&lt;/span&gt; &lt;span class="s2"&gt;"30 min ago"&lt;/span&gt; &lt;span class="nt"&gt;--since-docker&lt;/span&gt; 1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Outputs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;out/report.md&lt;/code&gt; — human-friendly&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;out/report.json&lt;/code&gt; — machine-friendly&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on [&lt;a href="https://autoshiftops.com/devops/ai/incident%20management/machine%20learning/2025/11/29/ai-incident-copilot.html" rel="noopener noreferrer"&gt;LINK&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;💬 What's your go-to first command when an incident fires?&lt;br&gt;
Drop it in the comments — I'll add the best ones to the safe-commands list.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>docker</category>
      <category>sre</category>
    </item>
    <item>
      <title>Self-Healing Docker: Bash Script That Auto-Restarts Containers</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 22 Feb 2026 23:51:18 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/self-healing-docker-bash-script-that-auto-restarts-containers-3jk2</link>
      <guid>https://dev.to/sajjasudhakararao/self-healing-docker-bash-script-that-auto-restarts-containers-3jk2</guid>
      <description>&lt;p&gt;Manual restarts during incidents are reactive. Self-healing means your containers recover themselves between alerts.&lt;/p&gt;

&lt;p&gt;This post shows how to build a lightweight bash watchdog that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitors container health via Docker health checks&lt;/li&gt;
&lt;li&gt;Restarts unhealthy containers&lt;/li&gt;
&lt;li&gt;Integrates with systemd for daemon-like behavior&lt;/li&gt;
&lt;li&gt;Logs everything for incident review&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Self-Healing Architecture
&lt;/h2&gt;

&lt;p&gt;Docker has built-in restart policies (&lt;code&gt;--restart unless-stopped&lt;/code&gt;), but they don’t respect health checks. A container can be "running" but unhealthy (app crashed, dependencies down, etc.).&lt;/p&gt;

&lt;p&gt;Our script loops every 30 seconds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Query Docker API for container health&lt;/li&gt;
&lt;li&gt;Restart unhealthy containers&lt;/li&gt;
&lt;li&gt;Log actions to systemd journal&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 1: Docker Health Checks (the foundation)
&lt;/h3&gt;

&lt;p&gt;First, ensure your containers have proper health checks in &lt;code&gt;docker-compose.yml&lt;/code&gt; or &lt;code&gt;docker run&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;web&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost/health"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;start_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;40s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This marks containers &lt;code&gt;healthy&lt;/code&gt;, &lt;code&gt;unhealthy&lt;/code&gt;, or &lt;code&gt;starting&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: The Self-Healing Watchdog Script
&lt;/h3&gt;

&lt;p&gt;Save this as &lt;code&gt;/usr/local/bin/docker-autoheal.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="c"&gt;# Config&lt;/span&gt;
&lt;span class="nv"&gt;CHECK_INTERVAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30
&lt;span class="nv"&gt;LOG_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/var/log/docker-autoheal.log"&lt;/span&gt;
&lt;span class="nv"&gt;CONTAINER_LABEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"autoheal=true"&lt;/span&gt;  &lt;span class="c"&gt;# Label your containers&lt;/span&gt;

log&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="s1"&gt;'+%Y-%m-%d %H:%M:%S'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; - &lt;/span&gt;&lt;span class="nv"&gt;$*&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

heal_containers&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;unhealthy&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;docker ps &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"health=unhealthy"&lt;/span&gt; &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"label=&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER_LABEL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s2"&gt;"{{.Names}}"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;container &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;unhealthy&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;log &lt;span class="s2"&gt;"RESTARTING UNHEALTHY CONTAINER: &lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

    &lt;span class="c"&gt;# Graceful stop first&lt;/span&gt;
    docker stop &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;

    &lt;span class="c"&gt;# Hard kill after timeout&lt;/span&gt;
    docker &lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;

    &lt;span class="c"&gt;# Restart with original command&lt;/span&gt;
    docker start &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

    log &lt;span class="s2"&gt;"RESTARTED: &lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Main loop&lt;/span&gt;
log &lt;span class="s2"&gt;"Docker Auto-Heal started (check interval: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CHECK_INTERVAL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s)"&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;heal_containers
  &lt;span class="nb"&gt;sleep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CHECK_INTERVAL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Run as a Systemd Service (production-ready)
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;/etc/systemd/system/docker-autoheal.service&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Unit]
Description=Docker Container Auto-Healer
After=docker.service
Requires=docker.service

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/docker-autoheal.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable and start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;docker-autoheal.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start docker-autoheal.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status docker-autoheal.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; docker-autoheal.service &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Label Your Containers
&lt;/h3&gt;

&lt;p&gt;Tag containers you want to auto-heal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--label&lt;/span&gt; &lt;span class="s2"&gt;"autoheal=true"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--health-cmd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"curl -f http://localhost:8080/health || exit 1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Advanced Features
&lt;/h3&gt;

&lt;p&gt;Grace Period (avoid restart loops)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add to script before restart&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;docker inspect &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.RestartCount}}'&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"10"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;log &lt;span class="s2"&gt;"SKIPPING: &lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="s2"&gt; restart count too high (restart loop?)"&lt;/span&gt;
  &lt;span class="k"&gt;continue
fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Webhook Alerts&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;container&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="nv"&gt;$container&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;action&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;restart&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$WEBHOOK_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multi-Host (Docker Swarm)
&lt;/h3&gt;

&lt;p&gt;Use labels + a central orchestrator or run the script on each host.&lt;br&gt;
​&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing It (safely)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Spin up a container with a failing health check:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; test-fail &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--label&lt;/span&gt; &lt;span class="s2"&gt;"autoheal=true"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--health-cmd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"exit 1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  alpine &lt;span class="nb"&gt;sleep &lt;/span&gt;infinity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Watch it get restarted:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; docker-autoheal.service &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Limitations + When to Upgrade
&lt;/h3&gt;

&lt;p&gt;This works great for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small deployments / homelabs&lt;/li&gt;
&lt;li&gt;Edge services / single-host apps&lt;/li&gt;
&lt;li&gt;Dev/staging environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Upgrade to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes: liveness/readiness probes + pod disruption budgets&lt;/li&gt;
&lt;li&gt;Docker Swarm: service replicas + constraints&lt;/li&gt;
&lt;li&gt;Nomad: health checks + restart stanzas&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary Table (Copy/Paste)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Component        | Command                             | Purpose                  |
| ---------------- | ----------------------------------- | ------------------------ |
| Health Check     | docker ps --filter health=unhealthy | Find broken containers   |
| Watchdog         | systemctl status docker-autoheal    | Confirm service running  |
| Logs             | journalctl -u docker-autoheal       | Review restart history   |
| Container Labels | --label autoheal=true               | Target specific services |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>docker</category>
      <category>devops</category>
      <category>troubleshooting</category>
      <category>linux</category>
    </item>
    <item>
      <title>The Container Troubleshooting Playbook: OOMs, CPU, and I/O</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 15 Feb 2026 13:12:21 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/the-container-troubleshooting-playbook-ooms-cpu-and-io-19g4</link>
      <guid>https://dev.to/sajjasudhakararao/the-container-troubleshooting-playbook-ooms-cpu-and-io-19g4</guid>
      <description>&lt;p&gt;When a container fails in production, you don’t always have time to browse StackOverflow. You need a checklist.&lt;/p&gt;

&lt;p&gt;This post is a field guide for the three most common container "murders": Memory (OOMKilled), CPU Throttling, and I/O Saturation. We’ll diagnose each using the &lt;code&gt;docker stats&lt;/code&gt; + Linux host tools workflow we established last week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 1: The "Silent" Death (OOMKilled)
&lt;/h2&gt;

&lt;p&gt;Symptom: The container restarts randomly. No error logs in the application output because it was killed instantly by the kernel.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Confirm it was an OOM Kill
&lt;/h3&gt;

&lt;p&gt;Docker knows why the container died. Ask it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker inspect &amp;lt;container&amp;gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.State.OOMKilled}}'&lt;/span&gt;
&lt;span class="c"&gt;# Output: true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or check the specific exit code (137 = 128 + 9 SIGKILL):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker inspect &amp;lt;container&amp;gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.State.ExitCode}}'&lt;/span&gt;
&lt;span class="c"&gt;# Output: 137&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Find the "Smoking Gun" in Kernel Logs
&lt;/h3&gt;

&lt;p&gt;If Docker confirms it, see exactly when the kernel snapped. Run this on the host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dmesg &lt;span class="nt"&gt;-T&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"killed process"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’ll see a line like: &lt;code&gt;Out of memory: Killed process 1234 (node) total-vm:2048kB, anon-rss:1024kB&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Fix
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Immediate:&lt;/strong&gt; Bump the memory limit if the host has capacity.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker update &lt;span class="nt"&gt;--memory&lt;/span&gt; 2g &amp;lt;container&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; Check your application for memory leaks. If it’s Java, check the heap settings (&lt;code&gt;-Xmx&lt;/code&gt;). If it’s Node, check the GC behavior.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scenario 2: The "Slow" Death (CPU Throttling)
&lt;/h2&gt;

&lt;p&gt;Symptom: App is running but incredibly slow. Latency spikes. Health checks time out.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Check if it’s throttling
&lt;/h3&gt;

&lt;p&gt;Linux cgroups enforce CPU limits by "pausing" your process when it uses its quota. It doesn’t kill the app; it just freezes it for milliseconds at a time.&lt;/p&gt;

&lt;p&gt;Check &lt;code&gt;docker stats&lt;/code&gt; first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stats &lt;span class="nt"&gt;--no-stream&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;CPU %&lt;/code&gt; is consistently near 100% of your configured limit (e.g., if you gave it 0.5 CPUs and it’s at 50%), you are being throttled.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Verify Throttling in cgroups
&lt;/h3&gt;

&lt;p&gt;Look at the raw cgroup metrics (works on cgroup v1/v2):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find the container ID&lt;/span&gt;
docker inspect &amp;lt;container&amp;gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.Id}}'&lt;/span&gt;

&lt;span class="c"&gt;# Check throttle stats (path varies by distro, commonly:)&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/cpu/docker/&amp;lt;long-id&amp;gt;/cpu.stat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;nr_throttled&lt;/code&gt; and &lt;code&gt;throttled_time&lt;/code&gt;. If these numbers are rising, your app is gasping for air.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Fix
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Remove the limit&lt;/strong&gt; temporarily to prove it’s the bottleneck.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker update &lt;span class="nt"&gt;--cpus&lt;/span&gt; 0 &amp;lt;container&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tune requests:&lt;/strong&gt; If the app needs that CPU, increase the limit. If it’s a bug (infinite loop), profile the app.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scenario 3: The "Gridlock" (Disk I/O Saturation)
&lt;/h2&gt;

&lt;p&gt;Symptom: The container becomes unresponsive, &lt;code&gt;docker ps&lt;/code&gt; hangs, or logs stop writing.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Identify the I/O Hog
&lt;/h3&gt;

&lt;p&gt;Is it the container or the neighbor?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check host I/O&lt;/span&gt;
iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;%util&lt;/code&gt; is &amp;gt;80%, the disk is saturated.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Blame the Container
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;pidstat&lt;/code&gt; (part of &lt;code&gt;sysstat&lt;/code&gt;) to find which process is thrashing the disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pidstat &lt;span class="nt"&gt;-d&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for the PID with high &lt;code&gt;kB_rd/s&lt;/code&gt; or &lt;code&gt;kB_wr/s&lt;/code&gt;. Match that PID back to a container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker inspect &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.State.Pid}}'&lt;/span&gt; &amp;lt;container&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. The Fix
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Limit the blast radius:&lt;/strong&gt; Set a Block I/O limit on the greedy container so it doesn’t kill the host.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker update &lt;span class="nt"&gt;--blkio-weight&lt;/span&gt; 100 &amp;lt;container&amp;gt;  &lt;span class="c"&gt;# Low priority (default 500)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Move logs:&lt;/strong&gt; Ensure your app isn’t logging debug data to the container’s JSON log driver (which writes to disk). Use a log shipper or write to &lt;code&gt;stdout&lt;/code&gt; sparingly.&lt;br&gt;
​&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bonus: Network Connectivity Issues
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; "Connection refused" or timeouts between containers.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The "Can I reach it?" Check
&lt;/h3&gt;

&lt;p&gt;Don't guess. Enter the container’s namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &amp;lt;source-container&amp;gt; sh
&lt;span class="c"&gt;# Inside:&lt;/span&gt;
ping &amp;lt;target-container-name&amp;gt;
nc &lt;span class="nt"&gt;-zv&lt;/span&gt; &amp;lt;target-container-name&amp;gt; &amp;lt;port&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. If DNS fails
&lt;/h3&gt;

&lt;p&gt;Docker has its own internal DNS. Check &lt;code&gt;/etc/resolv.conf&lt;/code&gt; inside the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/resolv.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It should point to Docker’s embedded DNS server (usually &lt;code&gt;127.0.0.11&lt;/code&gt;). If it’s missing or wrong, check your daemon config.&lt;br&gt;
​&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary Checklist (Copy/Paste)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Check Command&lt;/th&gt;
&lt;th&gt;Fix Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random Restarts&lt;/td&gt;
&lt;td&gt;&lt;code&gt;docker inspect &amp;lt;container&amp;gt; --format '{{.State.OOMKilled}}'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Increase RAM limit / Fix memory leak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sluggish App&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cat /sys/fs/cgroup/cpu/docker/&amp;lt;id&amp;gt;/cpu.stat&lt;/code&gt; (check &lt;code&gt;nr_throttled&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Increase CPU limit / Profile app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Host Unresponsive&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;iostat -x 1 5&lt;/code&gt; &amp;amp; &lt;code&gt;pidstat -d 1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Limit Block I/O weight / Reduce logging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network Timeout&lt;/td&gt;
&lt;td&gt;&lt;code&gt;docker exec &amp;lt;container&amp;gt; nc -zv &amp;lt;target&amp;gt; &amp;lt;port&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check Docker DNS / Verify network aliases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Now that you can debug containers manually, how do you automate this? Next week, we’ll build a "Self-Healing" Bash Script that detects these states and alerts you automatically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which of these kills your containers most often? For me, it's always the silent OOM killer.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>troubleshooting</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Docker Monitoring Without a Platform: docker stats + cgroups (DevOps)</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sat, 31 Jan 2026 21:40:31 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/docker-monitoring-without-a-platform-docker-stats-cgroups-devops-23gf</link>
      <guid>https://dev.to/sajjasudhakararao/docker-monitoring-without-a-platform-docker-stats-cgroups-devops-23gf</guid>
      <description>&lt;p&gt;When an incident hits a containerized service, you often don’t need a full observability stack to get traction. You need fast answers: Which container is hot? What resource is saturating? Is it an app problem or a limit problem?&lt;/p&gt;

&lt;p&gt;This guide shows a practical monitoring stack you can run from any Docker host:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Docker-level commands (docker stats, docker inspect, docker logs)&lt;/li&gt;
&lt;li&gt;Host Linux tools (ps/top/free/df/iostat/ss/journalctl)&lt;/li&gt;
&lt;li&gt;Kernel primitives: cgroups (resource limits/accounting) and namespaces (isolation)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1) Start with docker stats (the fastest signal)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;docker stats&lt;/code&gt; streams runtime metrics for containers, including CPU%, memory usage/limit, network I/O, and block I/O.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stats
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common workflows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stats &lt;span class="nt"&gt;--no-stream&lt;/span&gt;          &lt;span class="c"&gt;# Snapshot (good for scripts)&lt;/span&gt;
docker stats &amp;lt;container_name&amp;gt;     &lt;span class="c"&gt;# Focus on one container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to interpret it (in plain language)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU%:&lt;/strong&gt; who’s burning compute right now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MEM USAGE / LIMIT:&lt;/strong&gt; how close you are to the memory ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NET I/O:&lt;/strong&gt; traffic spikes, retries, or unusual egress.
​- &lt;strong&gt;BLOCK I/O:&lt;/strong&gt; slow disks, chatty logging, or heavy read/write workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2) Jump from “container name” → “what is it?”
&lt;/h2&gt;

&lt;p&gt;Once you identify a hot container, immediately gather identity + configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps
docker inspect &amp;lt;container&amp;gt; | less
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful &lt;code&gt;inspect&lt;/code&gt; questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What image/tag is running?&lt;/li&gt;
&lt;li&gt;What env vars/config are set?&lt;/li&gt;
&lt;li&gt;What ports and volumes are attached?&lt;/li&gt;
&lt;li&gt;Are there memory/CPU limits configured?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3) Logs: confirm symptoms fast&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker logs &lt;span class="nt"&gt;--tail&lt;/span&gt; 200 &amp;lt;container&amp;gt;
docker logs &lt;span class="nt"&gt;-f&lt;/span&gt; &amp;lt;container&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is often enough to spot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;crash loops&lt;/li&gt;
&lt;li&gt;OOM errors / memory pressure&lt;/li&gt;
&lt;li&gt;upstream timeouts&lt;/li&gt;
&lt;li&gt;DB connection exhaustion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4) Understand why it’s happening: cgroups + namespaces (the mental model)
&lt;/h2&gt;

&lt;p&gt;Docker relies on Linux kernel features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Namespaces isolate views of processes, networking, mounts, etc.&lt;/li&gt;
&lt;li&gt;cgroups control and account for resources like CPU, memory, and I/O.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why this matters during incidents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A container can be “slow” because it’s CPU-throttled, not because the app code suddenly got worse.&lt;/li&gt;
&lt;li&gt;A container can restart because it hit its memory limit and the kernel’s OOM behavior targeted its processes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5) Host-level confirmation (tie back to your Linux monitoring toolkit)
&lt;/h2&gt;

&lt;p&gt;When docker stats shows a spike, verify on the host to avoid false conclusions.&lt;/p&gt;

&lt;h3&gt;
  
  
  CPU hogs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps aux &lt;span class="nt"&gt;--sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-%cpu | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Memory pressure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Disk full / log explosions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="nb"&gt;du&lt;/span&gt; &lt;span class="nt"&gt;-sh&lt;/span&gt; /var/lib/docker/&lt;span class="k"&gt;*&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Disk I/O saturation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unexpected listeners / traffic patterns
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ss &lt;span class="nt"&gt;-tuln&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These host checks help you decide whether you’re dealing with a single container or a node-wide saturation problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  6) What to do with the data (action mapping)
&lt;/h2&gt;

&lt;p&gt;Use the shortest safe path to stability:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CPU high + latency rising
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If CPU is legitimately needed: scale out / add capacity.&lt;/li&gt;
&lt;li&gt;If CPU is throttled: revisit limits/requests (or container CPU shares).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Memory near limit
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If memory leak suspected: restart as mitigation + open an issue with heap profiling.&lt;/li&gt;
&lt;li&gt;If limit too low for normal peaks: adjust limit carefully and monitor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Block I/O high
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Check log volume and disk saturation; reduce noisy logs or move logs off disk.&lt;/li&gt;
&lt;li&gt;Consider storage performance constraints and workload patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Network I/O abnormal
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Look for retries, timeouts, DDoS/abuse patterns, or upstream issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7) Copy/paste triage sequence (5 minutes)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1) Find the hot container&lt;/span&gt;
docker stats &lt;span class="nt"&gt;--no-stream&lt;/span&gt;

&lt;span class="c"&gt;# 2) Identify it&lt;/span&gt;
docker ps
docker inspect &amp;lt;container&amp;gt; | less

&lt;span class="c"&gt;# 3) Check symptoms&lt;/span&gt;
docker logs &lt;span class="nt"&gt;--tail&lt;/span&gt; 200 &amp;lt;container&amp;gt;

&lt;span class="c"&gt;# 4) Confirm on host (avoid guessing)&lt;/span&gt;
ps aux &lt;span class="nt"&gt;--sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-%cpu | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-10&lt;/span&gt;
free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 3
ss &lt;span class="nt"&gt;-tuln&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s your most common container failure mode: OOM kills, CPU throttling, disk I/O, or network timeouts?&lt;/p&gt;

</description>
      <category>linux</category>
      <category>observability</category>
      <category>docker</category>
      <category>sre</category>
    </item>
    <item>
      <title>Incident Response Runbook Template for DevOps</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 18 Jan 2026 04:13:03 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/incident-response-runbook-template-for-devops-4ljl</link>
      <guid>https://dev.to/sajjasudhakararao/incident-response-runbook-template-for-devops-4ljl</guid>
      <description>&lt;h2&gt;
  
  
  Incident Response Runbook Template for DevOps
&lt;/h2&gt;

&lt;p&gt;Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic.&lt;br&gt;
​&lt;br&gt;
This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this runbook optimizes for
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fast acknowledgement and clear ownership (Incident Commander + roles).&lt;/li&gt;
&lt;li&gt;Early impact assessment and severity assignment to avoid under/over‑reacting.&lt;/li&gt;
&lt;li&gt;Communication cadence and “known/unknown/next update” structure that builds trust.&lt;/li&gt;
&lt;li&gt;Evidence capture (commands + logs) to support post‑incident review.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The incident runbook template
&lt;/h2&gt;

&lt;p&gt;Copy this into your internal wiki, README, Notion, or ops repo.&lt;br&gt;
​&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Trigger
&lt;/h3&gt;

&lt;p&gt;&lt;b&gt;Triggers:&lt;/b&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Monitoring alert / SLO breach&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Customer report escalated&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Internal detection (logs, latency spikes, error spikes)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2. Acknowledge (0–5 minutes)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Acknowledge page/alert in your paging system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create an incident channel: &lt;strong&gt;#inc-YYYYMMDD-service-shortdesc.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Assign Incident Commander (IC) and Comms Lead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start an incident document: timeline + links + decisions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3. Assess severity (5–10 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;b&gt;Answer quickly:&lt;/b&gt;&lt;br&gt;
    - What’s impacted (service, region, feature)?&lt;br&gt;
    - How many users / revenue / compliance impact?&lt;br&gt;
    - Is impact ongoing and spreading?&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Suggested severity:&lt;/b&gt;&lt;br&gt;
    - SEV1: Major outage / severe user impact; immediate coordination.&lt;br&gt;
    - SEV2: Partial outage / significant degradation; urgent but controlled.&lt;br&gt;
    - SEV3: Minor impact; can be handled async.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Stabilize first (10–30 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;b&gt;Goal:&lt;/b&gt; stop the bleeding before chasing root cause.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Typical mitigations:&lt;/b&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Roll back the last deploy/config change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Disable a feature flag.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scale up/out temporarily.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fail over if safe.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rate-limit or block abusive traffic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  5. Triage checklist (host-level)
&lt;/h3&gt;

&lt;p&gt;Run these to establish the baseline quickly (copy/paste friendly).&lt;/p&gt;

&lt;p&gt;&lt;b&gt;CPU&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps aux &lt;span class="nt"&gt;--sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-%cpu | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; any process &amp;gt;50% sustained.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Memory&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; available &amp;lt;20% total RAM.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Disk&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="nb"&gt;du&lt;/span&gt; &lt;span class="nt"&gt;-sh&lt;/span&gt; /var/log/&lt;span class="k"&gt;*&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; any filesystem &amp;gt;90%.​&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Disk I/O&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; %util &amp;gt;80%, await &amp;gt;20ms.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Network listeners&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ss &lt;span class="nt"&gt;-tuln&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; unexpected listeners/ports.&lt;br&gt;
    ​&lt;br&gt;
&lt;b&gt;Logs (example: nginx)&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nginx &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert cue:&lt;/b&gt; 5xx errors spiking.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Comms cadence (keep it boring)
&lt;/h3&gt;

&lt;p&gt;SEV1: updates every 10–15 minutes.​&lt;br&gt;
SEV2: updates every 30 minutes.&lt;br&gt;
SEV3: async updates acceptable.&lt;/p&gt;

&lt;p&gt;Use this structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What we know&lt;/li&gt;
&lt;li&gt;What we don’t know&lt;/li&gt;
&lt;li&gt;What we’re doing now&lt;/li&gt;
&lt;li&gt;Next update at: TIME&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7. Verify resolution
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Confirm user impact is gone (synthetic checks + error rate + latency).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Confirm saturation is back to normal (CPU/memory/disk/I/O).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Watch for 30–60 minutes for regression.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8. Close and learn (post-incident)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Write a brief timeline (detection → mitigation → resolution).&lt;/li&gt;
&lt;li&gt;Capture what worked, what didn’t, and what to automate.&lt;/li&gt;
&lt;li&gt;Create follow-ups: alerts tuning, runbook updates, tests, guardrails.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bonus: “Golden signals” lens for incidents
&lt;/h2&gt;

&lt;p&gt;When you’re lost, anchor on the four golden signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency (are requests slower?)&lt;/li&gt;
&lt;li&gt;Traffic (is demand abnormal?)&lt;/li&gt;
&lt;li&gt;Errors (is failure rate rising?)&lt;/li&gt;
&lt;li&gt;Saturation (are resources hitting limits?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps triage focused on user impact and system limits, not vanity metrics.&lt;br&gt;
​&lt;/p&gt;

&lt;h2&gt;
  
  
  Download / reuse
&lt;/h2&gt;

&lt;p&gt;If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>incidentmanagement</category>
      <category>observability</category>
      <category>linux</category>
    </item>
    <item>
      <title>Linux Monitoring &amp; Alerting: Command-Line Mastery for DevOps</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 11 Jan 2026 00:36:07 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/linux-monitoring-alerting-command-line-mastery-for-devops-jo7</link>
      <guid>https://dev.to/sajjasudhakararao/linux-monitoring-alerting-command-line-mastery-for-devops-jo7</guid>
      <description>&lt;h2&gt;
  
  
  The Monitoring Gap Every DevOps Engineer Faces
&lt;/h2&gt;

&lt;p&gt;Full monitoring stacks like Prometheus + Grafana are great, but they take time to set up. What about the servers you inherit? The staging environments? The emergency VM you spin up during an outage?&lt;/p&gt;

&lt;p&gt;Command-line monitoring is your immediate, universal answer. These tools work on every Linux box, no agents required. Better yet, they're fast enough to script into alerting workflows.&lt;/p&gt;

&lt;p&gt;This post covers the essential Linux monitoring commands plus patterns to turn raw metrics into actionable alerts—perfect follow-up to our Bash scripting guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Real-Time Resource Dashboards
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;top&lt;/code&gt;/&lt;code&gt;htop&lt;/code&gt; Foundation&lt;br&gt;
&lt;code&gt;top&lt;/code&gt; gives you an instant system snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;top - 11:26:45 up 5 days,  3:12,  2 &lt;span class="nb"&gt;users&lt;/span&gt;,  load average: 1.23, 1.45, 1.67
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;: 12.3 us,  8.7 sy,  0.0 ni, 78.9 &lt;span class="nb"&gt;id&lt;/span&gt;,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  7900.2 total,  1234.5 free,  4567.8 used,  2097.9 buff/cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Pro move:&lt;/b&gt; &lt;code&gt;htop&lt;/code&gt; (install with &lt;code&gt;apt install htop&lt;/code&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Mouse/keyboard navigation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Color-coded resource bars&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tree view of processes (&lt;code&gt;F5&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quick filters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;htop &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep &lt;span class="nt"&gt;-d&lt;/span&gt;, nginx&lt;span class="si"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# Monitor nginx processes only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Memory Deep Dive:&lt;/b&gt; &lt;code&gt;free -h&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       4.2Gi       1.2Gi       128Mi       2.3Gi       3.1Gi 
Swap:          2.0Gi          0B       2.0Gi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What matters: Focus on &lt;code&gt;available&lt;/code&gt; column, not &lt;code&gt;free&lt;/code&gt;. Linux aggressively caches to disk.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. CPU Analysis: Who's Eating Cycles?
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Per-Process Breakdown&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps aux &lt;span class="nt"&gt;--sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-%cpu | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-10&lt;/span&gt;
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql     1234 45.2 12.3 2.1g  980m ?        S    10:00   3:45 /usr/sbin/mysqld
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Historical CPU Trends:&lt;/b&gt; &lt;code&gt;sar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install: apt install sysstat&lt;/span&gt;
sar &lt;span class="nt"&gt;-u&lt;/span&gt; 1 5     &lt;span class="c"&gt;# CPU every 1 sec, 5 samples&lt;/span&gt;
sar &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/log/sysstat/sa08  &lt;span class="c"&gt;# Yesterday's data&lt;/span&gt;

Average: CPU %user %nice %system %iowait %steal %idle
Average:    all  12.34  0.00  8.76    1.23   0.00  77.67
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Alert pattern:&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;sar &lt;span class="nt"&gt;-u&lt;/span&gt; 1 3 | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{if($8 &amp;lt; 70) exit 1}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CPU idle &amp;lt;70% for 3s - investigate!"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Disk I/O: The Silent Killer
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Current I/O:&lt;/b&gt; &lt;code&gt;iostat&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 5
Device            r/s     w/s     rkB/s    wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
sda              23.4     1.2   234.5    12.3     0.0     10.2   0.00  89.12    0.1    2.3   0.45    10.0     6.2  1.23  45.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Red flags:&lt;/b&gt; &lt;code&gt;%util &amp;gt;80%&lt;/code&gt;, &lt;code&gt;await &amp;gt;20ms&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Disk Space Alerts:&lt;/b&gt; &lt;code&gt;df&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;,fstype,size,used,avail,pcent,target | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; tmpfs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Scriptable alert:&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"[8-9][0-9]%|[9][0-9]%|[100]%"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Disk healthy"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Network Troubleshooting Masters
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Active Connections:&lt;/b&gt; &lt;code&gt;ss&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replace netstat everywhere&lt;/span&gt;
ss &lt;span class="nt"&gt;-tuln&lt;/span&gt;          &lt;span class="c"&gt;# Listening TCP/UDP&lt;/span&gt;
ss &lt;span class="nt"&gt;-tunap&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; :80   &lt;span class="c"&gt;# Processes on port 80&lt;/span&gt;
ss &lt;span class="nt"&gt;-t&lt;/span&gt; state established | &lt;span class="nb"&gt;grep&lt;/span&gt; :443 | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;  &lt;span class="c"&gt;# Active HTTPS connections&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Drop Counters:&lt;/b&gt; &lt;code&gt;netstat&lt;/code&gt; or &lt;code&gt;ss&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;netstat &lt;span class="nt"&gt;-s&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"errors|dropped|retrans"&lt;/span&gt;
Ip:
    1234 total packets received
    56 dropped because of memory problems
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Live Packet Capture:&lt;/b&gt; &lt;code&gt;tcpdump&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Capture 100 packets on interface eth0, port 80&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;tcpdump &lt;span class="nt"&gt;-i&lt;/span&gt; eth0 &lt;span class="nt"&gt;-c&lt;/span&gt; 100 port 80 &lt;span class="nt"&gt;-w&lt;/span&gt; capture.pcap

&lt;span class="c"&gt;# Read capture&lt;/span&gt;
tcpdump &lt;span class="nt"&gt;-r&lt;/span&gt; capture.pcap &lt;span class="nt"&gt;-nn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Log Monitoring: Beyond tail -f
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Service Logs:&lt;/b&gt; &lt;code&gt;journalctl&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nginx &lt;span class="nt"&gt;-f&lt;/span&gt;           &lt;span class="c"&gt;# Follow nginx logs&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nginx &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"1h ago"&lt;/span&gt;  &lt;span class="c"&gt;# Last hour&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-p&lt;/span&gt; err &lt;span class="nt"&gt;-u&lt;/span&gt; nginx      &lt;span class="c"&gt;# Only errors&lt;/span&gt;
journalctl &lt;span class="nt"&gt;--no-pager&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; panic  &lt;span class="c"&gt;# System panics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Pattern Mining:&lt;/b&gt; &lt;code&gt;grep + awk&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Count 5xx errors per minute&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nginx &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"10min ago"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;" 500 "&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1, $2}'&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-f1&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;

&lt;span class="c"&gt;# Slow requests (&amp;gt;2s)&lt;/span&gt;
&lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'$NF &amp;gt; 2 {print}'&lt;/span&gt; /var/log/nginx/access.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Production Alerting Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;CPU/Memory Watchdog&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

alert&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"CPU &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CPU&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%, MEM &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MEM&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SLACK_WEBHOOK&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="nv"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;top &lt;span class="nt"&gt;-bn1&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Cpu(s)"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="s1"&gt;'%'&lt;/span&gt; &lt;span class="nt"&gt;-f1&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;MEM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;free | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'/Mem:/ {printf "%.0f", $3/$2 * 100}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CPU&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 80 &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MEM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 80 &lt;span class="o"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; alert
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Disk Space Guardian&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;fs &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;--local&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt; | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; +2&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;usage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nv"&gt;$fs&lt;/span&gt; | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $5}'&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/%//'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="nv"&gt;$usage&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 85 &lt;span class="o"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ALERT: &lt;/span&gt;&lt;span class="nv"&gt;$fs&lt;/span&gt;&lt;span class="s2"&gt; at &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;usage&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;b&gt;Cron schedule:&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Every 5 minutes&lt;/span&gt;
&lt;span class="k"&gt;*&lt;/span&gt;/5 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /usr/local/bin/check_resources.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. One-Line Dashboards
&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Combine tools into instant observability:&lt;/b&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# System overview (alias this to 'sys')&lt;/span&gt;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 &lt;span class="s1"&gt;'printf "\nCPU: "; sar -u 1 1 |tail-1; printf "MEM: "; free -h |tail-1; printf "DISK: "; df -h / /var |tail -2'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Top resource hogs&lt;/span&gt;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 &lt;span class="s1"&gt;'ps aux --sort=-%cpu | head -8; echo "---"; ps aux --sort=-%mem | head -8'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Reference Table
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Scenario    | Command                | Pro Tip                              |
| ----------- | ---------------------- | ------------------------------------ |
| CPU trends  | sar -u 1 5             | Historical data in /var/log/sysstat/ |
| Memory      | free -h                | Watch available, ignore free         |
| Disk I/O    | iostat -x 1            | %util &amp;gt;80% = trouble                 |
| Connections | ss -tuln               | Modern netstat replacement           |
| Logs        | journalctl -u nginx -f | systemd's tail -f                    |
| Processes   | htop -p $(pgrep nginx) | Filter to specific app               |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>devops</category>
      <category>linux</category>
      <category>bash</category>
      <category>shell</category>
    </item>
    <item>
      <title>Advanced Bash Scripting for DevOps Automation (With Copy‑Pasteable Examples)</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Fri, 09 Jan 2026 03:02:04 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/advanced-bash-scripting-for-devops-automation-with-copy-pasteable-examples-8a6</link>
      <guid>https://dev.to/sajjasudhakararao/advanced-bash-scripting-for-devops-automation-with-copy-pasteable-examples-8a6</guid>
      <description>&lt;p&gt;Bash is still the glue that holds a lot of DevOps workflows together. Whether you’re deploying services, wiring health checks into CI, or cleaning up logs on a forgotten VM, a few solid scripting patterns go a very long way.&lt;br&gt;
​&lt;br&gt;
In this post, you’ll find copy‑pasteable Bash snippets for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Safer script defaults&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Parameterized deployments&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Health checks and rollbacks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Log rotation and cleanup&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple CPU/memory watchdogs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is written with day‑to‑day DevOps work in mind—not contrived toy examples.&lt;/p&gt;


&lt;h3&gt;
  
  
  1. Bash foundations that prevent outages
&lt;/h3&gt;

&lt;p&gt;Even experienced engineers skip basics that later cause flaky scripts and silent failures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;set -e&lt;/strong&gt; – exit on the first error instead of continuing in a bad state&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;set -u&lt;/strong&gt; – fail if a variable is undefined&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;set -o pipefail&lt;/strong&gt; – make pipelines fail if any command fails&lt;br&gt;
​&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add a tiny logging helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log() { echo "[$(date +'%F %T')] $*"; }
die() { log "ERROR: $*"; exit 1; }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone makes every script more observable and safer to reuse across environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2. Parameters, flags, and environment‑aware scripts
&lt;/h3&gt;

&lt;p&gt;Hard‑coding values is fine for demos, but production scripts must be configurable.&lt;br&gt;
​&lt;br&gt;
Use arguments with sensible defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT="${1:-staging}"

log() { echo "[$(date +'%F %T')] [$ENVIRONMENT] $*"; }

log "Deploying to $ENVIRONMENT"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more control, turn your script into a tiny CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while getopts "e:v:h" opt; do
  case "$opt" in
    e) ENVIRONMENT="$OPTARG" ;;
    v) VERSION="$OPTARG" ;;
    h) echo "Usage: deploy.sh -e &amp;lt;env&amp;gt; -v &amp;lt;version&amp;gt;"; exit 0 ;;
    *) exit 1 ;;
  esac
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now teammates and CI pipelines can call the same script with explicit flags.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3. A realistic deployment script pattern
&lt;/h3&gt;

&lt;p&gt;Here’s a trimmed‑down deployment flow you can adapt for your services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT="${1:-staging}"
APP_DIR="/srv/myapp"
REPO_URL="git@github.com:org/myapp.git"

log() { echo "[$(date +'%F %T')] [$ENVIRONMENT] $*"; }

deploy() {
  log "Updating code..."
  if [[ ! -d "$APP_DIR/.git" ]]; then
    git clone "$REPO_URL" "$APP_DIR"
  fi

  cd "$APP_DIR"
  git fetch --all
  git checkout main
  git pull --ff-only

  log "Installing dependencies..."
  npm ci

  log "Running tests..."
  npm test

  log "Building..."
  npm run build

  log "Restarting service..."
  sudo systemctl restart myapp

  log "Deployment complete."
}

deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern is idempotent, easy to wire into CI/CD, and uses systemd for reliable service restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4. Health checks and rollbacks in one script
&lt;/h3&gt;

&lt;p&gt;Automation without safety is just a faster way to ship broken code. Add health checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;health_check() {
  local url="${1:-http://localhost/health}"
  if curl -fsS "$url" &amp;gt; /dev/null; then
    log "Health check passed for $url"
  else
    die "Health check FAILED for $url"
  fi
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then define a simple rollback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;previous_version() {
  git describe --tags --abbrev=0 HEAD~1 2&amp;gt;/dev/null || echo ""
}

rollback() {
  local prev
  prev="$(previous_version)"
  [[ -z "$prev" ]] &amp;amp;&amp;amp; die "No previous version found for rollback"

  log "Rolling back to $prev"
  git checkout "$prev"
  npm run build
  sudo systemctl restart myapp
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire it together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deploy
if ! health_check "https://myapp.example.com/health"; then
  log "Health check failed; rolling back"
  rollback
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You now have a single script that deploys, validates, and self‑heals.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. Log rotation and cleanup that actually runs
&lt;/h3&gt;

&lt;p&gt;Not every environment needs a full‑blown logging stack; Bash plus cron still works well.&lt;br&gt;
​&lt;br&gt;
Compress older logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail

LOG_DIR="/var/log/myapp"
DAYS_TO_KEEP=7

find "$LOG_DIR" -type f -name "*.log" -mtime +$DAYS_TO_KEEP -print0 \
  | while IFS= read -r -d '' file; do
      gzip "$file"
    done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remove stale archives:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;find "$LOG_DIR" -type f -name "*.gz" -mtime +30 -delete&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Schedule it:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;30 1 * * * /usr/local/bin/log_cleanup.sh&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That’s often enough to keep disks from filling up silently.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6. Lightweight monitoring and alert hooks
&lt;/h3&gt;

&lt;p&gt;You can wrap traditional Linux tools in Bash and push alerts to Slack, email, or a webhook endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail

CPU_THRESHOLD=80
MEM_THRESHOLD=80

cpu_usage() {
  mpstat 1 1 | awk '/Average/ &amp;amp;&amp;amp; $12 ~ /[0-9.]+/ {print 100-$12}'
}

mem_usage() {
  free | awk '/Mem:/ {printf("%.0f", $3/$2 * 100)}'
}

CPU=$(cpu_usage)
MEM=$(mem_usage)

if (( CPU &amp;gt; CPU_THRESHOLD || MEM &amp;gt; MEM_THRESHOLD )); then
  echo "High usage detected: CPU=${CPU}% MEM=${MEM}%"
  # TODO: send to Slack / email / webhook here
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a great complement to Prometheus, Grafana, or hosted solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  ​
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7. Safer configuration changes
&lt;/h3&gt;

&lt;p&gt;Always pair config edits with backups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CONFIG="/etc/myapp/config.yaml"
BACKUP="/etc/myapp/config.yaml.$(date +'%F-%H%M%S').bak"

cp "$CONFIG" "$BACKUP"

sed -i 's/feature_x: false/feature_x: true/' "$CONFIG"

systemctl restart myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern makes it trivial to revert a bad change during an incident.&lt;/p&gt;




&lt;p&gt;If you adapt any of these snippets into your own tooling, drop a comment with your variations—other engineers will benefit from seeing real‑world tweaks. Also, tell which part you’d like to see expanded: deployments, monitoring, or incident tooling.&lt;br&gt;
​&lt;/p&gt;

</description>
      <category>devops</category>
      <category>bash</category>
      <category>linux</category>
      <category>automation</category>
    </item>
    <item>
      <title>🚀 Building an AI-Powered Stock Trading Bot in Python (With Backtesting)</title>
      <dc:creator>Sajja Sudhakararao</dc:creator>
      <pubDate>Sun, 04 Jan 2026 21:37:13 +0000</pubDate>
      <link>https://dev.to/sajjasudhakararao/building-an-ai-powered-stock-trading-bot-in-python-with-backtesting-19f6</link>
      <guid>https://dev.to/sajjasudhakararao/building-an-ai-powered-stock-trading-bot-in-python-with-backtesting-19f6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;From prediction to execution — a practical guide for engineers&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;📌 Introduction&lt;/strong&gt;&lt;br&gt;
Algorithmic trading is no longer reserved for hedge funds. With Python, open APIs, and modern AI models, individual engineers can build intelligent stock trading bots that analyze data, predict price movement, backtest strategies, and automate trades.&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk you through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Designing an AI agent for stock price prediction&lt;/li&gt;
&lt;li&gt;Converting predictions into trading decisions&lt;/li&gt;
&lt;li&gt;Backtesting the strategy on historical data&lt;/li&gt;
&lt;li&gt;Preparing the system for real-world deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide is hands-on, practical, and written for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Software / DevOps engineers&lt;/li&gt;
&lt;li&gt;Python developers&lt;/li&gt;
&lt;li&gt;Anyone curious about AI in finance&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;&lt;strong&gt;🧠 What Is an AI Trading Agent?&lt;/strong&gt;&lt;br&gt;
An AI trading agent is a system that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observes the market (historical &amp;amp; live data)&lt;/li&gt;
&lt;li&gt;Learns patterns using machine learning&lt;/li&gt;
&lt;li&gt;Makes decisions (Buy / Sell / Hold)&lt;/li&gt;
&lt;li&gt;Executes trades automatically&lt;/li&gt;
&lt;li&gt;Improves through evaluation and backtesting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Core Components&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Component          | Purpose                                     |
| ------------------ | ------------------------------------------- |
| Data Source        | Market prices (Yahoo Finance, Alpaca, etc.) |
| AI Model           | Predict future price movement               |
| Strategy Engine    | Convert predictions into actions            |
| Backtesting Engine | Validate strategy on past data              |
| Broker API         | Execute trades                              |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;🏗️ System Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Market Data → AI Model → Trading Strategy → Backtesting → Broker API&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This separation keeps the system modular, testable, and scalable.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📊 Step 1: Fetching Stock Market Data&lt;/strong&gt;&lt;br&gt;
We’ll use Yahoo Finance for historical prices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import yfinance as yf
import pandas as pd

ticker = "AAPL"
df = yf.download(ticker, start="2020-01-01", end="2024-01-01")
df = df[['Close']]
df.dropna(inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us clean, daily closing prices — perfect for modeling.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🤖 Step 2: AI Model (LSTM for Time-Series Prediction)&lt;/strong&gt;&lt;br&gt;
Stock prices are sequential data, so LSTM (Long Short-Term Memory) works well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why LSTM?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learns temporal patterns&lt;/li&gt;
&lt;li&gt;Handles noisy financial data better than simple regression&lt;/li&gt;
&lt;li&gt;Widely used in quantitative finance research
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Prepare Data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)

def create_sequences(data, window=50):
    X, y = [], []
    for i in range(len(data) - window):
        X.append(data[i:i+window])
        y.append(data[i+window])
    return np.array(X), np.array(y)

X, y = create_sequences(scaled)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;🧪 Step 3: Training the Model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(50,1)),
    Dropout(0.2),
    LSTM(50),
    Dense(1)
])

model.compile(optimizer="adam", loss="mse")
model.fit(X, y, epochs=20, batch_size=32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This model predicts the next-day closing price.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📈 Step 4: Designing the Trading Strategy&lt;/strong&gt;&lt;br&gt;
Predictions alone are useless without rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Strategy Logic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Condition                            | Action |
| ------------------------------------ | ------ |
| Predicted price &amp;gt; current price + 2% | BUY    |
| Predicted price &amp;lt; current price - 2% | SELL   |
| Otherwise                            | HOLD   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This avoids over-trading and reduces noise.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🔁 Step 5: Backtesting the Strategy&lt;/strong&gt;&lt;br&gt;
Backtesting answers one question:&lt;/p&gt;

&lt;p&gt;“Would this strategy have worked in the past?”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backtesting Engine&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def backtest(df, model, initial_cash=10000):
  cash = initial_cash
  position = 0
  trades = [] 

  for i in range(50, len(df)):
    window = df.iloc[i-50:i]
    X = scaler.transform(window).reshape(1,50,1)
    predicted = scaler.inverse_transform(model.predict(X))[0][0]
    price = df.iloc[i]['Close']

    if predicted &amp;gt; price * 1.02 and cash &amp;gt;= price:
      cash -= price
      position += 1
      trades.append(("BUY", price))

    elif predicted &amp;lt; price * 0.98 and position &amp;gt; 0:
      cash += price
      position -= 1
      trades.append(("SELL", price))

  final_value = cash + position * df.iloc[-1]['Close']
  return final_value, trades
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;📊 Step 6: Performance Evaluation&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;AI Strategy vs Buy &amp;amp; Hold&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_value, trades = backtest(df, model)
buy_hold = 10000 * (df.iloc[-1]['Close'] / df.iloc[0]['Close'])

print("AI Strategy:", final_value)
print("Buy &amp;amp; Hold:", buy_hold)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This comparison tells you whether the AI adds real value or just noise.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;⚠️ Important Risk Considerations&lt;/strong&gt;&lt;br&gt;
AI trading is not magic.&lt;/p&gt;

&lt;p&gt;Be aware of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overfitting&lt;/li&gt;
&lt;li&gt;Market regime changes&lt;/li&gt;
&lt;li&gt;Latency in real trading&lt;/li&gt;
&lt;li&gt;Slippage and transaction fees&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Never deploy without:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backtesting&lt;/li&gt;
&lt;li&gt;Paper trading&lt;/li&gt;
&lt;li&gt;Risk limits&lt;/li&gt;
&lt;li&gt;Stop-loss rules&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;🚀 Production Readiness Checklist&lt;/strong&gt;&lt;br&gt;
Before going live:&lt;/p&gt;

&lt;p&gt;✅ Paper trading (Alpaca)&lt;br&gt;
✅ Daily trade limits&lt;br&gt;
✅ Stop-loss &amp;amp; take-profit&lt;br&gt;
✅ Logging &amp;amp; monitoring&lt;br&gt;
✅ Model retraining strategy&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🧩 Where This Can Go Next&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reinforcement Learning (RL)&lt;/li&gt;
&lt;li&gt;Multi-stock portfolio optimization&lt;/li&gt;
&lt;li&gt;Sentiment analysis (news + social)&lt;/li&gt;
&lt;li&gt;Kubernetes-based trading microservices&lt;/li&gt;
&lt;li&gt;Fully autonomous AI agents&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;🧠 Final Thoughts&lt;/strong&gt;&lt;br&gt;
AI-powered trading bots are an excellent real-world application of machine learning, combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data engineering&lt;/li&gt;
&lt;li&gt;AI modeling&lt;/li&gt;
&lt;li&gt;System design&lt;/li&gt;
&lt;li&gt;Financial reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if you never trade real money, building one will level up your skills dramatically.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📣 Disclaimer&lt;/strong&gt;&lt;br&gt;
This article is for educational purposes only.&lt;br&gt;
It is not financial advice.&lt;br&gt;
Always understand the risks before trading.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;✍️ About the Author&lt;/strong&gt;&lt;br&gt;
I’m a DevOps Engineer exploring the intersection of AI, automation, and real-world systems.&lt;br&gt;
I write about practical AI, engineering, and building systems that actually work.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>algorithms</category>
    </item>
  </channel>
</rss>
