DEV Community: Sajja Sudhakararao

🚀 Building an AI Incident Copilot: How I Automated the First 15 Minutes of Every Production Incident

Sajja Sudhakararao — Wed, 29 Apr 2026 17:15:12 +0000

Every production incident follows the same painful ritual.

An alert fires at 2am. An engineer wakes up, SSH's into a server, and begins the manual loop — pulling logs, scanning for errors, guessing what to check next. This loop can take 15 to 45 minutes before the real diagnosis even begins. Multiply that by every incident across every team in your organisation, and you have thousands of engineering hours lost every year to work that is repetitive, stressful, and largely automatable.

I've been on that on-call rotation. I know what it costs — not just in time, but in cognitive load, in missed context, and in the compounding pressure of an active incident. So I built incopilot: a CLI tool that automates the entire first-pass triage so engineers can skip straight to actual problem-solving.

This post walks through the architecture, the design decisions, and exactly how to build it yourself. Everything is open source at https://github.com/AutoShiftOps/incopilot.

Project structure

incopilot/
  __init__.py
  cli.py          # argument parsing + console output
  collectors.py   # journalctl, docker logs, file, bundle
  analyzer.py     # pattern detection + line normalization
  reporter.py     # report.md / report.json generation
  config.py       # patterns, golden-signal map, safe-command list
scripts/
  demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md

Setup

git clone https://github.com/AutoShiftOps/incopilot.git
cd incopilot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Quick test (no real services needed)

python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log
ls out/

Systemd journal triage

python -m incopilot journal --unit nginx --since "30 min ago"

Docker triage

python -m incopilot docker --container my-api --since 1h

Both sources (bundle)

python -m incopilot bundle \
  --unit nginx \
  --container my-api \
  --since-journal "30 min ago" \
  --since-docker 1h

What you get

out/report.md — paste into your incident doc

out/report.json — attach to a ticket or POST to a webhook

What to improve next

Per-service pattern packs (nginx, postgres, java, node)
Slack/Teams webhook posting (--webhook <url>)
Unit tests + GitHub Actions CI
Scheduled timer (systemd timer unit) for proactive reports

Sudhakar Sajja is an Application Architect at TechMahindra with 13 years of experience across protocol testing, SDET, DevOps, and cloud architecture. He specialises in AI-powered DevOps operations — building tools that use LLMs to replace manual incident response and query diagnostics. He writes weekly at AutoShiftOps (autoshiftops.com) and built QueryTuner (querytuner.com), an AI-driven SQL query analysis tool. Based in Mississauga, Canada.

Build an Alert Decision Layer CLI in Python

Sajja Sudhakararao — Sun, 19 Apr 2026 01:53:24 +0000

We talk a lot about alerting, but not enough about deciding.

This weekend project builds a small Alert Decision Layer as a Python CLI called alertdecider:

Input: alerts JSON (think Alertmanager or PagerDuty export).
Engine: a clear rule set that considers severity, environment, service tier, and flapping history.
Output: Markdown + JSON with decisions (page, ticket, aggregate, suppress) and reasons.

If you liked project-based posts like "AI trading bot in Python" or "Self-healing containers with Bash", this sits in the same category: you end up with a tool you can run and extend.

What you'll build

By the end of this tutorial you’ll have:

A Python package alertdecider-agent/ with:
- Dataclasses for Alert, ServiceProfile, History.
- A rule-based AlertDecisionEngine.
- A CLI entry point.
An examples/ folder with sample alerts, services, and alert history.
A command you can run locally to triage alerts and generate a report.

Prerequisites

Python 3.10+
Basic familiarity with JSON/YAML
A terminal where you can run python -m ...

1. Clone and set up the project

git clone https://github.com/AutoShiftOps/alertdecider.git
cd alertdecider
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The requirements.txt is intentionally small:

rich==13.9.4
PyYAML==6.0.2

2. Model the domain: alerts, services, history

In alertdecider-agent/models.py we define three dataclasses:

@dataclass
class Alert:
    id: str
    source: str
    name: str
    service: str
    severity: str
    environment: str
    starts_at: str
    fingerprint: str
    raw: Dict[str, Any]

@dataclass
class ServiceProfile:
    name: str
    tier: str
    slo_critical: bool
    owner: str | None = None

@dataclass
class History:
    fingerprint: str
    count_24h: int
    last_status: str | None = None

This gives us a normalized view of alerts and some context we can use to make better decisions.

3. Load alerts, services, and history

alertdecider-agent/loader.py contains helpers to turn raw files into those models.

load_alerts(path) reads alerts.json and extracts labels like alertname, service, severity, env, and fingerprint.
load_services(path) reads services.yml and builds ServiceProfile objects.
load_history(path) reads history.json and tracks how many times each fingerprint fired in the last 24h.

Example services.yml:

services:
  checkout-api:
    tier: tier1
    slo_critical: true
    owner: team-checkout
  notification-service:
    tier: tier1
    slo_critical: false
    owner: team-notify
  internal-reporting:
    tier: tier2
    slo_critical: false
    owner: team-data

4. Implement the AlertDecisionEngine

Now the interesting part: turn alerts + context into decisions.

In alertdecider-agent/engine.py we implement AlertDecisionEngine with a few rules. Conceptually:

class AlertDecisionEngine:
    def __init__(self, services, history):
        self.services = services
        self.history = history

    def decide(self, alerts: List[Alert]) -> List[Decision]:
        return [self._decide_one(a) for a in alerts]

    def _decide_one(self, alert: Alert) -> Decision:
        service_profile = self.services.get(alert.service)
        hist = self.history.get(alert.fingerprint)
        sev = alert.severity.lower()

        # 1) Suppress noisy low-severity alerts in non-prod
        if sev in {"info", "debug"} and alert.environment not in {"prod", "production"}:
            return Decision(alert, "suppress", "low-severity alert in non-prod environment")

        # 2) Page for tier1, slo-critical services on critical alerts
        if (sev == "critical" and service_profile and
                service_profile.tier == "tier1" and service_profile.slo_critical):
            return Decision(alert, "page", "critical alert on tier1 slo-critical service")

        # 3) Aggregate flapping alerts (lots of repeats in 24h)
        if hist and hist.count_24h >= 20:
            return Decision(alert, "aggregate", "alert fingerprint is flapping/noisy")

        # 4) Warnings in prod for tier1 services become tickets
        if sev == "warning" and service_profile and service_profile.tier == "tier1":
            return Decision(alert, "ticket", "warning on tier1 service; track as ticket")

        # 5) Default: ticket for prod, suppress for non-prod
        if alert.environment in {"prod", "production"}:
            return Decision(alert, "ticket", "prod alert without more specific rule")

        return Decision(alert, "suppress", "non-prod alert without more specific rule")

These rules are not perfect – they’re a starting point you can tweak.

The key is that they’re explicit. Anyone on your team can read, discuss, and change them.

5. Wire up the CLI

alertdecider-agent/cli.py glues everything together:

ap = argparse.ArgumentParser(
    prog="alertdecider-agent",
    description="Alert Decision Layer CLI",
)
ap.add_argument("--alerts", required=True)
ap.add_argument("--services", default="")
ap.add_argument("--history", default="")
ap.add_argument("--out-dir", default="out")

args = ap.parse_args()

alerts = load_alerts(args.alerts)
services = load_services(args.services)
history = load_history(args.history)

engine = AlertDecisionEngine(services, history)
decisions = engine.decide(alerts)

write_reports(args.out_dir, decisions)
render_console(decisions)

We also have __main__.py so you can use python -m alertdecider-agent.

6. Run it with sample data

The examples/ folder contains a simple dataset:

python -m alertdecider-agent   --alerts examples/alerts.json   --services examples/services.yml   --history examples/history.json   --out-dir out

Check the CLI table output.
Open out/decision_report.md to see the human-friendly report.
Open out/decision_report.json if you want to wire this into another tool.

Try changing severity, env, or service tier in the examples and see how decisions change.

7. Make it yours

Here are some ideas for adapting this to your environment:

Add time-of-day logic (e.g., don’t page at 03:00 for non-critical stuff).
Add SLO signals (e.g., error budget burn rate) into the decision rules.
Replace history.json with a real datastore of past alerts.
Call alertdecider from your Alertmanager/PagerDuty webhook pipeline.

You now have a small, understandable alert decision layer you can evolve – and a much better place to plug AI into in the future.

If you build on this, drop a link – I’d love to see different rule sets and architectures.

Build an AI Incident Copilot CLI in Python

Sajja Sudhakararao — Sun, 12 Apr 2026 04:46:54 +0000

When an incident fires, you don't need more dashboards.
You need answers, fast.

This post is a build-a-tool weekend project: a Python CLI that collects logs from systemd and Docker, highlights repeating patterns, maps them to the Golden Signals, and generates a ready-to-use incident report.

Project files

incopilot/
  cli.py         collectors.py
  analyzer.py    reporter.py    config.py
scripts/
  demo_generate_sample_logs.py
requirements.txt    pyproject.toml    README.md

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Quick demo (no real services)

python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log

Systemd triage

python -m incopilot journal --unit nginx --since "30 min ago"

Docker triage

python -m incopilot docker --container my-api --since 1h

Both (bundle)

python -m incopilot bundle --unit nginx --container my-api --since-journal "30 min ago" --since-docker 1h

Outputs

out/report.md — human-friendly
out/report.json — machine-friendly

Originally published on [LINK]

💬 What's your go-to first command when an incident fires?
Drop it in the comments — I'll add the best ones to the safe-commands list.

Self-Healing Docker: Bash Script That Auto-Restarts Containers

Sajja Sudhakararao — Sun, 22 Feb 2026 23:51:18 +0000

Manual restarts during incidents are reactive. Self-healing means your containers recover themselves between alerts.

This post shows how to build a lightweight bash watchdog that:

Monitors container health via Docker health checks
Restarts unhealthy containers
Integrates with systemd for daemon-like behavior
Logs everything for incident review

The Self-Healing Architecture

Docker has built-in restart policies (--restart unless-stopped), but they don’t respect health checks. A container can be "running" but unhealthy (app crashed, dependencies down, etc.).

Our script loops every 30 seconds:

Query Docker API for container health
Restart unhealthy containers
Log actions to systemd journal
Repeat

Step 1: Docker Health Checks (the foundation)

First, ensure your containers have proper health checks in docker-compose.yml or docker run:

services:
  web:
    image: nginx
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

This marks containers healthy, unhealthy, or starting.

Step 2: The Self-Healing Watchdog Script

Save this as /usr/local/bin/docker-autoheal.sh:

#!/usr/bin/env bash
set -euo pipefail

# Config
CHECK_INTERVAL=30
LOG_FILE="/var/log/docker-autoheal.log"
CONTAINER_LABEL="autoheal=true"  # Label your containers

log() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
}

heal_containers() {
  local unhealthy=($(docker ps --filter "health=unhealthy" --filter "label=$CONTAINER_LABEL" --format "{{.Names}}"))

  for container in "${unhealthy[@]}"; do
    log "RESTARTING UNHEALTHY CONTAINER: $container"

    # Graceful stop first
    docker stop "$container" || true

    # Hard kill after timeout
    docker kill "$container" || true

    # Restart with original command
    docker start "$container"

    log "RESTARTED: $container"
  done
}

# Main loop
log "Docker Auto-Heal started (check interval: ${CHECK_INTERVAL}s)"
while true; do
  heal_containers
  sleep "$CHECK_INTERVAL"
done

Step 3: Run as a Systemd Service (production-ready)

Create /etc/systemd/system/docker-autoheal.service:

[Unit]
Description=Docker Container Auto-Healer
After=docker.service
Requires=docker.service

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/docker-autoheal.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable docker-autoheal.service
sudo systemctl start docker-autoheal.service

Check status:

sudo systemctl status docker-autoheal.service
sudo journalctl -u docker-autoheal.service -f

Step 4: Label Your Containers

Tag containers you want to auto-heal:

docker run -d \
  --label "autoheal=true" \
  --restart unless-stopped \
  --health-cmd="curl -f http://localhost:8080/health || exit 1" \
  nginx

Step 5: Advanced Features

Grace Period (avoid restart loops)

# Add to script before restart
if docker inspect "$container" --format '{{.RestartCount}}' | grep -q "10"; then
  log "SKIPPING: $container restart count too high (restart loop?)"
  continue
fi

Webhook Alerts

curl -X POST -d "{\"container\":\"$container\",\"action\":\"restart\"}" "$WEBHOOK_URL"

Multi-Host (Docker Swarm)

Use labels + a central orchestrator or run the script on each host.

Testing It (safely)

Spin up a container with a failing health check:

docker run -d \
  --name test-fail \
  --label "autoheal=true" \
  --health-cmd="exit 1" \
  alpine sleep infinity

Watch it get restarted:

journalctl -u docker-autoheal.service -f

Limitations + When to Upgrade

This works great for:

Small deployments / homelabs
Edge services / single-host apps
Dev/staging environments

Upgrade to:

Kubernetes: liveness/readiness probes + pod disruption budgets
Docker Swarm: service replicas + constraints
Nomad: health checks + restart stanzas

Summary Table (Copy/Paste)

| Component        | Command                             | Purpose                  |
| ---------------- | ----------------------------------- | ------------------------ |
| Health Check     | docker ps --filter health=unhealthy | Find broken containers   |
| Watchdog         | systemctl status docker-autoheal    | Confirm service running  |
| Logs             | journalctl -u docker-autoheal       | Review restart history   |
| Container Labels | --label autoheal=true               | Target specific services |

The Container Troubleshooting Playbook: OOMs, CPU, and I/O

Sajja Sudhakararao — Sun, 15 Feb 2026 13:12:21 +0000

When a container fails in production, you don’t always have time to browse StackOverflow. You need a checklist.

This post is a field guide for the three most common container "murders": Memory (OOMKilled), CPU Throttling, and I/O Saturation. We’ll diagnose each using the docker stats + Linux host tools workflow we established last week.

Scenario 1: The "Silent" Death (OOMKilled)

Symptom: The container restarts randomly. No error logs in the application output because it was killed instantly by the kernel.

1. Confirm it was an OOM Kill

Docker knows why the container died. Ask it:

docker inspect <container> --format '{{.State.OOMKilled}}'
# Output: true

Or check the specific exit code (137 = 128 + 9 SIGKILL):

docker inspect <container> --format '{{.State.ExitCode}}'
# Output: 137

2. Find the "Smoking Gun" in Kernel Logs

If Docker confirms it, see exactly when the kernel snapped. Run this on the host:

dmesg -T | grep -i "killed process"

You’ll see a line like: Out of memory: Killed process 1234 (node) total-vm:2048kB, anon-rss:1024kB.

3. The Fix

Immediate: Bump the memory limit if the host has capacity.
```
docker update --memory 2g <container>
```
Root Cause: Check your application for memory leaks. If it’s Java, check the heap settings (-Xmx). If it’s Node, check the GC behavior.

Scenario 2: The "Slow" Death (CPU Throttling)

Symptom: App is running but incredibly slow. Latency spikes. Health checks time out.

1. Check if it’s throttling

Linux cgroups enforce CPU limits by "pausing" your process when it uses its quota. It doesn’t kill the app; it just freezes it for milliseconds at a time.

Check docker stats first:

docker stats --no-stream

If CPU % is consistently near 100% of your configured limit (e.g., if you gave it 0.5 CPUs and it’s at 50%), you are being throttled.

2. Verify Throttling in cgroups

Look at the raw cgroup metrics (works on cgroup v1/v2):

# Find the container ID
docker inspect <container> --format '{{.Id}}'

# Check throttle stats (path varies by distro, commonly:)
cat /sys/fs/cgroup/cpu/docker/<long-id>/cpu.stat

Look for nr_throttled and throttled_time. If these numbers are rising, your app is gasping for air.

3. The Fix

Remove the limit temporarily to prove it’s the bottleneck.
```
docker update --cpus 0 <container>
```
Tune requests: If the app needs that CPU, increase the limit. If it’s a bug (infinite loop), profile the app.

Scenario 3: The "Gridlock" (Disk I/O Saturation)

Symptom: The container becomes unresponsive, docker ps hangs, or logs stop writing.

1. Identify the I/O Hog

Is it the container or the neighbor?

# Check host I/O
iostat -x 1 5

If %util is >80%, the disk is saturated.

2. Blame the Container

Use pidstat (part of sysstat) to find which process is thrashing the disk:

pidstat -d 1

Look for the PID with high kB_rd/s or kB_wr/s. Match that PID back to a container:

docker inspect --format '{{.State.Pid}}' <container>

3. The Fix

Limit the blast radius: Set a Block I/O limit on the greedy container so it doesn’t kill the host.
```
docker update --blkio-weight 100 <container>  # Low priority (default 500)
```
Move logs: Ensure your app isn’t logging debug data to the container’s JSON log driver (which writes to disk). Use a log shipper or write to stdout sparingly.

Bonus: Network Connectivity Issues

Symptom: "Connection refused" or timeouts between containers.

1. The "Can I reach it?" Check

Don't guess. Enter the container’s namespace:

docker exec -it <source-container> sh
# Inside:
ping <target-container-name>
nc -zv <target-container-name> <port>

2. If DNS fails

Docker has its own internal DNS. Check /etc/resolv.conf inside the container:

cat /etc/resolv.conf

It should point to Docker’s embedded DNS server (usually 127.0.0.11). If it’s missing or wrong, check your daemon config.

Summary Checklist (Copy/Paste)

Symptom	Check Command	Fix Action
Random Restarts	`docker inspect <container> --format '{{.State.OOMKilled}}'`	Increase RAM limit / Fix memory leak
Sluggish App	`cat /sys/fs/cgroup/cpu/docker/<id>/cpu.stat` (check `nr_throttled`)	Increase CPU limit / Profile app
Host Unresponsive	`iostat -x 1 5` & `pidstat -d 1`	Limit Block I/O weight / Reduce logging
Network Timeout	`docker exec <container> nc -zv <target> <port>`	Check Docker DNS / Verify network aliases

Next Steps

Now that you can debug containers manually, how do you automate this? Next week, we’ll build a "Self-Healing" Bash Script that detects these states and alerts you automatically.

Which of these kills your containers most often? For me, it's always the silent OOM killer.

Docker Monitoring Without a Platform: docker stats + cgroups (DevOps)

Sajja Sudhakararao — Sat, 31 Jan 2026 21:40:31 +0000

When an incident hits a containerized service, you often don’t need a full observability stack to get traction. You need fast answers: Which container is hot? What resource is saturating? Is it an app problem or a limit problem?

This guide shows a practical monitoring stack you can run from any Docker host:

Docker-level commands (docker stats, docker inspect, docker logs)
Host Linux tools (ps/top/free/df/iostat/ss/journalctl)
Kernel primitives: cgroups (resource limits/accounting) and namespaces (isolation)

1) Start with docker stats (the fastest signal)

docker stats streams runtime metrics for containers, including CPU%, memory usage/limit, network I/O, and block I/O.

docker stats

Common workflows:

docker stats --no-stream          # Snapshot (good for scripts)
docker stats <container_name>     # Focus on one container

How to interpret it (in plain language)

CPU%: who’s burning compute right now.
MEM USAGE / LIMIT: how close you are to the memory ceiling.
NET I/O: traffic spikes, retries, or unusual egress. - BLOCK I/O: slow disks, chatty logging, or heavy read/write workloads.

2) Jump from “container name” → “what is it?”

Once you identify a hot container, immediately gather identity + configuration.

docker ps
docker inspect <container> | less

Useful inspect questions:

What image/tag is running?
What env vars/config are set?
What ports and volumes are attached?
Are there memory/CPU limits configured?

3) Logs: confirm symptoms fast

docker logs --tail 200 <container>
docker logs -f <container>

This is often enough to spot:

crash loops
OOM errors / memory pressure
upstream timeouts
DB connection exhaustion

4) Understand why it’s happening: cgroups + namespaces (the mental model)

Docker relies on Linux kernel features:

Namespaces isolate views of processes, networking, mounts, etc.
cgroups control and account for resources like CPU, memory, and I/O.

Why this matters during incidents:

A container can be “slow” because it’s CPU-throttled, not because the app code suddenly got worse.
A container can restart because it hit its memory limit and the kernel’s OOM behavior targeted its processes.

5) Host-level confirmation (tie back to your Linux monitoring toolkit)

When docker stats shows a spike, verify on the host to avoid false conclusions.

CPU hogs

ps aux --sort=-%cpu | head -15

Memory pressure

free -h

Disk full / log explosions

df -h
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -10

Disk I/O saturation

iostat -x 1 3

Unexpected listeners / traffic patterns

ss -tuln

These host checks help you decide whether you’re dealing with a single container or a node-wide saturation problem.

6) What to do with the data (action mapping)

Use the shortest safe path to stability:

1. CPU high + latency rising

If CPU is legitimately needed: scale out / add capacity.
If CPU is throttled: revisit limits/requests (or container CPU shares).

2. Memory near limit

If memory leak suspected: restart as mitigation + open an issue with heap profiling.
If limit too low for normal peaks: adjust limit carefully and monitor.

3. Block I/O high

Check log volume and disk saturation; reduce noisy logs or move logs off disk.
Consider storage performance constraints and workload patterns.

4. Network I/O abnormal

Look for retries, timeouts, DDoS/abuse patterns, or upstream issues.

7) Copy/paste triage sequence (5 minutes)

# 1) Find the hot container
docker stats --no-stream

# 2) Identify it
docker ps
docker inspect <container> | less

# 3) Check symptoms
docker logs --tail 200 <container>

# 4) Confirm on host (avoid guessing)
ps aux --sort=-%cpu | head -10
free -h
df -h
iostat -x 1 3
ss -tuln

What’s your most common container failure mode: OOM kills, CPU throttling, disk I/O, or network timeouts?

Incident Response Runbook Template for DevOps

Sajja Sudhakararao — Sun, 18 Jan 2026 04:13:03 +0000

Incident Response Runbook Template for DevOps

Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic.

This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.

What this runbook optimizes for

Fast acknowledgement and clear ownership (Incident Commander + roles).
Early impact assessment and severity assignment to avoid under/over‑reacting.
Communication cadence and “known/unknown/next update” structure that builds trust.
Evidence capture (commands + logs) to support post‑incident review.

The incident runbook template

Copy this into your internal wiki, README, Notion, or ops repo.

1. Trigger

Triggers:

Monitoring alert / SLO breach
Customer report escalated
Internal detection (logs, latency spikes, error spikes)

2. Acknowledge (0–5 minutes)

Acknowledge page/alert in your paging system.
Create an incident channel: #inc-YYYYMMDD-service-shortdesc.
Assign Incident Commander (IC) and Comms Lead.
Start an incident document: timeline + links + decisions.

3. Assess severity (5–10 minutes)

Answer quickly:
- What’s impacted (service, region, feature)?
- How many users / revenue / compliance impact?
- Is impact ongoing and spreading?

Suggested severity:
- SEV1: Major outage / severe user impact; immediate coordination.
- SEV2: Partial outage / significant degradation; urgent but controlled.
- SEV3: Minor impact; can be handled async.

4. Stabilize first (10–30 minutes)

Goal: stop the bleeding before chasing root cause.

Typical mitigations:

Roll back the last deploy/config change.
Disable a feature flag.
Scale up/out temporarily.
Fail over if safe.
Rate-limit or block abusive traffic.

5. Triage checklist (host-level)

Run these to establish the baseline quickly (copy/paste friendly).

CPU

ps aux --sort=-%cpu | head -15

Alert cue: any process >50% sustained.

Memory

free -h

Alert cue: available <20% total RAM.

Disk

df -h
du -sh /var/log/* 2>/dev/null | sort -h | tail -10

Alert cue: any filesystem >90%.

Disk I/O

iostat -x 1 3

Alert cue: %util >80%, await >20ms.

Network listeners

ss -tuln

Alert cue: unexpected listeners/ports.

Logs (example: nginx)

journalctl -u nginx -f

Alert cue: 5xx errors spiking.

6. Comms cadence (keep it boring)

SEV1: updates every 10–15 minutes.
SEV2: updates every 30 minutes.
SEV3: async updates acceptable.

Use this structure:

What we know
What we don’t know
What we’re doing now
Next update at: TIME

7. Verify resolution

Confirm user impact is gone (synthetic checks + error rate + latency).
Confirm saturation is back to normal (CPU/memory/disk/I/O).
Watch for 30–60 minutes for regression.

8. Close and learn (post-incident)

Write a brief timeline (detection → mitigation → resolution).
Capture what worked, what didn’t, and what to automate.
Create follow-ups: alerts tuning, runbook updates, tests, guardrails.

Bonus: “Golden signals” lens for incidents

When you’re lost, anchor on the four golden signals:

Latency (are requests slower?)
Traffic (is demand abnormal?)
Errors (is failure rate rising?)
Saturation (are resources hitting limits?)

This keeps triage focused on user impact and system limits, not vanity metrics.

Download / reuse

If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.

Linux Monitoring & Alerting: Command-Line Mastery for DevOps

Sajja Sudhakararao — Sun, 11 Jan 2026 00:36:07 +0000

The Monitoring Gap Every DevOps Engineer Faces

Full monitoring stacks like Prometheus + Grafana are great, but they take time to set up. What about the servers you inherit? The staging environments? The emergency VM you spin up during an outage?

Command-line monitoring is your immediate, universal answer. These tools work on every Linux box, no agents required. Better yet, they're fast enough to script into alerting workflows.

This post covers the essential Linux monitoring commands plus patterns to turn raw metrics into actionable alerts—perfect follow-up to our Bash scripting guide.

1. Real-Time Resource Dashboards

The top/htop Foundation
top gives you an instant system snapshot:

top - 11:26:45 up 5 days,  3:12,  2 users,  load average: 1.23, 1.45, 1.67
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.3 us,  8.7 sy,  0.0 ni, 78.9 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  7900.2 total,  1234.5 free,  4567.8 used,  2097.9 buff/cache

Pro move: htop (install with apt install htop)

Mouse/keyboard navigation
Color-coded resource bars
Tree view of processes (F5)

Quick filters:

htop -p $(pgrep -d, nginx)  # Monitor nginx processes only

Memory Deep Dive: free -h

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       4.2Gi       1.2Gi       128Mi       2.3Gi       3.1Gi 
Swap:          2.0Gi          0B       2.0Gi

What matters: Focus on available column, not free. Linux aggressively caches to disk.

2. CPU Analysis: Who's Eating Cycles?

Per-Process Breakdown

ps aux --sort=-%cpu | head -10
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql     1234 45.2 12.3 2.1g  980m ?        S    10:00   3:45 /usr/sbin/mysqld

Historical CPU Trends: sar

# Install: apt install sysstat
sar -u 1 5     # CPU every 1 sec, 5 samples
sar -u -f /var/log/sysstat/sa08  # Yesterday's data

Average: CPU %user %nice %system %iowait %steal %idle
Average:    all  12.34  0.00  8.76    1.23   0.00  77.67

Alert pattern:

#!/bin/bash
if sar -u 1 3 | tail -1 | awk '{if($8 < 70) exit 1}'; then
  echo "CPU idle <70% for 3s - investigate!"
fi

3. Disk I/O: The Silent Killer

Current I/O: iostat

iostat -x 1 5
Device            r/s     w/s     rkB/s    wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
sda              23.4     1.2   234.5    12.3     0.0     10.2   0.00  89.12    0.1    2.3   0.45    10.0     6.2  1.23  45.2

Red flags: %util >80%, await >20ms

Disk Space Alerts: df

df -h --output=source,fstype,size,used,avail,pcent,target | grep -v tmpfs

Scriptable alert:

df -h | grep -E "[8-9][0-9]%|[9][0-9]%|[100]%" || echo "Disk healthy"

4. Network Troubleshooting Masters

Active Connections: ss

# Replace netstat everywhere
ss -tuln          # Listening TCP/UDP
ss -tunap | grep :80   # Processes on port 80
ss -t state established | grep :443 | wc -l  # Active HTTPS connections

Drop Counters: netstat or ss

netstat -s | grep -E "errors|dropped|retrans"
Ip:
    1234 total packets received
    56 dropped because of memory problems

Live Packet Capture: tcpdump

# Capture 100 packets on interface eth0, port 80
sudo tcpdump -i eth0 -c 100 port 80 -w capture.pcap

# Read capture
tcpdump -r capture.pcap -nn

5. Log Monitoring: Beyond tail -f

Service Logs: journalctl

journalctl -u nginx -f           # Follow nginx logs
journalctl -u nginx --since "1h ago"  # Last hour
journalctl -p err -u nginx      # Only errors
journalctl --no-pager | grep -i panic  # System panics

Pattern Mining: grep + awk

# Count 5xx errors per minute
journalctl -u nginx --since "10min ago" | \
grep " 500 " | \
awk '{print $1, $2}' | cut -d. -f1 | sort | uniq -c

# Slow requests (>2s)
awk '$NF > 2 {print}' /var/log/nginx/access.log

6. Production Alerting Patterns

CPU/Memory Watchdog

#!/bin/bash
set -euo pipefail

alert() { curl -X POST -d "CPU ${CPU}%, MEM ${MEM}%" "$SLACK_WEBHOOK"; }

CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')

[[ "$CPU" -gt 80 || "$MEM" -gt 80 ]] && alert

Disk Space Guardian

#!/bin/bash
for fs in $(df --local --output=source | tail -n +2); do
  usage=$(df $fs | tail -1 | awk '{print $5}' | sed 's/%//')
  [[ $usage -gt 85 ]] && echo "ALERT: $fs at ${usage}%"
done

Cron schedule:

# Every 5 minutes
*/5 * * * * /usr/local/bin/check_resources.sh

7. One-Line Dashboards

Combine tools into instant observability:

# System overview (alias this to 'sys')
watch -n 2 'printf "\nCPU: "; sar -u 1 1 |tail-1; printf "MEM: "; free -h |tail-1; printf "DISK: "; df -h / /var |tail -2'

# Top resource hogs
watch -n 2 'ps aux --sort=-%cpu | head -8; echo "---"; ps aux --sort=-%mem | head -8'

Quick Reference Table

| Scenario    | Command                | Pro Tip                              |
| ----------- | ---------------------- | ------------------------------------ |
| CPU trends  | sar -u 1 5             | Historical data in /var/log/sysstat/ |
| Memory      | free -h                | Watch available, ignore free         |
| Disk I/O    | iostat -x 1            | %util >80% = trouble                 |
| Connections | ss -tuln               | Modern netstat replacement           |
| Logs        | journalctl -u nginx -f | systemd's tail -f                    |
| Processes   | htop -p $(pgrep nginx) | Filter to specific app               |

Advanced Bash Scripting for DevOps Automation (With Copy‑Pasteable Examples)

Sajja Sudhakararao — Fri, 09 Jan 2026 03:02:04 +0000

Bash is still the glue that holds a lot of DevOps workflows together. Whether you’re deploying services, wiring health checks into CI, or cleaning up logs on a forgotten VM, a few solid scripting patterns go a very long way.

In this post, you’ll find copy‑pasteable Bash snippets for:

Safer script defaults
Parameterized deployments
Health checks and rollbacks
Log rotation and cleanup
Simple CPU/memory watchdogs

Everything is written with day‑to‑day DevOps work in mind—not contrived toy examples.

1. Bash foundations that prevent outages

Even experienced engineers skip basics that later cause flaky scripts and silent failures.

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

set -e – exit on the first error instead of continuing in a bad state
set -u – fail if a variable is undefined
set -o pipefail – make pipelines fail if any command fails

Add a tiny logging helper:

log() { echo "[$(date +'%F %T')] $*"; }
die() { log "ERROR: $*"; exit 1; }

This alone makes every script more observable and safer to reuse across environments.

2. Parameters, flags, and environment‑aware scripts

Hard‑coding values is fine for demos, but production scripts must be configurable.

Use arguments with sensible defaults:

#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT="${1:-staging}"

log() { echo "[$(date +'%F %T')] [$ENVIRONMENT] $*"; }

log "Deploying to $ENVIRONMENT"

For more control, turn your script into a tiny CLI:

while getopts "e:v:h" opt; do
  case "$opt" in
    e) ENVIRONMENT="$OPTARG" ;;
    v) VERSION="$OPTARG" ;;
    h) echo "Usage: deploy.sh -e <env> -v <version>"; exit 0 ;;
    *) exit 1 ;;
  esac
done

Now teammates and CI pipelines can call the same script with explicit flags.

3. A realistic deployment script pattern

Here’s a trimmed‑down deployment flow you can adapt for your services.

#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT="${1:-staging}"
APP_DIR="/srv/myapp"
REPO_URL="git@github.com:org/myapp.git"

log() { echo "[$(date +'%F %T')] [$ENVIRONMENT] $*"; }

deploy() {
  log "Updating code..."
  if [[ ! -d "$APP_DIR/.git" ]]; then
    git clone "$REPO_URL" "$APP_DIR"
  fi

  cd "$APP_DIR"
  git fetch --all
  git checkout main
  git pull --ff-only

  log "Installing dependencies..."
  npm ci

  log "Running tests..."
  npm test

  log "Building..."
  npm run build

  log "Restarting service..."
  sudo systemctl restart myapp

  log "Deployment complete."
}

deploy

This pattern is idempotent, easy to wire into CI/CD, and uses systemd for reliable service restarts.

4. Health checks and rollbacks in one script

Automation without safety is just a faster way to ship broken code. Add health checks:

health_check() {
  local url="${1:-http://localhost/health}"
  if curl -fsS "$url" > /dev/null; then
    log "Health check passed for $url"
  else
    die "Health check FAILED for $url"
  fi
}

Then define a simple rollback:

previous_version() {
  git describe --tags --abbrev=0 HEAD~1 2>/dev/null || echo ""
}

rollback() {
  local prev
  prev="$(previous_version)"
  [[ -z "$prev" ]] && die "No previous version found for rollback"

  log "Rolling back to $prev"
  git checkout "$prev"
  npm run build
  sudo systemctl restart myapp
}

Wire it together:

deploy
if ! health_check "https://myapp.example.com/health"; then
  log "Health check failed; rolling back"
  rollback
fi

You now have a single script that deploys, validates, and self‑heals.

5. Log rotation and cleanup that actually runs

Not every environment needs a full‑blown logging stack; Bash plus cron still works well.

Compress older logs:

#!/usr/bin/env bash
set -euo pipefail

LOG_DIR="/var/log/myapp"
DAYS_TO_KEEP=7

find "$LOG_DIR" -type f -name "*.log" -mtime +$DAYS_TO_KEEP -print0 \
  | while IFS= read -r -d '' file; do
      gzip "$file"
    done

Remove stale archives:

find "$LOG_DIR" -type f -name "*.gz" -mtime +30 -delete

Schedule it:

30 1 * * * /usr/local/bin/log_cleanup.sh

That’s often enough to keep disks from filling up silently.

6. Lightweight monitoring and alert hooks

You can wrap traditional Linux tools in Bash and push alerts to Slack, email, or a webhook endpoint.

#!/usr/bin/env bash
set -euo pipefail

CPU_THRESHOLD=80
MEM_THRESHOLD=80

cpu_usage() {
  mpstat 1 1 | awk '/Average/ && $12 ~ /[0-9.]+/ {print 100-$12}'
}

mem_usage() {
  free | awk '/Mem:/ {printf("%.0f", $3/$2 * 100)}'
}

CPU=$(cpu_usage)
MEM=$(mem_usage)

if (( CPU > CPU_THRESHOLD || MEM > MEM_THRESHOLD )); then
  echo "High usage detected: CPU=${CPU}% MEM=${MEM}%"
  # TODO: send to Slack / email / webhook here
fi

This is a great complement to Prometheus, Grafana, or hosted solutions.

7. Safer configuration changes

Always pair config edits with backups.

CONFIG="/etc/myapp/config.yaml"
BACKUP="/etc/myapp/config.yaml.$(date +'%F-%H%M%S').bak"

cp "$CONFIG" "$BACKUP"

sed -i 's/feature_x: false/feature_x: true/' "$CONFIG"

systemctl restart myapp

This pattern makes it trivial to revert a bad change during an incident.

If you adapt any of these snippets into your own tooling, drop a comment with your variations—other engineers will benefit from seeing real‑world tweaks. Also, tell which part you’d like to see expanded: deployments, monitoring, or incident tooling.

🚀 Building an AI-Powered Stock Trading Bot in Python (With Backtesting)

Sajja Sudhakararao — Sun, 04 Jan 2026 21:37:13 +0000

From prediction to execution — a practical guide for engineers

📌 Introduction
Algorithmic trading is no longer reserved for hedge funds. With Python, open APIs, and modern AI models, individual engineers can build intelligent stock trading bots that analyze data, predict price movement, backtest strategies, and automate trades.

In this post, I’ll walk you through:

Designing an AI agent for stock price prediction
Converting predictions into trading decisions
Backtesting the strategy on historical data
Preparing the system for real-world deployment

This guide is hands-on, practical, and written for:

Software / DevOps engineers
Python developers
Anyone curious about AI in finance

🧠 What Is an AI Trading Agent?
An AI trading agent is a system that:

Observes the market (historical & live data)
Learns patterns using machine learning
Makes decisions (Buy / Sell / Hold)
Executes trades automatically
Improves through evaluation and backtesting

Core Components

| Component          | Purpose                                     |
| ------------------ | ------------------------------------------- |
| Data Source        | Market prices (Yahoo Finance, Alpaca, etc.) |
| AI Model           | Predict future price movement               |
| Strategy Engine    | Convert predictions into actions            |
| Backtesting Engine | Validate strategy on past data              |
| Broker API         | Execute trades                              |

🏗️ System Architecture

Market Data → AI Model → Trading Strategy → Backtesting → Broker API

This separation keeps the system modular, testable, and scalable.

📊 Step 1: Fetching Stock Market Data
We’ll use Yahoo Finance for historical prices.

import yfinance as yf
import pandas as pd

ticker = "AAPL"
df = yf.download(ticker, start="2020-01-01", end="2024-01-01")
df = df[['Close']]
df.dropna(inplace=True)

This gives us clean, daily closing prices — perfect for modeling.

🤖 Step 2: AI Model (LSTM for Time-Series Prediction)
Stock prices are sequential data, so LSTM (Long Short-Term Memory) works well.

Why LSTM?

Learns temporal patterns
Handles noisy financial data better than simple regression
Widely used in quantitative finance research

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
import numpy as np

Prepare Data


scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)

def create_sequences(data, window=50):
    X, y = [], []
    for i in range(len(data) - window):
        X.append(data[i:i+window])
        y.append(data[i+window])
    return np.array(X), np.array(y)

X, y = create_sequences(scaled)

🧪 Step 3: Training the Model

model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(50,1)),
    Dropout(0.2),
    LSTM(50),
    Dense(1)
])

model.compile(optimizer="adam", loss="mse")
model.fit(X, y, epochs=20, batch_size=32)

This model predicts the next-day closing price.

📈 Step 4: Designing the Trading Strategy
Predictions alone are useless without rules.

Simple Strategy Logic

| Condition                            | Action |
| ------------------------------------ | ------ |
| Predicted price > current price + 2% | BUY    |
| Predicted price < current price - 2% | SELL   |
| Otherwise                            | HOLD   |

This avoids over-trading and reduces noise.

🔁 Step 5: Backtesting the Strategy
Backtesting answers one question:

“Would this strategy have worked in the past?”

Backtesting Engine

def backtest(df, model, initial_cash=10000):
  cash = initial_cash
  position = 0
  trades = [] 

  for i in range(50, len(df)):
    window = df.iloc[i-50:i]
    X = scaler.transform(window).reshape(1,50,1)
    predicted = scaler.inverse_transform(model.predict(X))[0][0]
    price = df.iloc[i]['Close']

    if predicted > price * 1.02 and cash >= price:
      cash -= price
      position += 1
      trades.append(("BUY", price))

    elif predicted < price * 0.98 and position > 0:
      cash += price
      position -= 1
      trades.append(("SELL", price))

  final_value = cash + position * df.iloc[-1]['Close']
  return final_value, trades

📊 Step 6: Performance Evaluation
AI Strategy vs Buy & Hold

final_value, trades = backtest(df, model)
buy_hold = 10000 * (df.iloc[-1]['Close'] / df.iloc[0]['Close'])

print("AI Strategy:", final_value)
print("Buy & Hold:", buy_hold)

This comparison tells you whether the AI adds real value or just noise.

⚠️ Important Risk Considerations
AI trading is not magic.

Be aware of:

Overfitting
Market regime changes
Latency in real trading
Slippage and transaction fees

Never deploy without:

Backtesting
Paper trading
Risk limits
Stop-loss rules

🚀 Production Readiness Checklist
Before going live:

✅ Paper trading (Alpaca)
✅ Daily trade limits
✅ Stop-loss & take-profit
✅ Logging & monitoring
✅ Model retraining strategy

🧩 Where This Can Go Next

Reinforcement Learning (RL)
Multi-stock portfolio optimization
Sentiment analysis (news + social)
Kubernetes-based trading microservices
Fully autonomous AI agents

🧠 Final Thoughts
AI-powered trading bots are an excellent real-world application of machine learning, combining:

Data engineering
AI modeling
System design
Financial reasoning

Even if you never trade real money, building one will level up your skills dramatically.

📣 Disclaimer
This article is for educational purposes only.
It is not financial advice.
Always understand the risks before trading.

✍️ About the Author
I’m a DevOps Engineer exploring the intersection of AI, automation, and real-world systems.
I write about practical AI, engineering, and building systems that actually work.