I Built 20 AI-Powered DevOps Tools Because I Got Tired of Doing This Stuff Manually

#aws #sre #devops #ai

I've been a DevOps/SRE engineer for 10+ years.
I've managed 50+ EKS clusters at Apple scale, built OTA firmware
pipelines for 300+ EV chargers, migrated 80 applications to AWS,
and been the sole infrastructure engineer at two energy startups
where I supported teams of 30-40 engineers alone.
In all of that time, certain tasks never stopped being painful.
47 CloudWatch alarms firing at 11pm — and you have to figure out
which 3 actually matter.
A pod CrashLoopBackOff at 2am — logs open, describe output open,
trying to diagnose while half asleep.
A Terraform plan before a production apply — tired, reviewing it
manually, knowing you'll miss something.
A weekly AWS bill spike — someone asks why, you dig through Cost
Explorer for 40 minutes.
I got tired of doing all of this manually. So I built AI agents
for all of it.

What I Built
devops-ai-toolkit — 20 open source AI-powered tools across
5 sections, built with Python and Groq LLaMA 3.3.
🔗 github.com/manekanttasuru/devops-ai-toolkit
Every tool came from a real problem. None of this is theoretical.

The Tools — By Section
Kubernetes (4 tools)

Pod Failure Analyzer — diagnoses CrashLoopBackOff, OOMKilled,
Pending pods automatically from logs + describe output
Cluster Upgrade Advisor — reads your EKS version, scans for
deprecated APIs, produces a prioritized upgrade plan
RBAC Auditor — scans all roles, bindings, service accounts,
flags dangerous permissions ranked CRITICAL/HIGH/MEDIUM/LOW
Network Policy Analyzer — maps pod coverage, finds unprotected
namespaces, generates suggested NetworkPolicy YAML

AWS (4 tools)

IAM Analyzer — flags wildcards, missing MFA, old access keys,
over-permissioned roles with risk scoring
Security Group Auditor — finds open ports to 0.0.0.0/0,
orphaned groups, adds remediation commands per finding
VPC Network Analyzer — maps full topology, flags IP exhaustion,
missing flow logs, generates ASCII topology diagram
Unused Resource Hunter — finds idle EC2s, unattached EBS,
unused Elastic IPs, estimates monthly waste in dollars

Terraform (4 tools)

Security Plan Reviewer — reads terraform plan output, flags
security issues, rates CRITICAL/HIGH/MEDIUM/LOW with HCL fixes
Drift Detector — runs terraform plan, classifies drift as
INTENTIONAL/ACCIDENTAL/CONCERNING, gives per-resource recommendations
State Analyzer — scans tfstate for orphans, sensitive values,
missing tags, resource age estimation
Compliance Checker — maps your Terraform against CIS/HIPAA/SOC2
with control numbers and compliance score percentage

Monitoring (4 tools)

Dashboard Generator — takes a service name and metrics,
generates complete Grafana dashboard JSON ready to import
Log Pattern Analyzer — reads CloudWatch or local logs,
ranks error patterns by frequency and severity
Grafana Alert Router — classifies P1-P4 severity, routes to
right team, posts directly to Slack via webhook
Anomaly Detector — queries Prometheus + CloudWatch, flags
unusual patterns before they cross alert thresholds

SRE (4 tools)

Incident Runbook Generator — takes service + symptoms,
produces structured runbook with exact commands
On-Call Handoff Generator — takes current system state,
writes clean handoff brief for incoming engineer
Deployment Risk Scorer — rates LOW/MEDIUM/HIGH/CRITICAL
with go/no-go checklist per change type
Chaos Engineering Planner — generates full experiment plan
with hypothesis, steps, rollback, safety constraints

Stack
LLM: Groq API — LLaMA 3.3-70b-versatile (fast, free tier available)
Language: Python 3.9+
AWS: boto3
K8s: kubectl via subprocess
No heavy frameworks — each tool is a single Python file

Quick Start
bashgit clone https://github.com/manekanttasuru/devops-ai-toolkit
cd devops-ai-toolkit
pip install -r shared/requirements.txt
export GROQ_API_KEY=your_key_here

Run any tool — example:

cd kubernetes/pod-failure-analyzer
python main.py
Get a free Groq API key at console.groq.com

Why Groq + LLaMA
Fast enough for real-time infrastructure tooling. Free tier is
generous for experimentation. LLaMA 3.3 handles technical DevOps
context well. I use Groq in production for my other AI projects
too — MANI AI and BabyMind AI.

Every tool has a README with example output so you know what
you're getting before you run it.
If you find it useful — a star helps others find it.
If something is broken or you have ideas — open an issue.
🔗 github.com/manekanttasuru/devops-ai-toolkit

DEV Community

I Built 20 AI-Powered DevOps Tools Because I Got Tired of Doing This Stuff Manually

Run any tool — example:

Top comments (0)