It's 3 AM. PagerDuty fires. Your API latency is spiking. The on-call engineer wakes up, opens their laptop, checks Grafana, reads the alert, SSHs into the server, checks logs, identifies the root cause (a memory leak from the latest deploy), rolls back the deployment, verifies the fix, and goes back to sleep. Total time: 45 minutes of groggy debugging.
An AI DevOps agent does the same thing in 3 minutes. It receives the alert, correlates it with recent deployments, checks relevant logs, identifies the root cause, executes the rollback runbook, verifies the fix, and pages the human only if it can't resolve the issue automatically.
This isn't science fiction — teams are running these agents in production today. Here's how to build one.
## What an AI DevOps Agent Can Handle
TaskManual TimeAgent TimeAutomation Level
Alert triage & correlation10-30 min30 secFully auto
Log analysis & root cause15-60 min1-2 minFully auto
Runbook execution10-20 min2-3 minAuto with approval
Deployment rollback5-15 min1 minAuto with approval
Scaling decisions5-10 min30 secAuto within limits
Post-incident report1-2 hours5 minFully auto
Security alert response30-60 min2-5 minTriage auto, response manual
## Architecture: The AI SRE
Alert (PagerDuty/Grafana/Datadog)
│
▼
┌─────────────────┐
│ Alert Classifier │ → Severity, category, affected service
└────────┬────────┘
│
▼
┌─────────────────┐
│ Context Gatherer │ → Recent deploys, related alerts, metrics, logs
└────────┬────────┘
│
▼
┌────────────────────┐
│ Root Cause Analyzer │ → Correlate signals, identify probable cause
└────────┬───────────┘
│
▼
┌─────────────────┐
│ Runbook Selector │ → Match to known resolution playbook
└────────┬────────┘
│
▼
┌─────────────────────┐
│ Action Executor │ → Run remediation (with approval if needed)
└────────┬────────────┘
│
▼
┌──────────────────┐
│ Verification │ → Confirm fix, update status, generate report
└──────────────────┘
## Step 1: Alert Classification and Enrichment
Raw alerts are noisy. Your agent's first job is to classify, deduplicate, and enrich them with context.
class AlertClassifier:
CATEGORIES = {
"performance": ["latency", "slow", "timeout", "p99", "response time"],
"availability": ["down", "unreachable", "5xx", "health check", "connection refused"],
"resource": ["cpu", "memory", "disk", "oom", "out of memory", "storage"],
"deployment": ["deploy", "rollout", "version", "release", "canary"],
"security": ["unauthorized", "403", "brute force", "suspicious", "CVE"],
}
async def classify(self, alert: dict) -> dict:
# Quick keyword classification
text = f"{alert['title']} {alert['description']}".lower()
category = "unknown"
for cat, keywords in self.CATEGORIES.items():
if any(kw in text for kw in keywords):
category = cat
break
# Enrich with context
enriched = {
**alert,
"category": category,
"affected_service": self.extract_service(alert),
"recent_deploys": await self.get_recent_deploys(alert["service"], hours=6),
"related_alerts": await self.get_correlated_alerts(alert, minutes=30),
"current_metrics": await self.get_service_metrics(alert["service"]),
}
# Severity adjustment
if len(enriched["related_alerts"]) > 3:
enriched["severity"] = "critical" # Multiple correlated alerts = serious
return enriched
async def get_recent_deploys(self, service: str, hours: int) -> list:
"""Check CI/CD for recent deployments to this service."""
deploys = await github.get_deployments(service, since=f"{hours}h")
return [{"sha": d.sha, "author": d.author, "time": d.time,
"message": d.message} for d in deploys]
**Tip:** Alert deduplication is critical. A single outage can generate 50+ alerts across monitors. Group alerts by service + time window before analyzing. Your agent should see one incident, not 50 individual alerts.
## Step 2: Intelligent Log Analysis
Logs hold the answer to most incidents. But parsing through thousands of log lines at 3 AM is where humans make mistakes. Your agent doesn't get tired.
class LogAnalyzer:
async def analyze(self, service: str, time_range: tuple, alert: dict) -> dict:
# Fetch relevant logs
logs = await self.fetch_logs(
service=service,
start=time_range[0],
end=time_range[1],
level=["ERROR", "WARN", "FATAL"]
)
# Pattern detection: find error spikes
error_timeline = self.build_error_timeline(logs, bucket_minutes=5)
spike_time = self.find_spike(error_timeline)
# Extract unique error messages
unique_errors = self.deduplicate_errors(logs)
# Use LLM to analyze patterns
analysis = await self.llm.generate(f"""Analyze these error logs from service '{service}'.
Alert: {alert['title']}
Error spike detected at: {spike_time}
Recent deployments: {alert.get('recent_deploys', 'None')}
Unique errors (count, message):
{self.format_errors(unique_errors[:20])}
Questions to answer:
1. What is the most likely root cause?
2. Did this start after a deployment?
3. Is this a new error or a recurring pattern?
4. What is the blast radius (which users/features affected)?
5. Suggested remediation steps.
Be specific. Reference actual error messages and timestamps.""")
return {
"spike_time": spike_time,
"unique_errors": unique_errors[:10],
"analysis": analysis,
"log_count": len(logs)
}
## Step 3: Automated Runbook Execution
Runbooks are step-by-step procedures for handling known incidents. Most teams have them in Confluence or Notion, gathering dust. An AI agent can **execute them automatically**.
class RunbookExecutor:
def __init__(self):
self.runbooks = self.load_runbooks()
def load_runbooks(self) -> dict:
return {
"high_latency_api": {
"trigger": {"category": "performance", "service_type": "api"},
"steps": [
{"action": "check_metrics", "params": {"metric": "request_rate"}},
{"action": "check_metrics", "params": {"metric": "error_rate"}},
{"action": "check_recent_deploys", "params": {}},
{"action": "check_downstream_health", "params": {}},
{"decision": "if_recent_deploy_and_error_spike",
"true": "rollback_deployment",
"false": "scale_horizontally"},
{"action": "verify_recovery", "params": {"wait_seconds": 120}},
],
"approval_required": False, # Auto-execute for latency
},
"oom_kill": {
"trigger": {"category": "resource", "error_pattern": "OOM"},
"steps": [
{"action": "identify_pod", "params": {}},
{"action": "capture_heap_dump", "params": {}},
{"action": "restart_pod", "params": {}},
{"action": "increase_memory_limit", "params": {"factor": 1.5}},
{"action": "verify_recovery", "params": {"wait_seconds": 60}},
],
"approval_required": True, # Memory changes need approval
},
"deployment_rollback": {
"trigger": {"category": "deployment"},
"steps": [
{"action": "identify_bad_deploy", "params": {}},
{"action": "rollback_to_previous", "params": {}},
{"action": "verify_recovery", "params": {"wait_seconds": 180}},
{"action": "notify_deployer", "params": {}},
{"action": "create_incident_ticket", "params": {}},
],
"approval_required": True,
}
}
async def execute(self, runbook_name: str, context: dict) -> dict:
runbook = self.runbooks[runbook_name]
results = []
for step in runbook["steps"]:
if "decision" in step:
# Evaluate condition
branch = step["true"] if self.evaluate(step["decision"], context) else step["false"]
result = await self.execute_action(branch, context)
else:
result = await self.execute_action(step["action"], step["params"], context)
results.append(result)
if not result["success"]:
return {"status": "failed", "failed_at": step, "results": results}
return {"status": "resolved", "results": results}
**Warning:** Start with read-only actions (check metrics, read logs, analyze). Only add write actions (rollback, restart, scale) after extensive testing. A buggy agent that rolls back production is worse than a slow human.
## Step 4: Deployment Intelligence
Most incidents correlate with recent changes. Your agent should automatically correlate alerts with deployments.
class DeploymentCorrelator:
async def correlate(self, alert: dict, deploys: list) -> dict:
if not deploys:
return {"deployment_related": False}
# Find deploys within the alert time window
alert_time = alert["triggered_at"]
suspect_deploys = [
d for d in deploys
if (alert_time - d["time"]).total_seconds() str:
report = await self.llm.generate(f"""Generate a post-incident report.
Incident data:
- Alert: {incident['alert']['title']}
- Severity: {incident['severity']}
- Triggered: {incident['started_at']}
- Resolved: {incident['resolved_at']}
- Duration: {incident['duration_minutes']} minutes
- Root cause: {incident['root_cause']}
- Resolution: {incident['resolution_steps']}
- Affected services: {incident['affected_services']}
- User impact: {incident.get('user_impact', 'Unknown')}
- Related deployment: {incident.get('suspect_deploy', 'None')}
Format as a standard post-incident report with sections:
1. Summary (2-3 sentences)
2. Timeline (key events with timestamps)
3. Root Cause Analysis
4. Resolution
5. Impact Assessment
6. Action Items (preventive measures)
Be factual and specific. Include actual timestamps and metrics.""")
return report
## Tools Your Agent Needs
CategoryToolsPurpose
MonitoringGrafana API, Datadog API, PrometheusRead metrics, check dashboards
LoggingElasticsearch, Loki, CloudWatch LogsSearch and analyze logs
AlertingPagerDuty, OpsGenie, SlackReceive alerts, update status
CI/CDGitHub Actions, ArgoCD, JenkinsCheck deploys, trigger rollbacks
InfrastructureKubernetes API, AWS API, TerraformScale, restart, modify resources
CommunicationSlack, Teams, emailNotify teams, request approval
## Platform Comparison: AIOps Tools
PlatformBest ForPriceKey Feature
Shoreline.ioAutomated remediationCustomOp-based runbook automation
BigPandaAlert correlationCustomML-powered alert grouping
MoogsoftNoise reduction$15/host/moAI alert correlation
PagerDuty AIOpsExisting PD usersAdd-onIntelligent triage, similar incidents
Custom (this guide)Full control$100-300/moYour runbooks, your rules
## Safety Guardrails for DevOps Agents
DevOps agents have access to production infrastructure. The guardrails must be strict:
- **Read-first, write-later:** Start with read-only access. Add write actions one at a time after proving reliability
- **Blast radius limits:** Agent can restart 1 pod, not the entire deployment. Scale by 2x, not 10x
- **Approval gates:** Rollbacks, scaling changes, and config modifications require human approval via Slack/PagerDuty
- **Dry-run mode:** Agent shows what it *would* do, human approves, then it executes
- **Kill switch:** One command to disable all agent write actions immediately
- **Audit trail:** Every action logged with timestamp, reason, and outcome
- **Time boundaries:** No infrastructure changes during business hours without approval
class DevOpsGuardrails:
MAX_SCALE_FACTOR = 2.0 # Never scale more than 2x
MAX_RESTARTS_PER_HOUR = 5 # Prevent restart loops
CHANGE_FREEZE_HOURS = (9, 17) # No auto-changes during business hours
REQUIRE_APPROVAL = [
"rollback_deployment",
"scale_service",
"modify_config",
"restart_service", # After initial testing, this can be auto
]
def can_execute(self, action: str, params: dict) -> tuple[bool, str]:
# Business hours check
hour = datetime.now().hour
if self.CHANGE_FREEZE_HOURS[0] self.MAX_SCALE_FACTOR:
return False, f"Scale factor {params['factor']} exceeds max {self.MAX_SCALE_FACTOR}"
return True, "OK"
## Implementation Roadmap
Don't try to build the full AI SRE in one sprint. Follow this progression:
- **Week 1-2: Observer** — Alert classification, log analysis, context gathering. No write actions. Agent produces reports in Slack.
- **Week 3-4: Advisor** — Root cause analysis, runbook recommendation. Agent suggests actions, human executes.
- **Week 5-6: Assistant** — Agent executes simple runbooks (restart pod, clear cache) with approval. Human handles complex cases.
- **Week 7-8: Responder** — Auto-execute known runbooks for recurring incidents. Approval required for new patterns.
- **Month 3+: Autonomous** — Full auto-remediation for known incident types. Human-in-the-loop for novel issues.
Building AI agents for DevOps and SRE? [AI Agents Weekly](/newsletter.html) covers infrastructure automation, AIOps tools, and production deployment patterns 3x/week. Join free.
## Conclusion
The best on-call engineer is one who never has to wake up. An AI DevOps agent won't replace your SRE team, but it will handle the repetitive incidents that drain their energy — the 3 AM OOM kills, the deployment rollbacks, the "disk is 90% full" alerts that have a known fix.
Start as an observer. Prove the agent can correctly identify root causes. Then gradually give it the keys to execute. Every incident it handles autonomously is an interrupted sleep your team gets back. That's not just efficiency — it's quality of life.
Get our free AI Agent Starter Kit — templates, checklists, and deployment guides for building production AI agents.
Top comments (0)