Faizan Hussain Rabbani

Posted on May 25 • Edited on May 26

Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation

#agents #ai #devops #sre

Modern SRE teams manage several microservices with tangled interdependencies. When something breaks, engineers manually query several observability backends, correlate signals across layers, dig through historical post-mortems, and execute runbooks under pressure. The result: long MTTR, alert fatigue, and toil.

So I built the Autonomous SRE Agent — an AI-driven reliability system that runs the detect → investigate → diagnose → remediate → learn loop, with a phased path toward full autonomy.

The agent is built on strict hexagonal architecture with hard-coded safety guardrails — autonomy is earned through measurable graduation criteria, not granted by default. A clear "what's next" section at the end covers the design-stage components honestly, separated from running code.

🎯 Scope today

The agent automates detection, diagnosis, and remediation of infrastructure incidents, with a target of sub-30-second diagnostic latency. Five incident categories are detected and diagnosed; two have a fully executable end-to-end remediation path in shipped code:

Incident category	Trigger	Remediation	Status
OOM kill	Memory pressure > 85% for 5+ min	Pod / container restart	✅ End-to-end
High latency	p99 above rolling baseline + sigma threshold	HPA scale-up	✅ End-to-end
Error-rate spike	Surge over baseline	GitOps revert	🟡 Detect + diagnose only
Disk exhaustion	High utilization with projected breach	Log truncation	🟡 Detect + diagnose only
Cert expiry	Cert expires within configured window	Cert rotation	🟡 Detect + diagnose only

The 🟡 rows are fully detected, classified, and have a diagnosis pipeline.

🏗️ The architecture

A layered pipeline transforms raw telemetry into safe, verifiable remediations. Six layers, each with a clear contract:

1. Observability / Ingestion — vendor-agnostic by design

This layer gathers high-fidelity telemetry through ports that the domain depends on, with swappable adapters implementing them. Five concrete telemetry providers are shipped:

OpenTelemetry for app metrics, traces, and structured logs (composite of Prometheus, Jaeger, Loki adapters)
CloudWatch as an AWS-native composite (metrics, logs, X-Ray)
New Relic via NerdGraph GraphQL — full implementation of all four query ports
Pixie eBPF for kernel-level visibility (syscalls, network flows, process activity) via PxL scripts, running as a Kubernetes DaemonSet
Kubernetes pod logs for container-level evidence

A continuous dependency-graph service consumes trace data to build a real-time service topology, which is essential for calculating blast radius before any remediation runs.

2. Detection — threshold-free anomalies

Static thresholds are dead. The detection layer computes rolling statistical baselines, segmented by time-of-day and day-of-week, then evaluates deviations. Concretely:

BaselineService maintains per-(service, metric) baselines that adjust for daily and weekly patterns
AnomalyDetector runs detection rules across the canonical AnomalyType enum (memory pressure, latency spike, error-rate surge, traffic anomaly, deployment-induced regression, cert expiry, disk exhaustion)
SignalCorrelator joins metrics, logs, traces, and eBPF events into per-service CorrelatedSignals so a single root cause doesn't produce alert storms
AlertCorrelationEngine uses the dependency graph to fold related alerts into a single Incident

The output: a canonical, deduplicated Incident ready for the intelligence layer.

3. Intelligence — RAG-grounded reasoning

The diagnostic engine is built around a multi-stage Retrieval-Augmented Generation pipeline:

Incoming incidents are embedded (sentence-transformers) and matched against a vector store of historical post-mortems and runbooks — pgvector with HNSW indexing, falling back to JSONB cosine similarity in environments without the extension.
Retrieved evidence is reranked by a cross-encoder (ms-marco-MiniLM-L-6-v2) for query-aware relevance, then optionally compressed by LLMLingua with SRE-critical force tokens (OOM, kill, etc.) preserved.
A fingerprint-keyed diagnostic cache short-circuits recurring incidents (key: service + anomaly_type + metric) to avoid redundant LLM calls.
The compressed context is fed to an LLM — Claude Sonnet (claude-sonnet-4-20250514) by default — to generate a root-cause hypothesis.
A second-opinion validator cross-checks the reasoning. It supports both RULE_BASED and LLM_CROSS_CHECK strategies; the LLM cross-check typically uses gpt-4o-mini as a cheaper, faster reviewer of Claude's output.
The confidence scorer produces a composite score from four weighted components — LLM self-confidence (35%), validator agreement (25%), retrieval relevance (25%), and evidence volume (15%) — so no single signal dominates.

The output: a typed diagnosis with a calibrated composite confidence percentage.

4. Governance — graduation criteria and coordination

Autonomy isn't granted, it's earned. Two pieces of this layer are shipped:

Graduation gate evaluator — PhaseGate.evaluate_graduation() checks five hard criteria continuously:

Criterion	Threshold
Diagnostic accuracy	≥ 90%
Destructive false positives	0
Sev 3-4 autonomous resolution rate	≥ 95%
Remediation integration coverage	≥ 30%
Clean soak-test days	≥ 7

Phase transitions today are an operator decision informed by this evaluator's output (the full state machine is a Phase 3 deliverable).

Distributed lock manager — Redis, etcd, and in-memory adapters all implement DistributedLockManagerPort, so the agent acquires resource-scoped locks before remediation and releases them after verification. This is the foundation for multi-agent coordination (FinOps, SecOps) once a second agent enters the picture.

5. Action & Guardrails — the final mile

This is where the AI meets reality. Every remediation passes through hard-coded safety gates before execution:

Safety gates (Python-enforced, not OPA):

BlastRadiusCalculator — validates affected_pods_percentage against the plan's max_blast_radius_percentage, with a hard "scale-up ≤ 2× current replicas" override
KillSwitch — global toggle that gates every autonomous action; emits domain events on activation
CooldownEnforcer — prevents the same service from being remediated repeatedly in a short window
GuardrailOrchestrator — composes the above and produces a single GuardrailResult(allowed, reason)

Compute executors (idempotent, behind CloudOperatorPort):

Platform	Adapter	Operations
Kubernetes	`kubernetes/operator.py`	Pod restart, deployment scale
AWS ECS	`aws/ecs_operator.py`	Service update, task restart
AWS EC2 ASG	`aws/ec2_asg_operator.py`	Instance refresh, scale
AWS Lambda	`aws/lambda_operator.py`	Concurrency adjust, redeploy
Azure App Service	`azure/app_service_operator.py`	Restart, scale-out
Azure Functions	`azure/functions_operator.py`	Plan adjustment

After execution, RemediationVerifier compares post-action metrics against baseline within a sigma tolerance. Failed verification returns a FAILED status that the engine surfaces — and a future phase will trigger auto-rollback automatically.

6. Data & Persistence — one source of truth

Consolidated around PostgreSQL (ADR-006), with ten migrations in src/sre_agent/adapters/persistence/migrations/:

PostgreSQL is the system of record, with the transactional outbox pattern (postgres_outbox.py + outbox_relay.py) and a processed_events table for idempotent consumers
TimescaleDB extension handles the vector_embeddings and metrics hypertables (migration 002, with continuous-aggregate tuning in migration 009)
pgvector stores embeddings with HNSW indexing, including SET LOCAL hnsw.ef_search = 100 per query for recall/latency tuning, plus a JSONB fallback for dev environments
Redis Streams is the internal event bus (XADD / XREADGROUP with consumer groups, at-least-once delivery)
Every LLM prompt, retrieval, tool call, and decision lands in an immutable reasoning trace via reasoning_trace_store.py (agent_runs, tool_calls, retrieved_contexts) — the foundation for blameless post-mortems

🔄 The end-to-end flow

Here's what happens from the moment a metric goes sideways to the moment the agent verifies recovery:

Two things worth highlighting:

Severity classification is deterministic, not LLM-driven. The SeverityClassifier combines multi-dimensional impact scoring with hard rules and service-tier metadata — so the same incident always gets the same Sev. This matters for audit and graduation criteria.
Verification is structural, not aspirational. RemediationVerifier compares post-action metrics against the same baseline that triggered detection. If the agent restarts a pod and the OOM signal doesn't subside within sigma tolerance, the result is recorded as FAILED — and an operator review is unambiguous.

🧱 Design philosophy

Hexagonal architecture (Ports & Adapters)

The single most important architectural decision (ADR-001) was strict hexagonal layering. The domain/ package never imports kubernetes, boto3, anthropic, or openai directly. It depends only on abstract ports/, and every external interaction goes through an adapter — OtelProvider, CloudWatchProvider, PixieAdapter, AnthropicLLMAdapter, OpenAILLMAdapter, PgVectorAdapter, RedisStreamsEventBus, KubernetesOperator, and so on.

This isn't theoretical purity — it pays off concretely:

The reasoning engine runs unchanged on AWS, Azure, and on-prem Kubernetes
Every adapter is replaceable in tests with an in-memory fake — see adapters/coordination/in_memory_lock_manager.py and the JSONB fallback in the pgvector adapter
No LangChain, no LlamaIndex, no framework lock-in: direct SDK calls with custom Pydantic dataclasses tailored to SRE diagnostics

Autonomy earned, not granted

The agent is currently in Phase 2 (Assist). Diagnosis runs autonomously; remediation runs through the guardrail stack with approval state tracked in RemediationPlan.approval_state. The graduation evaluator continuously checks whether the agent has earned the right to advance:

Phase	What runs	What's mutated	Status
1 · Shadow	Detection + diagnosis	Nothing — actions written to audit log	✅ Shipped
2 · Assist	Diagnosis + plan generation, guardrail-gated execution	OOM and latency remediation against shipped compute executors	✅ Current phase
3 · Autonomous	Fully autonomous Sev 3-4 execution, automatic phase transitions	Subject to blast-radius limits and kill switch	🟡 In design
4 · Predictive	Trend analysis, pre-emptive action	Slow-moving issues caught before threshold breach	🟡 Roadmap

Graduation requires demonstrated diagnostic accuracy over a sustained window — not a manager's gut feel.

🛠️ Implementation stack

Concern	Choice	Why
Language	Python 3.11+	Mature AI/ML ecosystem, async-native
API surface	FastAPI	Async-first, OpenAPI by default
Data models	Pydantic v2	Strict runtime validation of canonical types
LLM SDKs	`anthropic` + `openai` (direct)	No LangChain — domain models stay in Pydantic, no impedance mismatch
Embeddings	`sentence-transformers`	Local, deterministic, no API tax
Operational store	PostgreSQL + TimescaleDB	One database for OLTP, time-series, and vectors
Vector store	pgvector (HNSW) + JSONB fallback	Same DB, ACID transactions, no extra ops surface
Event bus	Redis Streams	At-least-once with consumer groups, simple ops
Coordination	Redis + etcd	Distributed locks for multi-agent fencing
Testing	Testcontainers + LocalStack Pro	True integration tests against ephemeral cloud

The testing approach matters: because this agent mutates infrastructure, contract tests run against real (containerized) backends. In-memory fakes are reserved for unit tests against domain logic.

🚧 What's next — honest roadmap

To stay honest, here's what's designed but not yet shipped. These are the things you might expect from the architecture diagram that aren't running today:

Operator dashboard — Next.js SPA with real-time incident feed, confidence visualization, phase tracker. OpenSpec proposal at openspec/changes/phase-2-7-operator-dashboard/; zero tasks complete.
ChatOps adapters — Slack / Teams Block Kit approvals, PagerDuty escalation, Jira auto-ticketing. Directories specified, not yet implemented.
GitOps executor — ArgoCD / PyGithub adapter for rollback PRs. GITOPS_REVERT strategy is defined; no execution adapter yet.
Cert-manager executor — CERTIFICATE_ROTATION strategy mapped, no adapter to invoke cert-manager.
Log-truncation executor — LOG_TRUNCATION strategy mapped, no shipped adapter.
Phase state machine — the gate evaluator is shipped, but automatic phase transitions and a persisted state machine are Phase 3 work.
Oscillation detector — multi-agent coordination has the lock primitive; oscillation detection between SRE/FinOps/SecOps agents is a future requirement when a second agent exists.
WebSocket / SSE stream — REST endpoints are shipped; the real-time push channel for the dashboard is part of Phase 2.7.

I'm sharing this roadmap explicitly because the gap between "impressive architecture diagram" and "running code" is the credibility test for any AI infrastructure project. Phase 2 is what runs today; the rest is sequenced in OpenSpec changes with concrete tasks.

🚀 The bigger bet

By separating cognitive reasoning from infrastructure adapters, and by treating safety as a non-negotiable constraint rather than a feature, the goal is to bridge the gap between "impressive AI demos" and Tier-0 infrastructure automation that an enterprise actually trusts.

Phase 3 is where the agent stops asking for approval on Sev 3-4. Phase 4 is where it stops waiting for thresholds to be breached. Both require the foundation that Phase 2 lays — strict hexagonal boundaries, immutable audit trails, and graduation criteria that aren't negotiable.

Source code, OpenSpec changes, and ADRs: github.com/faizanhussainrabbani/autonomous-sre-agent

If you're working in platform engineering, AI infrastructure, or SRE, I'd love feedback on the architecture and safety patterns. What guardrails would you add? What would you take out? Drop a comment.

DEV Community