DEV Community

Cover image for Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation
Faizan Hussain Rabbani
Faizan Hussain Rabbani

Posted on • Edited on

Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation

Modern SRE teams manage several microservices with tangled interdependencies. When something breaks, engineers manually query several observability backends, correlate signals across layers, dig through historical post-mortems, and execute runbooks under pressure. The result: long MTTR, alert fatigue, and toil.

So I built the Autonomous SRE Agent — an AI-driven reliability system that runs the detect → investigate → diagnose → remediate → learn loop, with a phased path toward full autonomy.

The agent is built on strict hexagonal architecture with hard-coded safety guardrails — autonomy is earned through measurable graduation criteria, not granted by default. A clear "what's next" section at the end covers the design-stage components honestly, separated from running code.


🎯 Scope today

The agent automates detection, diagnosis, and remediation of infrastructure incidents, with a target of sub-30-second diagnostic latency. Five incident categories are detected and diagnosed; two have a fully executable end-to-end remediation path in shipped code:

Incident category Trigger Remediation Status
OOM kill Memory pressure > 85% for 5+ min Pod / container restart ✅ End-to-end
High latency p99 above rolling baseline + sigma threshold HPA scale-up ✅ End-to-end
Error-rate spike Surge over baseline GitOps revert 🟡 Detect + diagnose only
Disk exhaustion High utilization with projected breach Log truncation 🟡 Detect + diagnose only
Cert expiry Cert expires within configured window Cert rotation 🟡 Detect + diagnose only

The 🟡 rows are fully detected, classified, and have a diagnosis pipeline.


🏗️ The architecture

A layered pipeline transforms raw telemetry into safe, verifiable remediations. Six layers, each with a clear contract:

Autonomous SRE Agent — system architecture, showing six layers from observability ingestion through detection, intelligence, governance, action, and data persistence

1. Observability / Ingestion — vendor-agnostic by design

This layer gathers high-fidelity telemetry through ports that the domain depends on, with swappable adapters implementing them. Five concrete telemetry providers are shipped:

  • OpenTelemetry for app metrics, traces, and structured logs (composite of Prometheus, Jaeger, Loki adapters)
  • CloudWatch as an AWS-native composite (metrics, logs, X-Ray)
  • New Relic via NerdGraph GraphQL — full implementation of all four query ports
  • Pixie eBPF for kernel-level visibility (syscalls, network flows, process activity) via PxL scripts, running as a Kubernetes DaemonSet
  • Kubernetes pod logs for container-level evidence

A continuous dependency-graph service consumes trace data to build a real-time service topology, which is essential for calculating blast radius before any remediation runs.

2. Detection — threshold-free anomalies

Static thresholds are dead. The detection layer computes rolling statistical baselines, segmented by time-of-day and day-of-week, then evaluates deviations. Concretely:

  • BaselineService maintains per-(service, metric) baselines that adjust for daily and weekly patterns
  • AnomalyDetector runs detection rules across the canonical AnomalyType enum (memory pressure, latency spike, error-rate surge, traffic anomaly, deployment-induced regression, cert expiry, disk exhaustion)
  • SignalCorrelator joins metrics, logs, traces, and eBPF events into per-service CorrelatedSignals so a single root cause doesn't produce alert storms
  • AlertCorrelationEngine uses the dependency graph to fold related alerts into a single Incident

The output: a canonical, deduplicated Incident ready for the intelligence layer.

3. Intelligence — RAG-grounded reasoning

The diagnostic engine is built around a multi-stage Retrieval-Augmented Generation pipeline:

  1. Incoming incidents are embedded (sentence-transformers) and matched against a vector store of historical post-mortems and runbooks — pgvector with HNSW indexing, falling back to JSONB cosine similarity in environments without the extension.
  2. Retrieved evidence is reranked by a cross-encoder (ms-marco-MiniLM-L-6-v2) for query-aware relevance, then optionally compressed by LLMLingua with SRE-critical force tokens (OOM, kill, etc.) preserved.
  3. A fingerprint-keyed diagnostic cache short-circuits recurring incidents (key: service + anomaly_type + metric) to avoid redundant LLM calls.
  4. The compressed context is fed to an LLM — Claude Sonnet (claude-sonnet-4-20250514) by default — to generate a root-cause hypothesis.
  5. A second-opinion validator cross-checks the reasoning. It supports both RULE_BASED and LLM_CROSS_CHECK strategies; the LLM cross-check typically uses gpt-4o-mini as a cheaper, faster reviewer of Claude's output.
  6. The confidence scorer produces a composite score from four weighted components — LLM self-confidence (35%), validator agreement (25%), retrieval relevance (25%), and evidence volume (15%) — so no single signal dominates.

The output: a typed diagnosis with a calibrated composite confidence percentage.

4. Governance — graduation criteria and coordination

Autonomy isn't granted, it's earned. Two pieces of this layer are shipped:

Graduation gate evaluatorPhaseGate.evaluate_graduation() checks five hard criteria continuously:

Criterion Threshold
Diagnostic accuracy ≥ 90%
Destructive false positives 0
Sev 3-4 autonomous resolution rate ≥ 95%
Remediation integration coverage ≥ 30%
Clean soak-test days ≥ 7

Phase transitions today are an operator decision informed by this evaluator's output (the full state machine is a Phase 3 deliverable).

Distributed lock managerRedis, etcd, and in-memory adapters all implement DistributedLockManagerPort, so the agent acquires resource-scoped locks before remediation and releases them after verification. This is the foundation for multi-agent coordination (FinOps, SecOps) once a second agent enters the picture.

5. Action & Guardrails — the final mile

This is where the AI meets reality. Every remediation passes through hard-coded safety gates before execution:

Safety gates (Python-enforced, not OPA):

  • BlastRadiusCalculator — validates affected_pods_percentage against the plan's max_blast_radius_percentage, with a hard "scale-up ≤ 2× current replicas" override
  • KillSwitch — global toggle that gates every autonomous action; emits domain events on activation
  • CooldownEnforcer — prevents the same service from being remediated repeatedly in a short window
  • GuardrailOrchestrator — composes the above and produces a single GuardrailResult(allowed, reason)

Compute executors (idempotent, behind CloudOperatorPort):

Platform Adapter Operations
Kubernetes kubernetes/operator.py Pod restart, deployment scale
AWS ECS aws/ecs_operator.py Service update, task restart
AWS EC2 ASG aws/ec2_asg_operator.py Instance refresh, scale
AWS Lambda aws/lambda_operator.py Concurrency adjust, redeploy
Azure App Service azure/app_service_operator.py Restart, scale-out
Azure Functions azure/functions_operator.py Plan adjustment

After execution, RemediationVerifier compares post-action metrics against baseline within a sigma tolerance. Failed verification returns a FAILED status that the engine surfaces — and a future phase will trigger auto-rollback automatically.

6. Data & Persistence — one source of truth

Consolidated around PostgreSQL (ADR-006), with ten migrations in src/sre_agent/adapters/persistence/migrations/:

  • PostgreSQL is the system of record, with the transactional outbox pattern (postgres_outbox.py + outbox_relay.py) and a processed_events table for idempotent consumers
  • TimescaleDB extension handles the vector_embeddings and metrics hypertables (migration 002, with continuous-aggregate tuning in migration 009)
  • pgvector stores embeddings with HNSW indexing, including SET LOCAL hnsw.ef_search = 100 per query for recall/latency tuning, plus a JSONB fallback for dev environments
  • Redis Streams is the internal event bus (XADD / XREADGROUP with consumer groups, at-least-once delivery)
  • Every LLM prompt, retrieval, tool call, and decision lands in an immutable reasoning trace via reasoning_trace_store.py (agent_runs, tool_calls, retrieved_contexts) — the foundation for blameless post-mortems

🔄 The end-to-end flow

Here's what happens from the moment a metric goes sideways to the moment the agent verifies recovery:

End-to-end incident flow showing nine numbered steps from telemetry ingestion through detection, RAG retrieval, LLM diagnosis, validation, severity classification, safety gates, remediation execution, and post-action verification, with an immutable audit log capturing every step

Two things worth highlighting:

  1. Severity classification is deterministic, not LLM-driven. The SeverityClassifier combines multi-dimensional impact scoring with hard rules and service-tier metadata — so the same incident always gets the same Sev. This matters for audit and graduation criteria.
  2. Verification is structural, not aspirational. RemediationVerifier compares post-action metrics against the same baseline that triggered detection. If the agent restarts a pod and the OOM signal doesn't subside within sigma tolerance, the result is recorded as FAILED — and an operator review is unambiguous.

🧱 Design philosophy

Hexagonal architecture (Ports & Adapters)

The single most important architectural decision (ADR-001) was strict hexagonal layering. The domain/ package never imports kubernetes, boto3, anthropic, or openai directly. It depends only on abstract ports/, and every external interaction goes through an adapter — OtelProvider, CloudWatchProvider, PixieAdapter, AnthropicLLMAdapter, OpenAILLMAdapter, PgVectorAdapter, RedisStreamsEventBus, KubernetesOperator, and so on.

This isn't theoretical purity — it pays off concretely:

  • The reasoning engine runs unchanged on AWS, Azure, and on-prem Kubernetes
  • Every adapter is replaceable in tests with an in-memory fake — see adapters/coordination/in_memory_lock_manager.py and the JSONB fallback in the pgvector adapter
  • No LangChain, no LlamaIndex, no framework lock-in: direct SDK calls with custom Pydantic dataclasses tailored to SRE diagnostics

Autonomy earned, not granted

The agent is currently in Phase 2 (Assist). Diagnosis runs autonomously; remediation runs through the guardrail stack with approval state tracked in RemediationPlan.approval_state. The graduation evaluator continuously checks whether the agent has earned the right to advance:

Phase What runs What's mutated Status
1 · Shadow Detection + diagnosis Nothing — actions written to audit log ✅ Shipped
2 · Assist Diagnosis + plan generation, guardrail-gated execution OOM and latency remediation against shipped compute executors ✅ Current phase
3 · Autonomous Fully autonomous Sev 3-4 execution, automatic phase transitions Subject to blast-radius limits and kill switch 🟡 In design
4 · Predictive Trend analysis, pre-emptive action Slow-moving issues caught before threshold breach 🟡 Roadmap

Graduation requires demonstrated diagnostic accuracy over a sustained window — not a manager's gut feel.


🛠️ Implementation stack

Concern Choice Why
Language Python 3.11+ Mature AI/ML ecosystem, async-native
API surface FastAPI Async-first, OpenAPI by default
Data models Pydantic v2 Strict runtime validation of canonical types
LLM SDKs anthropic + openai (direct) No LangChain — domain models stay in Pydantic, no impedance mismatch
Embeddings sentence-transformers Local, deterministic, no API tax
Operational store PostgreSQL + TimescaleDB One database for OLTP, time-series, and vectors
Vector store pgvector (HNSW) + JSONB fallback Same DB, ACID transactions, no extra ops surface
Event bus Redis Streams At-least-once with consumer groups, simple ops
Coordination Redis + etcd Distributed locks for multi-agent fencing
Testing Testcontainers + LocalStack Pro True integration tests against ephemeral cloud

The testing approach matters: because this agent mutates infrastructure, contract tests run against real (containerized) backends. In-memory fakes are reserved for unit tests against domain logic.


🚧 What's next — honest roadmap

To stay honest, here's what's designed but not yet shipped. These are the things you might expect from the architecture diagram that aren't running today:

  • Operator dashboard — Next.js SPA with real-time incident feed, confidence visualization, phase tracker. OpenSpec proposal at openspec/changes/phase-2-7-operator-dashboard/; zero tasks complete.
  • ChatOps adapters — Slack / Teams Block Kit approvals, PagerDuty escalation, Jira auto-ticketing. Directories specified, not yet implemented.
  • GitOps executor — ArgoCD / PyGithub adapter for rollback PRs. GITOPS_REVERT strategy is defined; no execution adapter yet.
  • Cert-manager executorCERTIFICATE_ROTATION strategy mapped, no adapter to invoke cert-manager.
  • Log-truncation executorLOG_TRUNCATION strategy mapped, no shipped adapter.
  • Phase state machine — the gate evaluator is shipped, but automatic phase transitions and a persisted state machine are Phase 3 work.
  • Oscillation detector — multi-agent coordination has the lock primitive; oscillation detection between SRE/FinOps/SecOps agents is a future requirement when a second agent exists.
  • WebSocket / SSE stream — REST endpoints are shipped; the real-time push channel for the dashboard is part of Phase 2.7.

I'm sharing this roadmap explicitly because the gap between "impressive architecture diagram" and "running code" is the credibility test for any AI infrastructure project. Phase 2 is what runs today; the rest is sequenced in OpenSpec changes with concrete tasks.


🚀 The bigger bet

By separating cognitive reasoning from infrastructure adapters, and by treating safety as a non-negotiable constraint rather than a feature, the goal is to bridge the gap between "impressive AI demos" and Tier-0 infrastructure automation that an enterprise actually trusts.

Phase 3 is where the agent stops asking for approval on Sev 3-4. Phase 4 is where it stops waiting for thresholds to be breached. Both require the foundation that Phase 2 lays — strict hexagonal boundaries, immutable audit trails, and graduation criteria that aren't negotiable.

Source code, OpenSpec changes, and ADRs: github.com/faizanhussainrabbani/autonomous-sre-agent


If you're working in platform engineering, AI infrastructure, or SRE, I'd love feedback on the architecture and safety patterns. What guardrails would you add? What would you take out? Drop a comment.

Top comments (1)