DEV Community

Cover image for Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation
Faizan Hussain Rabbani
Faizan Hussain Rabbani

Posted on

Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation

Modern Site Reliability Engineering (SRE) teams manage hundreds of microservices with complex interdependencies. When an incident occurs, engineers must manually query multiple observability backends, correlate signals across layers, consult historical post-mortems, and execute runbooks. This manual process leads to high Mean Time to Recovery (MTTR), alert fatigue, and operational toil.

To solve this, I built the Autonomous SRE Agent—an AI-powered reliability system that executes the full incident loop (detect → investigate → diagnose → remediate → learn).

Unlike simplistic AI wrappers that execute LLM outputs blindly, this agent is built on a rigorous Hexagonal Architecture with hard-coded safety guardrails, ensuring that autonomy is earned through a strict phased rollout, rather than granted by default.

Here is a deep dive into the purpose, architecture, and implementation of the Autonomous SRE Agent.


🎯 Purpose and Core Capabilities

The Autonomous SRE Agent is designed to completely automate the triage and remediation of well-understood infrastructure incidents, reducing MTTR to sub-30-second diagnostic latency.

Currently, the agent supports end-to-end autonomous resolution for five critical incident types:

  • OOM Kills: Triggered by memory pressure >85% for >5 min. Remediated via Pod restarts.
  • High Latency: Triggered by p99 latency >3σ for >2 min. Remediated via HPA scale-up.
  • Error Rate Spikes: Triggered by >200% error surges. Remediated via GitOps deployment rollbacks.
  • Disk Exhaustion: Triggered by >80% usage with a 24h projection. Remediated via log truncation.
  • Certificate Expiry: Triggered when a cert expires within 14 days. Remediated via cert renewal triggers.

🏗️ System Architecture: The Five Layers

The system is designed as a sophisticated, layered processing pipeline that translates raw infrastructure telemetry into actionable incidents and safe remediations.

1. Observability Layer (Ingestion)

This layer gathers high-fidelity telemetry. It connects to OpenTelemetry (OTel) for application-level metrics, distributed traces, and structured logs, and utilizes eBPF for deep kernel-level visibility (network flows, syscalls) with minimal overhead. It continuously analyzes trace spans to build a real-time Service Dependency Graph, which is vital for calculating the blast radius of any future remediation.

2. Detection Layer

Moving away from static thresholds, this layer computes rolling statistical baselines. It uses machine-learning heuristics (like Isolation Forests) to detect multi-dimensional anomalies (e.g., a latency spike correlated with an error surge). The Alert Correlation Engine groups related anomalies into a single, deduplicated Incident using the dependency graph.

3. Intelligence Layer (The Cognitive Brain)

This layer acts as the diagnostic engine, utilizing a multi-stage Retrieval-Augmented Generation (RAG) pipeline.

  • It embeds incoming anomaly alerts and performs semantic similarity searches against a Vector Database containing historical post-mortems and runbooks.
  • It feeds this grounded context to an LLM (Anthropic Claude or OpenAI GPT-4o) to generate a root-cause hypothesis.
  • A Second-Opinion Validator cross-checks the LLM's logic to prevent hallucinations, while a Confidence Scorer evaluates the structural evidence.
  • Advanced Token Optimization techniques—such as cross-encoder reranking, semantic diagnostic caching, and LLMLingua evidence compression—reduce context window bloat and lower API costs.

4. Action & Guardrails Layer

This is the "final mile" where the AI interfaces with reality. To prevent catastrophic actions, it relies on a Safety Guardrails Engine.
Remediations are routed through strict policies (e.g., "never restart more than 10% of the fleet globally"). Actions are executed via idempotent cloud APIs or GitOps Pull Requests (using ArgoCD/Flux) for deployment rollbacks. A Post-Remediation Monitor tracks metrics and triggers auto-rollbacks if the system further degrades.

5. Orchestration & Operator Layer

To ensure the AI is never a "black box," the Operator Layer provides a React/Next.js dashboard. It features a real-time incident timeline, confidence score breakdowns, and ChatOps integrations (Slack/MS Teams) for "Human-in-the-Loop" approvals on Severity 1 and 2 incidents. The Orchestration layer also manages Multi-Agent Coordination using distributed locks (Redis/etcd) to prevent conflicts between the SRE agent and other autonomous entities (like FinOps or SecOps agents).


🧱 Design Philosophy: Hexagonal & Safety-First

Strict Hexagonal Architecture (Ports & Adapters)

The most critical architectural decision (ADR-001) was adopting Hexagonal Architecture. The core domain logic (domain/) never imports external SDKs like kubernetes, boto3, or openai directly.
Instead, it relies on abstract interfaces (ports/), which are implemented by swappable adapters/ (e.g., OtelProvider, CloudWatchLogsAdapter, PostgresIncidentStore). This guarantees that the core reasoning engine remains highly portable across AWS, Azure, and Kubernetes.

Autonomy Earned, Not Granted

The agent utilizes a strict Phased Rollout State Machine to build operator trust.

  1. Phase 1 (Observe): The agent runs in shadow mode, analyzing telemetry and writing its intended actions to an audit log without executing them.
  2. Phase 2 (Assist): The agent diagnoses incidents and proposes remediation plans via Slack/PagerDuty, requiring human approval (Human-in-the-Loop) to proceed.
  3. Phase 3 (Autonomous): After mathematically proving high diagnostic accuracy, the agent is granted permission to autonomously execute actions for lower-severity (Sev 3-4) incidents, strictly bound by blast-radius limits.

🛠️ Implementation & Technology Stack

The implementation is robust and built for scale:

  • Language & Core: Python 3.11+ using FastAPI for async-first API surfaces and Pydantic v2 for strict runtime validation of canonical data models.
  • Persistence (ADR-006): I consolidated our data strategy around PostgreSQL. It serves as the primary operational store, utilizing the TimescaleDB extension for high-volume telemetry metrics and the pgvector extension for production HNSW vector embeddings.
  • Eventing & Coordination: Redis Streams serves as the internal event bus with an at-least-once transactional outbox pattern to ensure audit trails are perfectly preserved. Redis and etcd manage the distributed locking for multi-agent fencing.
  • Testing Rigor: Because this agent mutates infrastructure, it requires a strict 60/30/10 test pyramid. I maintain 100% code coverage on domain logic, relying heavily on Testcontainers and LocalStack Pro to emulate AWS infrastructure dynamically during integration tests.

🚀 The Path Forward

The Autonomous SRE Agent is moving rapidly toward Phase 4: Predictive Capabilities. In this future state, the agent will monitor long-term degradation trends and automatically scale systems or recommend architectural shifts before an anomaly threshold is ever breached.

By separating the cognitive reasoning engine from the infrastructure adapters, and by making safety the ultimate non-negotiable constraint, I am bridging the gap between "impressive AI demos" and true, enterprise-ready Tier-0 infrastructure automation.


If you are working in platform engineering, AI infrastructure, or SRE, I would love to hear your feedback on the architecture and safety patterns in the comments!

Top comments (1)

Collapse
 
faizanhussainrabbani profile image
Faizan Hussain Rabbani