<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Faizan Hussain Rabbani</title>
    <description>The latest articles on DEV Community by Faizan Hussain Rabbani (@faizanhussainrabbani).</description>
    <link>https://dev.to/faizanhussainrabbani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F57791%2Fa5ab492c-5692-4207-a472-f37551b8ef67.jpg</url>
      <title>DEV Community: Faizan Hussain Rabbani</title>
      <link>https://dev.to/faizanhussainrabbani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/faizanhussainrabbani"/>
    <language>en</language>
    <item>
      <title>Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation</title>
      <dc:creator>Faizan Hussain Rabbani</dc:creator>
      <pubDate>Mon, 25 May 2026 08:52:41 +0000</pubDate>
      <link>https://dev.to/faizanhussainrabbani/building-an-autonomous-sre-agent-from-raw-telemetry-to-safe-ai-driven-remediation-2olo</link>
      <guid>https://dev.to/faizanhussainrabbani/building-an-autonomous-sre-agent-from-raw-telemetry-to-safe-ai-driven-remediation-2olo</guid>
      <description>&lt;p&gt;Modern SRE teams manage several microservices with tangled interdependencies. When something breaks, engineers manually query several observability backends, correlate signals across layers, dig through historical post-mortems, and execute runbooks under pressure. The result: long MTTR, alert fatigue, and toil.&lt;/p&gt;

&lt;p&gt;So I built the &lt;strong&gt;&lt;a href="https://github.com/faizanhussainrabbani/autonomous-sre-agent" rel="noopener noreferrer"&gt;Autonomous SRE Agent&lt;/a&gt;&lt;/strong&gt; — an AI-driven reliability system that runs the &lt;strong&gt;detect → investigate → diagnose → remediate → learn&lt;/strong&gt; loop, with a phased path toward full autonomy.&lt;/p&gt;

&lt;p&gt;The agent is built on strict &lt;strong&gt;hexagonal architecture&lt;/strong&gt; with &lt;strong&gt;hard-coded safety guardrails&lt;/strong&gt; — autonomy is &lt;em&gt;earned&lt;/em&gt; through measurable graduation criteria, not granted by default. A clear "what's next" section at the end covers the design-stage components honestly, separated from running code.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Scope today
&lt;/h2&gt;

&lt;p&gt;The agent automates detection, diagnosis, and remediation of infrastructure incidents, with a target of &lt;strong&gt;sub-30-second diagnostic latency&lt;/strong&gt;. Five incident categories are detected and diagnosed; two have a fully executable end-to-end remediation path in shipped code:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident category&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Remediation&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OOM kill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Memory pressure &amp;gt; 85% for 5+ min&lt;/td&gt;
&lt;td&gt;Pod / container restart&lt;/td&gt;
&lt;td&gt;✅ End-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;p99 above rolling baseline + sigma threshold&lt;/td&gt;
&lt;td&gt;HPA scale-up&lt;/td&gt;
&lt;td&gt;✅ End-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Error-rate spike&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Surge over baseline&lt;/td&gt;
&lt;td&gt;GitOps revert&lt;/td&gt;
&lt;td&gt;🟡 Detect + diagnose only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Disk exhaustion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High utilization with projected breach&lt;/td&gt;
&lt;td&gt;Log truncation&lt;/td&gt;
&lt;td&gt;🟡 Detect + diagnose only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cert expiry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cert expires within configured window&lt;/td&gt;
&lt;td&gt;Cert rotation&lt;/td&gt;
&lt;td&gt;🟡 Detect + diagnose only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 🟡 rows are fully detected, classified, and have a diagnosis pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ The architecture
&lt;/h2&gt;

&lt;p&gt;A layered pipeline transforms raw telemetry into safe, verifiable remediations. Six layers, each with a clear contract:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42erod72td92scxz2jxb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42erod72td92scxz2jxb.png" alt="Autonomous SRE Agent — system architecture, showing six layers from observability ingestion through detection, intelligence, governance, action, and data persistence" width="720" height="1300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Observability / Ingestion — vendor-agnostic by design
&lt;/h3&gt;

&lt;p&gt;This layer gathers high-fidelity telemetry through &lt;strong&gt;ports&lt;/strong&gt; that the domain depends on, with swappable &lt;strong&gt;adapters&lt;/strong&gt; implementing them. Five concrete telemetry providers are shipped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt; for app metrics, traces, and structured logs (composite of Prometheus, Jaeger, Loki adapters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch&lt;/strong&gt; as an AWS-native composite (metrics, logs, X-Ray)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New Relic&lt;/strong&gt; via NerdGraph GraphQL — full implementation of all four query ports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pixie eBPF&lt;/strong&gt; for kernel-level visibility (syscalls, network flows, process activity) via PxL scripts, running as a Kubernetes DaemonSet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes pod logs&lt;/strong&gt; for container-level evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A continuous &lt;strong&gt;dependency-graph service&lt;/strong&gt; consumes trace data to build a real-time service topology, which is essential for calculating blast radius before any remediation runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Detection — threshold-free anomalies
&lt;/h3&gt;

&lt;p&gt;Static thresholds are dead. The detection layer computes &lt;strong&gt;rolling statistical baselines&lt;/strong&gt;, segmented by time-of-day and day-of-week, then evaluates deviations. Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BaselineService&lt;/code&gt;&lt;/strong&gt; maintains per-(service, metric) baselines that adjust for daily and weekly patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AnomalyDetector&lt;/code&gt;&lt;/strong&gt; runs detection rules across the canonical &lt;code&gt;AnomalyType&lt;/code&gt; enum (memory pressure, latency spike, error-rate surge, traffic anomaly, deployment-induced regression, cert expiry, disk exhaustion)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SignalCorrelator&lt;/code&gt;&lt;/strong&gt; joins metrics, logs, traces, and eBPF events into per-service &lt;code&gt;CorrelatedSignals&lt;/code&gt; so a single root cause doesn't produce alert storms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AlertCorrelationEngine&lt;/code&gt;&lt;/strong&gt; uses the dependency graph to fold related alerts into a single &lt;code&gt;Incident&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output: a canonical, deduplicated &lt;code&gt;Incident&lt;/code&gt; ready for the intelligence layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Intelligence — RAG-grounded reasoning
&lt;/h3&gt;

&lt;p&gt;The diagnostic engine is built around a multi-stage Retrieval-Augmented Generation pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Incoming incidents are embedded (sentence-transformers) and matched against a vector store of historical post-mortems and runbooks — &lt;strong&gt;pgvector with HNSW indexing&lt;/strong&gt;, falling back to JSONB cosine similarity in environments without the extension.&lt;/li&gt;
&lt;li&gt;Retrieved evidence is reranked by a &lt;strong&gt;cross-encoder&lt;/strong&gt; (&lt;code&gt;ms-marco-MiniLM-L-6-v2&lt;/code&gt;) for query-aware relevance, then optionally compressed by &lt;strong&gt;LLMLingua&lt;/strong&gt; with SRE-critical force tokens (&lt;code&gt;OOM&lt;/code&gt;, &lt;code&gt;kill&lt;/code&gt;, etc.) preserved.&lt;/li&gt;
&lt;li&gt;A fingerprint-keyed &lt;strong&gt;diagnostic cache&lt;/strong&gt; short-circuits recurring incidents (key: &lt;code&gt;service + anomaly_type + metric&lt;/code&gt;) to avoid redundant LLM calls.&lt;/li&gt;
&lt;li&gt;The compressed context is fed to an LLM — &lt;strong&gt;Claude Sonnet&lt;/strong&gt; (&lt;code&gt;claude-sonnet-4-20250514&lt;/code&gt;) by default — to generate a root-cause hypothesis.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;second-opinion validator&lt;/strong&gt; cross-checks the reasoning. It supports both &lt;code&gt;RULE_BASED&lt;/code&gt; and &lt;code&gt;LLM_CROSS_CHECK&lt;/code&gt; strategies; the LLM cross-check typically uses &lt;strong&gt;gpt-4o-mini&lt;/strong&gt; as a cheaper, faster reviewer of Claude's output.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;confidence scorer&lt;/strong&gt; produces a composite score from four weighted components — LLM self-confidence (35%), validator agreement (25%), retrieval relevance (25%), and evidence volume (15%) — so no single signal dominates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output: a typed diagnosis with a calibrated composite confidence percentage.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Governance — graduation criteria and coordination
&lt;/h3&gt;

&lt;p&gt;Autonomy isn't granted, it's earned. Two pieces of this layer are shipped:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graduation gate evaluator&lt;/strong&gt; — &lt;code&gt;PhaseGate.evaluate_graduation()&lt;/code&gt; checks five hard criteria continuously:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Diagnostic accuracy&lt;/td&gt;
&lt;td&gt;≥ 90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Destructive false positives&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sev 3-4 autonomous resolution rate&lt;/td&gt;
&lt;td&gt;≥ 95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remediation integration coverage&lt;/td&gt;
&lt;td&gt;≥ 30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clean soak-test days&lt;/td&gt;
&lt;td&gt;≥ 7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Phase transitions today are an operator decision &lt;strong&gt;informed&lt;/strong&gt; by this evaluator's output (the full state machine is a Phase 3 deliverable).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed lock manager&lt;/strong&gt; — &lt;code&gt;Redis&lt;/code&gt;, &lt;code&gt;etcd&lt;/code&gt;, and &lt;code&gt;in-memory&lt;/code&gt; adapters all implement &lt;code&gt;DistributedLockManagerPort&lt;/code&gt;, so the agent acquires resource-scoped locks before remediation and releases them after verification. This is the foundation for multi-agent coordination (FinOps, SecOps) once a second agent enters the picture.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Action &amp;amp; Guardrails — the final mile
&lt;/h3&gt;

&lt;p&gt;This is where the AI meets reality. Every remediation passes through hard-coded safety gates before execution:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Safety gates (Python-enforced, not OPA):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BlastRadiusCalculator&lt;/code&gt;&lt;/strong&gt; — validates &lt;code&gt;affected_pods_percentage&lt;/code&gt; against the plan's &lt;code&gt;max_blast_radius_percentage&lt;/code&gt;, with a hard "scale-up ≤ 2× current replicas" override&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;KillSwitch&lt;/code&gt;&lt;/strong&gt; — global toggle that gates every autonomous action; emits domain events on activation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CooldownEnforcer&lt;/code&gt;&lt;/strong&gt; — prevents the same service from being remediated repeatedly in a short window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GuardrailOrchestrator&lt;/code&gt;&lt;/strong&gt; — composes the above and produces a single &lt;code&gt;GuardrailResult(allowed, reason)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compute executors (idempotent, behind &lt;code&gt;CloudOperatorPort&lt;/code&gt;):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Adapter&lt;/th&gt;
&lt;th&gt;Operations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kubernetes/operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pod restart, deployment scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS ECS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws/ecs_operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Service update, task restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS EC2 ASG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws/ec2_asg_operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Instance refresh, scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws/lambda_operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrency adjust, redeploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure App Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;azure/app_service_operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Restart, scale-out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure Functions&lt;/td&gt;
&lt;td&gt;&lt;code&gt;azure/functions_operator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Plan adjustment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After execution, &lt;strong&gt;&lt;code&gt;RemediationVerifier&lt;/code&gt;&lt;/strong&gt; compares post-action metrics against baseline within a sigma tolerance. Failed verification returns a &lt;code&gt;FAILED&lt;/code&gt; status that the engine surfaces — and a future phase will trigger auto-rollback automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Data &amp;amp; Persistence — one source of truth
&lt;/h3&gt;

&lt;p&gt;Consolidated around PostgreSQL (ADR-006), with ten migrations in &lt;code&gt;src/sre_agent/adapters/persistence/migrations/&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; is the system of record, with the &lt;strong&gt;transactional outbox pattern&lt;/strong&gt; (&lt;code&gt;postgres_outbox.py&lt;/code&gt; + &lt;code&gt;outbox_relay.py&lt;/code&gt;) and a &lt;code&gt;processed_events&lt;/code&gt; table for idempotent consumers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TimescaleDB extension&lt;/strong&gt; handles the &lt;code&gt;vector_embeddings&lt;/code&gt; and metrics hypertables (migration 002, with continuous-aggregate tuning in migration 009)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pgvector&lt;/strong&gt; stores embeddings with HNSW indexing, including &lt;code&gt;SET LOCAL hnsw.ef_search = 100&lt;/code&gt; per query for recall/latency tuning, plus a JSONB fallback for dev environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis Streams&lt;/strong&gt; is the internal event bus (&lt;code&gt;XADD&lt;/code&gt; / &lt;code&gt;XREADGROUP&lt;/code&gt; with consumer groups, at-least-once delivery)&lt;/li&gt;
&lt;li&gt;Every LLM prompt, retrieval, tool call, and decision lands in an &lt;strong&gt;immutable reasoning trace&lt;/strong&gt; via &lt;code&gt;reasoning_trace_store.py&lt;/code&gt; (&lt;code&gt;agent_runs&lt;/code&gt;, &lt;code&gt;tool_calls&lt;/code&gt;, &lt;code&gt;retrieved_contexts&lt;/code&gt;) — the foundation for blameless post-mortems&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔄 The end-to-end flow
&lt;/h2&gt;

&lt;p&gt;Here's what happens from the moment a metric goes sideways to the moment the agent verifies recovery:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm88od4bcw1fc813oyjkn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm88od4bcw1fc813oyjkn.png" alt="End-to-end incident flow showing nine numbered steps from telemetry ingestion through detection, RAG retrieval, LLM diagnosis, validation, severity classification, safety gates, remediation execution, and post-action verification, with an immutable audit log capturing every step" width="720" height="1180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things worth highlighting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Severity classification is deterministic, not LLM-driven.&lt;/strong&gt; The &lt;code&gt;SeverityClassifier&lt;/code&gt; combines multi-dimensional impact scoring with hard rules and service-tier metadata — so the same incident always gets the same Sev. This matters for audit and graduation criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification is structural, not aspirational.&lt;/strong&gt; &lt;code&gt;RemediationVerifier&lt;/code&gt; compares post-action metrics against the same baseline that triggered detection. If the agent restarts a pod and the OOM signal doesn't subside within sigma tolerance, the result is recorded as &lt;code&gt;FAILED&lt;/code&gt; — and an operator review is unambiguous.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🧱 Design philosophy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hexagonal architecture (Ports &amp;amp; Adapters)
&lt;/h3&gt;

&lt;p&gt;The single most important architectural decision (ADR-001) was strict hexagonal layering. The &lt;code&gt;domain/&lt;/code&gt; package &lt;strong&gt;never&lt;/strong&gt; imports &lt;code&gt;kubernetes&lt;/code&gt;, &lt;code&gt;boto3&lt;/code&gt;, &lt;code&gt;anthropic&lt;/code&gt;, or &lt;code&gt;openai&lt;/code&gt; directly. It depends only on abstract &lt;code&gt;ports/&lt;/code&gt;, and every external interaction goes through an adapter — &lt;code&gt;OtelProvider&lt;/code&gt;, &lt;code&gt;CloudWatchProvider&lt;/code&gt;, &lt;code&gt;PixieAdapter&lt;/code&gt;, &lt;code&gt;AnthropicLLMAdapter&lt;/code&gt;, &lt;code&gt;OpenAILLMAdapter&lt;/code&gt;, &lt;code&gt;PgVectorAdapter&lt;/code&gt;, &lt;code&gt;RedisStreamsEventBus&lt;/code&gt;, &lt;code&gt;KubernetesOperator&lt;/code&gt;, and so on.&lt;/p&gt;

&lt;p&gt;This isn't theoretical purity — it pays off concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The reasoning engine runs unchanged on AWS, Azure, and on-prem Kubernetes&lt;/li&gt;
&lt;li&gt;Every adapter is replaceable in tests with an in-memory fake — see &lt;code&gt;adapters/coordination/in_memory_lock_manager.py&lt;/code&gt; and the JSONB fallback in the pgvector adapter&lt;/li&gt;
&lt;li&gt;No LangChain, no LlamaIndex, no framework lock-in: direct SDK calls with custom Pydantic dataclasses tailored to SRE diagnostics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Autonomy earned, not granted
&lt;/h3&gt;

&lt;p&gt;The agent is currently in &lt;strong&gt;Phase 2 (Assist)&lt;/strong&gt;. Diagnosis runs autonomously; remediation runs through the guardrail stack with approval state tracked in &lt;code&gt;RemediationPlan.approval_state&lt;/code&gt;. The graduation evaluator continuously checks whether the agent has earned the right to advance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What runs&lt;/th&gt;
&lt;th&gt;What's mutated&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1 · Shadow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detection + diagnosis&lt;/td&gt;
&lt;td&gt;Nothing — actions written to audit log&lt;/td&gt;
&lt;td&gt;✅ Shipped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 · Assist&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagnosis + plan generation, guardrail-gated execution&lt;/td&gt;
&lt;td&gt;OOM and latency remediation against shipped compute executors&lt;/td&gt;
&lt;td&gt;✅ Current phase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3 · Autonomous&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully autonomous Sev 3-4 execution, automatic phase transitions&lt;/td&gt;
&lt;td&gt;Subject to blast-radius limits and kill switch&lt;/td&gt;
&lt;td&gt;🟡 In design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4 · Predictive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trend analysis, pre-emptive action&lt;/td&gt;
&lt;td&gt;Slow-moving issues caught before threshold breach&lt;/td&gt;
&lt;td&gt;🟡 Roadmap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Graduation requires demonstrated diagnostic accuracy over a sustained window — not a manager's gut feel.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Implementation stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Python 3.11+&lt;/td&gt;
&lt;td&gt;Mature AI/ML ecosystem, async-native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API surface&lt;/td&gt;
&lt;td&gt;FastAPI&lt;/td&gt;
&lt;td&gt;Async-first, OpenAPI by default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data models&lt;/td&gt;
&lt;td&gt;Pydantic v2&lt;/td&gt;
&lt;td&gt;Strict runtime validation of canonical types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM SDKs&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;anthropic&lt;/code&gt; + &lt;code&gt;openai&lt;/code&gt; (direct)&lt;/td&gt;
&lt;td&gt;No LangChain — domain models stay in Pydantic, no impedance mismatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;sentence-transformers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local, deterministic, no API tax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational store&lt;/td&gt;
&lt;td&gt;PostgreSQL + TimescaleDB&lt;/td&gt;
&lt;td&gt;One database for OLTP, time-series, and vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector store&lt;/td&gt;
&lt;td&gt;pgvector (HNSW) + JSONB fallback&lt;/td&gt;
&lt;td&gt;Same DB, ACID transactions, no extra ops surface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event bus&lt;/td&gt;
&lt;td&gt;Redis Streams&lt;/td&gt;
&lt;td&gt;At-least-once with consumer groups, simple ops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coordination&lt;/td&gt;
&lt;td&gt;Redis + etcd&lt;/td&gt;
&lt;td&gt;Distributed locks for multi-agent fencing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;Testcontainers + LocalStack Pro&lt;/td&gt;
&lt;td&gt;True integration tests against ephemeral cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The testing approach matters: because this agent mutates infrastructure, contract tests run against real (containerized) backends. In-memory fakes are reserved for unit tests against domain logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚧 What's next — honest roadmap
&lt;/h2&gt;

&lt;p&gt;To stay honest, here's what's &lt;strong&gt;designed&lt;/strong&gt; but &lt;strong&gt;not yet shipped&lt;/strong&gt;. These are the things you might expect from the architecture diagram that aren't running today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operator dashboard&lt;/strong&gt; — Next.js SPA with real-time incident feed, confidence visualization, phase tracker. OpenSpec proposal at &lt;code&gt;openspec/changes/phase-2-7-operator-dashboard/&lt;/code&gt;; zero tasks complete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatOps adapters&lt;/strong&gt; — Slack / Teams Block Kit approvals, PagerDuty escalation, Jira auto-ticketing. Directories specified, not yet implemented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitOps executor&lt;/strong&gt; — ArgoCD / PyGithub adapter for rollback PRs. &lt;code&gt;GITOPS_REVERT&lt;/code&gt; strategy is defined; no execution adapter yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cert-manager executor&lt;/strong&gt; — &lt;code&gt;CERTIFICATE_ROTATION&lt;/code&gt; strategy mapped, no adapter to invoke cert-manager.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log-truncation executor&lt;/strong&gt; — &lt;code&gt;LOG_TRUNCATION&lt;/code&gt; strategy mapped, no shipped adapter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase state machine&lt;/strong&gt; — the gate evaluator is shipped, but automatic phase transitions and a persisted state machine are Phase 3 work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oscillation detector&lt;/strong&gt; — multi-agent coordination has the lock primitive; oscillation detection between SRE/FinOps/SecOps agents is a future requirement when a second agent exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket / SSE stream&lt;/strong&gt; — REST endpoints are shipped; the real-time push channel for the dashboard is part of Phase 2.7.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm sharing this roadmap explicitly because the gap between "impressive architecture diagram" and "running code" is the credibility test for any AI infrastructure project. Phase 2 is what runs today; the rest is sequenced in OpenSpec changes with concrete tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 The bigger bet
&lt;/h2&gt;

&lt;p&gt;By separating cognitive reasoning from infrastructure adapters, and by treating safety as a non-negotiable constraint rather than a feature, the goal is to bridge the gap between "impressive AI demos" and Tier-0 infrastructure automation that an enterprise actually trusts.&lt;/p&gt;

&lt;p&gt;Phase 3 is where the agent stops asking for approval on Sev 3-4. Phase 4 is where it stops waiting for thresholds to be breached. Both require the foundation that Phase 2 lays — strict hexagonal boundaries, immutable audit trails, and graduation criteria that aren't negotiable.&lt;/p&gt;

&lt;p&gt;Source code, OpenSpec changes, and ADRs: &lt;strong&gt;&lt;a href="https://github.com/faizanhussainrabbani/autonomous-sre-agent" rel="noopener noreferrer"&gt;github.com/faizanhussainrabbani/autonomous-sre-agent&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're working in platform engineering, AI infrastructure, or SRE, I'd love feedback on the architecture and safety patterns. What guardrails would you add? What would you take out? Drop a comment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>ai</category>
      <category>devops</category>
      <category>agents</category>
    </item>
    <item>
      <title>Entity Framework Core 2.1 features</title>
      <dc:creator>Faizan Hussain Rabbani</dc:creator>
      <pubDate>Tue, 11 Sep 2018 07:56:07 +0000</pubDate>
      <link>https://dev.to/faizanhussainrabbani/entity-framework-core-21-features-20j3</link>
      <guid>https://dev.to/faizanhussainrabbani/entity-framework-core-21-features-20j3</guid>
      <description>&lt;h2&gt;
  
  
  What is Entity Framework Core?
&lt;/h2&gt;

&lt;p&gt;Entity Framework Core (EF Core) is a lightweight, extensible, and cross-platform version of Entity Framework. &lt;/p&gt;

&lt;p&gt;EF Core introduces many improvements and new features when compared with EF6. &lt;/p&gt;

&lt;p&gt;EF Core keeps the developer experience from EF6, and most of the top-level APIs remain the same too, so EF Core will feel very familiar to those who have used EF6.&lt;/p&gt;

&lt;p&gt;At the same time, EF Core is a new code base and not as mature as EF6.&lt;/p&gt;

&lt;h2&gt;
  
  
  New features introduced in EF Core 2.1:
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. Entity types with constructors
&lt;/h4&gt;

&lt;p&gt;It is now possible to define a constructor with parameters and have EF Core call this constructor when creating an instance of the entity. &lt;/p&gt;

&lt;p&gt;The constructor parameters can be bound to mapped properties, or to various kinds of services to facilitate behaviors like lazy-loading and can be used to inject property values, lazy loading delegates, and services.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public class Employee
{
    public Employee(string name, string designation)
    {
        Name = name;
        Designation = designation;
    }
    [Key]
    public int EmpId { get; set; }
    public string Name { get; set; }
    public string Designation { get; set; }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Some points to consider:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Not all properties need to have constructor parameters. If a property is not set by any constructor parameter, EF Core will set it after calling the constructor in the normal way.&lt;/li&gt;
  &lt;li&gt;The parameter types and names must match property types and names, except that properties can be Pascal-cased while the parameters are camel-cased.&lt;/li&gt;
  &lt;li&gt;EF Core cannot set navigation properties using a constructor.&lt;/li&gt;
  &lt;li&gt;The constructor can be public, private, or have any other accessibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Value Conversions
&lt;/h4&gt;

&lt;p&gt;Value converters allow property values to be converted when reading from or writing to the database. &lt;/p&gt;

&lt;p&gt;This conversion can be from one value to another of the same type (for example, encrypting strings) or&lt;/p&gt;

&lt;p&gt;from a value of one type to a value of another type (for example, converting enum values to and from strings in the database.)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public class Rider
{
    public int Id { get; set; }
    public EquineBeast Mount { get; set; }
}

public enum EquineBeast
{
    Donkey,
    Mule,
    Horse,
    Unicorn
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then conversions can be defined in OnModelCreating to store the enum values as strings (for example, "Donkey", "Mule", ...) in the database:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder
        .Entity&amp;lt;Rider&amp;gt;()
        .Property(e =&amp;gt; e.Mount)
        .HasConversion(
            v =&amp;gt; v.ToString(),
            v =&amp;gt; (EquineBeast)Enum.Parse(typeof(EquineBeast), v));
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Limitations:&lt;/p&gt;

&lt;p&gt;There are a few known current limitations of the value conversion system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;As noted above, null cannot be converted.&lt;/li&gt;
&lt;li&gt;There is currently no way to spread a conversion of one property to multiple columns or vice-versa.&lt;/li&gt;
&lt;li&gt;Use of value conversions may impact the ability of EF Core to translate expressions to SQL. A warning will be logged for such cases. Removal of these limitations is being considered for a future release.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  3. Query Types
&lt;/h4&gt;

&lt;p&gt;An EF Core model can now include query types. Unlike entity types, query types do not have keys defined on them and cannot be inserted, deleted or updated (that is, they are read-only), but they can be returned directly by queries. Some of the usage scenarios for query types are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mapping to views without primary keys&lt;/li&gt;
&lt;li&gt;Mapping to tables without primary keys&lt;/li&gt;
&lt;li&gt;Mapping to queries defined in the model&lt;/li&gt;
&lt;li&gt;Serving as the return type for FromSql() queries&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.Database.ExecuteSqlCommand(
    @"CREATE VIEW View_BlogPostCounts AS 
        SELECT Name, Count(p.PostId) as PostCount from Blogs b
        JOIN Posts p on p.BlogId = b.BlogId
        GROUP BY b.Name");

#region Query Type
public class BlogPostsCount
{
    public string BlogName { get; set; }
    public int PostCount { get; set; }
}
#endregion
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  4. Eager loading
&lt;/h4&gt;

&lt;p&gt;Use the Include method to specify related data to be included in query results. In the following example, the blogs that are returned in the results will have their Posts property populated with the related posts.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog =&amp;gt; blog.Posts)
        .ToList();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can include related data from multiple relationships in a single query.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog =&amp;gt; blog.Posts)
        .Include(blog =&amp;gt; blog.Owner)
        .ToList();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can drill down through relationships to include multiple levels of related data using the ThenInclude method. The following example loads all blogs, their related posts, and the author of each post.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using (var context = new BloggingContext())&lt;br&gt;
{&lt;br&gt;
    var blogs = context.Blogs&lt;br&gt;
        .Include(blog =&amp;gt; blog.Posts)&lt;br&gt;
            .ThenInclude(post =&amp;gt; post.Author)&lt;br&gt;
        .ToList();&lt;br&gt;
}&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Thank You!&lt;br&gt;
&lt;/h3&gt;

</description>
      <category>efcore</category>
      <category>entityframeworkcore</category>
      <category>ef6</category>
      <category>csharp</category>
    </item>
  </channel>
</rss>
