DevHelm

Posted on Jun 8 • Originally published at devhelm.io

AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

#ai #engineering #reliability

Six months ago, we deployed an AI agent that handles on-call for DevHelm's production infrastructure. It triages Grafana alerts, correlates signals from Sentry and deploy pipelines, opens Linear tickets with context, and — for P0 and P1 incidents — launches multi-turn investigation sessions using Claude to diagnose root causes.

This is not a concept piece. We're a small team running a monitoring platform across two data centers. The agent, which we call Nighthawk, processes every reliability signal in our stack. Here's what we built, what it costs, and what it can't do yet.

The three modes of AI SRE

AI-assisted operations exists on a spectrum. Most teams start at the left and move right as trust builds:

1. Advisory mode — classification and routing ($0/incident)

The agent receives a signal (alert fired, error spike, deploy failed), classifies it by severity and category using deterministic rules, creates a ticket in your project tracker with structured context (affected service, probable cause, relevant dashboards), and sends a notification to the on-call channel.

No LLM involved. No cost per event. This is a rules engine with structured output — the kind of automation that SRE teams have been building with PagerDuty webhooks and custom Slack bots for years. The value isn't AI; it's that the classification rules and routing logic live in one place instead of scattered across 15 webhook integrations.

2. Investigation mode — LLM-powered diagnosis (~$6/session)

When a P0 or P1 alert fires, the agent escalates from advisory to investigation. It launches an LLM conversation (we use Claude) with the full incident context: the alert payload, recent deploy history, correlated signals from other sources, and access to diagnostic tools (log search, metric queries, trace lookup).

The investigation runs as a multi-turn session. The agent asks questions, executes diagnostic commands, analyzes results, and builds a hypothesis. After each batch of turns, it pauses and reports findings to the human on-call. The human can inject additional context ("we deployed a database migration 20 minutes ago") or steer the investigation ("check the connection pool metrics, not the query latency").

This is where the real value appears. A P1 investigation that takes a human 45 minutes of context-switching — opening dashboards, reading logs, cross-referencing deploy history — takes the agent 3–5 minutes of autonomous work. The human still decides what to do with the findings, but the diagnostic legwork is automated.

3. Autonomous remediation — the frontier (not yet)

The logical next step: the agent not only diagnoses the issue but executes the fix. Restart the crashed pod, roll back the bad deploy, scale up the database connection pool. The technology is ready — tool use in modern LLMs is reliable enough for scoped operations. The problem is trust and blast radius. An agent that can restart pods can also restart the wrong pods. An agent that can roll back deploys can roll back the wrong deploy.

We haven't enabled autonomous remediation yet. The investigation-to-human-approval handoff is where we are today, and it's where we think most teams should start.

What our agent actually does

Nighthawk runs as a deployment in our Kubernetes cluster. All reliability signals flow through its webhook endpoints:

Signal source	What it carries
Grafana (38+ alert rules)	Metric threshold breaches: high error rates, latency spikes, disk/memory pressure, replication lag
Sentry	Unhandled exceptions, error spikes, new issue types across API and pipeline
Deploy pipeline	Build failures, health check failures post-deploy, rollback triggers
Failover controller	Cross-datacenter promotion events, replication failures, tunnel status changes
Pipeline workers	Adapter failures, SQS dead-letter events, rate limit exhaustion
Canary organization	Synthetic checks that exercise the full product path as a real user

Every signal goes through the same pipeline:

Deduplication. If the same alert fires 5 times in 2 minutes, the agent correlates them into a single incident instead of creating 5 tickets.
Severity classification. Rules-based mapping from signal metadata to incident severity levels (P0–P3). Grafana critical alerts map to P0. Sentry error spikes with > 100 events/minute map to P1. Build failures map to P2.
Context enrichment. The agent attaches recent deploy history, related signals from the last 30 minutes, and links to relevant dashboards and runbooks.
Routing. Create a Linear ticket. Send a Telegram notification with a one-paragraph summary. For P0/P1: auto-launch an investigation session.

The advisory pipeline processes signals in under 2 seconds. The investigation session typically runs 5–15 turns over 3–8 minutes.

The economics

The cost model is the first thing anyone asks about, so here are real numbers:

Advisory mode: $0 per incident. No LLM calls. The classification and routing logic is deterministic Python. We process 50–200 signals per day at zero marginal cost.

Investigation sessions: ~$6 per session using Claude Opus. A session runs up to 25 turns (hard budget), with 5 turns per invocation cycle. Most investigations resolve in 10–15 turns. Token usage averages 15,000 input tokens and 3,000 output tokens per turn.

Daily cost controls:

Circuit breaker at $10/day — if total investigation spend exceeds this, new investigations queue for human approval instead of auto-launching
Maximum 2 concurrent investigations — prevents a cascade of correlated alerts from draining the budget
Only P0 and P1 incidents auto-investigate — P2 and P3 get advisory-only treatment

In practice, we spend $30–60/month on investigations. That's less than half a day of human on-call time saved per month, even at a conservative estimate. The value isn't just time savings — it's that investigations start immediately at 3 AM instead of waiting for a human to wake up and orient.

What AI SRE can't do yet

Intellectual honesty about limitations is important. Here's what we've learned:

It can't prioritize between competing incidents. When three alerts fire simultaneously from different services, the agent investigates them independently. A human engineer would recognize that all three are downstream effects of a single root cause (the database is slow) and triage accordingly. We're building correlation heuristics, but the "is this the root cause or a symptom?" judgment still requires human pattern recognition.

It can't assess business impact. The agent knows that checkout error rates spiked. It doesn't know that this is the last day of a product launch campaign and every lost checkout costs 10x the normal revenue. Severity classification is based on technical signals, not business context.

It hallucinates diagnostic results. In ~5% of investigation sessions, the agent confidently states "the connection pool is exhausted" when the actual metric shows 30% utilization. We mitigate this by requiring the agent to cite specific metric values or log lines for every claim — if it can't produce the evidence, the finding is flagged as unverified.

It doesn't learn across incidents. Each investigation session starts from scratch. The agent doesn't remember that last week's P1 was caused by the same database migration pattern. We're building a "learnings" store that surfaces relevant past investigations, but it's not production-ready.

How to build your own advisory agent

You don't need to start with investigation sessions. The advisory layer alone — signal routing, classification, ticket creation, notification — handles 80% of the toil and costs nothing to run. Here's how to start:

Step 1: Consolidate signal routing

Pick a single webhook endpoint that receives all your reliability signals. Grafana alerts, Sentry webhooks, CI/CD notifications, and custom health checks should all flow through one router. This gives you a single place to add classification logic and prevents the "we have 12 Slack channels and nobody knows which one matters" problem.

Step 2: Define severity classification rules

Map signal metadata to severity levels. Start simple:

Grafana alert with severity=critical → P0
Sentry new issue with error count > 100/min → P1
Deploy health check failure → P2
Everything else → P3

Refine the rules as you learn what actually correlates with user-facing impact. The rules will be wrong at first — that's fine. A human reviewing the classification for 2 weeks will generate enough corrections to calibrate.

Step 3: Automate ticket creation

For every classified signal, create a ticket in your project tracker with structured fields: severity, affected service, timestamp, summary, links to relevant dashboards. This is the MTTR lever — the ticket exists before the human starts investigating, with context already attached.

Step 4: Add investigation when ready

Once you trust the classification and routing (after ~30 days of advisory-only operation), add LLM-powered investigation for P0/P1 incidents. Give the agent read access to your logs, metrics, and deploy history. Start with a conservative turn budget (10 turns max) and review every investigation output for the first month.

The role of external monitoring

An AI SRE agent that processes internal signals has a blind spot: it can't detect issues that originate outside your infrastructure. If your cloud provider's API degrades, your database host has a network partition, or a third-party service your pipeline depends on goes down — these are invisible to internal alerting until the downstream effects cascade into your metrics.

External uptime monitoring — checks that run from outside your infrastructure and verify endpoint availability every 30 seconds — closes this gap. It's the signal source that catches what internal monitoring misses. Start with checks for your most critical external dependencies at app.devhelm.io, then feed the results into your agent's signal router alongside Grafana and Sentry.

Originally published on DevHelm.

DEV Community