DEV Community

Cover image for Building AI SRE: Our journey
Ali
Ali

Posted on

Building AI SRE: Our journey

Even with automation and observability, most on-call workflows still rely on human responders juggling dashboards, logs, and chat threads under pressure. It’s reactive, fragmented, and cognitively draining. At ilert, we set out to change that. Not by adding another dashboard, but by making incident response more agentic: intelligent systems that understand, recommend, and act safely in real-time.

In this article, we want to share how we are building an agentic incident response, what we've learned on the way, and the next steps.

By the way, if you haven’t heard of us — ilert is an AI-first incident management platform that helps engineering teams reduce downtime and improve MTTR.

This article was written by ilert engineer Tim. You can find the full version of it in the Engineering blog, where we regularly share our journey of building AI SRE.

We began our path to agentic incident response by designing an architecture focused on flexibility, scalability, and intelligent automation throughout the entire incident lifecycle.

ilert AI SRE Architecture

Laying the groundwork

Hive: LLM Orchestration Layer

Hive is ilert’s backbone for AI-driven operations — a secure orchestration layer that manages multiple large language models for incident analysis, summaries, and contextual recommendations. It lets us route workloads to the best model for the job, maintain data privacy, and integrate new LLMs effortlessly as they emerge.

AI Voice Agent: Hands-free response

When responders can’t type, our AI voice agent becomes the interface — capturing spoken issues updates, turning them into structured alerts, and pulling fresh data from multiple sources. It bridges natural communication with automated precision.

The core: Model Context Protocol

The Model Context Protocol (MCP), originally developed by Anthropic, is a real-time system that connects operational data to the ilert AI SRE. It provides the structured context our agents need to act intelligently during incidents.

Why MCP? Traditional integrations often leave systems disconnected, forcing teams to manually correlate telemetry, logs, and infrastructure data. MCP eliminates these silos by automatically aggregating and structuring incident-relevant information in real time.

MCP collects data from monitoring tools, log aggregators, deployment pipelines, and infrastructure platforms, processes it within a secure, EU-compliant, multi-tenant architecture, and delivers only the essential insights to our agentic responders.

This ensures that:

  • Agents have real-time, granular incident awareness;
  • Data remains isolated, secure, and compliant;
  • Manual correlation and cognitive load are minimized;
  • Interactions with the ilert AI SRE agent are low-latency and context-rich.

So, MCP is a kind of neural layer connecting your observability stack, codebase, and infrastructure to our AI systems — keeping every action contextually accurate, relevant, and safe.

The ilert AI SRE: Turning alerts into agent-proposed actions

ilert AI SRE recommebded actions

We built an end-to-end pipeline that turns monitoring signals into intelligent, actionable workflows to speed up incident resolution. When an alert is triggered, Event Flow (automated workflows to streamline processing, routing, and escalating events in ilert) applies rules and thresholds to notify the right teams instantly — cutting noise and delay.

At the same moment, the MCP enriches the alert by gathering and structuring telemetry, logs, deployment data, and infrastructure status from tools like Prometheus, Grafana, GitHub, and Kubernetes. This gives the ilert agent full situational awareness without any manual correlation.

With this context in place, the ilert AI SRE becomes an active participant in the incident, not just a notifier. It analyzes data in real time to propose root causes, remediation steps, and escalation paths. All surfaced in an interactive chat interface where responders can review, adjust, or safely execute actions on the spot.

What we learned along the way

Building and running agentic systems for real-world, mission-critical incident response has been an insightful journey. Here are a few things we’ve learned along the way:

  1. Transparency builds trust. When agents act autonomously — collecting data, correlating signals, or even executing predefined actions — human-responders need to see what’s happening and why. Full visibility builds confidence. For high-impact actions, we let teams add approval steps, striking the right balance between speed and safety.
  2. Context is everything. To avoid hallucinations or half-baked suggestions, we feed our agents rich, structured data through the MCP. This keeps every insight grounded in reality — and makes the agent feel more like a reliable teammate than a guessing machine.
  3. Low latency matters. In an incident, seconds matter. We’ve optimized for speed with speculative tool calls and efficient data paths so responders get insights almost instantly. Less waiting, faster recovery.
  4. Feedback makes it better. Every incident teaches something new. Built-in feedback loops help the system learn what works (and what doesn’t), so it becomes sharper and more helpful over time.
  5. Safety first, always. Autonomous doesn’t mean out of control. By defining safe, scoped actions, the agent can fix certain issues on its own — with full rollback options if needed. That way, automation accelerates recovery without ever compromising reliability.

Top comments (0)