OpenSRE: Build Your Own AI Incident-Investigation Agent

#ai #devops #opensource #sre

Most AI coding tools stop at the editor. They help you write code. But the hardest, most stressful part of running software is not writing it. It is the moment it breaks in production at 2 a.m.

OpenSRE is built for that moment.

The problem it solves

When an incident hits, the evidence is scattered. Logs are in Datadog. Metrics are in Grafana. The config change that caused it is in Git. Service dependencies live in your infra layer. Each system saw part of what happened. None of them saw all of it.

So you do it manually. You pull logs, line up timestamps, ping the colleague who knows that stack, and slowly piece the story together. It takes hours. Under on-call pressure, you often just ship a patch to get the system back up and figure out the real cause later.

OpenSRE automates that investigation.

What it is

OpenSRE is an open-source framework, built on LangGraph, for building AI-powered SRE agents that automate incident investigation and root cause analysis. It is Apache 2.0 licensed and maintained by Tracer.

The point is not a single fixed product. It is a toolkit. You plug in the alerting sources you already use and compose custom investigation workflows tailored to your own infrastructure.

How the investigation runs

When an alert fires, the agent works through a defined sequence:

Ingest the alert from your monitoring or incident system.
Assemble context from logs, metrics, configs, and dependencies.
Frame failure modes the incident could plausibly be.
Execute investigation queries across connected systems, in parallel.
Evaluate hypotheses against the evidence collected.
Deliver a root cause report and recommended next actions, to Slack out of the box.

The agent tests several hypotheses at once and stops when it has enough confidence to give a clear answer, rather than running forever or guessing early.

What it connects to

OpenSRE integrates with the systems that already power modern platforms:

Data platform: Apache Airflow, Kafka, Spark
Observability: Grafana, Datadog, CloudWatch, Sentry
Infrastructure: Kubernetes, AWS, GCP, Azure
Dev tools: GitHub
Communication: Slack, PagerDuty

Adding a new output destination, such as routing reports to PagerDuty or OpsGenie, is described as one of the easiest contributions you can make.

Design principles worth noting

OpenSRE leans on a few principles that matter for production use: deterministic investigations, evidence-backed conclusions, parallel hypothesis testing, and fully auditable workflows.

That last point is important. This is not a black-box LLM that hands you a guess. The investigation is traceable, so you can see why it reached a conclusion.

Getting started

You can try it without touching production. The repo ships a local Grafana plus Loki demo that produces a real root cause report with one command:

git clone https://github.com/Tracer-Cloud/open-sre-agent
cd open-sre-agent
make install
make install-hooks
cp .env.example .env
opensre onboard
make local-grafana-live

The opensre onboard step walks you through configuring a local LLM provider and optionally validating integrations like Grafana, Datadog, Slack, AWS, GitHub, and Sentry. There is also a bundled demo that skips Docker entirely if you just want to see the flow.

Is it useful?

Promising, with caveats worth being honest about.

It is the youngest of the new wave of AI-agent tooling, with a smaller community and no tagged releases yet. It is also clearly aimed at data-platform teams, the Airflow, Kafka, and Spark crowd. If that describes your stack and on-call is genuinely painful, the local demo is worth an afternoon.

Heed the project's own security guidance: use read-only credentials, restrict network exposure, log every investigation, and always review a report before any automated remediation. An agent touching production systems deserves that caution.

The takeaway

AI agents are moving past the editor and into operations. OpenSRE is an early, open look at what an AI SRE actually involves: not a magic fix-it button, but a structured, auditable investigator that correlates the signals you already have. If incident response on your team still means hours of manual log-correlation, it is a project worth watching and, if your stack fits, worth trying.

Top comments (1)

ArshTechPro • May 18