DEV Community: Siddharth Singh

What is an AI SRE? Definition, Capabilities, and 2026 Buyer's Lens

Siddharth Singh — Thu, 21 May 2026 23:45:57 +0000

Key Takeaways

An AI SRE is a multi-step large-language-model agent that investigates production incidents, queries live telemetry, and drafts a root-cause analysis with remediation guidance. It is not an alerting tool, not an AIOps correlator, and not a chatbot. The agent calls infrastructure tools (kubectl, cloud APIs, log queries) during an incident to gather new evidence.

The category emerged in 2024 and consolidated in 2025-2026. Open-source projects include HolmesGPT (CNCF Sandbox since 8 October 2025), K8sGPT (CNCF Sandbox since 19 December 2023), and Aurora (Apache 2.0, multi-cloud). Commercial entrants include Resolve.ai ($125M Series A at $1B in February 2026) and Traversal ($48M Series A in June 2025).

An AI SRE is not the same as an AIOps platform. AIOps tools cluster alerts statistically and predate LLMs. An AI SRE reasons through an incident step by step using an LLM that calls tools. The two categories are complementary, not interchangeable.

Five capabilities define a credible AI SRE. Multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, and a structured root-cause output. Tools that ship fewer than three of these are something else (chatbot, summarizer, correlator).

Adoption is bounded by trust, not capability. Most 2026 buyers run the agent in read-only investigation mode for the first ninety days. Closed-loop remediation is a separate trust decision that follows clean operation, never the first decision.

An AI SRE is a multi-step large-language-model agent that investigates production incidents on behalf of a site reliability engineer. When an alert fires, the agent queries telemetry, traverses infrastructure dependencies, retrieves relevant runbooks, and produces a structured root-cause analysis. The category sits next to, not inside, the older AIOps and incident-management markets.

This page is a definitional reference. For the deep methodology and procurement-stage detail, see our AI SRE Complete Guide. For tool selection, see Top 15 AI SRE Tools in 2026.

What does an AI SRE do? The Five-Capability Test

We call the rubric below the Five-Capability AI SRE Test. A tool that ships fewer than three of these capabilities is in an adjacent category (copilot, summariser, correlator) and should not be evaluated against a real AI SRE.

Multi-step investigation. The agent runs an iterative reasoning loop (ReAct, tool-calling, or a graph-based equivalent) where each step uses the previous tool result to decide the next call. Single-shot summarisation is a different category.
Infrastructure tool execution. The agent reads from kubectl, cloud SDKs, observability backends, and ticket systems. Some agents also write, with guardrails. HolmesGPT documents read-only access with RBAC respect. Aurora documents sandboxed execution into an isolated namespace. K8sGPT documents Kubernetes-only diagnostics with anonymisation before any AI backend call.
Dependency-graph awareness. The agent knows that service A talks to service B and uses that topology to assess blast radius. Aurora ships a Memgraph-backed dependency graph. Causely is built on a causal-graph foundation; see How Causely Works.
Knowledge-base RAG. The agent retrieves runbooks and past postmortems using hybrid search (BM25 plus dense vectors). Aurora documents a Weaviate hybrid index. The leading commercial AI SREs all integrate Confluence and ticket systems.
Structured root-cause output. The agent emits a final artefact (summary, evidence chain, suggested remediation) rather than a chat transcript. Postmortem export to Confluence or Jira is increasingly table-stakes.

The minimum coherent product ships investigation, tool execution, and a structured output. Items 3 and 4 push the tool from "interesting demo" to "load-bearing in production."

How is an AI SRE different from a human SRE?

An AI SRE does not replace a human site reliability engineer. The 2026 division of labour is concrete.

Human stays in the loop for scope decisions (what counts as an incident), trust decisions (when to allow remediation), capacity planning, postmortem facilitation, runbook authorship, and the SLO conversation with product owners.
The agent absorbs the first sixty to ninety minutes of evidence-gathering on noisy alerts, the late-night triage of unclear pages, the cross-system correlation that humans defer until morning, and the boilerplate of a draft postmortem.

The economic argument is bounded. The category's investors (Sequoia, Kleiner, Lightspeed, Felicis) underwrite an "agent does first triage, human does decision" workflow, not a headcount-replacement claim. The SigNoz newsletter discussion of deskilling risk is a useful counterweight.

How is an AI SRE different from AIOps?

The two categories share an acronym sound and almost no implementation.

Dimension	AIOps platform	AI SRE
Primary technique	Statistical clustering, anomaly detection, correlation rules	LLM reasoning, tool-calling agents
When it was named	Coined by Gartner in 2017	Emerged in vendor marketing 2024 to 2025
What it produces	Alert clusters, noise reduction, incident summaries	A reasoned root-cause analysis, evidence chain
Representative tools	BigPanda, Moogsoft, Dynatrace Davis, PagerDuty Intelligent Alert Grouping	HolmesGPT, K8sGPT, Aurora, Resolve.ai, Traversal
Replaces	Manual alert triage	First-pass incident investigation

AIOps platforms predate LLMs and remain useful for alert hygiene. An AI SRE is downstream: once the alert lands, the AI SRE investigates it. Most mature teams will end up with both.

How is an AI SRE different from an incident-management copilot?

A copilot inside Rootly, incident.io, FireHydrant, or Datadog Bits AI drafts Slack updates, suggests on-call swaps, and writes a postmortem from artefacts the team has already produced. An AI SRE generates the evidence those artefacts describe. The two categories cooperate; they do not substitute. See our AI SRE vs traditional incident management comparison for the long form.

What are the open-source vs commercial AI SRE options?

In May 2026, three open-source projects dominate this lane.

HolmesGPT. Apache 2.0. 2.5k GitHub stars on the canonical repository as of May 2026, per the HolmesGPT/holmesgpt about box. Originally created by Robusta.dev with major contributions from Microsoft. CNCF Sandbox since 8 October 2025. Project legal entity: HolmesGPT a Series of LF Projects, LLC.
K8sGPT. Apache 2.0. 7.8k GitHub stars on the canonical repository as of May 2026, per the k8sgpt-ai/k8sgpt about box. CNCF Sandbox since 19 December 2023. The June 2024 CNCF blog notes that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (CNCF: K8sGPT, June 2024). Kubernetes-scoped.
Aurora by Arvo AI. Apache 2.0. Multi-cloud (AWS, Azure, GCP, OVH, Scaleway, Kubernetes). Sandboxed command execution, dependency-graph awareness, RAG over runbooks and postmortems. See the direct comparison of all three and our self-hosted AI SRE guide.

Commercial entrants raise larger cheques but ship a narrower deployment surface. Resolve.ai confirmed a $125M Series A at a $1B valuation in February 2026 and an extension at a $1.5B valuation in April 2026. Traversal raised $48M in June 2025 led by Sequoia and Kleiner Perkins. Incumbents shipped 2025-2026 launches: PagerDuty SRE Agent, Datadog Bits AI SRE, and ServiceNow Now Assist for incident operations.

How is an AI SRE evaluated?

Three questions resolve most procurement debates:

Does the agent investigate or just summarise? A summariser repeats what the dashboard already says. An investigator gathers new evidence. Ask the vendor to walk through one tool call after the alert; if the answer is "we summarise the alert payload," the product is a copilot, not an AI SRE.
Where does inference run? A SaaS-only inference plane is fine for unregulated teams and disqualifying for regulated ones. The deployment tier is fixed by the strictest constraint, not the average. See the Sovereignty Spectrum in our self-hosted guide.
What is the remediation boundary? Read-only investigation is one trust decision. PR-based suggestions are another. Sandboxed in-cluster execution is the third. Most teams stage these three independently across a six-to-twelve-month adoption arc, not in a single procurement.

For a detailed tool matrix scored on five axes (investigation, remediation, postmortem, deployment flexibility, source availability), see Top 15 AI SRE Tools in 2026.

ROI: where the time actually comes back

Independent ROI numbers specifically for AI SRE are still thin in 2026. The broader industry adoption picture is well-sourced:

Google's 2025 DORA report announcement states "90% of survey respondents report using AI at work" and that "More than 80% believe it has increased their productivity."
Stack Overflow's 2025 Developer Survey reports that 84 percent of respondents are using or planning to use AI tools in their development process, and 51 percent of professional developers use AI tools daily.
The same DORA 2025 report notes that "AI adoption still has a negative relationship with software delivery stability," which is exactly the gap an investigation-grade AI SRE is positioned to close, distinct from the coding-assistant category that drives most of the AI adoption signal above.

Where AI SRE specifically takes hours back is mid-tier paging volume: the alerts that are too ambiguous to ignore and too low-stakes to wake a senior on. The agent's first-pass triage moves those from "morning standup discussion" to "closed before breakfast."

What are the common mistakes when buying an AI SRE?

Conflating a postmortem generator with an AI SRE. A tool that writes a draft from the Slack transcript is not investigating. It is summarising.
Buying multi-cloud AI SRE for a single-cloud problem. If 95 percent of the estate is one cloud, a Kubernetes-only or AWS-only agent may be a better cost-to-fit match.
Starting with remediation. The fastest way to lose stakeholder trust is to let an agent execute a command before the team understands its investigation pattern. Stage trust.
Skipping the dependency-graph question. If the agent does not understand what calls what, it will miss blast-radius assessments and waste investigation steps. The capability is invisible in a demo and load-bearing in production.

How to evaluate an AI SRE in 14 days

A two-week, single-quarter procurement plan that maps directly to the Five-Capability AI SRE Test.

Day 1 to 2: Score the shortlist on the Five-Capability Test. Take the five capabilities (multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, structured root-cause output) and score every shortlisted tool 0 to 3 on each axis. Drop any tool that scores below 6 out of 15.
Day 3 to 4: Resolve the three procurement questions. Answer in writing: does the agent investigate or just summarise; where does inference run; what is the remediation boundary. Match the deployment tier to the strictest constraint, not the average.
Day 5 to 7: Run a sandboxed proof of value. Pick one real incident from the last 30 days. Replay it against the top two shortlisted tools using a non-production cloud key and a sandbox cluster.
Day 8 to 9: Run the security review. Walk security through each tool's data path: what telemetry leaves the customer perimeter, what is anonymised before LLM calls, what the read or write capability boundary is.
Day 10 to 11: Pilot one team for one week. Route a defined subset of alerts (one severity tier, one service domain) into the tool in read-only investigation mode. Do not touch remediation.
Day 12 to 13: Stage trust separately. Read-only investigation is one trust decision. PR-based suggestions are the second. Sandboxed in-cluster execution is the third. Most teams stage these over six to twelve months.
Day 14: Decide on five numbers. Five-Capability Test score, three-question filter answers, week-by-week investigation quality reading, total cost of ownership at projected incident volume, and security review status.

Where this guide fits

This is the short definitional reference. For deeper material:

AI SRE: The Complete Guide for Engineering Teams in 2026, procurement and adoption arc.
Top 15 AI SRE Tools in 2026, full capability matrix.
Self-Hosted AI SRE, deployment-tier framework.
Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT, three-way comparison.
HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison, two-way head-to-head.
AI-Powered Incident Investigation: The Complete Guide for SRE Teams, investigation-pattern detail.
What is Agentic Incident Management?, category framing.

Frequently Asked Questions

What is an AI SRE in simple terms?
An AI SRE is a multi-step LLM agent that investigates production incidents. It reads alerts, runs infrastructure commands such as kubectl or cloud SDK calls, queries observability backends, and produces a structured root-cause analysis. It augments a human site reliability engineer, not replaces them.

How is an AI SRE different from AIOps?
AIOps is a 2017-era Gartner category built on statistical alert clustering and anomaly detection. An AI SRE is downstream of that: once an alert lands, the AI SRE uses an LLM to reason through it step by step, calling tools to gather new evidence. Mature teams typically run both.

Is an AI SRE the same as an incident-management chatbot?
No. A chatbot inside Rootly, incident.io, FireHydrant, or PagerDuty drafts Slack updates and summarises artefacts the team already has. An AI SRE generates those artefacts by investigating the incident from telemetry. The two categories cooperate but do not substitute.

Will AI replace SREs?
No. Investor framing across Sequoia, Kleiner, Lightspeed, and Felicis-backed AI SRE companies in 2025 to 2026 has consistently been agent-as-first-triage with a human in the loop for scope, trust, capacity, and SLO decisions. The deskilling risk is real and discussed in industry essays such as the SigNoz newsletter; the headcount-replacement claim is not part of the category thesis.

What are the main open-source AI SRE tools in 2026?
Three projects dominate. HolmesGPT (Apache 2.0, CNCF Sandbox since 8 October 2025, Kubernetes-first, 2.5k GitHub stars per the about box in May 2026). K8sGPT (Apache 2.0, CNCF Sandbox since 19 December 2023, Kubernetes diagnostics, 7.8k GitHub stars per the about box in May 2026). Aurora by Arvo AI (Apache 2.0, multi-cloud, sandboxed command execution).

How does an AI SRE handle security and data privacy?
Practice varies by tool. HolmesGPT operates with read-only access that respects RBAC and is documented as safe to run in production. K8sGPT anonymises cluster object names and labels before sending data to the AI backend. Aurora supports air-gapped deployment with local LLMs through Ollama. Most commercial AI SREs run inference on vendor-managed infrastructure, which is the gating constraint for regulated buyers.

How long does an AI SRE take to deploy?
An open-source AI SRE runs in a single afternoon for a Docker Compose or Helm install with one cloud and one monitoring integration connected. Production rollout, including secret rotation, RBAC scoping, runbook ingestion, and Slack integration, takes two to four weeks for most teams. Closed-loop remediation is staged separately, three to twelve months after read-only operation.

What does an AI SRE cost?
Open-source AI SREs are free at the licence layer; the running cost is infrastructure plus LLM inference. Self-hosted Aurora with a local Ollama model removes the LLM cost entirely. Commercial AI SREs price either per-seat or per-investigation. Resolve.ai and Traversal price by custom contract; PagerDuty and Datadog bundle their AI SRE features into existing platform tiers.

Can an AI SRE run in an air-gapped environment?
Yes, for a small set of tools. Aurora supports air-gapped deployment with Ollama or vLLM for local inference. HolmesGPT supports self-hosted LLM endpoints. K8sGPT supports local backends including Ollama and LocalAI. Most commercial AI SREs require outbound calls to a vendor-managed inference plane and do not satisfy air-gapped procurement.

What does an AI SRE not do?
It does not set SLOs, define what counts as an incident, run capacity planning, facilitate a postmortem with the affected team, or own the customer relationship during a major outage. It is a tool for evidence-gathering and first-pass reasoning, not for the judgment work that defines the site reliability discipline.

Originally published at arvoai.ca/blog/what-is-an-ai-sre. Aurora by Arvo AI is open-source on GitHub under Apache 2.0.

HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison for SRE Teams

Siddharth Singh — Thu, 21 May 2026 23:43:47 +0000

Key Takeaways

HolmesGPT and K8sGPT are both Apache 2.0, both CNCF Sandbox, and both branded as AI for SRE work, but they solve different problems. HolmesGPT is an investigation agent that runs across "any infrastructure - VMs, bare metal, cloud services, or containers." K8sGPT is a Kubernetes diagnostics tool: "a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English."

GitHub adoption signals diverge sharply. As of May 2026, K8sGPT shows 7.8k stars and 996 forks, written 98.9 percent in Go. HolmesGPT shows 2.5k stars and 347 forks, written 84.5 percent in Python. K8sGPT had a two-year head start (CNCF Sandbox 19 December 2023 vs HolmesGPT 8 October 2025).

Execution model differs. HolmesGPT operates with read-only access that "respects RBAC permissions", plus a separate Operator Mode that "can open PRs to fix the problems it finds" through the GitHub MCP integration. K8sGPT runs as a CLI scanner or in-cluster operator with a 30-second default reconciliation interval (k8sgpt-operator) and anonymises Kubernetes object names and labels before any LLM call.

LLM backend lists overlap heavily and diverge at the edges. Both projects register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama as backends. K8sGPT's source registers a broader enterprise set: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. HolmesGPT documents a broader developer-tooling set: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and OpenAI-Compatible (LiteLLM proxy).

Governance shapes the trust story. HolmesGPT's project entity is HolmesGPT a Series of LF Projects, LLC; the project was originally created by Robusta.dev with major contributions from Microsoft. The June 2024 CNCF post on K8sGPT states that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (CNCF Blog, 7 June 2024).

This is a strict comparison of two open-source projects that are often grouped together because both attach AI to Kubernetes work, both are CNCF Sandbox, and both are Apache 2.0. Past that, they target different problems with different runtimes, different backends, and different governance. Every claim below is cited to a primary source: the project's GitHub repository, its official docs site, or a CNCF page. No quote is paraphrased from third-party blog posts.

A note on bias. Arvo builds Aurora, a separate open-source AI SRE listed alongside HolmesGPT and K8sGPT in our three-way comparison. This page intentionally excludes Aurora from the main comparison except for a small section at the end.

We call the rubric used below the Open-Source AI SRE Decision Matrix. Six axes, each evaluated against the project's own primary documentation, no third-party claims. The six axes are: stated scope, execution model, continuous operation, LLM provider breadth, Model Context Protocol direction (host vs consume), and project governance. Every cell in the comparison table that follows maps back to one of these six axes.

What is HolmesGPT?

HolmesGPT describes itself as an "Open-source AI agent for investigating production incidents and finding root causes". Repository statistics on the project's about box in May 2026 show 2.5k stars, 347 forks, and Python at 84.5 percent of the codebase (github.com/HolmesGPT/holmesgpt).

Scope is cross-infrastructure: "Open-source SRE agent for investigating production incidents across any infrastructure - Kubernetes, VMs, cloud services, databases, and more" (holmesgpt.dev). The same point is made on the project repository: "No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers" (github.com/HolmesGPT/holmesgpt).

Governance is shared between two entities. Origin attribution: "Originally created by Robusta.Dev, with major contributions from Microsoft". The project's legal entity is named on the docs site: "HolmesGPT a Series of LF Projects, LLC". CNCF acceptance is documented at "October 8, 2025 at the Sandbox maturity level" (cncf.io/projects/holmesgpt).

The latest release at time of writing is v0.30.1 on 20 May 2026 per the Releases page. The release notes for v0.30.1 mention Loki raw response handling on parse failure, a GitLab MCP entry in the datasource catalog, a Bash echo allowlist fix, and user_email persistence on chat requests.

What is K8sGPT?

K8sGPT describes itself as "a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI". Repository statistics on the project's about box in May 2026 show 7.8k stars, 996 forks, and Go at 98.9 percent of the codebase (github.com/k8sgpt-ai/k8sgpt).

Scope is explicitly Kubernetes. The project makes no claim of non-Kubernetes runtime support. The marketing site at k8sgpt.ai carries the tagline "K8sGPT - Giving Kubernetes Superpowers to Everyone."

Governance is community-led. The 7 June 2024 CNCF blog (Dotan Horovits) states: "unlike many popular projects, there is no company behind this project, and no business plan behind it" (CNCF Blog). CNCF acceptance is documented at "December 19, 2023 at the Sandbox maturity level" (cncf.io/projects/k8sgpt).

The latest release at time of writing is v0.4.33 on 13 May 2026 per the Releases page. Recent feature releases include v0.4.27 (mcp v2, 18 December 2025), v0.4.32 (Azure API type support and custom HTTP header, 22 April 2026), and v0.4.33 (analyze previous logs for restarted containers, 13 May 2026).

At a glance

Dimension	HolmesGPT	K8sGPT
License	Apache 2.0	Apache 2.0
CNCF status	Sandbox, 8 October 2025	Sandbox, 19 December 2023
Stars (May 2026)	2.5k	7.8k
Primary language	Python (84.5%)	Go (98.9%)
Stated scope	"Any infrastructure - VMs, bare metal, cloud services, or containers"	Kubernetes clusters
Operating model	Multi-step investigation agent + optional 24/7 Operator Mode (Alpha)	Scanner CLI + k8sgpt-operator for continuous in-cluster runs
Default permission model	"Read-only access and respects RBAC permissions"	Diagnoses; anonymises sensitive data before AI calls
Write capability	Can open GitHub PRs via the GitHub MCP integration in Operator Mode	None documented
MCP support	MCP-based integrations for AWS, Azure, GCP, GitHub, GitLab, Jenkins, Kubernetes Remediation, Sentry, Splunk, MariaDB, Prefect	Hosts an MCP server exposing 12 tools and 3 resources for Kubernetes operations
LLM providers	Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, Robusta AI	Anthropic, OpenAI, Azure OpenAI, AWS Bedrock (and Bedrock Converse), Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, Custom REST
Latest release at writing	v0.30.1, 20 May 2026	v0.4.33, 13 May 2026
Founding entity	Originally Robusta.dev, major Microsoft contributions	Community-led, no commercial backer per June 2024 CNCF blog

What is the scope difference between HolmesGPT and K8sGPT?

This is the load-bearing axis on the Open-Source AI SRE Decision Matrix, and the easiest one for teams to get wrong.

K8sGPT is, by stated scope, a Kubernetes diagnostics tool. The pkg/analyzer folder ships analysers for around 29 Kubernetes resource types as of May 2026, with a documented "default" subset (Pod, PVC, ReplicaSet, Service, Event, Ingress, StatefulSet, Deployment, Job, CronJob, Node, MutatingWebhook, ValidatingWebhook, ConfigMap) and an extended set covering HPA, PDB, NetworkPolicy, Gateway, GatewayClass, HTTPRoute, Log, Storage, Security, plus OLM-related resources (CatalogSource, ClusterServiceVersion, Subscription, etc.). Every analyser is scoped to a Kubernetes resource type. A team running on bare VMs, on managed cloud services without Kubernetes, or on a mainframe is not the K8sGPT audience.

HolmesGPT rebuts the Kubernetes-only assumption directly: "No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers". Its data-source catalogue, visible in the docs navigation, covers VM-era systems alongside Kubernetes-era ones: Bash, ClickHouse, MariaDB (via MCP), Confluence, Sentry, plus Kubernetes resources and Helm. The Operator Mode page also frames non-Kubernetes scope: "While the operator itself runs in Kubernetes, health checks can query any data source Holmes is connected to - VMs, cloud services, databases, SaaS platforms".

For SRE teams whose estate is entirely Kubernetes, this difference is academic. For teams that still run managed databases outside Kubernetes (RDS, Cloud SQL, Aurora), VM workloads, or third-party SaaS at incident-critical positions in the stack, K8sGPT cannot reach those resources without integration glue, and HolmesGPT can.

Can HolmesGPT or K8sGPT execute commands against my cluster?

Both projects ship a fundamentally read-shaped default. The phrasing differs.

HolmesGPT is explicit: "By design, HolmesGPT has read-only access and respects RBAC permissions. It is safe to run in production environments". The Operator Mode page describes how the read-only default is preserved while a separate write path opens: "Connect the GitHub MCP server so Holmes can open PRs to fix the problems it finds - not just report them". Writes do not happen against the cluster; they happen against the user's Git repository, where humans approve the change.

K8sGPT does not use the phrase "read-only" in its repository documentation, but its operational profile is similar: the tool scans cluster state through Kubernetes APIs and feeds analyser output to an LLM. Anonymisation happens before the LLM call: "the data is anonymized before being sent to the AI Backend... k8sgpt retrieves sensitive data (Kubernetes object names, labels, etc.). This data is masked when sent to the AI backend". The same primary source also notes that anonymisation "does not currently apply to events" and that certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count) are not masked. The trade-off is openly disclosed. The masking implementation lives in pkg/util/util.go as the MaskString function.

How does continuous operation differ between the two operators?

Both projects have an in-cluster operator, and again the framing differs.

HolmesGPT's Operator Mode is a 24/7 background agent: "HolmesGPT runs in the background 24/7, spots problems before your customers notice, and messages you in Slack with the fix" (holmesgpt.dev/latest/operator). The docs note its architecture: "a lightweight kopf-based controller handles CRD orchestration and scheduling, while stateless Holmes API servers execute the actual checks." The same page carries an explicit "Holmes Operator - Alpha Release" warning, and includes a cost caution: "Begin with infrequent schedules (e.g., hourly or daily) and monitor usage before scaling up."

K8sGPT's operator (a separate repo, k8sgpt-ai/k8sgpt-operator) is a continuous scanner: "This Operator is designed to enable K8sGPT within a Kubernetes cluster... It will allow you to create a custom resource that defines the behaviour and scope of a managed K8sGPT workload." The default reconciliation interval is 30 seconds, enforced in the controller code (ReconcileSuccessInterval = 30 * time.Second). Output goes to in-cluster Result CRDs, with optional Slack, Mattermost, and CloudEvents sinks. Prometheus and Grafana integration is exposed through ServiceMonitor and dashboard parameters (k8sgpt-operator docs).

Architecturally: HolmesGPT's Operator Mode is event-driven and incident-shaped (run on alert, run on schedule). K8sGPT's operator is poll-shaped (scan every 30 seconds, surface anomalies).

Which LLM providers does each tool support?

Both projects support multiple LLM backends. The lists overlap heavily on the headline providers and diverge at the edges.

K8sGPT's source code at pkg/ai/iai.go registers 17 backends as of May 2026: openai, anthropic, localai, ollama, azureopenai, cohereai, amazonbedrock, amazonbedrockconverse, amazonsagemaker, googleai, noopai, huggingface, googlevertexai, ocigenai, customrest, ibmwatsonxai, groq.

HolmesGPT's docs site navigation enumerates: Anthropic, AWS Bedrock, Azure AI Foundry, Gemini, GitHub Copilot, GitHub Models, Google Vertex AI, Ollama, OpenRouter, OpenAI, OpenAI-Compatible, Robusta AI.

The two lists overlap heavily on the headline providers (Anthropic, OpenAI, Azure OpenAI, Bedrock, Google Vertex AI, Ollama) and diverge at the edges. K8sGPT's edge list leans enterprise: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. HolmesGPT's edge list leans developer-tooling: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and an OpenAI-Compatible (LiteLLM proxy) catch-all. The right choice usually comes from the LLM the security team has already approved, not from this list.

How does each tool handle Model Context Protocol?

Both projects support MCP, and again the shape differs.

K8sGPT hosts an MCP server that the project ships: "K8sGPT provides a Model Context Protocol server that exposes Kubernetes operations as standardized tools for AI assistants." The server exposes "12 tools for cluster analysis, resource management, and debugging" and "3 resources for cluster information access," with "Stateless HTTP mode for one-off invocations" and "Full integration with Claude Desktop and other MCP clients." The MCP v2 feature lands in release v0.4.27 on 18 December 2025.

HolmesGPT consumes MCP servers as data sources rather than hosting one. The data-sources catalogue lists MCP-labelled integrations for AWS, Azure, GitHub, GitLab, Jenkins, GCP, Kubernetes Remediation, MariaDB, Prefect, Sentry, and Splunk. The docs navigation makes the consumption pattern explicit through entries like "MCP Servers" and "OAuth MCP Servers."

The implication: K8sGPT publishes cluster operations for Claude Desktop and other MCP clients to consume. HolmesGPT subscribes to MCP-published tools across third-party systems. Teams building MCP-shaped workflows will pick the direction that matches their existing investment.

Who governs each project, and how does that change the trust story?

The CNCF Sandbox label is identical on both projects. The economic shape behind each is not.

HolmesGPT is held under "HolmesGPT a Series of LF Projects, LLC", with origin attribution: "Originally created by Robusta.Dev, with major contributions from Microsoft". Robusta sells a managed SaaS product that integrates HolmesGPT, and Slack and Microsoft Teams integrations are flagged "Available via Robusta". This is a sponsored-open-source pattern.

K8sGPT is community-led. The June 2024 CNCF blog states: "unlike many popular projects, there is no company behind this project, and no business plan behind it." The same post names production users: "Companies like Kubermatic, SpectroCloud, and Nethopper have enthusiastically embraced K8sGPT capabilities." The project's GOVERNANCE.md further codifies the model: "No single vendor may control project direction."

Neither shape is structurally better. Sponsored open source ships polish and integrations faster; community open source is harder to commercially deprecate. Match the governance to the team's risk model.

Release cadence and recent feature deltas

HolmesGPT shipped v0.30.1 on 20 May 2026, with notes for the release covering Loki raw-response handling on parse failure, a GitLab MCP datasource entry, a Bash echo allowlist fix, user_email persistence on chat requests, and documentation refinements (release tag).

K8sGPT's recent releases include v0.4.33 ("analyze previous logs for restarted containers," 13 May 2026), v0.4.32 ("add Azure API Type Support and add Custom HTTP Header," 22 April 2026), and v0.4.27 ("mcp v2," 18 December 2025) (Releases).

Both projects ship monthly or near-monthly. Neither has demonstrated a multi-month pause in the period documented.

What HolmesGPT and K8sGPT are NOT

Three misreadings of this comparison show up repeatedly in vendor briefings and procurement memos. Naming them in advance saves a procurement cycle.

Neither is an alerting platform. Alerts originate in Prometheus AlertManager, Grafana, Datadog, CloudWatch, or PagerDuty. HolmesGPT fetches alerts from "AlertManager, PagerDuty, OpsGenie, or Jira"; K8sGPT integrates downstream of Prometheus alert rules. Buying either tool does not solve "we have too many or too few alerts."
Neither is a full AIOps platform. AIOps is a 2017-era category built on statistical correlation and noise reduction. Both tools sit downstream of that layer: once an alert lands, the agent investigates. Teams running BigPanda, Moogsoft, Dynatrace Davis, or PagerDuty Intelligent Alert Grouping should not expect either project to replace those products.
Neither is a managed SaaS by default. Both are open-source projects requiring self-hosting. Robusta sells a managed product around HolmesGPT, which is the closest commercial offering. K8sGPT has no commercial entity behind it per the June 2024 CNCF blog. A team that needs a vendor SOC 2 report against the open-source binary itself will not find one.
K8sGPT is not a multi-cloud reasoning tool. Its analysers map one-to-one to Kubernetes resource types. A managed RDS, a Datadog dashboard, or an OVH Bare Metal instance is invisible to K8sGPT's analysers.
HolmesGPT is not a deterministic rules engine. Its agent loop uses LLM tool-calling, which means investigation paths are non-deterministic and depend on the LLM provider and prompt context. Teams that need bit-for-bit reproducible incident analysis should match expectations to the agent pattern, not against a runbook executor.

When should I choose HolmesGPT vs K8sGPT?

Pick HolmesGPT when:

The estate spans more than Kubernetes (VMs, managed databases, SaaS platforms at incident-critical positions).
The LLM choice is GitHub Copilot, GitHub Models, OpenRouter, or Robusta AI (HolmesGPT-specific).
The team wants a 24/7 background agent that can post to Slack and open GitHub PRs through MCP integration. Note that Operator Mode is marked as an Alpha release at time of writing.
The team values an explicit, project-documented "read-only access and respects RBAC" guarantee.
A managed SaaS option (via Robusta) is acceptable or attractive.

Pick K8sGPT when:

The estate is Kubernetes-first or Kubernetes-only.
The team wants a Go binary that runs as a CLI and an in-cluster operator out of the box.
The LLM choice is IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, or Amazon SageMaker (K8sGPT-specific).
The team plans to publish cluster operations to MCP clients (Claude Desktop, custom tooling) rather than to consume external MCP services.
The team wants documented anonymisation of cluster object names and labels before LLM calls.
The team prefers a community-governed project with no commercial entity behind it.

The two are not directly substitutable for most teams. They are adjacent tools that can plausibly run alongside one another in a Kubernetes-heavy estate.

How to choose between HolmesGPT and K8sGPT in 14 days

A two-week evaluation plan to pick between HolmesGPT and K8sGPT, or to confirm that the team needs both. Every step is a concrete deliverable a procurement reviewer can sign off.

Day 1 to 2: Scope your estate. List every system that hosts incident-relevant state: Kubernetes clusters, VMs, managed databases, third-party SaaS, on-prem hardware. If the answer is Kubernetes plus one or two managed services, K8sGPT alone may cover it. If non-Kubernetes systems sit at incident-critical positions in the stack, HolmesGPT's stated "any infrastructure" scope is the better fit.
Day 3 to 4: Confirm the LLM standard. Identify the LLM provider the team is already approved to use. Cross-check against each project's published backend list. Both register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama. K8sGPT adds enterprise-leaning options (IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker). HolmesGPT adds developer-tooling options (GitHub Copilot, GitHub Models, OpenRouter, Robusta AI, OpenAI-Compatible proxy).
Day 5 to 6: Install both in a dev cluster. Install K8sGPT via brew or its Helm chart (helm repo add k8sgpt https://charts.k8sgpt.ai/) and the k8sgpt-operator. Install HolmesGPT via the official Helm chart documented at holmesgpt.dev. Connect a non-production LLM key.
Day 7 to 8: Run a known-bad scenario. Trigger a documented failure (CrashLoopBackOff, OOMKilled, ImagePullBackOff) in the dev cluster. Capture each tool's full output: time to first useful finding, false positives, and signal-to-noise.
Day 9 to 10: Assess the trust surface. Walk security through the read model. HolmesGPT operates with read-only access plus RBAC. K8sGPT anonymises cluster object names and labels but does not mask certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count). Get a written sign-off on each tool's data path before any production read.
Day 11 to 12: Test the operator behaviour. Enable HolmesGPT Operator Mode on an infrequent schedule (hourly, since Operator Mode is Alpha) and enable the K8sGPT operator at its 30-second default. Watch LLM token consumption and alert volume.
Day 13 to 14: Pick one, both, or neither. Three valid outcomes. (1) Pick K8sGPT alone if the estate is Kubernetes-only and the team needs continuous posture. (2) Pick HolmesGPT alone if the estate is multi-platform and the team values 24/7 Operator Mode with GitHub PR opening. (3) Pick both if the estate is Kubernetes-heavy and the team wants continuous posture (K8sGPT) plus incident investigation (HolmesGPT).

Where Aurora fits

Aurora by Arvo AI is a separate Apache 2.0 project at github.com/Arvo-AI/aurora. Compared to the two projects above, Aurora ships multi-cloud investigation (AWS, Azure, GCP, OVH, Scaleway, Kubernetes), a Memgraph-backed infrastructure dependency graph, hybrid (BM25 plus vector) RAG over runbooks and postmortems via Weaviate, and sandboxed kubectl execution into an isolated "untrusted" namespace with a four-layer command-safety pipeline (input rail, SigmaHQ signature match, per-org policy, LLM safety judge).

A team can run all three. The most common pattern in 2026 design-partner conversations is K8sGPT for continuous in-cluster posture, HolmesGPT or Aurora for incident investigation, and Aurora for the multi-cloud and remediation-staging path that K8sGPT does not target. For the full three-way comparison see Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT.

Where this guide fits

Top 15 AI SRE Tools in 2026, full capability matrix including commercial entrants.
Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT, three-way comparison.
Self-Hosted AI SRE, the deployment-tier framework.
What is an AI SRE?, the definitional reference.
AI Agent kubectl Safety, the sandboxing pattern that distinguishes investigation from remediation.

Frequently Asked Questions

What is the difference between HolmesGPT and K8sGPT?
HolmesGPT is an AI agent for investigating production incidents across any infrastructure including VMs, bare metal, cloud services, and containers. K8sGPT is a tool for scanning Kubernetes clusters and diagnosing issues in simple English, scoped to Kubernetes resources only. Both are Apache 2.0 and CNCF Sandbox projects.

Which is more popular on GitHub, HolmesGPT or K8sGPT?
As of May 2026, the K8sGPT about box on github.com/k8sgpt-ai/k8sgpt shows 7.8k stars and 996 forks. The HolmesGPT about box on github.com/HolmesGPT/holmesgpt shows 2.5k stars and 347 forks. K8sGPT had a two-year head start: it joined the CNCF Sandbox on 19 December 2023, while HolmesGPT joined on 8 October 2025.

Can HolmesGPT or K8sGPT execute commands against my cluster?
HolmesGPT operates with read-only access and respects RBAC permissions. The HolmesGPT docs describe an Operator Mode that can open GitHub pull requests via the GitHub MCP server, but those writes happen against the user's Git repository, not directly against the cluster. K8sGPT scans Kubernetes resources and anonymises object names and labels before sending data to its AI backend.

Which LLM providers does each tool support?
Both projects support the headline providers. K8sGPT's source registers 17 backends including Anthropic, OpenAI, Azure OpenAI, AWS Bedrock and Bedrock Converse, Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, and a Custom REST endpoint. HolmesGPT supports Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, and Robusta AI. K8sGPT's edge providers lean enterprise (watsonx, OCI, Cohere); HolmesGPT's lean developer tooling (Copilot, Models, OpenRouter).

Do HolmesGPT and K8sGPT both support MCP?
Yes, but in different directions. K8sGPT hosts a Model Context Protocol server that exposes 12 tools and 3 resources for cluster analysis, with full integration with Claude Desktop and other MCP clients. The MCP v2 feature shipped in v0.4.27 on 18 December 2025. HolmesGPT consumes MCP-exposed tools as data sources, including AWS, Azure, GCP, GitHub, GitLab, Jenkins, MariaDB, Prefect, Sentry, and Splunk.

Are HolmesGPT and K8sGPT both CNCF projects?
Both are CNCF Sandbox projects. The cncf.io project pages document HolmesGPT accepted on 8 October 2025 and K8sGPT accepted on 19 December 2023. Sandbox is the entry tier for CNCF projects and indicates the project is in an early stage relative to Incubating and Graduated.

Is there a company behind HolmesGPT or K8sGPT?
HolmesGPT is held under HolmesGPT a Series of LF Projects, LLC, and was originally created by Robusta.dev with major contributions from Microsoft. Robusta sells a managed SaaS product that integrates HolmesGPT. K8sGPT is community-led; the 7 June 2024 CNCF blog states that unlike many popular projects there is no company behind K8sGPT and no business plan behind it.

Which project is updated more often?
Both projects ship monthly or near-monthly. HolmesGPT's latest release at writing is v0.30.1 on 20 May 2026. K8sGPT's latest release at writing is v0.4.33 on 13 May 2026. Both Releases pages on GitHub show consistent 2025 to 2026 cadence with no documented multi-month pause.

Can HolmesGPT or K8sGPT run air-gapped?
Both projects support local LLM inference. K8sGPT's auth list includes localai and ollama, and the K8sGPT team recommends using a local model in critical production environments. HolmesGPT's docs nav lists Ollama and OpenAI-Compatible providers, which covers self-hosted LLM endpoints. The agent runtime and the LLM together must run inside the customer perimeter to claim air-gapped deployment.

Can I use HolmesGPT and K8sGPT together?
Yes. K8sGPT is built as a continuous in-cluster scanner with a 30-second default reconciliation interval. HolmesGPT runs as an incident-driven investigation agent that can also operate 24/7 in Operator Mode (Alpha). A common 2026 pattern is to use K8sGPT for posture and HolmesGPT for incident investigation, with results routed to the same Slack channels or ticket systems.

Originally published at arvoai.ca/blog/holmesgpt-vs-k8sgpt. Aurora by Arvo AI is open-source on GitHub under Apache 2.0.

Self-Hosted AI SRE in 2026: Air-Gapped, Multi-Cloud, BYO-LLM

Siddharth Singh — Tue, 19 May 2026 01:01:04 +0000

Key Takeaways

Self-hosted AI SRE means the agent runtime, its memory layer, and the LLM all run inside the customer's perimeter. Every inference call, every telemetry read, and every postmortem write happens on customer-owned infrastructure. The definition is structural. A vendor agent that ships data to vendor-managed inference is not self-hosted under this definition.

We propose the Sovereignty Spectrum. Five deployment tiers: T1 Public SaaS, T2 Private SaaS, T3 VPC-Isolated, T4 On-Prem Hosted, T5 Air-Gapped. Of the fifteen most-cited AI SRE tools in 2026, only Aurora, HolmesGPT, and K8sGPT credibly reach T4 or T5. The other twelve top out at T1 or T2.

Air-gapped deployment requires three independent stacks: orchestration, memory, and inference. Orchestration is the agent loop (LangGraph, ReAct). Memory is the dependency graph plus RAG corpus (Memgraph, Weaviate). Inference is the LLM (Ollama, vLLM, or a sovereign endpoint). All three must run locally, with no outbound network call.

Regulatory drivers are concrete and dated. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The EU AI Act implementation timeline phases in through 2027. The SEC adopted cybersecurity disclosure rules on 26 July 2023 (Form 8-K Item 1.05 effective 18 December 2023).

Open-weight LLMs in 2026 are credible for local inference. Meta's Llama 3.3 70B (December 2024) delivers similar performance to Llama 3.1 405B at lower inference cost, per Meta's own announcement. Mistral, DeepSeek, and Qwen have released competitive open-weight models. Aurora's reference local stack uses Ollama with a 70B-class model.

In Arvo's design-partner conversations across 2025, every regulated customer ran into the same procurement wall: every credible commercial AI SRE required production telemetry, including customer data inside log lines, error messages, and stack traces, to leave the customer perimeter for inference. For a SaaS startup the wall is paperwork. For a bank, a defence contractor, an EU sovereign-data buyer, or a healthcare provider, it blocks the procurement.

Self-hosted AI SRE removes the wall. The agent, its memory, and the LLM all run inside the customer's perimeter. This guide is the 2026 reference for evaluating, designing, and deploying a self-hosted AI SRE, with every commercial tool mapped to its deployment tier and Aurora's air-gapped stack used as the worked example.

What does self-hosted AI SRE mean?

The phrase is overloaded. Three definitions circulate in 2026 vendor marketing, and only the strictest meaningfully reduces the trust surface.

Self-hosted collector with VPC peering. A vendor agent runs in the customer VPC, gathers telemetry, and ships it (sometimes after partial filtering) to a vendor-managed inference plane. The inference call leaves the customer perimeter. Most commercial AI SREs in 2026 use this pattern and call it "private deployment."
Single-tenant SaaS. A dedicated vendor-managed instance inside a vendor-owned cloud account. The data plane is isolated from other tenants but still vendor-operated. Inference still leaves the customer perimeter.
True self-hosted. Every component (orchestration runtime, memory layers, inference endpoint, secrets manager) runs on customer-owned infrastructure. No outbound network call is required for an investigation to complete.

This guide uses the third definition. For audits and compliance reviews, only the third meaning answers the question "could a malicious actor at the vendor have read our incident transcript" with a structural no.

The Sovereignty Spectrum

Each tier increases perimeter control over the previous one. Choose the tier the team can defend operationally; aiming further than that is engineering debt waiting to happen.

Tier	What runs on customer infrastructure	What leaves the perimeter	Representative tools
T1, Public SaaS	Nothing	Telemetry, transcripts, investigation prompts	Datadog Bits AI, incident.io AI SRE, Rootly AI, PagerDuty SRE Agent, ServiceNow Now Assist, Splunk ITSI, Cleric.ai, Causely
T2, Private SaaS (VPC peering)	A vendor-supplied agent or collector	Telemetry, embeddings, sometimes whole log lines, all inference calls	Resolve.ai (satellite agent), Traversal, NeuBird Hawkeye (VPC option), Edwin AI
T3, VPC-Isolated single-tenant	Vendor-managed control plane inside a vendor-owned cloud account dedicated to one customer	All inference calls; cross-tenant data flow is structurally absent, the vendor still operates the plane	Some incumbent "private cloud" tiers (custom-quoted)
T4, On-prem hosted, hosted LLM	Agent, memory, dependency graph, RAG corpus	LLM API calls to OpenAI, Anthropic, Google, or Bedrock	Aurora with managed LLM; HolmesGPT with managed LLM
T5, Air-gapped	Agent, memory, dependency graph, RAG corpus, and a local LLM via Ollama, vLLM, or a sovereign endpoint	Nothing. Investigation completes without an outbound call	Aurora with Ollama; HolmesGPT with self-hosted LLM endpoint; K8sGPT with local LLM (Kubernetes-only scope)

A team's deployment tier is fixed by its strictest constraint, not its average. The FINMA Circular 2018/03 on outsourcing for Swiss banks and insurers pushes regulated workloads toward T5. A privacy-by-design product advertising "your incident data never leaves your servers" lands at T5. A team that cannot obtain controller approval for an LLM provider under GDPR Article 28 lands at T5.

Any other constraint allows T3 or T4. A single strict regulator collapses the choice to T5.

Why does self-hosting matter in 2026?

Three pressures, in roughly this order.

Regulatory. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The boundary covers data processing and storage for core services and is the model EU procurement teams now apply to other vendors. The EU AI Act timeline phases in through 2027, with high-risk system obligations under Chapter III (risk management, data governance, human oversight, post-market monitoring) applicable to operational AI used in critical infrastructure. The SEC's cybersecurity disclosure rules (adopted 26 July 2023, Form 8-K Item 1.05 effective 18 December 2023) make incident response transparency a public-company concern.

Sovereignty and latency. Sovereign cloud is no longer a French preoccupation. OVHcloud Sovereign Cloud, Scaleway, T-Systems Sovereign Cloud, Stackit (Schwarz Group), and Oracle EU Sovereign Cloud ship contractually sovereign tiers. An AI SRE that cannot operate without sending telemetry to a US hyperscaler region is unfit for these workloads. Latency follows the same constraint: an EU-hosted agent calling a US-hosted LLM during an incident incurs round-trip latency on every step of a multi-turn investigation.

Data leakage and trust. Production log lines frequently contain customer PII, secrets, and proprietary identifiers. GitGuardian's State of Secrets Sprawl 2024 found 12.8 million new exposed secrets across public repositories alone in 2023, a steady reminder that telemetry contains material auditors care about. The audit calculation for a security team is the same as for any third-party data flow: if it can leak, model the risk as if it will. T5 makes the model trivial because nothing leaves the perimeter.

For the full incident-investigation context, see AI-Powered Incident Investigation: The Complete Guide for SRE Teams.

Which AI SRE tools can be fully self-hosted?

The honest map.

Tool	Best achievable tier	Constraint
Aurora	T5, Air-Gapped	Reference stack: Docker Compose or Helm chart, Ollama local LLM, Vault, Memgraph, Weaviate. See the Aurora repo.
HolmesGPT	T4, On-prem with hosted LLM (T5 with self-hosted LLM endpoint)	Apache 2.0. Per the HolmesGPT docs, documentation assumes a hosted model provider (OpenAI, Azure OpenAI, Bedrock). Self-hosted LLM is an advanced configuration.
K8sGPT	T4, On-prem (T5 with local LLM, Kubernetes scope only)	CLI or Helm. Local LLMs via Ollama supported. Scope is limited to the Kubernetes API.
Resolve.ai	T2, Private SaaS	Satellite agent in the customer VPC for telemetry. Inference is vendor-managed. No publicly documented air-gapped option.
Traversal	T2, Private SaaS	Flexible deployment options. Inference is vendor-managed.
NeuBird Hawkeye	T2, Private SaaS (VPC)	VPC deployment available. Ephemeral telemetry processing claimed by NeuBird. Inference path is vendor-managed.
Causely	T1, Public SaaS	Kubernetes-only. SaaS control plane.
Cleric.ai	T1, Public SaaS	Slack-first SaaS.
PagerDuty SRE Agent	T1, Public SaaS	Inside PagerDuty Operations Cloud.
Datadog Bits AI SRE	T1, Public SaaS	Multi-tenant inside Datadog. HIPAA-compliant per Datadog's documentation, not air-gapped.
incident.io AI SRE	T1, Public SaaS	Hosted multi-tenant. AI SRE access design-partner-gated.
Rootly AI	T1, Public SaaS	Closed-core SaaS. Rootly AI Labs publishes open-source prototypes.
ServiceNow Now Assist SRE	T1, Public SaaS	ServiceNow cloud. GA targeted June 2026.
Edwin AI (LogicMonitor)	T2, Private (LogicMonitor-managed)	Bundled with LogicMonitor Envision platform. Not standalone.
Splunk ITSI Episode Summarization	T1, Public SaaS	Splunk Cloud only as of May 2026 (Alpha).

The open-source projects are the only tools today that credibly reach T4 or T5 with public documentation. Aurora is the only one with multi-cloud scope at T5. Resolve.ai, Traversal, NeuBird, and Datadog Bits AI publish FedRAMP-adjacent or HIPAA tiers but no air-gapped reference architecture as of May 2026. For the broader category overview, see our open-source incident management overview and the Aurora Actions launch post for scheduled and event-triggered automations on top of self-hosted Aurora.

What is the architecture of a self-hosted AI SRE?

A self-hosted agentic AI SRE has three concurrent runtime stacks. Skip any one and the deployment regresses to a lower sovereignty tier.

1. Orchestration runtime

The agent loop is the LangGraph, ReAct, or equivalent orchestration that decides what tool to call next. It is the smallest of the three stacks by resource footprint and the easiest to self-host. Requirements:

A Python or Node runtime, typically containerised.
A task queue (Celery, RQ, BullMQ) for long-running investigations.
Postgres for agent state, investigation records, and audit logs.
A secrets store (HashiCorp Vault, AWS Secrets Manager, or KMS) for cloud credentials and LLM keys.
A web UI or API surface for engineers to inspect and trigger investigations.

Aurora ships this stack as a Docker Compose for single-node deployment and a Helm chart for Kubernetes-native deployment, both documented in the repo.

2. Memory layer

The agent without memory is a stateless inference call. Memory is the difference between an agent that learns from the environment and an agent that makes the same investigative mistake every week.

Dependency graph. A graph database (Memgraph, Neo4j) that holds the live topology of the infrastructure: services, dependencies, alert sources, and ownership. The agent traverses the graph to assess blast radius and trace upstream causes before issuing tool calls.
RAG corpus. A vector database (Weaviate, Qdrant, Chroma) holding embeddings of past postmortems, runbooks, design docs, and code. Hybrid retrieval combining BM25 and vector search outperforms either alone on SRE corpora because exact-match identifiers (service names, error codes) coexist with semantic concepts (failure modes). See also the root cause analysis complete guide for SREs for the broader investigation context.
Event store. Postgres or an event-sourcing database for the agent's own investigation history. Past investigations become future evidence.

Aurora's reference stack is Memgraph, Weaviate, and Postgres. Each runs in a customer container, and none requires an outbound network call.

3. Inference layer

The LLM. Three paths, in increasing sovereignty:

Managed LLM API. OpenAI, Anthropic, Google, Bedrock. Cheapest to start, lowest operational burden, but the deployment stays at T4.
Private endpoint. Azure OpenAI dedicated, Bedrock Provisioned Throughput, or a partner-hosted endpoint. Stronger contractual perimeter, although the data still leaves the customer cloud account.
Local LLM. Ollama, vLLM, or a sovereign inference appliance. Reaches T5.

For T5, the inference stack is the operational lift. Hardware is the largest single line item, and team expertise is the second.

BYO-LLM: which models run well locally?

Open-weight model quality has progressed enough to anchor an agentic SRE loop in 2026. The current options:

Llama 3.3 70B (Meta, December 2024). Meta states the model delivers similar performance to Llama 3.1 405B at lower inference cost. A common starting point for local deployments.
DeepSeek-R1 (model card). A reasoning-tuned open-weight model.
Qwen 2.5 and 3 families (Qwen 2.5 release). Strong multilingual support for teams with non-English runbook content.
Mistral Large (Mistral models). Strong tool-use performance.

Hardware sizing for a 70B-class model: in float16, weights are roughly 140GB, so plan two 80GB cards (a pair of H100 or A100 80GB) or a single H200 (141GB). Q4-quantised variants compress weights to roughly 35-40GB and fit on a single 80GB card with context room, at some latency and quality cost. See the Llama 3.3 70B model card for the canonical parameter and tensor sizes. Specific latency targets are workload-dependent and should be measured, not assumed.

The constraint to flag: running a local LLM is a real engineering discipline. Teams without LLM-ops capacity should consider T4 (managed API) as the long-term answer and revisit T5 when the team is staffed for it.

How does multi-cloud authentication work in a self-hosted agent?

A self-hosted agent must still reach customer cloud APIs. The auth pattern matters because credentials live in the customer perimeter. Vendor-managed inference makes credential exfiltration a vendor-trust problem. Self-hosted inference makes it a customer-operations problem, which is the desired state.

Aurora's reference multi-cloud auth pattern:

Cloud	Pattern
AWS	STS AssumeRole into customer accounts via a least-privilege investigation role. Credentials never persist in agent storage.
Azure	Service Principal with Reader (and incident-scoped Operator) role assignments.
GCP	OAuth-based authentication or workload identity federation.
OVH	API key per investigation scope, stored in Vault.
Scaleway	API token stored in Vault.
Kubernetes	Kubeconfig per cluster, stored in Vault. Sandboxed kubectl execution into an isolated namespace; see our AI Agent kubectl Safety guide.

The Vault binding matters: every cloud credential is short-lived where the cloud supports it, and every credential use is auditable. In a T5 deployment, the auditor's "who issued this command" question is answered by the Vault audit log and the agent's tool-call trace, not by a vendor SOC 2 attestation.

What does an air-gapped AI SRE deployment require?

The hard version requires no outbound network call during an investigation, including for inference.

Aurora's air-gapped reference architecture covers six layers:

Mirrored container registry. Every image (Aurora, Memgraph, Weaviate, Postgres, Vault, Ollama) is pulled from a customer-internal registry. No Docker Hub calls.
Mirrored package indices. Python wheels and OS packages served from internal Artifactory or equivalent.
Mirrored model weights. Llama 3.3 weights downloaded once on a connected jumpbox, scanned, hashed, and copied into the air-gapped network. Same for embedding models.
Local DNS. No outbound DNS resolution required. Cloud APIs are reached via VPC private endpoints (AWS PrivateLink, Azure Private Endpoint, GCP Private Service Connect).
No telemetry to vendor. Neither Aurora nor the open-source components phone home; this is verified per release.
Sealed Vault. Vault sealed and unsealed via internal HSM or Shamir keyshares. No auto-unseal against a vendor KMS.

The provisioning lift is real. Teams that have operated air-gapped Kubernetes will recognise the pattern. Teams that have not should pilot in a connected environment first.

How Aurora implements the Sovereignty Spectrum

Every Aurora deployment is configured for the customer's tier. The same code base supports all five.

T1 and T2. Aurora deployed to a public-cloud account with managed services for Postgres, Memgraph, and Weaviate. LLM via OpenAI or Anthropic API. Useful for evaluation pilots.
T3. Aurora deployed to a customer-owned VPC with private endpoints to managed services. LLM via private endpoint (Azure OpenAI dedicated, Bedrock).
T4. Aurora deployed to customer-owned VMs or Kubernetes with self-hosted Postgres, Memgraph, and Weaviate. LLM via managed API or private endpoint.
T5. Aurora deployed to customer-owned air-gapped infrastructure with Ollama-hosted Llama 3.3 (or a sovereign LLM endpoint). All dependencies mirrored.

Aurora ships a single codebase that serves all five tiers. Tier downgrade ("drop from T5 to T3 for one workload") and upgrade ("move the EU workload from T3 to T5") become configuration changes rather than migrations.

How does self-hosted AI SRE cost compare to SaaS?

A precise total cost of ownership depends on team size, model choice, infrastructure pricing, regional rates, and incident volume. Procurement should model the variable axes against incident volume rather than anchor on a single vendor-supplied number.

Self-hosted T4 or T5 fixed costs. Compute for the agent runtime, memory stores, and (for T5) the LLM node. Storage for the RAG corpus and audit log. Engineering time to operate the stack.
Self-hosted T4 variable costs. Managed LLM API usage at provider rates (OpenAI pricing, Anthropic pricing, Bedrock pricing). Scales with the number and depth of investigations.
Commercial SaaS variable costs. Per-seat tiers (incident.io, Rootly, PagerDuty), per-investigation billing (Datadog Bits AI, NeuBird), or per-credit consumption (ServiceNow). All published on the vendor's pricing page.

The break-even between a self-hosted Tier 5 deployment and per-investigation SaaS depends on the vendor's per-investigation price, the LLM choice, and the engineering cost of running the stack. Procurement teams should model three points: today's incident volume, twelve-month projected volume, and a 3x scenario. If any of the three is dominated by sovereignty rather than economics, the regulator decides the deployment tier, not the spreadsheet.

When self-hosting is the wrong answer

Self-hosting is an engineering commitment, not a checkbox.

Teams that should skip it:

No LLM-ops capacity. If no one on the team has run inference servers in production, do not start with air-gapped Ollama. Pilot at T1 or T2.
Small team, low incident volume. Below twenty incidents per month, the operational overhead can exceed the cost savings of self-hosting. T1 is fine if the data classification allows it.
No regulatory or sovereignty pressure. If the compliance team is not asking and the data classification is not sensitive, the sovereignty premium is paid for nothing.
Early in the AI SRE evaluation curve. A managed pilot validates the value of the agent to the team. Self-host after that decision, not before it.

Teams that should default to self-hosting:

Regulated workloads (finance, healthcare, defence, critical infrastructure).
EU sovereign-data customers.
Customers that advertise sovereignty as a product attribute themselves.
Public-sector buyers under FedRAMP High, IRAP PROTECTED, IL5, or equivalent.
Anyone whose log lines contain customer PII that has not been scrubbed at source.

What to watch next

Arvo expects three shifts in the self-hosted AI SRE landscape over the next twelve months.

Sovereign LLM endpoints. EU-hosted, contract-bound LLM endpoints from cloud regions outside US jurisdiction will turn T4 into a viable tier for European regulated customers without forcing T5. Anthropic, OpenAI, and Google are each shipping or piloting EU-resident inference.
Air-gap reference appliances. Appliance-style packages (preloaded GPU servers with Aurora, a local LLM, and a sealed Vault) sold as turn-key T5 deployments are likely to emerge from hardware vendors.
Open benchmark cohorts. Closed-source players still measure themselves on private datasets. The first open, named, multi-LLM benchmark on a public incident corpus will become the citation surface the category orbits.

In 2024 self-hosted AI SRE was a theoretical option. By 2025 it was niche. In 2026 it has become the procurement default for regulated workloads. The tools that can execute it today are Aurora at the multi-cloud end, HolmesGPT at the CNCF and Kubernetes end, and K8sGPT for diagnostics.

For the full landscape of AI SRE tools and how each maps to a deployment tier, see Top 15 AI SRE Tools in 2026. For the broader category overview, see AI SRE: The Complete Guide for Engineering Teams in 2026. For the investigation and postmortem halves of the workflow, see AI-Powered Incident Investigation and Automated Post-Mortem Generation.

Originally published at arvoai.ca.

Top 15 AI SRE Tools in 2026: Open-Source and Commercial

Siddharth Singh — Tue, 19 May 2026 00:55:24 +0000

Key Takeaways

An AI SRE tool applies large-language-model reasoning to incident response, usually as a multi-step agent that runs infrastructure tools, summarizes events, or drafts postmortems. The label spans five archetypes that vendors blur in marketing: agentic investigation, AIOps correlation, postmortem generation, ITSM-integrated copilots, and workflow-automation suites with AI add-ons.

We score every tool on the AI SRE Capability Matrix. Five axes (Investigation, Remediation, Postmortem, Deployment Flexibility, Source Availability), each 0 to 3, total 15. The matrix tracks publicly documented capability as of May 2026.

Three open-source projects span the agentic-investigation lane. Aurora (Apache 2.0, multi-cloud), HolmesGPT (Apache 2.0, CNCF Sandbox since October 2025, co-maintained by Robusta and Microsoft), and K8sGPT (Apache 2.0, CNCF Sandbox since 19 December 2023, Kubernetes diagnostics).

Cited funding rounds in the last twelve months. Resolve.ai raised $125M at a $1B valuation in February 2026 and extended at a $1.5B valuation in April 2026. Traversal raised $48M in June 2025. incident.io closed a $62M Series B in September 2024.

Incumbents shipped AI SRE features by Q2 2026. PagerDuty SRE Agent, Datadog Bits AI SRE, Splunk ITSI Episode Summarization announced at .conf25 (September 2025), ServiceNow Now Assist SRE Specialist (GA targeted June 2026), and LogicMonitor Edwin AI. The procurement question moves from "is there an AI option" to "which archetype, at what deployment tier."

Site reliability teams in 2026 are evaluating tools in a market that has reorganised faster than most procurement processes can keep up with. Five archetypes share the "AI SRE" label, and buyers regularly compare a postmortem generator to an agentic investigator as if they did the same job. This guide compares the fifteen most-cited tools across both open-source and commercial categories, scored on a single capability matrix so the decision becomes one of fit.

A note on bias. Arvo builds Aurora, an open-source agentic AI SRE tool listed below. We applied the same scoring rubric to every product on the list, including our own, and cited every numeric or capability claim that is not common knowledge.

What is an AI SRE tool?

An AI SRE tool applies large-language-model reasoning to incident response. The term covers five distinct archetypes, and only two of them actually investigate incidents.

Agentic investigation. A multi-step LLM agent that calls infrastructure tools (kubectl, cloud APIs, log queries, dependency graphs) during an incident to gather new evidence and produce a root-cause analysis. Aurora, HolmesGPT, K8sGPT, Resolve.ai, Traversal, NeuBird, Cleric, Causely, and Ciroos all market themselves with this framing.
AIOps correlation. Statistical or ML clustering of alerts to reduce noise. PagerDuty Intelligent Alert Grouping, BigPanda, Dell APEX (Moogsoft), Dynatrace Davis. The category predates LLMs.
Postmortem generation. An LLM that drafts the retrospective from artefacts the team already has (Slack transcripts, monitor data, the investigation trace). Rootly, incident.io Scribe, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered in our Automated Post-Mortem Generation guide.
ITSM-integrated copilot. AI inside an existing service-management workflow. ServiceNow Now Assist SRE Specialist, LogicMonitor Edwin AI, Splunk ITSI Episode Summarization.
Workflow-automation suite plus AI add-on. Incident platforms that bolted AI onto existing on-call, runbook, and status-page features. incident.io AI SRE, Rootly AI, FireHydrant AI.

Conflating archetypes is the most common evaluation mistake. A team buying a postmortem generator will not get root-cause analysis. A team buying an AIOps correlator will not get a tool that runs kubectl. For the foundational definitions, see our AI SRE Complete Guide and AI-Powered Incident Investigation.

The AI SRE Capability Matrix

Five axes, each scored 0 to 3. We apply the same rubric to every tool in the shortlist.

Axis	0	1	2	3
Investigation	None	Single-shot LLM summary	Multi-step agent, single cloud or platform	Multi-step agent, multi-cloud, with RAG over historical evidence
Remediation	None	Suggested commands	PR-based fixes with approval	Sandboxed in-cluster execution with policy guardrails
Postmortem	None	Manual export of a transcript	LLM-drafted from artefacts	LLM-drafted from the agent's own investigation trace, exported to Confluence or Jira
Deployment flexibility	SaaS-only, public cloud	SaaS with private VPC peering	Self-hosted in customer VPC	Air-gapped with local LLM (Ollama or vLLM)
Source availability	Closed source	Source-available, paid	Open core	Apache 2.0 or MIT, fully open

A higher score is not always "better." A team without LLM-ops capacity should not score deployment flexibility 3 against its roadmap. The matrix is for like-for-like comparison, not a leaderboard.

For a deeper treatment of the deployment-flexibility axis, see our companion piece, Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment.

Which AI SRE tools are most-cited in 2026?

Ordered alphabetically inside each archetype. Scoring reflects the publicly documented capability of each product as of May 2026, not roadmap claims. For category foundations, see our open-source incident management overview and the root cause analysis complete guide for SREs.

Agentic-investigation tools

1. Aurora (Arvo AI), Apache 2.0, multi-cloud

Best for: SRE teams that need self-hosted, multi-cloud, BYO-LLM agentic investigation with the option to graduate into PR-based remediation.
Deployment: Docker Compose, Helm chart, or air-gapped with Ollama. Customer-owned infrastructure.
License: Apache 2.0. Code at github.com/Arvo-AI/aurora.
Investigation: LangGraph-orchestrated ReAct agent, 30+ integrations across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes. Memgraph dependency graph feeds an alert-correlation pre-step. Weaviate hybrid (BM25 plus vector) RAG over runbooks and past postmortems.
Remediation: Sandboxed kubectl execution into an isolated "untrusted" namespace, wrapped in a four-layer command-safety pipeline (input rail, SigmaHQ signature match, per-org policy, LLM safety judge). Aurora Actions add scheduled and event-triggered automations.
Postmortem: Postmortem agent fed by the investigation trace, exported to Confluence Cloud (OAuth) or Server / Data Center (PAT).
Pricing: Free (Apache 2.0). Infrastructure cost only. Optionally, LLM API usage. With local Ollama the recurring software cost is zero.
Watch out for: Self-host means the team operates the agent. Teams without basic Kubernetes ops capacity should pilot in an existing managed cluster first.
Capability score: Investigation 3, Remediation 3, Postmortem 3, Deployment 3, Source 3, total 15/15. The score reflects the breadth of the open-source feature set against the matrix, not a quality verdict relative to commercial competitors.

2. Causely, closed source, Kubernetes-only

Best for: Kubernetes-only teams that want causal-graph reasoning rather than LLM-first investigation.
Deployment: SaaS with in-cluster collector. CNCF Causely member listing (member, not project).
License: Closed source.
Investigation: Topology graph plus causality graph plus a "codebook" of failure patterns; the authors describe a deterministic abductive-inference layer that precedes any LLM call. See How Causely Works and the InfoQ piece on causal reasoning in observability. Gartner Cool Vendor for AIOps, December 2025.
Remediation: Suggestion-based via MCP server.
Postmortem: Not a first-class artefact.
Pricing: Not publicly disclosed.
Watch out for: Kubernetes-only by design. If the platform spans cloud SDKs and managed services, the model is incomplete.
Capability score: Investigation 2, Remediation 1, Postmortem 0, Deployment 0, Source 0, total 3/15.

3. Cleric.ai, closed source, Slack-first

Best for: SRE teams that triage primarily in Slack and use Datadog or Grafana for telemetry.
Deployment: SaaS.
License: Closed.
Investigation: Slack-native AI SRE per cleric.ai. Integrations with Datadog and Grafana are documented on the product site.
Remediation: Suggestion-based.
Postmortem: Investigation transcript only.
Pricing: Not publicly disclosed.
Watch out for: Slack-first is a strong constraint. Teams on Microsoft Teams or under strict ChatOps governance may find the surface rigid.
Capability score: Investigation 2, Remediation 1, Postmortem 1, Deployment 0, Source 0, total 4/15.

4. HolmesGPT, Apache 2.0, Kubernetes-first

Best for: Kubernetes-heavy teams that want a CNCF-aligned, RBAC-respecting investigation agent.
Deployment: Helm via Robusta, or standalone CLI. LLM provider is the customer's choice.
License: Apache 2.0. Code at github.com/HolmesGPT/holmesgpt. CNCF Sandbox since October 2025, co-maintained by Robusta and Microsoft.
Investigation: Iterative ReAct agent. Built-in toolsets span Prometheus, Grafana, AWS / Azure / GCP via MCP read-only, Datadog, and Confluence. Releases v0.20 through v0.25 shipped between February and April 2026 (Releases page).
Remediation: Read-only by default. Operator mode can open GitHub PRs. No in-cluster execution.
Postmortem: Not first-class. Investigations route to Slack, PagerDuty, or Jira.
Pricing: Free. Robusta sells a managed SaaS that wraps HolmesGPT.
Watch out for: AWS, Azure, and GCP support is exposed through MCP wrappers rather than first-class cloud SDK integration. The customer IAM model must fit MCP's read-only assumptions.
Capability score: Investigation 2, Remediation 1, Postmortem 1, Deployment 2, Source 3, total 9/15.

5. K8sGPT, Apache 2.0, Kubernetes-only diagnostics

Best for: Quick diagnostic sanity checks on a single cluster.
Deployment: CLI, in-cluster operator, or Helm.
License: Apache 2.0. CNCF Sandbox since 19 December 2023.
Investigation: Rule-based analyser set (Pod, Deployment, Ingress, Service, NetworkPolicy, etc.) with an LLM translating findings into natural language. Closer to L3 (single-shot diagnosis) than L4 (agentic multi-step) on the AICL.
Remediation: Suggestion-based per k8sgpt docs.
Postmortem: Not a feature.
Pricing: Free.
Watch out for: Strong privacy feature: resource names and labels are anonymised before LLM calls per the docs. Scope is limited to the cluster API; the tool cannot reach out to cloud APIs or external systems.
Capability score: Investigation 1, Remediation 1, Postmortem 0, Deployment 2, Source 3, total 7/15.

6. NeuBird Hawkeye, closed source, multi-platform

Best for: Datadog-heavy AWS shops that want a managed AI SRE.
Deployment: SaaS or VPC. Mayfield, M12, and AWS GenAI Accelerator backing per neubird.ai.
License: Closed.
Investigation: Ephemeral processing (telemetry not stored). Integrations with Datadog, Splunk, CloudWatch, PagerDuty, and ServiceNow per the Hawkeye deep-dive.
Remediation: Read-only by default. Integrations forward to ITSM.
Postmortem: Investigation transcript export.
Pricing: Per-investigation pricing listed on AWS Marketplace; enterprise contracts also available. See NeuBird's product page for the latest.
Watch out for: "Self-learning" implies a vector store that customers cannot directly inspect. Diligence the data path for regulated workloads.
Capability score: Investigation 2, Remediation 1, Postmortem 1, Deployment 1, Source 0, total 5/15.

7. Resolve.ai, closed source

Best for: Enterprise teams that want a managed "AI Production Engineer" with named-customer case studies.
Deployment: SaaS with in-VPC satellite agent for telemetry. No on-prem option. SOC 2, GDPR, HIPAA per the Resolve trust page.
License: Closed.
Investigation: Knowledge-graph plus LLM agent per the Resolve knowledge-graph post. Founders include Spiros Xanthos, an OpenTelemetry co-creator. Resolve's Series A press release reports vendor-claimed customer results that Arvo has not independently verified: 72% investigation-time reduction at Coinbase, 87% faster investigations at DoorDash, and 30% fewer engineers per incident at Zscaler.
Remediation: Generates suggested commands. Public architecture detail is limited.
Postmortem: Investigation transcript.
Pricing: Enterprise. Public pricing is not disclosed.
Watch out for: Cloud-only and closed-source. The two public LLM benchmark posts (Sonnet 4.6) use a private dataset with no public methodology, so the numbers are unreplicable.
Capability score: Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total 6/15.

8. Traversal, closed source

Best for: Log-heavy enterprise environments where causal search across telemetry is the bottleneck.
Deployment: SaaS with flexible deployment options. $48M from Sequoia and Kleiner Perkins, June 2025.
License: Closed.
Investigation: "Production World Model" and "Causal Search Engine" per Traversal's product blog. Vendor-reported production results at American Express, summarised in the Fortune launch coverage and Traversal's Amex announcement: 32% MTTR reduction and 82% RCA accuracy across roughly 250 billion log lines per day. Customer stories at Eventbrite, PepsiCo, and DigitalOcean.
Remediation: Read-only.
Postmortem: Investigation transcript.
Pricing: Enterprise.
Watch out for: Heavy reliance on trademarked frameworks. Confirm during evaluation how much is novel architecture versus packaging.
Capability score: Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total 6/15.

Incumbent and incident-workflow tools

9. Datadog Bits AI SRE, closed source

Best for: Teams standardised on Datadog observability who want investigation where the data already lives.
Deployment: SaaS, multi-tenant.
License: Closed.
Investigation: Multi-agent architecture with planner and worker agents. Datadog's engineering posts Building Bits AI SRE and the evaluation platform describe the design without releasing source. HIPAA-compliant per the product page. Seven triage actions including Slack, Teams, and Jira.
Remediation: Triage actions only.
Postmortem: Bits AI drafts post-incident reports per the product page.
Pricing: Per-conclusive-investigation billing on top of host, APM, logs, and RUM licensing per Datadog pricing.
Watch out for: Bits is tightly bound to Datadog's data plane. Using it without the full Datadog stack is not a supported pattern.
Capability score: Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total 5/15.

10. Edwin AI (LogicMonitor), closed source

Best for: Existing LogicMonitor Envision customers expanding into agentic AIOps.
Deployment: SaaS layered on LogicMonitor.
License: Closed.
Investigation: Ten-plus specialised sub-agents (investigation, correlation, remediation, orchestrator) per the agent-taxonomy post. MCP ecosystem support (Dynatrace, Splunk, ServiceNow, Elastic, GitHub, Confluence). A Forrester Total Economic Impact study commissioned by LogicMonitor reports 313% ROI on a composite organisation with sub-six-month payback.
Remediation: Closed-loop with policy guardrails per LogicMonitor's product description.
Postmortem: Investigation transcript.
Pricing: Bundled with LogicMonitor; quoted.
Watch out for: Customers must purchase LogicMonitor to use Edwin. Not a standalone option.
Capability score: Investigation 2, Remediation 2, Postmortem 1, Deployment 1, Source 0, total 6/15.

11. incident.io AI SRE, closed source

Best for: Teams already using incident.io for on-call and incident workflow who want the AI add-on.
Deployment: SaaS.
License: Closed.
Investigation: Multi-agent system searching GitHub PRs, Slack, historical incidents, logs, metrics, and traces per incident.io's AI SRE introduction. An "ambient agent" continuously monitors. The ZenML LLMOps case study documents the retrieval evolution from embeddings-only to deterministic tagging plus re-ranking.
Remediation: Recommendations only.
Postmortem: Scribe drafts post-incident reports.
Pricing: Platform tiers on incident.io's pricing page. AI SRE access is gated to design partners as of the launch announcement.
Watch out for: Verify AI SRE availability for your tier before assuming you can use it on day one.
Capability score: Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total 5/15.

12. PagerDuty SRE Agent, closed source

Best for: PagerDuty Operations Cloud customers who want a memory-equipped agent inside the existing on-call surface.
Deployment: SaaS, inside PagerDuty Operations Cloud per the product page.
License: Closed.
Investigation: Per-tenant memory: service-scoped observations, incident recollections, human-promoted playbooks. See PagerDuty's engineering post We Built an SRE Agent With Memory. MCP server. Connectors to Grafana, New Relic, and Honeycomb. Three-tier engagement model (agent-led, collaborative, human-led).
Remediation: Suggestions and automation hooks through existing PagerDuty workflows.
Postmortem: PagerDuty Scribe.
Pricing: Per-seat tiers and AIOps add-ons listed on PagerDuty pricing.
Watch out for: AI pricing across the incident-management category is moving from per-seat to usage-based. Model the long-term cost against incident volume rather than seat count.
Capability score: Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total 5/15.

13. Rootly AI, closed source

Best for: Teams that want an AI-first ChatOps incident response with an open MCP server and an actively published agent roadmap.
Deployment: SaaS.
License: Closed core. Rootly AI Labs publishes open-source prototypes.
Investigation: Analyses code changes, telemetry, and past incidents per the Rootly AI SRE page. An AI Meeting Bot joins incident bridges and transcribes. The Rootly API agent-first announcement describes the MCP-based agentic surface used by Cursor, Windsurf, and Claude.
Remediation: Suggestions plus workflow automation.
Postmortem: AI-drafted from incident artefacts.
Pricing: Tiers listed on Rootly pricing.
Watch out for: "AI-first" branding outpaces the published architecture detail; in evaluation, ask for the agent loop description and the rule-based-automation boundary.
Capability score: Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 1, total 6/15.

14. ServiceNow Now Assist SRE Specialist, closed source

Best for: Enterprises on ServiceNow ITSM that want triage and post-mortems inside the same platform.
Deployment: SaaS, ServiceNow cloud.
License: Closed.
Investigation: The "SRE Specialist" performs triage (what, impact, priority, who) and autonomous post-mortem authoring, announced as part of the Autonomous Workforce in ServiceNow's Knowledge 2026 release. GA targeted June 2026.
Remediation: Workflow automation.
Postmortem: Autonomous authoring claimed.
Pricing: Custom-quoted. Public pricing is not disclosed.
Watch out for: As of May 2026 the product is pre-GA and most coverage is press-release or keynote material. Treat capabilities as preliminary until verified during the design-partner phase.
Capability score: Investigation 2, Remediation 2, Postmortem 2, Deployment 0, Source 0, total 6/15.

15. Splunk ITSI Episode Summarization, closed source (Alpha)

Best for: Splunk-heavy enterprises that want LLM summaries layered on existing KPI engines.
Deployment: Splunk Cloud.
License: Closed.
Investigation: ITSI Episode Summarization, announced at .conf25 (September 2025), is in Alpha. The feature layers an LLM-generated summary (what happened, when, key events, suspected cause) onto Splunk ITSI's KPI-based episodes. Splunk also ships Event iQ for AI-driven alert correlation, listed on the ITSI product page.
Remediation: Recommendation-based.
Postmortem: Not yet a published feature.
Pricing: Splunk ITSI is data-volume or entity-count licensed. The AI features are in Alpha.
Watch out for: Alpha contract and capability terms can shift. Plan a re-evaluation after GA.
Capability score: Investigation 1, Remediation 1, Postmortem 1, Deployment 0, Source 0, total 3/15.

Scoring summary

#	Tool	License	Score
1	Aurora	Apache 2.0	15
4	HolmesGPT	Apache 2.0	9
5	K8sGPT	Apache 2.0	7
7	Resolve.ai	Closed	6
8	Traversal	Closed	6
10	Edwin AI	Closed	6
13	Rootly AI	Closed (Labs OSS)	6
14	ServiceNow Now Assist SRE	Closed	6
6	NeuBird Hawkeye	Closed	5
9	Datadog Bits AI SRE	Closed	5
11	incident.io AI SRE	Closed	5
12	PagerDuty SRE Agent	Closed	5
3	Cleric.ai	Closed	4
2	Causely	Closed	3
15	Splunk ITSI Episode Summarization	Closed	3

The open-source projects lead the deployment-flexibility and source-availability axes by definition. Aurora is the only entry that scores 3 on every axis. Commercial leaders cluster around 5 to 6 because they are uniformly strong on investigation but weak on deployment flexibility and source availability. Kubernetes-only projects (K8sGPT, Causely) and pre-GA incumbents (Splunk ITSI) cluster low because their scope or maturity caps multiple axes.

The score does not pick a winner. It picks a fit. A bank under FedRAMP High obligations evaluates this list differently from a 50-engineer Series B startup. The deployment axis answers the fitness question; investigation answers the depth question; source availability answers the trust question.

How do I choose an AI SRE tool?

Most procurement processes stall because the team compares across all five axes at once. Asking these three questions in order eliminates twelve of the fifteen tools before vendor demos.

Does the data have to stay in our perimeter? If yes, the answer is Aurora, HolmesGPT, or K8sGPT. Every commercial product on this list requires data to leave the customer perimeter for inference. See Self-Hosted AI SRE for the architecture you will need.
Is the scope multi-cloud or Kubernetes-only? If multi-cloud, the open-source shortlist narrows to Aurora; in the commercial set, Resolve.ai, Traversal, NeuBird, and incident.io are the credible candidates. If Kubernetes-only, every tool except Aurora's non-Kubernetes integrations remains valid.
Do you need to take action, or only investigate? Read-only covers most of the open-source category and most incumbent AI features. Actioning agents narrow the list to Aurora (PR-based, sandboxed kubectl, plus Aurora Actions), ServiceNow Now Assist (workflow automation), and Edwin AI (closed-loop within LogicMonitor).

For depth on the action-safety question, see our AI Agent kubectl Safety guide and CI/CD Auto-Remediation Complete Guide.

What to watch next

Arvo expects the category to converge along three axes through the rest of 2026.

Model Context Protocol convergence. PagerDuty, Rootly, Aurora, HolmesGPT, Causely, and Edwin AI have all shipped MCP servers. MCP is on track to become table stakes by year-end, which means differentiation will shift to prompt graphs, RAG quality, and policy guardrails.
Open benchmarking. Resolve.ai and Rootly have published proprietary LLM benchmark posts, neither with a reproducible dataset. The first open, named benchmark with a public incident corpus is likely to set the citation surface the category orbits.
Pricing model fragmentation. Per-seat (PagerDuty, Rootly, incident.io), per-investigation (Datadog Bits AI, NeuBird), per-credit (ServiceNow), per-cloud-host (Edwin AI), and free open source (Aurora, HolmesGPT, K8sGPT) coexist today. Expect convergence on a published reference cost per investigation as buyers compare more rigorously.

Differentiation in this market is structural rather than feature-list. Buyers who score against the capability matrix and apply the deployment, scope, and action questions usually land a credible shortlist of two or three tools within a week. Buyers running feature-list comparisons evaluate for a quarter.

Originally published at arvoai.ca.

Automated Post-Mortem Generation: The Complete Guide for SRE Teams (2026)

Siddharth Singh — Wed, 13 May 2026 16:13:39 +0000

Key Takeaways

Automated post-mortem generation is the process of producing an incident retrospective from artifacts already collected during the incident — chat transcript, alert timeline, monitor data, and (in agentic systems) the investigation agent's own tool-call trace. The category is not a single technology; it's an output shared by three distinct architectures.

We propose the Postmortem Provenance Model (PPM). Three source types: (1) chat-transcript postmortems (Rootly, incident.io, FireHydrant) summarize what humans said in the channel; (2) observability-stitched postmortems (Datadog Bits AI) summarize what monitors recorded; (3) agentic-investigation postmortems (Aurora) compose from the agent's causal reasoning trace. The three artifacts answer different questions and are not interchangeable.

The standards that anchor this work are old, but unchanged by AI. Google SRE Book Chapter 15 — Postmortem Culture (Lunney and Lueder, 2017) and John Allspaw's "Blameless PostMortems and a Just Culture" (Etsy, May 2012) define what a postmortem is for. AI changes the authoring cost, not the purpose.

The vendor landscape consolidated in 2025–2026. PagerDuty acquired Jeli in November 2023 for $29.7M; FireHydrant was acquired by Freshworks in December 2025; Squadcast was acquired by SolarWinds. ServiceNow's Now Assist SRE specialist (GA targeted June 2026) brings the largest ITSM vendor into the postmortem-generation lane.

Open-source agentic-investigation postmortems are a small lane. Aurora (Apache 2.0) generates postmortems from its own investigation agent's reasoning chain and exports to Confluence Cloud (OAuth) or Server / Data Center (PAT), with customizable per-org templates and version history.

A good postmortem outlives the incident. An automated post-mortem is an incident retrospective whose narrative, timeline, root cause, contributing factors, and action items are drafted by software rather than by hand — typically a large language model, sometimes a tool-using agent, always built on artifacts already collected during the incident. This guide is for SRE, platform, and incident-management leaders deciding which automated-postmortem architecture matches their team's working style — not which vendor logo to add to their stack.

Why automation, and why now

Most teams write postmortems by hand. Most postmortems are late, short, and read by no one. The reason is unsentimental: writing a good postmortem takes hours of reconstruction work, on top of an incident that has already drained the on-call's day. The lit-survey of practitioner posts converges on a 4–8 hour figure per postmortem of moderate complexity — most of that spent in Slack, dashboards, and ticket trails trying to reassemble the timeline.

The market response since 2023 has been a wave of automated-postmortem features: Rootly AI Copilot, incident.io Scribe and AI summaries, FireHydrant AI-Drafted Retrospectives, Datadog Bits AI postmortem variables, and PagerDuty Scribe Agent. The pitch is similar across them: 90 minutes of human reconstruction collapses to 15 minutes of human review.

The honest framing is that these tools do real work, but most of them are summarizing artifacts that already exist. They are not investigating; they are transcribing. That's enough for many teams, especially those whose incidents are well-captured in their incident-channel chatter. It is not enough for teams whose incidents require deep investigation across systems — and that gap is what the agentic-investigation category is starting to fill.

The Postmortem Provenance Model (PPM)

The three architectures differ in what they read from, not in what they produce. Same sections, different evidence.

Source type	Reads from	Strength	Limitation
Chat-transcript	Slack / Teams / Zoom channel for the incident window; on-call chatter; status updates	Captures human narrative, decisions, and judgment calls verbatim	Inherits human errors and gaps; weak on infrastructure facts the channel didn't surface
Observability-stitched	Monitor events, alert timeline, dashboards, deployment history	Strong factual timeline, embedded graphs and logs	Misses human context; weak on contributing factors that aren't in telemetry
Agentic-investigation	The investigation agent's tool-call trace, reasoning chain, evidence collected mid-incident	Causal record of what the system did and what the agent found	Requires running an investigation agent in the first place; quality depends on the agent

A team's choice should match its incident profile. If most incidents resolve in chat with little investigation needed, a chat-transcript tool is fine. If incidents are surfaced and resolved entirely in your observability stack, an observability-stitched approach gives you tight monitor-to-postmortem fidelity. If your incidents require traversing AWS, GCP, Kubernetes, and your own services to find the cause, an agentic-investigation postmortem is the only artifact that records the work the agent actually did.

Standards: what a postmortem is for

It is worth grounding the conversation in what postmortems were designed to do before LLMs existed.

Google SRE Book, Chapter 15 — Postmortem Culture: Learning from Failure by John Lunney and Sue Lueder (O'Reilly, 2017). The canonical text on blameless postmortems as organizational learning. The companion SRE Workbook Chapter 10 updates the practical guidance.
John Allspaw — Blameless PostMortems and a Just Culture (Etsy Code as Craft, May 22, 2012). The earlier articulation of why blameless-ness is operationally load-bearing.
Lunney — Postmortem Action Items (USENIX ;login: Spring 2017). The honest practitioner read on why most postmortems' action items never get done.
PagerDuty's open-source Postmortem documentation (Apache 2.0, GitHub). Includes a maintained postmortem template used as a baseline by many teams.
Verica Open Incident Database (VOID). The 2nd Annual VOID Report (December 2022) catalogs approximately 10,000 incidents from 600+ organizations; its central finding is that MTTR is statistically unreliable as a cross-organization comparison and that only ~25% of public incident reports clearly identify a root cause. A useful corrective to the "we reduced MTTR by X%" claims that pepper vendor marketing.
Dan Luu's curated postmortems collection. The widest public corpus of real postmortems; useful as RAG fuel for any AI postmortem system.

A blameless, learning-oriented postmortem is the goal. Automation changes the authoring cost; it does not relax the standard.

What gets auto-generated today

A typical 2026 automated postmortem produces some subset of:

Summary — one paragraph, the executive read.
Timeline — chronological events with timestamps (often HH:MM UTC).
Impact — customer-facing effect, services affected, error budget burn.
Root cause — the technical fault.
Contributing factors — human, process, and organizational conditions that allowed the incident.
Resolution — what stopped the bleeding.
Action items — owners, due dates, follow-ups.
Lessons learned — what the team would do differently.

Different products auto-draft different subsets. The "Lessons Learned" section, in particular, is left to humans in most products — for the obvious reason that it is the section where judgment is most consequential.

The tooling landscape

Concrete vendor positioning as of May 2026.

Product	License / hosting	What it auto-generates	Notes
Rootly AI Copilot	Closed, SaaS	Narrative summary, timeline, action items, root cause, embedded Datadog charts; meeting-bot transcription	Headline claim: 90 min → 15 min review. Exports to Confluence, Google Docs, Notion, Slack.
incident.io AI postmortems	Closed, SaaS	Summary, timeline, contributing factors, suggested follow-ups; Scribe transcribes call audio	"Lessons Learned" is left to humans by design. Exports to Confluence, Notion, Google Docs.
FireHydrant AI-Drafted Retrospectives	Closed, SaaS	Description, customer impact, lessons learned; Copilot compares ongoing incident to past incidents	Acquired by Freshworks December 2025; AI features are Enterprise tier only.
Datadog Bits AI postmortems	Closed, SaaS	Summary, customer impact, lessons learned variables; dynamic embedded graphs and logs	Exports to Datadog Notebooks, Confluence, or Google Drive.
PagerDuty Scribe Agent	Closed, SaaS	Real-time call transcription and timeline contributions to PagerDuty's Postmortems product	Part of PagerDuty's Spring 2026 agent suite (SRE Agent, Scribe Agent, Insights Agent).
Aurora	Apache 2.0, self-hosted	Summary, timeline (HH:MM UTC), root cause, impact, contributing factors, resolution, action items, lessons learned; generated from the investigation agent's reasoning trace	Per-org template overrides; Confluence Cloud (OAuth) and Server / Data Center (PAT) export.
ServiceNow Now Assist SRE specialist	Closed, SaaS	Triage + postmortem documentation end to end	GA targeted June 2026 (Knowledge 2026 announcement).
Squadcast	Closed, SaaS	One-click postmortem, webhook automation, templates	Acquired by SolarWinds.

The pattern: the SaaS-IM vendors all do chat-transcript postmortems well; Datadog owns the observability-stitched lane; Aurora is the open-source agentic-investigation option. ServiceNow's June 2026 GA brings the largest ITSM vendor into the category as a fourth meaningful entrant.

Architecture: how agentic-investigation postmortems work

Worth describing in detail because this is the category least visible to most buyers.

In a chat-transcript postmortem system, the flow is: incident channel → LLM with a postmortem template prompt → draft document. In an observability-stitched postmortem system, the flow is: incident timeline + dashboards → LLM with embedding variables → draft document with live charts.

An agentic-investigation postmortem starts earlier — at the investigation. The pattern, using Aurora as the concrete open-source example:

Alert webhook arrives. PagerDuty, Datadog, Grafana, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, NewRelic, OpsGenie, or incident.io fires. The provider-specific RCA-prompt builder constructs the agent's first message, including alert metadata, severity, service, and environment.
Investigation runs. Aurora's ReAct-style LangGraph agent calls tools across the next 3–15 minutes — kubectl, cloud CLIs, knowledge-base search, Terraform read, Confluence search — and accumulates a transcript of tool calls, tool results, and reasoning steps. The result is persisted as the incident's aurora_summary — the agent's RCA narrative.
Postmortem dispatch. When the incident is resolved (either manually, via Aurora's "Run Action" dropdown on completed incidents, or via an Aurora Actions on-incident-completion trigger), a postmortem agent run is dispatched with the agent's RCA summary as load-bearing context. The postmortem agent re-reads the original investigation output, optionally pulls Slack channel context for the incident window, and composes the postmortem under a per-org template.
Storage and versioning. Drafts are stored in PostgreSQL with version history. Engineers can edit; subsequent regenerations preserve human edits as a separate version.
Confluence export. The user clicks Export. Aurora pushes the rendered postmortem to Confluence Cloud (OAuth) or Server / Data Center (PAT), creating a page under a configured space and parent. Export is currently user-triggered rather than automatic, which preserves the human review step before publication.

The structural difference from chat-transcript postmortems is what evidence the LLM gets. A chat-transcript system can only describe what humans typed. An agentic-investigation system describes what the agent did, which tools it ran, what the cloud responded with, and how it reasoned through to the root cause. The artifact carries the actual causal trail, not a social reconstruction of it.

How to evaluate an automated postmortem tool

A rubric you can run on any vendor — open source or commercial.

Provenance match. Does the tool's source-of-truth match how your team actually runs incidents? Chat-heavy team → chat-transcript. Observability-heavy team → Datadog or equivalent. Investigation-heavy team → agentic.
Template control. Can you replace the vendor's template with your team's? Per-team templates? Aurora supports per-org template overrides via its actions configuration table; vendor SaaS varies.
Export target. Confluence Cloud, Server / Data Center, Notion, Google Docs, internal wiki. Match your team's documentation home. Aurora supports Confluence (both flavors); the SaaS vendors support different combinations.
Edit lineage. When the AI draft is edited, regenerated, and edited again, what survives? Test this explicitly with three round trips. Aurora preserves version history; check each candidate.
Action-item ownership. Does the tool extract action items with owners and due dates, or just bullet points? The Lunney USENIX piece is blunt about why this matters: action items without owners do not get done.
Embedded evidence. Are graphs, logs, and resource identifiers embedded inline or linked? Embedded survives the documentation system; linked rots over time.
Cost and privacy. Where does the postmortem text get processed? Self-hosted with bring-your-own-LLM (Aurora) keeps incident data on your infrastructure; SaaS vendors vary in how they handle this and your security team will want to know.
Standards alignment. Does the generated artifact match the blameless tradition (Allspaw, Lunney, the SRE Book) or accidentally drift into individual blame? Check the prompt if you can; otherwise inspect a sample.

How to roll out automation without breaking culture

A six-step adoption plan that respects the standards while saving the time.

Start with the easiest 30% — short-impact incidents with mostly-chat investigations. These produce passable AI drafts on day one.
Keep humans on lessons learned. Even tools that auto-generate the "Lessons Learned" section ship it as a draft to be aggressively rewritten. The judgment in that section is the point of the postmortem.
Require human edit before publish. The on-call engineer who ran the incident should always be the one who clicks "Publish." This is the cultural firewall.
Track action-item completion separately. AI-generated action items have a known completion-gap problem. Add a weekly review of last week's postmortem action items, with owners called out by name.
Run a quarterly audit of the generated postmortems. Pick five at random; have a senior engineer read them critically. Look for drift toward individual blame, missed contributing factors, and surface-level root causes.
Tighten the loop with the investigation tool. If your investigation tool and postmortem tool are the same product (Aurora, eventually Resolve.ai-class systems), the postmortem inherits the investigation's evidence chain. This is the highest-quality automated postmortem possible — but it requires running an agentic investigation in the first place.

What can go wrong

A short failure-mode list.

Surface-level root cause. AI drafts read confidently while attributing a deep system issue to its most visible symptom. The cure is human review by someone who was in the incident.
Hallucinated timeline. LLM invents events, misattributes timestamps, or doubles up on entries. Most common when the input artifact (chat transcript or telemetry) has gaps the model patches over.
Blame drift. AI summary slips into individual-blame framing because the human chat did. The blameless tradition exists exactly for this reason; the AI does not enforce it on its own.
Action items without ownership. A bullet list of "should do X" with no owner is not an action item; it is decoration. Treat ownerless action items as a failure of the tool's prompt.
Edit loss on regeneration. Some tools overwrite human edits when the user clicks "Regenerate." Verify that version history is preserved before trusting the tool for a quarter's worth of postmortems.

Where Aurora fits

Aurora is the open-source agentic-investigation entry in this category. Apache 2.0, self-hosted via Docker Compose or Helm. Postmortems are generated from the same agent that ran the investigation, with per-org template control, version history, Slack context backfill, and export to Confluence Cloud or Server / Data Center. If your incidents look like chat-resolved coordination work, you probably don't need Aurora's postmortem layer specifically. If your incidents look like deep cross-cloud investigation work, you probably do.

For more on how Aurora's investigation half works, see our AI-Powered Incident Investigation guide. For how Aurora's automation primitive (Aurora Actions) lets you chain postmortem generation onto every incident automatically, see the Aurora Actions launch post.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Related guides: AI-Powered Incident Investigation · Aurora Actions · Root Cause Analysis: Complete Guide for SREs

AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)

Siddharth Singh — Wed, 13 May 2026 16:12:09 +0000

Key Takeaways

AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs kubectl, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.

We propose the AI Investigation Capability Ladder (AICL). Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).

CNCF now hosts two open-source agentic projects in this lane. HolmesGPT entered the CNCF Sandbox in October 2025. K8sGPT has been Sandbox since December 19, 2023. Aurora (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.

The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT). Per DORA's metrics history, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The 2024 DORA report PDF added "deployment rework rate" as a fifth core measure.

The closed-source peer set is well-funded. Resolve.ai raised $125M at a $1B valuation in February 2026. Traversal reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.

Cloud incidents in 2026 surface faster than humans can investigate them. AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis. Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs kubectl describe, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.

This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.

What "investigation" means here

Three things blur together when people say "AI incident response":

Alert correlation — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.
Postmortem generation — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our Automated Post-Mortem Generation guide.
Agentic investigation — an LLM that runs new tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.

Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run kubectl will be even more disappointed.

The AI Investigation Capability Ladder (AICL)

Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.

Tier	What runs	Human role	Representative tools
L0 — Manual	Engineer reads alerts, runs `kubectl` and cloud CLIs by hand	Everything	PagerDuty, Slack, Datadog
L1 — Alert correlation	ML correlator clusters and dedupes events	Triage from a smaller list	PagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI
L2 — LLM-summarized timeline	LLM summarizes an event stream into prose	Reads summary instead of raw events	Datadog Bits AI summaries, incident.io Scribe
L3 — Single-shot LLM diagnosis	LLM produces an RCA from one prompt over alert + telemetry	Trusts a single inference	K8sGPT analyzers, vendor "AI insights" buttons
L4 — Agentic multi-step investigation	LLM agent calls many tools across multiple turns, replans as findings arrive	Reviews trace, ships fix	Aurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely
L5 — Closed-loop investigate + remediate	Agent investigates and proposes (or applies, with approval) a fix	Approves remediation	Aurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE

The honest framing: most teams are L0 or L1 today. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.

Traditional AIOps vs agentic investigation

Both are useful; they cover non-overlapping work.

Capability	Traditional AIOps (L1)	Agentic investigation (L4)
Input	Event stream, telemetry already ingested	Same, plus live tool calls
Output	Ranked suspects, correlated incidents	RCA narrative, evidence chain, suggested fix
New evidence	No — operates on what's already in the system	Yes — agent issues new commands
Reasoning	ML clustering / topology distance scoring	LLM step-by-step (ReAct or similar)
Why it can be wrong	Missing event, weak topology graph	Hallucination, tool misuse, prompt drift
Cost model	Per-event or per-host	Per LLM token + tool runtime
Failure mode	Quiet — wrong cluster, you don't know	Loud — agent's trace is human-readable

Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's ITSI Episode Summarization (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.

The agentic peer set in 2026

This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.

Product	License	Scope	Notes
Aurora	Apache 2.0	AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrations	LangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm.
HolmesGPT	Apache 2.0	Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsets	Co-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira.
K8sGPT	Apache 2.0	Kubernetes resource diagnostics	CNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder.
Cleric.ai	Closed source	Slack-first AI SRE	Gartner Cool Vendor 2025. Integrates Datadog and Grafana.
Resolve.ai	Closed source	Multi-cloud AI SRE	$125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk.
Traversal	Closed source	"Causal search engine" for production systems	$48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express.
Neubird Hawkeye	Closed source	Llama 3.2 70B fine-tuned + ChromaDB RAG	SaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow.
Causely	Closed source	Causal-graph reasoner for Kubernetes	Gartner Cool Vendor 2025. MCP server. Gemini-powered.
Ciroos.AI	Closed source	"SRE Teammate" multi-agent	MCP and A2A architecture.

If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.

For a deeper open-source-only comparison, see our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide.

Architecture: what makes investigation "agentic"?

Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.

1. A tool-calling loop (ReAct or similar)

The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the ReAct pattern (Reason + Act, Yao et al. 2022). Aurora's implementation is a single-node LangGraph workflow wrapping langchain.agents.create_agent; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.

2. Tool reach across the stack

An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.

3. Sandboxed CLI execution

Letting the agent run kubectl and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:

Prompt-injection input rail (NVIDIA NeMo Guardrails) blocks commands that originate from injected instructions.
Static signature match against 37 vendored SigmaHQ detection rules covering known-malicious command patterns.
Per-org command policy — allow/deny lists scoped to the customer's tenant.
LLM safety judge adapted from Meta's PurpleLlama AlignmentCheck.

Approved commands execute via kubectl exec into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our AI Agent kubectl Safety guide for the full threat model.

4. Retrieval over organizational memory

The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses Weaviate for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.

The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.

5. Infrastructure topology

An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses Memgraph as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.

What the DORA and VOID anchors actually say

Two industry sources are worth grounding the investigation conversation in.

DORA — Failed Deployment Recovery Time. Per DORA's metrics history, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The 2024 DORA State of DevOps Report PDF further refined the metric set, adding "deployment rework rate" as a fifth core measure.

VOID — incident reality, not vendor claims. The Verica Open Incident Database catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.

An evaluation scorecard for AI investigation tools

Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.

Multi-step tool use. Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.
Cloud scope. Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.
Sandboxing and RBAC. Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.
RAG quality. Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?
Trace readability. Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?
Cost and rate-limit headroom. Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.
Open source vs SaaS posture. If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.
Where it sits on the AICL. Decide up front whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.

How to run a low-risk pilot

Pick one alert source and one cluster. PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.
Run read-only for at least four weeks. Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).
Ingest your historical context. Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.
Add one chat channel and one slash command. Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.
Review traces weekly. Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.
Promote to alert-triggered investigation when the trace is clean for two consecutive weeks. Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.
Decide on L5 (remediation) only after three months at clean L4. Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's Aurora Actions feature is the open-source pattern for this.

What can go wrong

A short list of failure modes worth pre-mortem-ing.

Prompt drift. A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.
Tool misuse. Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.
Hallucinated identifiers. Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.
Token cost runaway. Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.
Over-trust. The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.

Where Aurora fits

We build Aurora — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the Aurora Actions feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Related guides: Aurora Actions · Automated Post-Mortem Generation · CI/CD Auto-Remediation · Open-Source AI SRE comparison

Aurora Actions: User-Defined Background Automations for Incident Response

Siddharth Singh — Mon, 11 May 2026 17:49:20 +0000

Key Takeaways

Aurora Actions are reusable, natural-language automations that Aurora's agent executes in the background using all 22+ connected integrations. Available today on the main branch of Aurora.

Three trigger types out of the box: manual ("run now"), on incident completion (chain follow-up work after every RCA), and recurring schedule (Celery Beat–driven intervals).

Same agent, same tools, different prompt scaffolding. Actions reuse Aurora's existing LangGraph agent and 30+ tools (kubectl, aws, gcloud, az, Terraform, Confluence, Slack, GitHub) — they just run as background chat sessions with eager-loaded skills and no RCA mandate.

/action <name> is a first-class chat primitive. Slash-command autocomplete in the chat input, "Run Action" dropdown on completed incidents, and full RBAC-gated CRUD UI in Settings.

Aurora Actions turn the agent into a programmable platform. This is the building block for CI/CD auto-remediation, scheduled audits, and post-incident health checks — covered in our CI/CD Auto-Remediation guide.

We shipped one of the most-requested features in Aurora's history: Aurora Actions — user-defined background automations that run on Aurora's agent. An Aurora Action is a named, natural-language instruction the user writes once and then triggers manually, on incident completion, or on a recurring schedule; Aurora's agent executes it as a background task with full access to every connected integration. Where traditional incident management tools force you to pick from a fixed catalog of "automations" (close incident, post to Slack, run runbook), Actions are written in plain English and inherit the full reasoning capability of the agent.

This post is for SRE and platform teams already running Aurora — or evaluating it — who want to understand what Actions actually do, where they fit on the agentic spectrum, and how to use them safely.

What is an Aurora Action?

An Aurora Action has four parts:

A name — used as the slash-command handle (/action <name>) and as the dropdown label on incident cards.
A natural-language instruction — the prompt the agent will execute. The same instruction the user would type into chat, except it can reference incident context placeholders when triggered post-incident.
A trigger type — manual, on-incident-completion, or on-schedule (interval-based via Celery Beat).
An on/off toggle — actions can be disabled without deletion, with full RBAC for who can create, edit, or trigger them.

The implementation is a thin layer over Aurora's existing chat agent. When an Action triggers, the executor service creates a background chat session with the action's instruction as the user message, runs it through the same LangGraph workflow that powers interactive chat, and persists the run history. The agent has full tool access (kubectl, cloud CLIs, Terraform, Slack, GitHub, Confluence, Memgraph, Weaviate) and eager-loaded skills — the only differences from interactive chat are scaffolded prompts and the absence of any RCA mandate.

Why this matters

Most incident management automation today is workflow automation: PagerDuty fires, Slack channel is created, status page is updated, runbook link is posted. The "automation" is a directed graph of static actions. There is no reasoning, no investigation, no judgment. Tools like Rootly, FireHydrant, and incident.io are excellent at this — but they don't do anything an SRE wouldn't have to manually verify after the fact.

Aurora's bet has always been the opposite: automate the investigation itself. Aurora Actions extend that bet from one-shot incident investigations to recurring or post-incident workflows. A few concrete examples:

Noisy alert tuning — "Every Friday at 5pm, review which Datadog alerts fired more than 20 times this week with mean time-to-acknowledge over 10 minutes. Open a Terraform PR to widen the thresholds or move them to a warning channel."
Post-incident health check — "After every completed RCA, run a 15-minute observation on the affected service: check error rate, p99 latency, and pod restart count. Post results to #incident-followup."
Scheduled infrastructure audit — "Every Monday at 9am, audit IAM roles in the production AWS account that have not been used in 90 days. List candidates for removal in a Confluence page."

None of these are runbook automation. Each requires the agent to query infrastructure, reason about results, and produce a structured output. Each one was previously the job of an on-call engineer doing follow-up between pages.

Where Actions sit on the agentic capability spectrum

In our Open-Source AI SRE comparison, we proposed a four-level spectrum for AI SRE capability. Actions don't change the level — they change when the agent runs.

When the agent runs	Trigger	Pre-Actions example	With Actions
On alert	Webhook from PagerDuty / Datadog / Grafana	Aurora investigates the alert and produces an RCA	Same — investigation flow is unchanged
On user request	Engineer asks a question in chat	Aurora answers using tools	Same — plus `/action <name>` shortcuts
After every incident	Incident state transitions to "resolved"	Postmortem generated; engineer manually does follow-up checks	Action runs automatically with incident context in scope
On a schedule	Celery Beat cron	No equivalent — required external scheduler + custom code	Single source of truth: agent runs the prompt on cadence

The post-incident and scheduled triggers are the genuinely new capability. Before Actions, anything recurring or post-incident required gluing Aurora to an external scheduler, an external prompt store, and bespoke trigger code. Actions collapse all three into the product surface.

How Actions work under the hood

This is for the technically curious. A few architecturally interesting things from the implementation:

1. Background chat sessions, not a separate runtime. When an Action triggers, the executor service creates a regular chat session with the action's instruction as the seed message and dispatches it as a background Celery task. The agent doesn't know it's running an Action — it just runs the workflow. This means every capability the interactive agent has (tool calls, RAG, graph traversal, sub-agent orchestration) is available inside Actions for free.

2. Eager-loaded skills, no RCA mandate. Interactive chat lazy-loads skills based on the user message. Background actions eager-load all skills because there is no human to clarify ambiguity. The system prompt also strips the "your job is to find root cause" framing — Actions can do anything the agent can do, not just investigate.

3. RLS context is preserved. Aurora uses PostgreSQL row-level security for multi-tenancy. The executor explicitly sets RLS context (org_id, user_id) before running so background tasks see only their own org's data — even though they run under a service identity.

4. Stale run cleanup is integrated. Aurora's existing background-chat janitor already handles orphaned chat sessions from crashed pods. Action runs go through the same path, so a worker pod dying mid-action doesn't leave the run state inconsistent.

5. RBAC is enforced at the route layer. Action CRUD is gated by Aurora's Casbin-based RBAC. Org admins can restrict which roles can create or trigger actions — important because an Action with cloud-CLI access has real blast radius.

Trigger types in detail

Manual triggers

The simplest case. An admin creates the action, an engineer triggers it from the Actions page or via /action <name> in chat. Useful for codifying common operational tasks ("rotate ECS task definitions for service X", "scan Confluence for stale runbooks") into named, repeatable commands.

The chat integration is worth calling out: /action is implemented as an LLM tool call using the same pattern as Aurora's /rca slash command. The agent processes the action dispatch and then continues responding to the rest of the user's message — so you can write "kick off the IAM audit and tell me what changed since last week" and the agent will dispatch the audit action and answer your question in the same turn.

On-incident-completion triggers

When an incident transitions to "resolved", any action with this trigger type runs against the incident context. The incident's metadata, RCA, and timeline are available to the action's agent without the user having to paste anything in. This is the trigger that turns Aurora from a reactive tool ("investigate this page") into a continuous one ("investigate, then run health checks, then file the postmortem").

Scheduled triggers

Interval-based, driven by Celery Beat. Choose a cadence (every N minutes / hours / days), and the action runs without user involvement. This is the building block for the CI/CD auto-remediation and scheduled audit use cases — and it's why we're calling this post and the CI/CD Auto-Remediation guide sister posts.

What Actions don't do (and why)

A few capability decisions worth being explicit about:

No external webhook triggers in this release. We could have added "trigger on arbitrary webhook" but it overlaps with the existing alert-triggered investigation flow. We may add it if we see demand for triggers from systems that don't go through PagerDuty / Datadog / Grafana.
No agent-authored Actions yet. The agent can't create or modify Actions on its own. Self-modification is a serious security boundary; we'd want approval gating and audit logging before opening that door. (See our AI Agent kubectl Safety guide for the threat model.)
No conditional / DAG composition in this release. Actions are single-prompt for now. If you need a multi-step workflow, write a single prompt that describes the steps — the agent is good at sequencing. We'll add explicit composition if the natural-language form proves limiting.

Safety: what to think about before enabling

Every Action is a small program with access to your cloud environment. A few rules we use ourselves:

Start read-only. Actions inherit Aurora's tool permissions. If your tool config restricts write actions (no kubectl apply, no aws ec2 terminate-instances), Actions inherit that posture. Keep it that way for the first few weeks.
Use scheduled triggers conservatively. A daily audit is cheap. A 5-minute polling loop with cloud CLI calls is not. Watch the LLM bill.
Audit who can create Actions. RBAC defaults to org-admin-only creation. Leave it there unless you have a clear reason to widen.
Pin the model. Action prompts can be sensitive to model behavior. Pin a known-good model per action (gpt-5.5, claude-sonnet-4.6, opus-4.7, etc.) using Aurora's per-org model dropdown until you have confidence in cross-model stability.
Review action runs weekly. Every action has a run-history view. Spend 10 minutes a week reading the agent's traces for your scheduled actions — anomalous reasoning is the leading indicator of prompt drift or tool drift.

How to ship your first Action

A six-step recipe.

1. Pick a recurring task you currently do manually

Anything you do every week or after every incident. Examples: stale-PR review, alert-noise audit, on-call handover summary. The smaller and more deterministic, the better for v1.

2. Write the prompt as if you were typing it into chat

Don't translate to "automation language." Write it the way you would write a chat message to a smart junior SRE. "Look at..." "Check whether..." "Open a PR that..."

3. Create the Action with a manual trigger

Settings → Actions → New Action. Paste the prompt, set trigger = manual, leave it disabled if you want to review before enabling. Trigger it once and watch the run.

4. Inspect the run trace

Click the run in the history view. Read every tool call. Look for: tool misuse (wrong cloud account), excessive tool calls (3 attempts at the same thing), hallucinated paths or resource IDs. Iterate on the prompt until the trace is clean for three consecutive runs.

5. Promote to the right trigger type

If the action makes sense after every incident → on-incident-completion. If it's a routine sweep → on-schedule with the longest cadence that still meets your need. Only use short cadences when you have a clear cost and blast-radius understanding.

6. Add it to your team's incident review

Treat agent runs the same way you treat human runs: include them in your weekly incident review. Look for actions that produced wrong output, actions that nobody read the output of, and actions that produced output nobody acted on. Delete or downgrade as needed.

Aurora Actions vs traditional incident-management automation

The category most people compare us to is "workflow automation in incident-management SaaS" — Rootly, FireHydrant, incident.io. The comparison is informative but ultimately category-different:

Capability	Aurora Actions	Rootly / FireHydrant / incident.io workflows
Authoring	Natural language	DSL or visual builder
Reasoning	Yes — LLM agent	No — fixed conditional graph
Tool reach	Cloud CLIs, kubectl, Terraform, Slack, Confluence, GitHub, RAG, infra graph	Slack, status pages, Zoom, runbook links, ticket creation
Scheduled execution	Yes (Celery Beat)	Limited (some support timed reminders)
Post-incident chaining	Yes — full incident context available	Yes — but limited to workflow actions
Open source	Yes (Apache 2.0, self-hosted)	No
Pricing	Free (self-hosted; LLM tokens only)	Per-user SaaS

The honest framing: traditional incident-management tools automate the process around the incident. Aurora Actions automate what happens inside the agent. Both have value; they cover non-overlapping work. If you live in PagerDuty and use Rootly for incident channels, Aurora Actions sit alongside that — they don't replace it.

What's next

Aurora Actions is the foundation for several capabilities on our roadmap:

DAG composition — explicit multi-step Action chains where each step is itself an Action.
Approval gates — Actions that pause for human approval before destructive tool calls (already supported in chat; explicit Action-level gating coming).
CI/CD auto-remediation hooks — first-class integration with GitHub Actions, Jenkins, and ArgoCD so a failing pipeline becomes a triggered Aurora investigation. (Background and detailed write-up in our CI/CD Auto-Remediation guide.)
Action marketplace — community-contributed Actions you can install with one click. Bring-your-own prompt store.

We'll publish each of these as they ship.

Get Aurora

Aurora is fully open source under Apache 2.0. Self-host with Docker Compose or Helm. Actions ship in the next tagged release after aurora-oss-1.2.15 (April 15, 2026); the feature is available on main today.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Compare against alternatives: Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT · Aurora vs traditional incident-management tools

Originally published at arvoai.ca.

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

Siddharth Singh — Mon, 11 May 2026 17:32:08 +0000

Key Takeaways

Most teams do not yet auto-remediate inside CI/CD. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of respondents don't use AI in CI/CD workflows at all — even though AI is now widely used elsewhere in the development lifecycle.

CI/CD auto-remediation is an architectural pattern, not a product category. It combines progressive delivery (canary, blue-green), automated metric-driven rollback, and AI-assisted root-cause-and-fix. Owned components, not a single SKU.

Three layers, four maturity levels. We propose the CI/CD Auto-Remediation Maturity Spectrum (CARM): L0 (manual), L1 (rollback), L2 (rollback + diagnostic), L3 (rollback + diagnostic + remediation), L4 (closed-loop with policy gates).

Open-source stack is mature. Argo Rollouts, Flagger, and metric-driven AnalysisTemplates cover L1–L2 with no AI. AI agents like Aurora extend to L3 with Actions-based remediation.

DORA's bar is real. Top-performing teams keep change failure rate low and recover from failed deployments in under one hour (DORA program guidance). Auto-remediation is how non-elite teams close the gap.

Of the 46+ AI SRE products and dozens of progressive-delivery tools shipping in 2026, only a handful explicitly target the pattern this guide is about. CI/CD auto-remediation is the practice of having your delivery pipeline automatically detect, diagnose, and recover from failure — without paging a human — using a combination of progressive-delivery primitives, metric-driven rollback policies, and (increasingly) AI agents that propose or apply fixes. It is not the same as auto-deploy. It is not the same as canary rollout. It is the closing of the loop between "the pipeline noticed something is wrong" and "the system is back in a good state" — without an engineer in the middle.

This guide is for SRE and platform teams who already run continuous delivery and want to push toward the auto-remediation end of the spectrum. By the end, you should be able to: position your current setup on the CARM maturity spectrum, identify the next concrete step, and pick a credible tool stack to get there.

Why auto-remediation matters in 2026

Three numbers explain the demand.

1. AI is shipping more code, faster. Per JetBrains' AI Pulse coverage on the TeamCity blog (April 2026), AI tools are now used by a large majority of developers in their daily work. The DX 2026 change-failure-rate analysis puts a number on it: with 91% of developers having adopted AI and 20%+ of merged code now AI-authored, code velocity has gone up while quality has gone in the opposite direction. More deployments per day means more chances to break production.

2. The pipeline itself is the new bottleneck. JetBrains' 2025 State of CI/CD survey documents widespread frustration with slow and unreliable CI/CD pipelines as a leading contributor to developer burnout.

3. AI in CI/CD specifically lags adoption. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of respondents don't use AI in CI/CD workflows at all — even though most use AI everywhere else in the development lifecycle. The gap isn't capability; it's trust and integration. AI in IDEs is low-risk; AI in pipelines touches production. Teams want the impact but won't take the blast radius until the architecture is right.

Auto-remediation is the architecture that closes that gap. It bounds the agent's reach (only inside the delivery pipeline), gives it deterministic guardrails (progressive delivery and metric-driven rollback), and produces a clear contract: detect, diagnose, fix-or-rollback, log.

What "auto-remediation" actually means

It is easiest to define by negation. Auto-remediation is not:

Auto-deploy. Auto-deploy ships code on merge. Auto-remediation is what happens after a problem appears.
Canary release. Canary is the detection mechanism — it surfaces problems early by shifting traffic gradually. Remediation is the response — rolling back, hotfixing, or reverting.
Self-healing infrastructure. Self-healing systems like Kubernetes restart pods. Auto-remediation includes that plus change-driven failure recovery: rolling back a bad deploy, rolling forward a fix, or pausing the pipeline while a human investigates.
AIOps. AIOps platforms surface alerts and correlations. Auto-remediation closes the loop by acting on them.

The minimum viable definition: a pipeline transition from a degraded state back to a healthy state, triggered by automated detection, executed by automated action, observed and logged for human review.

The CI/CD Auto-Remediation Maturity Spectrum (CARM)

There is no single industry-standard maturity model for auto-remediation. We use the following five-level spectrum — derived from how teams actually evolve.

Level	What happens on failed deploy	Tools that get you here	Trust required
L0 — Manual	Pipeline fails. PagerDuty pages the on-call. Engineer investigates, decides to roll back or hotfix, executes manually.	None — this is the default for most teams.	None — humans do everything.
L1 — Automated Rollback	Pipeline detects health-check failure (error rate, latency, smoke test). Automatically rolls back to the previous version. Pages a human after the fact.	Argo Rollouts, Flagger, Spinnaker	Trust that the health metric reflects user-visible failure.
L2 — Rollback + Diagnostic	L1 plus: AI agent runs an investigation when rollback fires. Produces an RCA before the human starts. Page goes out with context, not blank.	L1 stack + HolmesGPT, Aurora, K8sGPT	Trust that the diagnostic is right enough to bias human reasoning.
L3 — Rollback + Diagnostic + Remediation	L2 plus: agent proposes (or in some cases applies) a fix — a PR, a config change, an alert threshold update. Human reviews and merges.	L2 stack + Aurora Actions, HolmesGPT Operator mode	Trust that the agent's fix is correct, scoped, and reviewable.
L4 — Closed-loop with policy gates	L3 plus: certain low-risk, well-understood fixes are applied automatically inside policy guardrails (alert threshold widening, log-only changes, retry loops). Destructive or high-risk changes still gated.	L3 stack + policy engine (OPA, Casbin, Kyverno) + audit logging	Trust the policy gate definitions more than the agent.

Most teams in 2026 are at L0 or L1. The leap from L1 to L2 is the single highest-leverage move available because it preserves human-in-the-loop decision-making while removing the "blank-page" delay that explains a large share of MTTR. The 2024-2025 DORA research renamed MTTR to Failed Deployment Recovery Time (FDRT) precisely because the metric is more meaningful when scoped to change-driven failures — which is exactly the failure mode auto-remediation targets.

L1: Automated rollback (where most serious teams should be)

This is the foundation. Without L1, AI-assisted remediation at L2-L3 has nowhere to act.

The two Apache 2.0 incumbents are Argo Rollouts and Flagger. Both run in Kubernetes; both implement metric-driven progressive delivery with automated rollback. They differ in invasiveness.

Capability	Argo Rollouts	Flagger
CNCF status	Part of Argo (Graduated, Dec 2022)	Part of Flux (Graduated, Nov 2022)
Resource model	Replaces `Deployment` with `Rollout` CRD	Wraps existing `Deployment`
GitOps pairing	ArgoCD	FluxCD
Analysis	`AnalysisTemplate` querying Prometheus, Datadog, CloudWatch, etc.	Service-mesh metrics + custom webhooks
Automated rollback	Metric-threshold breach → revert	Metric-threshold breach → revert
Traffic shaping	Native + ingress + service mesh	Service-mesh first (Istio, Linkerd, App Mesh)
Invasiveness	Higher (changes resource type)	Lower (transparent wrapper)
Webhooks for custom logic	`Experiment` resource + analysis runs	Pre-/post-/during-rollout hooks

Pick Argo Rollouts if you already use ArgoCD and want explicit per-step canary control. Pick Flagger if you use a service mesh and want progressive delivery to be transparent to existing manifests.

For non-Kubernetes pipelines, equivalent capability lives in Spinnaker (multi-cloud, mature), Harness (commercial), and feature-flag platforms like LaunchDarkly (when "rollback" can be a flag flip).

A minimal Argo Rollouts AnalysisTemplate for HTTP error rate, simplified from the official docs:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 30s
      successCondition: result[0] <= 0.01
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[1m]))
            / sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))

Three failed 30-second windows → rollback. This is L1 in 30 lines of YAML.

L2: Rollback + automated diagnostic

L1 gets you out of an outage fast. It does not tell you why the deploy failed. The human gets paged with a rollback notification and starts from zero.

L2 fills that gap with an AI agent that runs when rollback fires. The agent queries the cluster state, the application logs, the rollout metrics, and the changed code — and produces an RCA before the human starts typing.

Three credible open-source options exist as of 2026 (compared in detail in our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide):

K8sGPT — rule-based scanner with LLM explanations. Best for low-blast-radius first deployment; explains why a resource is unhealthy.
HolmesGPT — ReAct-loop AI agent (CNCF Sandbox). 30+ observability integrations. Read-only by default. Strong fit for cluster-scoped investigation.
Aurora — LangGraph supervisor agent. Multi-cloud (AWS / Azure / GCP / OVH / Scaleway). Generates postmortems. Opens remediation PRs with human approval.

Wiring up L2 is straightforward: configure your AI SRE's webhook to receive the rollback event (Argo Rollouts emits Kubernetes events; you can route them via Argo Notifications to the agent). The agent investigates and posts results to the on-call Slack channel before the human acknowledges the page.

L3: Diagnostic + agent-proposed remediation

L3 is where AI starts proposing fixes, not just diagnosis. The pattern that works:

Pipeline fails → automated rollback (L1).
Agent investigates → RCA produced (L2).
Agent proposes a fix as a pull request, with the RCA as the PR description, the diff scoped to one file, and tests where possible.
Human reviews PR. If correct, merges. If wrong, comments and rejects.

This works because the pull request is the natural human-review surface. The agent doesn't touch production directly; it touches the repository, which already has CI, code review, and a merge gate.

Aurora Actions is built precisely for this pattern. A post-incident-completion Action with a prompt like "Open a PR widening alert thresholds for the three noisiest alerts in this incident" converts the human follow-up step into automated PR creation. The human review surface stays exactly the same as for human-authored PRs.

The HolmesGPT equivalent ships as "Operator mode" — the agent can write to GitHub when explicitly enabled.

L4: Closed-loop with policy gates

L4 is the contentious one. It involves the agent making changes without human approval — but only inside a tightly scoped policy.

The pattern:

A policy engine (Open Policy Agent, Kyverno, Casbin) defines which classes of remediation can run automatically.
The agent proposes a fix. The policy engine evaluates whether the fix matches a permitted class.
If yes → apply automatically with audit logging. If no → route to L3 (PR for human review).

Permitted classes that are usually safe at L4: widening an alert threshold by less than 2x, restarting a pod, scaling a deployment within preset bounds, adding a retry loop to a network call, suppressing a noisy log line.

Permitted classes that are usually not safe at L4: any data-plane change, any production traffic routing change, any secret or RBAC change, any change touching the policy engine itself.

The reason L4 is contentious is that the policy gate is now a high-value target. An attacker who can broaden the policy can broaden the agent's blast radius. The same threat model we walk through in our AI Agent kubectl Safety guide applies, plus an additional layer: the policy engine must be operated with the same rigor as the orchestration plane itself.

Almost no production teams in 2026 run pure L4. The credible deployments are L3 with hardcoded L4 exceptions for two or three well-understood remediation classes. That's where to aim.

Common pitfalls

A short list of failure modes we have seen — in our own work and in customer deployments.

Auto-remediating into a worse state. The classic failure is auto-scaling a service to handle elevated error rates that are themselves caused by a downstream dependency. The service scales, hammers the dependency harder, and the dependency collapses. Fix: never auto-remediate without dependency-graph awareness. Aurora uses Memgraph for this; HolmesGPT uses its toolset structure; pure-L1 stacks should require manual escalation when the failure crosses service boundaries.
Trusting the AnalysisTemplate metric too much. A 1% error rate threshold on a P99-tail service is meaningless if your real failure mode is request-stalled-not-failed. Fix: model what user-visible failure actually looks like, not what the cleanest Prometheus query produces.
Letting the agent run unbounded retries. AI agents that hit a "this didn't work" signal will often retry — sometimes thousands of times — burning tokens and triggering downstream rate limits. Fix: cap the agent's tool-call budget. Aurora's executor enforces this by default; verify your agent does the same.
Skipping the post-mortem. Auto-remediation that "just worked" without a clear human review of what happened is a slow path to brittleness. Every auto-remediation event should produce a postmortem the on-call reads.
Conflating auto-remediation with "self-healing infra". Kubernetes pod restarts are not auto-remediation. They are a runtime affordance. Auto-remediation is the response to a change-driven failure — the deploy, the config push, the schema migration. Keep the categories separate.

A pragmatic 90-day path to auto-remediation

For a team currently at L0 or L1.

Days 1–14: instrument and detect

Pick your three highest-traffic services. Add or harden:

Synthetic checks that exercise the user-visible path.
One Prometheus error-rate metric per service with a clear threshold.
A canary or blue-green rollout primitive (Argo Rollouts or Flagger).

Goal at end of week 2: a controlled bad deploy auto-rolls back without human intervention.

Days 15–45: wire in the agent

Deploy one of Aurora, HolmesGPT, or K8sGPT in read-only mode. Configure rollback events to webhook the agent. Have it post an RCA to your incident channel within five minutes of rollback.

Goal at end of week 6: every rollback comes with a written diagnostic before the human acknowledges.

Days 46–75: add agent-proposed remediation

Enable PR-creation for the agent (Aurora Actions on-incident-completion trigger, or HolmesGPT Operator mode). Constrain initial scope to one repo and one class of fix (alert thresholds, retry loops, log suppression). Review every PR for the first two weeks.

Goal at end of week 11: agent opens correct PRs in 70%+ of fired rollbacks. False-positive PRs are caught at code review.

Days 76–90: policy-gate one fix class for L4

Pick the safest class — usually alert threshold widening when an alert fired more than N times in M hours with mean TTA above some bound. Define an OPA / Kyverno policy that permits only that class. Wire the agent to apply directly when the policy permits, raise a PR otherwise.

Goal at end of week 12: one L4 lane open for one fix class with full audit trail.

This is the conservative path. Aggressive teams have moved faster, but we have not seen anyone skip steps successfully.

The DORA reality check

The DORA program's published guidance is blunt about what good looks like. Historical State of DevOps Reports have consistently shown the same shape of distribution:

Change Failure Rate: top performers maintain low single-digit percentages; lower performers see substantially higher rates.
Failed Deployment Recovery Time (FDRT): top performers recover in under one hour; lower performers can take days to weeks.

DORA's research has also consistently found that speed and stability reinforce each other rather than trade off — the fastest teams are also the most stable, per DORA's history of metrics and successive State of DevOps Reports. Auto-remediation is one of the small number of capabilities that moves teams across these tiers without requiring deeper organizational change. The L1→L2 jump alone reduces FDRT meaningfully because the human is no longer reconstructing context — the agent has already done it.

Where this is heading

Two predictions, each with a reasonable evidence base.

1. The L2 → L3 transition becomes table-stakes within 18 months. AI-authored PRs from agents are already merging in production at multiple companies in our network. Once the review surface is the same as for human-authored PRs (which it already is via GitHub / Bitbucket / GitLab), there is no organizational reason not to use them.

2. L4 stays narrow. The threat surface of agent-applied changes is genuinely scary, and the per-incident savings of going from L3 to L4 are smaller than the savings from L1 to L2. Expect L4 to be the place where one or two well-understood fix classes get automated, while everything else stays L3.

The teams who win in 2026-2027 are the ones who get to credible L3 first.

Where Aurora fits

Aurora is the AI agent layer of an auto-remediation stack — it covers L2 (investigation), L3 (PR-based remediation via Aurora Actions), and the agent half of L4 (policy-gated remediation). It does not replace Argo Rollouts or Flagger at L1; those remain the foundation. Aurora is the difference between rolling back blind and rolling back with a written RCA and a draft PR.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Aurora Actions launch: Aurora Actions: User-Defined Background Automations
OSS comparison: Aurora vs HolmesGPT vs K8sGPT
Safety architecture: AI Agent kubectl Safety

Originally published at arvoai.ca.

AI Agent kubectl Safety: Sandboxed Execution for Production

Siddharth Singh — Wed, 06 May 2026 20:44:12 +0000

Key Takeaways

Giving an AI agent kubectl access is an architecture decision, not a permission flag. Per-permission gates fail under prompt injection.

OWASP ranks "Excessive Agency" as LLM06 in the 2025 Top 10 for LLM Applications and "Tool Misuse and Exploitation" as ASI02 in the 2026 Top 10 for Agentic Applications.

The Kubernetes ecosystem already has an answer: k8s-sigs/agent-sandbox provides a declarative API for isolated agent runtimes using gVisor or Kata Containers.

Real precedent exists. EchoLeak (CVE-2025-32711), CVSS 9.3, was the first publicly documented zero-click prompt-injection data exfiltration in a production LLM system. The kubectl analogue would be cluster-wide.

Aurora runs every kubectl command in a pod-isolated process via its terminal_run primitive, with an environment-variable allowlist that strips secrets, signature-matcher and LLM-judge guardrails, and per-invocation cloud credentials.

Of the 46+ products marketed as "AI SRE" in 2026, only a handful publicly document their kubectl execution architecture — and the gap between vendors that handle this well and vendors that handle it badly is the single largest unspoken risk in the category. AI agent kubectl safety is the architectural discipline of letting an AI agent run kubectl (or any cloud CLI) against production without inheriting cluster-wide blast radius if the agent is compromised. It is not the same as RBAC scoping, and it is not the same as a human approval prompt — both are necessary but neither is sufficient on its own.

When OWASP published its 2025 Top 10 for LLM Applications, it ranked Prompt Injection (LLM01) as the top risk and Excessive Agency (LLM06) as one of the most consequential — defining it across three root causes: excessive functionality, excessive permissions, and excessive autonomy. In December 2025, OWASP followed up with a dedicated Top 10 for Agentic Applications that names Tool Misuse and Exploitation (ASI02) and Identity and Privilege Abuse (ASI03) as primary attack surfaces.

Translation: if you give an AI agent the ability to run kubectl, aws, or gcloud commands against production, you have a security architecture problem — not a permissions problem. This guide walks through the threat model, the emerging Kubernetes sandboxing standard, and how to evaluate any AI SRE on its kubectl safety.

What can go wrong when AI agents run kubectl?

Any LLM-driven agent that executes commands inherits the security properties of the LLM, the harness, and the runtime. Three real-world precedents illustrate the failure modes:

EchoLeak (CVE-2025-32711) — Microsoft 365 Copilot, CVSS 9.3 critical, patched in June 2025. Discovered by Aim Security, it was the first publicly documented zero-click indirect prompt-injection data exfiltration in a production LLM system. A crafted email sat in Outlook; when the user later asked Copilot for an unrelated summary, the email's hidden instructions fired and exfiltrated SharePoint, OneDrive, and Teams data. Research paper: arXiv:2509.10540.
MITRE ATLAS prompt-injection techniques — MITRE ATLAS catalogues real-world adversary techniques against AI systems, including indirect prompt injection that turns an LLM with tool access into an attacker-controlled execution surface. The framework specifically documents techniques for exfiltration via AI agent tool invocation.
Agent Session Smuggling — Palo Alto Unit 42 (November 2025) demonstrated rogue agents exploiting trust in the Agent-to-Agent (A2A) protocol with multi-turn manipulation. Documented in OWASP's Agentic Top 10.

None of these specifically targeted kubectl-running agents in production — but the class is the same and the blast radius would be larger. An agent that can run kubectl delete is one prompt-injection payload away from a cluster-wide outage.

The Four Attack Surfaces of Agentic kubectl

Most teams think of kubectl agent safety as a single problem ("can the agent be tricked?"). It's actually four distinct attack surfaces, each requiring its own mitigation.

Surface	Failure mode	Why permission-scoping alone fails	Mitigation
1. Prompt injection	Hidden instructions in logs, alerts, runbooks, or chat coerce the agent	Compromised agent acts within its granted permissions, which is exactly what permission-scoping permits	Sandboxed runtime; never trust LLM output derived from data the LLM read
2. Credential leakage	Executed command reads `AWS_SECRET_ACCESS_KEY`, `VAULT_TOKEN`, `KUBECONFIG` from inherited env	Permissions live on credentials; if the credential leaks, the permission set leaks with it	Per-invocation short-lived credentials (STS, Service Principal); explicit env allowlist that strips secrets
3. Blast radius escalation	Legitimate command runs against wrong namespace, region, or cluster	Permissions don't model "right action, wrong target"	Default read-only; dependency-graph awareness; human approval for destructive writes
4. Audit trail gaps	Logs capture commands without the agent's reasoning	Permission systems audit "who ran what," not "why"	Per-investigation transcripts that link reasoning → tool calls → outputs

Attack Surface 1: Prompt injection

The agent reads a log line, alert payload, runbook, or chat message that contains hidden instructions. The LLM cannot reliably distinguish data from instructions in the same channel — this is the fundamental property OWASP's LLM01 captures. Even frontier models do not eliminate it. Anthropic has publicly stated that "no browser agent is immune to prompt injection" and publishes defense benchmarks showing measurable but imperfect attack-prevention rates across computer-use, bash tool use, and MCP workflows. The implication for kubectl-running agents is clear: the LLM is not the security boundary. The runtime is.

Mitigation: never trust LLM output that originates from data the LLM also read. Sandbox the execution layer so even a successful injection has limited blast radius.

Attack Surface 2: Credential leakage

If the agent runs commands with credentials inherited from the host process environment (AWS_SECRET_ACCESS_KEY, KUBECONFIG, VAULT_TOKEN), a successful command-injection or shell escape exposes everything the agent process has access to. Long-lived static credentials make this catastrophic.

Mitigation: per-invocation credential scoping. AWS STS AssumeRole, Azure Service Principal sessions, GCP short-lived tokens. Strip everything else from the child process environment with an explicit allowlist.

Attack Surface 3: Blast radius escalation

Even legitimate, non-injected commands can have outsized effects. kubectl delete pod on the wrong namespace. aws ec2 terminate-instances against a misidentified region. The agent doesn't need to be compromised — it just needs to be wrong.

Mitigation: read-only by default, write actions behind explicit human approval, and dependency-graph awareness so the agent can compute blast radius before acting.

Attack Surface 4: Audit trail gaps

When an investigation runs across 20+ tool invocations, traditional audit systems (CloudTrail, Kubernetes audit logs) record what was run but not why. A reviewer six months later cannot tell whether a kubectl scale was a legitimate response to a load spike or an injected instruction.

Mitigation: structured per-investigation transcripts that capture agent reasoning alongside tool calls. The right log isn't "kubectl was run" — it's "in response to alert X, the agent hypothesized Y, ran kubectl Z, and observed W."

Why "human approval" alone is not enough

The most common safety story in the AI SRE space is "the agent suggests; humans approve." That is necessary but not sufficient.

The problem with approval gates as the only line of defense:

Decision fatigue. An agent that handles 50 alerts a week generates dozens of approval prompts. Humans rubber-stamp.
Approval ≠ understanding. Engineers approve commands they don't fully understand because the agent's reasoning sounds plausible.
Injected intent looks legitimate. A prompt-injection payload can produce a recommendation that reads exactly like a normal RCA. The approver has no signal that the underlying instruction came from an attacker.

Approval gates are critical, but they need to sit on top of an already-sandboxed runtime — not be the only protection.

Permission scoping vs sandboxed execution: what's the difference?

These two terms get conflated. They aren't the same thing.

Permission scoping restricts what an agent's identity can do. RBAC roles, IAM policies, kubeconfig contexts. It's necessary, but it operates at the cluster-API layer — meaning a successful prompt injection can still use every permission the agent has.

Sandboxed execution isolates the runtime in which commands execute. If the agent's process is compromised, the sandbox limits what the compromised process can do regardless of the credentials it holds. The compromised process can't read other pods' files, can't reach other nodes, can't escalate to the host kernel.

The defensible architecture combines both: tight permission scoping (small RBAC role, short-lived credentials) + runtime isolation (sandboxed execution).

How sandboxed kubectl actually works

The Kubernetes ecosystem standardized on this pattern in 2025–2026.

k8s-sigs/agent-sandbox

k8s-sigs/agent-sandbox is a formal Kubernetes SIG Apps subproject that launched at KubeCon Atlanta in November 2025. It provides a declarative Kubernetes API for "isolated, stateful, singleton workloads" — built specifically for AI agent runtimes that may execute untrusted, LLM-generated code.

Core CRDs:

Sandbox — an isolated pod-equivalent with stronger boundaries
SandboxTemplate — reusable configuration
SandboxClaim — request a sandbox for a workload
SandboxWarmPool — pre-created sandboxes that bring cold-start under one second

The Kubernetes blog post from March 2026 makes the architectural claim explicit: "Isolation achieved via runtime-level sandboxing (gVisor/Kata), not just container-level namespaces."

gVisor

gVisor is a Google-maintained user-space application kernel that provides kernel-level isolation without full virtualization. Architecture: Sentry (a kernel emulator written in Go) intercepts roughly 200 Linux syscalls; Gofer brokers filesystem access over 9P. The OCI runtime is runsc, drop-in compatible with runc.

gVisor runs in production at Google for App Engine standard, Cloud Functions, Cloud Run, and Cloud ML Engine. GKE Sandbox productizes it for GKE node pools. It is one of two named isolation backends in agent-sandbox (the other being Kata Containers, which uses lightweight VMs).

Why this matters for AI SRE

An AI SRE that runs kubectl against production is exactly the kind of workload agent-sandbox was built for. It executes LLM-generated commands. It needs file system isolation, syscall isolation, and per-invocation credential scoping. It benefits enormously from a warm pool that reduces cold-start latency.

If you are evaluating an AI SRE in 2026, this is one of the right questions to ask: what isolation backend does the agent use when it executes commands?

How Aurora's pod-isolated execution works

Aurora's approach predates agent-sandbox and follows the same architectural principles.

When Aurora's agent runs a kubectl, aws, az, or gcloud command, it doesn't use subprocess.run() directly. It uses an internal primitive called terminal_run, defined in server/utils/terminal/terminal_run.py. The module's docstring is explicit:

Drop-in replacement for subprocess.run() that executes in terminal pods. This module provides a terminal_run() function that mimics subprocess.run() API but executes commands in isolated terminal pods via kubectl exec. Safety guardrails (signature matcher + LLM judge) run automatically unless the caller passes trusted=True for known-safe internal operations.

Three properties matter:

1. Pod-isolated execution. When the ENABLE_POD_ISOLATION flag is set (the default in Kubernetes deployments), every external command runs inside a separate terminal pod via kubectl exec. The agent's own process never executes the command directly. A successful command-injection in the agent's reasoning loop does not give an attacker access to the agent host.

2. Two-stage safety guardrails. Before any non-trusted command runs, two checks fire automatically: a deterministic signature matcher that rejects known-dangerous patterns, and an LLM judge that evaluates the proposed command against the investigation context. The trusted=True flag bypasses both — used only for known-safe internal operations like configured connector calls.

3. Sanitized environment allowlist. Aurora's terminal_exec_tool module defines an explicit _SAFE_ENV_KEYS set: PATH, HOME, USER, SHELL, TERM, LANG, TMPDIR, SSL_CERT_FILE, plus ENABLE_POD_ISOLATION itself. Everything else — including VAULT_TOKEN, DATABASE_URL, SECRET_KEY, and any cloud credentials — is stripped from the child process environment. A compromised command cannot read the agent's secrets via env.

Cloud credentials are handled separately. Aurora calls generate_contextual_access_token and generate_azure_access_token per invocation. AWS uses STS AssumeRole via cross-account roles (aurora-cross-account-role.yaml) — short-lived credentials, not long-lived access keys. Azure uses Service Principal sessions. GCP uses OAuth-derived tokens.

For agents that need to reach customer Kubernetes clusters Aurora can't access directly, a separate kubectl-agent binary deploys via Helm into the customer's cluster and connects outbound over WebSocket. No inbound network access required, no kubeconfig sharing, no static credentials at rest.

How to evaluate an AI SRE's kubectl safety model

Eight questions to ask any AI SRE vendor or open-source project before enabling production access:

Where does the command actually execute? Same process as the agent? Same host? Separate container? Sandboxed runtime (gVisor/Kata)?
What credentials does the command inherit from the host environment? Specifically: can the executed command read your agent's vault token, database URL, or other host secrets?
Are credentials short-lived or static? STS / Service Principal sessions, or long-lived access keys?
Is the default read-only? What flag, configuration, or RBAC role enables write access?
What happens between "agent decides to run X" and "X runs"? Is there a deterministic policy check? An LLM judge? A human approval prompt? All three?
Are destructive actions specifically gated? What's the definition of "destructive" — vendor-defined or operator-configurable?
What does the audit trail capture? Just the commands, or the agent's reasoning + the commands together?
What's the blast radius of a single successful prompt injection? Walk through the worst case explicitly with the vendor.

If a vendor can't answer these clearly, the architecture isn't ready for production write access.

Open questions in 2026

This is a young problem space. Several questions are not yet resolved:

Standardization. k8s-sigs/agent-sandbox is the leading candidate for a standard, but Knative Sandbox, container-level approaches, and microVM-based runtimes (Firecracker) are all in play.
Multi-cloud isolation. Sandboxing a Kubernetes pod is a solved problem. Sandboxing a process that calls aws, az, gcloud across cloud APIs from a single agent is harder — the credentials and trust boundaries change per provider.
Approval UX at scale. Engineers can't approve 200 actions per week. The right UI for batch approval, policy-based pre-approval, and rollback-only autonomy is still being figured out.

Expect significant movement on all three through 2026 and into 2027.

Aurora's approach in summary

If you operate an AI SRE in production, the safety questions are non-negotiable. Aurora's answer is: pod-isolated execution by default, deterministic + LLM-judge guardrails before any non-trusted command, environment-variable allowlist that strips secrets, per-invocation cloud credentials via STS/Service Principal/short-lived tokens, and human approval for destructive write operations. The full architecture is open source under Apache 2.0 — auditable in the Aurora repository.

For background on the agent and tool model, see the complete guide to AI SRE, the open-source AI SRE comparison, or the explainer on agentic incident management.

Originally published at arvoai.ca.

Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)

Siddharth Singh — Wed, 06 May 2026 20:38:19 +0000

Key Takeaways

Three credible open-source AI SREs exist in 2026: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox). All three are Apache 2.0.

Only one is a true multi-step agent. HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.

Only Aurora handles multi-cloud out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.

Only Aurora generates remediation pull requests. HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.

All three support BYO LLM, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.

Of the 46+ companies offering "AI SRE" products in 2026, only a handful are open source — and only three are credible enough to deploy in production: Aurora, HolmesGPT, and K8sGPT. An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification. They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.

This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.

What is an open-source AI SRE?

An open-source AI SRE is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:

License: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.
Self-hostable: runs entirely inside your environment without phoning home to a vendor.
LLM-driven: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)

The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.

Why open source matters for AI SRE

Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:

Data sovereignty. Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.
Audit transparency. Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.
Cost predictability. Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.

The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.

How the three compare

This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026.

Dimension	Aurora	HolmesGPT	K8sGPT
License	Apache 2.0	Apache 2.0	Apache 2.0
GitHub stars	201	2,366	7,737
Latest release	v1.1.1 (Mar 2026)	0.26.0 (Apr 2026)	v0.4.32 (Apr 2026)
CNCF status	Independent	Sandbox (Oct 2025)	Sandbox
Built by	Arvo AI	Robusta + Microsoft	k8sgpt-ai community
Agent architecture	LangGraph supervisor + sub-agents	ReAct loop (`ToolCallingLLM`)	Rule-based scanner + LLM explainer
Multi-step reasoning	Yes	Yes	No (single-shot per analyzer)
Cloud providers	AWS, Azure, GCP, OVH, Scaleway	Kubernetes + AWS via MCP	Kubernetes only
Kubernetes execution	`kubectl` in sandboxed pods	Read-only `kubectl get`/`describe`	Read-only via Kube API
Other integrations	22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)	30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)	None — Kubernetes-only by design
Knowledge base / RAG	Weaviate vector search over runbooks + postmortems	Yes (via toolsets)	No
Dependency graph	Memgraph (cross-cloud blast radius)	No	No
Postmortem generation	Yes, exports to Confluence	Investigation reports only	No
Pull request remediation	GitHub + Bitbucket with human approval gate	GitHub PRs in Operator mode	None — strictly read-only
MCP server	Yes (340+ endpoints, 6 named tools)	Yes (consumes MCP servers)	No
LLM providers	OpenAI, Anthropic, Google, Vertex, OpenRouter, Ollama	OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, Ollama	OpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, Ollama
Air-gapped support	Yes (Ollama + image tarballs)	Yes (Ollama)	Yes (LocalAI / Ollama)
Deployment	Docker Compose or Helm	Binary, API server, K8s Operator, Python SDK	Go binary, K8s operator

The OSS AI SRE Maturity Spectrum

A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.

Level	What the agent does	Tools at this level
L1 — Diagnostic Explainer	Reads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.	K8sGPT
L2 — Read-Only Investigator	Runs an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.	HolmesGPT
L3 — Investigation + Suggestion	Everything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.	HolmesGPT (Operator mode), Aurora
L4 — Investigation + Approved Remediation	Everything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.	Aurora (with Bitbucket connector's human approval gate for destructive actions)

No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.

Aurora vs HolmesGPT — which should you choose?

Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.

Pick HolmesGPT when:

Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.
You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.
You value CNCF governance and a steep ecosystem velocity.
You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.

Pick Aurora when:

You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.
You want auto-generated postmortems exported to Confluence.
You want the agent to draft remediation PRs against your codebase.
You need a graph-based blast radius model (Memgraph) for dependency analysis.
You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.

In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.

Aurora vs K8sGPT — which should you choose?

This is closer to "which tool category do you need?" than a head-to-head.

Pick K8sGPT when:

You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as k8sgpt analyze --explain.
Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.
You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).

Pick Aurora when:

You need agentic investigation, not just diagnostic explanation.
You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.
You want auto-generated postmortems and remediation PRs.

These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.

HolmesGPT vs K8sGPT — head-to-head

Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.

Aspect	HolmesGPT	K8sGPT
What it is	Multi-step AI agent	Rule-based scanner with LLM explanations
When it shines	Investigating an alert end-to-end across signals	Diagnosing why a specific resource is unhealthy
Latency	Seconds to minutes (multi-step)	Sub-second per analyzer
LLM cost	Higher (multiple calls per investigation)	Lower (one explanation per finding)
Hallucination risk	Higher (agent reasons across signals)	Lower (deterministic before LLM)
Best fit	On-call engineers handling alerts	Platform teams running periodic cluster audits

K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.

When NOT to use open-source AI SRE

Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:

You don't have the operational capacity to run another stateful service in production.
You want vendor support with SLAs and a phone number to call at 3 AM.
Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE.
You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.

How to pilot an open-source AI SRE in your team

A six-step, low-risk pilot for any of the three tools:

Pick one cluster and one observability source. Don't try to cover everything at once.
Install in read-only mode first. All three tools default to read-only — keep it that way for the first two weeks.
Connect one alert source. PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.
Run for two weeks alongside human on-call. Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.
Feed it your historical context. Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.
Expand carefully. Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.

Getting started with Aurora

Aurora is the multi-cloud, multi-tool option among open-source AI SREs. To run it:

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments.

For the technical side of running an agent that executes kubectl against production, see the companion piece on AI agent kubectl safety and sandboxed execution.

Originally published at arvoai.ca.

AI SRE: The Complete Guide for Engineering Teams in 2026

Siddharth Singh — Fri, 24 Apr 2026 21:37:36 +0000

Key Takeaway: An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that triages alerts, investigates incidents, performs root cause analysis, and generates postmortems without step-by-step human direction. Gartner projects that by 2029, 70% of enterprises will deploy agentic AI agents to operate their IT infrastructure, up from less than 5% in 2025. This guide explains what an AI SRE actually does, how it differs from AIOps and traditional SRE, and how to evaluate the commercial and open-source tools available in 2026.

An AI SRE is an autonomous software agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, postmortem generation, and in some cases guided remediation — using large language models and production tooling to operate with minimal human direction. Unlike chatbots or copilots, an AI SRE decides what to investigate, which systems to query, and how to synthesize findings into actionable outcomes.

The category crystallized in 2026. Microsoft made Azure SRE Agent generally available on March 10, 2026. Komodor reports being named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling. Open-source options like Aurora, K8sGPT, and HolmesGPT emerged as credible alternatives to commercial platforms.

What is an AI SRE?

An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that performs SRE work — alert triage, incident investigation, root cause analysis, postmortem generation, and guided remediation — without requiring step-by-step human direction.

Three characteristics distinguish an AI SRE from earlier generations of operations tooling:

Autonomy. An AI SRE decides which tools to use and what data to gather. It is not a runbook that executes predefined steps; it is an agent that plans a multi-step investigation based on the specific alert.
Access to production. An AI SRE reads real infrastructure signals — metrics, logs, traces, Kubernetes events, cloud API responses, deployment history — rather than working only from summaries.
Synthesis. An AI SRE produces structured outputs: a root cause analysis, a timeline, a blast radius assessment, a postmortem, or a remediation PR. It does not stop at "the error rate is elevated."

Why AI SRE Emerged in 2026

The conditions that made AI SRE viable came together between 2024 and 2026:

Alert volume outpaced human capacity. PagerDuty's State of Digital Operations data shows the average on-call engineer receives roughly 50 alerts per week, with only 2–5% requiring real human intervention. A 2024 Catchpoint study cited by OneUptime found that 70% of SRE teams list alert fatigue as a top-three operational concern.

Multi-cloud became the default. According to the Flexera 2025 State of the Cloud Report, organizations use an average of 2.4 public cloud providers, and 70% operate a hybrid cloud strategy. Correlating incidents across AWS, Azure, and GCP by hand is increasingly impractical.

Change velocity rose faster than reliability tooling. The 2025 DORA State of AI-Assisted Software Development report found that incidents per PR increased 242.7% as AI coding assistants accelerated delivery — without a matching improvement in incident response capacity.

LLM tool use matured. Agent frameworks like LangGraph made it practical to give a language model 30+ tools and let it chain them into a coherent investigation. Claude, GPT-5, and Gemini 2.5+ reached enough reliability at structured tool use to be trusted with read-only production access.

Gartner codified the category. In Predicts 2026: AI Agents Will Transform IT Infrastructure and Operations, Gartner projected that by 2029, 70% of enterprises will deploy agentic AI to operate IT infrastructure, up from less than 5% in 2025.

How Does an AI SRE Work?

An AI SRE runs a repeatable loop for every alert it receives:

Alert ingestion. A monitoring tool (PagerDuty, Datadog, Grafana, BigPanda) fires a webhook. The AI SRE receives the payload and begins investigation without waiting for a human to acknowledge the page.
Context gathering. The agent reads the recent state: pod status, metric trends, deployment history, recent configuration changes, related alerts within a time window.
Hypothesis formation. Using the alert semantics plus the gathered context, the agent proposes one or more candidate causes.
Evidence collection. The agent selects from its tool inventory — running kubectl describe, querying metrics, searching a vector knowledge base of past postmortems — to test each hypothesis.
Root cause synthesis. The agent produces a structured RCA: what failed, why, what the blast radius is, which services are affected, whether a recent change likely caused it.
Remediation (optional). Some AI SREs stop at recommendations. Others generate a PR, roll back a deployment, or restart a service — typically behind a human approval gate for destructive actions.
Postmortem generation. The agent assembles a draft postmortem with timeline, contributing factors, impact, and action items, ready for human review and export to Confluence or another docs system.

A trustworthy AI SRE is transparent about this loop — surfacing the evidence it considered, the hypotheses it ruled out, and its confidence in the final answer.

AI SRE vs Traditional SRE vs AIOps

The three categories are often conflated but address different problems.

Aspect	Traditional SRE	AIOps	AI SRE
Primary function	Human engineers manage reliability	Anomaly detection, alert correlation	Autonomous incident investigation and RCA
Investigation	Manual (human reads logs, queries systems)	Suggests related alerts	Agent runs multi-step investigation
Root cause analysis	Hours, depends on engineer's expertise	Correlation hints, not causation	Structured RCA in minutes
Tool use	Engineer runs kubectl, aws CLI, dashboards	Reads pre-ingested telemetry	Dynamically selects from 20–40+ tools
Remediation	Human-driven	Typically suggestions only	Agentic execution, often with approval gates
Knowledge transfer	Runbooks, tribal knowledge	Alert correlation models	RAG over runbooks and past postmortems
Core technology	Humans plus monitoring dashboards	ML models for anomaly detection	LLM agents with tool calling

The short version: AIOps tells you what is anomalous. An AI SRE tells you why it is happening and, increasingly, fixes it. Traditional SRE is the human discipline both categories augment.

What Capabilities Should an AI SRE Have?

Serious AI SREs in 2026 share a consistent capability stack:

Autonomous multi-step investigation

The agent must plan and execute investigations without requiring humans to choose tools or pass data between steps. Simple tool-calling is not enough — the agent needs memory across steps and the ability to revise hypotheses as evidence arrives.

Broad tool access with safe execution

kubectl, aws, az, gcloud, metric queries, log search, deployment history, IaC state. How tools are executed matters: running kubectl on the agent host is a production risk. Aurora, for example, runs CLI commands in sandboxed Kubernetes pods with per-invocation credential scoping, not on the agent host.

Cross-cloud and cross-platform reach

With the Flexera 2025 average at 2.4 public clouds per organization, an AI SRE that works only inside AWS or only inside Kubernetes will miss the majority of real incidents.

Knowledge base retrieval

Past postmortems, runbooks, and docs searchable by the agent via vector search (RAG). The knowledge your senior SRE built up should be available to the agent on day one.

Infrastructure dependency graph

When a database fails, the agent needs to know which services depend on it. Graph databases like Memgraph are a common choice for modeling cross-service and cross-cloud relationships.

Postmortem generation

Structured timeline, contributing factors, blast radius, action items — produced during the investigation, not written manually afterward.

Remediation with guardrails

Generating PRs, rolling back deployments, restarting services. Destructive actions should require human approval. Aurora's Bitbucket connector, added in v1.1.0, requires explicit human approval before agents can write.

LLM flexibility

OpenAI, Anthropic, Google, and local models via Ollama for air-gapped deployments. Vendor lock-in on LLM is a real risk as model quality and pricing evolve rapidly.

The AI SRE Landscape in 2026

Commercial platforms

Azure SRE Agent — Microsoft's first-party agent, generally available since March 10, 2026. Deep Azure integration, adjustable autonomy from "review recommendations" to "fully automated," billed via Azure Agent Units on pay-as-you-go.
Rootly AI SRE — AI layer built on top of a mature incident management platform. Transparent chain-of-thought reasoning. SOC2 since January 2022. Depends on external observability tools for telemetry.
Komodor Klaudia — Kubernetes-specialized AI SRE. Komodor reports Klaudia achieves 95% accuracy across real-world incident scenarios and that Komodor was named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling.
incident.io AI SRE — Multi-agent AI investigation integrated into an incident response platform, with code fix suggestions.
Traversal — Focused on large distributed systems using causal ML. Traversal reports a 38% MTTR reduction at DigitalOcean. Supports on-prem and bring-your-own model.
Resolve.ai — Pushes toward high-autonomy resolution with guardrails.

Open-source AI SRE options

Aurora — Apache 2.0, self-hosted, multi-cloud (AWS via STS AssumeRole, Azure via Service Principal, GCP, OVH, Scaleway, Kubernetes). LangGraph-orchestrated agents with 30+ tools, Memgraph dependency graph, Weaviate RAG, postmortem export to Confluence, PR generation via GitHub and Bitbucket. Works with any LLM (OpenAI, Anthropic, Google, OpenRouter, Ollama).
K8sGPT — Open-source CLI for scanning Kubernetes clusters and explaining failures in plain English. Narrower scope than a full AI SRE.
HolmesGPT — Open-source cross-stack SRE agent covering Kubernetes, Prometheus, logs, and Slack workflows.
Coroot (Community Edition) — Kubernetes observability plus AI-assisted RCA. Community Edition is free; commercial tier is priced transparently from $1 per monitored CPU core per month.

Open-Source vs Commercial AI SRE

Consideration	Open-Source	Commercial
Data residency	Fully self-hosted; incident data stays in your environment	Usually SaaS; incident data leaves your perimeter
Cost model	Free software; you pay for infra and LLM API usage	Per-seat or per-incident pricing
LLM choice	Bring any provider, including local via Ollama	Often bundled or restricted
Audit transparency	Source code available; you can audit how the agent behaves	Typically black-box
Support and managed ops	Community plus self-managed	Vendor support, SLAs, managed infrastructure
Time to deploy	Longer — self-hosting has setup cost	Shorter — SaaS onboarding
Customization	Fork, modify, add tools	Limited to what the vendor exposes

For regulated industries (finance, healthcare, government), air-gapped deployments, or teams already operating their own Kubernetes, open-source AI SRE is often the right fit. For teams prioritizing fastest time to value, commercial platforms win.

How to Evaluate an AI SRE Tool

If you are piloting an AI SRE in 2026, these are the questions to answer before committing:

How does the agent actually execute commands? Host process, container, sandboxed pod? Read-only or write? What credentials does it use?
Which alerts can it investigate today? Ask for specific integrations by name (PagerDuty, Datadog, CloudWatch) and test with your own alert payloads.
What happens when it is wrong? How does the agent surface low-confidence answers? Can you see the evidence it gathered?
Can it handle multi-cloud? If you run on more than one cloud, does it correlate across providers or investigate each in isolation?
Does it learn from past incidents? Does it ingest your existing runbooks and postmortems? How?
What is the remediation model? Suggestions only? PRs with human approval? Direct execution? Where are the guardrails?
Which LLM does it use — and can you change it? LLM cost and quality move quickly. Lock-in is a risk.
Where does your incident data go? Self-hosted, vendor cloud, LLM provider? Read the data flow carefully.

Limitations of AI SREs in 2026

The category is real but not a silver bullet:

Novel failure modes. Agents excel at recognizing patterns similar to past incidents. Genuinely new failures still often require human judgment.
Organizational root causes. "The deploy pipeline does not validate environment variables" is the kind of root cause an AI SRE can surface. "We do not have enough staff to maintain this service" is not.
LLM cost at scale. Complex investigations can consume hundreds of LLM calls. Local inference via Ollama mitigates this but requires GPU infrastructure.
Tool coverage gaps. An AI SRE can only investigate systems it has tools for. Legacy systems, internal tooling, and unusual stacks require custom connectors.
Trust-building takes time. Teams typically start with the agent in "observe" mode, graduate to "suggest," and only later enable autonomous remediation.

The DORA 2025 report is instructive: AI improves throughput but can increase instability in teams without strong platform engineering foundations. AI SRE tools amplify existing practices more than they fix broken ones.

How to Pilot an AI SRE in Your Team

A low-risk pilot follows six steps. Expect it to take four to six weeks end-to-end.

Pick one service and one alert source. Do not try to cover everything at once. Choose a service your team knows well and a monitoring tool you already use.
Deploy the AI SRE in read-only mode. Connect it to alerts, read-only cloud credentials, and your existing observability tools. Do not grant write permissions yet.
Run for two weeks, compare to human RCA. Let the agent investigate every incident that fires. Compare its root cause conclusions to what the on-call engineer eventually determined.
Measure accuracy and time-to-RCA. Two metrics matter: was the agent's root cause correct, and how much faster was it than the human?
Expand scope gradually. Add more services, enable remediation suggestions, then (only after trust is established) approved automated actions for specific low-risk patterns.
Feed historical context. Ingest your existing runbooks and past postmortems into the agent's knowledge base. Agents become dramatically more useful with organizational memory.

Getting Started with Aurora

Aurora is an open-source (Apache 2.0) AI SRE built by Arvo AI. It autonomously investigates incidents across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, integrating with 22+ tools including PagerDuty, Datadog, Grafana, Slack, Bitbucket, and Confluence.

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Aurora works with any LLM provider — OpenAI, Anthropic, Google Gemini, OpenRouter, or local models via Ollama for air-gapped deployments. See the full documentation or the original post on arvoai.ca for more context.

This post was originally published on arvoai.ca.

Opsgenie 2026: Features, Pricing, EOL & Alternatives

Siddharth Singh — Tue, 21 Apr 2026 17:36:17 +0000

TL;DR — Opsgenie is ending. Atlassian stopped new Opsgenie signups on June 4, 2025 and will shut the service down permanently on April 5, 2027. Any data not migrated by that date will be deleted. Atlassian's official migration paths are Jira Service Management (JSM) Operations and Compass. Many teams are using the forced migration as a chance to evaluate alternatives — especially AI-powered options that weren't available when Opsgenie was originally adopted.

Opsgenie is an alerting and on-call management platform that was acquired by Atlassian in 2018. For years it was one of the most widely adopted tools in the SRE stack, sitting alongside PagerDuty and xMatters. In March 2025 Atlassian announced that Opsgenie's capabilities would be absorbed into Jira Service Management and Compass, and that the standalone product would be retired.

This guide covers what Opsgenie is, how it works, what it costs, the exact end-of-life timeline, what happens to your data when it shuts down, the official migration paths, and the current landscape of alternatives. Every claim is linked to an official source.

Last updated: April 21, 2026.

What is Opsgenie?

Opsgenie is a cloud-based incident alerting and on-call management platform for DevOps and SRE teams. It routes alerts from 200+ monitoring tools to the right on-call responders via SMS, voice, email, push, Slack, and Microsoft Teams. Atlassian acquired Opsgenie in 2018 and will retire the standalone product on April 5, 2027.

The tool was founded in 2012 and its capabilities are being absorbed into Jira Service Management and Compass.

Opsgenie at a glance vs top alternatives

	Opsgenie (retiring)	JSM Operations	PagerDuty	Aurora
Available after April 2027	No	Yes	Yes	Yes
Starting price	N/A (closed)	Per-agent	$21/user/mo	Free (OSS)
Built-in AI RCA	No	Partial	Add-on ($699+/mo)	Yes (agentic)
Open source	No	No	No	Apache 2.0
On-call + escalations	Yes	Yes	Yes	Via integration

Opsgenie End-of-Life Timeline (Official)

Atlassian announced the end of Opsgenie in The Evolution of IT Operations. The three critical dates are:

Milestone	Date	What it means
End of Sale	June 4, 2025	No new signups, upgrades, or downgrades on standalone Opsgenie
End of Support / Shutdown	April 5, 2027	Opsgenie service is turned off; REST APIs stop responding
Data Deletion	April 5, 2027	All unmigrated alerts, schedules, escalation policies, integrations, and incidents are permanently deleted

Existing customers can continue using Opsgenie through April 5, 2027, but cannot expand their footprint. After migration, Opsgenie and the new JSM or Compass instance can run in parallel for up to 120 days, after which Opsgenie is automatically switched off (official source).

"Opsgenie REST APIs will continue to work until April 5, 2027. However, Atlassian recommends updating all API endpoints before Opsgenie is turned off to avoid any disruptions." — Atlassian Support

Opsgenie Features

Opsgenie's core feature set is mature — this is a 13-year-old product. Here is what it currently provides, verified from Atlassian's documentation.

Integrations

Opsgenie ships with over 200 integrations with monitoring, ticketing, chat, and ITSM tools. Most are bidirectional — alerts flow in, and acknowledgement or closure events flow back.

Multi-Channel Notifications

Supported notification channels, per Atlassian documentation:

SMS — Aggregated at a minimum 1-minute interval; users can acknowledge or close alerts via reply
Voice calls — Capped at 2 minutes; dial-pad actions (1 = read, 2 = close, 3 = acknowledge, 4 = escalate)
Email — With inline action buttons
Push notifications — iOS and Android with swipe-to-ack/close
Slack — Bidirectional integration
Microsoft Teams — Bidirectional integration

On-Call Management

Opsgenie supports daily, weekly, and custom rotation types including follow-the-sun, with ad-hoc overrides, "Take on-call for an hour" self-service, and a "No-One" participant for scheduled gaps (official docs).

Escalation Policies

Default escalation is 5 minutes, then 10 minutes, repeatable up to 20 times per alert. Acknowledgement or closure stops the policy (official docs).

Heartbeat Monitoring

A "dead man's switch" — if an expected HTTP ping doesn't arrive within the configured interval (minimum 1 minute), Opsgenie fires an alert. Available on Standard and Enterprise plans only (official docs).

Alert Deduplication, Suppression, and Grouping

Opsgenie uses an alias field to deduplicate alerts — identical alias values increment a counter on the existing alert instead of creating a new one. The counter stops logging at 100 occurrences, but deduplication continues (official docs).

Delay policies can hold notifications for a fixed time, until a deduplication threshold is reached, or until an occurrence rate threshold triggers.

Routing Rules

Each team can have up to 100 routing rules, evaluated top-down with first-match semantics. Free and Essentials plans are limited to 1 routing rule and can only route by priority or tags. Standard and Enterprise plans support full-field routing.

Reporting by Plan

Report	Essentials	Standard	Enterprise
Notifications + API Usage	Yes	Yes	Yes
Monthly Overview (Looker)	No	Yes	Yes
Advanced reporting / MTTA / MTTR	No	Yes	Yes
Team Reports	No	No	Yes
Global Reports + Looker dashboards	No	No	Yes
Post-Incident Analysis	No	No	Yes

Source: Opsgenie Advanced Reporting.

Mobile App

Opsgenie's iOS and Android apps support swipe-to-acknowledge from the lock screen and iOS Critical Alerts that override Do Not Disturb and silent mode.

SSO / SAML

SSO is available on Standard and Enterprise plans only, with supported providers including Google, Azure AD, Okta, OneLogin, Ping Identity, and Microsoft AD FS (official docs).

Compliance

Opsgenie is covered under Atlassian's Trust program with SOC 2 Type II (annual), ISO/IEC 27001, ISO/IEC 27018, CSA, and TISAX AL2 certifications, plus a pre-signed GDPR DPA (official page).

Data Residency

Opsgenie is offered in US and EU regions, both hosted on AWS (official docs).

Who Should Use Opsgenie in 2026?

With end-of-sale already behind us, Opsgenie is only relevant to existing subscribers planning their exit. New teams cannot sign up. The question for existing subscribers is whether to stay with Atlassian (migrate to JSM or Compass) or evaluate alternatives.

Stay with Atlassian (migrate to JSM Operations) if you are already a Jira Service Management customer, need ITSM workflows (change, problem, incident), and are comfortable with the Premium-tier price increase.
Stay with Atlassian (migrate to Compass) if you are a DevOps or SRE team that wants alerting paired with a software component catalog and service ownership model, not ITSM.
Switch to a dedicated alerting tool (PagerDuty, ilert, Squadcast) if you want deeper alerting features and do not need Atlassian platform integration.
Switch to AI-powered incident management (incident.io, Rootly, Aurora) if you want autonomous investigation and root cause analysis, not just alert routing.

Opsgenie Pricing (Standalone, 100-User Reference)

Pricing below is for standalone Opsgenie with 100 users — sourced from the official Opsgenie pricing page. New signups are closed, so these numbers apply only to existing customers on legacy plans.

Plan	Monthly	Annual	Routing Rules	Heartbeats	SSO
Free	$0 (up to 5 users)	—	1	No	No
Essentials	$11.55/user/mo	$9.45/user/mo	1	No	No
Standard	~ $29/user/mo	Discounted	100 per team	Yes	Yes
Enterprise	~ $39/user/mo	Discounted	100 per team	Yes	Yes

Enterprise-exclusive features include Incident Command Center (built-in video chatroom tied to incidents), Stakeholders (notification-only users), Service Subscriptions, Incident Templates, and Post-Incident Analysis.

Incoming call routing is charged separately: $0.10 per minute for US/Canada and $0.35 per minute internationally after the free tier.

What Happens When Opsgenie Is Turned Off

On April 5, 2027, Atlassian will:

Disable the Opsgenie web application, mobile apps, and REST APIs
Delete all data that was not migrated to JSM or Compass — alerts, on-call schedules, escalation policies, integrations, incidents, notes, attachments
Stop accepting any incoming webhooks or notifications

Important: unlike the legacy Opsgenie Enterprise plan, JSM automatically deletes alert data after a retention window. Once alert data is deleted in JSM, it cannot be recovered. Export anything you need for compliance or audit before migration (official source).

Opsgenie Migration Paths: JSM vs Compass

Atlassian offers two official migration destinations. Both share the same underlying Operations engine — schedules, alerts, and policies sync bidirectionally — but the wrapping product and pricing differ (managing operations across both).

Jira Service Management (JSM) Operations

JSM Operations is the ITSM-centric path — alerts are paired with change, problem, and incident workflows. JSM pricing (official page):

JSM Plan	Price	Outbound Webhooks	Incident Command Center	Post-Incident Reviews	99.9% SLA
Free	$0 (up to 3 agents)	No	No	No	No
Standard	Per-agent	No	No	No	No
Premium	Per-agent	Yes	No	Yes	Yes
Enterprise	Contact sales	Yes	Yes	Yes	99.95%

Opsgenie features that do not carry over to JSM Operations, per Atlassian's shifting guide:

Incoming Call Routing integration is not supported
Stakeholder role — custom Opsgenie roles default to User
Alert creation rules from Opsgenie do not migrate
Legacy api.opsgenie.com/v1/services endpoint stops working
Chat integrations must be reconnected manually
The old Opsgenie mobile app stops working — responders switch to the Jira mobile app

Compass

Compass is positioned as a software component catalog + alerting platform aimed at DevOps, SRE, and Platform Engineering teams rather than ITSM. Compass pricing (official page):

Compass Plan	Price	Alerting	Heartbeats	99.9% SLA
Free	$0 (up to 3 full users)	Basic	No	No
Standard	$8/user/mo	Yes (150+ integrations)	No	No
Premium	$25/user/mo	Advanced	Yes	Yes

Migration Friction

Real complaints from the Atlassian Community:

Price increases — JSM Premium is widely reported as more expensive than standalone Opsgenie Standard
Feature parity gaps — some users need JSM and Compass together to match Opsgenie's alert processing depth
120-day forced cutover — Opsgenie auto-shuts-down 120 days after migration begins; Atlassian has declined requests to extend the window
Split paths confusion — some features only exist in JSM, others only in Compass, forcing customers to choose or buy both

One user put it bluntly: "Switching to Compass seems like buying a new car just to listen to the radio."

Why Teams Are Evaluating Alternatives Instead of Migrating

The forced migration has created a rare evaluation moment. Teams that adopted Opsgenie in 2018 are re-evaluating the entire category with three shifts in mind:

AI-native incident management has arrived. Products like Aurora, incident.io AI SRE, Rootly AI, and PagerDuty Advance didn't exist when most Opsgenie contracts were signed. Per Gartner (October 2025), 54% of I&O leaders are now adopting AI in operations.
On-call burnout is a hiring and retention problem. The Catchpoint SRE Report 2025 found that roughly 70% of SREs cite on-call stress as a direct cause of burnout, and toil rose to 30% of SRE work.
Downtime costs have climbed. PagerDuty's 2024 research put the average cost of a major incident at $794,000, or $4,537 per minute. ITIC's 2024 survey found 97% of large enterprises say an hour of downtime costs them over $100,000.

Against this backdrop, "like-for-like Opsgenie replacement" is no longer the only question — many teams are asking whether the replacement should also do autonomous investigation, not just alerting.

"By 2030, 75% of IT work will be human plus AI, 25% will be AI-only, and zero percent will be human-only." — Gartner CIO survey of 700+ CIOs, 2025

Top Opsgenie Alternatives in 2026

Verified pricing and capabilities from each vendor's official site. Last checked April 2026.

Product	Starting price	Free plan	Open source	AI-native	Best for
Aurora by Arvo AI	$0 self-hosted	Yes (OSS)	Apache 2.0	Yes (agentic)	OSS teams wanting alerting + autonomous RCA in one stack
PagerDuty	$21/user/mo	14-day trial	No	Yes (PagerDuty Advance, $415+/mo)	Enterprises wanting the incumbent with AI add-ons
ilert	Up to €49/user/mo Scale	Yes (5 responders)	Partial (MCP server)	Yes	EU-based teams requiring GDPR data residency
Squadcast	$9/user/mo Pro	Yes (5 users)	No	Yes	Small SRE teams on tight budgets
Rootly OnCall	From $20/user/mo	Trial	Partial (MCP, Agents JSON)	Yes (AI SRE standalone)	Teams wanting modular IR + on-call + AI SRE
incident.io On-call	$19 base + $10 add-on	Trial	No	Yes (AI SRE)	Slack-native incident coordination with AI
FireHydrant Signals	Usage-based	Trial	No	Yes (AI Copilot)	Teams preferring pay-per-alert over per-seat
xMatters	$39/user/mo base	Yes (10 users)	No	Partial	Everbridge customers needing codeless workflows
Grafana OnCall OSS	Free	Yes	AGPLv3 (archived)	No	Not recommended — archived March 24, 2026

Product Notes

PagerDuty — Most mature alerting product. PagerDuty Advance adds AI agents (SRE, Scribe, Shift) but requires a paid base plan and a separate $415+/mo Advance subscription. AIOps features require a $699+/mo add-on.

ilert — EU-hosted with a clear GDPR and data-sovereignty story; the AI SRE opts out of LLM training on customer data. Free tier includes 5 responders.

Squadcast — Acquired by SolarWinds on March 3, 2025. Roadmap now driven by SolarWinds.

Rootly — Rootly AI Labs launched February 20, 2026; Rootly MCP GA April 2, 2026. Rootly sells IR, On-Call, and AI SRE as standalone products.

incident.io — $62M Series B funded the launch of AI SRE — an always-on agent that investigates alerts, drafts PRs, and can autoresolve incidents.

FireHydrant — Acquisition by Freshworks expected to close Q1 2026; FireHydrant will become the incident layer inside Freshservice.

Grafana OnCall — Entered maintenance mode March 11, 2025 and archived March 24, 2026. Do not start new deployments. Grafana is consolidating on a unified Cloud IRM app.

Splunk On-Call (VictorOps) — Pricing not publicly listed. Cisco completed its $28B Splunk acquisition in March 2024; no official EOL announcement as of April 2026, but the product has seen minimal public investment since.

How Aurora Integrates with Opsgenie and JSM Operations

Aurora is open-source agentic incident management that works alongside Opsgenie (and the JSM Operations successor). Most AI incident tools have already deprecated Opsgenie support ahead of the 2027 shutdown — Aurora supports both so teams can run their migration on their own timeline. The integration is fully documented in Aurora's docs.

What Aurora does with Opsgenie alerts:

Bidirectional authentication — Accepts either a native Opsgenie GenieKey (US or EU region) or a JSM Operations Atlassian API token. Credentials are encrypted in HashiCorp Vault.
Webhook ingestion — Receives Create, Acknowledge, Close, and custom alert actions. Only Create triggers an investigation, preventing duplicates from acknowledgement webhooks.
Alert correlation — Aurora's AlertCorrelator groups incoming alerts with existing incidents by service, title, and time proximity. Correlated alerts attach to the parent incident instead of spawning a new one.
Priority mapping — Opsgenie priorities map deterministically: P1 → critical, P2 → high, P3 → medium, P4/P5 → low.
Service extraction — Aurora reads alerts for a service:xxx tag first, then falls back to the source and entity fields.
Autonomous RCA — On alert creation, Aurora creates an incident record, generates an AI summary, and launches a LangGraph-orchestrated agent that queries your cloud infrastructure to find the root cause.
Bidirectional JSM commenting — For JSM Operations users, Aurora posts an "RCA in progress" comment back onto the linked Jira incident and updates it with findings.
Chatbot query surface — Engineers can ask Aurora in natural language: "Who is on-call right now?", "Show me P1 alerts from the last 24 hours", "Get details for alert ABC-123". Aurora queries 8 Opsgenie resource types (alerts, alert details, incidents, incident details, services, on-call, schedules, teams) via parallel API calls.

"Most AI investigation tools only work with PagerDuty. We built Aurora to meet SRE teams where they already live — including Opsgenie and JSM — so AI-powered RCA isn't gated on migrating your alerting stack first." — Noah Casarotto-Dinning, CEO at Arvo AI

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init && make prod-prebuilt

How to Migrate Off Opsgenie Before April 5, 2027

Prerequisites: administrator access to your Opsgenie account, access to your monitoring stack, and a target destination decided (JSM Operations, Compass, or a third-party alternative).

If You Are Staying with Atlassian

Inventory your Opsgenie config. Document integrations, escalation policies, routing rules, heartbeats, on-call schedules, and custom roles.
Choose JSM Operations vs Compass. Pick JSM if you need ITSM workflows (change, problem, incident); pick Compass if you want alerting tied to a service catalog.
Verify feature parity. Review the Atlassian shifting guide for features that do not migrate.
Export historical data. Alert data in JSM auto-deletes after a retention window — export anything needed for audit or compliance first.
Run the in-product migration tool. Atlassian provides a guided migration that copies your data to JSM or Compass.
Re-authenticate chat integrations. Re-authorize Slack and Microsoft Teams — OAuth grants do not transfer.
Update API endpoints. Every consumer of the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.
Replan the mobile rollout. The standalone Opsgenie mobile app stops working — responders move to the Jira mobile app.
Close Opsgenie within 120 days. After migration, Opsgenie runs in parallel for up to 120 days, then auto-shuts down.

If You Are Evaluating Alternatives

Shortlist two or three alternatives using the comparison table above.
Run a 90-day parallel trial alongside Opsgenie — most vendors offer free trials.
Validate the integrations that matter — especially monitoring tool webhooks and your chat platform.
Measure MTTR and on-call satisfaction against your Opsgenie baseline.
Decide before Atlassian's 120-day cutover window closes on any migration you start with JSM or Compass.

Frequently Asked Questions

When is Opsgenie being shut down?
Atlassian will shut down Opsgenie permanently on April 5, 2027. End of sale was June 4, 2025 — no new signups, upgrades, or downgrades are allowed. On April 5, 2027 the service will be disabled and any data that has not been migrated to Jira Service Management or Compass will be permanently deleted.

Can I still buy Opsgenie in 2026?
No. Atlassian closed new Opsgenie sales on June 4, 2025. Existing customers can continue using their current Opsgenie subscription until April 5, 2027 but cannot upgrade, downgrade, or add new users beyond their existing plan limits.

What are the official Opsgenie migration paths?
Atlassian offers two paths: Jira Service Management (JSM) Operations for ITSM teams needing change, problem, and incident workflows, and Compass for DevOps/SRE teams wanting alerting paired with a service catalog. Both share the same Operations engine, so schedules, alerts, and policies sync if you use both.

Will my Opsgenie data be preserved after migration?
Only data you explicitly migrate through Atlassian's in-product migration tool is preserved. Unlike legacy Opsgenie Enterprise, JSM automatically deletes alert data after a retention window — so you must export anything needed for compliance or audit before migration. Some features like alert creation rules and custom roles do not carry over at all.

How much does Opsgenie cost in 2026?
Existing standalone customers pay $9.45/user/month annual or $11.55/user/month monthly on Essentials at 100 users. Standard and Enterprise add full routing, SSO, heartbeats, and advanced reporting. Incoming call routing is billed separately at $0.10/minute (US/Canada) and $0.35/minute (international). New signups are no longer accepted.

What are the best Opsgenie alternatives?
The strongest 2026 alternatives are PagerDuty (incumbent with AI add-ons), incident.io (Slack-native with AI SRE), ilert (EU-hosted, GDPR-focused), Squadcast (budget-friendly, SolarWinds-owned), Rootly (modular IR + on-call + AI SRE), and Aurora by Arvo AI (open-source agentic RCA with Opsgenie and JSM support). Grafana OnCall OSS was archived in March 2026.

Does Opsgenie support AI-powered root cause analysis?
Standalone Opsgenie is an alerting and on-call product — it does not perform root cause analysis. Atlassian is adding AIOps features (alert grouping, automated resolutions) to JSM and Compass. Teams wanting autonomous multi-step RCA typically pair Opsgenie with a dedicated tool like Aurora, which ingests Opsgenie webhooks and investigates incidents automatically.

What happens to my Opsgenie integrations after migration?
Monitoring integrations (Datadog, New Relic, Prometheus) migrate automatically via Atlassian's in-product tool. Chat integrations (Slack, Microsoft Teams) must be re-authorized manually because the OAuth grants do not transfer. Custom webhooks calling the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.

Can Aurora connect to Opsgenie and JSM?
Yes. Aurora supports both standalone Opsgenie (GenieKey authentication, US and EU regions) and JSM Operations (Atlassian API token). Aurora ingests alert webhooks, runs AI-powered alert correlation to group related alerts into incidents, and autonomously investigates the root cause. For JSM users, Aurora posts findings back as comments on the linked Jira incident.

Is Jira Service Management cheaper than Opsgenie?
No. JSM Premium is widely reported by Atlassian Community users as more expensive than standalone Opsgenie Standard. Real-time outbound webhooks require JSM Premium, and Incident Command Center requires JSM Enterprise. Many Opsgenie customers see a net price increase after migration, which is why teams use the forced migration to evaluate alternatives.

All Opsgenie, JSM, Compass, and alternative-vendor claims verified from official sources in April 2026. Last updated: April 21, 2026.

Originally published on arvoai.ca/blog.

By Team at Arvo AI.