<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Siddharth Singh</title>
    <description>The latest articles on DEV Community by Siddharth Singh (@siddharth_singh_409bd5267).</description>
    <link>https://dev.to/siddharth_singh_409bd5267</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836164%2Fed12b658-4232-401b-be5c-924bb828c22f.png</url>
      <title>DEV Community: Siddharth Singh</title>
      <link>https://dev.to/siddharth_singh_409bd5267</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siddharth_singh_409bd5267"/>
    <language>en</language>
    <item>
      <title>AI SRE vs AIOps in 2026: Definitions, Differences, and How to Choose</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 28 May 2026 22:55:26 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-sre-vs-aiops-in-2026-definitions-differences-and-how-to-choose-565g</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-sre-vs-aiops-in-2026-definitions-differences-and-how-to-choose-565g</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AIOps and AI SRE are not interchangeable terms.&lt;/strong&gt; &lt;a href="https://en.wikipedia.org/wiki/AIOps" rel="noopener noreferrer"&gt;Gartner coined "AIOps" in 2016&lt;/a&gt; and defines an AIOps platform as one that "combines big data and machine learning functionality to support all primary IT operations functions through the scalable ingestion and analysis of the ever-increasing volume, variety and velocity of data generated by IT" (&lt;a href="https://www.gartner.com/en/information-technology/glossary/aiops-platform" rel="noopener noreferrer"&gt;Gartner IT glossary&lt;/a&gt;). "AI SRE" is a 2024-to-2026 category for multi-step LLM agents that investigate incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The technical separation is clean.&lt;/strong&gt; AIOps platforms cluster alerts and detect anomalies using statistical machine learning. An AI SRE runs a large-language-model agent that calls tools (&lt;code&gt;kubectl&lt;/code&gt;, cloud SDKs, log queries) to gather new evidence during an incident. See our &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;definition of an AI SRE&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps does noise reduction; an AI SRE does investigation.&lt;/strong&gt; Classic AIOps vendors include &lt;a href="https://www.bigpanda.io/" rel="noopener noreferrer"&gt;BigPanda&lt;/a&gt;, Moogsoft (&lt;a href="https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2023~07~dell-technologies-announces-intent-to-acquire-moogsoft.htm" rel="noopener noreferrer"&gt;acquired by Dell in 2023 per Dell's announcement&lt;/a&gt;), and Dynatrace Davis. AI SRE entrants include &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;), &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;), &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://resolve.ai/" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt;, and &lt;a href="https://www.traversal.com/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The two categories are complementary.&lt;/strong&gt; AIOps handles the pre-alert stage (correlation, deduplication, noise reduction). An AI SRE handles the post-alert stage (evidence gathering, root-cause analysis, remediation drafting). Most 2026 SRE teams will end up running both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buyer signal in 2025 to 2026 has shifted toward AI SRE.&lt;/strong&gt; &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai confirmed a $125M Series A at a $1B valuation in February 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal raised $48M in June 2025 led by Sequoia and Kleiner Perkins&lt;/a&gt;. Datadog's &lt;a href="https://www.datadoghq.com/about/latest-news/press-releases/datadog-launches-bits-ai-sre-agent-to-resolve-incidents-faster/" rel="noopener noreferrer"&gt;Bits AI SRE went generally available on 2 December 2025&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The AIOps and AI SRE labels are confused because both compress to "AI for ops" and both pitch reliability outcomes. The categories were named years apart, built on different technical foundations, and address different stages of the incident lifecycle. This guide draws the line, cell by cell, with every claim cited to a primary source.&lt;/p&gt;

&lt;p&gt;For the standalone definition of an AI SRE, see our &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;What is an AI SRE? glossary entry&lt;/a&gt;. For the procurement and adoption arc, see the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE Complete Guide&lt;/a&gt;. The framework introduced below is what we use internally; we call it the &lt;strong&gt;Four-Axis AIOps vs AI SRE Matrix&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AIOps? A 2016 Gartner category
&lt;/h2&gt;

&lt;p&gt;The term "AIOps" was first published by Gartner in 2016 (&lt;a href="https://en.wikipedia.org/wiki/AIOps" rel="noopener noreferrer"&gt;Wikipedia: AIOps&lt;/a&gt;). Gartner's own glossary defines an AIOps platform as one that "combines big data and machine learning functionality to support all primary IT operations functions through the scalable ingestion and analysis of the ever-increasing volume, variety and velocity of data generated by IT. The platform enables the concurrent use of multiple data sources, data collection methods, and analytical and presentation technologies" (&lt;a href="https://www.gartner.com/en/information-technology/glossary/aiops-platform" rel="noopener noreferrer"&gt;Gartner IT glossary, AIOps platform&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Three things about that definition are load-bearing in 2026.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It predates the LLM era.&lt;/strong&gt; ChatGPT was released in November 2022. Gartner's AIOps definition is six years older. The "AI" in AIOps refers to classical machine learning techniques (anomaly detection, time-series forecasting, clustering, correlation rules), not the multi-step language-model agents that emerged after 2023.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is platform-shaped.&lt;/strong&gt; Gartner's definition describes a data platform that ingests telemetry and produces insight. It is not an agent that takes actions; it is an analytical layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Its core job is noise reduction.&lt;/strong&gt; The category was created to address the alert-storm problem: thousands of alerts firing per day from disparate monitoring tools, with no automated way to group them. Classic AIOps tools cluster these alerts so an on-call human sees ten meaningful incidents instead of a thousand symptoms.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Representative AIOps vendors include &lt;a href="https://www.bigpanda.io/" rel="noopener noreferrer"&gt;BigPanda&lt;/a&gt; (founded 2012), Moogsoft (&lt;a href="https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2023~07~dell-technologies-announces-intent-to-acquire-moogsoft.htm" rel="noopener noreferrer"&gt;acquired by Dell, announced July 2023&lt;/a&gt;), Dynatrace with its Davis AI engine, &lt;a href="https://sciencelogic.com/" rel="noopener noreferrer"&gt;ScienceLogic&lt;/a&gt;, and PagerDuty's Intelligent Alert Grouping. PagerDuty's &lt;a href="https://www.pagerduty.com/resources/aiops/learn/what-is-aiops/" rel="noopener noreferrer"&gt;own glossary page on AIOps&lt;/a&gt; frames the use cases as event correlation, anomaly detection, and noise reduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI SRE? A 2024-to-2026 LLM-agent category
&lt;/h2&gt;

&lt;p&gt;The "AI SRE" term emerged in vendor marketing through 2024 and consolidated in 2025 to 2026 as a recognisable category. An AI SRE is a multi-step large-language-model agent that investigates production incidents on behalf of a site reliability engineer. The defining capability is &lt;strong&gt;tool-calling investigation&lt;/strong&gt;: the agent runs an iterative reasoning loop (&lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;ReAct&lt;/a&gt;-style, function-calling, or graph-based) where each step uses prior evidence to decide the next tool call. We cover the five capabilities that define a credible AI SRE in our &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;What is an AI SRE? glossary entry&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The category's investor signal is concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai confirmed a $125M Series A at a $1B valuation in February 2026&lt;/a&gt;, with an &lt;a href="https://www.prnewswire.com/news-releases/resolve-ai-announces-series-a-extension-at-a-1-5b-valuation-and-launches-resolve-ai-labs-to-advance-ai-systems-for-complex-production-environments-302743888.html" rel="noopener noreferrer"&gt;extension at a $1.5B valuation in April 2026&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal emerged from stealth in June 2025 with $48M led by Sequoia and Kleiner Perkins&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Datadog's &lt;a href="https://www.datadoghq.com/about/latest-news/press-releases/datadog-launches-bits-ai-sre-agent-to-resolve-incidents-faster/" rel="noopener noreferrer"&gt;Bits AI SRE became generally available on 2 December 2025&lt;/a&gt;, with a &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre-deeper-reasoning/" rel="noopener noreferrer"&gt;March 2026 update Datadog describes as completing investigations "about 2 times faster than before"&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;PagerDuty has shipped the &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty SRE Agent&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open-source projects shape the lower end of the category. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;) and &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;) sit alongside &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (multi-cloud, sandboxed execution). See our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;open-source three-way comparison&lt;/a&gt; for the per-project details.&lt;/p&gt;

&lt;h2&gt;
  
  
  AIOps vs AI SRE: the Four-Axis Matrix
&lt;/h2&gt;

&lt;p&gt;The matrix below resolves most procurement debates. Each row is a separate axis; the two categories almost never overlap on the same cell.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;AIOps platform&lt;/th&gt;
&lt;th&gt;AI SRE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Origin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://en.wikipedia.org/wiki/AIOps" rel="noopener noreferrer"&gt;Gartner, 2016&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Vendor marketing, 2024 to 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary technique&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Statistical ML: clustering, anomaly detection, correlation rules&lt;/td&gt;
&lt;td&gt;LLM tool-calling agents (ReAct loops, function calling)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Triggered by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw telemetry stream (metrics, logs, events at firehose volume)&lt;/td&gt;
&lt;td&gt;A specific alert or incident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clustered alerts, noise-reduced event stream, anomaly score&lt;/td&gt;
&lt;td&gt;A reasoned root-cause analysis with an evidence chain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lifecycle stage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-alert: from telemetry to incident&lt;/td&gt;
&lt;td&gt;Post-alert: from incident to root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Misclusters or misses anomalies (false negatives)&lt;/td&gt;
&lt;td&gt;Hallucinates a plausible-but-wrong root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Representative vendors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BigPanda, Moogsoft, Dynatrace Davis, ScienceLogic, PagerDuty Intelligent Alert Grouping&lt;/td&gt;
&lt;td&gt;HolmesGPT, K8sGPT, Aurora, Resolve.ai, Traversal, Bits AI SRE, PagerDuty SRE Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it replaces in the team&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human alert triage&lt;/td&gt;
&lt;td&gt;First-pass incident investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two of the eight axes deserve separate treatment because they are most often misread by buyers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis 2: technique difference, in detail
&lt;/h3&gt;

&lt;p&gt;Classical AIOps relies on statistical machine-learning techniques that were mature well before 2020. A typical AIOps pipeline ingests metrics, applies time-series anomaly detection (Holt-Winters, ARIMA, isolation forests), and correlates anomalies across services using clustering on temporal proximity, topology proximity, or symbolic patterns. The pipeline is &lt;strong&gt;trained, not prompted&lt;/strong&gt;. It outputs a probability score and a group label; it does not "decide" anything.&lt;/p&gt;

&lt;p&gt;An AI SRE is built around an LLM that consumes a small amount of context and chooses the next tool to call. The agent does not need to be retrained for a new failure mode; it inspects the failure mode at runtime by reading logs, fetching pod state, or querying a database. This is why the category is dominated by frontier-model providers (Anthropic, OpenAI, Google) and is sensitive to model quality in a way that classical AIOps is not.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis 5: lifecycle-stage difference, in detail
&lt;/h3&gt;

&lt;p&gt;AIOps lives before the alert lands on a human. Its job is to convert ten thousand metric points and a thousand raw events into a tractable list of "things that look like incidents." Once a human (or downstream system) accepts that an incident exists, AIOps has done its work.&lt;/p&gt;

&lt;p&gt;An AI SRE picks up at that handoff. Its job is to take "an incident exists" and resolve it into "here is the most likely root cause and the evidence that supports it." The agent does not need to discover the incident; it needs to investigate it.&lt;/p&gt;

&lt;p&gt;This is why a team that buys an AI SRE without an upstream noise-reduction layer often suffers: the agent gets paged on every false positive, which burns LLM inference cost and dilutes the trust signal. Conversely, a team that buys AIOps without an investigation layer pages a human on every clustered incident, which leaves the time-back opportunity on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where does AIOps still win?
&lt;/h2&gt;

&lt;p&gt;AIOps has not been retired by AI SRE. Three jobs remain firmly in the AIOps lane in 2026.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Carrier-scale event correlation.&lt;/strong&gt; A telco core network or a national observability tier producing millions of events per minute is the wrong shape for an LLM agent to inspect end-to-end. Statistical correlation on this firehose, with rule overlays for known patterns, remains the production-grade approach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert deduplication and routing.&lt;/strong&gt; AIOps platforms dedupe alerts across overlapping monitoring tools and route them to the right on-call rotation. This is plumbing-grade work that does not need an LLM and should not be delegated to one on cost grounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-horizon trend analysis on numeric telemetry.&lt;/strong&gt; Forecasting capacity, modelling seasonal traffic patterns, and detecting drift in metrics are still better served by classical time-series methods than by language models.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where does AI SRE win?
&lt;/h2&gt;

&lt;p&gt;The AI SRE category dominates four jobs that AIOps platforms either cannot do or do poorly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;First-pass investigation on a single incident.&lt;/strong&gt; The agent fetches pod logs, traces, recent deploys, and ticket history, then assembles the evidence chain a human SRE would have built manually. Datadog's Bits AI SRE product page quotes iFood SRE Rafael Bento: &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;"From day one, Bits AI SRE started cutting our MTTR by 70%"&lt;/a&gt;, and frames the category outcome on the same page as helping teams "restore services 90% faster." Traversal's &lt;a href="https://www.traversal.com/blog/american-express-announcement" rel="noopener noreferrer"&gt;American Express announcement&lt;/a&gt; reports an "82% root cause analysis accuracy rate" and a "32% reduction in potential mean time to resolution (MTTR)" within six months of deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-system reasoning during an incident.&lt;/strong&gt; A human SRE who needs to correlate Kubernetes events, an RDS slow-query log, a recent deploy in GitHub, and a Confluence runbook is doing five tab-switches. An AI SRE does the same correlation in a single context window. This is where the time-back curve bends hardest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drafting structured artefacts.&lt;/strong&gt; Postmortems, evidence chains, and remediation suggestions land as Markdown the team can edit, not as a chat transcript. See our &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;automated post-mortem guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gapped and self-hosted deployment.&lt;/strong&gt; Open-source AI SRE projects support local LLMs through Ollama, vLLM, or LocalAI. Most classical AIOps platforms are SaaS-only. For regulated buyers, the deployment story alone shifts spend toward AI SRE.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Do you need both AIOps and an AI SRE?
&lt;/h2&gt;

&lt;p&gt;In 2026, most enterprise SRE teams will end up running both. The functional split is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AIOps below the alert line.&lt;/strong&gt; Ingest the firehose, correlate, dedupe, route. The team should never see a thousand raw events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI SRE above the alert line.&lt;/strong&gt; Investigate each incident the AIOps layer surfaces. Produce the evidence chain a human signs off on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Smaller and AI-native teams often skip the AIOps layer and connect the AI SRE directly to monitoring webhooks (PagerDuty, Datadog, Grafana) on the assumption that the alert hygiene is already acceptable. This is a reasonable starting position for teams under ~50 services and breaks down at larger event volumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you choose between an AI SRE and an AIOps platform?
&lt;/h2&gt;

&lt;p&gt;The decision tree is shorter than the matrix suggests.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Is the bottleneck noise or investigation?&lt;/strong&gt; If your on-call is drowning in alerts, the first move is AIOps (or PagerDuty Intelligent Alert Grouping, which is bundled with PagerDuty). If your on-call is producing reasonable alert volume but spending hours on each investigation, the first move is an AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What does the deployment posture require?&lt;/strong&gt; Air-gapped or strict-residency buyers should default to open-source AI SRE. SaaS-comfortable buyers have a wider field. See our &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;self-hosted AI SRE guide&lt;/a&gt; for the deployment tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is Kubernetes the dominant runtime?&lt;/strong&gt; Kubernetes-heavy estates have stronger open-source AI SRE options (HolmesGPT, K8sGPT, Aurora). VM-heavy or multi-cloud estates narrow the field to the cross-infrastructure agents (HolmesGPT, Aurora, commercial SaaS).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For tool selection past this step, see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt; and our &lt;a href="https://www.arvoai.ca/blog/top-10-aiops-platforms-free-root-cause-analysis-2026" rel="noopener noreferrer"&gt;Top 10 AIOps Platforms Offering Free Root Cause Analysis&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common mistakes when treating AIOps and AI SRE as substitutes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buying an AI SRE to fix alert noise.&lt;/strong&gt; The agent will get paged on every false positive and the LLM cost curve will dominate the conversation. Noise is a layer below the AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buying AIOps to get root-cause analysis.&lt;/strong&gt; Classical AIOps platforms generate anomaly clusters, not investigations. The "root cause" they surface is a statistical correlation, not a causal chain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assuming the two categories will merge into one product.&lt;/strong&gt; Some vendors are bundling. The job split is not going away, because the underlying techniques are different and the cost curves are different.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discounting open-source AIOps.&lt;/strong&gt; Open-source projects like &lt;a href="https://www.keephq.dev/" rel="noopener noreferrer"&gt;Keep&lt;/a&gt; exist in the AIOps lane too, and they pair cleanly with an open-source AI SRE for an end-to-end self-hosted stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between AIOps and AI SRE?
&lt;/h3&gt;

&lt;p&gt;AIOps is a 2016 Gartner category for platforms that combine big data and machine learning to reduce noise across IT operations, primarily through statistical clustering and anomaly detection. AI SRE is a 2024-to-2026 category for multi-step LLM agents that investigate individual incidents by calling infrastructure tools and producing a reasoned root-cause analysis. AIOps sits before the alert; an AI SRE sits after it. Most mature teams run both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who coined the term AIOps and when?
&lt;/h3&gt;

&lt;p&gt;Gartner coined the term in 2016, initially as a shortening of "Algorithmic IT Operations" and later as "Artificial Intelligence for IT Operations." Gartner's glossary defines an AIOps platform as one that combines big data and machine learning to support primary IT operations functions through scalable ingestion and analysis of telemetry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is an AI SRE just a rebrand of AIOps?
&lt;/h3&gt;

&lt;p&gt;No. The two categories use different technical foundations and address different stages of the incident lifecycle. AIOps platforms rely on classical machine learning (clustering, anomaly detection) trained on metric and event streams. An AI SRE runs a large-language-model agent that calls tools during an incident to gather new evidence and reason through the cause. The terms are often confused because both compress to "AI for ops," but the products and skills required to run them are different.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can an AI SRE replace an AIOps platform?
&lt;/h3&gt;

&lt;p&gt;Not for teams operating at carrier or telco scale, where event volumes exceed what an LLM can usefully reason over. Classical AIOps is still the right answer for raw-firehose correlation, alert deduplication, and trend analysis on numeric telemetry. An AI SRE replaces the human investigation step that follows an alert, not the noise-reduction step that precedes one.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the leading AIOps tools in 2026?
&lt;/h3&gt;

&lt;p&gt;Established commercial AIOps tools include BigPanda, Moogsoft (acquired by Dell in 2023), Dynatrace Davis, and ScienceLogic. PagerDuty Intelligent Alert Grouping ships as a feature inside PagerDuty. Open-source AIOps is led by Keep, which pairs cleanly with open-source AI SRE projects for an end-to-end self-hosted stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the leading AI SRE tools in 2026?
&lt;/h3&gt;

&lt;p&gt;Open-source: HolmesGPT (CNCF Sandbox since 8 October 2025), K8sGPT (CNCF Sandbox since 19 December 2023), and Aurora by Arvo AI. Commercial: Resolve.ai (Series A at a $1B valuation in February 2026), Traversal (Series A of $48M in June 2025), Datadog Bits AI SRE (GA on 2 December 2025), and PagerDuty SRE Agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which should a small team buy first, AIOps or an AI SRE?
&lt;/h3&gt;

&lt;p&gt;If alert volume is the pain point, start with the AIOps layer or with the noise-reduction features bundled in your incident-management tool (PagerDuty Intelligent Alert Grouping, Opsgenie alert policies). If alert volume is acceptable but investigations take hours, start with an AI SRE. Smaller teams under roughly 50 services often skip the AIOps layer initially and connect the AI SRE directly to monitoring webhooks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does AIOps include LLMs in 2026?
&lt;/h3&gt;

&lt;p&gt;Some AIOps vendors have added LLM features such as natural-language alert summaries or chat interfaces over their dashboards. This blurs the boundary at the product level but does not change the underlying job split. The LLM bolt-ons inside an AIOps product are typically copilot-grade summarisers, not multi-step investigation agents. Buyers should not assume an LLM feature inside an AIOps platform delivers AI SRE capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is AI SRE the same as Site Reliability Engineering with AI?
&lt;/h3&gt;

&lt;p&gt;Not exactly. Site Reliability Engineering is a discipline created at Google around 2003 covering SLOs, error budgets, capacity planning, postmortem culture, and on-call practices. An AI SRE is a tool category that automates one specific job inside that discipline, namely first-pass incident investigation. Investor framing across Sequoia, Kleiner, Lightspeed, and Felicis-backed AI SRE companies has consistently been agent-as-first-triage with a human in the loop, not headcount replacement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I still need AIOps if I have an AI SRE?
&lt;/h3&gt;

&lt;p&gt;For most enterprise estates, yes. The two categories handle different stages: AIOps reduces the firehose of telemetry into a tractable list of incidents; the AI SRE investigates each one. Skipping the AIOps layer is reasonable for smaller estates with acceptable alert hygiene but breaks down at large event volumes where LLM cost and context-window limits become a constraint.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/ai-sre-vs-aiops" rel="noopener noreferrer"&gt;arvoai.ca/blog/ai-sre-vs-aiops&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to Evaluate an AI SRE Platform</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 28 May 2026 22:43:58 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/how-to-evaluate-an-ai-sre-platform-2115</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/how-to-evaluate-an-ai-sre-platform-2115</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic SaaS RFPs do not fit AI SRE.&lt;/strong&gt; The category is younger than most procurement templates and the failure modes (hallucinated root causes, model drift, signal-type sensitivity) are not covered by traditional vendor checklists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation quality is measurable.&lt;/strong&gt; The &lt;a href="https://arxiv.org/abs/2412.17015" rel="noopener noreferrer"&gt;RCAEval benchmark&lt;/a&gt; (Pham et al., December 2024, &lt;a href="https://dl.acm.org/doi/proceedings/10.1145/3701716" rel="noopener noreferrer"&gt;published at ACM Web Conference 2025 Companion Proceedings&lt;/a&gt;) provides 735 fault-injection cases across three microservice systems with 11 fault types and 15 reproducible baselines. The &lt;a href="https://www.nofire.ai/ai-sre-benchmark" rel="noopener noreferrer"&gt;NOFire AI benchmark&lt;/a&gt; extends this with a signal-type ladder showing Top-1 accuracy rises from 29 percent on metrics-only inputs to 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust is a separate axis from capability.&lt;/strong&gt; &lt;a href="https://rootly.com/ai-sre-guide/maturity-model" rel="noopener noreferrer"&gt;Rootly's AI SRE Maturity Model&lt;/a&gt; maps the trust ladder in four steps: Level 0 (manual), Level 1 (read-only copilot), Level 2 (assisted actions with approvals), Level 3 (guardrailed autonomy for narrow, reversible failure modes). Buyers should stage trust across that ladder, not buy at the top.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment sovereignty is a gating constraint, not a tiebreaker.&lt;/strong&gt; Air-gapped, residency-bound, and BYO-LLM buyers must filter the shortlist on inference location before scoring anything else. See our &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TCO is not just the licence line.&lt;/strong&gt; AI SRE cost models include LLM inference, observability surface, runbook ingestion, and the engineering time spent on guardrails. Open-source projects shift the licence cost to zero and surface the operating cost transparently.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;This guide is the deep evaluation framework. For the brief two-week procurement plan, see the &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;HowTo schema in our What is an AI SRE? glossary entry&lt;/a&gt;. For the five-capability rubric that filters the shortlist before evaluation starts, see the &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;Five-Capability AI SRE Test in that same post&lt;/a&gt;. Everything below assumes the shortlist has already cleared that test.&lt;/p&gt;

&lt;p&gt;We call the framework below the &lt;strong&gt;Four-Pillar AI SRE Evaluation Framework&lt;/strong&gt;. Each pillar is a separate scoring axis and each is anchored to a primary source where one exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do generic RFPs fail for AI SRE?
&lt;/h2&gt;

&lt;p&gt;A typical SaaS procurement RFP covers uptime, security posture, pricing, support tiers, and integrations. None of these surface the failure modes that matter for an AI SRE.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated root causes.&lt;/strong&gt; An AI SRE that produces a confident, plausible, wrong root-cause analysis is worse than no AI SRE, because the on-call rotation will follow the suggestion before second-guessing it. RFPs that do not include investigation-quality measurement miss this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signal-type sensitivity.&lt;/strong&gt; The NOFire benchmark shows that the same agent moves from 29 percent Top-1 accuracy on metrics-only inputs to 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning (&lt;a href="https://www.nofire.ai/ai-sre-benchmark" rel="noopener noreferrer"&gt;NOFire AI Benchmark&lt;/a&gt;). Buyers who do not test the agent against their actual signal mix will misestimate accuracy by a factor of three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust ladder mismatches.&lt;/strong&gt; Buyers often evaluate against the wrong trust level. A team that wants Level 1 read-only operation and scores a tool on its Level 3 closed-loop remediation features is grading the tool on capability they will never use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference-location ambiguity.&lt;/strong&gt; Where the LLM call physically runs is a single-question filter that disqualifies half the shortlist for regulated buyers. Standard RFPs bury this in a "data residency" footnote rather than putting it at the top of the form.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four pillars below replace the generic RFP with an AI-SRE-specific scoring sheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you measure AI SRE investigation quality?
&lt;/h2&gt;

&lt;p&gt;Investigation quality is the central question and the one most often left to vendor demos. The literature now provides enough scaffolding to measure it without a labelled production dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchor: the RCAEval benchmark
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2412.17015" rel="noopener noreferrer"&gt;RCAEval&lt;/a&gt; (Pham, Zhang, Ha, Salim, Zhang; December 2024; published at ACM Web Conference 2025 per the &lt;a href="https://dl.acm.org/doi/10.1145/3701716.3715290" rel="noopener noreferrer"&gt;ACM Digital Library record&lt;/a&gt;) ships:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;735 failure cases collected from three microservice systems.&lt;/li&gt;
&lt;li&gt;11 fault types observed in real-world failures, including CPU throttling, memory leaks, network latency, container crashes, deployment errors, resource exhaustion, and database connection failures.&lt;/li&gt;
&lt;li&gt;Multi-source telemetry (metrics, logs, and traces) supporting metric-based, trace-based, and multi-source RCA approaches.&lt;/li&gt;
&lt;li&gt;15 reproducible baselines covering coarse-grained and fine-grained RCA.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The authors describe the work as the first comprehensive benchmark for root-cause analysis of microservices and the gap it fills as "no standard benchmark that includes large-scale datasets and supports comprehensive evaluation environments." For a buyer, this gives a defensible neutral testbed: ask the vendor to run their agent against the RCAEval fault-injection set and report Top-1 accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchor: the NOFire signal-type ladder
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.nofire.ai/ai-sre-benchmark" rel="noopener noreferrer"&gt;NOFire AI's published benchmark&lt;/a&gt; uses the RCAEval dataset and reports two findings buyers should internalise.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Signal type matters more than model choice.&lt;/strong&gt; The benchmark reports Top-1 accuracy of 29 percent on metrics-only inputs, 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning. The implication: a buyer who only sends metrics to the agent should not expect more than a third of investigations to converge correctly, regardless of which vendor they pick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-context graphs beat plain LLMs.&lt;/strong&gt; NOFire reports 89 percent Top-1 accuracy on the RCAEval set, versus 42 percent for the best academic baseline (described on the benchmark page as "Academic SOTA (GALA BARO M2)"). The published implication is that a structured representation of the production environment (services, dependencies, recent changes) gives the agent better grounding than narrative telemetry alone. The vendor publishes this benchmark; treat the absolute numbers with the appropriate scepticism, but the relative ranking and the signal-type ladder are reproducible from the open dataset.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What to measure during evaluation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Top-1 accuracy on RCAEval-style fault injection.&lt;/strong&gt; Replay a handful of fault types and ask the vendor to walk through the investigation trace. The presence or absence of a coherent reasoning chain is the binary signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signal-type robustness.&lt;/strong&gt; Run the same fault under three configurations: metrics only, metrics plus logs, metrics plus logs plus traces. The shape of the accuracy curve tells you whether the agent compensates for signal gaps or simply degrades.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to first useful finding.&lt;/strong&gt; Not MTTR (which depends on the human response loop), but the elapsed time from alert ingestion to the first piece of evidence a human SRE would have surfaced manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-system reasoning.&lt;/strong&gt; Construct a synthetic incident that spans Kubernetes, a managed database, and a recent deploy. Measure whether the agent reasons across all three or fixates on the loudest signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How do you evaluate AI SRE trust and governance?
&lt;/h2&gt;

&lt;p&gt;Investigation quality without trust controls is a liability. The buyer's question is not "can the agent act" but "under what conditions, with what evidence, and with what rollback."&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchor: the Rootly AI SRE Maturity Model
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://rootly.com/ai-sre-guide/maturity-model" rel="noopener noreferrer"&gt;Rootly publishes a four-level maturity model&lt;/a&gt; that maps the trust ladder cleanly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Level 0: Manual reliability operations.&lt;/strong&gt; No AI assistance. Responders hunt across dashboards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 1: Read-only AI SRE, evidence-first copilot.&lt;/strong&gt; The "trust-building stage." The AI accelerates context gathering and produces ranked hypotheses linked to evidence, but executes no changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2: Assisted actions with approvals.&lt;/strong&gt; The AI can propose and run approved actions through a governed workflow engine with RBAC, audit logs, verification gates, and rollback readiness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 3: Guardrailed autonomy for narrow, reversible failure modes.&lt;/strong&gt; The AI autonomously executes pre-approved runbooks for a small set of repeatable incidents, within strict bounds and with continuous verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Rootly framing is useful for a reason buyers often miss: &lt;strong&gt;the levels are not a feature ranking, they are a deployment posture&lt;/strong&gt;. A tool that ships Level 3 features is not "better" than a tool that ships Level 1 features; it is appropriate for a different stage of the buyer's adoption arc.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to evaluate
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Match the maturity level to the buyer's current state.&lt;/strong&gt; A team that has not yet shipped Level 1 should not be paying for Level 3 capability they will not turn on. A team ready for Level 2 should not be buying a Level 1-only tool that they will outgrow in twelve months.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action class boundaries.&lt;/strong&gt; Read-only investigation, PR-based suggestions, and sandboxed in-cluster execution are three different trust decisions. Document which classes the tool supports and which the buyer will enable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit and rollback.&lt;/strong&gt; Every action the agent can take must have an audit log entry and a rollback path. Komodor's published benchmarking guide names this dimension as "Transparency: evidence, timelines, and change history alongside every recommendation" (&lt;a href="https://komodor.com/resources/beyond-the-hype-a-benchmarking-guide-for-ai-sre-in-2026/" rel="noopener noreferrer"&gt;Komodor: Beyond the Hype&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination guardrails.&lt;/strong&gt; Komodor's guide also calls out "rigorous testing cycles and closed feedback loops" as the path to "95% RCA precision." Ask the vendor what those feedback loops look like in practice; a tool with no published guardrail story should be downgraded.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What deployment sovereignty checks matter for an AI SRE?
&lt;/h2&gt;

&lt;p&gt;For regulated industries, this pillar is a filter, not a scoring axis. The shortlist either includes a deployment posture that matches the buyer's constraints or it does not.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to evaluate
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference location.&lt;/strong&gt; Does the LLM call run on vendor-managed infrastructure (SaaS), on customer-managed infrastructure (self-hosted), or on a local model (air-gapped)? See our &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt; guide for the full deployment-tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data residency.&lt;/strong&gt; Where does telemetry physically reside when sent to the agent? Buyers under GDPR, HIPAA, or sector-specific regimes need a written answer, not a marketing one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BYO-LLM support.&lt;/strong&gt; Can the buyer point the agent at their own LLM endpoint? Open-source projects support this directly; &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT documents OpenAI-Compatible (LiteLLM proxy) and Ollama&lt;/a&gt;; &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;K8sGPT registers IBM watsonx, Oracle OCI GenAI, and a generic Custom REST endpoint&lt;/a&gt; among others; Aurora supports local inference through Ollama. Most commercial AI SREs offer a smaller backend list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gapped operation.&lt;/strong&gt; Can the agent run with no outbound network calls? This is the strictest test and disqualifies most SaaS-only products.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Anchor: cite the project's own documentation
&lt;/h3&gt;

&lt;p&gt;For an open-source AI SRE, the source of truth is the project's GitHub repository and official docs site. For a commercial AI SRE, the source of truth is the vendor's data-processing addendum and the published deployment architecture. Avoid relying on the sales conversation for this pillar; the engineering documentation is where commitments are durable.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you model AI SRE total cost of ownership?
&lt;/h2&gt;

&lt;p&gt;TCO for AI SRE breaks down into four layers, only one of which is the licence.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Licence or subscription cost.&lt;/strong&gt; Open-source AI SREs are zero at this layer. Commercial AI SREs use per-seat, per-investigation, or platform-bundled pricing. Datadog Bits AI SRE bundles into the broader Datadog platform per the &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;. PagerDuty SRE Agent bundles into the PagerDuty platform per the &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;. Resolve.ai and Traversal price by custom contract.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM inference cost.&lt;/strong&gt; This is the live-burn cost. Frontier-model API pricing changes monthly; buyers should model investigations per month against the published per-token rates of their chosen provider (Anthropic, OpenAI, Google). For BYO-LLM deployments running local models through Ollama or vLLM, the inference cost is reduced to the underlying compute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operating surface cost.&lt;/strong&gt; The agent reads from observability backends, ticket systems, and source-control. Heavy use can increase the costs of the systems it reads from (Datadog ingestion, Splunk indexing, GitHub API rate-limit upgrades). Buyers should add a line item for this and ask their FinOps team to model it before purchase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering time on guardrails and runbook ingestion.&lt;/strong&gt; Every AI SRE needs runbooks ingested, integrations configured, and RBAC scoped. The first month of engineering time is the largest hidden TCO cost.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why we are not publishing competitor pricing numbers
&lt;/h3&gt;

&lt;p&gt;Most direct AI SRE competitors (Resolve.ai, Cleric, PagerDuty SRE Agent, Datadog Bits AI SRE) do not publish per-investigation or per-seat pricing on their public sites. Buyers must request a quote. Any TCO comparison we published with specific dollar figures would either be out of date by the time you read it or fabricated. The correct approach is to issue an RFP that asks for the same shape of cost data (licence floor, per-investigation rate, per-seat rate, included integrations) from every vendor on the shortlist and to model the rest yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  How long should the evaluation take?
&lt;/h2&gt;

&lt;p&gt;Our recommendation is a 21-day evaluation sprint, structured as below. This is longer than the 14-day plan in the &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;What is an AI SRE? HowTo schema&lt;/a&gt; because the four-pillar framework explicitly measures investigation quality with synthetic fault injection rather than relying on demo-driven impressions.&lt;/p&gt;

&lt;p&gt;The 21-day sprint is documented in the appendix at the end of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  What do competitor evaluation frameworks miss?
&lt;/h2&gt;

&lt;p&gt;Several competitor frameworks exist; each has gaps the Four-Pillar Framework closes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.traversal.com/blog/how-should-you-evaluate-an-ai-sre-product" rel="noopener noreferrer"&gt;Traversal's "How Should You Evaluate an AI SRE Product?" post&lt;/a&gt;&lt;/strong&gt; focuses on selecting representative incidents, defining success metrics, and a multi-tier accuracy rubric for root cause analysis. It is strong on the testing methodology and quiet on TCO modelling and open-source alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://komodor.com/resources/beyond-the-hype-a-benchmarking-guide-for-ai-sre-in-2026/" rel="noopener noreferrer"&gt;Komodor's "Beyond the Hype" benchmarking guide&lt;/a&gt;&lt;/strong&gt; is strong on transparency and on the LLM-as-a-Judge testing methodology. The detailed scoring rubric is gated behind an ebook download and the framework does not extend to deployment sovereignty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rootly's maturity model&lt;/strong&gt; is the cleanest published trust ladder, but does not address investigation quality measurement or signal-type sensitivity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;a href="https://www.traversal.com/blog/llm-benchmarking-in-context-retrieval-reasoning-incident-root-cause-analysis" rel="noopener noreferrer"&gt;Traversal LLM benchmarking paper for incident root cause analysis&lt;/a&gt;&lt;/strong&gt; is excellent on model-level evaluation and silent on the buyer-process question of how to map evaluation results onto a maturity-level decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Four-Pillar Framework is intentionally additive. A buyer who has already adopted Rootly's maturity model and Komodor's testing methodology can use the framework above to fill the deployment-sovereignty and TCO gaps without throwing away their existing work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the common AI SRE evaluation mistakes?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scoring on demo polish.&lt;/strong&gt; A demo is a curated success path. Evaluate on failure cases the vendor has not seen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the signal-type test.&lt;/strong&gt; The NOFire ladder shows accuracy can vary by a factor of three depending on what telemetry the agent receives. Run the test on the buyer's actual signal mix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buying remediation before trust.&lt;/strong&gt; Most teams should buy at Level 1 (read-only investigation) and stage trust upward across six to twelve months, not procure at Level 3 in a single decision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring open-source baselines.&lt;/strong&gt; A Five-Capability-passing open-source AI SRE deployed in a single afternoon is the fairest baseline against which to measure any commercial pitch. See our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;open-source three-way comparison&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treating TCO as the licence line.&lt;/strong&gt; Inference, operating surface, and guardrail engineering frequently exceed the licence cost in year one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Appendix: How to evaluate an AI SRE platform in 21 days
&lt;/h2&gt;

&lt;p&gt;A three-week evaluation sprint that applies the Four-Pillar AI SRE Evaluation Framework end-to-end. Each step produces a written deliverable a procurement reviewer can sign off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 1 to 3: Filter the shortlist on the Five-Capability AI SRE Test.&lt;/strong&gt; Score every shortlisted tool on multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, and structured root-cause output. Drop any tool that scores below 6 out of 15. The deliverable is a one-page capability scorecard for each tool on the remaining shortlist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 4 to 8: Measure investigation quality on synthetic fault injection.&lt;/strong&gt; Replay a sample of RCAEval fault types (CPU throttling, memory leak, network latency, container crash, deployment error) against each tool. Measure Top-1 accuracy, time to first useful finding, and the coherence of the investigation trace. Run the NOFire signal-type ladder by configuring the same fault under metrics-only, metrics-plus-logs, and metrics-plus-logs-plus-traces and report the accuracy curve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 9 to 11: Run the trust and governance walkthrough.&lt;/strong&gt; Map each tool to a Rootly AI SRE Maturity Model level (0, 1, 2, or 3). Document the action classes the tool supports (read-only investigation, PR-based suggestion, sandboxed execution), the audit and rollback paths for each, and the hallucination guardrails the vendor publishes. Match the tool's maturity level to your adoption stage; do not procure higher than you will operate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 12 to 14: Apply the deployment sovereignty filter.&lt;/strong&gt; Document where the LLM call physically runs (SaaS, self-hosted, air-gapped), where telemetry resides, whether the tool supports BYO-LLM, and whether air-gapped operation is supported. For regulated buyers this is a filter, not a scoring axis: a tool that fails residency or air-gapped requirements drops off the shortlist regardless of its score on the other pillars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 15 to 17: Model total cost of ownership.&lt;/strong&gt; Build four cost lines for each tool: licence or subscription, projected LLM inference at expected investigation volume, projected operating-surface increase (observability ingestion, ticket-system API load), and projected engineering time on guardrails and runbook ingestion. Model year one and year three. Add an open-source baseline (HolmesGPT, K8sGPT, or Aurora) to the cost table as a reference floor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 18 to 20: Pilot the top two tools in read-only mode.&lt;/strong&gt; Pick a single SRE team or product squad. Route a defined subset of alerts (one severity tier, one service domain) into each tool in read-only investigation mode. Capture the team's qualitative read on whether the investigation traces are trustable, faster, and surface evidence the team would have missed manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 21: Produce the four-pillar decision memo.&lt;/strong&gt; Write a one-page memo with the per-tool scores on each pillar: Investigation Quality (Top-1 accuracy on RCAEval-style fault injection plus signal-type robustness), Trust and Governance (Rootly maturity level fit plus published guardrail story), Deployment Sovereignty (pass or fail on inference location and residency), and TCO (year-one total across the four cost lines). The decision is the highest scorer on the buyer's weighted pillar priorities, not the highest scorer overall.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I benchmark an AI SRE platform's accuracy?
&lt;/h3&gt;

&lt;p&gt;Use the RCAEval public benchmark (arxiv 2412.17015), which provides 735 fault-injection cases across three microservice systems with 11 fault types. Ask the vendor to run their agent against the RCAEval set and report Top-1 accuracy. Replay a handful of fault types yourself and inspect the investigation trace for a coherent reasoning chain. Cross-reference against the NOFire AI benchmark for the signal-type ladder, which shows Top-1 accuracy rising from 29 percent on metrics-only inputs to 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the four pillars of AI SRE evaluation?
&lt;/h3&gt;

&lt;p&gt;Investigation quality, trust and governance, deployment sovereignty, and total cost of ownership. Investigation quality is anchored to the RCAEval benchmark and NOFire signal-type ladder. Trust and governance maps to Rootly's four-level AI SRE Maturity Model. Deployment sovereignty is a gating filter on inference location and data residency. TCO covers licence, LLM inference, operating-surface cost, and engineering time on guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the RCAEval benchmark?
&lt;/h3&gt;

&lt;p&gt;RCAEval is the first comprehensive open benchmark for root-cause analysis of microservice systems, published by Pham et al. in December 2024 and presented at the ACM Web Conference 2025. It ships 735 failure cases across three microservice systems, 11 fault types observed in real-world failures, multi-source telemetry (metrics, logs, traces), and 15 reproducible baselines covering coarse-grained and fine-grained RCA. It is the closest thing the AI SRE category has to a neutral evaluation testbed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the NOFire AI signal-type ladder?
&lt;/h3&gt;

&lt;p&gt;The NOFire benchmark, built on the RCAEval dataset, reports Top-1 accuracy across four signal configurations: 29 percent on metrics-only inputs, 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning. The published implication is that signal type matters more than model choice; a buyer who sends only metrics to the agent should not expect more than a third of investigations to converge regardless of vendor.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Rootly's AI SRE Maturity Model?
&lt;/h3&gt;

&lt;p&gt;A four-level trust ladder published by Rootly. Level 0 is manual reliability operations. Level 1 is read-only AI SRE that produces ranked hypotheses linked to evidence but executes nothing. Level 2 is assisted actions with approvals through a governed workflow engine. Level 3 is guardrailed autonomy for a narrow set of pre-approved, reversible failure modes. The levels are a deployment posture, not a feature ranking; buyers should match the level to their adoption stage rather than always procuring at the top.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does an AI SRE evaluation take?
&lt;/h3&gt;

&lt;p&gt;A focused buyer can complete the Four-Pillar evaluation in 21 days: three days on shortlisting against the Five-Capability AI SRE Test, five days on investigation-quality testing with synthetic fault injection, three days on the trust-and-governance walkthrough with security, three days on the deployment-sovereignty filter, three days on TCO modelling, and the remainder on pilot operation and the decision memo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why do generic RFPs fail for AI SRE evaluation?
&lt;/h3&gt;

&lt;p&gt;Standard SaaS procurement templates miss the failure modes specific to LLM agents: hallucinated root causes, signal-type sensitivity, trust-ladder mismatch, and inference-location ambiguity. RFPs that score on uptime and integrations without measuring investigation quality on synthetic fault injection misestimate accuracy by a factor of three (per the NOFire benchmark) and tend to push buyers toward over-procurement at higher maturity levels than they will operate.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I evaluate the cost of an AI SRE?
&lt;/h3&gt;

&lt;p&gt;Model four cost layers, not one. The licence or subscription line is the visible cost and is zero for open-source AI SREs. LLM inference is the live-burn cost, modelled against per-token rates from the chosen provider or against compute costs for self-hosted models. The operating-surface cost covers increased ingestion or rate-limit pressure on the systems the agent reads from (observability backends, ticket systems, source control). Engineering time on guardrails and runbook ingestion is the largest hidden cost in year one and frequently exceeds the licence line.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I evaluate an AI SRE without buying first?
&lt;/h3&gt;

&lt;p&gt;Yes for the open-source projects. HolmesGPT, K8sGPT, and Aurora can be installed in a Docker Compose or Helm chart in a single afternoon and run against the RCAEval fault-injection dataset with no commercial commitment. Most commercial AI SREs offer a trial period; Datadog Bits AI SRE documents a 14-day free trial in its launch blog. Use the open-source baseline to calibrate expectations before scoring any commercial pitch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not publish a TCO comparison table?
&lt;/h3&gt;

&lt;p&gt;Most direct AI SRE competitors (Resolve.ai, Cleric, PagerDuty SRE Agent, Datadog Bits AI SRE) do not publish per-investigation or per-seat pricing on their public sites. Any TCO comparison with specific dollar figures would either be out of date by the time it was read or fabricated. The honest answer is to issue an RFP that asks for the same shape of cost data (licence floor, per-investigation rate, per-seat rate, included integrations) from every vendor and to model the inference, operating-surface, and engineering layers yourself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/how-to-evaluate-ai-sre-platform" rel="noopener noreferrer"&gt;arvoai.ca/blog/how-to-evaluate-ai-sre-platform&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>What is an AI SRE? Definition, Capabilities, and 2026 Buyer's Lens</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 21 May 2026 23:45:57 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/what-is-an-ai-sre-definition-capabilities-and-2026-buyers-lens-41l4</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/what-is-an-ai-sre-definition-capabilities-and-2026-buyers-lens-41l4</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE is a multi-step large-language-model agent that investigates production incidents, queries live telemetry, and drafts a root-cause analysis with remediation guidance.&lt;/strong&gt; It is not an alerting tool, not an AIOps correlator, and not a chatbot. The agent calls infrastructure tools (&lt;code&gt;kubectl&lt;/code&gt;, cloud APIs, log queries) during an incident to gather new evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The category emerged in 2024 and consolidated in 2025-2026.&lt;/strong&gt; Open-source projects include &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;), &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;), and &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, multi-cloud). Commercial entrants include &lt;a href="https://resolve.ai/" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt; (&lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;$125M Series A at $1B in February 2026&lt;/a&gt;) and &lt;a href="https://www.traversal.com/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt; (&lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;$48M Series A in June 2025&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE is not the same as an AIOps platform.&lt;/strong&gt; AIOps tools cluster alerts statistically and predate LLMs. An AI SRE reasons through an incident step by step using an LLM that calls tools. The two categories are complementary, not interchangeable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five capabilities define a credible AI SRE.&lt;/strong&gt; Multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, and a structured root-cause output. Tools that ship fewer than three of these are something else (chatbot, summarizer, correlator).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adoption is bounded by trust, not capability.&lt;/strong&gt; Most 2026 buyers run the agent in read-only investigation mode for the first ninety days. Closed-loop remediation is a separate trust decision that follows clean operation, never the first decision.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;An AI SRE is a multi-step large-language-model agent that investigates production incidents on behalf of a site reliability engineer.&lt;/strong&gt; When an alert fires, the agent queries telemetry, traverses infrastructure dependencies, retrieves relevant runbooks, and produces a structured root-cause analysis. The category sits next to, not inside, the older AIOps and incident-management markets.&lt;/p&gt;

&lt;p&gt;This page is a definitional reference. For the deep methodology and procurement-stage detail, see our &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE Complete Guide&lt;/a&gt;. For tool selection, see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does an AI SRE do? The Five-Capability Test
&lt;/h2&gt;

&lt;p&gt;We call the rubric below the &lt;strong&gt;Five-Capability AI SRE Test&lt;/strong&gt;. A tool that ships fewer than three of these capabilities is in an adjacent category (copilot, summariser, correlator) and should not be evaluated against a real AI SRE.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step investigation.&lt;/strong&gt; The agent runs an iterative reasoning loop (&lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;ReAct&lt;/a&gt;, tool-calling, or a graph-based equivalent) where each step uses the previous tool result to decide the next call. Single-shot summarisation is a different category.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure tool execution.&lt;/strong&gt; The agent reads from &lt;code&gt;kubectl&lt;/code&gt;, cloud SDKs, observability backends, and ticket systems. Some agents also write, with guardrails. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT documents read-only access with RBAC respect&lt;/a&gt;. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora documents sandboxed execution into an isolated namespace&lt;/a&gt;. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT documents Kubernetes-only diagnostics with anonymisation before any AI backend call&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency-graph awareness.&lt;/strong&gt; The agent knows that service A talks to service B and uses that topology to assess blast radius. Aurora ships a Memgraph-backed dependency graph. Causely is built on a causal-graph foundation; see &lt;a href="https://docs.causely.ai/getting-started/how-causely-works/" rel="noopener noreferrer"&gt;How Causely Works&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge-base RAG.&lt;/strong&gt; The agent retrieves runbooks and past postmortems using hybrid search (&lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25" rel="noopener noreferrer"&gt;BM25&lt;/a&gt; plus dense vectors). Aurora documents a &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt; hybrid index. The leading commercial AI SREs all integrate Confluence and ticket systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured root-cause output.&lt;/strong&gt; The agent emits a final artefact (summary, evidence chain, suggested remediation) rather than a chat transcript. Postmortem export to Confluence or Jira is increasingly table-stakes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The minimum coherent product ships investigation, tool execution, and a structured output. Items 3 and 4 push the tool from "interesting demo" to "load-bearing in production."&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from a human SRE?
&lt;/h2&gt;

&lt;p&gt;An AI SRE does not replace a human site reliability engineer. The 2026 division of labour is concrete.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human stays in the loop for&lt;/strong&gt; scope decisions (what counts as an incident), trust decisions (when to allow remediation), capacity planning, postmortem facilitation, runbook authorship, and the SLO conversation with product owners.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The agent absorbs&lt;/strong&gt; the first sixty to ninety minutes of evidence-gathering on noisy alerts, the late-night triage of unclear pages, the cross-system correlation that humans defer until morning, and the boilerplate of a draft postmortem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The economic argument is bounded. The category's investors (Sequoia, Kleiner, Lightspeed, Felicis) underwrite an "agent does first triage, human does decision" workflow, not a headcount-replacement claim. The &lt;a href="https://newsletter.signoz.io/p/ai-isnt-replacing-sres-its-deskilling" rel="noopener noreferrer"&gt;SigNoz newsletter discussion of deskilling risk&lt;/a&gt; is a useful counterweight.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from AIOps?
&lt;/h2&gt;

&lt;p&gt;The two categories share an acronym sound and almost no implementation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;AIOps platform&lt;/th&gt;
&lt;th&gt;AI SRE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary technique&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Statistical clustering, anomaly detection, correlation rules&lt;/td&gt;
&lt;td&gt;LLM reasoning, tool-calling agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When it was named&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coined by &lt;a href="https://www.gartner.com/en/information-technology/glossary/aiops-platform" rel="noopener noreferrer"&gt;Gartner in 2017&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Emerged in vendor marketing 2024 to 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it produces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alert clusters, noise reduction, incident summaries&lt;/td&gt;
&lt;td&gt;A reasoned root-cause analysis, evidence chain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Representative tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BigPanda, Moogsoft, Dynatrace Davis, PagerDuty Intelligent Alert Grouping&lt;/td&gt;
&lt;td&gt;HolmesGPT, K8sGPT, Aurora, Resolve.ai, Traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Replaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual alert triage&lt;/td&gt;
&lt;td&gt;First-pass incident investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;AIOps platforms predate LLMs and remain useful for alert hygiene. An AI SRE is downstream: once the alert lands, the AI SRE investigates it. Most mature teams will end up with both.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from an incident-management copilot?
&lt;/h2&gt;

&lt;p&gt;A copilot inside &lt;a href="https://rootly.com/" rel="noopener noreferrer"&gt;Rootly&lt;/a&gt;, &lt;a href="https://incident.io/" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt;, &lt;a href="https://firehydrant.com/" rel="noopener noreferrer"&gt;FireHydrant&lt;/a&gt;, or &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI&lt;/a&gt; drafts Slack updates, suggests on-call swaps, and writes a postmortem from artefacts the team has already produced. An AI SRE generates the evidence those artefacts describe. The two categories cooperate; they do not substitute. See our &lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;AI SRE vs traditional incident management comparison&lt;/a&gt; for the long form.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the open-source vs commercial AI SRE options?
&lt;/h2&gt;

&lt;p&gt;In May 2026, three open-source projects dominate this lane.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt/blob/master/LICENSE" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. 2.5k GitHub stars on the canonical repository as of May 2026, per the &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT/holmesgpt about box&lt;/a&gt;. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;Originally created by Robusta.dev with major contributions from Microsoft&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;. Project legal entity: &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT a Series of LF Projects, LLC&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT.&lt;/strong&gt; &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. 7.8k GitHub stars on the canonical repository as of May 2026, per the &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;k8sgpt-ai/k8sgpt about box&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;. The June 2024 CNCF blog notes that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF: K8sGPT, June 2024&lt;/a&gt;). Kubernetes-scoped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora by Arvo AI.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. Multi-cloud (AWS, Azure, GCP, OVH, Scaleway, Kubernetes). Sandboxed command execution, dependency-graph awareness, RAG over runbooks and postmortems. See the &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;direct comparison of all three&lt;/a&gt; and our &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;self-hosted AI SRE guide&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Commercial entrants raise larger cheques but ship a narrower deployment surface. &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai confirmed a $125M Series A at a $1B valuation in February 2026&lt;/a&gt; and an &lt;a href="https://www.prnewswire.com/news-releases/resolve-ai-announces-series-a-extension-at-a-1-5b-valuation-and-launches-resolve-ai-labs-to-advance-ai-systems-for-complex-production-environments-302743888.html" rel="noopener noreferrer"&gt;extension at a $1.5B valuation in April 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal raised $48M in June 2025 led by Sequoia and Kleiner Perkins&lt;/a&gt;. Incumbents shipped 2025-2026 launches: &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty SRE Agent&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI SRE&lt;/a&gt;, and ServiceNow Now Assist for incident operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE evaluated?
&lt;/h2&gt;

&lt;p&gt;Three questions resolve most procurement debates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does the agent investigate or just summarise?&lt;/strong&gt; A summariser repeats what the dashboard already says. An investigator gathers new evidence. Ask the vendor to walk through one tool call after the alert; if the answer is "we summarise the alert payload," the product is a copilot, not an AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where does inference run?&lt;/strong&gt; A SaaS-only inference plane is fine for unregulated teams and disqualifying for regulated ones. The deployment tier is fixed by the strictest constraint, not the average. See the &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Sovereignty Spectrum in our self-hosted guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the remediation boundary?&lt;/strong&gt; Read-only investigation is one trust decision. PR-based suggestions are another. Sandboxed in-cluster execution is the third. Most teams stage these three independently across a six-to-twelve-month adoption arc, not in a single procurement.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a detailed tool matrix scored on five axes (investigation, remediation, postmortem, deployment flexibility, source availability), see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  ROI: where the time actually comes back
&lt;/h2&gt;

&lt;p&gt;Independent ROI numbers specifically for AI SRE are still thin in 2026. The broader industry adoption picture is well-sourced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report" rel="noopener noreferrer"&gt;Google's 2025 DORA report announcement&lt;/a&gt; states "90% of survey respondents report using AI at work" and that "More than 80% believe it has increased their productivity."&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://survey.stackoverflow.co/2025/" rel="noopener noreferrer"&gt;Stack Overflow's 2025 Developer Survey&lt;/a&gt; reports that 84 percent of respondents are using or planning to use AI tools in their development process, and 51 percent of professional developers use AI tools daily.&lt;/li&gt;
&lt;li&gt;The same DORA 2025 report notes that "AI adoption still has a negative relationship with software delivery stability," which is exactly the gap an investigation-grade AI SRE is positioned to close, distinct from the coding-assistant category that drives most of the AI adoption signal above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where AI SRE specifically takes hours back is mid-tier paging volume: the alerts that are too ambiguous to ignore and too low-stakes to wake a senior on. The agent's first-pass triage moves those from "morning standup discussion" to "closed before breakfast."&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the common mistakes when buying an AI SRE?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conflating a postmortem generator with an AI SRE.&lt;/strong&gt; A tool that writes a draft from the Slack transcript is not investigating. It is summarising.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buying multi-cloud AI SRE for a single-cloud problem.&lt;/strong&gt; If 95 percent of the estate is one cloud, a Kubernetes-only or AWS-only agent may be a better cost-to-fit match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starting with remediation.&lt;/strong&gt; The fastest way to lose stakeholder trust is to let an agent execute a command before the team understands its investigation pattern. Stage trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the dependency-graph question.&lt;/strong&gt; If the agent does not understand what calls what, it will miss blast-radius assessments and waste investigation steps. The capability is invisible in a demo and load-bearing in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to evaluate an AI SRE in 14 days
&lt;/h2&gt;

&lt;p&gt;A two-week, single-quarter procurement plan that maps directly to the Five-Capability AI SRE Test.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Day 1 to 2: Score the shortlist on the Five-Capability Test.&lt;/strong&gt; Take the five capabilities (multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, structured root-cause output) and score every shortlisted tool 0 to 3 on each axis. Drop any tool that scores below 6 out of 15.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3 to 4: Resolve the three procurement questions.&lt;/strong&gt; Answer in writing: does the agent investigate or just summarise; where does inference run; what is the remediation boundary. Match the deployment tier to the strictest constraint, not the average.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5 to 7: Run a sandboxed proof of value.&lt;/strong&gt; Pick one real incident from the last 30 days. Replay it against the top two shortlisted tools using a non-production cloud key and a sandbox cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 8 to 9: Run the security review.&lt;/strong&gt; Walk security through each tool's data path: what telemetry leaves the customer perimeter, what is anonymised before LLM calls, what the read or write capability boundary is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 10 to 11: Pilot one team for one week.&lt;/strong&gt; Route a defined subset of alerts (one severity tier, one service domain) into the tool in read-only investigation mode. Do not touch remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 12 to 13: Stage trust separately.&lt;/strong&gt; Read-only investigation is one trust decision. PR-based suggestions are the second. Sandboxed in-cluster execution is the third. Most teams stage these over six to twelve months.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 14: Decide on five numbers.&lt;/strong&gt; Five-Capability Test score, three-question filter answers, week-by-week investigation quality reading, total cost of ownership at projected incident volume, and security review status.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where this guide fits
&lt;/h2&gt;

&lt;p&gt;This is the short definitional reference. For deeper material:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE: The Complete Guide for Engineering Teams in 2026&lt;/a&gt;, procurement and adoption arc.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;, full capability matrix.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt;, deployment-tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;, three-way comparison.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison&lt;/a&gt;, two-way head-to-head.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation: The Complete Guide for SRE Teams&lt;/a&gt;, investigation-pattern detail.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;What is Agentic Incident Management?&lt;/a&gt;, category framing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is an AI SRE in simple terms?&lt;/strong&gt;&lt;br&gt;
An AI SRE is a multi-step LLM agent that investigates production incidents. It reads alerts, runs infrastructure commands such as kubectl or cloud SDK calls, queries observability backends, and produces a structured root-cause analysis. It augments a human site reliability engineer, not replaces them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How is an AI SRE different from AIOps?&lt;/strong&gt;&lt;br&gt;
AIOps is a 2017-era Gartner category built on statistical alert clustering and anomaly detection. An AI SRE is downstream of that: once an alert lands, the AI SRE uses an LLM to reason through it step by step, calling tools to gather new evidence. Mature teams typically run both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is an AI SRE the same as an incident-management chatbot?&lt;/strong&gt;&lt;br&gt;
No. A chatbot inside Rootly, incident.io, FireHydrant, or PagerDuty drafts Slack updates and summarises artefacts the team already has. An AI SRE generates those artefacts by investigating the incident from telemetry. The two categories cooperate but do not substitute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will AI replace SREs?&lt;/strong&gt;&lt;br&gt;
No. Investor framing across Sequoia, Kleiner, Lightspeed, and Felicis-backed AI SRE companies in 2025 to 2026 has consistently been agent-as-first-triage with a human in the loop for scope, trust, capacity, and SLO decisions. The deskilling risk is real and discussed in industry essays such as the SigNoz newsletter; the headcount-replacement claim is not part of the category thesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the main open-source AI SRE tools in 2026?&lt;/strong&gt;&lt;br&gt;
Three projects dominate. HolmesGPT (Apache 2.0, CNCF Sandbox since 8 October 2025, Kubernetes-first, 2.5k GitHub stars per the about box in May 2026). K8sGPT (Apache 2.0, CNCF Sandbox since 19 December 2023, Kubernetes diagnostics, 7.8k GitHub stars per the about box in May 2026). Aurora by Arvo AI (Apache 2.0, multi-cloud, sandboxed command execution).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does an AI SRE handle security and data privacy?&lt;/strong&gt;&lt;br&gt;
Practice varies by tool. HolmesGPT operates with read-only access that respects RBAC and is documented as safe to run in production. K8sGPT anonymises cluster object names and labels before sending data to the AI backend. Aurora supports air-gapped deployment with local LLMs through Ollama. Most commercial AI SREs run inference on vendor-managed infrastructure, which is the gating constraint for regulated buyers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long does an AI SRE take to deploy?&lt;/strong&gt;&lt;br&gt;
An open-source AI SRE runs in a single afternoon for a Docker Compose or Helm install with one cloud and one monitoring integration connected. Production rollout, including secret rotation, RBAC scoping, runbook ingestion, and Slack integration, takes two to four weeks for most teams. Closed-loop remediation is staged separately, three to twelve months after read-only operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does an AI SRE cost?&lt;/strong&gt;&lt;br&gt;
Open-source AI SREs are free at the licence layer; the running cost is infrastructure plus LLM inference. Self-hosted Aurora with a local Ollama model removes the LLM cost entirely. Commercial AI SREs price either per-seat or per-investigation. Resolve.ai and Traversal price by custom contract; PagerDuty and Datadog bundle their AI SRE features into existing platform tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can an AI SRE run in an air-gapped environment?&lt;/strong&gt;&lt;br&gt;
Yes, for a small set of tools. Aurora supports air-gapped deployment with Ollama or vLLM for local inference. HolmesGPT supports self-hosted LLM endpoints. K8sGPT supports local backends including Ollama and LocalAI. Most commercial AI SREs require outbound calls to a vendor-managed inference plane and do not satisfy air-gapped procurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does an AI SRE not do?&lt;/strong&gt;&lt;br&gt;
It does not set SLOs, define what counts as an incident, run capacity planning, facilitate a postmortem with the affected team, or own the customer relationship during a major outage. It is a tool for evidence-gathering and first-pass reasoning, not for the judgment work that defines the site reliability discipline.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;arvoai.ca/blog/what-is-an-ai-sre&lt;/a&gt;. Aurora by Arvo AI is open-source on &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; under Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison for SRE Teams</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 21 May 2026 23:43:47 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/holmesgpt-vs-k8sgpt-a-2026-head-to-head-comparison-for-sre-teams-2aco</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/holmesgpt-vs-k8sgpt-a-2026-head-to-head-comparison-for-sre-teams-2aco</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT and K8sGPT are both Apache 2.0, both CNCF Sandbox, and both branded as AI for SRE work, but they solve different problems.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; is an investigation agent that runs across "any infrastructure - VMs, bare metal, cloud services, or containers." &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; is a Kubernetes diagnostics tool: "a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub adoption signals diverge sharply.&lt;/strong&gt; As of May 2026, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT shows 7.8k stars and 996 forks&lt;/a&gt;, written 98.9 percent in Go. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT shows 2.5k stars and 347 forks&lt;/a&gt;, written 84.5 percent in Python. K8sGPT had a two-year head start (CNCF Sandbox 19 December 2023 vs HolmesGPT 8 October 2025).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution model differs.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT operates with read-only access that "respects RBAC permissions"&lt;/a&gt;, plus a separate &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;Operator Mode that "can open PRs to fix the problems it finds"&lt;/a&gt; through the GitHub MCP integration. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; runs as a CLI scanner or in-cluster operator with a 30-second default reconciliation interval (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-operator&lt;/a&gt;) and anonymises Kubernetes object names and labels before any LLM call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM backend lists overlap heavily and diverge at the edges.&lt;/strong&gt; Both projects register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama as backends. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;K8sGPT's source registers a broader enterprise set&lt;/a&gt;: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT documents a broader developer-tooling set&lt;/a&gt;: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and OpenAI-Compatible (LiteLLM proxy).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance shapes the trust story.&lt;/strong&gt; HolmesGPT's project entity is &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT a Series of LF Projects, LLC&lt;/a&gt;; the project was &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;originally created by Robusta.dev with major contributions from Microsoft&lt;/a&gt;. The June 2024 CNCF post on K8sGPT states that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF Blog, 7 June 2024&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a strict comparison of two open-source projects that are often grouped together because both attach AI to Kubernetes work, both are CNCF Sandbox, and both are Apache 2.0. Past that, they target different problems with different runtimes, different backends, and different governance. Every claim below is cited to a primary source: the project's GitHub repository, its official docs site, or a CNCF page. No quote is paraphrased from third-party blog posts.&lt;/p&gt;

&lt;p&gt;A note on bias. Arvo builds &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, a separate open-source AI SRE listed alongside HolmesGPT and K8sGPT in our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;three-way comparison&lt;/a&gt;. This page intentionally excludes Aurora from the main comparison except for a small section at the end.&lt;/p&gt;

&lt;p&gt;We call the rubric used below the &lt;strong&gt;Open-Source AI SRE Decision Matrix&lt;/strong&gt;. Six axes, each evaluated against the project's own primary documentation, no third-party claims. The six axes are: stated scope, execution model, continuous operation, LLM provider breadth, Model Context Protocol direction (host vs consume), and project governance. Every cell in the comparison table that follows maps back to one of these six axes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is HolmesGPT?
&lt;/h2&gt;

&lt;p&gt;HolmesGPT describes itself as an &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Open-source AI agent for investigating production incidents and finding root causes"&lt;/a&gt;. Repository statistics on the project's about box in May 2026 show 2.5k stars, 347 forks, and Python at 84.5 percent of the codebase (&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Scope is cross-infrastructure: "Open-source SRE agent for investigating production incidents across any infrastructure - Kubernetes, VMs, cloud services, databases, and more" (&lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;holmesgpt.dev&lt;/a&gt;). The same point is made on the project repository: "No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers" (&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Governance is shared between two entities. Origin attribution: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Originally created by Robusta.Dev, with major contributions from Microsoft"&lt;/a&gt;. The project's legal entity is named on the docs site: &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;"HolmesGPT a Series of LF Projects, LLC"&lt;/a&gt;. CNCF acceptance is documented at "October 8, 2025 at the Sandbox maturity level" (&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;cncf.io/projects/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The latest release at time of writing is &lt;strong&gt;v0.30.1 on 20 May 2026&lt;/strong&gt; per the &lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;. The release notes for v0.30.1 mention Loki raw response handling on parse failure, a GitLab MCP entry in the datasource catalog, a Bash echo allowlist fix, and user_email persistence on chat requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is K8sGPT?
&lt;/h2&gt;

&lt;p&gt;K8sGPT describes itself as &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;"a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI"&lt;/a&gt;. Repository statistics on the project's about box in May 2026 show 7.8k stars, 996 forks, and Go at 98.9 percent of the codebase (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;github.com/k8sgpt-ai/k8sgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Scope is explicitly Kubernetes. The project makes no claim of non-Kubernetes runtime support. The marketing site at &lt;a href="https://k8sgpt.ai/" rel="noopener noreferrer"&gt;k8sgpt.ai&lt;/a&gt; carries the tagline "K8sGPT - Giving Kubernetes Superpowers to Everyone."&lt;/p&gt;

&lt;p&gt;Governance is community-led. The 7 June 2024 CNCF blog (Dotan Horovits) states: "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF Blog&lt;/a&gt;). CNCF acceptance is documented at "December 19, 2023 at the Sandbox maturity level" (&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;cncf.io/projects/k8sgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The latest release at time of writing is &lt;strong&gt;v0.4.33 on 13 May 2026&lt;/strong&gt; per the &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.33" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;. Recent feature releases include v0.4.27 (mcp v2, 18 December 2025), v0.4.32 (Azure API type support and custom HTTP header, 22 April 2026), and v0.4.33 (analyze previous logs for restarted containers, 13 May 2026).&lt;/p&gt;

&lt;h2&gt;
  
  
  At a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt/blob/master/LICENSE" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;Sandbox, 8 October 2025&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;Sandbox, 19 December 2023&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars (May 2026)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;2.5k&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;7.8k&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python (84.5%)&lt;/td&gt;
&lt;td&gt;Go (98.9%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stated scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Any infrastructure - VMs, bare metal, cloud services, or containers"&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Kubernetes clusters&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operating model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step investigation agent + optional 24/7 &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;Operator Mode&lt;/a&gt; (Alpha)&lt;/td&gt;
&lt;td&gt;Scanner CLI + &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-operator&lt;/a&gt; for continuous in-cluster runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default permission model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Read-only access and respects RBAC permissions"&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Diagnoses; anonymises sensitive data before AI calls&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;open GitHub PRs via the GitHub MCP integration in Operator Mode&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;None documented&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;MCP-based integrations for AWS, Azure, GCP, GitHub, GitLab, Jenkins, Kubernetes Remediation, Sentry, Splunk, MariaDB, Prefect&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Hosts an MCP server exposing 12 tools and 3 resources for Kubernetes operations&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, Robusta AI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;Anthropic, OpenAI, Azure OpenAI, AWS Bedrock (and Bedrock Converse), Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, Custom REST&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest release at writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;v0.30.1, 20 May 2026&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.33" rel="noopener noreferrer"&gt;v0.4.33, 13 May 2026&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Founding entity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;Originally Robusta.dev, major Microsoft contributions&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Community-led, no commercial backer per &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What is the scope difference between HolmesGPT and K8sGPT?
&lt;/h2&gt;

&lt;p&gt;This is the load-bearing axis on the Open-Source AI SRE Decision Matrix, and the easiest one for teams to get wrong.&lt;/p&gt;

&lt;p&gt;K8sGPT is, by stated scope, a Kubernetes diagnostics tool. The &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/tree/main/pkg/analyzer" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/analyzer&lt;/code&gt;&lt;/a&gt; folder ships analysers for around 29 Kubernetes resource types as of May 2026, with a documented "default" subset (Pod, PVC, ReplicaSet, Service, Event, Ingress, StatefulSet, Deployment, Job, CronJob, Node, MutatingWebhook, ValidatingWebhook, ConfigMap) and an extended set covering HPA, PDB, NetworkPolicy, Gateway, GatewayClass, HTTPRoute, Log, Storage, Security, plus OLM-related resources (CatalogSource, ClusterServiceVersion, Subscription, etc.). Every analyser is scoped to a Kubernetes resource type. A team running on bare VMs, on managed cloud services without Kubernetes, or on a mainframe is not the K8sGPT audience.&lt;/p&gt;

&lt;p&gt;HolmesGPT rebuts the Kubernetes-only assumption directly: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers"&lt;/a&gt;. Its data-source catalogue, visible in the &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs navigation&lt;/a&gt;, covers VM-era systems alongside Kubernetes-era ones: Bash, ClickHouse, MariaDB (via MCP), Confluence, Sentry, plus Kubernetes resources and Helm. The Operator Mode page also frames non-Kubernetes scope: &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;"While the operator itself runs in Kubernetes, health checks can query any data source Holmes is connected to - VMs, cloud services, databases, SaaS platforms"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For SRE teams whose estate is entirely Kubernetes, this difference is academic. For teams that still run managed databases outside Kubernetes (RDS, Cloud SQL, Aurora), VM workloads, or third-party SaaS at incident-critical positions in the stack, K8sGPT cannot reach those resources without integration glue, and HolmesGPT can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can HolmesGPT or K8sGPT execute commands against my cluster?
&lt;/h2&gt;

&lt;p&gt;Both projects ship a fundamentally read-shaped default. The phrasing differs.&lt;/p&gt;

&lt;p&gt;HolmesGPT is explicit: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"By design, HolmesGPT has read-only access and respects RBAC permissions. It is safe to run in production environments"&lt;/a&gt;. The Operator Mode page describes how the read-only default is preserved while a separate write path opens: &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;"Connect the GitHub MCP server so Holmes can open PRs to fix the problems it finds - not just report them"&lt;/a&gt;. Writes do not happen against the cluster; they happen against the user's Git repository, where humans approve the change.&lt;/p&gt;

&lt;p&gt;K8sGPT does not use the phrase "read-only" in its repository documentation, but its operational profile is similar: the tool scans cluster state through Kubernetes APIs and feeds analyser output to an LLM. Anonymisation happens before the LLM call: &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;"the data is anonymized before being sent to the AI Backend... k8sgpt retrieves sensitive data (Kubernetes object names, labels, etc.). This data is masked when sent to the AI backend"&lt;/a&gt;. The same primary source also notes that anonymisation "does not currently apply to events" and that certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count) are not masked. The trade-off is openly disclosed. The masking implementation lives in &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/util/util.go" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/util/util.go&lt;/code&gt;&lt;/a&gt; as the &lt;code&gt;MaskString&lt;/code&gt; function.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does continuous operation differ between the two operators?
&lt;/h2&gt;

&lt;p&gt;Both projects have an in-cluster operator, and again the framing differs.&lt;/p&gt;

&lt;p&gt;HolmesGPT's Operator Mode is a 24/7 background agent: "HolmesGPT runs in the background 24/7, spots problems before your customers notice, and messages you in Slack with the fix" (&lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;holmesgpt.dev/latest/operator&lt;/a&gt;). The docs note its architecture: "a lightweight kopf-based controller handles CRD orchestration and scheduling, while stateless Holmes API servers execute the actual checks." The same page carries an explicit "Holmes Operator - Alpha Release" warning, and includes a cost caution: "Begin with infrequent schedules (e.g., hourly or daily) and monitor usage before scaling up."&lt;/p&gt;

&lt;p&gt;K8sGPT's operator (a separate repo, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-ai/k8sgpt-operator&lt;/a&gt;) is a continuous scanner: "This Operator is designed to enable K8sGPT within a Kubernetes cluster... It will allow you to create a custom resource that defines the behaviour and scope of a managed K8sGPT workload." The &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;default reconciliation interval is 30 seconds&lt;/a&gt;, enforced in the controller code (&lt;code&gt;ReconcileSuccessInterval = 30 * time.Second&lt;/code&gt;). Output goes to in-cluster Result CRDs, with optional Slack, Mattermost, and CloudEvents sinks. Prometheus and Grafana integration is exposed through ServiceMonitor and dashboard parameters (&lt;a href="https://docs.k8sgpt.ai/reference/operator/overview/" rel="noopener noreferrer"&gt;k8sgpt-operator docs&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Architecturally: HolmesGPT's Operator Mode is event-driven and incident-shaped (run on alert, run on schedule). K8sGPT's operator is poll-shaped (scan every 30 seconds, surface anomalies).&lt;/p&gt;

&lt;h2&gt;
  
  
  Which LLM providers does each tool support?
&lt;/h2&gt;

&lt;p&gt;Both projects support multiple LLM backends. The lists overlap heavily on the headline providers and diverge at the edges.&lt;/p&gt;

&lt;p&gt;K8sGPT's source code at &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/ai/iai.go&lt;/code&gt;&lt;/a&gt; registers 17 backends as of May 2026: openai, anthropic, localai, ollama, azureopenai, cohereai, amazonbedrock, amazonbedrockconverse, amazonsagemaker, googleai, noopai, huggingface, googlevertexai, ocigenai, customrest, ibmwatsonxai, groq.&lt;/p&gt;

&lt;p&gt;HolmesGPT's &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs site navigation&lt;/a&gt; enumerates: Anthropic, AWS Bedrock, Azure AI Foundry, Gemini, GitHub Copilot, GitHub Models, Google Vertex AI, Ollama, OpenRouter, OpenAI, OpenAI-Compatible, Robusta AI.&lt;/p&gt;

&lt;p&gt;The two lists overlap heavily on the headline providers (Anthropic, OpenAI, Azure OpenAI, Bedrock, Google Vertex AI, Ollama) and diverge at the edges. K8sGPT's edge list leans enterprise: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. HolmesGPT's edge list leans developer-tooling: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and an OpenAI-Compatible (LiteLLM proxy) catch-all. The right choice usually comes from the LLM the security team has already approved, not from this list.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does each tool handle Model Context Protocol?
&lt;/h2&gt;

&lt;p&gt;Both projects support MCP, and again the shape differs.&lt;/p&gt;

&lt;p&gt;K8sGPT &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;hosts an MCP server that the project ships&lt;/a&gt;: "K8sGPT provides a Model Context Protocol server that exposes Kubernetes operations as standardized tools for AI assistants." The server exposes "12 tools for cluster analysis, resource management, and debugging" and "3 resources for cluster information access," with "Stateless HTTP mode for one-off invocations" and "Full integration with Claude Desktop and other MCP clients." The MCP v2 feature lands in &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.27" rel="noopener noreferrer"&gt;release v0.4.27 on 18 December 2025&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;HolmesGPT consumes MCP servers as data sources rather than hosting one. The &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;data-sources catalogue&lt;/a&gt; lists MCP-labelled integrations for AWS, Azure, GitHub, GitLab, Jenkins, GCP, Kubernetes Remediation, MariaDB, Prefect, Sentry, and Splunk. The &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs navigation&lt;/a&gt; makes the consumption pattern explicit through entries like "MCP Servers" and "OAuth MCP Servers."&lt;/p&gt;

&lt;p&gt;The implication: K8sGPT publishes cluster operations for Claude Desktop and other MCP clients to consume. HolmesGPT subscribes to MCP-published tools across third-party systems. Teams building MCP-shaped workflows will pick the direction that matches their existing investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who governs each project, and how does that change the trust story?
&lt;/h2&gt;

&lt;p&gt;The CNCF Sandbox label is identical on both projects. The economic shape behind each is not.&lt;/p&gt;

&lt;p&gt;HolmesGPT is held under &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;"HolmesGPT a Series of LF Projects, LLC"&lt;/a&gt;, with origin attribution: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Originally created by Robusta.Dev, with major contributions from Microsoft"&lt;/a&gt;. &lt;a href="https://home.robusta.dev/" rel="noopener noreferrer"&gt;Robusta&lt;/a&gt; sells a managed SaaS product that integrates HolmesGPT, and Slack and Microsoft Teams integrations are flagged &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Available via Robusta"&lt;/a&gt;. This is a sponsored-open-source pattern.&lt;/p&gt;

&lt;p&gt;K8sGPT is community-led. The &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt; states: "unlike many popular projects, there is no company behind this project, and no business plan behind it." The same post names production users: "Companies like Kubermatic, SpectroCloud, and Nethopper have enthusiastically embraced K8sGPT capabilities." The project's &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/GOVERNANCE.md" rel="noopener noreferrer"&gt;&lt;code&gt;GOVERNANCE.md&lt;/code&gt;&lt;/a&gt; further codifies the model: "No single vendor may control project direction."&lt;/p&gt;

&lt;p&gt;Neither shape is structurally better. Sponsored open source ships polish and integrations faster; community open source is harder to commercially deprecate. Match the governance to the team's risk model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release cadence and recent feature deltas
&lt;/h2&gt;

&lt;p&gt;HolmesGPT shipped v0.30.1 on 20 May 2026, with notes for the release covering Loki raw-response handling on parse failure, a GitLab MCP datasource entry, a Bash echo allowlist fix, user_email persistence on chat requests, and documentation refinements (&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;release tag&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;K8sGPT's recent releases include v0.4.33 ("analyze previous logs for restarted containers," 13 May 2026), v0.4.32 ("add Azure API Type Support and add Custom HTTP Header," 22 April 2026), and v0.4.27 ("mcp v2," 18 December 2025) (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Both projects ship monthly or near-monthly. Neither has demonstrated a multi-month pause in the period documented.&lt;/p&gt;

&lt;h2&gt;
  
  
  What HolmesGPT and K8sGPT are NOT
&lt;/h2&gt;

&lt;p&gt;Three misreadings of this comparison show up repeatedly in vendor briefings and procurement memos. Naming them in advance saves a procurement cycle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Neither is an alerting platform.&lt;/strong&gt; Alerts originate in Prometheus AlertManager, Grafana, Datadog, CloudWatch, or PagerDuty. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT fetches alerts from "AlertManager, PagerDuty, OpsGenie, or Jira"&lt;/a&gt;; K8sGPT integrates downstream of Prometheus alert rules. Buying either tool does not solve "we have too many or too few alerts."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neither is a full AIOps platform.&lt;/strong&gt; AIOps is a 2017-era category built on statistical correlation and noise reduction. Both tools sit downstream of that layer: once an alert lands, the agent investigates. Teams running BigPanda, Moogsoft, Dynatrace Davis, or PagerDuty Intelligent Alert Grouping should not expect either project to replace those products.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neither is a managed SaaS by default.&lt;/strong&gt; Both are open-source projects requiring self-hosting. &lt;a href="https://home.robusta.dev/" rel="noopener noreferrer"&gt;Robusta&lt;/a&gt; sells a managed product around HolmesGPT, which is the closest commercial offering. K8sGPT has no commercial entity behind it per the &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt;. A team that needs a vendor SOC 2 report against the open-source binary itself will not find one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT is not a multi-cloud reasoning tool.&lt;/strong&gt; Its &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;analysers map one-to-one to Kubernetes resource types&lt;/a&gt;. A managed RDS, a Datadog dashboard, or an OVH Bare Metal instance is invisible to K8sGPT's analysers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT is not a deterministic rules engine.&lt;/strong&gt; Its agent loop uses LLM tool-calling, which means investigation paths are non-deterministic and depend on the LLM provider and prompt context. Teams that need bit-for-bit reproducible incident analysis should match expectations to the agent pattern, not against a runbook executor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When should I choose HolmesGPT vs K8sGPT?
&lt;/h2&gt;

&lt;p&gt;Pick &lt;strong&gt;HolmesGPT&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The estate spans more than Kubernetes (VMs, managed databases, SaaS platforms at incident-critical positions).&lt;/li&gt;
&lt;li&gt;The LLM choice is GitHub Copilot, GitHub Models, OpenRouter, or Robusta AI (HolmesGPT-specific).&lt;/li&gt;
&lt;li&gt;The team wants a 24/7 background agent that can post to Slack and open GitHub PRs through MCP integration. Note that Operator Mode is marked as an Alpha release at time of writing.&lt;/li&gt;
&lt;li&gt;The team values an explicit, project-documented "read-only access and respects RBAC" guarantee.&lt;/li&gt;
&lt;li&gt;A managed SaaS option (via Robusta) is acceptable or attractive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick &lt;strong&gt;K8sGPT&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The estate is Kubernetes-first or Kubernetes-only.&lt;/li&gt;
&lt;li&gt;The team wants a Go binary that runs as a CLI and an in-cluster operator out of the box.&lt;/li&gt;
&lt;li&gt;The LLM choice is IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, or Amazon SageMaker (K8sGPT-specific).&lt;/li&gt;
&lt;li&gt;The team plans to publish cluster operations to MCP clients (Claude Desktop, custom tooling) rather than to consume external MCP services.&lt;/li&gt;
&lt;li&gt;The team wants documented anonymisation of cluster object names and labels before LLM calls.&lt;/li&gt;
&lt;li&gt;The team prefers a community-governed project with no commercial entity behind it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two are not directly substitutable for most teams. They are adjacent tools that can plausibly run alongside one another in a Kubernetes-heavy estate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to choose between HolmesGPT and K8sGPT in 14 days
&lt;/h2&gt;

&lt;p&gt;A two-week evaluation plan to pick between HolmesGPT and K8sGPT, or to confirm that the team needs both. Every step is a concrete deliverable a procurement reviewer can sign off.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Day 1 to 2: Scope your estate.&lt;/strong&gt; List every system that hosts incident-relevant state: Kubernetes clusters, VMs, managed databases, third-party SaaS, on-prem hardware. If the answer is Kubernetes plus one or two managed services, K8sGPT alone may cover it. If non-Kubernetes systems sit at incident-critical positions in the stack, HolmesGPT's stated "any infrastructure" scope is the better fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3 to 4: Confirm the LLM standard.&lt;/strong&gt; Identify the LLM provider the team is already approved to use. Cross-check against each project's published backend list. Both register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama. K8sGPT adds enterprise-leaning options (IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker). HolmesGPT adds developer-tooling options (GitHub Copilot, GitHub Models, OpenRouter, Robusta AI, OpenAI-Compatible proxy).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5 to 6: Install both in a dev cluster.&lt;/strong&gt; Install K8sGPT via brew or its Helm chart (&lt;code&gt;helm repo add k8sgpt https://charts.k8sgpt.ai/&lt;/code&gt;) and the k8sgpt-operator. Install HolmesGPT via the official Helm chart documented at holmesgpt.dev. Connect a non-production LLM key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 7 to 8: Run a known-bad scenario.&lt;/strong&gt; Trigger a documented failure (CrashLoopBackOff, OOMKilled, ImagePullBackOff) in the dev cluster. Capture each tool's full output: time to first useful finding, false positives, and signal-to-noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 9 to 10: Assess the trust surface.&lt;/strong&gt; Walk security through the read model. HolmesGPT operates with read-only access plus RBAC. K8sGPT anonymises cluster object names and labels but does not mask certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count). Get a written sign-off on each tool's data path before any production read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 11 to 12: Test the operator behaviour.&lt;/strong&gt; Enable HolmesGPT Operator Mode on an infrequent schedule (hourly, since Operator Mode is Alpha) and enable the K8sGPT operator at its 30-second default. Watch LLM token consumption and alert volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 13 to 14: Pick one, both, or neither.&lt;/strong&gt; Three valid outcomes. (1) Pick K8sGPT alone if the estate is Kubernetes-only and the team needs continuous posture. (2) Pick HolmesGPT alone if the estate is multi-platform and the team values 24/7 Operator Mode with GitHub PR opening. (3) Pick both if the estate is Kubernetes-heavy and the team wants continuous posture (K8sGPT) plus incident investigation (HolmesGPT).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;Aurora by Arvo AI is a separate Apache 2.0 project at &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;. Compared to the two projects above, Aurora ships multi-cloud investigation (AWS, Azure, GCP, OVH, Scaleway, Kubernetes), a Memgraph-backed infrastructure dependency graph, hybrid (BM25 plus vector) RAG over runbooks and postmortems via Weaviate, and sandboxed &lt;code&gt;kubectl&lt;/code&gt; execution into an isolated "untrusted" namespace with a four-layer command-safety pipeline (input rail, &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; signature match, per-org policy, LLM safety judge).&lt;/p&gt;

&lt;p&gt;A team can run all three. The most common pattern in 2026 design-partner conversations is K8sGPT for continuous in-cluster posture, HolmesGPT or Aurora for incident investigation, and Aurora for the multi-cloud and remediation-staging path that K8sGPT does not target. For the full three-way comparison see &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this guide fits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;, full capability matrix including commercial entrants.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;, three-way comparison.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt;, the deployment-tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;What is an AI SRE?&lt;/a&gt;, the definitional reference.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt;, the sandboxing pattern that distinguishes investigation from remediation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between HolmesGPT and K8sGPT?&lt;/strong&gt;&lt;br&gt;
HolmesGPT is an AI agent for investigating production incidents across any infrastructure including VMs, bare metal, cloud services, and containers. K8sGPT is a tool for scanning Kubernetes clusters and diagnosing issues in simple English, scoped to Kubernetes resources only. Both are Apache 2.0 and CNCF Sandbox projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which is more popular on GitHub, HolmesGPT or K8sGPT?&lt;/strong&gt;&lt;br&gt;
As of May 2026, the K8sGPT about box on github.com/k8sgpt-ai/k8sgpt shows 7.8k stars and 996 forks. The HolmesGPT about box on github.com/HolmesGPT/holmesgpt shows 2.5k stars and 347 forks. K8sGPT had a two-year head start: it joined the CNCF Sandbox on 19 December 2023, while HolmesGPT joined on 8 October 2025.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can HolmesGPT or K8sGPT execute commands against my cluster?&lt;/strong&gt;&lt;br&gt;
HolmesGPT operates with read-only access and respects RBAC permissions. The HolmesGPT docs describe an Operator Mode that can open GitHub pull requests via the GitHub MCP server, but those writes happen against the user's Git repository, not directly against the cluster. K8sGPT scans Kubernetes resources and anonymises object names and labels before sending data to its AI backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which LLM providers does each tool support?&lt;/strong&gt;&lt;br&gt;
Both projects support the headline providers. K8sGPT's source registers 17 backends including Anthropic, OpenAI, Azure OpenAI, AWS Bedrock and Bedrock Converse, Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, and a Custom REST endpoint. HolmesGPT supports Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, and Robusta AI. K8sGPT's edge providers lean enterprise (watsonx, OCI, Cohere); HolmesGPT's lean developer tooling (Copilot, Models, OpenRouter).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do HolmesGPT and K8sGPT both support MCP?&lt;/strong&gt;&lt;br&gt;
Yes, but in different directions. K8sGPT hosts a Model Context Protocol server that exposes 12 tools and 3 resources for cluster analysis, with full integration with Claude Desktop and other MCP clients. The MCP v2 feature shipped in v0.4.27 on 18 December 2025. HolmesGPT consumes MCP-exposed tools as data sources, including AWS, Azure, GCP, GitHub, GitLab, Jenkins, MariaDB, Prefect, Sentry, and Splunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are HolmesGPT and K8sGPT both CNCF projects?&lt;/strong&gt;&lt;br&gt;
Both are CNCF Sandbox projects. The cncf.io project pages document HolmesGPT accepted on 8 October 2025 and K8sGPT accepted on 19 December 2023. Sandbox is the entry tier for CNCF projects and indicates the project is in an early stage relative to Incubating and Graduated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a company behind HolmesGPT or K8sGPT?&lt;/strong&gt;&lt;br&gt;
HolmesGPT is held under HolmesGPT a Series of LF Projects, LLC, and was originally created by Robusta.dev with major contributions from Microsoft. Robusta sells a managed SaaS product that integrates HolmesGPT. K8sGPT is community-led; the 7 June 2024 CNCF blog states that unlike many popular projects there is no company behind K8sGPT and no business plan behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which project is updated more often?&lt;/strong&gt;&lt;br&gt;
Both projects ship monthly or near-monthly. HolmesGPT's latest release at writing is v0.30.1 on 20 May 2026. K8sGPT's latest release at writing is v0.4.33 on 13 May 2026. Both Releases pages on GitHub show consistent 2025 to 2026 cadence with no documented multi-month pause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can HolmesGPT or K8sGPT run air-gapped?&lt;/strong&gt;&lt;br&gt;
Both projects support local LLM inference. K8sGPT's auth list includes localai and ollama, and the K8sGPT team recommends using a local model in critical production environments. HolmesGPT's docs nav lists Ollama and OpenAI-Compatible providers, which covers self-hosted LLM endpoints. The agent runtime and the LLM together must run inside the customer perimeter to claim air-gapped deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use HolmesGPT and K8sGPT together?&lt;/strong&gt;&lt;br&gt;
Yes. K8sGPT is built as a continuous in-cluster scanner with a 30-second default reconciliation interval. HolmesGPT runs as an incident-driven investigation agent that can also operate 24/7 in Operator Mode (Alpha). A common 2026 pattern is to use K8sGPT for posture and HolmesGPT for incident investigation, with results routed to the same Slack channels or ticket systems.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;arvoai.ca/blog/holmesgpt-vs-k8sgpt&lt;/a&gt;. Aurora by Arvo AI is open-source on &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; under Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Self-Hosted AI SRE in 2026: Air-Gapped, Multi-Cloud, BYO-LLM</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 19 May 2026 01:01:04 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/self-hosted-ai-sre-in-2026-air-gapped-multi-cloud-byo-llm-53ha</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/self-hosted-ai-sre-in-2026-air-gapped-multi-cloud-byo-llm-53ha</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted AI SRE means the agent runtime, its memory layer, and the LLM all run inside the customer's perimeter.&lt;/strong&gt; Every inference call, every telemetry read, and every postmortem write happens on customer-owned infrastructure. The definition is structural. A vendor agent that ships data to vendor-managed inference is not self-hosted under this definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the Sovereignty Spectrum.&lt;/strong&gt; Five deployment tiers: T1 Public SaaS, T2 Private SaaS, T3 VPC-Isolated, T4 On-Prem Hosted, T5 Air-Gapped. Of the fifteen most-cited AI SRE tools in 2026, only &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, and &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; credibly reach T4 or T5. The other twelve top out at T1 or T2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gapped deployment requires three independent stacks: orchestration, memory, and inference.&lt;/strong&gt; Orchestration is the agent loop (LangGraph, ReAct). Memory is the dependency graph plus RAG corpus (Memgraph, Weaviate). Inference is the LLM (&lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, &lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;, or a sovereign endpoint). All three must run locally, with no outbound network call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory drivers are concrete and dated.&lt;/strong&gt; The &lt;a href="https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn" rel="noopener noreferrer"&gt;EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025&lt;/a&gt;. The &lt;a href="https://artificialintelligenceact.eu/implementation-timeline/" rel="noopener noreferrer"&gt;EU AI Act implementation timeline&lt;/a&gt; phases in through 2027. The &lt;a href="https://www.sec.gov/newsroom/press-releases/2023-139" rel="noopener noreferrer"&gt;SEC adopted cybersecurity disclosure rules on 26 July 2023&lt;/a&gt; (Form 8-K Item 1.05 effective 18 December 2023).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-weight LLMs in 2026 are credible for local inference.&lt;/strong&gt; &lt;a href="https://ai.meta.com/blog/meta-llama-3-3-70b/" rel="noopener noreferrer"&gt;Meta's Llama 3.3 70B (December 2024)&lt;/a&gt; delivers similar performance to Llama 3.1 405B at lower inference cost, per Meta's own announcement. Mistral, DeepSeek, and Qwen have released competitive open-weight models. Aurora's reference local stack uses Ollama with a 70B-class model.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Arvo's design-partner conversations across 2025, every regulated customer ran into the same procurement wall: every credible commercial AI SRE required production telemetry, including customer data inside log lines, error messages, and stack traces, to leave the customer perimeter for inference. For a SaaS startup the wall is paperwork. For a bank, a defence contractor, an EU sovereign-data buyer, or a healthcare provider, it blocks the procurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted AI SRE removes the wall.&lt;/strong&gt; The agent, its memory, and the LLM all run inside the customer's perimeter. This guide is the 2026 reference for evaluating, designing, and deploying a self-hosted AI SRE, with every commercial tool mapped to its deployment tier and Aurora's air-gapped stack used as the worked example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does self-hosted AI SRE mean?
&lt;/h2&gt;

&lt;p&gt;The phrase is overloaded. Three definitions circulate in 2026 vendor marketing, and only the strictest meaningfully reduces the trust surface.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted collector with VPC peering.&lt;/strong&gt; A vendor agent runs in the customer VPC, gathers telemetry, and ships it (sometimes after partial filtering) to a vendor-managed inference plane. The inference call leaves the customer perimeter. Most commercial AI SREs in 2026 use this pattern and call it "private deployment."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-tenant SaaS.&lt;/strong&gt; A dedicated vendor-managed instance inside a vendor-owned cloud account. The data plane is isolated from other tenants but still vendor-operated. Inference still leaves the customer perimeter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True self-hosted.&lt;/strong&gt; Every component (orchestration runtime, memory layers, inference endpoint, secrets manager) runs on customer-owned infrastructure. No outbound network call is required for an investigation to complete.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This guide uses the third definition. For audits and compliance reviews, only the third meaning answers the question "could a malicious actor at the vendor have read our incident transcript" with a structural no.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sovereignty Spectrum
&lt;/h2&gt;

&lt;p&gt;Each tier increases perimeter control over the previous one. Choose the tier the team can defend operationally; aiming further than that is engineering debt waiting to happen.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What runs on customer infrastructure&lt;/th&gt;
&lt;th&gt;What leaves the perimeter&lt;/th&gt;
&lt;th&gt;Representative tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T1, Public SaaS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing&lt;/td&gt;
&lt;td&gt;Telemetry, transcripts, investigation prompts&lt;/td&gt;
&lt;td&gt;Datadog Bits AI, incident.io AI SRE, Rootly AI, PagerDuty SRE Agent, ServiceNow Now Assist, Splunk ITSI, Cleric.ai, Causely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T2, Private SaaS (VPC peering)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A vendor-supplied agent or collector&lt;/td&gt;
&lt;td&gt;Telemetry, embeddings, sometimes whole log lines, all inference calls&lt;/td&gt;
&lt;td&gt;Resolve.ai (satellite agent), Traversal, NeuBird Hawkeye (VPC option), Edwin AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T3, VPC-Isolated single-tenant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vendor-managed control plane inside a vendor-owned cloud account dedicated to one customer&lt;/td&gt;
&lt;td&gt;All inference calls; cross-tenant data flow is structurally absent, the vendor still operates the plane&lt;/td&gt;
&lt;td&gt;Some incumbent "private cloud" tiers (custom-quoted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T4, On-prem hosted, hosted LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent, memory, dependency graph, RAG corpus&lt;/td&gt;
&lt;td&gt;LLM API calls to OpenAI, Anthropic, Google, or Bedrock&lt;/td&gt;
&lt;td&gt;Aurora with managed LLM; HolmesGPT with managed LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T5, Air-gapped&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent, memory, dependency graph, RAG corpus, and a local LLM via Ollama, vLLM, or a sovereign endpoint&lt;/td&gt;
&lt;td&gt;Nothing. Investigation completes without an outbound call&lt;/td&gt;
&lt;td&gt;Aurora with Ollama; HolmesGPT with self-hosted LLM endpoint; K8sGPT with local LLM (Kubernetes-only scope)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A team's deployment tier is fixed by its strictest constraint, not its average. The &lt;a href="https://www.finma.ch/en/~/media/finma/dokumente/dokumentencenter/myfinma/rundschreiben/finma-rs-2018-03-20180101.pdf?la=en" rel="noopener noreferrer"&gt;FINMA Circular 2018/03&lt;/a&gt; on outsourcing for Swiss banks and insurers pushes regulated workloads toward T5. A privacy-by-design product advertising "your incident data never leaves your servers" lands at T5. A team that cannot obtain controller approval for an LLM provider under &lt;a href="https://gdpr-info.eu/art-28-gdpr/" rel="noopener noreferrer"&gt;GDPR Article 28&lt;/a&gt; lands at T5.&lt;/p&gt;

&lt;p&gt;Any other constraint allows T3 or T4. A single strict regulator collapses the choice to T5.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does self-hosting matter in 2026?
&lt;/h2&gt;

&lt;p&gt;Three pressures, in roughly this order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulatory.&lt;/strong&gt; The &lt;a href="https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn" rel="noopener noreferrer"&gt;EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025&lt;/a&gt;. The boundary covers data processing and storage for core services and is the model EU procurement teams now apply to other vendors. The &lt;a href="https://artificialintelligenceact.eu/implementation-timeline/" rel="noopener noreferrer"&gt;EU AI Act timeline&lt;/a&gt; phases in through 2027, with &lt;a href="https://artificialintelligenceact.eu/chapter/3/" rel="noopener noreferrer"&gt;high-risk system obligations under Chapter III&lt;/a&gt; (risk management, data governance, human oversight, post-market monitoring) applicable to operational AI used in critical infrastructure. The &lt;a href="https://www.sec.gov/newsroom/press-releases/2023-139" rel="noopener noreferrer"&gt;SEC's cybersecurity disclosure rules&lt;/a&gt; (adopted 26 July 2023, Form 8-K Item 1.05 effective 18 December 2023) make incident response transparency a public-company concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sovereignty and latency.&lt;/strong&gt; Sovereign cloud is no longer a French preoccupation. &lt;a href="https://www.ovhcloud.com/en/enterprise/products/hosted-private-cloud/" rel="noopener noreferrer"&gt;OVHcloud Sovereign Cloud&lt;/a&gt;, &lt;a href="https://www.scaleway.com/en/" rel="noopener noreferrer"&gt;Scaleway&lt;/a&gt;, &lt;a href="https://www.t-systems.com/de/en/cloud-services/sovereign-cloud" rel="noopener noreferrer"&gt;T-Systems Sovereign Cloud&lt;/a&gt;, &lt;a href="https://www.stackit.de/en/" rel="noopener noreferrer"&gt;Stackit&lt;/a&gt; (Schwarz Group), and &lt;a href="https://www.oracle.com/cloud/sovereign-cloud/eu/" rel="noopener noreferrer"&gt;Oracle EU Sovereign Cloud&lt;/a&gt; ship contractually sovereign tiers. An AI SRE that cannot operate without sending telemetry to a US hyperscaler region is unfit for these workloads. Latency follows the same constraint: an EU-hosted agent calling a US-hosted LLM during an incident incurs round-trip latency on every step of a multi-turn investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data leakage and trust.&lt;/strong&gt; Production log lines frequently contain customer PII, secrets, and proprietary identifiers. &lt;a href="https://www.gitguardian.com/state-of-secrets-sprawl-report-2024" rel="noopener noreferrer"&gt;GitGuardian's State of Secrets Sprawl 2024&lt;/a&gt; found 12.8 million new exposed secrets across public repositories alone in 2023, a steady reminder that telemetry contains material auditors care about. The audit calculation for a security team is the same as for any third-party data flow: if it can leak, model the risk as if it will. T5 makes the model trivial because nothing leaves the perimeter.&lt;/p&gt;

&lt;p&gt;For the full incident-investigation context, see &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation: The Complete Guide for SRE Teams&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which AI SRE tools can be fully self-hosted?
&lt;/h2&gt;

&lt;p&gt;The honest map.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Best achievable tier&lt;/th&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T5, Air-Gapped&lt;/td&gt;
&lt;td&gt;Reference stack: Docker Compose or Helm chart, Ollama local LLM, Vault, Memgraph, Weaviate. See the &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora repo&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HolmesGPT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T4, On-prem with hosted LLM (T5 with self-hosted LLM endpoint)&lt;/td&gt;
&lt;td&gt;Apache 2.0. Per the &lt;a href="https://holmesgpt.dev/" rel="noopener noreferrer"&gt;HolmesGPT docs&lt;/a&gt;, documentation assumes a hosted model provider (OpenAI, Azure OpenAI, Bedrock). Self-hosted LLM is an advanced configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K8sGPT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T4, On-prem (T5 with local LLM, Kubernetes scope only)&lt;/td&gt;
&lt;td&gt;CLI or Helm. &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;Local LLMs via Ollama supported&lt;/a&gt;. Scope is limited to the Kubernetes API.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resolve.ai&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS&lt;/td&gt;
&lt;td&gt;Satellite agent in the customer VPC for telemetry. Inference is vendor-managed. No publicly documented air-gapped option.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traversal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS&lt;/td&gt;
&lt;td&gt;Flexible deployment options. Inference is vendor-managed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NeuBird Hawkeye&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS (VPC)&lt;/td&gt;
&lt;td&gt;VPC deployment available. Ephemeral telemetry processing claimed by NeuBird. Inference path is vendor-managed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Causely&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Kubernetes-only. SaaS control plane.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cleric.ai&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Slack-first SaaS.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PagerDuty SRE Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Inside &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty Operations Cloud&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog Bits AI SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Multi-tenant inside Datadog. HIPAA-compliant per &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog's documentation&lt;/a&gt;, not air-gapped.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;incident.io AI SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Hosted multi-tenant. AI SRE access design-partner-gated.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rootly AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Closed-core SaaS. &lt;a href="https://rootly.com/labs" rel="noopener noreferrer"&gt;Rootly AI Labs&lt;/a&gt; publishes open-source prototypes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ServiceNow Now Assist SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;ServiceNow cloud. GA targeted June 2026.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edwin AI (LogicMonitor)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private (LogicMonitor-managed)&lt;/td&gt;
&lt;td&gt;Bundled with LogicMonitor Envision platform. Not standalone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk ITSI Episode Summarization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Splunk Cloud only as of May 2026 (Alpha).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The open-source projects are the only tools today that credibly reach T4 or T5 with public documentation. Aurora is the only one with multi-cloud scope at T5. Resolve.ai, Traversal, NeuBird, and Datadog Bits AI publish FedRAMP-adjacent or HIPAA tiers but no air-gapped reference architecture as of May 2026. For the broader category overview, see our &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;open-source incident management overview&lt;/a&gt; and the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions launch post&lt;/a&gt; for scheduled and event-triggered automations on top of self-hosted Aurora.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the architecture of a self-hosted AI SRE?
&lt;/h2&gt;

&lt;p&gt;A self-hosted agentic AI SRE has three concurrent runtime stacks. Skip any one and the deployment regresses to a lower sovereignty tier.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Orchestration runtime
&lt;/h3&gt;

&lt;p&gt;The agent loop is the LangGraph, ReAct, or equivalent orchestration that decides what tool to call next. It is the smallest of the three stacks by resource footprint and the easiest to self-host. Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Python or Node runtime, typically containerised.&lt;/li&gt;
&lt;li&gt;A task queue (&lt;a href="https://docs.celeryq.dev/" rel="noopener noreferrer"&gt;Celery&lt;/a&gt;, &lt;a href="https://python-rq.org/" rel="noopener noreferrer"&gt;RQ&lt;/a&gt;, &lt;a href="https://docs.bullmq.io/" rel="noopener noreferrer"&gt;BullMQ&lt;/a&gt;) for long-running investigations.&lt;/li&gt;
&lt;li&gt;Postgres for agent state, investigation records, and audit logs.&lt;/li&gt;
&lt;li&gt;A secrets store (&lt;a href="https://developer.hashicorp.com/vault" rel="noopener noreferrer"&gt;HashiCorp Vault&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/secretsmanager/" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt;, or &lt;a href="https://cloud.google.com/security/products/security-key-management" rel="noopener noreferrer"&gt;KMS&lt;/a&gt;) for cloud credentials and LLM keys.&lt;/li&gt;
&lt;li&gt;A web UI or API surface for engineers to inspect and trigger investigations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora ships this stack as a Docker Compose for single-node deployment and a Helm chart for Kubernetes-native deployment, both &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;documented in the repo&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Memory layer
&lt;/h3&gt;

&lt;p&gt;The agent without memory is a stateless inference call. Memory is the difference between an agent that learns from the environment and an agent that makes the same investigative mistake every week.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency graph.&lt;/strong&gt; A graph database (&lt;a href="https://memgraph.com/docs" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt;, Neo4j) that holds the live topology of the infrastructure: services, dependencies, alert sources, and ownership. The agent traverses the graph to assess blast radius and trace upstream causes before issuing tool calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG corpus.&lt;/strong&gt; A vector database (&lt;a href="https://weaviate.io/developers/weaviate" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;, &lt;a href="https://qdrant.tech/documentation/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt;, &lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;) holding embeddings of past postmortems, runbooks, design docs, and code. &lt;a href="https://weaviate.io/developers/weaviate/search/hybrid" rel="noopener noreferrer"&gt;Hybrid retrieval combining BM25 and vector search&lt;/a&gt; outperforms either alone on SRE corpora because exact-match identifiers (service names, error codes) coexist with semantic concepts (failure modes). See also the &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;root cause analysis complete guide for SREs&lt;/a&gt; for the broader investigation context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event store.&lt;/strong&gt; Postgres or an event-sourcing database for the agent's own investigation history. Past investigations become future evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora's reference stack is Memgraph, Weaviate, and Postgres. Each runs in a customer container, and none requires an outbound network call.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Inference layer
&lt;/h3&gt;

&lt;p&gt;The LLM. Three paths, in increasing sovereignty:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Managed LLM API.&lt;/strong&gt; OpenAI, Anthropic, Google, Bedrock. Cheapest to start, lowest operational burden, but the deployment stays at T4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private endpoint.&lt;/strong&gt; &lt;a href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/provisioned-throughput" rel="noopener noreferrer"&gt;Azure OpenAI dedicated&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html" rel="noopener noreferrer"&gt;Bedrock Provisioned Throughput&lt;/a&gt;, or a partner-hosted endpoint. Stronger contractual perimeter, although the data still leaves the customer cloud account.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM.&lt;/strong&gt; &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, &lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;, or a sovereign inference appliance. Reaches T5.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For T5, the inference stack is the operational lift. Hardware is the largest single line item, and team expertise is the second.&lt;/p&gt;

&lt;h2&gt;
  
  
  BYO-LLM: which models run well locally?
&lt;/h2&gt;

&lt;p&gt;Open-weight model quality has progressed enough to anchor an agentic SRE loop in 2026. The current options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Llama 3.3 70B&lt;/strong&gt; (&lt;a href="https://ai.meta.com/blog/meta-llama-3-3-70b/" rel="noopener noreferrer"&gt;Meta, December 2024&lt;/a&gt;). Meta states the model delivers similar performance to Llama 3.1 405B at lower inference cost. A common starting point for local deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1&lt;/strong&gt; (&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;model card&lt;/a&gt;). A reasoning-tuned open-weight model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 2.5 and 3 families&lt;/strong&gt; (&lt;a href="https://qwenlm.github.io/blog/qwen2.5/" rel="noopener noreferrer"&gt;Qwen 2.5 release&lt;/a&gt;). Strong multilingual support for teams with non-English runbook content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral Large&lt;/strong&gt; (&lt;a href="https://docs.mistral.ai/getting-started/models/models_overview/" rel="noopener noreferrer"&gt;Mistral models&lt;/a&gt;). Strong tool-use performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hardware sizing for a 70B-class model: in float16, weights are roughly 140GB, so plan two 80GB cards (a pair of H100 or A100 80GB) or a single H200 (141GB). Q4-quantised variants compress weights to roughly 35-40GB and fit on a single 80GB card with context room, at some latency and quality cost. See the &lt;a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct" rel="noopener noreferrer"&gt;Llama 3.3 70B model card&lt;/a&gt; for the canonical parameter and tensor sizes. Specific latency targets are workload-dependent and should be measured, not assumed.&lt;/p&gt;

&lt;p&gt;The constraint to flag: running a local LLM is a real engineering discipline. Teams without LLM-ops capacity should consider T4 (managed API) as the long-term answer and revisit T5 when the team is staffed for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does multi-cloud authentication work in a self-hosted agent?
&lt;/h2&gt;

&lt;p&gt;A self-hosted agent must still reach customer cloud APIs. The auth pattern matters because credentials live in the customer perimeter. Vendor-managed inference makes credential exfiltration a vendor-trust problem. Self-hosted inference makes it a customer-operations problem, which is the desired state.&lt;/p&gt;

&lt;p&gt;Aurora's reference multi-cloud auth pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cloud&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" rel="noopener noreferrer"&gt;STS AssumeRole&lt;/a&gt; into customer accounts via a least-privilege investigation role. Credentials never persist in agent storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal" rel="noopener noreferrer"&gt;Service Principal&lt;/a&gt; with Reader (and incident-scoped Operator) role assignments.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth-based authentication or &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;workload identity federation&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OVH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://help.ovhcloud.com/csm/en-gb-api-getting-started-ovhcloud-api" rel="noopener noreferrer"&gt;API key per investigation scope&lt;/a&gt;, stored in Vault.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaleway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.scaleway.com/en/docs/iam/how-to/create-api-keys/" rel="noopener noreferrer"&gt;API token&lt;/a&gt; stored in Vault.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubeconfig per cluster, stored in Vault. Sandboxed kubectl execution into an isolated namespace; see our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Vault binding matters: every cloud credential is short-lived where the cloud supports it, and every credential use is auditable. In a T5 deployment, the auditor's "who issued this command" question is answered by the Vault audit log and the agent's tool-call trace, not by a vendor SOC 2 attestation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does an air-gapped AI SRE deployment require?
&lt;/h2&gt;

&lt;p&gt;The hard version requires no outbound network call during an investigation, including for inference.&lt;/p&gt;

&lt;p&gt;Aurora's air-gapped reference architecture covers six layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored container registry.&lt;/strong&gt; Every image (Aurora, Memgraph, Weaviate, Postgres, Vault, Ollama) is pulled from a customer-internal registry. No Docker Hub calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored package indices.&lt;/strong&gt; Python wheels and OS packages served from internal Artifactory or equivalent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored model weights.&lt;/strong&gt; Llama 3.3 weights downloaded once on a connected jumpbox, scanned, hashed, and copied into the air-gapped network. Same for embedding models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local DNS.&lt;/strong&gt; No outbound DNS resolution required. Cloud APIs are reached via VPC private endpoints (&lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html" rel="noopener noreferrer"&gt;AWS PrivateLink&lt;/a&gt;, &lt;a href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview" rel="noopener noreferrer"&gt;Azure Private Endpoint&lt;/a&gt;, &lt;a href="https://cloud.google.com/vpc/docs/private-service-connect" rel="noopener noreferrer"&gt;GCP Private Service Connect&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No telemetry to vendor.&lt;/strong&gt; Neither Aurora nor the open-source components phone home; this is verified per release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sealed Vault.&lt;/strong&gt; Vault sealed and unsealed via internal HSM or &lt;a href="https://developer.hashicorp.com/vault/docs/concepts/seal" rel="noopener noreferrer"&gt;Shamir keyshares&lt;/a&gt;. No auto-unseal against a vendor KMS.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The provisioning lift is real. Teams that have operated air-gapped Kubernetes will recognise the pattern. Teams that have not should pilot in a connected environment first.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aurora implements the Sovereignty Spectrum
&lt;/h2&gt;

&lt;p&gt;Every Aurora deployment is configured for the customer's tier. The same code base supports all five.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T1 and T2.&lt;/strong&gt; Aurora deployed to a public-cloud account with managed services for Postgres, Memgraph, and Weaviate. LLM via OpenAI or Anthropic API. Useful for evaluation pilots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T3.&lt;/strong&gt; Aurora deployed to a customer-owned VPC with private endpoints to managed services. LLM via private endpoint (Azure OpenAI dedicated, Bedrock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T4.&lt;/strong&gt; Aurora deployed to customer-owned VMs or Kubernetes with self-hosted Postgres, Memgraph, and Weaviate. LLM via managed API or private endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T5.&lt;/strong&gt; Aurora deployed to customer-owned air-gapped infrastructure with Ollama-hosted Llama 3.3 (or a sovereign LLM endpoint). All dependencies mirrored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora ships a single codebase that serves all five tiers. Tier downgrade ("drop from T5 to T3 for one workload") and upgrade ("move the EU workload from T3 to T5") become configuration changes rather than migrations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does self-hosted AI SRE cost compare to SaaS?
&lt;/h2&gt;

&lt;p&gt;A precise total cost of ownership depends on team size, model choice, infrastructure pricing, regional rates, and incident volume. Procurement should model the variable axes against incident volume rather than anchor on a single vendor-supplied number.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted T4 or T5 fixed costs.&lt;/strong&gt; Compute for the agent runtime, memory stores, and (for T5) the LLM node. Storage for the RAG corpus and audit log. Engineering time to operate the stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted T4 variable costs.&lt;/strong&gt; Managed LLM API usage at provider rates (&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI pricing&lt;/a&gt;, &lt;a href="https://www.anthropic.com/pricing" rel="noopener noreferrer"&gt;Anthropic pricing&lt;/a&gt;, &lt;a href="https://aws.amazon.com/bedrock/pricing/" rel="noopener noreferrer"&gt;Bedrock pricing&lt;/a&gt;). Scales with the number and depth of investigations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commercial SaaS variable costs.&lt;/strong&gt; Per-seat tiers (incident.io, Rootly, PagerDuty), per-investigation billing (&lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;Datadog Bits AI&lt;/a&gt;, NeuBird), or per-credit consumption (ServiceNow). All published on the vendor's pricing page.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The break-even between a self-hosted Tier 5 deployment and per-investigation SaaS depends on the vendor's per-investigation price, the LLM choice, and the engineering cost of running the stack. Procurement teams should model three points: today's incident volume, twelve-month projected volume, and a 3x scenario. If any of the three is dominated by sovereignty rather than economics, the regulator decides the deployment tier, not the spreadsheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  When self-hosting is the wrong answer
&lt;/h2&gt;

&lt;p&gt;Self-hosting is an engineering commitment, not a checkbox.&lt;/p&gt;

&lt;p&gt;Teams that should skip it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No LLM-ops capacity.&lt;/strong&gt; If no one on the team has run inference servers in production, do not start with air-gapped Ollama. Pilot at T1 or T2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small team, low incident volume.&lt;/strong&gt; Below twenty incidents per month, the operational overhead can exceed the cost savings of self-hosting. T1 is fine if the data classification allows it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No regulatory or sovereignty pressure.&lt;/strong&gt; If the compliance team is not asking and the data classification is not sensitive, the sovereignty premium is paid for nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early in the AI SRE evaluation curve.&lt;/strong&gt; A managed pilot validates the value of the agent to the team. Self-host after that decision, not before it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams that should default to self-hosting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regulated workloads (finance, healthcare, defence, critical infrastructure).&lt;/li&gt;
&lt;li&gt;EU sovereign-data customers.&lt;/li&gt;
&lt;li&gt;Customers that advertise sovereignty as a product attribute themselves.&lt;/li&gt;
&lt;li&gt;Public-sector buyers under FedRAMP High, IRAP PROTECTED, IL5, or equivalent.&lt;/li&gt;
&lt;li&gt;Anyone whose log lines contain customer PII that has not been scrubbed at source.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;Arvo expects three shifts in the self-hosted AI SRE landscape over the next twelve months.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sovereign LLM endpoints.&lt;/strong&gt; EU-hosted, contract-bound LLM endpoints from cloud regions outside US jurisdiction will turn T4 into a viable tier for European regulated customers without forcing T5. Anthropic, OpenAI, and Google are each shipping or piloting EU-resident inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gap reference appliances.&lt;/strong&gt; Appliance-style packages (preloaded GPU servers with Aurora, a local LLM, and a sealed Vault) sold as turn-key T5 deployments are likely to emerge from hardware vendors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open benchmark cohorts.&lt;/strong&gt; Closed-source players still measure themselves on private datasets. The first open, named, multi-LLM benchmark on a public incident corpus will become the citation surface the category orbits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2024 self-hosted AI SRE was a theoretical option. By 2025 it was niche. In 2026 it has become the procurement default for regulated workloads. The tools that can execute it today are Aurora at the multi-cloud end, HolmesGPT at the CNCF and Kubernetes end, and K8sGPT for diagnostics.&lt;/p&gt;

&lt;p&gt;For the full landscape of AI SRE tools and how each maps to a deployment tier, see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;. For the broader category overview, see &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE: The Complete Guide for Engineering Teams in 2026&lt;/a&gt;. For the investigation and postmortem halves of the workflow, see &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt; and &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Top 15 AI SRE Tools in 2026: Open-Source and Commercial</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 19 May 2026 00:55:24 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/top-15-ai-sre-tools-in-2026-open-source-and-commercial-1bl7</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/top-15-ai-sre-tools-in-2026-open-source-and-commercial-1bl7</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE tool applies large-language-model reasoning to incident response, usually as a multi-step agent that runs infrastructure tools, summarizes events, or drafts postmortems.&lt;/strong&gt; The label spans five archetypes that vendors blur in marketing: agentic investigation, AIOps correlation, postmortem generation, ITSM-integrated copilots, and workflow-automation suites with AI add-ons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We score every tool on the AI SRE Capability Matrix.&lt;/strong&gt; Five axes (Investigation, Remediation, Postmortem, Deployment Flexibility, Source Availability), each 0 to 3, total 15. The matrix tracks publicly documented capability as of May 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three open-source projects span the agentic-investigation lane.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, multi-cloud), &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since October 2025&lt;/a&gt;, co-maintained by &lt;a href="https://www.cncf.io/blog/2026/01/07/holmesgpt-agentic-troubleshooting-built-for-the-cloud-native-era/" rel="noopener noreferrer"&gt;Robusta and Microsoft&lt;/a&gt;), and &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;, Kubernetes diagnostics).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cited funding rounds in the last twelve months.&lt;/strong&gt; &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai raised $125M at a $1B valuation in February 2026&lt;/a&gt; and &lt;a href="https://www.prnewswire.com/news-releases/resolve-ai-announces-series-a-extension-at-a-1-5b-valuation-and-launches-resolve-ai-labs-to-advance-ai-systems-for-complex-production-environments-302743888.html" rel="noopener noreferrer"&gt;extended at a $1.5B valuation in April 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal raised $48M in June 2025&lt;/a&gt;. &lt;a href="https://incident.io/blog/incident-io-raises-62m-series-b-to-build-ai-agents-that-resolve-incidents-with-you" rel="noopener noreferrer"&gt;incident.io closed a $62M Series B in September 2024&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incumbents shipped AI SRE features by Q2 2026.&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty SRE Agent&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI SRE&lt;/a&gt;, &lt;a href="https://www.splunk.com/en_us/blog/observability/conf25-splunk-observability-announcements.html" rel="noopener noreferrer"&gt;Splunk ITSI Episode Summarization announced at .conf25&lt;/a&gt; (September 2025), &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow Now Assist SRE Specialist&lt;/a&gt; (GA targeted June 2026), and &lt;a href="https://www.logicmonitor.com/edwin-ai" rel="noopener noreferrer"&gt;LogicMonitor Edwin AI&lt;/a&gt;. The procurement question moves from "is there an AI option" to "which archetype, at what deployment tier."&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Site reliability teams in 2026 are evaluating tools in a market that has reorganised faster than most procurement processes can keep up with. Five archetypes share the "AI SRE" label, and buyers regularly compare a postmortem generator to an agentic investigator as if they did the same job. This guide compares the fifteen most-cited tools across both open-source and commercial categories, scored on a single capability matrix so the decision becomes one of fit.&lt;/p&gt;

&lt;p&gt;A note on bias. Arvo builds &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, an open-source agentic AI SRE tool listed below. We applied the same scoring rubric to every product on the list, including our own, and cited every numeric or capability claim that is not common knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI SRE tool?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;An AI SRE tool applies large-language-model reasoning to incident response.&lt;/strong&gt; The term covers five distinct archetypes, and only two of them actually investigate incidents.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agentic investigation.&lt;/strong&gt; A multi-step LLM agent that calls infrastructure tools (&lt;code&gt;kubectl&lt;/code&gt;, cloud APIs, log queries, dependency graphs) during an incident to gather new evidence and produce a root-cause analysis. Aurora, HolmesGPT, K8sGPT, Resolve.ai, Traversal, NeuBird, Cleric, Causely, and Ciroos all market themselves with this framing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps correlation.&lt;/strong&gt; Statistical or ML clustering of alerts to reduce noise. PagerDuty Intelligent Alert Grouping, BigPanda, Dell APEX (Moogsoft), Dynatrace Davis. The category predates LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation.&lt;/strong&gt; An LLM that drafts the retrospective from artefacts the team already has (Slack transcripts, monitor data, the investigation trace). Rootly, incident.io Scribe, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered in our &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ITSM-integrated copilot.&lt;/strong&gt; AI inside an existing service-management workflow. ServiceNow Now Assist SRE Specialist, LogicMonitor Edwin AI, Splunk ITSI Episode Summarization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow-automation suite plus AI add-on.&lt;/strong&gt; Incident platforms that bolted AI onto existing on-call, runbook, and status-page features. incident.io AI SRE, Rootly AI, FireHydrant AI.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conflating archetypes is the most common evaluation mistake. A team buying a postmortem generator will not get root-cause analysis. A team buying an AIOps correlator will not get a tool that runs &lt;code&gt;kubectl&lt;/code&gt;. For the foundational definitions, see our &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE Complete Guide&lt;/a&gt; and &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI SRE Capability Matrix
&lt;/h2&gt;

&lt;p&gt;Five axes, each scored 0 to 3. We apply the same rubric to every tool in the shortlist.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;2&lt;/th&gt;
&lt;th&gt;3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Single-shot LLM summary&lt;/td&gt;
&lt;td&gt;Multi-step agent, single cloud or platform&lt;/td&gt;
&lt;td&gt;Multi-step agent, multi-cloud, with RAG over historical evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Suggested commands&lt;/td&gt;
&lt;td&gt;PR-based fixes with approval&lt;/td&gt;
&lt;td&gt;Sandboxed in-cluster execution with policy guardrails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postmortem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Manual export of a transcript&lt;/td&gt;
&lt;td&gt;LLM-drafted from artefacts&lt;/td&gt;
&lt;td&gt;LLM-drafted from the agent's own investigation trace, exported to Confluence or Jira&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS-only, public cloud&lt;/td&gt;
&lt;td&gt;SaaS with private VPC peering&lt;/td&gt;
&lt;td&gt;Self-hosted in customer VPC&lt;/td&gt;
&lt;td&gt;Air-gapped with local LLM (Ollama or vLLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Source availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Source-available, paid&lt;/td&gt;
&lt;td&gt;Open core&lt;/td&gt;
&lt;td&gt;Apache 2.0 or MIT, fully open&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A higher score is not always "better." A team without LLM-ops capacity should not score deployment flexibility 3 against its roadmap. The matrix is for like-for-like comparison, not a leaderboard.&lt;/p&gt;

&lt;p&gt;For a deeper treatment of the deployment-flexibility axis, see our companion piece, &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which AI SRE tools are most-cited in 2026?
&lt;/h2&gt;

&lt;p&gt;Ordered alphabetically inside each archetype. Scoring reflects the publicly documented capability of each product as of May 2026, not roadmap claims. For category foundations, see our &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;open-source incident management overview&lt;/a&gt; and the &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;root cause analysis complete guide for SREs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic-investigation tools
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Aurora (Arvo AI), Apache 2.0, multi-cloud
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; SRE teams that need self-hosted, multi-cloud, BYO-LLM agentic investigation with the option to graduate into PR-based remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Docker Compose, Helm chart, or air-gapped with &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;. Customer-owned infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. Code at &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; LangGraph-orchestrated ReAct agent, 30+ integrations across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes. Memgraph dependency graph feeds an alert-correlation pre-step. Weaviate hybrid (BM25 plus vector) RAG over runbooks and past postmortems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Sandboxed &lt;code&gt;kubectl&lt;/code&gt; execution into an isolated "untrusted" namespace, wrapped in a four-layer command-safety pipeline (input rail, &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; signature match, per-org policy, LLM safety judge). Aurora Actions add scheduled and event-triggered automations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Postmortem agent fed by the investigation trace, exported to Confluence Cloud (OAuth) or Server / Data Center (PAT).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free (Apache 2.0). Infrastructure cost only. Optionally, LLM API usage. With local Ollama the recurring software cost is zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Self-host means the team operates the agent. Teams without basic Kubernetes ops capacity should pilot in an existing managed cluster first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 3, Postmortem 3, Deployment 3, Source 3, total &lt;strong&gt;15/15&lt;/strong&gt;. The score reflects the breadth of the open-source feature set against the matrix, not a quality verdict relative to commercial competitors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Causely, closed source, Kubernetes-only
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Kubernetes-only teams that want causal-graph reasoning rather than LLM-first investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with in-cluster collector. CNCF &lt;a href="https://www.cncf.io/sandbox-projects/" rel="noopener noreferrer"&gt;Causely member listing&lt;/a&gt; (member, not project).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Topology graph plus causality graph plus a "codebook" of failure patterns; the authors describe a deterministic abductive-inference layer that precedes any LLM call. See &lt;a href="https://docs.causely.ai/getting-started/how-causely-works/" rel="noopener noreferrer"&gt;How Causely Works&lt;/a&gt; and the &lt;a href="https://www.infoq.com/articles/causal-reasoning-observability/" rel="noopener noreferrer"&gt;InfoQ piece on causal reasoning in observability&lt;/a&gt;. &lt;a href="https://www.causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes" rel="noopener noreferrer"&gt;Gartner Cool Vendor for AIOps, December 2025&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based via &lt;a href="https://www.causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not a first-class artefact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Not publicly disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Kubernetes-only by design. If the platform spans cloud SDKs and managed services, the model is incomplete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 0, Deployment 0, Source 0, total &lt;strong&gt;3/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Cleric.ai, closed source, Slack-first
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; SRE teams that triage primarily in Slack and use Datadog or Grafana for telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Slack-native AI SRE per &lt;a href="https://cleric.ai/" rel="noopener noreferrer"&gt;cleric.ai&lt;/a&gt;. Integrations with Datadog and Grafana are documented on the product site.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Not publicly disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Slack-first is a strong constraint. Teams on Microsoft Teams or under strict ChatOps governance may find the surface rigid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 0, Source 0, total &lt;strong&gt;4/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. HolmesGPT, Apache 2.0, Kubernetes-first
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Kubernetes-heavy teams that want a CNCF-aligned, RBAC-respecting investigation agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Helm via Robusta, or standalone CLI. LLM provider is the customer's choice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. Code at &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since October 2025&lt;/a&gt;, co-maintained by Robusta and Microsoft.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Iterative ReAct agent. &lt;a href="https://holmesgpt.dev/data-sources/builtin-toolsets/" rel="noopener noreferrer"&gt;Built-in toolsets&lt;/a&gt; span Prometheus, Grafana, AWS / Azure / GCP via MCP read-only, Datadog, and Confluence. Releases v0.20 through v0.25 shipped between February and April 2026 (&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only by default. Operator mode can open GitHub PRs. No in-cluster execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not first-class. Investigations route to Slack, PagerDuty, or Jira.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free. Robusta sells a managed SaaS that wraps HolmesGPT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; AWS, Azure, and GCP support is exposed through MCP wrappers rather than first-class cloud SDK integration. The customer IAM model must fit MCP's read-only assumptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 2, Source 3, total &lt;strong&gt;9/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  5. K8sGPT, Apache 2.0, Kubernetes-only diagnostics
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Quick diagnostic sanity checks on a single cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; CLI, in-cluster operator, or Helm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Rule-based analyser set (Pod, Deployment, Ingress, Service, NetworkPolicy, etc.) with an LLM translating findings into natural language. Closer to L3 (single-shot diagnosis) than L4 (agentic multi-step) on the &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AICL&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based per &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;k8sgpt docs&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not a feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Strong privacy feature: resource names and labels are anonymised before LLM calls per the &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;. Scope is limited to the cluster API; the tool cannot reach out to cloud APIs or external systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 1, Remediation 1, Postmortem 0, Deployment 2, Source 3, total &lt;strong&gt;7/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  6. NeuBird Hawkeye, closed source, multi-platform
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Datadog-heavy AWS shops that want a managed AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS or VPC. Mayfield, M12, and AWS GenAI Accelerator backing per &lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;neubird.ai&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Ephemeral processing (telemetry not stored). Integrations with Datadog, Splunk, CloudWatch, PagerDuty, and ServiceNow per the &lt;a href="https://neubird.ai/blog/how-hawkeye-works-deep-dive-secure-genai-powered-it-operations/" rel="noopener noreferrer"&gt;Hawkeye deep-dive&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only by default. Integrations forward to ITSM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript export.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-investigation pricing listed on AWS Marketplace; enterprise contracts also available. See &lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;NeuBird's product page&lt;/a&gt; for the latest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; "Self-learning" implies a vector store that customers cannot directly inspect. Diligence the data path for regulated workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  7. Resolve.ai, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise teams that want a managed "AI Production Engineer" with named-customer case studies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with in-VPC satellite agent for telemetry. No on-prem option. SOC 2, GDPR, HIPAA per the &lt;a href="https://resolve.ai/about-us" rel="noopener noreferrer"&gt;Resolve trust page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Knowledge-graph plus LLM agent per the &lt;a href="https://resolve.ai/blog/knowledge-graph-agentic-ai-incident-response" rel="noopener noreferrer"&gt;Resolve knowledge-graph post&lt;/a&gt;. Founders include Spiros Xanthos, an OpenTelemetry co-creator. Resolve's &lt;a href="https://resolve.ai/news/resolveai-raises-125-million-series-a" rel="noopener noreferrer"&gt;Series A press release&lt;/a&gt; reports vendor-claimed customer results that Arvo has not independently verified: 72% investigation-time reduction at Coinbase, 87% faster investigations at DoorDash, and 30% fewer engineers per incident at Zscaler.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Generates suggested commands. Public architecture detail is limited.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise. Public pricing is not disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Cloud-only and closed-source. The two public LLM benchmark posts (&lt;a href="https://resolve.ai/blog/Our-early-impressions-of-Claude-Sonnet-4.6" rel="noopener noreferrer"&gt;Sonnet 4.6&lt;/a&gt;) use a private dataset with no public methodology, so the numbers are unreplicable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  8. Traversal, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Log-heavy enterprise environments where causal search across telemetry is the bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with flexible deployment options. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;$48M from Sequoia and Kleiner Perkins, June 2025&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; "Production World Model" and "Causal Search Engine" per &lt;a href="https://traversal.com/blog/introducing-causal-search-engine-from-correlated-alerts-to-causally-consistent-diagnoses" rel="noopener noreferrer"&gt;Traversal's product blog&lt;/a&gt;. Vendor-reported production results at American Express, summarised in the Fortune launch coverage and Traversal's &lt;a href="https://traversal.com/blog/american-express-announcement" rel="noopener noreferrer"&gt;Amex announcement&lt;/a&gt;: 32% MTTR reduction and 82% RCA accuracy across roughly 250 billion log lines per day. Customer stories at &lt;a href="https://traversal.com/customer-stories/eventbrite" rel="noopener noreferrer"&gt;Eventbrite&lt;/a&gt;, PepsiCo, and DigitalOcean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Heavy reliance on trademarked frameworks. Confirm during evaluation how much is novel architecture versus packaging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Incumbent and incident-workflow tools
&lt;/h3&gt;

&lt;h4&gt;
  
  
  9. Datadog Bits AI SRE, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams standardised on Datadog observability who want investigation where the data already lives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, multi-tenant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Multi-agent architecture with planner and worker agents. Datadog's engineering posts &lt;a href="https://www.datadoghq.com/blog/building-bits-ai-sre/" rel="noopener noreferrer"&gt;Building Bits AI SRE&lt;/a&gt; and &lt;a href="https://www.datadoghq.com/blog/engineering/bits-ai-eval-platform/" rel="noopener noreferrer"&gt;the evaluation platform&lt;/a&gt; describe the design without releasing source. &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;HIPAA-compliant&lt;/a&gt; per the product page. Seven triage actions including Slack, Teams, and Jira.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Triage actions only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Bits AI drafts post-incident reports per the &lt;a href="https://www.datadoghq.com/blog/bits-ai-for-incident-management/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-conclusive-investigation billing on top of host, APM, logs, and RUM licensing per &lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;Datadog pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Bits is tightly bound to Datadog's data plane. Using it without the full Datadog stack is not a supported pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  10. Edwin AI (LogicMonitor), closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Existing LogicMonitor Envision customers expanding into agentic AIOps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS layered on LogicMonitor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Ten-plus specialised sub-agents (investigation, correlation, remediation, orchestrator) per the &lt;a href="https://www.logicmonitor.com/blog/meet-edwin-ai-specialized-agents-agentic-aiops" rel="noopener noreferrer"&gt;agent-taxonomy post&lt;/a&gt;. MCP ecosystem support (Dynatrace, Splunk, ServiceNow, Elastic, GitHub, Confluence). A &lt;a href="https://www.logicmonitor.com/blog/fortune-500-it-incident-reduction-edwin-ai" rel="noopener noreferrer"&gt;Forrester Total Economic Impact study&lt;/a&gt; commissioned by LogicMonitor reports 313% ROI on a composite organisation with sub-six-month payback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Closed-loop with policy guardrails per LogicMonitor's product description.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Bundled with LogicMonitor; quoted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Customers must purchase LogicMonitor to use Edwin. Not a standalone option.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 2, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  11. incident.io AI SRE, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using incident.io for on-call and incident workflow who want the AI add-on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Multi-agent system searching GitHub PRs, Slack, historical incidents, logs, metrics, and traces per &lt;a href="https://incident.io/blog/introducing-ai-sre" rel="noopener noreferrer"&gt;incident.io's AI SRE introduction&lt;/a&gt;. An "ambient agent" continuously monitors. The &lt;a href="https://www.zenml.io/llmops-database/ai-powered-incident-response-system-with-multi-agent-investigation" rel="noopener noreferrer"&gt;ZenML LLMOps case study&lt;/a&gt; documents the retrieval evolution from embeddings-only to deterministic tagging plus re-ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Recommendations only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Scribe drafts post-incident reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Platform tiers on &lt;a href="https://incident.io/pricing" rel="noopener noreferrer"&gt;incident.io's pricing page&lt;/a&gt;. AI SRE access is gated to design partners as of the launch announcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Verify AI SRE availability for your tier before assuming you can use it on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  12. PagerDuty SRE Agent, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; PagerDuty Operations Cloud customers who want a memory-equipped agent inside the existing on-call surface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, inside PagerDuty Operations Cloud per the &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Per-tenant memory: service-scoped observations, incident recollections, human-promoted playbooks. See PagerDuty's engineering post &lt;a href="https://www.pagerduty.com/blog/ai/we-built-an-sre-agent-with-memory-and-its-transforming-incident-response/" rel="noopener noreferrer"&gt;We Built an SRE Agent With Memory&lt;/a&gt;. MCP server. Connectors to Grafana, New Relic, and Honeycomb. Three-tier engagement model (agent-led, collaborative, human-led).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestions and automation hooks through existing PagerDuty workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; PagerDuty Scribe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-seat tiers and AIOps add-ons listed on &lt;a href="https://www.pagerduty.com/pricing/aiops/" rel="noopener noreferrer"&gt;PagerDuty pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; AI pricing across the incident-management category is moving from per-seat to usage-based. Model the long-term cost against incident volume rather than seat count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  13. Rootly AI, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want an AI-first ChatOps incident response with an open MCP server and an actively published agent roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed core. &lt;a href="https://rootly.com/labs" rel="noopener noreferrer"&gt;Rootly AI Labs&lt;/a&gt; publishes open-source prototypes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Analyses code changes, telemetry, and past incidents per the &lt;a href="https://rootly.com/ai-sre" rel="noopener noreferrer"&gt;Rootly AI SRE page&lt;/a&gt;. An AI Meeting Bot joins incident bridges and transcribes. The &lt;a href="https://rootly.com/blog/introducing-rootlys-api-ai-agent-first-approach" rel="noopener noreferrer"&gt;Rootly API agent-first announcement&lt;/a&gt; describes the MCP-based agentic surface used by Cursor, Windsurf, and Claude.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestions plus workflow automation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; AI-drafted from incident artefacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Tiers listed on &lt;a href="https://rootly.com/pricing" rel="noopener noreferrer"&gt;Rootly pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; "AI-first" branding outpaces the published architecture detail; in evaluation, ask for the agent loop description and the rule-based-automation boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 1, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  14. ServiceNow Now Assist SRE Specialist, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises on ServiceNow ITSM that want triage and post-mortems inside the same platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, ServiceNow cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; The "SRE Specialist" performs triage (what, impact, priority, who) and autonomous post-mortem authoring, announced as part of the Autonomous Workforce in &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow's Knowledge 2026 release&lt;/a&gt;. GA targeted June 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Workflow automation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Autonomous authoring claimed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Custom-quoted. Public pricing is not disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; As of May 2026 the product is pre-GA and most coverage is press-release or keynote material. Treat capabilities as preliminary until verified during the design-partner phase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 2, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  15. Splunk ITSI Episode Summarization, closed source (Alpha)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Splunk-heavy enterprises that want LLM summaries layered on existing KPI engines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Splunk Cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; &lt;a href="https://www.splunk.com/en_us/blog/observability/conf25-splunk-observability-announcements.html" rel="noopener noreferrer"&gt;ITSI Episode Summarization&lt;/a&gt;, announced at .conf25 (September 2025), is in Alpha. The feature layers an LLM-generated summary (what happened, when, key events, suspected cause) onto Splunk ITSI's KPI-based episodes. Splunk also ships Event iQ for AI-driven alert correlation, listed on the &lt;a href="https://www.splunk.com/en_us/products/it-service-intelligence.html" rel="noopener noreferrer"&gt;ITSI product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Recommendation-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not yet a published feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Splunk ITSI is data-volume or entity-count licensed. The AI features are in Alpha.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Alpha contract and capability terms can shift. Plan a re-evaluation after GA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 1, Remediation 1, Postmortem 1, Deployment 0, Source 0, total &lt;strong&gt;3/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Aurora&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;HolmesGPT&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;K8sGPT&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Resolve.ai&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Traversal&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Edwin AI&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Rootly AI&lt;/td&gt;
&lt;td&gt;Closed (Labs OSS)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;ServiceNow Now Assist SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;NeuBird Hawkeye&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Datadog Bits AI SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;incident.io AI SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;PagerDuty SRE Agent&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Cleric.ai&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Causely&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Splunk ITSI Episode Summarization&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The open-source projects lead the deployment-flexibility and source-availability axes by definition. Aurora is the only entry that scores 3 on every axis. Commercial leaders cluster around 5 to 6 because they are uniformly strong on investigation but weak on deployment flexibility and source availability. Kubernetes-only projects (K8sGPT, Causely) and pre-GA incumbents (Splunk ITSI) cluster low because their scope or maturity caps multiple axes.&lt;/p&gt;

&lt;p&gt;The score does not pick a winner. It picks a fit. A bank under FedRAMP High obligations evaluates this list differently from a 50-engineer Series B startup. The deployment axis answers the fitness question; investigation answers the depth question; source availability answers the trust question.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I choose an AI SRE tool?
&lt;/h2&gt;

&lt;p&gt;Most procurement processes stall because the team compares across all five axes at once. Asking these three questions in order eliminates twelve of the fifteen tools before vendor demos.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does the data have to stay in our perimeter?&lt;/strong&gt; If yes, the answer is Aurora, HolmesGPT, or K8sGPT. Every commercial product on this list requires data to leave the customer perimeter for inference. See &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt; for the architecture you will need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the scope multi-cloud or Kubernetes-only?&lt;/strong&gt; If multi-cloud, the open-source shortlist narrows to Aurora; in the commercial set, Resolve.ai, Traversal, NeuBird, and incident.io are the credible candidates. If Kubernetes-only, every tool except Aurora's non-Kubernetes integrations remains valid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do you need to take action, or only investigate?&lt;/strong&gt; Read-only covers most of the open-source category and most incumbent AI features. Actioning agents narrow the list to Aurora (PR-based, sandboxed kubectl, plus Aurora Actions), ServiceNow Now Assist (workflow automation), and Edwin AI (closed-loop within LogicMonitor).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For depth on the action-safety question, see our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt; guide and &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation Complete Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;Arvo expects the category to converge along three axes through the rest of 2026.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model Context Protocol convergence.&lt;/strong&gt; PagerDuty, Rootly, Aurora, HolmesGPT, Causely, and Edwin AI have all shipped MCP servers. MCP is on track to become table stakes by year-end, which means differentiation will shift to prompt graphs, RAG quality, and policy guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open benchmarking.&lt;/strong&gt; Resolve.ai and Rootly have published proprietary LLM benchmark posts, neither with a reproducible dataset. The first open, named benchmark with a public incident corpus is likely to set the citation surface the category orbits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing model fragmentation.&lt;/strong&gt; Per-seat (PagerDuty, Rootly, incident.io), per-investigation (Datadog Bits AI, NeuBird), per-credit (ServiceNow), per-cloud-host (Edwin AI), and free open source (Aurora, HolmesGPT, K8sGPT) coexist today. Expect convergence on a published reference cost per investigation as buyers compare more rigorously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Differentiation in this market is structural rather than feature-list. Buyers who score against the capability matrix and apply the deployment, scope, and action questions usually land a credible shortlist of two or three tools within a week. Buyers running feature-list comparisons evaluate for a quarter.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Automated Post-Mortem Generation: The Complete Guide for SRE Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 13 May 2026 16:13:39 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/automated-post-mortem-generation-the-complete-guide-for-sre-teams-2026-55ck</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/automated-post-mortem-generation-the-complete-guide-for-sre-teams-2026-55ck</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated post-mortem generation is the process of producing an incident retrospective from artifacts already collected during the incident&lt;/strong&gt; — chat transcript, alert timeline, monitor data, and (in agentic systems) the investigation agent's own tool-call trace. The category is not a single technology; it's an output shared by three distinct architectures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the Postmortem Provenance Model (PPM).&lt;/strong&gt; Three source types: &lt;strong&gt;(1) chat-transcript postmortems&lt;/strong&gt; (Rootly, incident.io, FireHydrant) summarize what humans said in the channel; &lt;strong&gt;(2) observability-stitched postmortems&lt;/strong&gt; (Datadog Bits AI) summarize what monitors recorded; &lt;strong&gt;(3) agentic-investigation postmortems&lt;/strong&gt; (Aurora) compose from the agent's causal reasoning trace. The three artifacts answer different questions and are not interchangeable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The standards that anchor this work are old, but unchanged by AI.&lt;/strong&gt; &lt;a href="https://sre.google/sre-book/postmortem-culture/" rel="noopener noreferrer"&gt;Google SRE Book Chapter 15 — Postmortem Culture&lt;/a&gt; (Lunney and Lueder, 2017) and &lt;a href="https://www.etsy.com/codeascraft/blameless-postmortems" rel="noopener noreferrer"&gt;John Allspaw's "Blameless PostMortems and a Just Culture"&lt;/a&gt; (Etsy, May 2012) define what a postmortem is for. AI changes the authoring cost, not the purpose.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The vendor landscape consolidated in 2025–2026.&lt;/strong&gt; PagerDuty acquired &lt;a href="https://www.pagerduty.com/newsroom/pagerduty-acquires-jeli/" rel="noopener noreferrer"&gt;Jeli in November 2023 for $29.7M&lt;/a&gt;; FireHydrant was acquired by Freshworks in December 2025; Squadcast was acquired by SolarWinds. ServiceNow's &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;Now Assist SRE specialist&lt;/a&gt; (GA targeted June 2026) brings the largest ITSM vendor into the postmortem-generation lane.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source agentic-investigation postmortems are a small lane.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0) generates postmortems from its own investigation agent's reasoning chain and exports to Confluence Cloud (OAuth) or Server / Data Center (PAT), with customizable per-org templates and version history.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A good postmortem outlives the incident. &lt;strong&gt;An automated post-mortem is an incident retrospective whose narrative, timeline, root cause, contributing factors, and action items are drafted by software rather than by hand — typically a large language model, sometimes a tool-using agent, always built on artifacts already collected during the incident.&lt;/strong&gt; This guide is for SRE, platform, and incident-management leaders deciding which automated-postmortem architecture matches their team's working style — not which vendor logo to add to their stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why automation, and why now
&lt;/h2&gt;

&lt;p&gt;Most teams write postmortems by hand. Most postmortems are late, short, and read by no one. The reason is unsentimental: writing a good postmortem takes hours of reconstruction work, on top of an incident that has already drained the on-call's day. The lit-survey of practitioner posts converges on a 4–8 hour figure per postmortem of moderate complexity — most of that spent in Slack, dashboards, and ticket trails trying to reassemble the timeline.&lt;/p&gt;

&lt;p&gt;The market response since 2023 has been a wave of automated-postmortem features: &lt;a href="https://rootly.com/retrospectives" rel="noopener noreferrer"&gt;Rootly AI Copilot&lt;/a&gt;, &lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt; Scribe and AI summaries, &lt;a href="https://firehydrant.com/ai/" rel="noopener noreferrer"&gt;FireHydrant AI-Drafted Retrospectives&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/create-postmortems-with-datadog/" rel="noopener noreferrer"&gt;Datadog Bits AI postmortem variables&lt;/a&gt;, and &lt;a href="https://support.pagerduty.com/main/docs/scribe-agent" rel="noopener noreferrer"&gt;PagerDuty Scribe Agent&lt;/a&gt;. The pitch is similar across them: 90 minutes of human reconstruction collapses to 15 minutes of human review.&lt;/p&gt;

&lt;p&gt;The honest framing is that these tools do real work, but most of them are summarizing artifacts that already exist. &lt;strong&gt;They are not investigating; they are transcribing.&lt;/strong&gt; That's enough for many teams, especially those whose incidents are well-captured in their incident-channel chatter. It is not enough for teams whose incidents require deep investigation across systems — and that gap is what the agentic-investigation category is starting to fill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Postmortem Provenance Model (PPM)
&lt;/h2&gt;

&lt;p&gt;The three architectures differ in what they read from, not in what they produce. Same sections, different evidence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source type&lt;/th&gt;
&lt;th&gt;Reads from&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat-transcript&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack / Teams / Zoom channel for the incident window; on-call chatter; status updates&lt;/td&gt;
&lt;td&gt;Captures human narrative, decisions, and judgment calls verbatim&lt;/td&gt;
&lt;td&gt;Inherits human errors and gaps; weak on infrastructure facts the channel didn't surface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability-stitched&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitor events, alert timeline, dashboards, deployment history&lt;/td&gt;
&lt;td&gt;Strong factual timeline, embedded graphs and logs&lt;/td&gt;
&lt;td&gt;Misses human context; weak on contributing factors that aren't in telemetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic-investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The investigation agent's tool-call trace, reasoning chain, evidence collected mid-incident&lt;/td&gt;
&lt;td&gt;Causal record of what the system did and what the agent found&lt;/td&gt;
&lt;td&gt;Requires running an investigation agent in the first place; quality depends on the agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A team's choice should match its incident profile. If most incidents resolve in chat with little investigation needed, a chat-transcript tool is fine. If incidents are surfaced and resolved entirely in your observability stack, an observability-stitched approach gives you tight monitor-to-postmortem fidelity. If your incidents require traversing AWS, GCP, Kubernetes, and your own services to find the cause, an agentic-investigation postmortem is the only artifact that records the work the agent actually did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards: what a postmortem is for
&lt;/h2&gt;

&lt;p&gt;It is worth grounding the conversation in what postmortems were designed to do before LLMs existed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://sre.google/sre-book/postmortem-culture/" rel="noopener noreferrer"&gt;Google SRE Book, Chapter 15 — Postmortem Culture: Learning from Failure&lt;/a&gt;&lt;/strong&gt; by John Lunney and Sue Lueder (O'Reilly, 2017). The canonical text on blameless postmortems as organizational learning. The companion &lt;a href="https://sre.google/workbook/postmortem-culture/" rel="noopener noreferrer"&gt;SRE Workbook Chapter 10&lt;/a&gt; updates the practical guidance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.etsy.com/codeascraft/blameless-postmortems" rel="noopener noreferrer"&gt;John Allspaw — Blameless PostMortems and a Just Culture&lt;/a&gt;&lt;/strong&gt; (Etsy Code as Craft, May 22, 2012). The earlier articulation of why blameless-ness is operationally load-bearing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.usenix.org/system/files/login/articles/login_spring17_09_lunney.pdf" rel="noopener noreferrer"&gt;Lunney — Postmortem Action Items&lt;/a&gt;&lt;/strong&gt; (USENIX ;login: Spring 2017). The honest practitioner read on why most postmortems' action items never get done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://postmortems.pagerduty.com/" rel="noopener noreferrer"&gt;PagerDuty's open-source Postmortem documentation&lt;/a&gt;&lt;/strong&gt; (Apache 2.0, &lt;a href="https://github.com/PagerDuty/postmortem-docs" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;). Includes a maintained &lt;a href="https://github.com/PagerDuty/postmortem-docs/blob/master/docs/resources/post_mortem_template.md" rel="noopener noreferrer"&gt;postmortem template&lt;/a&gt; used as a baseline by many teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.thevoid.community/" rel="noopener noreferrer"&gt;Verica Open Incident Database (VOID)&lt;/a&gt;&lt;/strong&gt;. The 2nd Annual VOID Report (December 2022) catalogs approximately 10,000 incidents from 600+ organizations; its central finding is that MTTR is statistically unreliable as a cross-organization comparison and that only ~25% of public incident reports clearly identify a root cause. A useful corrective to the "we reduced MTTR by X%" claims that pepper vendor marketing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/danluu/post-mortems" rel="noopener noreferrer"&gt;Dan Luu's curated postmortems collection&lt;/a&gt;&lt;/strong&gt;. The widest public corpus of real postmortems; useful as RAG fuel for any AI postmortem system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A blameless, learning-oriented postmortem is the goal. Automation changes the authoring cost; it does not relax the standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gets auto-generated today
&lt;/h2&gt;

&lt;p&gt;A typical 2026 automated postmortem produces some subset of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Summary&lt;/strong&gt; — one paragraph, the executive read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeline&lt;/strong&gt; — chronological events with timestamps (often HH:MM UTC).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt; — customer-facing effect, services affected, error budget burn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause&lt;/strong&gt; — the technical fault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributing factors&lt;/strong&gt; — human, process, and organizational conditions that allowed the incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt; — what stopped the bleeding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action items&lt;/strong&gt; — owners, due dates, follow-ups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lessons learned&lt;/strong&gt; — what the team would do differently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Different products auto-draft different subsets. The "Lessons Learned" section, in particular, is left to humans in most products — for the obvious reason that it is the section where judgment is most consequential.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tooling landscape
&lt;/h2&gt;

&lt;p&gt;Concrete vendor positioning as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;License / hosting&lt;/th&gt;
&lt;th&gt;What it auto-generates&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://rootly.com/retrospectives" rel="noopener noreferrer"&gt;Rootly AI Copilot&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Narrative summary, timeline, action items, root cause, embedded Datadog charts; meeting-bot transcription&lt;/td&gt;
&lt;td&gt;Headline claim: 90 min → 15 min review. Exports to Confluence, Google Docs, Notion, Slack.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io AI postmortems&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Summary, timeline, contributing factors, suggested follow-ups; Scribe transcribes call audio&lt;/td&gt;
&lt;td&gt;"Lessons Learned" is left to humans by design. Exports to Confluence, Notion, Google Docs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://firehydrant.com/ai/" rel="noopener noreferrer"&gt;FireHydrant AI-Drafted Retrospectives&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Description, customer impact, lessons learned; Copilot compares ongoing incident to past incidents&lt;/td&gt;
&lt;td&gt;Acquired by Freshworks December 2025; AI features are Enterprise tier only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.datadoghq.com/blog/create-postmortems-with-datadog/" rel="noopener noreferrer"&gt;Datadog Bits AI postmortems&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Summary, customer impact, lessons learned variables; dynamic embedded graphs and logs&lt;/td&gt;
&lt;td&gt;Exports to Datadog Notebooks, Confluence, or Google Drive.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://support.pagerduty.com/main/docs/scribe-agent" rel="noopener noreferrer"&gt;PagerDuty Scribe Agent&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Real-time call transcription and timeline contributions to PagerDuty's Postmortems product&lt;/td&gt;
&lt;td&gt;Part of PagerDuty's Spring 2026 agent suite (SRE Agent, Scribe Agent, Insights Agent).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0, self-hosted&lt;/td&gt;
&lt;td&gt;Summary, timeline (HH:MM UTC), root cause, impact, contributing factors, resolution, action items, lessons learned; generated from the investigation agent's reasoning trace&lt;/td&gt;
&lt;td&gt;Per-org template overrides; Confluence Cloud (OAuth) and Server / Data Center (PAT) export.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow Now Assist SRE specialist&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Triage + postmortem documentation end to end&lt;/td&gt;
&lt;td&gt;GA targeted June 2026 (Knowledge 2026 announcement).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.squadcast.com/product/postmortems" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;One-click postmortem, webhook automation, templates&lt;/td&gt;
&lt;td&gt;Acquired by SolarWinds.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: the SaaS-IM vendors all do chat-transcript postmortems well; Datadog owns the observability-stitched lane; Aurora is the open-source agentic-investigation option. ServiceNow's June 2026 GA brings the largest ITSM vendor into the category as a fourth meaningful entrant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: how agentic-investigation postmortems work
&lt;/h2&gt;

&lt;p&gt;Worth describing in detail because this is the category least visible to most buyers.&lt;/p&gt;

&lt;p&gt;In a chat-transcript postmortem system, the flow is: incident channel → LLM with a postmortem template prompt → draft document. In an observability-stitched postmortem system, the flow is: incident timeline + dashboards → LLM with embedding variables → draft document with live charts.&lt;/p&gt;

&lt;p&gt;An agentic-investigation postmortem starts earlier — at the &lt;em&gt;investigation&lt;/em&gt;. The pattern, using Aurora as the concrete open-source example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert webhook arrives.&lt;/strong&gt; PagerDuty, Datadog, Grafana, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, NewRelic, OpsGenie, or incident.io fires. The provider-specific RCA-prompt builder constructs the agent's first message, including alert metadata, severity, service, and environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation runs.&lt;/strong&gt; Aurora's ReAct-style LangGraph agent calls tools across the next 3–15 minutes — &lt;code&gt;kubectl&lt;/code&gt;, cloud CLIs, knowledge-base search, Terraform read, Confluence search — and accumulates a transcript of tool calls, tool results, and reasoning steps. The result is persisted as the incident's &lt;code&gt;aurora_summary&lt;/code&gt; — the agent's RCA narrative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem dispatch.&lt;/strong&gt; When the incident is resolved (either manually, via Aurora's "Run Action" dropdown on completed incidents, or via an &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; on-incident-completion trigger), a postmortem agent run is dispatched with the agent's RCA summary as load-bearing context. The postmortem agent re-reads the original investigation output, optionally pulls Slack channel context for the incident window, and composes the postmortem under a per-org template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage and versioning.&lt;/strong&gt; Drafts are stored in PostgreSQL with version history. Engineers can edit; subsequent regenerations preserve human edits as a separate version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confluence export.&lt;/strong&gt; The user clicks Export. Aurora pushes the rendered postmortem to Confluence Cloud (OAuth) or Server / Data Center (PAT), creating a page under a configured space and parent. Export is currently user-triggered rather than automatic, which preserves the human review step before publication.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The structural difference from chat-transcript postmortems is what evidence the LLM gets. A chat-transcript system can only describe what humans typed. An agentic-investigation system describes what the agent did, which tools it ran, what the cloud responded with, and how it reasoned through to the root cause. The artifact carries the actual causal trail, not a social reconstruction of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an automated postmortem tool
&lt;/h2&gt;

&lt;p&gt;A rubric you can run on any vendor — open source or commercial.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance match.&lt;/strong&gt; Does the tool's source-of-truth match how your team actually runs incidents? Chat-heavy team → chat-transcript. Observability-heavy team → Datadog or equivalent. Investigation-heavy team → agentic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template control.&lt;/strong&gt; Can you replace the vendor's template with your team's? Per-team templates? Aurora supports per-org template overrides via its &lt;code&gt;actions&lt;/code&gt; configuration table; vendor SaaS varies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export target.&lt;/strong&gt; Confluence Cloud, Server / Data Center, Notion, Google Docs, internal wiki. Match your team's documentation home. Aurora supports Confluence (both flavors); the SaaS vendors support different combinations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edit lineage.&lt;/strong&gt; When the AI draft is edited, regenerated, and edited again, what survives? Test this explicitly with three round trips. Aurora preserves version history; check each candidate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action-item ownership.&lt;/strong&gt; Does the tool extract action items with owners and due dates, or just bullet points? The Lunney USENIX piece is blunt about why this matters: action items without owners do not get done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedded evidence.&lt;/strong&gt; Are graphs, logs, and resource identifiers embedded inline or linked? Embedded survives the documentation system; linked rots over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost and privacy.&lt;/strong&gt; Where does the postmortem text get processed? Self-hosted with bring-your-own-LLM (Aurora) keeps incident data on your infrastructure; SaaS vendors vary in how they handle this and your security team will want to know.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standards alignment.&lt;/strong&gt; Does the generated artifact match the blameless tradition (Allspaw, Lunney, the SRE Book) or accidentally drift into individual blame? Check the prompt if you can; otherwise inspect a sample.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to roll out automation without breaking culture
&lt;/h2&gt;

&lt;p&gt;A six-step adoption plan that respects the standards while saving the time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the easiest 30%&lt;/strong&gt; — short-impact incidents with mostly-chat investigations. These produce passable AI drafts on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep humans on lessons learned.&lt;/strong&gt; Even tools that auto-generate the "Lessons Learned" section ship it as a draft to be aggressively rewritten. The judgment in that section is the point of the postmortem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require human edit before publish.&lt;/strong&gt; The on-call engineer who ran the incident should always be the one who clicks "Publish." This is the cultural firewall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track action-item completion separately.&lt;/strong&gt; AI-generated action items have a known completion-gap problem. Add a weekly review of last week's postmortem action items, with owners called out by name.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a quarterly audit of the generated postmortems.&lt;/strong&gt; Pick five at random; have a senior engineer read them critically. Look for drift toward individual blame, missed contributing factors, and surface-level root causes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tighten the loop with the investigation tool.&lt;/strong&gt; If your investigation tool and postmortem tool are the same product (Aurora, eventually Resolve.ai-class systems), the postmortem inherits the investigation's evidence chain. This is the highest-quality automated postmortem possible — but it requires running an agentic investigation in the first place.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What can go wrong
&lt;/h2&gt;

&lt;p&gt;A short failure-mode list.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Surface-level root cause.&lt;/strong&gt; AI drafts read confidently while attributing a deep system issue to its most visible symptom. The cure is human review by someone who was in the incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated timeline.&lt;/strong&gt; LLM invents events, misattributes timestamps, or doubles up on entries. Most common when the input artifact (chat transcript or telemetry) has gaps the model patches over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blame drift.&lt;/strong&gt; AI summary slips into individual-blame framing because the human chat did. The blameless tradition exists exactly for this reason; the AI does not enforce it on its own.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action items without ownership.&lt;/strong&gt; A bullet list of "should do X" with no owner is not an action item; it is decoration. Treat ownerless action items as a failure of the tool's prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edit loss on regeneration.&lt;/strong&gt; Some tools overwrite human edits when the user clicks "Regenerate." Verify that version history is preserved before trusting the tool for a quarter's worth of postmortems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is the open-source agentic-investigation entry in this category. Apache 2.0, self-hosted via Docker Compose or Helm. Postmortems are generated from the same agent that ran the investigation, with per-org template control, version history, Slack context backfill, and export to Confluence Cloud or Server / Data Center. If your incidents look like chat-resolved coordination work, you probably don't need Aurora's postmortem layer specifically. If your incidents look like deep cross-cloud investigation work, you probably do.&lt;/p&gt;

&lt;p&gt;For more on how Aurora's investigation half works, see our &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation guide&lt;/a&gt;. For how Aurora's automation primitive (Aurora Actions) lets you chain postmortem generation onto every incident automatically, see the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions launch post&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related guides:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;Root Cause Analysis: Complete Guide for SREs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 13 May 2026 16:12:09 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-powered-incident-investigation-the-complete-guide-for-sre-teams-2026-4hl0</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-powered-incident-investigation-the-complete-guide-for-sre-teams-2026-4hl0</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps&lt;/strong&gt; — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs &lt;code&gt;kubectl&lt;/code&gt;, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the AI Investigation Capability Ladder (AICL).&lt;/strong&gt; Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNCF now hosts two open-source agentic projects in this lane.&lt;/strong&gt; &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; &lt;a href="https://www.cncf.io/blog/2026/01/07/holmesgpt-agentic-troubleshooting-built-for-the-cloud-native-era/" rel="noopener noreferrer"&gt;entered the CNCF Sandbox in October 2025&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; has been Sandbox since December 19, 2023. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT).&lt;/strong&gt; Per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's metrics history&lt;/a&gt;, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The &lt;a href="https://dora.dev/research/2024/dora-report/2024-dora-accelerate-state-of-devops-report.pdf" rel="noopener noreferrer"&gt;2024 DORA report PDF&lt;/a&gt; added "deployment rework rate" as a fifth core measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The closed-source peer set is well-funded.&lt;/strong&gt; &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai raised $125M at a $1B valuation in February 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt; reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cloud incidents in 2026 surface faster than humans can investigate them. &lt;strong&gt;AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis.&lt;/strong&gt; Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs &lt;code&gt;kubectl describe&lt;/code&gt;, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.&lt;/p&gt;

&lt;p&gt;This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "investigation" means here
&lt;/h2&gt;

&lt;p&gt;Three things blur together when people say "AI incident response":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert correlation&lt;/strong&gt; — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation&lt;/strong&gt; — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic investigation&lt;/strong&gt; — an LLM that runs &lt;em&gt;new&lt;/em&gt; tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run &lt;code&gt;kubectl&lt;/code&gt; will be even more disappointed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Investigation Capability Ladder (AICL)
&lt;/h2&gt;

&lt;p&gt;Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What runs&lt;/th&gt;
&lt;th&gt;Human role&lt;/th&gt;
&lt;th&gt;Representative tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0 — Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer reads alerts, runs &lt;code&gt;kubectl&lt;/code&gt; and cloud CLIs by hand&lt;/td&gt;
&lt;td&gt;Everything&lt;/td&gt;
&lt;td&gt;PagerDuty, Slack, Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Alert correlation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML correlator clusters and dedupes events&lt;/td&gt;
&lt;td&gt;Triage from a smaller list&lt;/td&gt;
&lt;td&gt;PagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — LLM-summarized timeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM summarizes an event stream into prose&lt;/td&gt;
&lt;td&gt;Reads summary instead of raw events&lt;/td&gt;
&lt;td&gt;Datadog Bits AI summaries, incident.io Scribe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Single-shot LLM diagnosis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM produces an RCA from one prompt over alert + telemetry&lt;/td&gt;
&lt;td&gt;Trusts a single inference&lt;/td&gt;
&lt;td&gt;K8sGPT analyzers, vendor "AI insights" buttons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Agentic multi-step investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM agent calls many tools across multiple turns, replans as findings arrive&lt;/td&gt;
&lt;td&gt;Reviews trace, ships fix&lt;/td&gt;
&lt;td&gt;Aurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L5 — Closed-loop investigate + remediate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent investigates and proposes (or applies, with approval) a fix&lt;/td&gt;
&lt;td&gt;Approves remediation&lt;/td&gt;
&lt;td&gt;Aurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest framing: &lt;strong&gt;most teams are L0 or L1 today.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional AIOps vs agentic investigation
&lt;/h2&gt;

&lt;p&gt;Both are useful; they cover non-overlapping work.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Traditional AIOps (L1)&lt;/th&gt;
&lt;th&gt;Agentic investigation (L4)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event stream, telemetry already ingested&lt;/td&gt;
&lt;td&gt;Same, plus live tool calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ranked suspects, correlated incidents&lt;/td&gt;
&lt;td&gt;RCA narrative, evidence chain, suggested fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New evidence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No — operates on what's already in the system&lt;/td&gt;
&lt;td&gt;Yes — agent issues new commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML clustering / topology distance scoring&lt;/td&gt;
&lt;td&gt;LLM step-by-step (ReAct or similar)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Why it can be wrong&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Missing event, weak topology graph&lt;/td&gt;
&lt;td&gt;Hallucination, tool misuse, prompt drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-event or per-host&lt;/td&gt;
&lt;td&gt;Per LLM token + tool runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quiet — wrong cluster, you don't know&lt;/td&gt;
&lt;td&gt;Loud — agent's trace is human-readable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's &lt;a href="https://www.splunk.com/en_us/products/it-service-intelligence.html" rel="noopener noreferrer"&gt;ITSI Episode Summarization&lt;/a&gt; (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agentic peer set in 2026
&lt;/h2&gt;

&lt;p&gt;This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrations&lt;/td&gt;
&lt;td&gt;LangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsets&lt;/td&gt;
&lt;td&gt;Co-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Kubernetes resource diagnostics&lt;/td&gt;
&lt;td&gt;CNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://cleric.ai/" rel="noopener noreferrer"&gt;Cleric.ai&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Slack-first AI SRE&lt;/td&gt;
&lt;td&gt;Gartner Cool Vendor 2025. Integrates Datadog and Grafana.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://resolve.ai/" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Multi-cloud AI SRE&lt;/td&gt;
&lt;td&gt;$125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://traversal.ai/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;"Causal search engine" for production systems&lt;/td&gt;
&lt;td&gt;$48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;Neubird Hawkeye&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Llama 3.2 70B fine-tuned + ChromaDB RAG&lt;/td&gt;
&lt;td&gt;SaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.causely.ai/" rel="noopener noreferrer"&gt;Causely&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Causal-graph reasoner for Kubernetes&lt;/td&gt;
&lt;td&gt;Gartner Cool Vendor 2025. MCP server. Gemini-powered.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://ciroos.ai/" rel="noopener noreferrer"&gt;Ciroos.AI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;"SRE Teammate" multi-agent&lt;/td&gt;
&lt;td&gt;MCP and A2A architecture.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.&lt;/p&gt;

&lt;p&gt;For a deeper open-source-only comparison, see our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: what makes investigation "agentic"?
&lt;/h2&gt;

&lt;p&gt;Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. A tool-calling loop (ReAct or similar)
&lt;/h3&gt;

&lt;p&gt;The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the &lt;strong&gt;ReAct pattern&lt;/strong&gt; (Reason + Act, &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;Yao et al. 2022&lt;/a&gt;). Aurora's implementation is a single-node LangGraph workflow wrapping &lt;code&gt;langchain.agents.create_agent&lt;/code&gt;; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool reach across the stack
&lt;/h3&gt;

&lt;p&gt;An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Sandboxed CLI execution
&lt;/h3&gt;

&lt;p&gt;Letting the agent run &lt;code&gt;kubectl&lt;/code&gt; and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prompt-injection input rail&lt;/strong&gt; (&lt;a href="https://github.com/NVIDIA/NeMo-Guardrails" rel="noopener noreferrer"&gt;NVIDIA NeMo Guardrails&lt;/a&gt;) blocks commands that originate from injected instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static signature match&lt;/strong&gt; against 37 vendored &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; detection rules covering known-malicious command patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-org command policy&lt;/strong&gt; — allow/deny lists scoped to the customer's tenant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM safety judge&lt;/strong&gt; adapted from &lt;a href="https://github.com/meta-llama/PurpleLlama" rel="noopener noreferrer"&gt;Meta's PurpleLlama AlignmentCheck&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Approved commands execute via &lt;code&gt;kubectl exec&lt;/code&gt; into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; for the full threat model.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Retrieval over organizational memory
&lt;/h3&gt;

&lt;p&gt;The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt; for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.&lt;/p&gt;

&lt;p&gt;The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Infrastructure topology
&lt;/h3&gt;

&lt;p&gt;An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses &lt;a href="https://memgraph.com/" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt; as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the DORA and VOID anchors actually say
&lt;/h2&gt;

&lt;p&gt;Two industry sources are worth grounding the investigation conversation in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DORA — Failed Deployment Recovery Time.&lt;/strong&gt; Per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's metrics history&lt;/a&gt;, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The &lt;a href="https://dora.dev/research/2024/dora-report/2024-dora-accelerate-state-of-devops-report.pdf" rel="noopener noreferrer"&gt;2024 DORA State of DevOps Report PDF&lt;/a&gt; further refined the metric set, adding "deployment rework rate" as a fifth core measure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VOID — incident reality, not vendor claims.&lt;/strong&gt; The &lt;a href="https://www.thevoid.community/" rel="noopener noreferrer"&gt;Verica Open Incident Database&lt;/a&gt; catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  An evaluation scorecard for AI investigation tools
&lt;/h2&gt;

&lt;p&gt;Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step tool use.&lt;/strong&gt; Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud scope.&lt;/strong&gt; Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxing and RBAC.&lt;/strong&gt; Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG quality.&lt;/strong&gt; Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace readability.&lt;/strong&gt; Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost and rate-limit headroom.&lt;/strong&gt; Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source vs SaaS posture.&lt;/strong&gt; If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where it sits on the AICL.&lt;/strong&gt; Decide &lt;em&gt;up front&lt;/em&gt; whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to run a low-risk pilot
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one alert source and one cluster.&lt;/strong&gt; PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run read-only for at least four weeks.&lt;/strong&gt; Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingest your historical context.&lt;/strong&gt; Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add one chat channel and one slash command.&lt;/strong&gt; Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review traces weekly.&lt;/strong&gt; Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Promote to alert-triggered investigation when the trace is clean for two consecutive weeks.&lt;/strong&gt; Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide on L5 (remediation) only after three months at clean L4.&lt;/strong&gt; Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; feature is the open-source pattern for this.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What can go wrong
&lt;/h2&gt;

&lt;p&gt;A short list of failure modes worth pre-mortem-ing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt drift.&lt;/strong&gt; A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool misuse.&lt;/strong&gt; Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated identifiers.&lt;/strong&gt; Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost runaway.&lt;/strong&gt; Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-trust.&lt;/strong&gt; The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;We build &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related guides:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE comparison&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Aurora Actions: User-Defined Background Automations for Incident Response</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:49:20 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions are reusable, natural-language automations&lt;/strong&gt; that Aurora's agent executes in the background using all 22+ connected integrations. Available today on the main branch of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three trigger types out of the box&lt;/strong&gt;: manual ("run now"), on incident completion (chain follow-up work after every RCA), and recurring schedule (Celery Beat–driven intervals).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same agent, same tools, different prompt scaffolding.&lt;/strong&gt; Actions reuse Aurora's existing LangGraph agent and 30+ tools (kubectl, aws, gcloud, az, Terraform, Confluence, Slack, GitHub) — they just run as background chat sessions with eager-loaded skills and no RCA mandate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; is a first-class chat primitive.&lt;/strong&gt; Slash-command autocomplete in the chat input, "Run Action" dropdown on completed incidents, and full RBAC-gated CRUD UI in Settings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions turn the agent into a programmable platform.&lt;/strong&gt; This is the building block for CI/CD auto-remediation, scheduled audits, and post-incident health checks — covered in &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;our CI/CD Auto-Remediation guide&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;We shipped one of the most-requested features in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;'s history: &lt;strong&gt;Aurora Actions — user-defined background automations that run on Aurora's agent.&lt;/strong&gt; &lt;strong&gt;An Aurora Action is a named, natural-language instruction the user writes once and then triggers manually, on incident completion, or on a recurring schedule; Aurora's agent executes it as a background task with full access to every connected integration.&lt;/strong&gt; Where traditional incident management tools force you to pick from a fixed catalog of "automations" (close incident, post to Slack, run runbook), Actions are written in plain English and inherit the full reasoning capability of the agent.&lt;/p&gt;

&lt;p&gt;This post is for SRE and platform teams already running Aurora — or evaluating it — who want to understand what Actions actually do, where they fit on the agentic spectrum, and how to use them safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an Aurora Action?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;Aurora Action&lt;/strong&gt; has four parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A name&lt;/strong&gt; — used as the slash-command handle (&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt;) and as the dropdown label on incident cards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A natural-language instruction&lt;/strong&gt; — the prompt the agent will execute. The same instruction the user would type into chat, except it can reference incident context placeholders when triggered post-incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A trigger type&lt;/strong&gt; — manual, on-incident-completion, or on-schedule (interval-based via &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An on/off toggle&lt;/strong&gt; — actions can be disabled without deletion, with full RBAC for who can create, edit, or trigger them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation is a thin layer over Aurora's existing chat agent. When an Action triggers, the executor service creates a background chat session with the action's instruction as the user message, runs it through the same LangGraph workflow that powers interactive chat, and persists the run history. The agent has full tool access (kubectl, cloud CLIs, Terraform, Slack, GitHub, Confluence, Memgraph, Weaviate) and eager-loaded skills — the only differences from interactive chat are scaffolded prompts and the absence of any RCA mandate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Most incident management automation today is &lt;strong&gt;workflow automation&lt;/strong&gt;: PagerDuty fires, Slack channel is created, status page is updated, runbook link is posted. The "automation" is a directed graph of static actions. There is no reasoning, no investigation, no judgment. Tools like Rootly, FireHydrant, and incident.io are excellent at this — but they don't &lt;em&gt;do&lt;/em&gt; anything an SRE wouldn't have to manually verify after the fact.&lt;/p&gt;

&lt;p&gt;Aurora's bet has always been the opposite: &lt;strong&gt;automate the investigation itself.&lt;/strong&gt; Aurora Actions extend that bet from one-shot incident investigations to recurring or post-incident workflows. A few concrete examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Noisy alert tuning&lt;/strong&gt; — "Every Friday at 5pm, review which Datadog alerts fired more than 20 times this week with mean time-to-acknowledge over 10 minutes. Open a Terraform PR to widen the thresholds or move them to a warning channel."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-incident health check&lt;/strong&gt; — "After every completed RCA, run a 15-minute observation on the affected service: check error rate, p99 latency, and pod restart count. Post results to #incident-followup."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled infrastructure audit&lt;/strong&gt; — "Every Monday at 9am, audit IAM roles in the production AWS account that have not been used in 90 days. List candidates for removal in a Confluence page."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are runbook automation. Each requires the agent to query infrastructure, reason about results, and produce a structured output. Each one was previously the job of an on-call engineer doing follow-up between pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Actions sit on the agentic capability spectrum
&lt;/h2&gt;

&lt;p&gt;In our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE comparison&lt;/a&gt;, we proposed a four-level spectrum for AI SRE capability. Actions don't change the level — they change &lt;em&gt;when the agent runs.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;When the agent runs&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Pre-Actions example&lt;/th&gt;
&lt;th&gt;With Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On alert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Webhook from PagerDuty / Datadog / Grafana&lt;/td&gt;
&lt;td&gt;Aurora investigates the alert and produces an RCA&lt;/td&gt;
&lt;td&gt;Same — investigation flow is unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On user request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer asks a question in chat&lt;/td&gt;
&lt;td&gt;Aurora answers using tools&lt;/td&gt;
&lt;td&gt;Same — plus &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; shortcuts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;After every incident&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incident state transitions to "resolved"&lt;/td&gt;
&lt;td&gt;Postmortem generated; engineer manually does follow-up checks&lt;/td&gt;
&lt;td&gt;Action runs automatically with incident context in scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On a schedule&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Celery Beat cron&lt;/td&gt;
&lt;td&gt;No equivalent — required external scheduler + custom code&lt;/td&gt;
&lt;td&gt;Single source of truth: agent runs the prompt on cadence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The post-incident and scheduled triggers are the genuinely new capability. Before Actions, anything recurring or post-incident required gluing Aurora to an external scheduler, an external prompt store, and bespoke trigger code. Actions collapse all three into the product surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Actions work under the hood
&lt;/h2&gt;

&lt;p&gt;This is for the technically curious. A few architecturally interesting things from the implementation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Background chat sessions, not a separate runtime.&lt;/strong&gt; When an Action triggers, the executor service creates a regular chat session with the action's instruction as the seed message and dispatches it as a background Celery task. The agent doesn't know it's running an Action — it just runs the workflow. This means every capability the interactive agent has (tool calls, RAG, graph traversal, sub-agent orchestration) is available inside Actions for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Eager-loaded skills, no RCA mandate.&lt;/strong&gt; Interactive chat lazy-loads skills based on the user message. Background actions eager-load all skills because there is no human to clarify ambiguity. The system prompt also strips the "your job is to find root cause" framing — Actions can do anything the agent can do, not just investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. RLS context is preserved.&lt;/strong&gt; Aurora uses &lt;a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" rel="noopener noreferrer"&gt;PostgreSQL row-level security&lt;/a&gt; for multi-tenancy. The executor explicitly sets RLS context (&lt;code&gt;org_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;) before running so background tasks see only their own org's data — even though they run under a service identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Stale run cleanup is integrated.&lt;/strong&gt; Aurora's existing background-chat janitor already handles orphaned chat sessions from crashed pods. Action runs go through the same path, so a worker pod dying mid-action doesn't leave the run state inconsistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. RBAC is enforced at the route layer.&lt;/strong&gt; Action CRUD is gated by Aurora's Casbin-based RBAC. Org admins can restrict which roles can create or trigger actions — important because an Action with cloud-CLI access has real blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trigger types in detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual triggers
&lt;/h3&gt;

&lt;p&gt;The simplest case. An admin creates the action, an engineer triggers it from the Actions page or via &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; in chat. Useful for codifying common operational tasks ("rotate ECS task definitions for service X", "scan Confluence for stale runbooks") into named, repeatable commands.&lt;/p&gt;

&lt;p&gt;The chat integration is worth calling out: &lt;code&gt;/action&lt;/code&gt; is implemented as an LLM tool call using the same pattern as Aurora's &lt;code&gt;/rca&lt;/code&gt; slash command. The agent processes the action dispatch and then continues responding to the rest of the user's message — so you can write "kick off the IAM audit and tell me what changed since last week" and the agent will dispatch the audit action &lt;em&gt;and&lt;/em&gt; answer your question in the same turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-incident-completion triggers
&lt;/h3&gt;

&lt;p&gt;When an incident transitions to "resolved", any action with this trigger type runs against the incident context. The incident's metadata, RCA, and timeline are available to the action's agent without the user having to paste anything in. This is the trigger that turns Aurora from a reactive tool ("investigate this page") into a continuous one ("investigate, then run health checks, then file the postmortem").&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled triggers
&lt;/h3&gt;

&lt;p&gt;Interval-based, driven by &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;. Choose a cadence (every N minutes / hours / days), and the action runs without user involvement. This is the building block for the CI/CD auto-remediation and scheduled audit use cases — and it's why we're calling this post and the &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt; sister posts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actions don't do (and why)
&lt;/h2&gt;

&lt;p&gt;A few capability decisions worth being explicit about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No external webhook triggers&lt;/strong&gt; in this release. We could have added "trigger on arbitrary webhook" but it overlaps with the existing alert-triggered investigation flow. We may add it if we see demand for triggers from systems that don't go through PagerDuty / Datadog / Grafana.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No agent-authored Actions&lt;/strong&gt; yet. The agent can't create or modify Actions on its own. Self-modification is a serious security boundary; we'd want approval gating and audit logging before opening that door. (See our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; for the threat model.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No conditional / DAG composition&lt;/strong&gt; in this release. Actions are single-prompt for now. If you need a multi-step workflow, write a single prompt that describes the steps — the agent is good at sequencing. We'll add explicit composition if the natural-language form proves limiting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Safety: what to think about before enabling
&lt;/h2&gt;

&lt;p&gt;Every Action is a small program with access to your cloud environment. A few rules we use ourselves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start read-only.&lt;/strong&gt; Actions inherit Aurora's tool permissions. If your tool config restricts write actions (no &lt;code&gt;kubectl apply&lt;/code&gt;, no &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt;), Actions inherit that posture. Keep it that way for the first few weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use scheduled triggers conservatively.&lt;/strong&gt; A daily audit is cheap. A 5-minute polling loop with cloud CLI calls is not. Watch the LLM bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit who can create Actions.&lt;/strong&gt; RBAC defaults to org-admin-only creation. Leave it there unless you have a clear reason to widen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin the model.&lt;/strong&gt; Action prompts can be sensitive to model behavior. Pin a known-good model per action (gpt-5.5, claude-sonnet-4.6, opus-4.7, etc.) using Aurora's per-org model dropdown until you have confidence in cross-model stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review action runs weekly.&lt;/strong&gt; Every action has a run-history view. Spend 10 minutes a week reading the agent's traces for your scheduled actions — anomalous reasoning is the leading indicator of prompt drift or tool drift.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to ship your first Action
&lt;/h2&gt;

&lt;p&gt;A six-step recipe.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pick a recurring task you currently do manually
&lt;/h3&gt;

&lt;p&gt;Anything you do every week or after every incident. Examples: stale-PR review, alert-noise audit, on-call handover summary. The smaller and more deterministic, the better for v1.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Write the prompt as if you were typing it into chat
&lt;/h3&gt;

&lt;p&gt;Don't translate to "automation language." Write it the way you would write a chat message to a smart junior SRE. "Look at..." "Check whether..." "Open a PR that..."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Create the Action with a manual trigger
&lt;/h3&gt;

&lt;p&gt;Settings → Actions → New Action. Paste the prompt, set trigger = manual, leave it disabled if you want to review before enabling. Trigger it once and watch the run.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Inspect the run trace
&lt;/h3&gt;

&lt;p&gt;Click the run in the history view. Read every tool call. Look for: tool misuse (wrong cloud account), excessive tool calls (3 attempts at the same thing), hallucinated paths or resource IDs. Iterate on the prompt until the trace is clean for three consecutive runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Promote to the right trigger type
&lt;/h3&gt;

&lt;p&gt;If the action makes sense after every incident → on-incident-completion. If it's a routine sweep → on-schedule with the longest cadence that still meets your need. Only use short cadences when you have a clear cost and blast-radius understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Add it to your team's incident review
&lt;/h3&gt;

&lt;p&gt;Treat agent runs the same way you treat human runs: include them in your weekly incident review. Look for actions that produced wrong output, actions that nobody read the output of, and actions that produced output nobody acted on. Delete or downgrade as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora Actions vs traditional incident-management automation
&lt;/h2&gt;

&lt;p&gt;The category most people compare us to is "workflow automation in incident-management SaaS" — Rootly, FireHydrant, incident.io. The comparison is informative but ultimately category-different:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Aurora Actions&lt;/th&gt;
&lt;th&gt;Rootly / FireHydrant / incident.io workflows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language&lt;/td&gt;
&lt;td&gt;DSL or visual builder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — LLM agent&lt;/td&gt;
&lt;td&gt;No — fixed conditional graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool reach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud CLIs, kubectl, Terraform, Slack, Confluence, GitHub, RAG, infra graph&lt;/td&gt;
&lt;td&gt;Slack, status pages, Zoom, runbook links, ticket creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduled execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Celery Beat)&lt;/td&gt;
&lt;td&gt;Limited (some support timed reminders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Post-incident chaining&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — full incident context available&lt;/td&gt;
&lt;td&gt;Yes — but limited to workflow actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Apache 2.0, self-hosted)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (self-hosted; LLM tokens only)&lt;/td&gt;
&lt;td&gt;Per-user SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest framing: traditional incident-management tools automate the &lt;em&gt;process around&lt;/em&gt; the incident. Aurora Actions automate &lt;em&gt;what happens inside the agent&lt;/em&gt;. Both have value; they cover non-overlapping work. If you live in PagerDuty and use Rootly for incident channels, Aurora Actions sit alongside that — they don't replace it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Aurora Actions is the foundation for several capabilities on our roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DAG composition&lt;/strong&gt; — explicit multi-step Action chains where each step is itself an Action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval gates&lt;/strong&gt; — Actions that pause for human approval before destructive tool calls (already supported in chat; explicit Action-level gating coming).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation hooks&lt;/strong&gt; — first-class integration with GitHub Actions, Jenkins, and ArgoCD so a failing pipeline becomes a triggered Aurora investigation. (Background and detailed write-up in our &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action marketplace&lt;/strong&gt; — community-contributed Actions you can install with one click. Bring-your-own prompt store.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll publish each of these as they ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora is fully open source under Apache 2.0. Self-host with &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Docker Compose or Helm&lt;/a&gt;. Actions ship in the next tagged release after &lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;aurora-oss-1.2.15&lt;/a&gt; (April 15, 2026); the feature is available on &lt;code&gt;main&lt;/code&gt; today.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare against alternatives:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs traditional incident-management tools&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:32:08 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Most teams do not yet auto-remediate inside CI/CD.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though AI is now widely used elsewhere in the development lifecycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation is an architectural pattern, not a product category.&lt;/strong&gt; It combines progressive delivery (canary, blue-green), automated metric-driven rollback, and AI-assisted root-cause-and-fix. Owned components, not a single SKU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three layers, four maturity levels.&lt;/strong&gt; We propose the &lt;strong&gt;CI/CD Auto-Remediation Maturity Spectrum (CARM)&lt;/strong&gt;: L0 (manual), L1 (rollback), L2 (rollback + diagnostic), L3 (rollback + diagnostic + remediation), L4 (closed-loop with policy gates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source stack is mature.&lt;/strong&gt; &lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, and metric-driven &lt;code&gt;AnalysisTemplates&lt;/code&gt; cover L1–L2 with no AI. AI agents like &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; extend to L3 with Actions-based remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DORA's bar is real.&lt;/strong&gt; Top-performing teams keep change failure rate low and recover from failed deployments in under one hour (&lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program guidance&lt;/a&gt;). Auto-remediation is how non-elite teams close the gap.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ AI SRE products&lt;/a&gt; and dozens of progressive-delivery tools shipping in 2026, only a handful explicitly target the pattern this guide is about. &lt;strong&gt;CI/CD auto-remediation is the practice of having your delivery pipeline automatically detect, diagnose, and recover from failure — without paging a human — using a combination of progressive-delivery primitives, metric-driven rollback policies, and (increasingly) AI agents that propose or apply fixes.&lt;/strong&gt; It is not the same as auto-deploy. It is not the same as canary rollout. It is the closing of the loop between "the pipeline noticed something is wrong" and "the system is back in a good state" — without an engineer in the middle.&lt;/p&gt;

&lt;p&gt;This guide is for SRE and platform teams who already run continuous delivery and want to push toward the auto-remediation end of the spectrum. By the end, you should be able to: position your current setup on the CARM maturity spectrum, identify the next concrete step, and pick a credible tool stack to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why auto-remediation matters in 2026
&lt;/h2&gt;

&lt;p&gt;Three numbers explain the demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AI is shipping more code, faster.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage on the TeamCity blog (April 2026)&lt;/a&gt;, AI tools are now used by a large majority of developers in their daily work. The &lt;a href="https://getdx.com/blog/change-failure-rate/" rel="noopener noreferrer"&gt;DX 2026 change-failure-rate analysis&lt;/a&gt; puts a number on it: with 91% of developers having adopted AI and 20%+ of merged code now AI-authored, &lt;strong&gt;code velocity has gone up while quality has gone in the opposite direction.&lt;/strong&gt; More deployments per day means more chances to break production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The pipeline itself is the new bottleneck.&lt;/strong&gt; &lt;a href="https://blog.jetbrains.com/teamcity/2025/10/the-state-of-cicd/" rel="noopener noreferrer"&gt;JetBrains' 2025 State of CI/CD survey&lt;/a&gt; documents widespread frustration with slow and unreliable CI/CD pipelines as a leading contributor to developer burnout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AI in CI/CD specifically lags adoption.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though most use AI everywhere else in the development lifecycle. The gap isn't capability; it's trust and integration. AI in IDEs is low-risk; AI in pipelines touches production. Teams want the impact but won't take the blast radius until the architecture is right.&lt;/p&gt;

&lt;p&gt;Auto-remediation is the architecture that closes that gap. It bounds the agent's reach (only inside the delivery pipeline), gives it deterministic guardrails (progressive delivery and metric-driven rollback), and produces a clear contract: detect, diagnose, fix-or-rollback, log.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "auto-remediation" actually means
&lt;/h2&gt;

&lt;p&gt;It is easiest to define by negation. Auto-remediation is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-deploy.&lt;/strong&gt; Auto-deploy ships code on merge. Auto-remediation is what happens &lt;em&gt;after&lt;/em&gt; a problem appears.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canary release.&lt;/strong&gt; Canary is the &lt;em&gt;detection mechanism&lt;/em&gt; — it surfaces problems early by shifting traffic gradually. Remediation is the &lt;em&gt;response&lt;/em&gt; — rolling back, hotfixing, or reverting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing infrastructure.&lt;/strong&gt; Self-healing systems like Kubernetes restart pods. Auto-remediation includes that plus &lt;em&gt;change-driven&lt;/em&gt; failure recovery: rolling back a bad deploy, rolling forward a fix, or pausing the pipeline while a human investigates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps.&lt;/strong&gt; AIOps platforms surface alerts and correlations. Auto-remediation closes the loop by &lt;em&gt;acting&lt;/em&gt; on them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The minimum viable definition: &lt;strong&gt;a pipeline transition from a degraded state back to a healthy state, triggered by automated detection, executed by automated action, observed and logged for human review.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI/CD Auto-Remediation Maturity Spectrum (CARM)
&lt;/h2&gt;

&lt;p&gt;There is no single industry-standard maturity model for auto-remediation. We use the following five-level spectrum — derived from how teams actually evolve.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What happens on failed deploy&lt;/th&gt;
&lt;th&gt;Tools that get you here&lt;/th&gt;
&lt;th&gt;Trust required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0 — Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline fails. PagerDuty pages the on-call. Engineer investigates, decides to roll back or hotfix, executes manually.&lt;/td&gt;
&lt;td&gt;None — this is the default for most teams.&lt;/td&gt;
&lt;td&gt;None — humans do everything.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Automated Rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline detects health-check failure (error rate, latency, smoke test). Automatically rolls back to the previous version. Pages a human after the fact.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, &lt;a href="https://spinnaker.io/" rel="noopener noreferrer"&gt;Spinnaker&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the health metric reflects user-visible failure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Rollback + Diagnostic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L1 plus: AI agent runs an investigation when rollback fires. Produces an RCA before the human starts. Page goes out with context, not blank.&lt;/td&gt;
&lt;td&gt;L1 stack + &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the diagnostic is right enough to bias human reasoning.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Rollback + Diagnostic + Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L2 plus: agent proposes (or in some cases applies) a fix — a PR, a config change, an alert threshold update. Human reviews and merges.&lt;/td&gt;
&lt;td&gt;L2 stack + Aurora Actions, HolmesGPT Operator mode&lt;/td&gt;
&lt;td&gt;Trust that the agent's fix is correct, scoped, and reviewable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Closed-loop with policy gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L3 plus: certain &lt;em&gt;low-risk, well-understood&lt;/em&gt; fixes are applied automatically inside policy guardrails (alert threshold widening, log-only changes, retry loops). Destructive or high-risk changes still gated.&lt;/td&gt;
&lt;td&gt;L3 stack + policy engine (OPA, Casbin, Kyverno) + audit logging&lt;/td&gt;
&lt;td&gt;Trust the policy gate definitions more than the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams in 2026 are at &lt;strong&gt;L0 or L1&lt;/strong&gt;. The leap from L1 to L2 is the single highest-leverage move available because it preserves human-in-the-loop decision-making while removing the "blank-page" delay that explains a large share of MTTR. The 2024-2025 DORA research &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;renamed MTTR to Failed Deployment Recovery Time (FDRT)&lt;/a&gt; precisely because the metric is more meaningful when scoped to change-driven failures — which is exactly the failure mode auto-remediation targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  L1: Automated rollback (where most serious teams should be)
&lt;/h2&gt;

&lt;p&gt;This is the foundation. Without L1, AI-assisted remediation at L2-L3 has nowhere to act.&lt;/p&gt;

&lt;p&gt;The two Apache 2.0 incumbents are &lt;strong&gt;Argo Rollouts&lt;/strong&gt; and &lt;strong&gt;Flagger.&lt;/strong&gt; Both run in Kubernetes; both implement metric-driven progressive delivery with automated rollback. They differ in invasiveness.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/argo/" rel="noopener noreferrer"&gt;Argo&lt;/a&gt; (Graduated, Dec 2022)&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; (Graduated, Nov 2022)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Replaces &lt;code&gt;Deployment&lt;/code&gt; with &lt;code&gt;Rollout&lt;/code&gt; CRD&lt;/td&gt;
&lt;td&gt;Wraps existing &lt;code&gt;Deployment&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitOps pairing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;FluxCD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AnalysisTemplate&lt;/code&gt; querying Prometheus, Datadog, CloudWatch, etc.&lt;/td&gt;
&lt;td&gt;Service-mesh metrics + custom webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automated rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traffic shaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native + ingress + service mesh&lt;/td&gt;
&lt;td&gt;Service-mesh first (Istio, Linkerd, App Mesh)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Invasiveness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (changes resource type)&lt;/td&gt;
&lt;td&gt;Lower (transparent wrapper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Webhooks for custom logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Experiment&lt;/code&gt; resource + analysis runs&lt;/td&gt;
&lt;td&gt;Pre-/post-/during-rollout hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pick Argo Rollouts&lt;/strong&gt; if you already use ArgoCD and want explicit per-step canary control. &lt;strong&gt;Pick Flagger&lt;/strong&gt; if you use a service mesh and want progressive delivery to be transparent to existing manifests.&lt;/p&gt;

&lt;p&gt;For non-Kubernetes pipelines, equivalent capability lives in &lt;strong&gt;Spinnaker&lt;/strong&gt; (multi-cloud, mature), &lt;strong&gt;Harness&lt;/strong&gt; (commercial), and feature-flag platforms like &lt;strong&gt;LaunchDarkly&lt;/strong&gt; (when "rollback" can be a flag flip).&lt;/p&gt;

&lt;p&gt;A minimal Argo Rollouts AnalysisTemplate for HTTP error rate, simplified from the &lt;a href="https://argoproj.github.io/argo-rollouts/features/analysis/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[1m]))&lt;/span&gt;
            &lt;span class="s"&gt;/ sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three failed 30-second windows → rollback. This is L1 in 30 lines of YAML.&lt;/p&gt;

&lt;h2&gt;
  
  
  L2: Rollback + automated diagnostic
&lt;/h2&gt;

&lt;p&gt;L1 gets you out of an outage fast. It does not tell you &lt;em&gt;why&lt;/em&gt; the deploy failed. The human gets paged with a rollback notification and starts from zero.&lt;/p&gt;

&lt;p&gt;L2 fills that gap with an AI agent that runs when rollback fires. The agent queries the cluster state, the application logs, the rollout metrics, and the changed code — and produces an RCA before the human starts typing.&lt;/p&gt;

&lt;p&gt;Three credible open-source options exist as of 2026 (compared in detail in our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; guide):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt; — rule-based scanner with LLM explanations. Best for low-blast-radius first deployment; explains &lt;em&gt;why&lt;/em&gt; a resource is unhealthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt; — ReAct-loop AI agent (CNCF Sandbox). 30+ observability integrations. Read-only by default. Strong fit for cluster-scoped investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt; — LangGraph supervisor agent. Multi-cloud (AWS / Azure / GCP / OVH / Scaleway). Generates postmortems. Opens remediation PRs with human approval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wiring up L2 is straightforward: configure your AI SRE's webhook to receive the rollback event (Argo Rollouts emits Kubernetes events; you can route them via &lt;a href="https://argoproj.github.io/argo-rollouts/features/notifications/" rel="noopener noreferrer"&gt;Argo Notifications&lt;/a&gt; to the agent). The agent investigates and posts results to the on-call Slack channel before the human acknowledges the page.&lt;/p&gt;

&lt;h2&gt;
  
  
  L3: Diagnostic + agent-proposed remediation
&lt;/h2&gt;

&lt;p&gt;L3 is where AI starts proposing fixes, not just diagnosis. The pattern that works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pipeline fails → automated rollback (L1).&lt;/li&gt;
&lt;li&gt;Agent investigates → RCA produced (L2).&lt;/li&gt;
&lt;li&gt;Agent proposes a fix as a &lt;strong&gt;pull request&lt;/strong&gt;, with the RCA as the PR description, the diff scoped to one file, and tests where possible.&lt;/li&gt;
&lt;li&gt;Human reviews PR. If correct, merges. If wrong, comments and rejects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works because the pull request is the natural human-review surface. The agent doesn't touch production directly; it touches the repository, which already has CI, code review, and a merge gate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; is built precisely for this pattern. A post-incident-completion Action with a prompt like "Open a PR widening alert thresholds for the three noisiest alerts in this incident" converts the human follow-up step into automated PR creation. The human review surface stays exactly the same as for human-authored PRs.&lt;/p&gt;

&lt;p&gt;The HolmesGPT equivalent ships as &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;"Operator mode"&lt;/a&gt; — the agent can write to GitHub when explicitly enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  L4: Closed-loop with policy gates
&lt;/h2&gt;

&lt;p&gt;L4 is the contentious one. It involves the agent making changes &lt;em&gt;without&lt;/em&gt; human approval — but only inside a tightly scoped policy.&lt;/p&gt;

&lt;p&gt;The pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;policy engine&lt;/strong&gt; (&lt;a href="https://www.openpolicyagent.org/" rel="noopener noreferrer"&gt;Open Policy Agent&lt;/a&gt;, &lt;a href="https://kyverno.io/" rel="noopener noreferrer"&gt;Kyverno&lt;/a&gt;, Casbin) defines which classes of remediation can run automatically.&lt;/li&gt;
&lt;li&gt;The agent proposes a fix. The policy engine evaluates whether the fix matches a permitted class.&lt;/li&gt;
&lt;li&gt;If yes → apply automatically with audit logging. If no → route to L3 (PR for human review).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Permitted classes that are usually safe at L4: widening an alert threshold by less than 2x, restarting a pod, scaling a deployment within preset bounds, adding a retry loop to a network call, suppressing a noisy log line.&lt;/p&gt;

&lt;p&gt;Permitted classes that are usually &lt;em&gt;not&lt;/em&gt; safe at L4: any data-plane change, any production traffic routing change, any secret or RBAC change, any change touching the policy engine itself.&lt;/p&gt;

&lt;p&gt;The reason L4 is contentious is that the policy gate is now a high-value target. An attacker who can broaden the policy can broaden the agent's blast radius. The same threat model we walk through in our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; applies, plus an additional layer: the policy engine must be operated with the same rigor as the orchestration plane itself.&lt;/p&gt;

&lt;p&gt;Almost no production teams in 2026 run pure L4. The credible deployments are &lt;strong&gt;L3 with hardcoded L4 exceptions&lt;/strong&gt; for two or three well-understood remediation classes. That's where to aim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common pitfalls
&lt;/h2&gt;

&lt;p&gt;A short list of failure modes we have seen — in our own work and in customer deployments.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Auto-remediating &lt;em&gt;into&lt;/em&gt; a worse state.&lt;/strong&gt; The classic failure is auto-scaling a service to handle elevated error rates that are themselves caused by a downstream dependency. The service scales, hammers the dependency harder, and the dependency collapses. &lt;strong&gt;Fix:&lt;/strong&gt; never auto-remediate without dependency-graph awareness. Aurora uses &lt;a href="https://memgraph.com/" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt; for this; HolmesGPT uses its toolset structure; pure-L1 stacks should require manual escalation when the failure crosses service boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trusting the AnalysisTemplate metric too much.&lt;/strong&gt; A 1% error rate threshold on a P99-tail service is meaningless if your real failure mode is request-stalled-not-failed. &lt;strong&gt;Fix:&lt;/strong&gt; model what user-visible failure actually looks like, not what the cleanest Prometheus query produces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Letting the agent run unbounded retries.&lt;/strong&gt; AI agents that hit a "this didn't work" signal will often retry — sometimes thousands of times — burning tokens and triggering downstream rate limits. &lt;strong&gt;Fix:&lt;/strong&gt; cap the agent's tool-call budget. Aurora's executor enforces this by default; verify your agent does the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the post-mortem.&lt;/strong&gt; Auto-remediation that "just worked" without a clear human review of what happened is a slow path to brittleness. Every auto-remediation event should produce a postmortem the on-call reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflating auto-remediation with "self-healing infra".&lt;/strong&gt; Kubernetes pod restarts are not auto-remediation. They are a runtime affordance. Auto-remediation is the response to a &lt;em&gt;change-driven&lt;/em&gt; failure — the deploy, the config push, the schema migration. Keep the categories separate.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A pragmatic 90-day path to auto-remediation
&lt;/h2&gt;

&lt;p&gt;For a team currently at L0 or L1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 1–14: instrument and detect
&lt;/h3&gt;

&lt;p&gt;Pick your three highest-traffic services. Add or harden:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic checks that exercise the user-visible path.&lt;/li&gt;
&lt;li&gt;One Prometheus error-rate metric per service with a clear threshold.&lt;/li&gt;
&lt;li&gt;A canary or blue-green rollout primitive (&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt; or &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Goal at end of week 2: a controlled bad deploy auto-rolls back without human intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 15–45: wire in the agent
&lt;/h3&gt;

&lt;p&gt;Deploy one of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, or &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; in read-only mode. Configure rollback events to webhook the agent. Have it post an RCA to your incident channel within five minutes of rollback.&lt;/p&gt;

&lt;p&gt;Goal at end of week 6: every rollback comes with a written diagnostic before the human acknowledges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 46–75: add agent-proposed remediation
&lt;/h3&gt;

&lt;p&gt;Enable PR-creation for the agent (&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; on-incident-completion trigger, or HolmesGPT Operator mode). Constrain initial scope to one repo and one class of fix (alert thresholds, retry loops, log suppression). Review every PR for the first two weeks.&lt;/p&gt;

&lt;p&gt;Goal at end of week 11: agent opens correct PRs in 70%+ of fired rollbacks. False-positive PRs are caught at code review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 76–90: policy-gate one fix class for L4
&lt;/h3&gt;

&lt;p&gt;Pick the safest class — usually alert threshold widening when an alert fired more than N times in M hours with mean TTA above some bound. Define an OPA / Kyverno policy that permits &lt;em&gt;only that class.&lt;/em&gt; Wire the agent to apply directly when the policy permits, raise a PR otherwise.&lt;/p&gt;

&lt;p&gt;Goal at end of week 12: one L4 lane open for one fix class with full audit trail.&lt;/p&gt;

&lt;p&gt;This is the conservative path. Aggressive teams have moved faster, but we have not seen anyone skip steps successfully.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DORA reality check
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program's published guidance&lt;/a&gt; is blunt about what good looks like. Historical State of DevOps Reports have consistently shown the same shape of distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Change Failure Rate&lt;/strong&gt;: top performers maintain low single-digit percentages; lower performers see substantially higher rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed Deployment Recovery Time (FDRT)&lt;/strong&gt;: top performers recover in under one hour; lower performers can take days to weeks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DORA's research has also consistently found that &lt;strong&gt;speed and stability reinforce each other rather than trade off&lt;/strong&gt; — the fastest teams are also the most stable, per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's history of metrics&lt;/a&gt; and successive State of DevOps Reports. Auto-remediation is one of the small number of capabilities that moves teams across these tiers without requiring deeper organizational change. The L1→L2 jump alone reduces FDRT meaningfully because the human is no longer reconstructing context — the agent has already done it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is heading
&lt;/h2&gt;

&lt;p&gt;Two predictions, each with a reasonable evidence base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The L2 → L3 transition becomes table-stakes within 18 months.&lt;/strong&gt; AI-authored PRs from agents are already merging in production at multiple companies in our network. Once the review surface is the same as for human-authored PRs (which it already is via GitHub / Bitbucket / GitLab), there is no organizational reason not to use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. L4 stays narrow.&lt;/strong&gt; The threat surface of agent-applied changes is genuinely scary, and the per-incident savings of going from L3 to L4 are smaller than the savings from L1 to L2. Expect L4 to be the place where one or two well-understood fix classes get automated, while everything else stays L3.&lt;/p&gt;

&lt;p&gt;The teams who win in 2026-2027 are the ones who get to credible L3 first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;Aurora is the AI agent layer of an auto-remediation stack — it covers L2 (investigation), L3 (PR-based remediation via &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt;), and the agent half of L4 (policy-gated remediation). It does not replace Argo Rollouts or Flagger at L1; those remain the foundation. Aurora is the difference between rolling back blind and rolling back with a written RCA and a draft PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions launch:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions: User-Defined Background Automations&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OSS comparison:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety architecture:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>AI Agent kubectl Safety: Sandboxed Execution for Production</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:44:12 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Giving an AI agent kubectl access is an architecture decision, not a permission flag.&lt;/strong&gt; Per-permission gates fail under prompt injection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OWASP ranks "Excessive Agency" as LLM06 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;2025 Top 10 for LLM Applications&lt;/a&gt;&lt;/strong&gt; and "Tool Misuse and Exploitation" as ASI02 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;2026 Top 10 for Agentic Applications&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Kubernetes ecosystem already has an answer&lt;/strong&gt;: &lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; provides a declarative API for isolated agent runtimes using gVisor or Kata Containers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real precedent exists.&lt;/strong&gt; &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;EchoLeak (CVE-2025-32711)&lt;/a&gt;, CVSS 9.3, was the first publicly documented zero-click prompt-injection data exfiltration in a production LLM system. The kubectl analogue would be cluster-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora runs every &lt;code&gt;kubectl&lt;/code&gt; command in a pod-isolated process&lt;/strong&gt; via its &lt;code&gt;terminal_run&lt;/code&gt; primitive, with an environment-variable allowlist that strips secrets, signature-matcher and LLM-judge guardrails, and per-invocation cloud credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ products marketed as "AI SRE" in 2026&lt;/a&gt;, only a handful publicly document their kubectl execution architecture — and the gap between vendors that handle this well and vendors that handle it badly is the single largest unspoken risk in the category. &lt;strong&gt;AI agent kubectl safety is the architectural discipline of letting an AI agent run &lt;code&gt;kubectl&lt;/code&gt; (or any cloud CLI) against production without inheriting cluster-wide blast radius if the agent is compromised.&lt;/strong&gt; It is not the same as RBAC scoping, and it is not the same as a human approval prompt — both are necessary but neither is sufficient on its own.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;OWASP published its 2025 Top 10 for LLM Applications&lt;/a&gt;, it ranked &lt;strong&gt;Prompt Injection (LLM01)&lt;/strong&gt; as the top risk and &lt;strong&gt;Excessive Agency (LLM06)&lt;/strong&gt; as one of the most consequential — defining it across three root causes: excessive functionality, excessive permissions, and excessive autonomy. In December 2025, OWASP followed up with a &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;dedicated Top 10 for Agentic Applications&lt;/a&gt; that names &lt;strong&gt;Tool Misuse and Exploitation (ASI02)&lt;/strong&gt; and &lt;strong&gt;Identity and Privilege Abuse (ASI03)&lt;/strong&gt; as primary attack surfaces.&lt;/p&gt;

&lt;p&gt;Translation: if you give an AI agent the ability to run &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; commands against production, you have a security architecture problem — not a permissions problem. This guide walks through the threat model, the emerging Kubernetes sandboxing standard, and how to evaluate any AI SRE on its kubectl safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can go wrong when AI agents run kubectl?
&lt;/h2&gt;

&lt;p&gt;Any LLM-driven agent that executes commands inherits the security properties of the LLM, the harness, and the runtime. Three real-world precedents illustrate the failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EchoLeak (CVE-2025-32711)&lt;/strong&gt; — Microsoft 365 Copilot, CVSS 9.3 critical, &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;patched in June 2025&lt;/a&gt;. Discovered by Aim Security, it was the first publicly documented zero-click indirect prompt-injection data exfiltration in a production LLM system. A crafted email sat in Outlook; when the user later asked Copilot for an unrelated summary, the email's hidden instructions fired and exfiltrated SharePoint, OneDrive, and Teams data. Research paper: &lt;a href="https://arxiv.org/abs/2509.10540" rel="noopener noreferrer"&gt;arXiv:2509.10540&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MITRE ATLAS prompt-injection techniques&lt;/strong&gt; — &lt;a href="https://atlas.mitre.org/" rel="noopener noreferrer"&gt;MITRE ATLAS&lt;/a&gt; catalogues real-world adversary techniques against AI systems, including indirect prompt injection that turns an LLM with tool access into an attacker-controlled execution surface. The framework specifically documents techniques for exfiltration via AI agent tool invocation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Session Smuggling&lt;/strong&gt; — Palo Alto Unit 42 (November 2025) demonstrated rogue agents exploiting trust in the Agent-to-Agent (A2A) protocol with multi-turn manipulation. Documented in OWASP's Agentic Top 10.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these specifically targeted kubectl-running agents in production — but the class is the same and the blast radius would be larger. An agent that can run &lt;code&gt;kubectl delete&lt;/code&gt; is one prompt-injection payload away from a cluster-wide outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Attack Surfaces of Agentic kubectl
&lt;/h2&gt;

&lt;p&gt;Most teams think of kubectl agent safety as a single problem ("can the agent be tricked?"). It's actually four distinct attack surfaces, each requiring its own mitigation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Why permission-scoping alone fails&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Prompt injection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hidden instructions in logs, alerts, runbooks, or chat coerce the agent&lt;/td&gt;
&lt;td&gt;Compromised agent acts within its granted permissions, which is exactly what permission-scoping permits&lt;/td&gt;
&lt;td&gt;Sandboxed runtime; never trust LLM output derived from data the LLM read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Credential leakage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executed command reads &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt; from inherited env&lt;/td&gt;
&lt;td&gt;Permissions live on credentials; if the credential leaks, the permission set leaks with it&lt;/td&gt;
&lt;td&gt;Per-invocation short-lived credentials (STS, Service Principal); explicit env allowlist that strips secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Blast radius escalation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Legitimate command runs against wrong namespace, region, or cluster&lt;/td&gt;
&lt;td&gt;Permissions don't model "right action, wrong target"&lt;/td&gt;
&lt;td&gt;Default read-only; dependency-graph awareness; human approval for destructive writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Audit trail gaps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logs capture commands without the agent's reasoning&lt;/td&gt;
&lt;td&gt;Permission systems audit "who ran what," not "why"&lt;/td&gt;
&lt;td&gt;Per-investigation transcripts that link reasoning → tool calls → outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Attack Surface 1: Prompt injection
&lt;/h3&gt;

&lt;p&gt;The agent reads a log line, alert payload, runbook, or chat message that contains hidden instructions. The LLM cannot reliably distinguish data from instructions in the same channel — this is the fundamental property OWASP's LLM01 captures. Even frontier models do not eliminate it. Anthropic has publicly stated that "no browser agent is immune to prompt injection" and publishes &lt;a href="https://www.anthropic.com/news/prompt-injection-defenses" rel="noopener noreferrer"&gt;defense benchmarks&lt;/a&gt; showing measurable but imperfect attack-prevention rates across computer-use, bash tool use, and MCP workflows. The implication for kubectl-running agents is clear: &lt;strong&gt;the LLM is not the security boundary. The runtime is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mitigation: never trust LLM output that originates from data the LLM also read. Sandbox the execution layer so even a successful injection has limited blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 2: Credential leakage
&lt;/h3&gt;

&lt;p&gt;If the agent runs commands with credentials inherited from the host process environment (&lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;), a successful command-injection or shell escape exposes everything the agent process has access to. Long-lived static credentials make this catastrophic.&lt;/p&gt;

&lt;p&gt;Mitigation: per-invocation credential scoping. AWS STS AssumeRole, Azure Service Principal sessions, GCP short-lived tokens. Strip everything else from the child process environment with an explicit allowlist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 3: Blast radius escalation
&lt;/h3&gt;

&lt;p&gt;Even legitimate, non-injected commands can have outsized effects. &lt;code&gt;kubectl delete pod&lt;/code&gt; on the wrong namespace. &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt; against a misidentified region. The agent doesn't need to be compromised — it just needs to be wrong.&lt;/p&gt;

&lt;p&gt;Mitigation: read-only by default, write actions behind explicit human approval, and dependency-graph awareness so the agent can compute blast radius before acting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 4: Audit trail gaps
&lt;/h3&gt;

&lt;p&gt;When an investigation runs across 20+ tool invocations, traditional audit systems (CloudTrail, Kubernetes audit logs) record what was run but not why. A reviewer six months later cannot tell whether a &lt;code&gt;kubectl scale&lt;/code&gt; was a legitimate response to a load spike or an injected instruction.&lt;/p&gt;

&lt;p&gt;Mitigation: structured per-investigation transcripts that capture agent reasoning alongside tool calls. The right log isn't "kubectl was run" — it's "in response to alert X, the agent hypothesized Y, ran kubectl Z, and observed W."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "human approval" alone is not enough
&lt;/h2&gt;

&lt;p&gt;The most common safety story in the AI SRE space is "the agent suggests; humans approve." That is necessary but not sufficient.&lt;/p&gt;

&lt;p&gt;The problem with approval gates as the only line of defense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision fatigue.&lt;/strong&gt; An agent that handles 50 alerts a week generates dozens of approval prompts. Humans rubber-stamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval ≠ understanding.&lt;/strong&gt; Engineers approve commands they don't fully understand because the agent's reasoning sounds plausible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injected intent looks legitimate.&lt;/strong&gt; A prompt-injection payload can produce a recommendation that &lt;em&gt;reads&lt;/em&gt; exactly like a normal RCA. The approver has no signal that the underlying instruction came from an attacker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Approval gates are critical, but they need to sit on top of an already-sandboxed runtime — not be the only protection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Permission scoping vs sandboxed execution: what's the difference?
&lt;/h2&gt;

&lt;p&gt;These two terms get conflated. They aren't the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission scoping&lt;/strong&gt; restricts what an agent's identity can do. RBAC roles, IAM policies, kubeconfig contexts. It's necessary, but it operates at the cluster-API layer — meaning a successful prompt injection can still use every permission the agent has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxed execution&lt;/strong&gt; isolates the &lt;em&gt;runtime&lt;/em&gt; in which commands execute. If the agent's process is compromised, the sandbox limits what the compromised process can do regardless of the credentials it holds. The compromised process can't read other pods' files, can't reach other nodes, can't escalate to the host kernel.&lt;/p&gt;

&lt;p&gt;The defensible architecture combines both: tight permission scoping (small RBAC role, short-lived credentials) + runtime isolation (sandboxed execution).&lt;/p&gt;

&lt;h2&gt;
  
  
  How sandboxed kubectl actually works
&lt;/h2&gt;

&lt;p&gt;The Kubernetes ecosystem standardized on this pattern in 2025–2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  k8s-sigs/agent-sandbox
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; is a formal Kubernetes SIG Apps subproject that launched at KubeCon Atlanta in November 2025. It provides a declarative Kubernetes API for "isolated, stateful, singleton workloads" — built specifically for AI agent runtimes that may execute untrusted, LLM-generated code.&lt;/p&gt;

&lt;p&gt;Core CRDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Sandbox&lt;/code&gt; — an isolated pod-equivalent with stronger boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxTemplate&lt;/code&gt; — reusable configuration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxClaim&lt;/code&gt; — request a sandbox for a workload&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxWarmPool&lt;/code&gt; — pre-created sandboxes that bring cold-start under one second&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/" rel="noopener noreferrer"&gt;Kubernetes blog post from March 2026&lt;/a&gt; makes the architectural claim explicit: "Isolation achieved via runtime-level sandboxing (gVisor/Kata), not just container-level namespaces."&lt;/p&gt;

&lt;h3&gt;
  
  
  gVisor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor&lt;/a&gt; is a Google-maintained user-space application kernel that provides kernel-level isolation without full virtualization. Architecture: &lt;strong&gt;Sentry&lt;/strong&gt; (a kernel emulator written in Go) intercepts roughly 200 Linux syscalls; &lt;strong&gt;Gofer&lt;/strong&gt; brokers filesystem access over 9P. The OCI runtime is &lt;code&gt;runsc&lt;/code&gt;, drop-in compatible with &lt;code&gt;runc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;gVisor runs in production at Google for App Engine standard, Cloud Functions, Cloud Run, and Cloud ML Engine. GKE Sandbox productizes it for GKE node pools. It is one of two named isolation backends in agent-sandbox (the other being Kata Containers, which uses lightweight VMs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters for AI SRE
&lt;/h3&gt;

&lt;p&gt;An AI SRE that runs &lt;code&gt;kubectl&lt;/code&gt; against production is exactly the kind of workload agent-sandbox was built for. It executes LLM-generated commands. It needs file system isolation, syscall isolation, and per-invocation credential scoping. It benefits enormously from a warm pool that reduces cold-start latency.&lt;/p&gt;

&lt;p&gt;If you are evaluating an AI SRE in 2026, this is one of the right questions to ask: &lt;em&gt;what isolation backend does the agent use when it executes commands?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aurora's pod-isolated execution works
&lt;/h2&gt;

&lt;p&gt;Aurora's approach predates agent-sandbox and follows the same architectural principles.&lt;/p&gt;

&lt;p&gt;When Aurora's agent runs a &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; command, it doesn't use &lt;code&gt;subprocess.run()&lt;/code&gt; directly. It uses an internal primitive called &lt;code&gt;terminal_run&lt;/code&gt;, defined in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;&lt;code&gt;server/utils/terminal/terminal_run.py&lt;/code&gt;&lt;/a&gt;. The module's docstring is explicit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Drop-in replacement for subprocess.run() that executes in terminal pods. This module provides a terminal_run() function that mimics subprocess.run() API but executes commands in isolated terminal pods via kubectl exec. Safety guardrails (signature matcher + LLM judge) run automatically unless the caller passes &lt;code&gt;trusted=True&lt;/code&gt; for known-safe internal operations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three properties matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Pod-isolated execution.&lt;/strong&gt; When the &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; flag is set (the default in Kubernetes deployments), every external command runs inside a separate terminal pod via &lt;code&gt;kubectl exec&lt;/code&gt;. The agent's own process never executes the command directly. A successful command-injection in the agent's reasoning loop does not give an attacker access to the agent host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Two-stage safety guardrails.&lt;/strong&gt; Before any non-trusted command runs, two checks fire automatically: a deterministic signature matcher that rejects known-dangerous patterns, and an LLM judge that evaluates the proposed command against the investigation context. The &lt;code&gt;trusted=True&lt;/code&gt; flag bypasses both — used only for known-safe internal operations like configured connector calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sanitized environment allowlist.&lt;/strong&gt; Aurora's &lt;code&gt;terminal_exec_tool&lt;/code&gt; module defines an explicit &lt;code&gt;_SAFE_ENV_KEYS&lt;/code&gt; set: &lt;code&gt;PATH&lt;/code&gt;, &lt;code&gt;HOME&lt;/code&gt;, &lt;code&gt;USER&lt;/code&gt;, &lt;code&gt;SHELL&lt;/code&gt;, &lt;code&gt;TERM&lt;/code&gt;, &lt;code&gt;LANG&lt;/code&gt;, &lt;code&gt;TMPDIR&lt;/code&gt;, &lt;code&gt;SSL_CERT_FILE&lt;/code&gt;, plus &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; itself. Everything else — including &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;DATABASE_URL&lt;/code&gt;, &lt;code&gt;SECRET_KEY&lt;/code&gt;, and any cloud credentials — is stripped from the child process environment. A compromised command cannot read the agent's secrets via &lt;code&gt;env&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Cloud credentials are handled separately. Aurora calls &lt;code&gt;generate_contextual_access_token&lt;/code&gt; and &lt;code&gt;generate_azure_access_token&lt;/code&gt; per invocation. AWS uses STS AssumeRole via cross-account roles (&lt;a href="https://github.com/Arvo-AI/aurora/tree/main/server/connectors/aws_connector" rel="noopener noreferrer"&gt;&lt;code&gt;aurora-cross-account-role.yaml&lt;/code&gt;&lt;/a&gt;) — short-lived credentials, not long-lived access keys. Azure uses Service Principal sessions. GCP uses OAuth-derived tokens.&lt;/p&gt;

&lt;p&gt;For agents that need to reach customer Kubernetes clusters Aurora can't access directly, a separate &lt;a href="https://github.com/Arvo-AI/aurora/tree/main/kubectl-agent" rel="noopener noreferrer"&gt;&lt;code&gt;kubectl-agent&lt;/code&gt;&lt;/a&gt; binary deploys via Helm into the customer's cluster and connects outbound over WebSocket. No inbound network access required, no kubeconfig sharing, no static credentials at rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an AI SRE's kubectl safety model
&lt;/h2&gt;

&lt;p&gt;Eight questions to ask any AI SRE vendor or open-source project before enabling production access:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Where does the command actually execute?&lt;/strong&gt; Same process as the agent? Same host? Separate container? Sandboxed runtime (gVisor/Kata)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What credentials does the command inherit from the host environment?&lt;/strong&gt; Specifically: can the executed command read your agent's vault token, database URL, or other host secrets?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are credentials short-lived or static?&lt;/strong&gt; STS / Service Principal sessions, or long-lived access keys?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the default read-only?&lt;/strong&gt; What flag, configuration, or RBAC role enables write access?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens between "agent decides to run X" and "X runs"?&lt;/strong&gt; Is there a deterministic policy check? An LLM judge? A human approval prompt? All three?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are destructive actions specifically gated?&lt;/strong&gt; What's the definition of "destructive" — vendor-defined or operator-configurable?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What does the audit trail capture?&lt;/strong&gt; Just the commands, or the agent's reasoning + the commands together?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the blast radius of a single successful prompt injection?&lt;/strong&gt; Walk through the worst case explicitly with the vendor.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a vendor can't answer these clearly, the architecture isn't ready for production write access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions in 2026
&lt;/h2&gt;

&lt;p&gt;This is a young problem space. Several questions are not yet resolved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization.&lt;/strong&gt; k8s-sigs/agent-sandbox is the leading candidate for a standard, but Knative Sandbox, container-level approaches, and microVM-based runtimes (Firecracker) are all in play.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud isolation.&lt;/strong&gt; Sandboxing a Kubernetes pod is a solved problem. Sandboxing a process that calls &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; across cloud APIs from a single agent is harder — the credentials and trust boundaries change per provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval UX at scale.&lt;/strong&gt; Engineers can't approve 200 actions per week. The right UI for batch approval, policy-based pre-approval, and rollback-only autonomy is still being figured out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expect significant movement on all three through 2026 and into 2027.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora's approach in summary
&lt;/h2&gt;

&lt;p&gt;If you operate an AI SRE in production, the safety questions are non-negotiable. Aurora's answer is: pod-isolated execution by default, deterministic + LLM-judge guardrails before any non-trusted command, environment-variable allowlist that strips secrets, per-invocation cloud credentials via STS/Service Principal/short-lived tokens, and human approval for destructive write operations. The full architecture is open source under Apache 2.0 — auditable in the &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For background on the agent and tool model, see the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;complete guide to AI SRE&lt;/a&gt;, the &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;open-source AI SRE comparison&lt;/a&gt;, or the explainer on &lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;agentic incident management&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:38:19 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three credible open-source AI SREs exist in 2026&lt;/strong&gt;: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox). All three are Apache 2.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only one is a true multi-step agent.&lt;/strong&gt; HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora handles multi-cloud&lt;/strong&gt; out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora generates remediation pull requests.&lt;/strong&gt; HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three support BYO LLM&lt;/strong&gt;, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ companies offering "AI SRE" products in 2026&lt;/a&gt;, only a handful are open source — and only three are credible enough to deploy in production: &lt;strong&gt;Aurora&lt;/strong&gt;, &lt;strong&gt;HolmesGPT&lt;/strong&gt;, and &lt;strong&gt;K8sGPT&lt;/strong&gt;. &lt;strong&gt;An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification.&lt;/strong&gt; They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.&lt;/p&gt;

&lt;p&gt;This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an open-source AI SRE?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;open-source AI SRE&lt;/strong&gt; is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hostable&lt;/strong&gt;: runs entirely inside your environment without phoning home to a vendor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-driven&lt;/strong&gt;: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why open source matters for AI SRE
&lt;/h2&gt;

&lt;p&gt;Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty.&lt;/strong&gt; Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit transparency.&lt;/strong&gt; Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost predictability.&lt;/strong&gt; Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the three compare
&lt;/h2&gt;

&lt;p&gt;This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;201&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;2,366&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;7,737&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;v1.1.1 (Mar 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt/releases" rel="noopener noreferrer"&gt;0.26.0 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases" rel="noopener noreferrer"&gt;v0.4.32 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;Sandbox (Oct 2025)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arvo AI&lt;/td&gt;
&lt;td&gt;Robusta + Microsoft&lt;/td&gt;
&lt;td&gt;k8sgpt-ai community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LangGraph supervisor + sub-agents&lt;/td&gt;
&lt;td&gt;ReAct loop (&lt;code&gt;ToolCallingLLM&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Rule-based scanner + LLM explainer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-step reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (single-shot per analyzer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway&lt;/td&gt;
&lt;td&gt;Kubernetes + AWS via MCP&lt;/td&gt;
&lt;td&gt;Kubernetes only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kubectl&lt;/code&gt; in sandboxed pods&lt;/td&gt;
&lt;td&gt;Read-only &lt;code&gt;kubectl get&lt;/code&gt;/&lt;code&gt;describe&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Read-only via Kube API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Other integrations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)&lt;/td&gt;
&lt;td&gt;30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)&lt;/td&gt;
&lt;td&gt;None — Kubernetes-only by design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge base / RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weaviate vector search over runbooks + postmortems&lt;/td&gt;
&lt;td&gt;Yes (via toolsets)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependency graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Memgraph (cross-cloud blast radius)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postmortem generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, exports to Confluence&lt;/td&gt;
&lt;td&gt;Investigation reports only&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pull request remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub + Bitbucket with human approval gate&lt;/td&gt;
&lt;td&gt;GitHub PRs in Operator mode&lt;/td&gt;
&lt;td&gt;None — strictly read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (340+ endpoints, 6 named tools)&lt;/td&gt;
&lt;td&gt;Yes (consumes MCP servers)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Google, Vertex, OpenRouter, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Air-gapped support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Ollama + image tarballs)&lt;/td&gt;
&lt;td&gt;Yes (Ollama)&lt;/td&gt;
&lt;td&gt;Yes (LocalAI / Ollama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker Compose or Helm&lt;/td&gt;
&lt;td&gt;Binary, API server, K8s Operator, Python SDK&lt;/td&gt;
&lt;td&gt;Go binary, K8s operator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The OSS AI SRE Maturity Spectrum
&lt;/h2&gt;

&lt;p&gt;A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What the agent does&lt;/th&gt;
&lt;th&gt;Tools at this level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Diagnostic Explainer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;K8sGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Read-Only Investigator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HolmesGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Investigation + Suggestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;HolmesGPT (Operator mode)&lt;/strong&gt;, &lt;strong&gt;Aurora&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Investigation + Approved Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; (with Bitbucket connector's &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;human approval gate&lt;/a&gt; for destructive actions)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs HolmesGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick HolmesGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.&lt;/li&gt;
&lt;li&gt;You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.&lt;/li&gt;
&lt;li&gt;You value CNCF governance and a steep ecosystem velocity.&lt;/li&gt;
&lt;li&gt;You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems exported to Confluence.&lt;/li&gt;
&lt;li&gt;You want the agent to draft remediation PRs against your codebase.&lt;/li&gt;
&lt;li&gt;You need a graph-based blast radius model (Memgraph) for dependency analysis.&lt;/li&gt;
&lt;li&gt;You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs K8sGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;This is closer to "which tool category do you need?" than a head-to-head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick K8sGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as &lt;code&gt;k8sgpt analyze --explain&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.&lt;/li&gt;
&lt;li&gt;You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need agentic investigation, not just diagnostic explanation.&lt;/li&gt;
&lt;li&gt;You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems and remediation PRs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  HolmesGPT vs K8sGPT — head-to-head
&lt;/h2&gt;

&lt;p&gt;Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step AI agent&lt;/td&gt;
&lt;td&gt;Rule-based scanner with LLM explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When it shines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigating an alert end-to-end across signals&lt;/td&gt;
&lt;td&gt;Diagnosing why a specific resource is unhealthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds to minutes (multi-step)&lt;/td&gt;
&lt;td&gt;Sub-second per analyzer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (multiple calls per investigation)&lt;/td&gt;
&lt;td&gt;Lower (one explanation per finding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hallucination risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (agent reasons across signals)&lt;/td&gt;
&lt;td&gt;Lower (deterministic before LLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-call engineers handling alerts&lt;/td&gt;
&lt;td&gt;Platform teams running periodic cluster audits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to use open-source AI SRE
&lt;/h2&gt;

&lt;p&gt;Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't have the operational capacity to run another stateful service in production.&lt;/li&gt;
&lt;li&gt;You want vendor support with SLAs and a phone number to call at 3 AM.&lt;/li&gt;
&lt;li&gt;Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE.&lt;/li&gt;
&lt;li&gt;You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to pilot an open-source AI SRE in your team
&lt;/h2&gt;

&lt;p&gt;A six-step, low-risk pilot for any of the three tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one cluster and one observability source.&lt;/strong&gt; Don't try to cover everything at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install in read-only mode first.&lt;/strong&gt; All three tools default to read-only — keep it that way for the first two weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect one alert source.&lt;/strong&gt; PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run for two weeks alongside human on-call.&lt;/strong&gt; Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed it your historical context.&lt;/strong&gt; Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand carefully.&lt;/strong&gt; Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting started with Aurora
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is the multi-cloud, multi-tool option among open-source AI SREs. To run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments.&lt;/p&gt;

&lt;p&gt;For the technical side of running an agent that executes &lt;code&gt;kubectl&lt;/code&gt; against production, see the companion piece on &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI agent kubectl safety and sandboxed execution&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
