<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Siddharth Singh</title>
    <description>The latest articles on DEV Community by Siddharth Singh (@siddharth_singh_409bd5267).</description>
    <link>https://dev.to/siddharth_singh_409bd5267</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836164%2Fed12b658-4232-401b-be5c-924bb828c22f.png</url>
      <title>DEV Community: Siddharth Singh</title>
      <link>https://dev.to/siddharth_singh_409bd5267</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siddharth_singh_409bd5267"/>
    <language>en</language>
    <item>
      <title>What is an AI SRE? Definition, Capabilities, and 2026 Buyer's Lens</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 21 May 2026 23:45:57 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/what-is-an-ai-sre-definition-capabilities-and-2026-buyers-lens-41l4</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/what-is-an-ai-sre-definition-capabilities-and-2026-buyers-lens-41l4</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE is a multi-step large-language-model agent that investigates production incidents, queries live telemetry, and drafts a root-cause analysis with remediation guidance.&lt;/strong&gt; It is not an alerting tool, not an AIOps correlator, and not a chatbot. The agent calls infrastructure tools (&lt;code&gt;kubectl&lt;/code&gt;, cloud APIs, log queries) during an incident to gather new evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The category emerged in 2024 and consolidated in 2025-2026.&lt;/strong&gt; Open-source projects include &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;), &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;), and &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, multi-cloud). Commercial entrants include &lt;a href="https://resolve.ai/" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt; (&lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;$125M Series A at $1B in February 2026&lt;/a&gt;) and &lt;a href="https://www.traversal.com/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt; (&lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;$48M Series A in June 2025&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE is not the same as an AIOps platform.&lt;/strong&gt; AIOps tools cluster alerts statistically and predate LLMs. An AI SRE reasons through an incident step by step using an LLM that calls tools. The two categories are complementary, not interchangeable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five capabilities define a credible AI SRE.&lt;/strong&gt; Multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, and a structured root-cause output. Tools that ship fewer than three of these are something else (chatbot, summarizer, correlator).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adoption is bounded by trust, not capability.&lt;/strong&gt; Most 2026 buyers run the agent in read-only investigation mode for the first ninety days. Closed-loop remediation is a separate trust decision that follows clean operation, never the first decision.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;An AI SRE is a multi-step large-language-model agent that investigates production incidents on behalf of a site reliability engineer.&lt;/strong&gt; When an alert fires, the agent queries telemetry, traverses infrastructure dependencies, retrieves relevant runbooks, and produces a structured root-cause analysis. The category sits next to, not inside, the older AIOps and incident-management markets.&lt;/p&gt;

&lt;p&gt;This page is a definitional reference. For the deep methodology and procurement-stage detail, see our &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE Complete Guide&lt;/a&gt;. For tool selection, see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does an AI SRE do? The Five-Capability Test
&lt;/h2&gt;

&lt;p&gt;We call the rubric below the &lt;strong&gt;Five-Capability AI SRE Test&lt;/strong&gt;. A tool that ships fewer than three of these capabilities is in an adjacent category (copilot, summariser, correlator) and should not be evaluated against a real AI SRE.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step investigation.&lt;/strong&gt; The agent runs an iterative reasoning loop (&lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;ReAct&lt;/a&gt;, tool-calling, or a graph-based equivalent) where each step uses the previous tool result to decide the next call. Single-shot summarisation is a different category.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure tool execution.&lt;/strong&gt; The agent reads from &lt;code&gt;kubectl&lt;/code&gt;, cloud SDKs, observability backends, and ticket systems. Some agents also write, with guardrails. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT documents read-only access with RBAC respect&lt;/a&gt;. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora documents sandboxed execution into an isolated namespace&lt;/a&gt;. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT documents Kubernetes-only diagnostics with anonymisation before any AI backend call&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency-graph awareness.&lt;/strong&gt; The agent knows that service A talks to service B and uses that topology to assess blast radius. Aurora ships a Memgraph-backed dependency graph. Causely is built on a causal-graph foundation; see &lt;a href="https://docs.causely.ai/getting-started/how-causely-works/" rel="noopener noreferrer"&gt;How Causely Works&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge-base RAG.&lt;/strong&gt; The agent retrieves runbooks and past postmortems using hybrid search (&lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25" rel="noopener noreferrer"&gt;BM25&lt;/a&gt; plus dense vectors). Aurora documents a &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt; hybrid index. The leading commercial AI SREs all integrate Confluence and ticket systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured root-cause output.&lt;/strong&gt; The agent emits a final artefact (summary, evidence chain, suggested remediation) rather than a chat transcript. Postmortem export to Confluence or Jira is increasingly table-stakes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The minimum coherent product ships investigation, tool execution, and a structured output. Items 3 and 4 push the tool from "interesting demo" to "load-bearing in production."&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from a human SRE?
&lt;/h2&gt;

&lt;p&gt;An AI SRE does not replace a human site reliability engineer. The 2026 division of labour is concrete.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human stays in the loop for&lt;/strong&gt; scope decisions (what counts as an incident), trust decisions (when to allow remediation), capacity planning, postmortem facilitation, runbook authorship, and the SLO conversation with product owners.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The agent absorbs&lt;/strong&gt; the first sixty to ninety minutes of evidence-gathering on noisy alerts, the late-night triage of unclear pages, the cross-system correlation that humans defer until morning, and the boilerplate of a draft postmortem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The economic argument is bounded. The category's investors (Sequoia, Kleiner, Lightspeed, Felicis) underwrite an "agent does first triage, human does decision" workflow, not a headcount-replacement claim. The &lt;a href="https://newsletter.signoz.io/p/ai-isnt-replacing-sres-its-deskilling" rel="noopener noreferrer"&gt;SigNoz newsletter discussion of deskilling risk&lt;/a&gt; is a useful counterweight.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from AIOps?
&lt;/h2&gt;

&lt;p&gt;The two categories share an acronym sound and almost no implementation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;AIOps platform&lt;/th&gt;
&lt;th&gt;AI SRE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary technique&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Statistical clustering, anomaly detection, correlation rules&lt;/td&gt;
&lt;td&gt;LLM reasoning, tool-calling agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When it was named&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coined by &lt;a href="https://www.gartner.com/en/information-technology/glossary/aiops-platform" rel="noopener noreferrer"&gt;Gartner in 2017&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Emerged in vendor marketing 2024 to 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it produces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alert clusters, noise reduction, incident summaries&lt;/td&gt;
&lt;td&gt;A reasoned root-cause analysis, evidence chain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Representative tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BigPanda, Moogsoft, Dynatrace Davis, PagerDuty Intelligent Alert Grouping&lt;/td&gt;
&lt;td&gt;HolmesGPT, K8sGPT, Aurora, Resolve.ai, Traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Replaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual alert triage&lt;/td&gt;
&lt;td&gt;First-pass incident investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;AIOps platforms predate LLMs and remain useful for alert hygiene. An AI SRE is downstream: once the alert lands, the AI SRE investigates it. Most mature teams will end up with both.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE different from an incident-management copilot?
&lt;/h2&gt;

&lt;p&gt;A copilot inside &lt;a href="https://rootly.com/" rel="noopener noreferrer"&gt;Rootly&lt;/a&gt;, &lt;a href="https://incident.io/" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt;, &lt;a href="https://firehydrant.com/" rel="noopener noreferrer"&gt;FireHydrant&lt;/a&gt;, or &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI&lt;/a&gt; drafts Slack updates, suggests on-call swaps, and writes a postmortem from artefacts the team has already produced. An AI SRE generates the evidence those artefacts describe. The two categories cooperate; they do not substitute. See our &lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;AI SRE vs traditional incident management comparison&lt;/a&gt; for the long form.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the open-source vs commercial AI SRE options?
&lt;/h2&gt;

&lt;p&gt;In May 2026, three open-source projects dominate this lane.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt/blob/master/LICENSE" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. 2.5k GitHub stars on the canonical repository as of May 2026, per the &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT/holmesgpt about box&lt;/a&gt;. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;Originally created by Robusta.dev with major contributions from Microsoft&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 8 October 2025&lt;/a&gt;. Project legal entity: &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT a Series of LF Projects, LLC&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT.&lt;/strong&gt; &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. 7.8k GitHub stars on the canonical repository as of May 2026, per the &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;k8sgpt-ai/k8sgpt about box&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;. The June 2024 CNCF blog notes that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF: K8sGPT, June 2024&lt;/a&gt;). Kubernetes-scoped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora by Arvo AI.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;. Multi-cloud (AWS, Azure, GCP, OVH, Scaleway, Kubernetes). Sandboxed command execution, dependency-graph awareness, RAG over runbooks and postmortems. See the &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;direct comparison of all three&lt;/a&gt; and our &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;self-hosted AI SRE guide&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Commercial entrants raise larger cheques but ship a narrower deployment surface. &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai confirmed a $125M Series A at a $1B valuation in February 2026&lt;/a&gt; and an &lt;a href="https://www.prnewswire.com/news-releases/resolve-ai-announces-series-a-extension-at-a-1-5b-valuation-and-launches-resolve-ai-labs-to-advance-ai-systems-for-complex-production-environments-302743888.html" rel="noopener noreferrer"&gt;extension at a $1.5B valuation in April 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal raised $48M in June 2025 led by Sequoia and Kleiner Perkins&lt;/a&gt;. Incumbents shipped 2025-2026 launches: &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty SRE Agent&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI SRE&lt;/a&gt;, and ServiceNow Now Assist for incident operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is an AI SRE evaluated?
&lt;/h2&gt;

&lt;p&gt;Three questions resolve most procurement debates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does the agent investigate or just summarise?&lt;/strong&gt; A summariser repeats what the dashboard already says. An investigator gathers new evidence. Ask the vendor to walk through one tool call after the alert; if the answer is "we summarise the alert payload," the product is a copilot, not an AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where does inference run?&lt;/strong&gt; A SaaS-only inference plane is fine for unregulated teams and disqualifying for regulated ones. The deployment tier is fixed by the strictest constraint, not the average. See the &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Sovereignty Spectrum in our self-hosted guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the remediation boundary?&lt;/strong&gt; Read-only investigation is one trust decision. PR-based suggestions are another. Sandboxed in-cluster execution is the third. Most teams stage these three independently across a six-to-twelve-month adoption arc, not in a single procurement.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a detailed tool matrix scored on five axes (investigation, remediation, postmortem, deployment flexibility, source availability), see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  ROI: where the time actually comes back
&lt;/h2&gt;

&lt;p&gt;Independent ROI numbers specifically for AI SRE are still thin in 2026. The broader industry adoption picture is well-sourced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report" rel="noopener noreferrer"&gt;Google's 2025 DORA report announcement&lt;/a&gt; states "90% of survey respondents report using AI at work" and that "More than 80% believe it has increased their productivity."&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://survey.stackoverflow.co/2025/" rel="noopener noreferrer"&gt;Stack Overflow's 2025 Developer Survey&lt;/a&gt; reports that 84 percent of respondents are using or planning to use AI tools in their development process, and 51 percent of professional developers use AI tools daily.&lt;/li&gt;
&lt;li&gt;The same DORA 2025 report notes that "AI adoption still has a negative relationship with software delivery stability," which is exactly the gap an investigation-grade AI SRE is positioned to close, distinct from the coding-assistant category that drives most of the AI adoption signal above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where AI SRE specifically takes hours back is mid-tier paging volume: the alerts that are too ambiguous to ignore and too low-stakes to wake a senior on. The agent's first-pass triage moves those from "morning standup discussion" to "closed before breakfast."&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the common mistakes when buying an AI SRE?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conflating a postmortem generator with an AI SRE.&lt;/strong&gt; A tool that writes a draft from the Slack transcript is not investigating. It is summarising.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buying multi-cloud AI SRE for a single-cloud problem.&lt;/strong&gt; If 95 percent of the estate is one cloud, a Kubernetes-only or AWS-only agent may be a better cost-to-fit match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starting with remediation.&lt;/strong&gt; The fastest way to lose stakeholder trust is to let an agent execute a command before the team understands its investigation pattern. Stage trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the dependency-graph question.&lt;/strong&gt; If the agent does not understand what calls what, it will miss blast-radius assessments and waste investigation steps. The capability is invisible in a demo and load-bearing in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to evaluate an AI SRE in 14 days
&lt;/h2&gt;

&lt;p&gt;A two-week, single-quarter procurement plan that maps directly to the Five-Capability AI SRE Test.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Day 1 to 2: Score the shortlist on the Five-Capability Test.&lt;/strong&gt; Take the five capabilities (multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, structured root-cause output) and score every shortlisted tool 0 to 3 on each axis. Drop any tool that scores below 6 out of 15.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3 to 4: Resolve the three procurement questions.&lt;/strong&gt; Answer in writing: does the agent investigate or just summarise; where does inference run; what is the remediation boundary. Match the deployment tier to the strictest constraint, not the average.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5 to 7: Run a sandboxed proof of value.&lt;/strong&gt; Pick one real incident from the last 30 days. Replay it against the top two shortlisted tools using a non-production cloud key and a sandbox cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 8 to 9: Run the security review.&lt;/strong&gt; Walk security through each tool's data path: what telemetry leaves the customer perimeter, what is anonymised before LLM calls, what the read or write capability boundary is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 10 to 11: Pilot one team for one week.&lt;/strong&gt; Route a defined subset of alerts (one severity tier, one service domain) into the tool in read-only investigation mode. Do not touch remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 12 to 13: Stage trust separately.&lt;/strong&gt; Read-only investigation is one trust decision. PR-based suggestions are the second. Sandboxed in-cluster execution is the third. Most teams stage these over six to twelve months.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 14: Decide on five numbers.&lt;/strong&gt; Five-Capability Test score, three-question filter answers, week-by-week investigation quality reading, total cost of ownership at projected incident volume, and security review status.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where this guide fits
&lt;/h2&gt;

&lt;p&gt;This is the short definitional reference. For deeper material:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE: The Complete Guide for Engineering Teams in 2026&lt;/a&gt;, procurement and adoption arc.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;, full capability matrix.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt;, deployment-tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;, three-way comparison.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison&lt;/a&gt;, two-way head-to-head.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation: The Complete Guide for SRE Teams&lt;/a&gt;, investigation-pattern detail.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;What is Agentic Incident Management?&lt;/a&gt;, category framing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is an AI SRE in simple terms?&lt;/strong&gt;&lt;br&gt;
An AI SRE is a multi-step LLM agent that investigates production incidents. It reads alerts, runs infrastructure commands such as kubectl or cloud SDK calls, queries observability backends, and produces a structured root-cause analysis. It augments a human site reliability engineer, not replaces them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How is an AI SRE different from AIOps?&lt;/strong&gt;&lt;br&gt;
AIOps is a 2017-era Gartner category built on statistical alert clustering and anomaly detection. An AI SRE is downstream of that: once an alert lands, the AI SRE uses an LLM to reason through it step by step, calling tools to gather new evidence. Mature teams typically run both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is an AI SRE the same as an incident-management chatbot?&lt;/strong&gt;&lt;br&gt;
No. A chatbot inside Rootly, incident.io, FireHydrant, or PagerDuty drafts Slack updates and summarises artefacts the team already has. An AI SRE generates those artefacts by investigating the incident from telemetry. The two categories cooperate but do not substitute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will AI replace SREs?&lt;/strong&gt;&lt;br&gt;
No. Investor framing across Sequoia, Kleiner, Lightspeed, and Felicis-backed AI SRE companies in 2025 to 2026 has consistently been agent-as-first-triage with a human in the loop for scope, trust, capacity, and SLO decisions. The deskilling risk is real and discussed in industry essays such as the SigNoz newsletter; the headcount-replacement claim is not part of the category thesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the main open-source AI SRE tools in 2026?&lt;/strong&gt;&lt;br&gt;
Three projects dominate. HolmesGPT (Apache 2.0, CNCF Sandbox since 8 October 2025, Kubernetes-first, 2.5k GitHub stars per the about box in May 2026). K8sGPT (Apache 2.0, CNCF Sandbox since 19 December 2023, Kubernetes diagnostics, 7.8k GitHub stars per the about box in May 2026). Aurora by Arvo AI (Apache 2.0, multi-cloud, sandboxed command execution).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does an AI SRE handle security and data privacy?&lt;/strong&gt;&lt;br&gt;
Practice varies by tool. HolmesGPT operates with read-only access that respects RBAC and is documented as safe to run in production. K8sGPT anonymises cluster object names and labels before sending data to the AI backend. Aurora supports air-gapped deployment with local LLMs through Ollama. Most commercial AI SREs run inference on vendor-managed infrastructure, which is the gating constraint for regulated buyers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long does an AI SRE take to deploy?&lt;/strong&gt;&lt;br&gt;
An open-source AI SRE runs in a single afternoon for a Docker Compose or Helm install with one cloud and one monitoring integration connected. Production rollout, including secret rotation, RBAC scoping, runbook ingestion, and Slack integration, takes two to four weeks for most teams. Closed-loop remediation is staged separately, three to twelve months after read-only operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does an AI SRE cost?&lt;/strong&gt;&lt;br&gt;
Open-source AI SREs are free at the licence layer; the running cost is infrastructure plus LLM inference. Self-hosted Aurora with a local Ollama model removes the LLM cost entirely. Commercial AI SREs price either per-seat or per-investigation. Resolve.ai and Traversal price by custom contract; PagerDuty and Datadog bundle their AI SRE features into existing platform tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can an AI SRE run in an air-gapped environment?&lt;/strong&gt;&lt;br&gt;
Yes, for a small set of tools. Aurora supports air-gapped deployment with Ollama or vLLM for local inference. HolmesGPT supports self-hosted LLM endpoints. K8sGPT supports local backends including Ollama and LocalAI. Most commercial AI SREs require outbound calls to a vendor-managed inference plane and do not satisfy air-gapped procurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does an AI SRE not do?&lt;/strong&gt;&lt;br&gt;
It does not set SLOs, define what counts as an incident, run capacity planning, facilitate a postmortem with the affected team, or own the customer relationship during a major outage. It is a tool for evidence-gathering and first-pass reasoning, not for the judgment work that defines the site reliability discipline.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;arvoai.ca/blog/what-is-an-ai-sre&lt;/a&gt;. Aurora by Arvo AI is open-source on &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; under Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>HolmesGPT vs K8sGPT: A 2026 Head-to-Head Comparison for SRE Teams</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 21 May 2026 23:43:47 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/holmesgpt-vs-k8sgpt-a-2026-head-to-head-comparison-for-sre-teams-2aco</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/holmesgpt-vs-k8sgpt-a-2026-head-to-head-comparison-for-sre-teams-2aco</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT and K8sGPT are both Apache 2.0, both CNCF Sandbox, and both branded as AI for SRE work, but they solve different problems.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; is an investigation agent that runs across "any infrastructure - VMs, bare metal, cloud services, or containers." &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; is a Kubernetes diagnostics tool: "a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub adoption signals diverge sharply.&lt;/strong&gt; As of May 2026, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT shows 7.8k stars and 996 forks&lt;/a&gt;, written 98.9 percent in Go. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT shows 2.5k stars and 347 forks&lt;/a&gt;, written 84.5 percent in Python. K8sGPT had a two-year head start (CNCF Sandbox 19 December 2023 vs HolmesGPT 8 October 2025).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution model differs.&lt;/strong&gt; &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT operates with read-only access that "respects RBAC permissions"&lt;/a&gt;, plus a separate &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;Operator Mode that "can open PRs to fix the problems it finds"&lt;/a&gt; through the GitHub MCP integration. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; runs as a CLI scanner or in-cluster operator with a 30-second default reconciliation interval (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-operator&lt;/a&gt;) and anonymises Kubernetes object names and labels before any LLM call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM backend lists overlap heavily and diverge at the edges.&lt;/strong&gt; Both projects register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama as backends. &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;K8sGPT's source registers a broader enterprise set&lt;/a&gt;: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT documents a broader developer-tooling set&lt;/a&gt;: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and OpenAI-Compatible (LiteLLM proxy).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance shapes the trust story.&lt;/strong&gt; HolmesGPT's project entity is &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;HolmesGPT a Series of LF Projects, LLC&lt;/a&gt;; the project was &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;originally created by Robusta.dev with major contributions from Microsoft&lt;/a&gt;. The June 2024 CNCF post on K8sGPT states that "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF Blog, 7 June 2024&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a strict comparison of two open-source projects that are often grouped together because both attach AI to Kubernetes work, both are CNCF Sandbox, and both are Apache 2.0. Past that, they target different problems with different runtimes, different backends, and different governance. Every claim below is cited to a primary source: the project's GitHub repository, its official docs site, or a CNCF page. No quote is paraphrased from third-party blog posts.&lt;/p&gt;

&lt;p&gt;A note on bias. Arvo builds &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, a separate open-source AI SRE listed alongside HolmesGPT and K8sGPT in our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;three-way comparison&lt;/a&gt;. This page intentionally excludes Aurora from the main comparison except for a small section at the end.&lt;/p&gt;

&lt;p&gt;We call the rubric used below the &lt;strong&gt;Open-Source AI SRE Decision Matrix&lt;/strong&gt;. Six axes, each evaluated against the project's own primary documentation, no third-party claims. The six axes are: stated scope, execution model, continuous operation, LLM provider breadth, Model Context Protocol direction (host vs consume), and project governance. Every cell in the comparison table that follows maps back to one of these six axes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is HolmesGPT?
&lt;/h2&gt;

&lt;p&gt;HolmesGPT describes itself as an &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Open-source AI agent for investigating production incidents and finding root causes"&lt;/a&gt;. Repository statistics on the project's about box in May 2026 show 2.5k stars, 347 forks, and Python at 84.5 percent of the codebase (&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Scope is cross-infrastructure: "Open-source SRE agent for investigating production incidents across any infrastructure - Kubernetes, VMs, cloud services, databases, and more" (&lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;holmesgpt.dev&lt;/a&gt;). The same point is made on the project repository: "No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers" (&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Governance is shared between two entities. Origin attribution: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Originally created by Robusta.Dev, with major contributions from Microsoft"&lt;/a&gt;. The project's legal entity is named on the docs site: &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;"HolmesGPT a Series of LF Projects, LLC"&lt;/a&gt;. CNCF acceptance is documented at "October 8, 2025 at the Sandbox maturity level" (&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;cncf.io/projects/holmesgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The latest release at time of writing is &lt;strong&gt;v0.30.1 on 20 May 2026&lt;/strong&gt; per the &lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;. The release notes for v0.30.1 mention Loki raw response handling on parse failure, a GitLab MCP entry in the datasource catalog, a Bash echo allowlist fix, and user_email persistence on chat requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is K8sGPT?
&lt;/h2&gt;

&lt;p&gt;K8sGPT describes itself as &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;"a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI"&lt;/a&gt;. Repository statistics on the project's about box in May 2026 show 7.8k stars, 996 forks, and Go at 98.9 percent of the codebase (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;github.com/k8sgpt-ai/k8sgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Scope is explicitly Kubernetes. The project makes no claim of non-Kubernetes runtime support. The marketing site at &lt;a href="https://k8sgpt.ai/" rel="noopener noreferrer"&gt;k8sgpt.ai&lt;/a&gt; carries the tagline "K8sGPT - Giving Kubernetes Superpowers to Everyone."&lt;/p&gt;

&lt;p&gt;Governance is community-led. The 7 June 2024 CNCF blog (Dotan Horovits) states: "unlike many popular projects, there is no company behind this project, and no business plan behind it" (&lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;CNCF Blog&lt;/a&gt;). CNCF acceptance is documented at "December 19, 2023 at the Sandbox maturity level" (&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;cncf.io/projects/k8sgpt&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The latest release at time of writing is &lt;strong&gt;v0.4.33 on 13 May 2026&lt;/strong&gt; per the &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.33" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;. Recent feature releases include v0.4.27 (mcp v2, 18 December 2025), v0.4.32 (Azure API type support and custom HTTP header, 22 April 2026), and v0.4.33 (analyze previous logs for restarted containers, 13 May 2026).&lt;/p&gt;

&lt;h2&gt;
  
  
  At a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt/blob/master/LICENSE" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;Sandbox, 8 October 2025&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;Sandbox, 19 December 2023&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars (May 2026)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;2.5k&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;7.8k&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python (84.5%)&lt;/td&gt;
&lt;td&gt;Go (98.9%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stated scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Any infrastructure - VMs, bare metal, cloud services, or containers"&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Kubernetes clusters&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operating model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step investigation agent + optional 24/7 &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;Operator Mode&lt;/a&gt; (Alpha)&lt;/td&gt;
&lt;td&gt;Scanner CLI + &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-operator&lt;/a&gt; for continuous in-cluster runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default permission model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Read-only access and respects RBAC permissions"&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Diagnoses; anonymises sensitive data before AI calls&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;open GitHub PRs via the GitHub MCP integration in Operator Mode&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;None documented&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;MCP-based integrations for AWS, Azure, GCP, GitHub, GitLab, Jenkins, Kubernetes Remediation, Sentry, Splunk, MariaDB, Prefect&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;Hosts an MCP server exposing 12 tools and 3 resources for Kubernetes operations&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, Robusta AI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;Anthropic, OpenAI, Azure OpenAI, AWS Bedrock (and Bedrock Converse), Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, Custom REST&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest release at writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;v0.30.1, 20 May 2026&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.33" rel="noopener noreferrer"&gt;v0.4.33, 13 May 2026&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Founding entity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;Originally Robusta.dev, major Microsoft contributions&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Community-led, no commercial backer per &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What is the scope difference between HolmesGPT and K8sGPT?
&lt;/h2&gt;

&lt;p&gt;This is the load-bearing axis on the Open-Source AI SRE Decision Matrix, and the easiest one for teams to get wrong.&lt;/p&gt;

&lt;p&gt;K8sGPT is, by stated scope, a Kubernetes diagnostics tool. The &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/tree/main/pkg/analyzer" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/analyzer&lt;/code&gt;&lt;/a&gt; folder ships analysers for around 29 Kubernetes resource types as of May 2026, with a documented "default" subset (Pod, PVC, ReplicaSet, Service, Event, Ingress, StatefulSet, Deployment, Job, CronJob, Node, MutatingWebhook, ValidatingWebhook, ConfigMap) and an extended set covering HPA, PDB, NetworkPolicy, Gateway, GatewayClass, HTTPRoute, Log, Storage, Security, plus OLM-related resources (CatalogSource, ClusterServiceVersion, Subscription, etc.). Every analyser is scoped to a Kubernetes resource type. A team running on bare VMs, on managed cloud services without Kubernetes, or on a mainframe is not the K8sGPT audience.&lt;/p&gt;

&lt;p&gt;HolmesGPT rebuts the Kubernetes-only assumption directly: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"No Kubernetes required: Works with any infrastructure - VMs, bare metal, cloud services, or containers"&lt;/a&gt;. Its data-source catalogue, visible in the &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs navigation&lt;/a&gt;, covers VM-era systems alongside Kubernetes-era ones: Bash, ClickHouse, MariaDB (via MCP), Confluence, Sentry, plus Kubernetes resources and Helm. The Operator Mode page also frames non-Kubernetes scope: &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;"While the operator itself runs in Kubernetes, health checks can query any data source Holmes is connected to - VMs, cloud services, databases, SaaS platforms"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For SRE teams whose estate is entirely Kubernetes, this difference is academic. For teams that still run managed databases outside Kubernetes (RDS, Cloud SQL, Aurora), VM workloads, or third-party SaaS at incident-critical positions in the stack, K8sGPT cannot reach those resources without integration glue, and HolmesGPT can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can HolmesGPT or K8sGPT execute commands against my cluster?
&lt;/h2&gt;

&lt;p&gt;Both projects ship a fundamentally read-shaped default. The phrasing differs.&lt;/p&gt;

&lt;p&gt;HolmesGPT is explicit: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"By design, HolmesGPT has read-only access and respects RBAC permissions. It is safe to run in production environments"&lt;/a&gt;. The Operator Mode page describes how the read-only default is preserved while a separate write path opens: &lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;"Connect the GitHub MCP server so Holmes can open PRs to fix the problems it finds - not just report them"&lt;/a&gt;. Writes do not happen against the cluster; they happen against the user's Git repository, where humans approve the change.&lt;/p&gt;

&lt;p&gt;K8sGPT does not use the phrase "read-only" in its repository documentation, but its operational profile is similar: the tool scans cluster state through Kubernetes APIs and feeds analyser output to an LLM. Anonymisation happens before the LLM call: &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;"the data is anonymized before being sent to the AI Backend... k8sgpt retrieves sensitive data (Kubernetes object names, labels, etc.). This data is masked when sent to the AI backend"&lt;/a&gt;. The same primary source also notes that anonymisation "does not currently apply to events" and that certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count) are not masked. The trade-off is openly disclosed. The masking implementation lives in &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/util/util.go" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/util/util.go&lt;/code&gt;&lt;/a&gt; as the &lt;code&gt;MaskString&lt;/code&gt; function.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does continuous operation differ between the two operators?
&lt;/h2&gt;

&lt;p&gt;Both projects have an in-cluster operator, and again the framing differs.&lt;/p&gt;

&lt;p&gt;HolmesGPT's Operator Mode is a 24/7 background agent: "HolmesGPT runs in the background 24/7, spots problems before your customers notice, and messages you in Slack with the fix" (&lt;a href="https://holmesgpt.dev/latest/operator/" rel="noopener noreferrer"&gt;holmesgpt.dev/latest/operator&lt;/a&gt;). The docs note its architecture: "a lightweight kopf-based controller handles CRD orchestration and scheduling, while stateless Holmes API servers execute the actual checks." The same page carries an explicit "Holmes Operator - Alpha Release" warning, and includes a cost caution: "Begin with infrequent schedules (e.g., hourly or daily) and monitor usage before scaling up."&lt;/p&gt;

&lt;p&gt;K8sGPT's operator (a separate repo, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;k8sgpt-ai/k8sgpt-operator&lt;/a&gt;) is a continuous scanner: "This Operator is designed to enable K8sGPT within a Kubernetes cluster... It will allow you to create a custom resource that defines the behaviour and scope of a managed K8sGPT workload." The &lt;a href="https://github.com/k8sgpt-ai/k8sgpt-operator" rel="noopener noreferrer"&gt;default reconciliation interval is 30 seconds&lt;/a&gt;, enforced in the controller code (&lt;code&gt;ReconcileSuccessInterval = 30 * time.Second&lt;/code&gt;). Output goes to in-cluster Result CRDs, with optional Slack, Mattermost, and CloudEvents sinks. Prometheus and Grafana integration is exposed through ServiceMonitor and dashboard parameters (&lt;a href="https://docs.k8sgpt.ai/reference/operator/overview/" rel="noopener noreferrer"&gt;k8sgpt-operator docs&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Architecturally: HolmesGPT's Operator Mode is event-driven and incident-shaped (run on alert, run on schedule). K8sGPT's operator is poll-shaped (scan every 30 seconds, surface anomalies).&lt;/p&gt;

&lt;h2&gt;
  
  
  Which LLM providers does each tool support?
&lt;/h2&gt;

&lt;p&gt;Both projects support multiple LLM backends. The lists overlap heavily on the headline providers and diverge at the edges.&lt;/p&gt;

&lt;p&gt;K8sGPT's source code at &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/pkg/ai/iai.go" rel="noopener noreferrer"&gt;&lt;code&gt;pkg/ai/iai.go&lt;/code&gt;&lt;/a&gt; registers 17 backends as of May 2026: openai, anthropic, localai, ollama, azureopenai, cohereai, amazonbedrock, amazonbedrockconverse, amazonsagemaker, googleai, noopai, huggingface, googlevertexai, ocigenai, customrest, ibmwatsonxai, groq.&lt;/p&gt;

&lt;p&gt;HolmesGPT's &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs site navigation&lt;/a&gt; enumerates: Anthropic, AWS Bedrock, Azure AI Foundry, Gemini, GitHub Copilot, GitHub Models, Google Vertex AI, Ollama, OpenRouter, OpenAI, OpenAI-Compatible, Robusta AI.&lt;/p&gt;

&lt;p&gt;The two lists overlap heavily on the headline providers (Anthropic, OpenAI, Azure OpenAI, Bedrock, Google Vertex AI, Ollama) and diverge at the edges. K8sGPT's edge list leans enterprise: IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker, and a generic Custom REST endpoint. HolmesGPT's edge list leans developer-tooling: GitHub Copilot, GitHub Models, Azure AI Foundry, OpenRouter, Robusta AI, and an OpenAI-Compatible (LiteLLM proxy) catch-all. The right choice usually comes from the LLM the security team has already approved, not from this list.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does each tool handle Model Context Protocol?
&lt;/h2&gt;

&lt;p&gt;Both projects support MCP, and again the shape differs.&lt;/p&gt;

&lt;p&gt;K8sGPT &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;hosts an MCP server that the project ships&lt;/a&gt;: "K8sGPT provides a Model Context Protocol server that exposes Kubernetes operations as standardized tools for AI assistants." The server exposes "12 tools for cluster analysis, resource management, and debugging" and "3 resources for cluster information access," with "Stateless HTTP mode for one-off invocations" and "Full integration with Claude Desktop and other MCP clients." The MCP v2 feature lands in &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases/tag/v0.4.27" rel="noopener noreferrer"&gt;release v0.4.27 on 18 December 2025&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;HolmesGPT consumes MCP servers as data sources rather than hosting one. The &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;data-sources catalogue&lt;/a&gt; lists MCP-labelled integrations for AWS, Azure, GitHub, GitLab, Jenkins, GCP, Kubernetes Remediation, MariaDB, Prefect, Sentry, and Splunk. The &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;docs navigation&lt;/a&gt; makes the consumption pattern explicit through entries like "MCP Servers" and "OAuth MCP Servers."&lt;/p&gt;

&lt;p&gt;The implication: K8sGPT publishes cluster operations for Claude Desktop and other MCP clients to consume. HolmesGPT subscribes to MCP-published tools across third-party systems. Teams building MCP-shaped workflows will pick the direction that matches their existing investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who governs each project, and how does that change the trust story?
&lt;/h2&gt;

&lt;p&gt;The CNCF Sandbox label is identical on both projects. The economic shape behind each is not.&lt;/p&gt;

&lt;p&gt;HolmesGPT is held under &lt;a href="https://holmesgpt.dev/latest/" rel="noopener noreferrer"&gt;"HolmesGPT a Series of LF Projects, LLC"&lt;/a&gt;, with origin attribution: &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Originally created by Robusta.Dev, with major contributions from Microsoft"&lt;/a&gt;. &lt;a href="https://home.robusta.dev/" rel="noopener noreferrer"&gt;Robusta&lt;/a&gt; sells a managed SaaS product that integrates HolmesGPT, and Slack and Microsoft Teams integrations are flagged &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;"Available via Robusta"&lt;/a&gt;. This is a sponsored-open-source pattern.&lt;/p&gt;

&lt;p&gt;K8sGPT is community-led. The &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt; states: "unlike many popular projects, there is no company behind this project, and no business plan behind it." The same post names production users: "Companies like Kubermatic, SpectroCloud, and Nethopper have enthusiastically embraced K8sGPT capabilities." The project's &lt;a href="https://github.com/k8sgpt-ai/k8sgpt/blob/main/GOVERNANCE.md" rel="noopener noreferrer"&gt;&lt;code&gt;GOVERNANCE.md&lt;/code&gt;&lt;/a&gt; further codifies the model: "No single vendor may control project direction."&lt;/p&gt;

&lt;p&gt;Neither shape is structurally better. Sponsored open source ships polish and integrations faster; community open source is harder to commercially deprecate. Match the governance to the team's risk model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release cadence and recent feature deltas
&lt;/h2&gt;

&lt;p&gt;HolmesGPT shipped v0.30.1 on 20 May 2026, with notes for the release covering Loki raw-response handling on parse failure, a GitLab MCP datasource entry, a Bash echo allowlist fix, user_email persistence on chat requests, and documentation refinements (&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases/tag/0.30.1" rel="noopener noreferrer"&gt;release tag&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;K8sGPT's recent releases include v0.4.33 ("analyze previous logs for restarted containers," 13 May 2026), v0.4.32 ("add Azure API Type Support and add Custom HTTP Header," 22 April 2026), and v0.4.27 ("mcp v2," 18 December 2025) (&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Both projects ship monthly or near-monthly. Neither has demonstrated a multi-month pause in the period documented.&lt;/p&gt;

&lt;h2&gt;
  
  
  What HolmesGPT and K8sGPT are NOT
&lt;/h2&gt;

&lt;p&gt;Three misreadings of this comparison show up repeatedly in vendor briefings and procurement memos. Naming them in advance saves a procurement cycle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Neither is an alerting platform.&lt;/strong&gt; Alerts originate in Prometheus AlertManager, Grafana, Datadog, CloudWatch, or PagerDuty. &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT fetches alerts from "AlertManager, PagerDuty, OpsGenie, or Jira"&lt;/a&gt;; K8sGPT integrates downstream of Prometheus alert rules. Buying either tool does not solve "we have too many or too few alerts."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neither is a full AIOps platform.&lt;/strong&gt; AIOps is a 2017-era category built on statistical correlation and noise reduction. Both tools sit downstream of that layer: once an alert lands, the agent investigates. Teams running BigPanda, Moogsoft, Dynatrace Davis, or PagerDuty Intelligent Alert Grouping should not expect either project to replace those products.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neither is a managed SaaS by default.&lt;/strong&gt; Both are open-source projects requiring self-hosting. &lt;a href="https://home.robusta.dev/" rel="noopener noreferrer"&gt;Robusta&lt;/a&gt; sells a managed product around HolmesGPT, which is the closest commercial offering. K8sGPT has no commercial entity behind it per the &lt;a href="https://www.cncf.io/blog/2024/06/07/generative-ai-for-kubernetes-meet-k8sgpt-open-source-project/" rel="noopener noreferrer"&gt;June 2024 CNCF blog&lt;/a&gt;. A team that needs a vendor SOC 2 report against the open-source binary itself will not find one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT is not a multi-cloud reasoning tool.&lt;/strong&gt; Its &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;analysers map one-to-one to Kubernetes resource types&lt;/a&gt;. A managed RDS, a Datadog dashboard, or an OVH Bare Metal instance is invisible to K8sGPT's analysers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HolmesGPT is not a deterministic rules engine.&lt;/strong&gt; Its agent loop uses LLM tool-calling, which means investigation paths are non-deterministic and depend on the LLM provider and prompt context. Teams that need bit-for-bit reproducible incident analysis should match expectations to the agent pattern, not against a runbook executor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When should I choose HolmesGPT vs K8sGPT?
&lt;/h2&gt;

&lt;p&gt;Pick &lt;strong&gt;HolmesGPT&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The estate spans more than Kubernetes (VMs, managed databases, SaaS platforms at incident-critical positions).&lt;/li&gt;
&lt;li&gt;The LLM choice is GitHub Copilot, GitHub Models, OpenRouter, or Robusta AI (HolmesGPT-specific).&lt;/li&gt;
&lt;li&gt;The team wants a 24/7 background agent that can post to Slack and open GitHub PRs through MCP integration. Note that Operator Mode is marked as an Alpha release at time of writing.&lt;/li&gt;
&lt;li&gt;The team values an explicit, project-documented "read-only access and respects RBAC" guarantee.&lt;/li&gt;
&lt;li&gt;A managed SaaS option (via Robusta) is acceptable or attractive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick &lt;strong&gt;K8sGPT&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The estate is Kubernetes-first or Kubernetes-only.&lt;/li&gt;
&lt;li&gt;The team wants a Go binary that runs as a CLI and an in-cluster operator out of the box.&lt;/li&gt;
&lt;li&gt;The LLM choice is IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, or Amazon SageMaker (K8sGPT-specific).&lt;/li&gt;
&lt;li&gt;The team plans to publish cluster operations to MCP clients (Claude Desktop, custom tooling) rather than to consume external MCP services.&lt;/li&gt;
&lt;li&gt;The team wants documented anonymisation of cluster object names and labels before LLM calls.&lt;/li&gt;
&lt;li&gt;The team prefers a community-governed project with no commercial entity behind it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two are not directly substitutable for most teams. They are adjacent tools that can plausibly run alongside one another in a Kubernetes-heavy estate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to choose between HolmesGPT and K8sGPT in 14 days
&lt;/h2&gt;

&lt;p&gt;A two-week evaluation plan to pick between HolmesGPT and K8sGPT, or to confirm that the team needs both. Every step is a concrete deliverable a procurement reviewer can sign off.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Day 1 to 2: Scope your estate.&lt;/strong&gt; List every system that hosts incident-relevant state: Kubernetes clusters, VMs, managed databases, third-party SaaS, on-prem hardware. If the answer is Kubernetes plus one or two managed services, K8sGPT alone may cover it. If non-Kubernetes systems sit at incident-critical positions in the stack, HolmesGPT's stated "any infrastructure" scope is the better fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3 to 4: Confirm the LLM standard.&lt;/strong&gt; Identify the LLM provider the team is already approved to use. Cross-check against each project's published backend list. Both register Anthropic, OpenAI, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Ollama. K8sGPT adds enterprise-leaning options (IBM watsonx, Oracle OCI GenAI, Cohere, Groq, HuggingFace, Amazon SageMaker). HolmesGPT adds developer-tooling options (GitHub Copilot, GitHub Models, OpenRouter, Robusta AI, OpenAI-Compatible proxy).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5 to 6: Install both in a dev cluster.&lt;/strong&gt; Install K8sGPT via brew or its Helm chart (&lt;code&gt;helm repo add k8sgpt https://charts.k8sgpt.ai/&lt;/code&gt;) and the k8sgpt-operator. Install HolmesGPT via the official Helm chart documented at holmesgpt.dev. Connect a non-production LLM key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 7 to 8: Run a known-bad scenario.&lt;/strong&gt; Trigger a documented failure (CrashLoopBackOff, OOMKilled, ImagePullBackOff) in the dev cluster. Capture each tool's full output: time to first useful finding, false positives, and signal-to-noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 9 to 10: Assess the trust surface.&lt;/strong&gt; Walk security through the read model. HolmesGPT operates with read-only access plus RBAC. K8sGPT anonymises cluster object names and labels but does not mask certain fields (Describe, ObjectStatus, Replicas, ContainerStatus, Event Message, ReplicaStatus, Count). Get a written sign-off on each tool's data path before any production read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 11 to 12: Test the operator behaviour.&lt;/strong&gt; Enable HolmesGPT Operator Mode on an infrequent schedule (hourly, since Operator Mode is Alpha) and enable the K8sGPT operator at its 30-second default. Watch LLM token consumption and alert volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 13 to 14: Pick one, both, or neither.&lt;/strong&gt; Three valid outcomes. (1) Pick K8sGPT alone if the estate is Kubernetes-only and the team needs continuous posture. (2) Pick HolmesGPT alone if the estate is multi-platform and the team values 24/7 Operator Mode with GitHub PR opening. (3) Pick both if the estate is Kubernetes-heavy and the team wants continuous posture (K8sGPT) plus incident investigation (HolmesGPT).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;Aurora by Arvo AI is a separate Apache 2.0 project at &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;. Compared to the two projects above, Aurora ships multi-cloud investigation (AWS, Azure, GCP, OVH, Scaleway, Kubernetes), a Memgraph-backed infrastructure dependency graph, hybrid (BM25 plus vector) RAG over runbooks and postmortems via Weaviate, and sandboxed &lt;code&gt;kubectl&lt;/code&gt; execution into an isolated "untrusted" namespace with a four-layer command-safety pipeline (input rail, &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; signature match, per-org policy, LLM safety judge).&lt;/p&gt;

&lt;p&gt;A team can run all three. The most common pattern in 2026 design-partner conversations is K8sGPT for continuous in-cluster posture, HolmesGPT or Aurora for incident investigation, and Aurora for the multi-cloud and remediation-staging path that K8sGPT does not target. For the full three-way comparison see &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this guide fits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;, full capability matrix including commercial entrants.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;, three-way comparison.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt;, the deployment-tier framework.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/what-is-an-ai-sre" rel="noopener noreferrer"&gt;What is an AI SRE?&lt;/a&gt;, the definitional reference.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt;, the sandboxing pattern that distinguishes investigation from remediation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between HolmesGPT and K8sGPT?&lt;/strong&gt;&lt;br&gt;
HolmesGPT is an AI agent for investigating production incidents across any infrastructure including VMs, bare metal, cloud services, and containers. K8sGPT is a tool for scanning Kubernetes clusters and diagnosing issues in simple English, scoped to Kubernetes resources only. Both are Apache 2.0 and CNCF Sandbox projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which is more popular on GitHub, HolmesGPT or K8sGPT?&lt;/strong&gt;&lt;br&gt;
As of May 2026, the K8sGPT about box on github.com/k8sgpt-ai/k8sgpt shows 7.8k stars and 996 forks. The HolmesGPT about box on github.com/HolmesGPT/holmesgpt shows 2.5k stars and 347 forks. K8sGPT had a two-year head start: it joined the CNCF Sandbox on 19 December 2023, while HolmesGPT joined on 8 October 2025.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can HolmesGPT or K8sGPT execute commands against my cluster?&lt;/strong&gt;&lt;br&gt;
HolmesGPT operates with read-only access and respects RBAC permissions. The HolmesGPT docs describe an Operator Mode that can open GitHub pull requests via the GitHub MCP server, but those writes happen against the user's Git repository, not directly against the cluster. K8sGPT scans Kubernetes resources and anonymises object names and labels before sending data to its AI backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which LLM providers does each tool support?&lt;/strong&gt;&lt;br&gt;
Both projects support the headline providers. K8sGPT's source registers 17 backends including Anthropic, OpenAI, Azure OpenAI, AWS Bedrock and Bedrock Converse, Amazon SageMaker, Google Vertex AI, Google GenAI, Cohere, Groq, HuggingFace, IBM watsonx, Oracle OCI GenAI, Ollama, LocalAI, and a Custom REST endpoint. HolmesGPT supports Anthropic, OpenAI, Azure AI Foundry, AWS Bedrock, Google Vertex AI, Gemini, GitHub Copilot, GitHub Models, Ollama, OpenRouter, OpenAI-Compatible, and Robusta AI. K8sGPT's edge providers lean enterprise (watsonx, OCI, Cohere); HolmesGPT's lean developer tooling (Copilot, Models, OpenRouter).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do HolmesGPT and K8sGPT both support MCP?&lt;/strong&gt;&lt;br&gt;
Yes, but in different directions. K8sGPT hosts a Model Context Protocol server that exposes 12 tools and 3 resources for cluster analysis, with full integration with Claude Desktop and other MCP clients. The MCP v2 feature shipped in v0.4.27 on 18 December 2025. HolmesGPT consumes MCP-exposed tools as data sources, including AWS, Azure, GCP, GitHub, GitLab, Jenkins, MariaDB, Prefect, Sentry, and Splunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are HolmesGPT and K8sGPT both CNCF projects?&lt;/strong&gt;&lt;br&gt;
Both are CNCF Sandbox projects. The cncf.io project pages document HolmesGPT accepted on 8 October 2025 and K8sGPT accepted on 19 December 2023. Sandbox is the entry tier for CNCF projects and indicates the project is in an early stage relative to Incubating and Graduated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a company behind HolmesGPT or K8sGPT?&lt;/strong&gt;&lt;br&gt;
HolmesGPT is held under HolmesGPT a Series of LF Projects, LLC, and was originally created by Robusta.dev with major contributions from Microsoft. Robusta sells a managed SaaS product that integrates HolmesGPT. K8sGPT is community-led; the 7 June 2024 CNCF blog states that unlike many popular projects there is no company behind K8sGPT and no business plan behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which project is updated more often?&lt;/strong&gt;&lt;br&gt;
Both projects ship monthly or near-monthly. HolmesGPT's latest release at writing is v0.30.1 on 20 May 2026. K8sGPT's latest release at writing is v0.4.33 on 13 May 2026. Both Releases pages on GitHub show consistent 2025 to 2026 cadence with no documented multi-month pause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can HolmesGPT or K8sGPT run air-gapped?&lt;/strong&gt;&lt;br&gt;
Both projects support local LLM inference. K8sGPT's auth list includes localai and ollama, and the K8sGPT team recommends using a local model in critical production environments. HolmesGPT's docs nav lists Ollama and OpenAI-Compatible providers, which covers self-hosted LLM endpoints. The agent runtime and the LLM together must run inside the customer perimeter to claim air-gapped deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use HolmesGPT and K8sGPT together?&lt;/strong&gt;&lt;br&gt;
Yes. K8sGPT is built as a continuous in-cluster scanner with a 30-second default reconciliation interval. HolmesGPT runs as an incident-driven investigation agent that can also operate 24/7 in Operator Mode (Alpha). A common 2026 pattern is to use K8sGPT for posture and HolmesGPT for incident investigation, with results routed to the same Slack channels or ticket systems.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;arvoai.ca/blog/holmesgpt-vs-k8sgpt&lt;/a&gt;. Aurora by Arvo AI is open-source on &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; under Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Self-Hosted AI SRE in 2026: Air-Gapped, Multi-Cloud, BYO-LLM</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 19 May 2026 01:01:04 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/self-hosted-ai-sre-in-2026-air-gapped-multi-cloud-byo-llm-53ha</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/self-hosted-ai-sre-in-2026-air-gapped-multi-cloud-byo-llm-53ha</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted AI SRE means the agent runtime, its memory layer, and the LLM all run inside the customer's perimeter.&lt;/strong&gt; Every inference call, every telemetry read, and every postmortem write happens on customer-owned infrastructure. The definition is structural. A vendor agent that ships data to vendor-managed inference is not self-hosted under this definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the Sovereignty Spectrum.&lt;/strong&gt; Five deployment tiers: T1 Public SaaS, T2 Private SaaS, T3 VPC-Isolated, T4 On-Prem Hosted, T5 Air-Gapped. Of the fifteen most-cited AI SRE tools in 2026, only &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, and &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; credibly reach T4 or T5. The other twelve top out at T1 or T2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gapped deployment requires three independent stacks: orchestration, memory, and inference.&lt;/strong&gt; Orchestration is the agent loop (LangGraph, ReAct). Memory is the dependency graph plus RAG corpus (Memgraph, Weaviate). Inference is the LLM (&lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, &lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;, or a sovereign endpoint). All three must run locally, with no outbound network call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory drivers are concrete and dated.&lt;/strong&gt; The &lt;a href="https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn" rel="noopener noreferrer"&gt;EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025&lt;/a&gt;. The &lt;a href="https://artificialintelligenceact.eu/implementation-timeline/" rel="noopener noreferrer"&gt;EU AI Act implementation timeline&lt;/a&gt; phases in through 2027. The &lt;a href="https://www.sec.gov/newsroom/press-releases/2023-139" rel="noopener noreferrer"&gt;SEC adopted cybersecurity disclosure rules on 26 July 2023&lt;/a&gt; (Form 8-K Item 1.05 effective 18 December 2023).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-weight LLMs in 2026 are credible for local inference.&lt;/strong&gt; &lt;a href="https://ai.meta.com/blog/meta-llama-3-3-70b/" rel="noopener noreferrer"&gt;Meta's Llama 3.3 70B (December 2024)&lt;/a&gt; delivers similar performance to Llama 3.1 405B at lower inference cost, per Meta's own announcement. Mistral, DeepSeek, and Qwen have released competitive open-weight models. Aurora's reference local stack uses Ollama with a 70B-class model.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Arvo's design-partner conversations across 2025, every regulated customer ran into the same procurement wall: every credible commercial AI SRE required production telemetry, including customer data inside log lines, error messages, and stack traces, to leave the customer perimeter for inference. For a SaaS startup the wall is paperwork. For a bank, a defence contractor, an EU sovereign-data buyer, or a healthcare provider, it blocks the procurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted AI SRE removes the wall.&lt;/strong&gt; The agent, its memory, and the LLM all run inside the customer's perimeter. This guide is the 2026 reference for evaluating, designing, and deploying a self-hosted AI SRE, with every commercial tool mapped to its deployment tier and Aurora's air-gapped stack used as the worked example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does self-hosted AI SRE mean?
&lt;/h2&gt;

&lt;p&gt;The phrase is overloaded. Three definitions circulate in 2026 vendor marketing, and only the strictest meaningfully reduces the trust surface.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted collector with VPC peering.&lt;/strong&gt; A vendor agent runs in the customer VPC, gathers telemetry, and ships it (sometimes after partial filtering) to a vendor-managed inference plane. The inference call leaves the customer perimeter. Most commercial AI SREs in 2026 use this pattern and call it "private deployment."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-tenant SaaS.&lt;/strong&gt; A dedicated vendor-managed instance inside a vendor-owned cloud account. The data plane is isolated from other tenants but still vendor-operated. Inference still leaves the customer perimeter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True self-hosted.&lt;/strong&gt; Every component (orchestration runtime, memory layers, inference endpoint, secrets manager) runs on customer-owned infrastructure. No outbound network call is required for an investigation to complete.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This guide uses the third definition. For audits and compliance reviews, only the third meaning answers the question "could a malicious actor at the vendor have read our incident transcript" with a structural no.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sovereignty Spectrum
&lt;/h2&gt;

&lt;p&gt;Each tier increases perimeter control over the previous one. Choose the tier the team can defend operationally; aiming further than that is engineering debt waiting to happen.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What runs on customer infrastructure&lt;/th&gt;
&lt;th&gt;What leaves the perimeter&lt;/th&gt;
&lt;th&gt;Representative tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T1, Public SaaS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing&lt;/td&gt;
&lt;td&gt;Telemetry, transcripts, investigation prompts&lt;/td&gt;
&lt;td&gt;Datadog Bits AI, incident.io AI SRE, Rootly AI, PagerDuty SRE Agent, ServiceNow Now Assist, Splunk ITSI, Cleric.ai, Causely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T2, Private SaaS (VPC peering)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A vendor-supplied agent or collector&lt;/td&gt;
&lt;td&gt;Telemetry, embeddings, sometimes whole log lines, all inference calls&lt;/td&gt;
&lt;td&gt;Resolve.ai (satellite agent), Traversal, NeuBird Hawkeye (VPC option), Edwin AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T3, VPC-Isolated single-tenant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vendor-managed control plane inside a vendor-owned cloud account dedicated to one customer&lt;/td&gt;
&lt;td&gt;All inference calls; cross-tenant data flow is structurally absent, the vendor still operates the plane&lt;/td&gt;
&lt;td&gt;Some incumbent "private cloud" tiers (custom-quoted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T4, On-prem hosted, hosted LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent, memory, dependency graph, RAG corpus&lt;/td&gt;
&lt;td&gt;LLM API calls to OpenAI, Anthropic, Google, or Bedrock&lt;/td&gt;
&lt;td&gt;Aurora with managed LLM; HolmesGPT with managed LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T5, Air-gapped&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent, memory, dependency graph, RAG corpus, and a local LLM via Ollama, vLLM, or a sovereign endpoint&lt;/td&gt;
&lt;td&gt;Nothing. Investigation completes without an outbound call&lt;/td&gt;
&lt;td&gt;Aurora with Ollama; HolmesGPT with self-hosted LLM endpoint; K8sGPT with local LLM (Kubernetes-only scope)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A team's deployment tier is fixed by its strictest constraint, not its average. The &lt;a href="https://www.finma.ch/en/~/media/finma/dokumente/dokumentencenter/myfinma/rundschreiben/finma-rs-2018-03-20180101.pdf?la=en" rel="noopener noreferrer"&gt;FINMA Circular 2018/03&lt;/a&gt; on outsourcing for Swiss banks and insurers pushes regulated workloads toward T5. A privacy-by-design product advertising "your incident data never leaves your servers" lands at T5. A team that cannot obtain controller approval for an LLM provider under &lt;a href="https://gdpr-info.eu/art-28-gdpr/" rel="noopener noreferrer"&gt;GDPR Article 28&lt;/a&gt; lands at T5.&lt;/p&gt;

&lt;p&gt;Any other constraint allows T3 or T4. A single strict regulator collapses the choice to T5.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does self-hosting matter in 2026?
&lt;/h2&gt;

&lt;p&gt;Three pressures, in roughly this order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulatory.&lt;/strong&gt; The &lt;a href="https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn" rel="noopener noreferrer"&gt;EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025&lt;/a&gt;. The boundary covers data processing and storage for core services and is the model EU procurement teams now apply to other vendors. The &lt;a href="https://artificialintelligenceact.eu/implementation-timeline/" rel="noopener noreferrer"&gt;EU AI Act timeline&lt;/a&gt; phases in through 2027, with &lt;a href="https://artificialintelligenceact.eu/chapter/3/" rel="noopener noreferrer"&gt;high-risk system obligations under Chapter III&lt;/a&gt; (risk management, data governance, human oversight, post-market monitoring) applicable to operational AI used in critical infrastructure. The &lt;a href="https://www.sec.gov/newsroom/press-releases/2023-139" rel="noopener noreferrer"&gt;SEC's cybersecurity disclosure rules&lt;/a&gt; (adopted 26 July 2023, Form 8-K Item 1.05 effective 18 December 2023) make incident response transparency a public-company concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sovereignty and latency.&lt;/strong&gt; Sovereign cloud is no longer a French preoccupation. &lt;a href="https://www.ovhcloud.com/en/enterprise/products/hosted-private-cloud/" rel="noopener noreferrer"&gt;OVHcloud Sovereign Cloud&lt;/a&gt;, &lt;a href="https://www.scaleway.com/en/" rel="noopener noreferrer"&gt;Scaleway&lt;/a&gt;, &lt;a href="https://www.t-systems.com/de/en/cloud-services/sovereign-cloud" rel="noopener noreferrer"&gt;T-Systems Sovereign Cloud&lt;/a&gt;, &lt;a href="https://www.stackit.de/en/" rel="noopener noreferrer"&gt;Stackit&lt;/a&gt; (Schwarz Group), and &lt;a href="https://www.oracle.com/cloud/sovereign-cloud/eu/" rel="noopener noreferrer"&gt;Oracle EU Sovereign Cloud&lt;/a&gt; ship contractually sovereign tiers. An AI SRE that cannot operate without sending telemetry to a US hyperscaler region is unfit for these workloads. Latency follows the same constraint: an EU-hosted agent calling a US-hosted LLM during an incident incurs round-trip latency on every step of a multi-turn investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data leakage and trust.&lt;/strong&gt; Production log lines frequently contain customer PII, secrets, and proprietary identifiers. &lt;a href="https://www.gitguardian.com/state-of-secrets-sprawl-report-2024" rel="noopener noreferrer"&gt;GitGuardian's State of Secrets Sprawl 2024&lt;/a&gt; found 12.8 million new exposed secrets across public repositories alone in 2023, a steady reminder that telemetry contains material auditors care about. The audit calculation for a security team is the same as for any third-party data flow: if it can leak, model the risk as if it will. T5 makes the model trivial because nothing leaves the perimeter.&lt;/p&gt;

&lt;p&gt;For the full incident-investigation context, see &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation: The Complete Guide for SRE Teams&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which AI SRE tools can be fully self-hosted?
&lt;/h2&gt;

&lt;p&gt;The honest map.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Best achievable tier&lt;/th&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T5, Air-Gapped&lt;/td&gt;
&lt;td&gt;Reference stack: Docker Compose or Helm chart, Ollama local LLM, Vault, Memgraph, Weaviate. See the &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora repo&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HolmesGPT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T4, On-prem with hosted LLM (T5 with self-hosted LLM endpoint)&lt;/td&gt;
&lt;td&gt;Apache 2.0. Per the &lt;a href="https://holmesgpt.dev/" rel="noopener noreferrer"&gt;HolmesGPT docs&lt;/a&gt;, documentation assumes a hosted model provider (OpenAI, Azure OpenAI, Bedrock). Self-hosted LLM is an advanced configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K8sGPT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T4, On-prem (T5 with local LLM, Kubernetes scope only)&lt;/td&gt;
&lt;td&gt;CLI or Helm. &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;Local LLMs via Ollama supported&lt;/a&gt;. Scope is limited to the Kubernetes API.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resolve.ai&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS&lt;/td&gt;
&lt;td&gt;Satellite agent in the customer VPC for telemetry. Inference is vendor-managed. No publicly documented air-gapped option.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traversal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS&lt;/td&gt;
&lt;td&gt;Flexible deployment options. Inference is vendor-managed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NeuBird Hawkeye&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private SaaS (VPC)&lt;/td&gt;
&lt;td&gt;VPC deployment available. Ephemeral telemetry processing claimed by NeuBird. Inference path is vendor-managed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Causely&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Kubernetes-only. SaaS control plane.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cleric.ai&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Slack-first SaaS.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PagerDuty SRE Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Inside &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty Operations Cloud&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog Bits AI SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Multi-tenant inside Datadog. HIPAA-compliant per &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog's documentation&lt;/a&gt;, not air-gapped.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;incident.io AI SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Hosted multi-tenant. AI SRE access design-partner-gated.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rootly AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Closed-core SaaS. &lt;a href="https://rootly.com/labs" rel="noopener noreferrer"&gt;Rootly AI Labs&lt;/a&gt; publishes open-source prototypes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ServiceNow Now Assist SRE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;ServiceNow cloud. GA targeted June 2026.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edwin AI (LogicMonitor)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T2, Private (LogicMonitor-managed)&lt;/td&gt;
&lt;td&gt;Bundled with LogicMonitor Envision platform. Not standalone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk ITSI Episode Summarization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;T1, Public SaaS&lt;/td&gt;
&lt;td&gt;Splunk Cloud only as of May 2026 (Alpha).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The open-source projects are the only tools today that credibly reach T4 or T5 with public documentation. Aurora is the only one with multi-cloud scope at T5. Resolve.ai, Traversal, NeuBird, and Datadog Bits AI publish FedRAMP-adjacent or HIPAA tiers but no air-gapped reference architecture as of May 2026. For the broader category overview, see our &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;open-source incident management overview&lt;/a&gt; and the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions launch post&lt;/a&gt; for scheduled and event-triggered automations on top of self-hosted Aurora.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the architecture of a self-hosted AI SRE?
&lt;/h2&gt;

&lt;p&gt;A self-hosted agentic AI SRE has three concurrent runtime stacks. Skip any one and the deployment regresses to a lower sovereignty tier.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Orchestration runtime
&lt;/h3&gt;

&lt;p&gt;The agent loop is the LangGraph, ReAct, or equivalent orchestration that decides what tool to call next. It is the smallest of the three stacks by resource footprint and the easiest to self-host. Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Python or Node runtime, typically containerised.&lt;/li&gt;
&lt;li&gt;A task queue (&lt;a href="https://docs.celeryq.dev/" rel="noopener noreferrer"&gt;Celery&lt;/a&gt;, &lt;a href="https://python-rq.org/" rel="noopener noreferrer"&gt;RQ&lt;/a&gt;, &lt;a href="https://docs.bullmq.io/" rel="noopener noreferrer"&gt;BullMQ&lt;/a&gt;) for long-running investigations.&lt;/li&gt;
&lt;li&gt;Postgres for agent state, investigation records, and audit logs.&lt;/li&gt;
&lt;li&gt;A secrets store (&lt;a href="https://developer.hashicorp.com/vault" rel="noopener noreferrer"&gt;HashiCorp Vault&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/secretsmanager/" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt;, or &lt;a href="https://cloud.google.com/security/products/security-key-management" rel="noopener noreferrer"&gt;KMS&lt;/a&gt;) for cloud credentials and LLM keys.&lt;/li&gt;
&lt;li&gt;A web UI or API surface for engineers to inspect and trigger investigations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora ships this stack as a Docker Compose for single-node deployment and a Helm chart for Kubernetes-native deployment, both &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;documented in the repo&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Memory layer
&lt;/h3&gt;

&lt;p&gt;The agent without memory is a stateless inference call. Memory is the difference between an agent that learns from the environment and an agent that makes the same investigative mistake every week.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency graph.&lt;/strong&gt; A graph database (&lt;a href="https://memgraph.com/docs" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt;, Neo4j) that holds the live topology of the infrastructure: services, dependencies, alert sources, and ownership. The agent traverses the graph to assess blast radius and trace upstream causes before issuing tool calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG corpus.&lt;/strong&gt; A vector database (&lt;a href="https://weaviate.io/developers/weaviate" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;, &lt;a href="https://qdrant.tech/documentation/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt;, &lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;) holding embeddings of past postmortems, runbooks, design docs, and code. &lt;a href="https://weaviate.io/developers/weaviate/search/hybrid" rel="noopener noreferrer"&gt;Hybrid retrieval combining BM25 and vector search&lt;/a&gt; outperforms either alone on SRE corpora because exact-match identifiers (service names, error codes) coexist with semantic concepts (failure modes). See also the &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;root cause analysis complete guide for SREs&lt;/a&gt; for the broader investigation context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event store.&lt;/strong&gt; Postgres or an event-sourcing database for the agent's own investigation history. Past investigations become future evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora's reference stack is Memgraph, Weaviate, and Postgres. Each runs in a customer container, and none requires an outbound network call.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Inference layer
&lt;/h3&gt;

&lt;p&gt;The LLM. Three paths, in increasing sovereignty:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Managed LLM API.&lt;/strong&gt; OpenAI, Anthropic, Google, Bedrock. Cheapest to start, lowest operational burden, but the deployment stays at T4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private endpoint.&lt;/strong&gt; &lt;a href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/provisioned-throughput" rel="noopener noreferrer"&gt;Azure OpenAI dedicated&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html" rel="noopener noreferrer"&gt;Bedrock Provisioned Throughput&lt;/a&gt;, or a partner-hosted endpoint. Stronger contractual perimeter, although the data still leaves the customer cloud account.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM.&lt;/strong&gt; &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, &lt;a href="https://docs.vllm.ai/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;, or a sovereign inference appliance. Reaches T5.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For T5, the inference stack is the operational lift. Hardware is the largest single line item, and team expertise is the second.&lt;/p&gt;

&lt;h2&gt;
  
  
  BYO-LLM: which models run well locally?
&lt;/h2&gt;

&lt;p&gt;Open-weight model quality has progressed enough to anchor an agentic SRE loop in 2026. The current options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Llama 3.3 70B&lt;/strong&gt; (&lt;a href="https://ai.meta.com/blog/meta-llama-3-3-70b/" rel="noopener noreferrer"&gt;Meta, December 2024&lt;/a&gt;). Meta states the model delivers similar performance to Llama 3.1 405B at lower inference cost. A common starting point for local deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1&lt;/strong&gt; (&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;model card&lt;/a&gt;). A reasoning-tuned open-weight model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 2.5 and 3 families&lt;/strong&gt; (&lt;a href="https://qwenlm.github.io/blog/qwen2.5/" rel="noopener noreferrer"&gt;Qwen 2.5 release&lt;/a&gt;). Strong multilingual support for teams with non-English runbook content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral Large&lt;/strong&gt; (&lt;a href="https://docs.mistral.ai/getting-started/models/models_overview/" rel="noopener noreferrer"&gt;Mistral models&lt;/a&gt;). Strong tool-use performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hardware sizing for a 70B-class model: in float16, weights are roughly 140GB, so plan two 80GB cards (a pair of H100 or A100 80GB) or a single H200 (141GB). Q4-quantised variants compress weights to roughly 35-40GB and fit on a single 80GB card with context room, at some latency and quality cost. See the &lt;a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct" rel="noopener noreferrer"&gt;Llama 3.3 70B model card&lt;/a&gt; for the canonical parameter and tensor sizes. Specific latency targets are workload-dependent and should be measured, not assumed.&lt;/p&gt;

&lt;p&gt;The constraint to flag: running a local LLM is a real engineering discipline. Teams without LLM-ops capacity should consider T4 (managed API) as the long-term answer and revisit T5 when the team is staffed for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does multi-cloud authentication work in a self-hosted agent?
&lt;/h2&gt;

&lt;p&gt;A self-hosted agent must still reach customer cloud APIs. The auth pattern matters because credentials live in the customer perimeter. Vendor-managed inference makes credential exfiltration a vendor-trust problem. Self-hosted inference makes it a customer-operations problem, which is the desired state.&lt;/p&gt;

&lt;p&gt;Aurora's reference multi-cloud auth pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cloud&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" rel="noopener noreferrer"&gt;STS AssumeRole&lt;/a&gt; into customer accounts via a least-privilege investigation role. Credentials never persist in agent storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal" rel="noopener noreferrer"&gt;Service Principal&lt;/a&gt; with Reader (and incident-scoped Operator) role assignments.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth-based authentication or &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;workload identity federation&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OVH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://help.ovhcloud.com/csm/en-gb-api-getting-started-ovhcloud-api" rel="noopener noreferrer"&gt;API key per investigation scope&lt;/a&gt;, stored in Vault.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaleway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.scaleway.com/en/docs/iam/how-to/create-api-keys/" rel="noopener noreferrer"&gt;API token&lt;/a&gt; stored in Vault.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubeconfig per cluster, stored in Vault. Sandboxed kubectl execution into an isolated namespace; see our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Vault binding matters: every cloud credential is short-lived where the cloud supports it, and every credential use is auditable. In a T5 deployment, the auditor's "who issued this command" question is answered by the Vault audit log and the agent's tool-call trace, not by a vendor SOC 2 attestation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does an air-gapped AI SRE deployment require?
&lt;/h2&gt;

&lt;p&gt;The hard version requires no outbound network call during an investigation, including for inference.&lt;/p&gt;

&lt;p&gt;Aurora's air-gapped reference architecture covers six layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored container registry.&lt;/strong&gt; Every image (Aurora, Memgraph, Weaviate, Postgres, Vault, Ollama) is pulled from a customer-internal registry. No Docker Hub calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored package indices.&lt;/strong&gt; Python wheels and OS packages served from internal Artifactory or equivalent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirrored model weights.&lt;/strong&gt; Llama 3.3 weights downloaded once on a connected jumpbox, scanned, hashed, and copied into the air-gapped network. Same for embedding models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local DNS.&lt;/strong&gt; No outbound DNS resolution required. Cloud APIs are reached via VPC private endpoints (&lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html" rel="noopener noreferrer"&gt;AWS PrivateLink&lt;/a&gt;, &lt;a href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview" rel="noopener noreferrer"&gt;Azure Private Endpoint&lt;/a&gt;, &lt;a href="https://cloud.google.com/vpc/docs/private-service-connect" rel="noopener noreferrer"&gt;GCP Private Service Connect&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No telemetry to vendor.&lt;/strong&gt; Neither Aurora nor the open-source components phone home; this is verified per release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sealed Vault.&lt;/strong&gt; Vault sealed and unsealed via internal HSM or &lt;a href="https://developer.hashicorp.com/vault/docs/concepts/seal" rel="noopener noreferrer"&gt;Shamir keyshares&lt;/a&gt;. No auto-unseal against a vendor KMS.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The provisioning lift is real. Teams that have operated air-gapped Kubernetes will recognise the pattern. Teams that have not should pilot in a connected environment first.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aurora implements the Sovereignty Spectrum
&lt;/h2&gt;

&lt;p&gt;Every Aurora deployment is configured for the customer's tier. The same code base supports all five.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T1 and T2.&lt;/strong&gt; Aurora deployed to a public-cloud account with managed services for Postgres, Memgraph, and Weaviate. LLM via OpenAI or Anthropic API. Useful for evaluation pilots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T3.&lt;/strong&gt; Aurora deployed to a customer-owned VPC with private endpoints to managed services. LLM via private endpoint (Azure OpenAI dedicated, Bedrock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T4.&lt;/strong&gt; Aurora deployed to customer-owned VMs or Kubernetes with self-hosted Postgres, Memgraph, and Weaviate. LLM via managed API or private endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T5.&lt;/strong&gt; Aurora deployed to customer-owned air-gapped infrastructure with Ollama-hosted Llama 3.3 (or a sovereign LLM endpoint). All dependencies mirrored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora ships a single codebase that serves all five tiers. Tier downgrade ("drop from T5 to T3 for one workload") and upgrade ("move the EU workload from T3 to T5") become configuration changes rather than migrations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does self-hosted AI SRE cost compare to SaaS?
&lt;/h2&gt;

&lt;p&gt;A precise total cost of ownership depends on team size, model choice, infrastructure pricing, regional rates, and incident volume. Procurement should model the variable axes against incident volume rather than anchor on a single vendor-supplied number.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted T4 or T5 fixed costs.&lt;/strong&gt; Compute for the agent runtime, memory stores, and (for T5) the LLM node. Storage for the RAG corpus and audit log. Engineering time to operate the stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted T4 variable costs.&lt;/strong&gt; Managed LLM API usage at provider rates (&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI pricing&lt;/a&gt;, &lt;a href="https://www.anthropic.com/pricing" rel="noopener noreferrer"&gt;Anthropic pricing&lt;/a&gt;, &lt;a href="https://aws.amazon.com/bedrock/pricing/" rel="noopener noreferrer"&gt;Bedrock pricing&lt;/a&gt;). Scales with the number and depth of investigations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commercial SaaS variable costs.&lt;/strong&gt; Per-seat tiers (incident.io, Rootly, PagerDuty), per-investigation billing (&lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;Datadog Bits AI&lt;/a&gt;, NeuBird), or per-credit consumption (ServiceNow). All published on the vendor's pricing page.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The break-even between a self-hosted Tier 5 deployment and per-investigation SaaS depends on the vendor's per-investigation price, the LLM choice, and the engineering cost of running the stack. Procurement teams should model three points: today's incident volume, twelve-month projected volume, and a 3x scenario. If any of the three is dominated by sovereignty rather than economics, the regulator decides the deployment tier, not the spreadsheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  When self-hosting is the wrong answer
&lt;/h2&gt;

&lt;p&gt;Self-hosting is an engineering commitment, not a checkbox.&lt;/p&gt;

&lt;p&gt;Teams that should skip it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No LLM-ops capacity.&lt;/strong&gt; If no one on the team has run inference servers in production, do not start with air-gapped Ollama. Pilot at T1 or T2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small team, low incident volume.&lt;/strong&gt; Below twenty incidents per month, the operational overhead can exceed the cost savings of self-hosting. T1 is fine if the data classification allows it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No regulatory or sovereignty pressure.&lt;/strong&gt; If the compliance team is not asking and the data classification is not sensitive, the sovereignty premium is paid for nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early in the AI SRE evaluation curve.&lt;/strong&gt; A managed pilot validates the value of the agent to the team. Self-host after that decision, not before it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams that should default to self-hosting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regulated workloads (finance, healthcare, defence, critical infrastructure).&lt;/li&gt;
&lt;li&gt;EU sovereign-data customers.&lt;/li&gt;
&lt;li&gt;Customers that advertise sovereignty as a product attribute themselves.&lt;/li&gt;
&lt;li&gt;Public-sector buyers under FedRAMP High, IRAP PROTECTED, IL5, or equivalent.&lt;/li&gt;
&lt;li&gt;Anyone whose log lines contain customer PII that has not been scrubbed at source.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;Arvo expects three shifts in the self-hosted AI SRE landscape over the next twelve months.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sovereign LLM endpoints.&lt;/strong&gt; EU-hosted, contract-bound LLM endpoints from cloud regions outside US jurisdiction will turn T4 into a viable tier for European regulated customers without forcing T5. Anthropic, OpenAI, and Google are each shipping or piloting EU-resident inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Air-gap reference appliances.&lt;/strong&gt; Appliance-style packages (preloaded GPU servers with Aurora, a local LLM, and a sealed Vault) sold as turn-key T5 deployments are likely to emerge from hardware vendors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open benchmark cohorts.&lt;/strong&gt; Closed-source players still measure themselves on private datasets. The first open, named, multi-LLM benchmark on a public incident corpus will become the citation surface the category orbits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2024 self-hosted AI SRE was a theoretical option. By 2025 it was niche. In 2026 it has become the procurement default for regulated workloads. The tools that can execute it today are Aurora at the multi-cloud end, HolmesGPT at the CNCF and Kubernetes end, and K8sGPT for diagnostics.&lt;/p&gt;

&lt;p&gt;For the full landscape of AI SRE tools and how each maps to a deployment tier, see &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;Top 15 AI SRE Tools in 2026&lt;/a&gt;. For the broader category overview, see &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE: The Complete Guide for Engineering Teams in 2026&lt;/a&gt;. For the investigation and postmortem halves of the workflow, see &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt; and &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Top 15 AI SRE Tools in 2026: Open-Source and Commercial</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 19 May 2026 00:55:24 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/top-15-ai-sre-tools-in-2026-open-source-and-commercial-1bl7</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/top-15-ai-sre-tools-in-2026-open-source-and-commercial-1bl7</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An AI SRE tool applies large-language-model reasoning to incident response, usually as a multi-step agent that runs infrastructure tools, summarizes events, or drafts postmortems.&lt;/strong&gt; The label spans five archetypes that vendors blur in marketing: agentic investigation, AIOps correlation, postmortem generation, ITSM-integrated copilots, and workflow-automation suites with AI add-ons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We score every tool on the AI SRE Capability Matrix.&lt;/strong&gt; Five axes (Investigation, Remediation, Postmortem, Deployment Flexibility, Source Availability), each 0 to 3, total 15. The matrix tracks publicly documented capability as of May 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three open-source projects span the agentic-investigation lane.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, multi-cloud), &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since October 2025&lt;/a&gt;, co-maintained by &lt;a href="https://www.cncf.io/blog/2026/01/07/holmesgpt-agentic-troubleshooting-built-for-the-cloud-native-era/" rel="noopener noreferrer"&gt;Robusta and Microsoft&lt;/a&gt;), and &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; (Apache 2.0, &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;, Kubernetes diagnostics).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cited funding rounds in the last twelve months.&lt;/strong&gt; &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai raised $125M at a $1B valuation in February 2026&lt;/a&gt; and &lt;a href="https://www.prnewswire.com/news-releases/resolve-ai-announces-series-a-extension-at-a-1-5b-valuation-and-launches-resolve-ai-labs-to-advance-ai-systems-for-complex-production-environments-302743888.html" rel="noopener noreferrer"&gt;extended at a $1.5B valuation in April 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal raised $48M in June 2025&lt;/a&gt;. &lt;a href="https://incident.io/blog/incident-io-raises-62m-series-b-to-build-ai-agents-that-resolve-incidents-with-you" rel="noopener noreferrer"&gt;incident.io closed a $62M Series B in September 2024&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incumbents shipped AI SRE features by Q2 2026.&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;PagerDuty SRE Agent&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/bits-ai-sre/" rel="noopener noreferrer"&gt;Datadog Bits AI SRE&lt;/a&gt;, &lt;a href="https://www.splunk.com/en_us/blog/observability/conf25-splunk-observability-announcements.html" rel="noopener noreferrer"&gt;Splunk ITSI Episode Summarization announced at .conf25&lt;/a&gt; (September 2025), &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow Now Assist SRE Specialist&lt;/a&gt; (GA targeted June 2026), and &lt;a href="https://www.logicmonitor.com/edwin-ai" rel="noopener noreferrer"&gt;LogicMonitor Edwin AI&lt;/a&gt;. The procurement question moves from "is there an AI option" to "which archetype, at what deployment tier."&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Site reliability teams in 2026 are evaluating tools in a market that has reorganised faster than most procurement processes can keep up with. Five archetypes share the "AI SRE" label, and buyers regularly compare a postmortem generator to an agentic investigator as if they did the same job. This guide compares the fifteen most-cited tools across both open-source and commercial categories, scored on a single capability matrix so the decision becomes one of fit.&lt;/p&gt;

&lt;p&gt;A note on bias. Arvo builds &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, an open-source agentic AI SRE tool listed below. We applied the same scoring rubric to every product on the list, including our own, and cited every numeric or capability claim that is not common knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI SRE tool?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;An AI SRE tool applies large-language-model reasoning to incident response.&lt;/strong&gt; The term covers five distinct archetypes, and only two of them actually investigate incidents.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agentic investigation.&lt;/strong&gt; A multi-step LLM agent that calls infrastructure tools (&lt;code&gt;kubectl&lt;/code&gt;, cloud APIs, log queries, dependency graphs) during an incident to gather new evidence and produce a root-cause analysis. Aurora, HolmesGPT, K8sGPT, Resolve.ai, Traversal, NeuBird, Cleric, Causely, and Ciroos all market themselves with this framing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps correlation.&lt;/strong&gt; Statistical or ML clustering of alerts to reduce noise. PagerDuty Intelligent Alert Grouping, BigPanda, Dell APEX (Moogsoft), Dynatrace Davis. The category predates LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation.&lt;/strong&gt; An LLM that drafts the retrospective from artefacts the team already has (Slack transcripts, monitor data, the investigation trace). Rootly, incident.io Scribe, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered in our &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ITSM-integrated copilot.&lt;/strong&gt; AI inside an existing service-management workflow. ServiceNow Now Assist SRE Specialist, LogicMonitor Edwin AI, Splunk ITSI Episode Summarization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow-automation suite plus AI add-on.&lt;/strong&gt; Incident platforms that bolted AI onto existing on-call, runbook, and status-page features. incident.io AI SRE, Rootly AI, FireHydrant AI.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conflating archetypes is the most common evaluation mistake. A team buying a postmortem generator will not get root-cause analysis. A team buying an AIOps correlator will not get a tool that runs &lt;code&gt;kubectl&lt;/code&gt;. For the foundational definitions, see our &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;AI SRE Complete Guide&lt;/a&gt; and &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI SRE Capability Matrix
&lt;/h2&gt;

&lt;p&gt;Five axes, each scored 0 to 3. We apply the same rubric to every tool in the shortlist.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;2&lt;/th&gt;
&lt;th&gt;3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Single-shot LLM summary&lt;/td&gt;
&lt;td&gt;Multi-step agent, single cloud or platform&lt;/td&gt;
&lt;td&gt;Multi-step agent, multi-cloud, with RAG over historical evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Suggested commands&lt;/td&gt;
&lt;td&gt;PR-based fixes with approval&lt;/td&gt;
&lt;td&gt;Sandboxed in-cluster execution with policy guardrails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postmortem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Manual export of a transcript&lt;/td&gt;
&lt;td&gt;LLM-drafted from artefacts&lt;/td&gt;
&lt;td&gt;LLM-drafted from the agent's own investigation trace, exported to Confluence or Jira&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS-only, public cloud&lt;/td&gt;
&lt;td&gt;SaaS with private VPC peering&lt;/td&gt;
&lt;td&gt;Self-hosted in customer VPC&lt;/td&gt;
&lt;td&gt;Air-gapped with local LLM (Ollama or vLLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Source availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Source-available, paid&lt;/td&gt;
&lt;td&gt;Open core&lt;/td&gt;
&lt;td&gt;Apache 2.0 or MIT, fully open&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A higher score is not always "better." A team without LLM-ops capacity should not score deployment flexibility 3 against its roadmap. The matrix is for like-for-like comparison, not a leaderboard.&lt;/p&gt;

&lt;p&gt;For a deeper treatment of the deployment-flexibility axis, see our companion piece, &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which AI SRE tools are most-cited in 2026?
&lt;/h2&gt;

&lt;p&gt;Ordered alphabetically inside each archetype. Scoring reflects the publicly documented capability of each product as of May 2026, not roadmap claims. For category foundations, see our &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;open-source incident management overview&lt;/a&gt; and the &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;root cause analysis complete guide for SREs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic-investigation tools
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Aurora (Arvo AI), Apache 2.0, multi-cloud
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; SRE teams that need self-hosted, multi-cloud, BYO-LLM agentic investigation with the option to graduate into PR-based remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Docker Compose, Helm chart, or air-gapped with &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;. Customer-owned infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. Code at &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; LangGraph-orchestrated ReAct agent, 30+ integrations across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes. Memgraph dependency graph feeds an alert-correlation pre-step. Weaviate hybrid (BM25 plus vector) RAG over runbooks and past postmortems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Sandboxed &lt;code&gt;kubectl&lt;/code&gt; execution into an isolated "untrusted" namespace, wrapped in a four-layer command-safety pipeline (input rail, &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; signature match, per-org policy, LLM safety judge). Aurora Actions add scheduled and event-triggered automations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Postmortem agent fed by the investigation trace, exported to Confluence Cloud (OAuth) or Server / Data Center (PAT).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free (Apache 2.0). Infrastructure cost only. Optionally, LLM API usage. With local Ollama the recurring software cost is zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Self-host means the team operates the agent. Teams without basic Kubernetes ops capacity should pilot in an existing managed cluster first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 3, Postmortem 3, Deployment 3, Source 3, total &lt;strong&gt;15/15&lt;/strong&gt;. The score reflects the breadth of the open-source feature set against the matrix, not a quality verdict relative to commercial competitors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Causely, closed source, Kubernetes-only
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Kubernetes-only teams that want causal-graph reasoning rather than LLM-first investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with in-cluster collector. CNCF &lt;a href="https://www.cncf.io/sandbox-projects/" rel="noopener noreferrer"&gt;Causely member listing&lt;/a&gt; (member, not project).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Topology graph plus causality graph plus a "codebook" of failure patterns; the authors describe a deterministic abductive-inference layer that precedes any LLM call. See &lt;a href="https://docs.causely.ai/getting-started/how-causely-works/" rel="noopener noreferrer"&gt;How Causely Works&lt;/a&gt; and the &lt;a href="https://www.infoq.com/articles/causal-reasoning-observability/" rel="noopener noreferrer"&gt;InfoQ piece on causal reasoning in observability&lt;/a&gt;. &lt;a href="https://www.causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes" rel="noopener noreferrer"&gt;Gartner Cool Vendor for AIOps, December 2025&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based via &lt;a href="https://www.causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not a first-class artefact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Not publicly disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Kubernetes-only by design. If the platform spans cloud SDKs and managed services, the model is incomplete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 0, Deployment 0, Source 0, total &lt;strong&gt;3/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Cleric.ai, closed source, Slack-first
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; SRE teams that triage primarily in Slack and use Datadog or Grafana for telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Slack-native AI SRE per &lt;a href="https://cleric.ai/" rel="noopener noreferrer"&gt;cleric.ai&lt;/a&gt;. Integrations with Datadog and Grafana are documented on the product site.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Not publicly disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Slack-first is a strong constraint. Teams on Microsoft Teams or under strict ChatOps governance may find the surface rigid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 0, Source 0, total &lt;strong&gt;4/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. HolmesGPT, Apache 2.0, Kubernetes-first
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Kubernetes-heavy teams that want a CNCF-aligned, RBAC-respecting investigation agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Helm via Robusta, or standalone CLI. LLM provider is the customer's choice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. Code at &lt;a href="https://github.com/HolmesGPT/holmesgpt" rel="noopener noreferrer"&gt;github.com/HolmesGPT/holmesgpt&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since October 2025&lt;/a&gt;, co-maintained by Robusta and Microsoft.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Iterative ReAct agent. &lt;a href="https://holmesgpt.dev/data-sources/builtin-toolsets/" rel="noopener noreferrer"&gt;Built-in toolsets&lt;/a&gt; span Prometheus, Grafana, AWS / Azure / GCP via MCP read-only, Datadog, and Confluence. Releases v0.20 through v0.25 shipped between February and April 2026 (&lt;a href="https://github.com/HolmesGPT/holmesgpt/releases" rel="noopener noreferrer"&gt;Releases page&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only by default. Operator mode can open GitHub PRs. No in-cluster execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not first-class. Investigations route to Slack, PagerDuty, or Jira.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free. Robusta sells a managed SaaS that wraps HolmesGPT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; AWS, Azure, and GCP support is exposed through MCP wrappers rather than first-class cloud SDK integration. The customer IAM model must fit MCP's read-only assumptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 2, Source 3, total &lt;strong&gt;9/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  5. K8sGPT, Apache 2.0, Kubernetes-only diagnostics
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Quick diagnostic sanity checks on a single cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; CLI, in-cluster operator, or Helm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;CNCF Sandbox since 19 December 2023&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Rule-based analyser set (Pod, Deployment, Ingress, Service, NetworkPolicy, etc.) with an LLM translating findings into natural language. Closer to L3 (single-shot diagnosis) than L4 (agentic multi-step) on the &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AICL&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestion-based per &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;k8sgpt docs&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not a feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Strong privacy feature: resource names and labels are anonymised before LLM calls per the &lt;a href="https://docs.k8sgpt.ai/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;. Scope is limited to the cluster API; the tool cannot reach out to cloud APIs or external systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 1, Remediation 1, Postmortem 0, Deployment 2, Source 3, total &lt;strong&gt;7/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  6. NeuBird Hawkeye, closed source, multi-platform
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Datadog-heavy AWS shops that want a managed AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS or VPC. Mayfield, M12, and AWS GenAI Accelerator backing per &lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;neubird.ai&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Ephemeral processing (telemetry not stored). Integrations with Datadog, Splunk, CloudWatch, PagerDuty, and ServiceNow per the &lt;a href="https://neubird.ai/blog/how-hawkeye-works-deep-dive-secure-genai-powered-it-operations/" rel="noopener noreferrer"&gt;Hawkeye deep-dive&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only by default. Integrations forward to ITSM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript export.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-investigation pricing listed on AWS Marketplace; enterprise contracts also available. See &lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;NeuBird's product page&lt;/a&gt; for the latest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; "Self-learning" implies a vector store that customers cannot directly inspect. Diligence the data path for regulated workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  7. Resolve.ai, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise teams that want a managed "AI Production Engineer" with named-customer case studies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with in-VPC satellite agent for telemetry. No on-prem option. SOC 2, GDPR, HIPAA per the &lt;a href="https://resolve.ai/about-us" rel="noopener noreferrer"&gt;Resolve trust page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Knowledge-graph plus LLM agent per the &lt;a href="https://resolve.ai/blog/knowledge-graph-agentic-ai-incident-response" rel="noopener noreferrer"&gt;Resolve knowledge-graph post&lt;/a&gt;. Founders include Spiros Xanthos, an OpenTelemetry co-creator. Resolve's &lt;a href="https://resolve.ai/news/resolveai-raises-125-million-series-a" rel="noopener noreferrer"&gt;Series A press release&lt;/a&gt; reports vendor-claimed customer results that Arvo has not independently verified: 72% investigation-time reduction at Coinbase, 87% faster investigations at DoorDash, and 30% fewer engineers per incident at Zscaler.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Generates suggested commands. Public architecture detail is limited.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise. Public pricing is not disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Cloud-only and closed-source. The two public LLM benchmark posts (&lt;a href="https://resolve.ai/blog/Our-early-impressions-of-Claude-Sonnet-4.6" rel="noopener noreferrer"&gt;Sonnet 4.6&lt;/a&gt;) use a private dataset with no public methodology, so the numbers are unreplicable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  8. Traversal, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Log-heavy enterprise environments where causal search across telemetry is the bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS with flexible deployment options. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;$48M from Sequoia and Kleiner Perkins, June 2025&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; "Production World Model" and "Causal Search Engine" per &lt;a href="https://traversal.com/blog/introducing-causal-search-engine-from-correlated-alerts-to-causally-consistent-diagnoses" rel="noopener noreferrer"&gt;Traversal's product blog&lt;/a&gt;. Vendor-reported production results at American Express, summarised in the Fortune launch coverage and Traversal's &lt;a href="https://traversal.com/blog/american-express-announcement" rel="noopener noreferrer"&gt;Amex announcement&lt;/a&gt;: 32% MTTR reduction and 82% RCA accuracy across roughly 250 billion log lines per day. Customer stories at &lt;a href="https://traversal.com/customer-stories/eventbrite" rel="noopener noreferrer"&gt;Eventbrite&lt;/a&gt;, PepsiCo, and DigitalOcean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Read-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Heavy reliance on trademarked frameworks. Confirm during evaluation how much is novel architecture versus packaging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 3, Remediation 1, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Incumbent and incident-workflow tools
&lt;/h3&gt;

&lt;h4&gt;
  
  
  9. Datadog Bits AI SRE, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams standardised on Datadog observability who want investigation where the data already lives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, multi-tenant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Multi-agent architecture with planner and worker agents. Datadog's engineering posts &lt;a href="https://www.datadoghq.com/blog/building-bits-ai-sre/" rel="noopener noreferrer"&gt;Building Bits AI SRE&lt;/a&gt; and &lt;a href="https://www.datadoghq.com/blog/engineering/bits-ai-eval-platform/" rel="noopener noreferrer"&gt;the evaluation platform&lt;/a&gt; describe the design without releasing source. &lt;a href="https://www.datadoghq.com/product/ai/bits-ai-sre/" rel="noopener noreferrer"&gt;HIPAA-compliant&lt;/a&gt; per the product page. Seven triage actions including Slack, Teams, and Jira.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Triage actions only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Bits AI drafts post-incident reports per the &lt;a href="https://www.datadoghq.com/blog/bits-ai-for-incident-management/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-conclusive-investigation billing on top of host, APM, logs, and RUM licensing per &lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;Datadog pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Bits is tightly bound to Datadog's data plane. Using it without the full Datadog stack is not a supported pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  10. Edwin AI (LogicMonitor), closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Existing LogicMonitor Envision customers expanding into agentic AIOps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS layered on LogicMonitor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Ten-plus specialised sub-agents (investigation, correlation, remediation, orchestrator) per the &lt;a href="https://www.logicmonitor.com/blog/meet-edwin-ai-specialized-agents-agentic-aiops" rel="noopener noreferrer"&gt;agent-taxonomy post&lt;/a&gt;. MCP ecosystem support (Dynatrace, Splunk, ServiceNow, Elastic, GitHub, Confluence). A &lt;a href="https://www.logicmonitor.com/blog/fortune-500-it-incident-reduction-edwin-ai" rel="noopener noreferrer"&gt;Forrester Total Economic Impact study&lt;/a&gt; commissioned by LogicMonitor reports 313% ROI on a composite organisation with sub-six-month payback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Closed-loop with policy guardrails per LogicMonitor's product description.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Investigation transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Bundled with LogicMonitor; quoted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Customers must purchase LogicMonitor to use Edwin. Not a standalone option.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 2, Postmortem 1, Deployment 1, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  11. incident.io AI SRE, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using incident.io for on-call and incident workflow who want the AI add-on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Multi-agent system searching GitHub PRs, Slack, historical incidents, logs, metrics, and traces per &lt;a href="https://incident.io/blog/introducing-ai-sre" rel="noopener noreferrer"&gt;incident.io's AI SRE introduction&lt;/a&gt;. An "ambient agent" continuously monitors. The &lt;a href="https://www.zenml.io/llmops-database/ai-powered-incident-response-system-with-multi-agent-investigation" rel="noopener noreferrer"&gt;ZenML LLMOps case study&lt;/a&gt; documents the retrieval evolution from embeddings-only to deterministic tagging plus re-ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Recommendations only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Scribe drafts post-incident reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Platform tiers on &lt;a href="https://incident.io/pricing" rel="noopener noreferrer"&gt;incident.io's pricing page&lt;/a&gt;. AI SRE access is gated to design partners as of the launch announcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Verify AI SRE availability for your tier before assuming you can use it on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  12. PagerDuty SRE Agent, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; PagerDuty Operations Cloud customers who want a memory-equipped agent inside the existing on-call surface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, inside PagerDuty Operations Cloud per the &lt;a href="https://www.pagerduty.com/platform/ai-agents/sre/" rel="noopener noreferrer"&gt;product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Per-tenant memory: service-scoped observations, incident recollections, human-promoted playbooks. See PagerDuty's engineering post &lt;a href="https://www.pagerduty.com/blog/ai/we-built-an-sre-agent-with-memory-and-its-transforming-incident-response/" rel="noopener noreferrer"&gt;We Built an SRE Agent With Memory&lt;/a&gt;. MCP server. Connectors to Grafana, New Relic, and Honeycomb. Three-tier engagement model (agent-led, collaborative, human-led).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestions and automation hooks through existing PagerDuty workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; PagerDuty Scribe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Per-seat tiers and AIOps add-ons listed on &lt;a href="https://www.pagerduty.com/pricing/aiops/" rel="noopener noreferrer"&gt;PagerDuty pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; AI pricing across the incident-management category is moving from per-seat to usage-based. Model the long-term cost against incident volume rather than seat count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;5/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  13. Rootly AI, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want an AI-first ChatOps incident response with an open MCP server and an actively published agent roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed core. &lt;a href="https://rootly.com/labs" rel="noopener noreferrer"&gt;Rootly AI Labs&lt;/a&gt; publishes open-source prototypes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; Analyses code changes, telemetry, and past incidents per the &lt;a href="https://rootly.com/ai-sre" rel="noopener noreferrer"&gt;Rootly AI SRE page&lt;/a&gt;. An AI Meeting Bot joins incident bridges and transcribes. The &lt;a href="https://rootly.com/blog/introducing-rootlys-api-ai-agent-first-approach" rel="noopener noreferrer"&gt;Rootly API agent-first announcement&lt;/a&gt; describes the MCP-based agentic surface used by Cursor, Windsurf, and Claude.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Suggestions plus workflow automation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; AI-drafted from incident artefacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Tiers listed on &lt;a href="https://rootly.com/pricing" rel="noopener noreferrer"&gt;Rootly pricing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; "AI-first" branding outpaces the published architecture detail; in evaluation, ask for the agent loop description and the rule-based-automation boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 1, Postmortem 2, Deployment 0, Source 1, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  14. ServiceNow Now Assist SRE Specialist, closed source
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises on ServiceNow ITSM that want triage and post-mortems inside the same platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS, ServiceNow cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; The "SRE Specialist" performs triage (what, impact, priority, who) and autonomous post-mortem authoring, announced as part of the Autonomous Workforce in &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow's Knowledge 2026 release&lt;/a&gt;. GA targeted June 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Workflow automation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Autonomous authoring claimed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Custom-quoted. Public pricing is not disclosed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; As of May 2026 the product is pre-GA and most coverage is press-release or keynote material. Treat capabilities as preliminary until verified during the design-partner phase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 2, Remediation 2, Postmortem 2, Deployment 0, Source 0, total &lt;strong&gt;6/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  15. Splunk ITSI Episode Summarization, closed source (Alpha)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Splunk-heavy enterprises that want LLM summaries layered on existing KPI engines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Splunk Cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; Closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation:&lt;/strong&gt; &lt;a href="https://www.splunk.com/en_us/blog/observability/conf25-splunk-observability-announcements.html" rel="noopener noreferrer"&gt;ITSI Episode Summarization&lt;/a&gt;, announced at .conf25 (September 2025), is in Alpha. The feature layers an LLM-generated summary (what happened, when, key events, suspected cause) onto Splunk ITSI's KPI-based episodes. Splunk also ships Event iQ for AI-driven alert correlation, listed on the &lt;a href="https://www.splunk.com/en_us/products/it-service-intelligence.html" rel="noopener noreferrer"&gt;ITSI product page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; Recommendation-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem:&lt;/strong&gt; Not yet a published feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Splunk ITSI is data-volume or entity-count licensed. The AI features are in Alpha.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch out for:&lt;/strong&gt; Alpha contract and capability terms can shift. Plan a re-evaluation after GA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability score:&lt;/strong&gt; Investigation 1, Remediation 1, Postmortem 1, Deployment 0, Source 0, total &lt;strong&gt;3/15&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Aurora&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;HolmesGPT&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;K8sGPT&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Resolve.ai&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Traversal&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Edwin AI&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Rootly AI&lt;/td&gt;
&lt;td&gt;Closed (Labs OSS)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;ServiceNow Now Assist SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;NeuBird Hawkeye&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Datadog Bits AI SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;incident.io AI SRE&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;PagerDuty SRE Agent&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Cleric.ai&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Causely&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Splunk ITSI Episode Summarization&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The open-source projects lead the deployment-flexibility and source-availability axes by definition. Aurora is the only entry that scores 3 on every axis. Commercial leaders cluster around 5 to 6 because they are uniformly strong on investigation but weak on deployment flexibility and source availability. Kubernetes-only projects (K8sGPT, Causely) and pre-GA incumbents (Splunk ITSI) cluster low because their scope or maturity caps multiple axes.&lt;/p&gt;

&lt;p&gt;The score does not pick a winner. It picks a fit. A bank under FedRAMP High obligations evaluates this list differently from a 50-engineer Series B startup. The deployment axis answers the fitness question; investigation answers the depth question; source availability answers the trust question.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I choose an AI SRE tool?
&lt;/h2&gt;

&lt;p&gt;Most procurement processes stall because the team compares across all five axes at once. Asking these three questions in order eliminates twelve of the fifteen tools before vendor demos.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does the data have to stay in our perimeter?&lt;/strong&gt; If yes, the answer is Aurora, HolmesGPT, or K8sGPT. Every commercial product on this list requires data to leave the customer perimeter for inference. See &lt;a href="https://www.arvoai.ca/blog/self-hosted-ai-sre" rel="noopener noreferrer"&gt;Self-Hosted AI SRE&lt;/a&gt; for the architecture you will need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the scope multi-cloud or Kubernetes-only?&lt;/strong&gt; If multi-cloud, the open-source shortlist narrows to Aurora; in the commercial set, Resolve.ai, Traversal, NeuBird, and incident.io are the credible candidates. If Kubernetes-only, every tool except Aurora's non-Kubernetes integrations remains valid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do you need to take action, or only investigate?&lt;/strong&gt; Read-only covers most of the open-source category and most incumbent AI features. Actioning agents narrow the list to Aurora (PR-based, sandboxed kubectl, plus Aurora Actions), ServiceNow Now Assist (workflow automation), and Edwin AI (closed-loop within LogicMonitor).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For depth on the action-safety question, see our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt; guide and &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation Complete Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;Arvo expects the category to converge along three axes through the rest of 2026.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model Context Protocol convergence.&lt;/strong&gt; PagerDuty, Rootly, Aurora, HolmesGPT, Causely, and Edwin AI have all shipped MCP servers. MCP is on track to become table stakes by year-end, which means differentiation will shift to prompt graphs, RAG quality, and policy guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open benchmarking.&lt;/strong&gt; Resolve.ai and Rootly have published proprietary LLM benchmark posts, neither with a reproducible dataset. The first open, named benchmark with a public incident corpus is likely to set the citation surface the category orbits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing model fragmentation.&lt;/strong&gt; Per-seat (PagerDuty, Rootly, incident.io), per-investigation (Datadog Bits AI, NeuBird), per-credit (ServiceNow), per-cloud-host (Edwin AI), and free open source (Aurora, HolmesGPT, K8sGPT) coexist today. Expect convergence on a published reference cost per investigation as buyers compare more rigorously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Differentiation in this market is structural rather than feature-list. Buyers who score against the capability matrix and apply the deployment, scope, and action questions usually land a credible shortlist of two or three tools within a week. Buyers running feature-list comparisons evaluate for a quarter.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/top-ai-sre-tools-2026" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Automated Post-Mortem Generation: The Complete Guide for SRE Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 13 May 2026 16:13:39 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/automated-post-mortem-generation-the-complete-guide-for-sre-teams-2026-55ck</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/automated-post-mortem-generation-the-complete-guide-for-sre-teams-2026-55ck</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated post-mortem generation is the process of producing an incident retrospective from artifacts already collected during the incident&lt;/strong&gt; — chat transcript, alert timeline, monitor data, and (in agentic systems) the investigation agent's own tool-call trace. The category is not a single technology; it's an output shared by three distinct architectures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the Postmortem Provenance Model (PPM).&lt;/strong&gt; Three source types: &lt;strong&gt;(1) chat-transcript postmortems&lt;/strong&gt; (Rootly, incident.io, FireHydrant) summarize what humans said in the channel; &lt;strong&gt;(2) observability-stitched postmortems&lt;/strong&gt; (Datadog Bits AI) summarize what monitors recorded; &lt;strong&gt;(3) agentic-investigation postmortems&lt;/strong&gt; (Aurora) compose from the agent's causal reasoning trace. The three artifacts answer different questions and are not interchangeable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The standards that anchor this work are old, but unchanged by AI.&lt;/strong&gt; &lt;a href="https://sre.google/sre-book/postmortem-culture/" rel="noopener noreferrer"&gt;Google SRE Book Chapter 15 — Postmortem Culture&lt;/a&gt; (Lunney and Lueder, 2017) and &lt;a href="https://www.etsy.com/codeascraft/blameless-postmortems" rel="noopener noreferrer"&gt;John Allspaw's "Blameless PostMortems and a Just Culture"&lt;/a&gt; (Etsy, May 2012) define what a postmortem is for. AI changes the authoring cost, not the purpose.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The vendor landscape consolidated in 2025–2026.&lt;/strong&gt; PagerDuty acquired &lt;a href="https://www.pagerduty.com/newsroom/pagerduty-acquires-jeli/" rel="noopener noreferrer"&gt;Jeli in November 2023 for $29.7M&lt;/a&gt;; FireHydrant was acquired by Freshworks in December 2025; Squadcast was acquired by SolarWinds. ServiceNow's &lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;Now Assist SRE specialist&lt;/a&gt; (GA targeted June 2026) brings the largest ITSM vendor into the postmortem-generation lane.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source agentic-investigation postmortems are a small lane.&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0) generates postmortems from its own investigation agent's reasoning chain and exports to Confluence Cloud (OAuth) or Server / Data Center (PAT), with customizable per-org templates and version history.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A good postmortem outlives the incident. &lt;strong&gt;An automated post-mortem is an incident retrospective whose narrative, timeline, root cause, contributing factors, and action items are drafted by software rather than by hand — typically a large language model, sometimes a tool-using agent, always built on artifacts already collected during the incident.&lt;/strong&gt; This guide is for SRE, platform, and incident-management leaders deciding which automated-postmortem architecture matches their team's working style — not which vendor logo to add to their stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why automation, and why now
&lt;/h2&gt;

&lt;p&gt;Most teams write postmortems by hand. Most postmortems are late, short, and read by no one. The reason is unsentimental: writing a good postmortem takes hours of reconstruction work, on top of an incident that has already drained the on-call's day. The lit-survey of practitioner posts converges on a 4–8 hour figure per postmortem of moderate complexity — most of that spent in Slack, dashboards, and ticket trails trying to reassemble the timeline.&lt;/p&gt;

&lt;p&gt;The market response since 2023 has been a wave of automated-postmortem features: &lt;a href="https://rootly.com/retrospectives" rel="noopener noreferrer"&gt;Rootly AI Copilot&lt;/a&gt;, &lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt; Scribe and AI summaries, &lt;a href="https://firehydrant.com/ai/" rel="noopener noreferrer"&gt;FireHydrant AI-Drafted Retrospectives&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/blog/create-postmortems-with-datadog/" rel="noopener noreferrer"&gt;Datadog Bits AI postmortem variables&lt;/a&gt;, and &lt;a href="https://support.pagerduty.com/main/docs/scribe-agent" rel="noopener noreferrer"&gt;PagerDuty Scribe Agent&lt;/a&gt;. The pitch is similar across them: 90 minutes of human reconstruction collapses to 15 minutes of human review.&lt;/p&gt;

&lt;p&gt;The honest framing is that these tools do real work, but most of them are summarizing artifacts that already exist. &lt;strong&gt;They are not investigating; they are transcribing.&lt;/strong&gt; That's enough for many teams, especially those whose incidents are well-captured in their incident-channel chatter. It is not enough for teams whose incidents require deep investigation across systems — and that gap is what the agentic-investigation category is starting to fill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Postmortem Provenance Model (PPM)
&lt;/h2&gt;

&lt;p&gt;The three architectures differ in what they read from, not in what they produce. Same sections, different evidence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source type&lt;/th&gt;
&lt;th&gt;Reads from&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat-transcript&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack / Teams / Zoom channel for the incident window; on-call chatter; status updates&lt;/td&gt;
&lt;td&gt;Captures human narrative, decisions, and judgment calls verbatim&lt;/td&gt;
&lt;td&gt;Inherits human errors and gaps; weak on infrastructure facts the channel didn't surface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability-stitched&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitor events, alert timeline, dashboards, deployment history&lt;/td&gt;
&lt;td&gt;Strong factual timeline, embedded graphs and logs&lt;/td&gt;
&lt;td&gt;Misses human context; weak on contributing factors that aren't in telemetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic-investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The investigation agent's tool-call trace, reasoning chain, evidence collected mid-incident&lt;/td&gt;
&lt;td&gt;Causal record of what the system did and what the agent found&lt;/td&gt;
&lt;td&gt;Requires running an investigation agent in the first place; quality depends on the agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A team's choice should match its incident profile. If most incidents resolve in chat with little investigation needed, a chat-transcript tool is fine. If incidents are surfaced and resolved entirely in your observability stack, an observability-stitched approach gives you tight monitor-to-postmortem fidelity. If your incidents require traversing AWS, GCP, Kubernetes, and your own services to find the cause, an agentic-investigation postmortem is the only artifact that records the work the agent actually did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards: what a postmortem is for
&lt;/h2&gt;

&lt;p&gt;It is worth grounding the conversation in what postmortems were designed to do before LLMs existed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://sre.google/sre-book/postmortem-culture/" rel="noopener noreferrer"&gt;Google SRE Book, Chapter 15 — Postmortem Culture: Learning from Failure&lt;/a&gt;&lt;/strong&gt; by John Lunney and Sue Lueder (O'Reilly, 2017). The canonical text on blameless postmortems as organizational learning. The companion &lt;a href="https://sre.google/workbook/postmortem-culture/" rel="noopener noreferrer"&gt;SRE Workbook Chapter 10&lt;/a&gt; updates the practical guidance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.etsy.com/codeascraft/blameless-postmortems" rel="noopener noreferrer"&gt;John Allspaw — Blameless PostMortems and a Just Culture&lt;/a&gt;&lt;/strong&gt; (Etsy Code as Craft, May 22, 2012). The earlier articulation of why blameless-ness is operationally load-bearing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.usenix.org/system/files/login/articles/login_spring17_09_lunney.pdf" rel="noopener noreferrer"&gt;Lunney — Postmortem Action Items&lt;/a&gt;&lt;/strong&gt; (USENIX ;login: Spring 2017). The honest practitioner read on why most postmortems' action items never get done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://postmortems.pagerduty.com/" rel="noopener noreferrer"&gt;PagerDuty's open-source Postmortem documentation&lt;/a&gt;&lt;/strong&gt; (Apache 2.0, &lt;a href="https://github.com/PagerDuty/postmortem-docs" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;). Includes a maintained &lt;a href="https://github.com/PagerDuty/postmortem-docs/blob/master/docs/resources/post_mortem_template.md" rel="noopener noreferrer"&gt;postmortem template&lt;/a&gt; used as a baseline by many teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.thevoid.community/" rel="noopener noreferrer"&gt;Verica Open Incident Database (VOID)&lt;/a&gt;&lt;/strong&gt;. The 2nd Annual VOID Report (December 2022) catalogs approximately 10,000 incidents from 600+ organizations; its central finding is that MTTR is statistically unreliable as a cross-organization comparison and that only ~25% of public incident reports clearly identify a root cause. A useful corrective to the "we reduced MTTR by X%" claims that pepper vendor marketing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/danluu/post-mortems" rel="noopener noreferrer"&gt;Dan Luu's curated postmortems collection&lt;/a&gt;&lt;/strong&gt;. The widest public corpus of real postmortems; useful as RAG fuel for any AI postmortem system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A blameless, learning-oriented postmortem is the goal. Automation changes the authoring cost; it does not relax the standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gets auto-generated today
&lt;/h2&gt;

&lt;p&gt;A typical 2026 automated postmortem produces some subset of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Summary&lt;/strong&gt; — one paragraph, the executive read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeline&lt;/strong&gt; — chronological events with timestamps (often HH:MM UTC).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt; — customer-facing effect, services affected, error budget burn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause&lt;/strong&gt; — the technical fault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributing factors&lt;/strong&gt; — human, process, and organizational conditions that allowed the incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt; — what stopped the bleeding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action items&lt;/strong&gt; — owners, due dates, follow-ups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lessons learned&lt;/strong&gt; — what the team would do differently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Different products auto-draft different subsets. The "Lessons Learned" section, in particular, is left to humans in most products — for the obvious reason that it is the section where judgment is most consequential.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tooling landscape
&lt;/h2&gt;

&lt;p&gt;Concrete vendor positioning as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;License / hosting&lt;/th&gt;
&lt;th&gt;What it auto-generates&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://rootly.com/retrospectives" rel="noopener noreferrer"&gt;Rootly AI Copilot&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Narrative summary, timeline, action items, root cause, embedded Datadog charts; meeting-bot transcription&lt;/td&gt;
&lt;td&gt;Headline claim: 90 min → 15 min review. Exports to Confluence, Google Docs, Notion, Slack.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io AI postmortems&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Summary, timeline, contributing factors, suggested follow-ups; Scribe transcribes call audio&lt;/td&gt;
&lt;td&gt;"Lessons Learned" is left to humans by design. Exports to Confluence, Notion, Google Docs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://firehydrant.com/ai/" rel="noopener noreferrer"&gt;FireHydrant AI-Drafted Retrospectives&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Description, customer impact, lessons learned; Copilot compares ongoing incident to past incidents&lt;/td&gt;
&lt;td&gt;Acquired by Freshworks December 2025; AI features are Enterprise tier only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.datadoghq.com/blog/create-postmortems-with-datadog/" rel="noopener noreferrer"&gt;Datadog Bits AI postmortems&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Summary, customer impact, lessons learned variables; dynamic embedded graphs and logs&lt;/td&gt;
&lt;td&gt;Exports to Datadog Notebooks, Confluence, or Google Drive.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://support.pagerduty.com/main/docs/scribe-agent" rel="noopener noreferrer"&gt;PagerDuty Scribe Agent&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Real-time call transcription and timeline contributions to PagerDuty's Postmortems product&lt;/td&gt;
&lt;td&gt;Part of PagerDuty's Spring 2026 agent suite (SRE Agent, Scribe Agent, Insights Agent).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0, self-hosted&lt;/td&gt;
&lt;td&gt;Summary, timeline (HH:MM UTC), root cause, impact, contributing factors, resolution, action items, lessons learned; generated from the investigation agent's reasoning trace&lt;/td&gt;
&lt;td&gt;Per-org template overrides; Confluence Cloud (OAuth) and Server / Data Center (PAT) export.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-brings-Autonomous-Workforce-to-every-major-business-function/default.aspx" rel="noopener noreferrer"&gt;ServiceNow Now Assist SRE specialist&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;Triage + postmortem documentation end to end&lt;/td&gt;
&lt;td&gt;GA targeted June 2026 (Knowledge 2026 announcement).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.squadcast.com/product/postmortems" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed, SaaS&lt;/td&gt;
&lt;td&gt;One-click postmortem, webhook automation, templates&lt;/td&gt;
&lt;td&gt;Acquired by SolarWinds.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: the SaaS-IM vendors all do chat-transcript postmortems well; Datadog owns the observability-stitched lane; Aurora is the open-source agentic-investigation option. ServiceNow's June 2026 GA brings the largest ITSM vendor into the category as a fourth meaningful entrant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: how agentic-investigation postmortems work
&lt;/h2&gt;

&lt;p&gt;Worth describing in detail because this is the category least visible to most buyers.&lt;/p&gt;

&lt;p&gt;In a chat-transcript postmortem system, the flow is: incident channel → LLM with a postmortem template prompt → draft document. In an observability-stitched postmortem system, the flow is: incident timeline + dashboards → LLM with embedding variables → draft document with live charts.&lt;/p&gt;

&lt;p&gt;An agentic-investigation postmortem starts earlier — at the &lt;em&gt;investigation&lt;/em&gt;. The pattern, using Aurora as the concrete open-source example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert webhook arrives.&lt;/strong&gt; PagerDuty, Datadog, Grafana, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, NewRelic, OpsGenie, or incident.io fires. The provider-specific RCA-prompt builder constructs the agent's first message, including alert metadata, severity, service, and environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation runs.&lt;/strong&gt; Aurora's ReAct-style LangGraph agent calls tools across the next 3–15 minutes — &lt;code&gt;kubectl&lt;/code&gt;, cloud CLIs, knowledge-base search, Terraform read, Confluence search — and accumulates a transcript of tool calls, tool results, and reasoning steps. The result is persisted as the incident's &lt;code&gt;aurora_summary&lt;/code&gt; — the agent's RCA narrative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem dispatch.&lt;/strong&gt; When the incident is resolved (either manually, via Aurora's "Run Action" dropdown on completed incidents, or via an &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; on-incident-completion trigger), a postmortem agent run is dispatched with the agent's RCA summary as load-bearing context. The postmortem agent re-reads the original investigation output, optionally pulls Slack channel context for the incident window, and composes the postmortem under a per-org template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage and versioning.&lt;/strong&gt; Drafts are stored in PostgreSQL with version history. Engineers can edit; subsequent regenerations preserve human edits as a separate version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confluence export.&lt;/strong&gt; The user clicks Export. Aurora pushes the rendered postmortem to Confluence Cloud (OAuth) or Server / Data Center (PAT), creating a page under a configured space and parent. Export is currently user-triggered rather than automatic, which preserves the human review step before publication.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The structural difference from chat-transcript postmortems is what evidence the LLM gets. A chat-transcript system can only describe what humans typed. An agentic-investigation system describes what the agent did, which tools it ran, what the cloud responded with, and how it reasoned through to the root cause. The artifact carries the actual causal trail, not a social reconstruction of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an automated postmortem tool
&lt;/h2&gt;

&lt;p&gt;A rubric you can run on any vendor — open source or commercial.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance match.&lt;/strong&gt; Does the tool's source-of-truth match how your team actually runs incidents? Chat-heavy team → chat-transcript. Observability-heavy team → Datadog or equivalent. Investigation-heavy team → agentic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template control.&lt;/strong&gt; Can you replace the vendor's template with your team's? Per-team templates? Aurora supports per-org template overrides via its &lt;code&gt;actions&lt;/code&gt; configuration table; vendor SaaS varies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export target.&lt;/strong&gt; Confluence Cloud, Server / Data Center, Notion, Google Docs, internal wiki. Match your team's documentation home. Aurora supports Confluence (both flavors); the SaaS vendors support different combinations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edit lineage.&lt;/strong&gt; When the AI draft is edited, regenerated, and edited again, what survives? Test this explicitly with three round trips. Aurora preserves version history; check each candidate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action-item ownership.&lt;/strong&gt; Does the tool extract action items with owners and due dates, or just bullet points? The Lunney USENIX piece is blunt about why this matters: action items without owners do not get done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedded evidence.&lt;/strong&gt; Are graphs, logs, and resource identifiers embedded inline or linked? Embedded survives the documentation system; linked rots over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost and privacy.&lt;/strong&gt; Where does the postmortem text get processed? Self-hosted with bring-your-own-LLM (Aurora) keeps incident data on your infrastructure; SaaS vendors vary in how they handle this and your security team will want to know.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standards alignment.&lt;/strong&gt; Does the generated artifact match the blameless tradition (Allspaw, Lunney, the SRE Book) or accidentally drift into individual blame? Check the prompt if you can; otherwise inspect a sample.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to roll out automation without breaking culture
&lt;/h2&gt;

&lt;p&gt;A six-step adoption plan that respects the standards while saving the time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the easiest 30%&lt;/strong&gt; — short-impact incidents with mostly-chat investigations. These produce passable AI drafts on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep humans on lessons learned.&lt;/strong&gt; Even tools that auto-generate the "Lessons Learned" section ship it as a draft to be aggressively rewritten. The judgment in that section is the point of the postmortem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require human edit before publish.&lt;/strong&gt; The on-call engineer who ran the incident should always be the one who clicks "Publish." This is the cultural firewall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track action-item completion separately.&lt;/strong&gt; AI-generated action items have a known completion-gap problem. Add a weekly review of last week's postmortem action items, with owners called out by name.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a quarterly audit of the generated postmortems.&lt;/strong&gt; Pick five at random; have a senior engineer read them critically. Look for drift toward individual blame, missed contributing factors, and surface-level root causes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tighten the loop with the investigation tool.&lt;/strong&gt; If your investigation tool and postmortem tool are the same product (Aurora, eventually Resolve.ai-class systems), the postmortem inherits the investigation's evidence chain. This is the highest-quality automated postmortem possible — but it requires running an agentic investigation in the first place.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What can go wrong
&lt;/h2&gt;

&lt;p&gt;A short failure-mode list.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Surface-level root cause.&lt;/strong&gt; AI drafts read confidently while attributing a deep system issue to its most visible symptom. The cure is human review by someone who was in the incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated timeline.&lt;/strong&gt; LLM invents events, misattributes timestamps, or doubles up on entries. Most common when the input artifact (chat transcript or telemetry) has gaps the model patches over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blame drift.&lt;/strong&gt; AI summary slips into individual-blame framing because the human chat did. The blameless tradition exists exactly for this reason; the AI does not enforce it on its own.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action items without ownership.&lt;/strong&gt; A bullet list of "should do X" with no owner is not an action item; it is decoration. Treat ownerless action items as a failure of the tool's prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edit loss on regeneration.&lt;/strong&gt; Some tools overwrite human edits when the user clicks "Regenerate." Verify that version history is preserved before trusting the tool for a quarter's worth of postmortems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is the open-source agentic-investigation entry in this category. Apache 2.0, self-hosted via Docker Compose or Helm. Postmortems are generated from the same agent that ran the investigation, with per-org template control, version history, Slack context backfill, and export to Confluence Cloud or Server / Data Center. If your incidents look like chat-resolved coordination work, you probably don't need Aurora's postmortem layer specifically. If your incidents look like deep cross-cloud investigation work, you probably do.&lt;/p&gt;

&lt;p&gt;For more on how Aurora's investigation half works, see our &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation guide&lt;/a&gt;. For how Aurora's automation primitive (Aurora Actions) lets you chain postmortem generation onto every incident automatically, see the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions launch post&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related guides:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/ai-powered-incident-investigation" rel="noopener noreferrer"&gt;AI-Powered Incident Investigation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;Root Cause Analysis: Complete Guide for SREs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 13 May 2026 16:12:09 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-powered-incident-investigation-the-complete-guide-for-sre-teams-2026-4hl0</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-powered-incident-investigation-the-complete-guide-for-sre-teams-2026-4hl0</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps&lt;/strong&gt; — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs &lt;code&gt;kubectl&lt;/code&gt;, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We propose the AI Investigation Capability Ladder (AICL).&lt;/strong&gt; Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNCF now hosts two open-source agentic projects in this lane.&lt;/strong&gt; &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt; &lt;a href="https://www.cncf.io/blog/2026/01/07/holmesgpt-agentic-troubleshooting-built-for-the-cloud-native-era/" rel="noopener noreferrer"&gt;entered the CNCF Sandbox in October 2025&lt;/a&gt;. &lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; has been Sandbox since December 19, 2023. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT).&lt;/strong&gt; Per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's metrics history&lt;/a&gt;, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The &lt;a href="https://dora.dev/research/2024/dora-report/2024-dora-accelerate-state-of-devops-report.pdf" rel="noopener noreferrer"&gt;2024 DORA report PDF&lt;/a&gt; added "deployment rework rate" as a fifth core measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The closed-source peer set is well-funded.&lt;/strong&gt; &lt;a href="https://techcrunch.com/2026/02/04/ai-sre-resolve-ai-confirms-125m-raise-unicorn-valuation/" rel="noopener noreferrer"&gt;Resolve.ai raised $125M at a $1B valuation in February 2026&lt;/a&gt;. &lt;a href="https://fortune.com/2025/06/18/traversal-emerges-from-stealth-with-48-million-from-sequoia-and-kleiner-perkins-to-reimagine-site-reliability-in-the-ai-era/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt; reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cloud incidents in 2026 surface faster than humans can investigate them. &lt;strong&gt;AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis.&lt;/strong&gt; Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs &lt;code&gt;kubectl describe&lt;/code&gt;, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.&lt;/p&gt;

&lt;p&gt;This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "investigation" means here
&lt;/h2&gt;

&lt;p&gt;Three things blur together when people say "AI incident response":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert correlation&lt;/strong&gt; — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation&lt;/strong&gt; — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic investigation&lt;/strong&gt; — an LLM that runs &lt;em&gt;new&lt;/em&gt; tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run &lt;code&gt;kubectl&lt;/code&gt; will be even more disappointed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Investigation Capability Ladder (AICL)
&lt;/h2&gt;

&lt;p&gt;Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What runs&lt;/th&gt;
&lt;th&gt;Human role&lt;/th&gt;
&lt;th&gt;Representative tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0 — Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer reads alerts, runs &lt;code&gt;kubectl&lt;/code&gt; and cloud CLIs by hand&lt;/td&gt;
&lt;td&gt;Everything&lt;/td&gt;
&lt;td&gt;PagerDuty, Slack, Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Alert correlation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML correlator clusters and dedupes events&lt;/td&gt;
&lt;td&gt;Triage from a smaller list&lt;/td&gt;
&lt;td&gt;PagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — LLM-summarized timeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM summarizes an event stream into prose&lt;/td&gt;
&lt;td&gt;Reads summary instead of raw events&lt;/td&gt;
&lt;td&gt;Datadog Bits AI summaries, incident.io Scribe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Single-shot LLM diagnosis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM produces an RCA from one prompt over alert + telemetry&lt;/td&gt;
&lt;td&gt;Trusts a single inference&lt;/td&gt;
&lt;td&gt;K8sGPT analyzers, vendor "AI insights" buttons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Agentic multi-step investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM agent calls many tools across multiple turns, replans as findings arrive&lt;/td&gt;
&lt;td&gt;Reviews trace, ships fix&lt;/td&gt;
&lt;td&gt;Aurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L5 — Closed-loop investigate + remediate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent investigates and proposes (or applies, with approval) a fix&lt;/td&gt;
&lt;td&gt;Approves remediation&lt;/td&gt;
&lt;td&gt;Aurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest framing: &lt;strong&gt;most teams are L0 or L1 today.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional AIOps vs agentic investigation
&lt;/h2&gt;

&lt;p&gt;Both are useful; they cover non-overlapping work.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Traditional AIOps (L1)&lt;/th&gt;
&lt;th&gt;Agentic investigation (L4)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event stream, telemetry already ingested&lt;/td&gt;
&lt;td&gt;Same, plus live tool calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ranked suspects, correlated incidents&lt;/td&gt;
&lt;td&gt;RCA narrative, evidence chain, suggested fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New evidence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No — operates on what's already in the system&lt;/td&gt;
&lt;td&gt;Yes — agent issues new commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML clustering / topology distance scoring&lt;/td&gt;
&lt;td&gt;LLM step-by-step (ReAct or similar)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Why it can be wrong&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Missing event, weak topology graph&lt;/td&gt;
&lt;td&gt;Hallucination, tool misuse, prompt drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-event or per-host&lt;/td&gt;
&lt;td&gt;Per LLM token + tool runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quiet — wrong cluster, you don't know&lt;/td&gt;
&lt;td&gt;Loud — agent's trace is human-readable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's &lt;a href="https://www.splunk.com/en_us/products/it-service-intelligence.html" rel="noopener noreferrer"&gt;ITSI Episode Summarization&lt;/a&gt; (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agentic peer set in 2026
&lt;/h2&gt;

&lt;p&gt;This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrations&lt;/td&gt;
&lt;td&gt;LangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsets&lt;/td&gt;
&lt;td&gt;Co-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.cncf.io/projects/k8sgpt/" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Kubernetes resource diagnostics&lt;/td&gt;
&lt;td&gt;CNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://cleric.ai/" rel="noopener noreferrer"&gt;Cleric.ai&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Slack-first AI SRE&lt;/td&gt;
&lt;td&gt;Gartner Cool Vendor 2025. Integrates Datadog and Grafana.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://resolve.ai/" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Multi-cloud AI SRE&lt;/td&gt;
&lt;td&gt;$125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://traversal.ai/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;"Causal search engine" for production systems&lt;/td&gt;
&lt;td&gt;$48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://neubird.ai/" rel="noopener noreferrer"&gt;Neubird Hawkeye&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Llama 3.2 70B fine-tuned + ChromaDB RAG&lt;/td&gt;
&lt;td&gt;SaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://www.causely.ai/" rel="noopener noreferrer"&gt;Causely&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;Causal-graph reasoner for Kubernetes&lt;/td&gt;
&lt;td&gt;Gartner Cool Vendor 2025. MCP server. Gemini-powered.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://ciroos.ai/" rel="noopener noreferrer"&gt;Ciroos.AI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Closed source&lt;/td&gt;
&lt;td&gt;"SRE Teammate" multi-agent&lt;/td&gt;
&lt;td&gt;MCP and A2A architecture.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.&lt;/p&gt;

&lt;p&gt;For a deeper open-source-only comparison, see our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: what makes investigation "agentic"?
&lt;/h2&gt;

&lt;p&gt;Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. A tool-calling loop (ReAct or similar)
&lt;/h3&gt;

&lt;p&gt;The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the &lt;strong&gt;ReAct pattern&lt;/strong&gt; (Reason + Act, &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;Yao et al. 2022&lt;/a&gt;). Aurora's implementation is a single-node LangGraph workflow wrapping &lt;code&gt;langchain.agents.create_agent&lt;/code&gt;; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool reach across the stack
&lt;/h3&gt;

&lt;p&gt;An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Sandboxed CLI execution
&lt;/h3&gt;

&lt;p&gt;Letting the agent run &lt;code&gt;kubectl&lt;/code&gt; and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prompt-injection input rail&lt;/strong&gt; (&lt;a href="https://github.com/NVIDIA/NeMo-Guardrails" rel="noopener noreferrer"&gt;NVIDIA NeMo Guardrails&lt;/a&gt;) blocks commands that originate from injected instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static signature match&lt;/strong&gt; against 37 vendored &lt;a href="https://github.com/SigmaHQ/sigma" rel="noopener noreferrer"&gt;SigmaHQ&lt;/a&gt; detection rules covering known-malicious command patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-org command policy&lt;/strong&gt; — allow/deny lists scoped to the customer's tenant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM safety judge&lt;/strong&gt; adapted from &lt;a href="https://github.com/meta-llama/PurpleLlama" rel="noopener noreferrer"&gt;Meta's PurpleLlama AlignmentCheck&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Approved commands execute via &lt;code&gt;kubectl exec&lt;/code&gt; into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; for the full threat model.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Retrieval over organizational memory
&lt;/h3&gt;

&lt;p&gt;The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt; for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.&lt;/p&gt;

&lt;p&gt;The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Infrastructure topology
&lt;/h3&gt;

&lt;p&gt;An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses &lt;a href="https://memgraph.com/" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt; as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the DORA and VOID anchors actually say
&lt;/h2&gt;

&lt;p&gt;Two industry sources are worth grounding the investigation conversation in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DORA — Failed Deployment Recovery Time.&lt;/strong&gt; Per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's metrics history&lt;/a&gt;, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The &lt;a href="https://dora.dev/research/2024/dora-report/2024-dora-accelerate-state-of-devops-report.pdf" rel="noopener noreferrer"&gt;2024 DORA State of DevOps Report PDF&lt;/a&gt; further refined the metric set, adding "deployment rework rate" as a fifth core measure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VOID — incident reality, not vendor claims.&lt;/strong&gt; The &lt;a href="https://www.thevoid.community/" rel="noopener noreferrer"&gt;Verica Open Incident Database&lt;/a&gt; catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  An evaluation scorecard for AI investigation tools
&lt;/h2&gt;

&lt;p&gt;Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step tool use.&lt;/strong&gt; Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud scope.&lt;/strong&gt; Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxing and RBAC.&lt;/strong&gt; Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG quality.&lt;/strong&gt; Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace readability.&lt;/strong&gt; Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost and rate-limit headroom.&lt;/strong&gt; Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source vs SaaS posture.&lt;/strong&gt; If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where it sits on the AICL.&lt;/strong&gt; Decide &lt;em&gt;up front&lt;/em&gt; whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to run a low-risk pilot
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one alert source and one cluster.&lt;/strong&gt; PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run read-only for at least four weeks.&lt;/strong&gt; Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingest your historical context.&lt;/strong&gt; Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add one chat channel and one slash command.&lt;/strong&gt; Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review traces weekly.&lt;/strong&gt; Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Promote to alert-triggered investigation when the trace is clean for two consecutive weeks.&lt;/strong&gt; Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide on L5 (remediation) only after three months at clean L4.&lt;/strong&gt; Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; feature is the open-source pattern for this.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What can go wrong
&lt;/h2&gt;

&lt;p&gt;A short list of failure modes worth pre-mortem-ing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt drift.&lt;/strong&gt; A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool misuse.&lt;/strong&gt; Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated identifiers.&lt;/strong&gt; Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost runaway.&lt;/strong&gt; Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-trust.&lt;/strong&gt; The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;We build &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related guides:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/automated-post-mortem-generation" rel="noopener noreferrer"&gt;Automated Post-Mortem Generation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE comparison&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Aurora Actions: User-Defined Background Automations for Incident Response</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:49:20 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions are reusable, natural-language automations&lt;/strong&gt; that Aurora's agent executes in the background using all 22+ connected integrations. Available today on the main branch of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three trigger types out of the box&lt;/strong&gt;: manual ("run now"), on incident completion (chain follow-up work after every RCA), and recurring schedule (Celery Beat–driven intervals).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same agent, same tools, different prompt scaffolding.&lt;/strong&gt; Actions reuse Aurora's existing LangGraph agent and 30+ tools (kubectl, aws, gcloud, az, Terraform, Confluence, Slack, GitHub) — they just run as background chat sessions with eager-loaded skills and no RCA mandate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; is a first-class chat primitive.&lt;/strong&gt; Slash-command autocomplete in the chat input, "Run Action" dropdown on completed incidents, and full RBAC-gated CRUD UI in Settings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions turn the agent into a programmable platform.&lt;/strong&gt; This is the building block for CI/CD auto-remediation, scheduled audits, and post-incident health checks — covered in &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;our CI/CD Auto-Remediation guide&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;We shipped one of the most-requested features in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;'s history: &lt;strong&gt;Aurora Actions — user-defined background automations that run on Aurora's agent.&lt;/strong&gt; &lt;strong&gt;An Aurora Action is a named, natural-language instruction the user writes once and then triggers manually, on incident completion, or on a recurring schedule; Aurora's agent executes it as a background task with full access to every connected integration.&lt;/strong&gt; Where traditional incident management tools force you to pick from a fixed catalog of "automations" (close incident, post to Slack, run runbook), Actions are written in plain English and inherit the full reasoning capability of the agent.&lt;/p&gt;

&lt;p&gt;This post is for SRE and platform teams already running Aurora — or evaluating it — who want to understand what Actions actually do, where they fit on the agentic spectrum, and how to use them safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an Aurora Action?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;Aurora Action&lt;/strong&gt; has four parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A name&lt;/strong&gt; — used as the slash-command handle (&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt;) and as the dropdown label on incident cards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A natural-language instruction&lt;/strong&gt; — the prompt the agent will execute. The same instruction the user would type into chat, except it can reference incident context placeholders when triggered post-incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A trigger type&lt;/strong&gt; — manual, on-incident-completion, or on-schedule (interval-based via &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An on/off toggle&lt;/strong&gt; — actions can be disabled without deletion, with full RBAC for who can create, edit, or trigger them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation is a thin layer over Aurora's existing chat agent. When an Action triggers, the executor service creates a background chat session with the action's instruction as the user message, runs it through the same LangGraph workflow that powers interactive chat, and persists the run history. The agent has full tool access (kubectl, cloud CLIs, Terraform, Slack, GitHub, Confluence, Memgraph, Weaviate) and eager-loaded skills — the only differences from interactive chat are scaffolded prompts and the absence of any RCA mandate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Most incident management automation today is &lt;strong&gt;workflow automation&lt;/strong&gt;: PagerDuty fires, Slack channel is created, status page is updated, runbook link is posted. The "automation" is a directed graph of static actions. There is no reasoning, no investigation, no judgment. Tools like Rootly, FireHydrant, and incident.io are excellent at this — but they don't &lt;em&gt;do&lt;/em&gt; anything an SRE wouldn't have to manually verify after the fact.&lt;/p&gt;

&lt;p&gt;Aurora's bet has always been the opposite: &lt;strong&gt;automate the investigation itself.&lt;/strong&gt; Aurora Actions extend that bet from one-shot incident investigations to recurring or post-incident workflows. A few concrete examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Noisy alert tuning&lt;/strong&gt; — "Every Friday at 5pm, review which Datadog alerts fired more than 20 times this week with mean time-to-acknowledge over 10 minutes. Open a Terraform PR to widen the thresholds or move them to a warning channel."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-incident health check&lt;/strong&gt; — "After every completed RCA, run a 15-minute observation on the affected service: check error rate, p99 latency, and pod restart count. Post results to #incident-followup."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled infrastructure audit&lt;/strong&gt; — "Every Monday at 9am, audit IAM roles in the production AWS account that have not been used in 90 days. List candidates for removal in a Confluence page."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are runbook automation. Each requires the agent to query infrastructure, reason about results, and produce a structured output. Each one was previously the job of an on-call engineer doing follow-up between pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Actions sit on the agentic capability spectrum
&lt;/h2&gt;

&lt;p&gt;In our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE comparison&lt;/a&gt;, we proposed a four-level spectrum for AI SRE capability. Actions don't change the level — they change &lt;em&gt;when the agent runs.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;When the agent runs&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Pre-Actions example&lt;/th&gt;
&lt;th&gt;With Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On alert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Webhook from PagerDuty / Datadog / Grafana&lt;/td&gt;
&lt;td&gt;Aurora investigates the alert and produces an RCA&lt;/td&gt;
&lt;td&gt;Same — investigation flow is unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On user request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer asks a question in chat&lt;/td&gt;
&lt;td&gt;Aurora answers using tools&lt;/td&gt;
&lt;td&gt;Same — plus &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; shortcuts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;After every incident&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incident state transitions to "resolved"&lt;/td&gt;
&lt;td&gt;Postmortem generated; engineer manually does follow-up checks&lt;/td&gt;
&lt;td&gt;Action runs automatically with incident context in scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On a schedule&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Celery Beat cron&lt;/td&gt;
&lt;td&gt;No equivalent — required external scheduler + custom code&lt;/td&gt;
&lt;td&gt;Single source of truth: agent runs the prompt on cadence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The post-incident and scheduled triggers are the genuinely new capability. Before Actions, anything recurring or post-incident required gluing Aurora to an external scheduler, an external prompt store, and bespoke trigger code. Actions collapse all three into the product surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Actions work under the hood
&lt;/h2&gt;

&lt;p&gt;This is for the technically curious. A few architecturally interesting things from the implementation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Background chat sessions, not a separate runtime.&lt;/strong&gt; When an Action triggers, the executor service creates a regular chat session with the action's instruction as the seed message and dispatches it as a background Celery task. The agent doesn't know it's running an Action — it just runs the workflow. This means every capability the interactive agent has (tool calls, RAG, graph traversal, sub-agent orchestration) is available inside Actions for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Eager-loaded skills, no RCA mandate.&lt;/strong&gt; Interactive chat lazy-loads skills based on the user message. Background actions eager-load all skills because there is no human to clarify ambiguity. The system prompt also strips the "your job is to find root cause" framing — Actions can do anything the agent can do, not just investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. RLS context is preserved.&lt;/strong&gt; Aurora uses &lt;a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" rel="noopener noreferrer"&gt;PostgreSQL row-level security&lt;/a&gt; for multi-tenancy. The executor explicitly sets RLS context (&lt;code&gt;org_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;) before running so background tasks see only their own org's data — even though they run under a service identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Stale run cleanup is integrated.&lt;/strong&gt; Aurora's existing background-chat janitor already handles orphaned chat sessions from crashed pods. Action runs go through the same path, so a worker pod dying mid-action doesn't leave the run state inconsistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. RBAC is enforced at the route layer.&lt;/strong&gt; Action CRUD is gated by Aurora's Casbin-based RBAC. Org admins can restrict which roles can create or trigger actions — important because an Action with cloud-CLI access has real blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trigger types in detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual triggers
&lt;/h3&gt;

&lt;p&gt;The simplest case. An admin creates the action, an engineer triggers it from the Actions page or via &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; in chat. Useful for codifying common operational tasks ("rotate ECS task definitions for service X", "scan Confluence for stale runbooks") into named, repeatable commands.&lt;/p&gt;

&lt;p&gt;The chat integration is worth calling out: &lt;code&gt;/action&lt;/code&gt; is implemented as an LLM tool call using the same pattern as Aurora's &lt;code&gt;/rca&lt;/code&gt; slash command. The agent processes the action dispatch and then continues responding to the rest of the user's message — so you can write "kick off the IAM audit and tell me what changed since last week" and the agent will dispatch the audit action &lt;em&gt;and&lt;/em&gt; answer your question in the same turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-incident-completion triggers
&lt;/h3&gt;

&lt;p&gt;When an incident transitions to "resolved", any action with this trigger type runs against the incident context. The incident's metadata, RCA, and timeline are available to the action's agent without the user having to paste anything in. This is the trigger that turns Aurora from a reactive tool ("investigate this page") into a continuous one ("investigate, then run health checks, then file the postmortem").&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled triggers
&lt;/h3&gt;

&lt;p&gt;Interval-based, driven by &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;. Choose a cadence (every N minutes / hours / days), and the action runs without user involvement. This is the building block for the CI/CD auto-remediation and scheduled audit use cases — and it's why we're calling this post and the &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt; sister posts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actions don't do (and why)
&lt;/h2&gt;

&lt;p&gt;A few capability decisions worth being explicit about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No external webhook triggers&lt;/strong&gt; in this release. We could have added "trigger on arbitrary webhook" but it overlaps with the existing alert-triggered investigation flow. We may add it if we see demand for triggers from systems that don't go through PagerDuty / Datadog / Grafana.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No agent-authored Actions&lt;/strong&gt; yet. The agent can't create or modify Actions on its own. Self-modification is a serious security boundary; we'd want approval gating and audit logging before opening that door. (See our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; for the threat model.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No conditional / DAG composition&lt;/strong&gt; in this release. Actions are single-prompt for now. If you need a multi-step workflow, write a single prompt that describes the steps — the agent is good at sequencing. We'll add explicit composition if the natural-language form proves limiting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Safety: what to think about before enabling
&lt;/h2&gt;

&lt;p&gt;Every Action is a small program with access to your cloud environment. A few rules we use ourselves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start read-only.&lt;/strong&gt; Actions inherit Aurora's tool permissions. If your tool config restricts write actions (no &lt;code&gt;kubectl apply&lt;/code&gt;, no &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt;), Actions inherit that posture. Keep it that way for the first few weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use scheduled triggers conservatively.&lt;/strong&gt; A daily audit is cheap. A 5-minute polling loop with cloud CLI calls is not. Watch the LLM bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit who can create Actions.&lt;/strong&gt; RBAC defaults to org-admin-only creation. Leave it there unless you have a clear reason to widen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin the model.&lt;/strong&gt; Action prompts can be sensitive to model behavior. Pin a known-good model per action (gpt-5.5, claude-sonnet-4.6, opus-4.7, etc.) using Aurora's per-org model dropdown until you have confidence in cross-model stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review action runs weekly.&lt;/strong&gt; Every action has a run-history view. Spend 10 minutes a week reading the agent's traces for your scheduled actions — anomalous reasoning is the leading indicator of prompt drift or tool drift.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to ship your first Action
&lt;/h2&gt;

&lt;p&gt;A six-step recipe.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pick a recurring task you currently do manually
&lt;/h3&gt;

&lt;p&gt;Anything you do every week or after every incident. Examples: stale-PR review, alert-noise audit, on-call handover summary. The smaller and more deterministic, the better for v1.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Write the prompt as if you were typing it into chat
&lt;/h3&gt;

&lt;p&gt;Don't translate to "automation language." Write it the way you would write a chat message to a smart junior SRE. "Look at..." "Check whether..." "Open a PR that..."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Create the Action with a manual trigger
&lt;/h3&gt;

&lt;p&gt;Settings → Actions → New Action. Paste the prompt, set trigger = manual, leave it disabled if you want to review before enabling. Trigger it once and watch the run.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Inspect the run trace
&lt;/h3&gt;

&lt;p&gt;Click the run in the history view. Read every tool call. Look for: tool misuse (wrong cloud account), excessive tool calls (3 attempts at the same thing), hallucinated paths or resource IDs. Iterate on the prompt until the trace is clean for three consecutive runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Promote to the right trigger type
&lt;/h3&gt;

&lt;p&gt;If the action makes sense after every incident → on-incident-completion. If it's a routine sweep → on-schedule with the longest cadence that still meets your need. Only use short cadences when you have a clear cost and blast-radius understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Add it to your team's incident review
&lt;/h3&gt;

&lt;p&gt;Treat agent runs the same way you treat human runs: include them in your weekly incident review. Look for actions that produced wrong output, actions that nobody read the output of, and actions that produced output nobody acted on. Delete or downgrade as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora Actions vs traditional incident-management automation
&lt;/h2&gt;

&lt;p&gt;The category most people compare us to is "workflow automation in incident-management SaaS" — Rootly, FireHydrant, incident.io. The comparison is informative but ultimately category-different:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Aurora Actions&lt;/th&gt;
&lt;th&gt;Rootly / FireHydrant / incident.io workflows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language&lt;/td&gt;
&lt;td&gt;DSL or visual builder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — LLM agent&lt;/td&gt;
&lt;td&gt;No — fixed conditional graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool reach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud CLIs, kubectl, Terraform, Slack, Confluence, GitHub, RAG, infra graph&lt;/td&gt;
&lt;td&gt;Slack, status pages, Zoom, runbook links, ticket creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduled execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Celery Beat)&lt;/td&gt;
&lt;td&gt;Limited (some support timed reminders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Post-incident chaining&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — full incident context available&lt;/td&gt;
&lt;td&gt;Yes — but limited to workflow actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Apache 2.0, self-hosted)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (self-hosted; LLM tokens only)&lt;/td&gt;
&lt;td&gt;Per-user SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest framing: traditional incident-management tools automate the &lt;em&gt;process around&lt;/em&gt; the incident. Aurora Actions automate &lt;em&gt;what happens inside the agent&lt;/em&gt;. Both have value; they cover non-overlapping work. If you live in PagerDuty and use Rootly for incident channels, Aurora Actions sit alongside that — they don't replace it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Aurora Actions is the foundation for several capabilities on our roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DAG composition&lt;/strong&gt; — explicit multi-step Action chains where each step is itself an Action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval gates&lt;/strong&gt; — Actions that pause for human approval before destructive tool calls (already supported in chat; explicit Action-level gating coming).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation hooks&lt;/strong&gt; — first-class integration with GitHub Actions, Jenkins, and ArgoCD so a failing pipeline becomes a triggered Aurora investigation. (Background and detailed write-up in our &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action marketplace&lt;/strong&gt; — community-contributed Actions you can install with one click. Bring-your-own prompt store.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll publish each of these as they ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora is fully open source under Apache 2.0. Self-host with &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Docker Compose or Helm&lt;/a&gt;. Actions ship in the next tagged release after &lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;aurora-oss-1.2.15&lt;/a&gt; (April 15, 2026); the feature is available on &lt;code&gt;main&lt;/code&gt; today.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare against alternatives:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs traditional incident-management tools&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:32:08 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Most teams do not yet auto-remediate inside CI/CD.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though AI is now widely used elsewhere in the development lifecycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation is an architectural pattern, not a product category.&lt;/strong&gt; It combines progressive delivery (canary, blue-green), automated metric-driven rollback, and AI-assisted root-cause-and-fix. Owned components, not a single SKU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three layers, four maturity levels.&lt;/strong&gt; We propose the &lt;strong&gt;CI/CD Auto-Remediation Maturity Spectrum (CARM)&lt;/strong&gt;: L0 (manual), L1 (rollback), L2 (rollback + diagnostic), L3 (rollback + diagnostic + remediation), L4 (closed-loop with policy gates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source stack is mature.&lt;/strong&gt; &lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, and metric-driven &lt;code&gt;AnalysisTemplates&lt;/code&gt; cover L1–L2 with no AI. AI agents like &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; extend to L3 with Actions-based remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DORA's bar is real.&lt;/strong&gt; Top-performing teams keep change failure rate low and recover from failed deployments in under one hour (&lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program guidance&lt;/a&gt;). Auto-remediation is how non-elite teams close the gap.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ AI SRE products&lt;/a&gt; and dozens of progressive-delivery tools shipping in 2026, only a handful explicitly target the pattern this guide is about. &lt;strong&gt;CI/CD auto-remediation is the practice of having your delivery pipeline automatically detect, diagnose, and recover from failure — without paging a human — using a combination of progressive-delivery primitives, metric-driven rollback policies, and (increasingly) AI agents that propose or apply fixes.&lt;/strong&gt; It is not the same as auto-deploy. It is not the same as canary rollout. It is the closing of the loop between "the pipeline noticed something is wrong" and "the system is back in a good state" — without an engineer in the middle.&lt;/p&gt;

&lt;p&gt;This guide is for SRE and platform teams who already run continuous delivery and want to push toward the auto-remediation end of the spectrum. By the end, you should be able to: position your current setup on the CARM maturity spectrum, identify the next concrete step, and pick a credible tool stack to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why auto-remediation matters in 2026
&lt;/h2&gt;

&lt;p&gt;Three numbers explain the demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AI is shipping more code, faster.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage on the TeamCity blog (April 2026)&lt;/a&gt;, AI tools are now used by a large majority of developers in their daily work. The &lt;a href="https://getdx.com/blog/change-failure-rate/" rel="noopener noreferrer"&gt;DX 2026 change-failure-rate analysis&lt;/a&gt; puts a number on it: with 91% of developers having adopted AI and 20%+ of merged code now AI-authored, &lt;strong&gt;code velocity has gone up while quality has gone in the opposite direction.&lt;/strong&gt; More deployments per day means more chances to break production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The pipeline itself is the new bottleneck.&lt;/strong&gt; &lt;a href="https://blog.jetbrains.com/teamcity/2025/10/the-state-of-cicd/" rel="noopener noreferrer"&gt;JetBrains' 2025 State of CI/CD survey&lt;/a&gt; documents widespread frustration with slow and unreliable CI/CD pipelines as a leading contributor to developer burnout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AI in CI/CD specifically lags adoption.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though most use AI everywhere else in the development lifecycle. The gap isn't capability; it's trust and integration. AI in IDEs is low-risk; AI in pipelines touches production. Teams want the impact but won't take the blast radius until the architecture is right.&lt;/p&gt;

&lt;p&gt;Auto-remediation is the architecture that closes that gap. It bounds the agent's reach (only inside the delivery pipeline), gives it deterministic guardrails (progressive delivery and metric-driven rollback), and produces a clear contract: detect, diagnose, fix-or-rollback, log.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "auto-remediation" actually means
&lt;/h2&gt;

&lt;p&gt;It is easiest to define by negation. Auto-remediation is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-deploy.&lt;/strong&gt; Auto-deploy ships code on merge. Auto-remediation is what happens &lt;em&gt;after&lt;/em&gt; a problem appears.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canary release.&lt;/strong&gt; Canary is the &lt;em&gt;detection mechanism&lt;/em&gt; — it surfaces problems early by shifting traffic gradually. Remediation is the &lt;em&gt;response&lt;/em&gt; — rolling back, hotfixing, or reverting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing infrastructure.&lt;/strong&gt; Self-healing systems like Kubernetes restart pods. Auto-remediation includes that plus &lt;em&gt;change-driven&lt;/em&gt; failure recovery: rolling back a bad deploy, rolling forward a fix, or pausing the pipeline while a human investigates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps.&lt;/strong&gt; AIOps platforms surface alerts and correlations. Auto-remediation closes the loop by &lt;em&gt;acting&lt;/em&gt; on them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The minimum viable definition: &lt;strong&gt;a pipeline transition from a degraded state back to a healthy state, triggered by automated detection, executed by automated action, observed and logged for human review.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI/CD Auto-Remediation Maturity Spectrum (CARM)
&lt;/h2&gt;

&lt;p&gt;There is no single industry-standard maturity model for auto-remediation. We use the following five-level spectrum — derived from how teams actually evolve.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What happens on failed deploy&lt;/th&gt;
&lt;th&gt;Tools that get you here&lt;/th&gt;
&lt;th&gt;Trust required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0 — Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline fails. PagerDuty pages the on-call. Engineer investigates, decides to roll back or hotfix, executes manually.&lt;/td&gt;
&lt;td&gt;None — this is the default for most teams.&lt;/td&gt;
&lt;td&gt;None — humans do everything.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Automated Rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline detects health-check failure (error rate, latency, smoke test). Automatically rolls back to the previous version. Pages a human after the fact.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, &lt;a href="https://spinnaker.io/" rel="noopener noreferrer"&gt;Spinnaker&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the health metric reflects user-visible failure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Rollback + Diagnostic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L1 plus: AI agent runs an investigation when rollback fires. Produces an RCA before the human starts. Page goes out with context, not blank.&lt;/td&gt;
&lt;td&gt;L1 stack + &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the diagnostic is right enough to bias human reasoning.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Rollback + Diagnostic + Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L2 plus: agent proposes (or in some cases applies) a fix — a PR, a config change, an alert threshold update. Human reviews and merges.&lt;/td&gt;
&lt;td&gt;L2 stack + Aurora Actions, HolmesGPT Operator mode&lt;/td&gt;
&lt;td&gt;Trust that the agent's fix is correct, scoped, and reviewable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Closed-loop with policy gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L3 plus: certain &lt;em&gt;low-risk, well-understood&lt;/em&gt; fixes are applied automatically inside policy guardrails (alert threshold widening, log-only changes, retry loops). Destructive or high-risk changes still gated.&lt;/td&gt;
&lt;td&gt;L3 stack + policy engine (OPA, Casbin, Kyverno) + audit logging&lt;/td&gt;
&lt;td&gt;Trust the policy gate definitions more than the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams in 2026 are at &lt;strong&gt;L0 or L1&lt;/strong&gt;. The leap from L1 to L2 is the single highest-leverage move available because it preserves human-in-the-loop decision-making while removing the "blank-page" delay that explains a large share of MTTR. The 2024-2025 DORA research &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;renamed MTTR to Failed Deployment Recovery Time (FDRT)&lt;/a&gt; precisely because the metric is more meaningful when scoped to change-driven failures — which is exactly the failure mode auto-remediation targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  L1: Automated rollback (where most serious teams should be)
&lt;/h2&gt;

&lt;p&gt;This is the foundation. Without L1, AI-assisted remediation at L2-L3 has nowhere to act.&lt;/p&gt;

&lt;p&gt;The two Apache 2.0 incumbents are &lt;strong&gt;Argo Rollouts&lt;/strong&gt; and &lt;strong&gt;Flagger.&lt;/strong&gt; Both run in Kubernetes; both implement metric-driven progressive delivery with automated rollback. They differ in invasiveness.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/argo/" rel="noopener noreferrer"&gt;Argo&lt;/a&gt; (Graduated, Dec 2022)&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; (Graduated, Nov 2022)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Replaces &lt;code&gt;Deployment&lt;/code&gt; with &lt;code&gt;Rollout&lt;/code&gt; CRD&lt;/td&gt;
&lt;td&gt;Wraps existing &lt;code&gt;Deployment&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitOps pairing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;FluxCD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AnalysisTemplate&lt;/code&gt; querying Prometheus, Datadog, CloudWatch, etc.&lt;/td&gt;
&lt;td&gt;Service-mesh metrics + custom webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automated rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traffic shaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native + ingress + service mesh&lt;/td&gt;
&lt;td&gt;Service-mesh first (Istio, Linkerd, App Mesh)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Invasiveness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (changes resource type)&lt;/td&gt;
&lt;td&gt;Lower (transparent wrapper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Webhooks for custom logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Experiment&lt;/code&gt; resource + analysis runs&lt;/td&gt;
&lt;td&gt;Pre-/post-/during-rollout hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pick Argo Rollouts&lt;/strong&gt; if you already use ArgoCD and want explicit per-step canary control. &lt;strong&gt;Pick Flagger&lt;/strong&gt; if you use a service mesh and want progressive delivery to be transparent to existing manifests.&lt;/p&gt;

&lt;p&gt;For non-Kubernetes pipelines, equivalent capability lives in &lt;strong&gt;Spinnaker&lt;/strong&gt; (multi-cloud, mature), &lt;strong&gt;Harness&lt;/strong&gt; (commercial), and feature-flag platforms like &lt;strong&gt;LaunchDarkly&lt;/strong&gt; (when "rollback" can be a flag flip).&lt;/p&gt;

&lt;p&gt;A minimal Argo Rollouts AnalysisTemplate for HTTP error rate, simplified from the &lt;a href="https://argoproj.github.io/argo-rollouts/features/analysis/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[1m]))&lt;/span&gt;
            &lt;span class="s"&gt;/ sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three failed 30-second windows → rollback. This is L1 in 30 lines of YAML.&lt;/p&gt;

&lt;h2&gt;
  
  
  L2: Rollback + automated diagnostic
&lt;/h2&gt;

&lt;p&gt;L1 gets you out of an outage fast. It does not tell you &lt;em&gt;why&lt;/em&gt; the deploy failed. The human gets paged with a rollback notification and starts from zero.&lt;/p&gt;

&lt;p&gt;L2 fills that gap with an AI agent that runs when rollback fires. The agent queries the cluster state, the application logs, the rollout metrics, and the changed code — and produces an RCA before the human starts typing.&lt;/p&gt;

&lt;p&gt;Three credible open-source options exist as of 2026 (compared in detail in our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; guide):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt; — rule-based scanner with LLM explanations. Best for low-blast-radius first deployment; explains &lt;em&gt;why&lt;/em&gt; a resource is unhealthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt; — ReAct-loop AI agent (CNCF Sandbox). 30+ observability integrations. Read-only by default. Strong fit for cluster-scoped investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt; — LangGraph supervisor agent. Multi-cloud (AWS / Azure / GCP / OVH / Scaleway). Generates postmortems. Opens remediation PRs with human approval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wiring up L2 is straightforward: configure your AI SRE's webhook to receive the rollback event (Argo Rollouts emits Kubernetes events; you can route them via &lt;a href="https://argoproj.github.io/argo-rollouts/features/notifications/" rel="noopener noreferrer"&gt;Argo Notifications&lt;/a&gt; to the agent). The agent investigates and posts results to the on-call Slack channel before the human acknowledges the page.&lt;/p&gt;

&lt;h2&gt;
  
  
  L3: Diagnostic + agent-proposed remediation
&lt;/h2&gt;

&lt;p&gt;L3 is where AI starts proposing fixes, not just diagnosis. The pattern that works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pipeline fails → automated rollback (L1).&lt;/li&gt;
&lt;li&gt;Agent investigates → RCA produced (L2).&lt;/li&gt;
&lt;li&gt;Agent proposes a fix as a &lt;strong&gt;pull request&lt;/strong&gt;, with the RCA as the PR description, the diff scoped to one file, and tests where possible.&lt;/li&gt;
&lt;li&gt;Human reviews PR. If correct, merges. If wrong, comments and rejects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works because the pull request is the natural human-review surface. The agent doesn't touch production directly; it touches the repository, which already has CI, code review, and a merge gate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; is built precisely for this pattern. A post-incident-completion Action with a prompt like "Open a PR widening alert thresholds for the three noisiest alerts in this incident" converts the human follow-up step into automated PR creation. The human review surface stays exactly the same as for human-authored PRs.&lt;/p&gt;

&lt;p&gt;The HolmesGPT equivalent ships as &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;"Operator mode"&lt;/a&gt; — the agent can write to GitHub when explicitly enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  L4: Closed-loop with policy gates
&lt;/h2&gt;

&lt;p&gt;L4 is the contentious one. It involves the agent making changes &lt;em&gt;without&lt;/em&gt; human approval — but only inside a tightly scoped policy.&lt;/p&gt;

&lt;p&gt;The pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;policy engine&lt;/strong&gt; (&lt;a href="https://www.openpolicyagent.org/" rel="noopener noreferrer"&gt;Open Policy Agent&lt;/a&gt;, &lt;a href="https://kyverno.io/" rel="noopener noreferrer"&gt;Kyverno&lt;/a&gt;, Casbin) defines which classes of remediation can run automatically.&lt;/li&gt;
&lt;li&gt;The agent proposes a fix. The policy engine evaluates whether the fix matches a permitted class.&lt;/li&gt;
&lt;li&gt;If yes → apply automatically with audit logging. If no → route to L3 (PR for human review).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Permitted classes that are usually safe at L4: widening an alert threshold by less than 2x, restarting a pod, scaling a deployment within preset bounds, adding a retry loop to a network call, suppressing a noisy log line.&lt;/p&gt;

&lt;p&gt;Permitted classes that are usually &lt;em&gt;not&lt;/em&gt; safe at L4: any data-plane change, any production traffic routing change, any secret or RBAC change, any change touching the policy engine itself.&lt;/p&gt;

&lt;p&gt;The reason L4 is contentious is that the policy gate is now a high-value target. An attacker who can broaden the policy can broaden the agent's blast radius. The same threat model we walk through in our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; applies, plus an additional layer: the policy engine must be operated with the same rigor as the orchestration plane itself.&lt;/p&gt;

&lt;p&gt;Almost no production teams in 2026 run pure L4. The credible deployments are &lt;strong&gt;L3 with hardcoded L4 exceptions&lt;/strong&gt; for two or three well-understood remediation classes. That's where to aim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common pitfalls
&lt;/h2&gt;

&lt;p&gt;A short list of failure modes we have seen — in our own work and in customer deployments.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Auto-remediating &lt;em&gt;into&lt;/em&gt; a worse state.&lt;/strong&gt; The classic failure is auto-scaling a service to handle elevated error rates that are themselves caused by a downstream dependency. The service scales, hammers the dependency harder, and the dependency collapses. &lt;strong&gt;Fix:&lt;/strong&gt; never auto-remediate without dependency-graph awareness. Aurora uses &lt;a href="https://memgraph.com/" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt; for this; HolmesGPT uses its toolset structure; pure-L1 stacks should require manual escalation when the failure crosses service boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trusting the AnalysisTemplate metric too much.&lt;/strong&gt; A 1% error rate threshold on a P99-tail service is meaningless if your real failure mode is request-stalled-not-failed. &lt;strong&gt;Fix:&lt;/strong&gt; model what user-visible failure actually looks like, not what the cleanest Prometheus query produces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Letting the agent run unbounded retries.&lt;/strong&gt; AI agents that hit a "this didn't work" signal will often retry — sometimes thousands of times — burning tokens and triggering downstream rate limits. &lt;strong&gt;Fix:&lt;/strong&gt; cap the agent's tool-call budget. Aurora's executor enforces this by default; verify your agent does the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the post-mortem.&lt;/strong&gt; Auto-remediation that "just worked" without a clear human review of what happened is a slow path to brittleness. Every auto-remediation event should produce a postmortem the on-call reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflating auto-remediation with "self-healing infra".&lt;/strong&gt; Kubernetes pod restarts are not auto-remediation. They are a runtime affordance. Auto-remediation is the response to a &lt;em&gt;change-driven&lt;/em&gt; failure — the deploy, the config push, the schema migration. Keep the categories separate.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A pragmatic 90-day path to auto-remediation
&lt;/h2&gt;

&lt;p&gt;For a team currently at L0 or L1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 1–14: instrument and detect
&lt;/h3&gt;

&lt;p&gt;Pick your three highest-traffic services. Add or harden:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic checks that exercise the user-visible path.&lt;/li&gt;
&lt;li&gt;One Prometheus error-rate metric per service with a clear threshold.&lt;/li&gt;
&lt;li&gt;A canary or blue-green rollout primitive (&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt; or &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Goal at end of week 2: a controlled bad deploy auto-rolls back without human intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 15–45: wire in the agent
&lt;/h3&gt;

&lt;p&gt;Deploy one of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, or &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; in read-only mode. Configure rollback events to webhook the agent. Have it post an RCA to your incident channel within five minutes of rollback.&lt;/p&gt;

&lt;p&gt;Goal at end of week 6: every rollback comes with a written diagnostic before the human acknowledges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 46–75: add agent-proposed remediation
&lt;/h3&gt;

&lt;p&gt;Enable PR-creation for the agent (&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; on-incident-completion trigger, or HolmesGPT Operator mode). Constrain initial scope to one repo and one class of fix (alert thresholds, retry loops, log suppression). Review every PR for the first two weeks.&lt;/p&gt;

&lt;p&gt;Goal at end of week 11: agent opens correct PRs in 70%+ of fired rollbacks. False-positive PRs are caught at code review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 76–90: policy-gate one fix class for L4
&lt;/h3&gt;

&lt;p&gt;Pick the safest class — usually alert threshold widening when an alert fired more than N times in M hours with mean TTA above some bound. Define an OPA / Kyverno policy that permits &lt;em&gt;only that class.&lt;/em&gt; Wire the agent to apply directly when the policy permits, raise a PR otherwise.&lt;/p&gt;

&lt;p&gt;Goal at end of week 12: one L4 lane open for one fix class with full audit trail.&lt;/p&gt;

&lt;p&gt;This is the conservative path. Aggressive teams have moved faster, but we have not seen anyone skip steps successfully.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DORA reality check
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program's published guidance&lt;/a&gt; is blunt about what good looks like. Historical State of DevOps Reports have consistently shown the same shape of distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Change Failure Rate&lt;/strong&gt;: top performers maintain low single-digit percentages; lower performers see substantially higher rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed Deployment Recovery Time (FDRT)&lt;/strong&gt;: top performers recover in under one hour; lower performers can take days to weeks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DORA's research has also consistently found that &lt;strong&gt;speed and stability reinforce each other rather than trade off&lt;/strong&gt; — the fastest teams are also the most stable, per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's history of metrics&lt;/a&gt; and successive State of DevOps Reports. Auto-remediation is one of the small number of capabilities that moves teams across these tiers without requiring deeper organizational change. The L1→L2 jump alone reduces FDRT meaningfully because the human is no longer reconstructing context — the agent has already done it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is heading
&lt;/h2&gt;

&lt;p&gt;Two predictions, each with a reasonable evidence base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The L2 → L3 transition becomes table-stakes within 18 months.&lt;/strong&gt; AI-authored PRs from agents are already merging in production at multiple companies in our network. Once the review surface is the same as for human-authored PRs (which it already is via GitHub / Bitbucket / GitLab), there is no organizational reason not to use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. L4 stays narrow.&lt;/strong&gt; The threat surface of agent-applied changes is genuinely scary, and the per-incident savings of going from L3 to L4 are smaller than the savings from L1 to L2. Expect L4 to be the place where one or two well-understood fix classes get automated, while everything else stays L3.&lt;/p&gt;

&lt;p&gt;The teams who win in 2026-2027 are the ones who get to credible L3 first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;Aurora is the AI agent layer of an auto-remediation stack — it covers L2 (investigation), L3 (PR-based remediation via &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt;), and the agent half of L4 (policy-gated remediation). It does not replace Argo Rollouts or Flagger at L1; those remain the foundation. Aurora is the difference between rolling back blind and rolling back with a written RCA and a draft PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions launch:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions: User-Defined Background Automations&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OSS comparison:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety architecture:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>AI Agent kubectl Safety: Sandboxed Execution for Production</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:44:12 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Giving an AI agent kubectl access is an architecture decision, not a permission flag.&lt;/strong&gt; Per-permission gates fail under prompt injection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OWASP ranks "Excessive Agency" as LLM06 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;2025 Top 10 for LLM Applications&lt;/a&gt;&lt;/strong&gt; and "Tool Misuse and Exploitation" as ASI02 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;2026 Top 10 for Agentic Applications&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Kubernetes ecosystem already has an answer&lt;/strong&gt;: &lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; provides a declarative API for isolated agent runtimes using gVisor or Kata Containers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real precedent exists.&lt;/strong&gt; &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;EchoLeak (CVE-2025-32711)&lt;/a&gt;, CVSS 9.3, was the first publicly documented zero-click prompt-injection data exfiltration in a production LLM system. The kubectl analogue would be cluster-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora runs every &lt;code&gt;kubectl&lt;/code&gt; command in a pod-isolated process&lt;/strong&gt; via its &lt;code&gt;terminal_run&lt;/code&gt; primitive, with an environment-variable allowlist that strips secrets, signature-matcher and LLM-judge guardrails, and per-invocation cloud credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ products marketed as "AI SRE" in 2026&lt;/a&gt;, only a handful publicly document their kubectl execution architecture — and the gap between vendors that handle this well and vendors that handle it badly is the single largest unspoken risk in the category. &lt;strong&gt;AI agent kubectl safety is the architectural discipline of letting an AI agent run &lt;code&gt;kubectl&lt;/code&gt; (or any cloud CLI) against production without inheriting cluster-wide blast radius if the agent is compromised.&lt;/strong&gt; It is not the same as RBAC scoping, and it is not the same as a human approval prompt — both are necessary but neither is sufficient on its own.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;OWASP published its 2025 Top 10 for LLM Applications&lt;/a&gt;, it ranked &lt;strong&gt;Prompt Injection (LLM01)&lt;/strong&gt; as the top risk and &lt;strong&gt;Excessive Agency (LLM06)&lt;/strong&gt; as one of the most consequential — defining it across three root causes: excessive functionality, excessive permissions, and excessive autonomy. In December 2025, OWASP followed up with a &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;dedicated Top 10 for Agentic Applications&lt;/a&gt; that names &lt;strong&gt;Tool Misuse and Exploitation (ASI02)&lt;/strong&gt; and &lt;strong&gt;Identity and Privilege Abuse (ASI03)&lt;/strong&gt; as primary attack surfaces.&lt;/p&gt;

&lt;p&gt;Translation: if you give an AI agent the ability to run &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; commands against production, you have a security architecture problem — not a permissions problem. This guide walks through the threat model, the emerging Kubernetes sandboxing standard, and how to evaluate any AI SRE on its kubectl safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can go wrong when AI agents run kubectl?
&lt;/h2&gt;

&lt;p&gt;Any LLM-driven agent that executes commands inherits the security properties of the LLM, the harness, and the runtime. Three real-world precedents illustrate the failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EchoLeak (CVE-2025-32711)&lt;/strong&gt; — Microsoft 365 Copilot, CVSS 9.3 critical, &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;patched in June 2025&lt;/a&gt;. Discovered by Aim Security, it was the first publicly documented zero-click indirect prompt-injection data exfiltration in a production LLM system. A crafted email sat in Outlook; when the user later asked Copilot for an unrelated summary, the email's hidden instructions fired and exfiltrated SharePoint, OneDrive, and Teams data. Research paper: &lt;a href="https://arxiv.org/abs/2509.10540" rel="noopener noreferrer"&gt;arXiv:2509.10540&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MITRE ATLAS prompt-injection techniques&lt;/strong&gt; — &lt;a href="https://atlas.mitre.org/" rel="noopener noreferrer"&gt;MITRE ATLAS&lt;/a&gt; catalogues real-world adversary techniques against AI systems, including indirect prompt injection that turns an LLM with tool access into an attacker-controlled execution surface. The framework specifically documents techniques for exfiltration via AI agent tool invocation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Session Smuggling&lt;/strong&gt; — Palo Alto Unit 42 (November 2025) demonstrated rogue agents exploiting trust in the Agent-to-Agent (A2A) protocol with multi-turn manipulation. Documented in OWASP's Agentic Top 10.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these specifically targeted kubectl-running agents in production — but the class is the same and the blast radius would be larger. An agent that can run &lt;code&gt;kubectl delete&lt;/code&gt; is one prompt-injection payload away from a cluster-wide outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Attack Surfaces of Agentic kubectl
&lt;/h2&gt;

&lt;p&gt;Most teams think of kubectl agent safety as a single problem ("can the agent be tricked?"). It's actually four distinct attack surfaces, each requiring its own mitigation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Why permission-scoping alone fails&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Prompt injection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hidden instructions in logs, alerts, runbooks, or chat coerce the agent&lt;/td&gt;
&lt;td&gt;Compromised agent acts within its granted permissions, which is exactly what permission-scoping permits&lt;/td&gt;
&lt;td&gt;Sandboxed runtime; never trust LLM output derived from data the LLM read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Credential leakage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executed command reads &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt; from inherited env&lt;/td&gt;
&lt;td&gt;Permissions live on credentials; if the credential leaks, the permission set leaks with it&lt;/td&gt;
&lt;td&gt;Per-invocation short-lived credentials (STS, Service Principal); explicit env allowlist that strips secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Blast radius escalation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Legitimate command runs against wrong namespace, region, or cluster&lt;/td&gt;
&lt;td&gt;Permissions don't model "right action, wrong target"&lt;/td&gt;
&lt;td&gt;Default read-only; dependency-graph awareness; human approval for destructive writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Audit trail gaps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logs capture commands without the agent's reasoning&lt;/td&gt;
&lt;td&gt;Permission systems audit "who ran what," not "why"&lt;/td&gt;
&lt;td&gt;Per-investigation transcripts that link reasoning → tool calls → outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Attack Surface 1: Prompt injection
&lt;/h3&gt;

&lt;p&gt;The agent reads a log line, alert payload, runbook, or chat message that contains hidden instructions. The LLM cannot reliably distinguish data from instructions in the same channel — this is the fundamental property OWASP's LLM01 captures. Even frontier models do not eliminate it. Anthropic has publicly stated that "no browser agent is immune to prompt injection" and publishes &lt;a href="https://www.anthropic.com/news/prompt-injection-defenses" rel="noopener noreferrer"&gt;defense benchmarks&lt;/a&gt; showing measurable but imperfect attack-prevention rates across computer-use, bash tool use, and MCP workflows. The implication for kubectl-running agents is clear: &lt;strong&gt;the LLM is not the security boundary. The runtime is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mitigation: never trust LLM output that originates from data the LLM also read. Sandbox the execution layer so even a successful injection has limited blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 2: Credential leakage
&lt;/h3&gt;

&lt;p&gt;If the agent runs commands with credentials inherited from the host process environment (&lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;), a successful command-injection or shell escape exposes everything the agent process has access to. Long-lived static credentials make this catastrophic.&lt;/p&gt;

&lt;p&gt;Mitigation: per-invocation credential scoping. AWS STS AssumeRole, Azure Service Principal sessions, GCP short-lived tokens. Strip everything else from the child process environment with an explicit allowlist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 3: Blast radius escalation
&lt;/h3&gt;

&lt;p&gt;Even legitimate, non-injected commands can have outsized effects. &lt;code&gt;kubectl delete pod&lt;/code&gt; on the wrong namespace. &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt; against a misidentified region. The agent doesn't need to be compromised — it just needs to be wrong.&lt;/p&gt;

&lt;p&gt;Mitigation: read-only by default, write actions behind explicit human approval, and dependency-graph awareness so the agent can compute blast radius before acting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 4: Audit trail gaps
&lt;/h3&gt;

&lt;p&gt;When an investigation runs across 20+ tool invocations, traditional audit systems (CloudTrail, Kubernetes audit logs) record what was run but not why. A reviewer six months later cannot tell whether a &lt;code&gt;kubectl scale&lt;/code&gt; was a legitimate response to a load spike or an injected instruction.&lt;/p&gt;

&lt;p&gt;Mitigation: structured per-investigation transcripts that capture agent reasoning alongside tool calls. The right log isn't "kubectl was run" — it's "in response to alert X, the agent hypothesized Y, ran kubectl Z, and observed W."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "human approval" alone is not enough
&lt;/h2&gt;

&lt;p&gt;The most common safety story in the AI SRE space is "the agent suggests; humans approve." That is necessary but not sufficient.&lt;/p&gt;

&lt;p&gt;The problem with approval gates as the only line of defense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision fatigue.&lt;/strong&gt; An agent that handles 50 alerts a week generates dozens of approval prompts. Humans rubber-stamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval ≠ understanding.&lt;/strong&gt; Engineers approve commands they don't fully understand because the agent's reasoning sounds plausible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injected intent looks legitimate.&lt;/strong&gt; A prompt-injection payload can produce a recommendation that &lt;em&gt;reads&lt;/em&gt; exactly like a normal RCA. The approver has no signal that the underlying instruction came from an attacker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Approval gates are critical, but they need to sit on top of an already-sandboxed runtime — not be the only protection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Permission scoping vs sandboxed execution: what's the difference?
&lt;/h2&gt;

&lt;p&gt;These two terms get conflated. They aren't the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission scoping&lt;/strong&gt; restricts what an agent's identity can do. RBAC roles, IAM policies, kubeconfig contexts. It's necessary, but it operates at the cluster-API layer — meaning a successful prompt injection can still use every permission the agent has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxed execution&lt;/strong&gt; isolates the &lt;em&gt;runtime&lt;/em&gt; in which commands execute. If the agent's process is compromised, the sandbox limits what the compromised process can do regardless of the credentials it holds. The compromised process can't read other pods' files, can't reach other nodes, can't escalate to the host kernel.&lt;/p&gt;

&lt;p&gt;The defensible architecture combines both: tight permission scoping (small RBAC role, short-lived credentials) + runtime isolation (sandboxed execution).&lt;/p&gt;

&lt;h2&gt;
  
  
  How sandboxed kubectl actually works
&lt;/h2&gt;

&lt;p&gt;The Kubernetes ecosystem standardized on this pattern in 2025–2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  k8s-sigs/agent-sandbox
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; is a formal Kubernetes SIG Apps subproject that launched at KubeCon Atlanta in November 2025. It provides a declarative Kubernetes API for "isolated, stateful, singleton workloads" — built specifically for AI agent runtimes that may execute untrusted, LLM-generated code.&lt;/p&gt;

&lt;p&gt;Core CRDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Sandbox&lt;/code&gt; — an isolated pod-equivalent with stronger boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxTemplate&lt;/code&gt; — reusable configuration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxClaim&lt;/code&gt; — request a sandbox for a workload&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxWarmPool&lt;/code&gt; — pre-created sandboxes that bring cold-start under one second&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/" rel="noopener noreferrer"&gt;Kubernetes blog post from March 2026&lt;/a&gt; makes the architectural claim explicit: "Isolation achieved via runtime-level sandboxing (gVisor/Kata), not just container-level namespaces."&lt;/p&gt;

&lt;h3&gt;
  
  
  gVisor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor&lt;/a&gt; is a Google-maintained user-space application kernel that provides kernel-level isolation without full virtualization. Architecture: &lt;strong&gt;Sentry&lt;/strong&gt; (a kernel emulator written in Go) intercepts roughly 200 Linux syscalls; &lt;strong&gt;Gofer&lt;/strong&gt; brokers filesystem access over 9P. The OCI runtime is &lt;code&gt;runsc&lt;/code&gt;, drop-in compatible with &lt;code&gt;runc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;gVisor runs in production at Google for App Engine standard, Cloud Functions, Cloud Run, and Cloud ML Engine. GKE Sandbox productizes it for GKE node pools. It is one of two named isolation backends in agent-sandbox (the other being Kata Containers, which uses lightweight VMs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters for AI SRE
&lt;/h3&gt;

&lt;p&gt;An AI SRE that runs &lt;code&gt;kubectl&lt;/code&gt; against production is exactly the kind of workload agent-sandbox was built for. It executes LLM-generated commands. It needs file system isolation, syscall isolation, and per-invocation credential scoping. It benefits enormously from a warm pool that reduces cold-start latency.&lt;/p&gt;

&lt;p&gt;If you are evaluating an AI SRE in 2026, this is one of the right questions to ask: &lt;em&gt;what isolation backend does the agent use when it executes commands?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aurora's pod-isolated execution works
&lt;/h2&gt;

&lt;p&gt;Aurora's approach predates agent-sandbox and follows the same architectural principles.&lt;/p&gt;

&lt;p&gt;When Aurora's agent runs a &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; command, it doesn't use &lt;code&gt;subprocess.run()&lt;/code&gt; directly. It uses an internal primitive called &lt;code&gt;terminal_run&lt;/code&gt;, defined in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;&lt;code&gt;server/utils/terminal/terminal_run.py&lt;/code&gt;&lt;/a&gt;. The module's docstring is explicit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Drop-in replacement for subprocess.run() that executes in terminal pods. This module provides a terminal_run() function that mimics subprocess.run() API but executes commands in isolated terminal pods via kubectl exec. Safety guardrails (signature matcher + LLM judge) run automatically unless the caller passes &lt;code&gt;trusted=True&lt;/code&gt; for known-safe internal operations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three properties matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Pod-isolated execution.&lt;/strong&gt; When the &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; flag is set (the default in Kubernetes deployments), every external command runs inside a separate terminal pod via &lt;code&gt;kubectl exec&lt;/code&gt;. The agent's own process never executes the command directly. A successful command-injection in the agent's reasoning loop does not give an attacker access to the agent host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Two-stage safety guardrails.&lt;/strong&gt; Before any non-trusted command runs, two checks fire automatically: a deterministic signature matcher that rejects known-dangerous patterns, and an LLM judge that evaluates the proposed command against the investigation context. The &lt;code&gt;trusted=True&lt;/code&gt; flag bypasses both — used only for known-safe internal operations like configured connector calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sanitized environment allowlist.&lt;/strong&gt; Aurora's &lt;code&gt;terminal_exec_tool&lt;/code&gt; module defines an explicit &lt;code&gt;_SAFE_ENV_KEYS&lt;/code&gt; set: &lt;code&gt;PATH&lt;/code&gt;, &lt;code&gt;HOME&lt;/code&gt;, &lt;code&gt;USER&lt;/code&gt;, &lt;code&gt;SHELL&lt;/code&gt;, &lt;code&gt;TERM&lt;/code&gt;, &lt;code&gt;LANG&lt;/code&gt;, &lt;code&gt;TMPDIR&lt;/code&gt;, &lt;code&gt;SSL_CERT_FILE&lt;/code&gt;, plus &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; itself. Everything else — including &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;DATABASE_URL&lt;/code&gt;, &lt;code&gt;SECRET_KEY&lt;/code&gt;, and any cloud credentials — is stripped from the child process environment. A compromised command cannot read the agent's secrets via &lt;code&gt;env&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Cloud credentials are handled separately. Aurora calls &lt;code&gt;generate_contextual_access_token&lt;/code&gt; and &lt;code&gt;generate_azure_access_token&lt;/code&gt; per invocation. AWS uses STS AssumeRole via cross-account roles (&lt;a href="https://github.com/Arvo-AI/aurora/tree/main/server/connectors/aws_connector" rel="noopener noreferrer"&gt;&lt;code&gt;aurora-cross-account-role.yaml&lt;/code&gt;&lt;/a&gt;) — short-lived credentials, not long-lived access keys. Azure uses Service Principal sessions. GCP uses OAuth-derived tokens.&lt;/p&gt;

&lt;p&gt;For agents that need to reach customer Kubernetes clusters Aurora can't access directly, a separate &lt;a href="https://github.com/Arvo-AI/aurora/tree/main/kubectl-agent" rel="noopener noreferrer"&gt;&lt;code&gt;kubectl-agent&lt;/code&gt;&lt;/a&gt; binary deploys via Helm into the customer's cluster and connects outbound over WebSocket. No inbound network access required, no kubeconfig sharing, no static credentials at rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an AI SRE's kubectl safety model
&lt;/h2&gt;

&lt;p&gt;Eight questions to ask any AI SRE vendor or open-source project before enabling production access:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Where does the command actually execute?&lt;/strong&gt; Same process as the agent? Same host? Separate container? Sandboxed runtime (gVisor/Kata)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What credentials does the command inherit from the host environment?&lt;/strong&gt; Specifically: can the executed command read your agent's vault token, database URL, or other host secrets?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are credentials short-lived or static?&lt;/strong&gt; STS / Service Principal sessions, or long-lived access keys?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the default read-only?&lt;/strong&gt; What flag, configuration, or RBAC role enables write access?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens between "agent decides to run X" and "X runs"?&lt;/strong&gt; Is there a deterministic policy check? An LLM judge? A human approval prompt? All three?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are destructive actions specifically gated?&lt;/strong&gt; What's the definition of "destructive" — vendor-defined or operator-configurable?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What does the audit trail capture?&lt;/strong&gt; Just the commands, or the agent's reasoning + the commands together?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the blast radius of a single successful prompt injection?&lt;/strong&gt; Walk through the worst case explicitly with the vendor.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a vendor can't answer these clearly, the architecture isn't ready for production write access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions in 2026
&lt;/h2&gt;

&lt;p&gt;This is a young problem space. Several questions are not yet resolved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization.&lt;/strong&gt; k8s-sigs/agent-sandbox is the leading candidate for a standard, but Knative Sandbox, container-level approaches, and microVM-based runtimes (Firecracker) are all in play.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud isolation.&lt;/strong&gt; Sandboxing a Kubernetes pod is a solved problem. Sandboxing a process that calls &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; across cloud APIs from a single agent is harder — the credentials and trust boundaries change per provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval UX at scale.&lt;/strong&gt; Engineers can't approve 200 actions per week. The right UI for batch approval, policy-based pre-approval, and rollback-only autonomy is still being figured out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expect significant movement on all three through 2026 and into 2027.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora's approach in summary
&lt;/h2&gt;

&lt;p&gt;If you operate an AI SRE in production, the safety questions are non-negotiable. Aurora's answer is: pod-isolated execution by default, deterministic + LLM-judge guardrails before any non-trusted command, environment-variable allowlist that strips secrets, per-invocation cloud credentials via STS/Service Principal/short-lived tokens, and human approval for destructive write operations. The full architecture is open source under Apache 2.0 — auditable in the &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For background on the agent and tool model, see the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;complete guide to AI SRE&lt;/a&gt;, the &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;open-source AI SRE comparison&lt;/a&gt;, or the explainer on &lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;agentic incident management&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:38:19 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three credible open-source AI SREs exist in 2026&lt;/strong&gt;: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox). All three are Apache 2.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only one is a true multi-step agent.&lt;/strong&gt; HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora handles multi-cloud&lt;/strong&gt; out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora generates remediation pull requests.&lt;/strong&gt; HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three support BYO LLM&lt;/strong&gt;, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ companies offering "AI SRE" products in 2026&lt;/a&gt;, only a handful are open source — and only three are credible enough to deploy in production: &lt;strong&gt;Aurora&lt;/strong&gt;, &lt;strong&gt;HolmesGPT&lt;/strong&gt;, and &lt;strong&gt;K8sGPT&lt;/strong&gt;. &lt;strong&gt;An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification.&lt;/strong&gt; They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.&lt;/p&gt;

&lt;p&gt;This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an open-source AI SRE?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;open-source AI SRE&lt;/strong&gt; is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hostable&lt;/strong&gt;: runs entirely inside your environment without phoning home to a vendor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-driven&lt;/strong&gt;: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why open source matters for AI SRE
&lt;/h2&gt;

&lt;p&gt;Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty.&lt;/strong&gt; Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit transparency.&lt;/strong&gt; Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost predictability.&lt;/strong&gt; Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the three compare
&lt;/h2&gt;

&lt;p&gt;This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;201&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;2,366&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;7,737&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;v1.1.1 (Mar 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt/releases" rel="noopener noreferrer"&gt;0.26.0 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases" rel="noopener noreferrer"&gt;v0.4.32 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;Sandbox (Oct 2025)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arvo AI&lt;/td&gt;
&lt;td&gt;Robusta + Microsoft&lt;/td&gt;
&lt;td&gt;k8sgpt-ai community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LangGraph supervisor + sub-agents&lt;/td&gt;
&lt;td&gt;ReAct loop (&lt;code&gt;ToolCallingLLM&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Rule-based scanner + LLM explainer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-step reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (single-shot per analyzer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway&lt;/td&gt;
&lt;td&gt;Kubernetes + AWS via MCP&lt;/td&gt;
&lt;td&gt;Kubernetes only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kubectl&lt;/code&gt; in sandboxed pods&lt;/td&gt;
&lt;td&gt;Read-only &lt;code&gt;kubectl get&lt;/code&gt;/&lt;code&gt;describe&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Read-only via Kube API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Other integrations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)&lt;/td&gt;
&lt;td&gt;30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)&lt;/td&gt;
&lt;td&gt;None — Kubernetes-only by design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge base / RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weaviate vector search over runbooks + postmortems&lt;/td&gt;
&lt;td&gt;Yes (via toolsets)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependency graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Memgraph (cross-cloud blast radius)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postmortem generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, exports to Confluence&lt;/td&gt;
&lt;td&gt;Investigation reports only&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pull request remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub + Bitbucket with human approval gate&lt;/td&gt;
&lt;td&gt;GitHub PRs in Operator mode&lt;/td&gt;
&lt;td&gt;None — strictly read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (340+ endpoints, 6 named tools)&lt;/td&gt;
&lt;td&gt;Yes (consumes MCP servers)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Google, Vertex, OpenRouter, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Air-gapped support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Ollama + image tarballs)&lt;/td&gt;
&lt;td&gt;Yes (Ollama)&lt;/td&gt;
&lt;td&gt;Yes (LocalAI / Ollama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker Compose or Helm&lt;/td&gt;
&lt;td&gt;Binary, API server, K8s Operator, Python SDK&lt;/td&gt;
&lt;td&gt;Go binary, K8s operator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The OSS AI SRE Maturity Spectrum
&lt;/h2&gt;

&lt;p&gt;A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What the agent does&lt;/th&gt;
&lt;th&gt;Tools at this level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Diagnostic Explainer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;K8sGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Read-Only Investigator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HolmesGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Investigation + Suggestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;HolmesGPT (Operator mode)&lt;/strong&gt;, &lt;strong&gt;Aurora&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Investigation + Approved Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; (with Bitbucket connector's &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;human approval gate&lt;/a&gt; for destructive actions)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs HolmesGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick HolmesGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.&lt;/li&gt;
&lt;li&gt;You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.&lt;/li&gt;
&lt;li&gt;You value CNCF governance and a steep ecosystem velocity.&lt;/li&gt;
&lt;li&gt;You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems exported to Confluence.&lt;/li&gt;
&lt;li&gt;You want the agent to draft remediation PRs against your codebase.&lt;/li&gt;
&lt;li&gt;You need a graph-based blast radius model (Memgraph) for dependency analysis.&lt;/li&gt;
&lt;li&gt;You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs K8sGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;This is closer to "which tool category do you need?" than a head-to-head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick K8sGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as &lt;code&gt;k8sgpt analyze --explain&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.&lt;/li&gt;
&lt;li&gt;You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need agentic investigation, not just diagnostic explanation.&lt;/li&gt;
&lt;li&gt;You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems and remediation PRs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  HolmesGPT vs K8sGPT — head-to-head
&lt;/h2&gt;

&lt;p&gt;Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step AI agent&lt;/td&gt;
&lt;td&gt;Rule-based scanner with LLM explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When it shines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigating an alert end-to-end across signals&lt;/td&gt;
&lt;td&gt;Diagnosing why a specific resource is unhealthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds to minutes (multi-step)&lt;/td&gt;
&lt;td&gt;Sub-second per analyzer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (multiple calls per investigation)&lt;/td&gt;
&lt;td&gt;Lower (one explanation per finding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hallucination risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (agent reasons across signals)&lt;/td&gt;
&lt;td&gt;Lower (deterministic before LLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-call engineers handling alerts&lt;/td&gt;
&lt;td&gt;Platform teams running periodic cluster audits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to use open-source AI SRE
&lt;/h2&gt;

&lt;p&gt;Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't have the operational capacity to run another stateful service in production.&lt;/li&gt;
&lt;li&gt;You want vendor support with SLAs and a phone number to call at 3 AM.&lt;/li&gt;
&lt;li&gt;Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE.&lt;/li&gt;
&lt;li&gt;You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to pilot an open-source AI SRE in your team
&lt;/h2&gt;

&lt;p&gt;A six-step, low-risk pilot for any of the three tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one cluster and one observability source.&lt;/strong&gt; Don't try to cover everything at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install in read-only mode first.&lt;/strong&gt; All three tools default to read-only — keep it that way for the first two weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect one alert source.&lt;/strong&gt; PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run for two weeks alongside human on-call.&lt;/strong&gt; Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed it your historical context.&lt;/strong&gt; Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand carefully.&lt;/strong&gt; Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting started with Aurora
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is the multi-cloud, multi-tool option among open-source AI SREs. To run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments.&lt;/p&gt;

&lt;p&gt;For the technical side of running an agent that executes &lt;code&gt;kubectl&lt;/code&gt; against production, see the companion piece on &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI agent kubectl safety and sandboxed execution&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
    <item>
      <title>AI SRE: The Complete Guide for Engineering Teams in 2026</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Fri, 24 Apr 2026 21:37:36 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/ai-sre-the-complete-guide-for-engineering-teams-in-2026-51ba</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/ai-sre-the-complete-guide-for-engineering-teams-in-2026-51ba</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; An &lt;strong&gt;AI SRE (AI Site Reliability Engineer)&lt;/strong&gt; is an autonomous AI agent that triages alerts, investigates incidents, performs root cause analysis, and generates postmortems without step-by-step human direction. &lt;a href="https://www.gartner.com/en/documents/7242030" rel="noopener noreferrer"&gt;Gartner projects&lt;/a&gt; that by 2029, 70% of enterprises will deploy agentic AI agents to operate their IT infrastructure, up from less than 5% in 2025. This guide explains what an AI SRE actually does, how it differs from AIOps and traditional SRE, and how to evaluate the commercial and open-source tools available in 2026.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An &lt;strong&gt;AI SRE&lt;/strong&gt; is an autonomous software agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, postmortem generation, and in some cases guided remediation — using large language models and production tooling to operate with minimal human direction. Unlike chatbots or copilots, an AI SRE decides what to investigate, which systems to query, and how to synthesize findings into actionable outcomes.&lt;/p&gt;

&lt;p&gt;The category crystallized in 2026. Microsoft made &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/announcing-general-availability-for-the-azure-sre-agent/4500682" rel="noopener noreferrer"&gt;Azure SRE Agent generally available on March 10, 2026&lt;/a&gt;. Komodor reports being named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling. Open-source options like &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, K8sGPT, and HolmesGPT emerged as credible alternatives to commercial platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI SRE?
&lt;/h2&gt;

&lt;p&gt;An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that performs SRE work — alert triage, incident investigation, root cause analysis, postmortem generation, and guided remediation — without requiring step-by-step human direction.&lt;/p&gt;

&lt;p&gt;Three characteristics distinguish an AI SRE from earlier generations of operations tooling:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Autonomy.&lt;/strong&gt; An AI SRE decides which tools to use and what data to gather. It is not a runbook that executes predefined steps; it is an agent that plans a multi-step investigation based on the specific alert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access to production.&lt;/strong&gt; An AI SRE reads real infrastructure signals — metrics, logs, traces, Kubernetes events, cloud API responses, deployment history — rather than working only from summaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis.&lt;/strong&gt; An AI SRE produces structured outputs: a root cause analysis, a timeline, a blast radius assessment, a postmortem, or a remediation PR. It does not stop at "the error rate is elevated."&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why AI SRE Emerged in 2026
&lt;/h2&gt;

&lt;p&gt;The conditions that made AI SRE viable came together between 2024 and 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert volume outpaced human capacity.&lt;/strong&gt; PagerDuty's State of Digital Operations data shows the average on-call engineer receives roughly 50 alerts per week, with only 2–5% requiring real human intervention. A &lt;a href="https://oneuptime.com/blog/post/2026-03-05-alert-fatigue-ai-on-call/view" rel="noopener noreferrer"&gt;2024 Catchpoint study cited by OneUptime&lt;/a&gt; found that 70% of SRE teams list alert fatigue as a top-three operational concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-cloud became the default.&lt;/strong&gt; According to the &lt;a href="https://resources.flexera.com/web/pdf/Flexera-State-of-the-Cloud-Report-2025.pdf" rel="noopener noreferrer"&gt;Flexera 2025 State of the Cloud Report&lt;/a&gt;, organizations use an average of 2.4 public cloud providers, and 70% operate a hybrid cloud strategy. Correlating incidents across AWS, Azure, and GCP by hand is increasingly impractical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change velocity rose faster than reliability tooling.&lt;/strong&gt; The &lt;a href="https://cloud.google.com/devops/state-of-devops" rel="noopener noreferrer"&gt;2025 DORA State of AI-Assisted Software Development report&lt;/a&gt; found that incidents per PR increased 242.7% as AI coding assistants accelerated delivery — without a matching improvement in incident response capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM tool use matured.&lt;/strong&gt; Agent frameworks like LangGraph made it practical to give a language model 30+ tools and let it chain them into a coherent investigation. Claude, GPT-5, and Gemini 2.5+ reached enough reliability at structured tool use to be trusted with read-only production access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gartner codified the category.&lt;/strong&gt; In &lt;a href="https://www.gartner.com/en/documents/7242030" rel="noopener noreferrer"&gt;Predicts 2026: AI Agents Will Transform IT Infrastructure and Operations&lt;/a&gt;, Gartner projected that by 2029, 70% of enterprises will deploy agentic AI to operate IT infrastructure, up from less than 5% in 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does an AI SRE Work?
&lt;/h2&gt;

&lt;p&gt;An AI SRE runs a repeatable loop for every alert it receives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert ingestion.&lt;/strong&gt; A monitoring tool (PagerDuty, Datadog, Grafana, BigPanda) fires a webhook. The AI SRE receives the payload and begins investigation without waiting for a human to acknowledge the page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context gathering.&lt;/strong&gt; The agent reads the recent state: pod status, metric trends, deployment history, recent configuration changes, related alerts within a time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis formation.&lt;/strong&gt; Using the alert semantics plus the gathered context, the agent proposes one or more candidate causes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence collection.&lt;/strong&gt; The agent selects from its tool inventory — running &lt;code&gt;kubectl describe&lt;/code&gt;, querying metrics, searching a vector knowledge base of past postmortems — to test each hypothesis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause synthesis.&lt;/strong&gt; The agent produces a structured RCA: what failed, why, what the blast radius is, which services are affected, whether a recent change likely caused it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation (optional).&lt;/strong&gt; Some AI SREs stop at recommendations. Others generate a PR, roll back a deployment, or restart a service — typically behind a human approval gate for destructive actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation.&lt;/strong&gt; The agent assembles a draft postmortem with timeline, contributing factors, impact, and action items, ready for human review and export to Confluence or another docs system.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A trustworthy AI SRE is transparent about this loop — surfacing the evidence it considered, the hypotheses it ruled out, and its confidence in the final answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI SRE vs Traditional SRE vs AIOps
&lt;/h2&gt;

&lt;p&gt;The three categories are often conflated but address different problems.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional SRE&lt;/th&gt;
&lt;th&gt;AIOps&lt;/th&gt;
&lt;th&gt;AI SRE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human engineers manage reliability&lt;/td&gt;
&lt;td&gt;Anomaly detection, alert correlation&lt;/td&gt;
&lt;td&gt;Autonomous incident investigation and RCA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual (human reads logs, queries systems)&lt;/td&gt;
&lt;td&gt;Suggests related alerts&lt;/td&gt;
&lt;td&gt;Agent runs multi-step investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root cause analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hours, depends on engineer's expertise&lt;/td&gt;
&lt;td&gt;Correlation hints, not causation&lt;/td&gt;
&lt;td&gt;Structured RCA in minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer runs kubectl, aws CLI, dashboards&lt;/td&gt;
&lt;td&gt;Reads pre-ingested telemetry&lt;/td&gt;
&lt;td&gt;Dynamically selects from 20–40+ tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-driven&lt;/td&gt;
&lt;td&gt;Typically suggestions only&lt;/td&gt;
&lt;td&gt;Agentic execution, often with approval gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge transfer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runbooks, tribal knowledge&lt;/td&gt;
&lt;td&gt;Alert correlation models&lt;/td&gt;
&lt;td&gt;RAG over runbooks and past postmortems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core technology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Humans plus monitoring dashboards&lt;/td&gt;
&lt;td&gt;ML models for anomaly detection&lt;/td&gt;
&lt;td&gt;LLM agents with tool calling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The short version: &lt;strong&gt;AIOps tells you what is anomalous. An AI SRE tells you why it is happening and, increasingly, fixes it.&lt;/strong&gt; Traditional SRE is the human discipline both categories augment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Capabilities Should an AI SRE Have?
&lt;/h2&gt;

&lt;p&gt;Serious AI SREs in 2026 share a consistent capability stack:&lt;/p&gt;

&lt;h3&gt;
  
  
  Autonomous multi-step investigation
&lt;/h3&gt;

&lt;p&gt;The agent must plan and execute investigations without requiring humans to choose tools or pass data between steps. Simple tool-calling is not enough — the agent needs memory across steps and the ability to revise hypotheses as evidence arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Broad tool access with safe execution
&lt;/h3&gt;

&lt;p&gt;kubectl, aws, az, gcloud, metric queries, log search, deployment history, IaC state. &lt;strong&gt;How tools are executed matters&lt;/strong&gt;: running kubectl on the agent host is a production risk. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, for example, runs CLI commands in sandboxed Kubernetes pods with per-invocation credential scoping, not on the agent host.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-cloud and cross-platform reach
&lt;/h3&gt;

&lt;p&gt;With the Flexera 2025 average at 2.4 public clouds per organization, an AI SRE that works only inside AWS or only inside Kubernetes will miss the majority of real incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Knowledge base retrieval
&lt;/h3&gt;

&lt;p&gt;Past postmortems, runbooks, and docs searchable by the agent via vector search (RAG). The knowledge your senior SRE built up should be available to the agent on day one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure dependency graph
&lt;/h3&gt;

&lt;p&gt;When a database fails, the agent needs to know which services depend on it. Graph databases like Memgraph are a common choice for modeling cross-service and cross-cloud relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Postmortem generation
&lt;/h3&gt;

&lt;p&gt;Structured timeline, contributing factors, blast radius, action items — produced during the investigation, not written manually afterward.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remediation with guardrails
&lt;/h3&gt;

&lt;p&gt;Generating PRs, rolling back deployments, restarting services. Destructive actions should require human approval. Aurora's Bitbucket connector, added in &lt;a href="https://github.com/Arvo-AI/aurora/releases/tag/v1.1.0" rel="noopener noreferrer"&gt;v1.1.0&lt;/a&gt;, requires explicit human approval before agents can write.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM flexibility
&lt;/h3&gt;

&lt;p&gt;OpenAI, Anthropic, Google, and local models via Ollama for air-gapped deployments. Vendor lock-in on LLM is a real risk as model quality and pricing evolve rapidly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI SRE Landscape in 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Commercial platforms
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://azure.microsoft.com/en-us/products/sre-agent/" rel="noopener noreferrer"&gt;Azure SRE Agent&lt;/a&gt;&lt;/strong&gt; — Microsoft's first-party agent, &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/announcing-general-availability-for-the-azure-sre-agent/4500682" rel="noopener noreferrer"&gt;generally available since March 10, 2026&lt;/a&gt;. Deep Azure integration, adjustable autonomy from "review recommendations" to "fully automated," billed via Azure Agent Units on pay-as-you-go.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://rootly.com/ai-sre" rel="noopener noreferrer"&gt;Rootly AI SRE&lt;/a&gt;&lt;/strong&gt; — AI layer built on top of a mature incident management platform. Transparent chain-of-thought reasoning. SOC2 since January 2022. Depends on external observability tools for telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://komodor.com/ai-sre-2026/" rel="noopener noreferrer"&gt;Komodor Klaudia&lt;/a&gt;&lt;/strong&gt; — Kubernetes-specialized AI SRE. Komodor reports Klaudia achieves 95% accuracy across real-world incident scenarios and that Komodor was named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io AI SRE&lt;/a&gt;&lt;/strong&gt; — Multi-agent AI investigation integrated into an incident response platform, with code fix suggestions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.traversal.com/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt;&lt;/strong&gt; — Focused on large distributed systems using causal ML. Traversal reports a 38% MTTR reduction at DigitalOcean. Supports on-prem and bring-your-own model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolve.ai&lt;/strong&gt; — Pushes toward high-autonomy resolution with guardrails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Open-source AI SRE options
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt; — Apache 2.0, self-hosted, multi-cloud (AWS via STS AssumeRole, Azure via Service Principal, GCP, OVH, Scaleway, Kubernetes). LangGraph-orchestrated agents with 30+ tools, Memgraph dependency graph, Weaviate RAG, postmortem export to Confluence, PR generation via GitHub and Bitbucket. Works with any LLM (OpenAI, Anthropic, Google, OpenRouter, Ollama).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt; — Open-source CLI for scanning Kubernetes clusters and explaining failures in plain English. Narrower scope than a full AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt; — Open-source cross-stack SRE agent covering Kubernetes, Prometheus, logs, and Slack workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coroot (Community Edition)&lt;/strong&gt; — Kubernetes observability plus AI-assisted RCA. Community Edition is free; commercial tier is priced transparently from $1 per monitored CPU core per month.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open-Source vs Commercial AI SRE
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Consideration&lt;/th&gt;
&lt;th&gt;Open-Source&lt;/th&gt;
&lt;th&gt;Commercial&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data residency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully self-hosted; incident data stays in your environment&lt;/td&gt;
&lt;td&gt;Usually SaaS; incident data leaves your perimeter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free software; you pay for infra and LLM API usage&lt;/td&gt;
&lt;td&gt;Per-seat or per-incident pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM choice&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bring any provider, including local via Ollama&lt;/td&gt;
&lt;td&gt;Often bundled or restricted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit transparency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Source code available; you can audit how the agent behaves&lt;/td&gt;
&lt;td&gt;Typically black-box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Support and managed ops&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Community plus self-managed&lt;/td&gt;
&lt;td&gt;Vendor support, SLAs, managed infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time to deploy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Longer — self-hosting has setup cost&lt;/td&gt;
&lt;td&gt;Shorter — SaaS onboarding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Customization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fork, modify, add tools&lt;/td&gt;
&lt;td&gt;Limited to what the vendor exposes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For regulated industries (finance, healthcare, government), air-gapped deployments, or teams already operating their own Kubernetes, open-source AI SRE is often the right fit. For teams prioritizing fastest time to value, commercial platforms win.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Evaluate an AI SRE Tool
&lt;/h2&gt;

&lt;p&gt;If you are piloting an AI SRE in 2026, these are the questions to answer before committing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How does the agent actually execute commands?&lt;/strong&gt; Host process, container, sandboxed pod? Read-only or write? What credentials does it use?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which alerts can it investigate today?&lt;/strong&gt; Ask for specific integrations by name (PagerDuty, Datadog, CloudWatch) and test with your own alert payloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when it is wrong?&lt;/strong&gt; How does the agent surface low-confidence answers? Can you see the evidence it gathered?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can it handle multi-cloud?&lt;/strong&gt; If you run on more than one cloud, does it correlate across providers or investigate each in isolation?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it learn from past incidents?&lt;/strong&gt; Does it ingest your existing runbooks and postmortems? How?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the remediation model?&lt;/strong&gt; Suggestions only? PRs with human approval? Direct execution? Where are the guardrails?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which LLM does it use — and can you change it?&lt;/strong&gt; LLM cost and quality move quickly. Lock-in is a risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where does your incident data go?&lt;/strong&gt; Self-hosted, vendor cloud, LLM provider? Read the data flow carefully.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Limitations of AI SREs in 2026
&lt;/h2&gt;

&lt;p&gt;The category is real but not a silver bullet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Novel failure modes.&lt;/strong&gt; Agents excel at recognizing patterns similar to past incidents. Genuinely new failures still often require human judgment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational root causes.&lt;/strong&gt; "The deploy pipeline does not validate environment variables" is the kind of root cause an AI SRE can surface. "We do not have enough staff to maintain this service" is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM cost at scale.&lt;/strong&gt; Complex investigations can consume hundreds of LLM calls. Local inference via Ollama mitigates this but requires GPU infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool coverage gaps.&lt;/strong&gt; An AI SRE can only investigate systems it has tools for. Legacy systems, internal tooling, and unusual stacks require custom connectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust-building takes time.&lt;/strong&gt; Teams typically start with the agent in "observe" mode, graduate to "suggest," and only later enable autonomous remediation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://cloud.google.com/devops/state-of-devops" rel="noopener noreferrer"&gt;DORA 2025 report&lt;/a&gt; is instructive: AI improves throughput but can increase instability in teams without strong platform engineering foundations. AI SRE tools amplify existing practices more than they fix broken ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Pilot an AI SRE in Your Team
&lt;/h2&gt;

&lt;p&gt;A low-risk pilot follows six steps. Expect it to take four to six weeks end-to-end.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one service and one alert source.&lt;/strong&gt; Do not try to cover everything at once. Choose a service your team knows well and a monitoring tool you already use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy the AI SRE in read-only mode.&lt;/strong&gt; Connect it to alerts, read-only cloud credentials, and your existing observability tools. Do not grant write permissions yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run for two weeks, compare to human RCA.&lt;/strong&gt; Let the agent investigate every incident that fires. Compare its root cause conclusions to what the on-call engineer eventually determined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure accuracy and time-to-RCA.&lt;/strong&gt; Two metrics matter: was the agent's root cause correct, and how much faster was it than the human?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand scope gradually.&lt;/strong&gt; Add more services, enable remediation suggestions, then (only after trust is established) approved automated actions for specific low-risk patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed historical context.&lt;/strong&gt; Ingest your existing runbooks and past postmortems into the agent's knowledge base. Agents become dramatically more useful with organizational memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open-source (Apache 2.0) AI SRE built by Arvo AI. It autonomously investigates incidents across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, integrating with 22+ tools including PagerDuty, Datadog, Grafana, Slack, Bitbucket, and Confluence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aurora works with any LLM provider — OpenAI, Anthropic, Google Gemini, OpenRouter, or local models via Ollama for air-gapped deployments. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt; or the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;original post on arvoai.ca&lt;/a&gt; for more context.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Opsgenie 2026: Features, Pricing, EOL &amp; Alternatives</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 17:36:17 +0000</pubDate>
      <link>https://dev.to/siddharth_singh_409bd5267/opsgenie-2026-features-pricing-eol-alternatives-1bm0</link>
      <guid>https://dev.to/siddharth_singh_409bd5267/opsgenie-2026-features-pricing-eol-alternatives-1bm0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR — Opsgenie is ending.&lt;/strong&gt; Atlassian stopped new Opsgenie signups on &lt;strong&gt;June 4, 2025&lt;/strong&gt; and will shut the service down permanently on &lt;strong&gt;April 5, 2027&lt;/strong&gt;. Any data not migrated by that date will be deleted. Atlassian's official migration paths are Jira Service Management (JSM) Operations and Compass. Many teams are using the forced migration as a chance to evaluate alternatives — especially AI-powered options that weren't available when Opsgenie was originally adopted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Opsgenie is an alerting and on-call management platform that was acquired by Atlassian in 2018. For years it was one of the most widely adopted tools in the SRE stack, sitting alongside PagerDuty and xMatters. In March 2025 Atlassian announced that Opsgenie's capabilities would be absorbed into Jira Service Management and Compass, and that the standalone product would be retired.&lt;/p&gt;

&lt;p&gt;This guide covers what Opsgenie is, how it works, what it costs, the exact end-of-life timeline, what happens to your data when it shuts down, the official migration paths, and the current landscape of alternatives. Every claim is linked to an official source.&lt;/p&gt;

&lt;p&gt;Last updated: April 21, 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Opsgenie?
&lt;/h2&gt;

&lt;p&gt;Opsgenie is a cloud-based incident alerting and on-call management platform for DevOps and SRE teams. It routes alerts from 200+ monitoring tools to the right on-call responders via SMS, voice, email, push, Slack, and Microsoft Teams. &lt;a href="https://www.atlassian.com/software/opsgenie" rel="noopener noreferrer"&gt;Atlassian acquired Opsgenie in 2018&lt;/a&gt; and will retire the standalone product on April 5, 2027.&lt;/p&gt;

&lt;p&gt;The tool was founded in 2012 and its capabilities are being absorbed into &lt;a href="https://www.atlassian.com/software/jira/service-management" rel="noopener noreferrer"&gt;Jira Service Management&lt;/a&gt; and &lt;a href="https://www.atlassian.com/software/compass" rel="noopener noreferrer"&gt;Compass&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opsgenie at a glance vs top alternatives
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Opsgenie (retiring)&lt;/th&gt;
&lt;th&gt;JSM Operations&lt;/th&gt;
&lt;th&gt;PagerDuty&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Available after April 2027&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting price&lt;/td&gt;
&lt;td&gt;N/A (closed)&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;$21/user/mo&lt;/td&gt;
&lt;td&gt;Free (OSS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in AI RCA&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Add-on ($699+/mo)&lt;/td&gt;
&lt;td&gt;Yes (agentic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-call + escalations&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Via integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Opsgenie End-of-Life Timeline (Official)
&lt;/h2&gt;

&lt;p&gt;Atlassian announced the end of Opsgenie in &lt;a href="https://www.atlassian.com/blog/announcements/evolution-of-it-operations" rel="noopener noreferrer"&gt;The Evolution of IT Operations&lt;/a&gt;. The three critical dates are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Milestone&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End of Sale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;June 4, 2025&lt;/td&gt;
&lt;td&gt;No new signups, upgrades, or downgrades on standalone Opsgenie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End of Support / Shutdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;April 5, 2027&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opsgenie service is turned off; REST APIs stop responding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Deletion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;April 5, 2027&lt;/td&gt;
&lt;td&gt;All unmigrated alerts, schedules, escalation policies, integrations, and incidents are permanently deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Existing customers can continue using Opsgenie through April 5, 2027, but cannot expand their footprint. After migration, Opsgenie and the new JSM or Compass instance can run in parallel for up to 120 days, after which Opsgenie is automatically switched off (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;official source&lt;/a&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Opsgenie REST APIs will continue to work until April 5, 2027. However, Atlassian recommends updating all API endpoints before Opsgenie is turned off to avoid any disruptions." — &lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;Atlassian Support&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Opsgenie Features
&lt;/h2&gt;

&lt;p&gt;Opsgenie's core feature set is mature — this is a 13-year-old product. Here is what it currently provides, verified from Atlassian's documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integrations
&lt;/h3&gt;

&lt;p&gt;Opsgenie ships with &lt;a href="https://www.atlassian.com/software/opsgenie/integrations" rel="noopener noreferrer"&gt;over 200 integrations&lt;/a&gt; with monitoring, ticketing, chat, and ITSM tools. Most are bidirectional — alerts flow in, and acknowledgement or closure events flow back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Channel Notifications
&lt;/h3&gt;

&lt;p&gt;Supported notification channels, per &lt;a href="https://support.atlassian.com/opsgenie/docs/send-voice-and-sms-notifications/" rel="noopener noreferrer"&gt;Atlassian documentation&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SMS&lt;/strong&gt; — Aggregated at a minimum 1-minute interval; users can acknowledge or close alerts via reply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice calls&lt;/strong&gt; — Capped at 2 minutes; dial-pad actions (1 = read, 2 = close, 3 = acknowledge, 4 = escalate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email&lt;/strong&gt; — With inline action buttons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push notifications&lt;/strong&gt; — iOS and Android with swipe-to-ack/close&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — &lt;a href="https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-slack-app/" rel="noopener noreferrer"&gt;Bidirectional integration&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Teams&lt;/strong&gt; — &lt;a href="https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-microsoft-teams/" rel="noopener noreferrer"&gt;Bidirectional integration&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  On-Call Management
&lt;/h3&gt;

&lt;p&gt;Opsgenie supports daily, weekly, and custom rotation types including follow-the-sun, with ad-hoc overrides, "Take on-call for an hour" self-service, and a "No-One" participant for scheduled gaps (&lt;a href="https://support.atlassian.com/opsgenie/docs/manage-on-call-schedules-and-rotations/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Escalation Policies
&lt;/h3&gt;

&lt;p&gt;Default escalation is 5 minutes, then 10 minutes, repeatable up to 20 times per alert. Acknowledgement or closure stops the policy (&lt;a href="https://support.atlassian.com/opsgenie/docs/how-do-escalations-work-in-opsgenie/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Heartbeat Monitoring
&lt;/h3&gt;

&lt;p&gt;A "dead man's switch" — if an expected HTTP ping doesn't arrive within the configured interval (minimum 1 minute), Opsgenie fires an alert. Available on &lt;strong&gt;Standard and Enterprise plans only&lt;/strong&gt; (&lt;a href="https://support.atlassian.com/opsgenie/docs/check-system-health-with-opsgenie-heartbeats/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert Deduplication, Suppression, and Grouping
&lt;/h3&gt;

&lt;p&gt;Opsgenie uses an &lt;code&gt;alias&lt;/code&gt; field to deduplicate alerts — identical alias values increment a counter on the existing alert instead of creating a new one. The counter stops logging at 100 occurrences, but deduplication continues (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Delay policies can hold notifications for a fixed time, until a deduplication threshold is reached, or until an occurrence rate threshold triggers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing Rules
&lt;/h3&gt;

&lt;p&gt;Each team can have &lt;strong&gt;up to 100 routing rules&lt;/strong&gt;, evaluated top-down with first-match semantics. Free and Essentials plans are limited to &lt;strong&gt;1 routing rule&lt;/strong&gt; and can only route by priority or tags. Standard and Enterprise plans support full-field routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reporting by Plan
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Report&lt;/th&gt;
&lt;th&gt;Essentials&lt;/th&gt;
&lt;th&gt;Standard&lt;/th&gt;
&lt;th&gt;Enterprise&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Notifications + API Usage&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly Overview (Looker)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced reporting / MTTA / MTTR&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team Reports&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global Reports + Looker dashboards&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-Incident Analysis&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://www.atlassian.com/software/opsgenie/advanced-reporting-and-analytics" rel="noopener noreferrer"&gt;Opsgenie Advanced Reporting&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mobile App
&lt;/h3&gt;

&lt;p&gt;Opsgenie's &lt;a href="https://www.atlassian.com/software/opsgenie/mobile-app" rel="noopener noreferrer"&gt;iOS and Android apps&lt;/a&gt; support swipe-to-acknowledge from the lock screen and iOS Critical Alerts that override Do Not Disturb and silent mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSO / SAML
&lt;/h3&gt;

&lt;p&gt;SSO is available on &lt;strong&gt;Standard and Enterprise plans only&lt;/strong&gt;, with supported providers including Google, Azure AD, Okta, OneLogin, Ping Identity, and Microsoft AD FS (&lt;a href="https://support.atlassian.com/opsgenie/docs/configure-sso-for-opsgenie/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance
&lt;/h3&gt;

&lt;p&gt;Opsgenie is covered under Atlassian's Trust program with &lt;strong&gt;SOC 2 Type II (annual), ISO/IEC 27001, ISO/IEC 27018, CSA, and TISAX AL2&lt;/strong&gt; certifications, plus a pre-signed GDPR DPA (&lt;a href="https://www.atlassian.com/software/opsgenie/security" rel="noopener noreferrer"&gt;official page&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Residency
&lt;/h3&gt;

&lt;p&gt;Opsgenie is offered in &lt;strong&gt;US and EU&lt;/strong&gt; regions, both hosted on AWS (&lt;a href="https://support.atlassian.com/opsgenie/docs/opsgenies-data-residency/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Should Use Opsgenie in 2026?
&lt;/h2&gt;

&lt;p&gt;With end-of-sale already behind us, Opsgenie is only relevant to &lt;strong&gt;existing subscribers&lt;/strong&gt; planning their exit. New teams cannot sign up. The question for existing subscribers is whether to stay with Atlassian (migrate to JSM or Compass) or evaluate alternatives.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stay with Atlassian (migrate to JSM Operations)&lt;/strong&gt; if you are already a Jira Service Management customer, need ITSM workflows (change, problem, incident), and are comfortable with the Premium-tier price increase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay with Atlassian (migrate to Compass)&lt;/strong&gt; if you are a DevOps or SRE team that wants alerting paired with a software component catalog and service ownership model, not ITSM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switch to a dedicated alerting tool&lt;/strong&gt; (PagerDuty, ilert, Squadcast) if you want deeper alerting features and do not need Atlassian platform integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switch to AI-powered incident management&lt;/strong&gt; (incident.io, Rootly, Aurora) if you want autonomous investigation and root cause analysis, not just alert routing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Opsgenie Pricing (Standalone, 100-User Reference)
&lt;/h2&gt;

&lt;p&gt;Pricing below is for standalone Opsgenie with 100 users — sourced from the &lt;a href="https://www.atlassian.com/software/opsgenie/pricing" rel="noopener noreferrer"&gt;official Opsgenie pricing page&lt;/a&gt;. New signups are closed, so these numbers apply only to existing customers on legacy plans.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Monthly&lt;/th&gt;
&lt;th&gt;Annual&lt;/th&gt;
&lt;th&gt;Routing Rules&lt;/th&gt;
&lt;th&gt;Heartbeats&lt;/th&gt;
&lt;th&gt;SSO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 5 users)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Essentials&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$11.55/user/mo&lt;/td&gt;
&lt;td&gt;$9.45/user/mo&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~ $29/user/mo&lt;/td&gt;
&lt;td&gt;Discounted&lt;/td&gt;
&lt;td&gt;100 per team&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~ $39/user/mo&lt;/td&gt;
&lt;td&gt;Discounted&lt;/td&gt;
&lt;td&gt;100 per team&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Enterprise-exclusive features include Incident Command Center (built-in video chatroom tied to incidents), Stakeholders (notification-only users), Service Subscriptions, Incident Templates, and Post-Incident Analysis.&lt;/p&gt;

&lt;p&gt;Incoming call routing is charged separately: &lt;strong&gt;$0.10 per minute&lt;/strong&gt; for US/Canada and &lt;strong&gt;$0.35 per minute&lt;/strong&gt; internationally after the free tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happens When Opsgenie Is Turned Off
&lt;/h2&gt;

&lt;p&gt;On April 5, 2027, Atlassian will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Disable the Opsgenie web application, mobile apps, and REST APIs&lt;/li&gt;
&lt;li&gt;Delete all data that was &lt;strong&gt;not migrated&lt;/strong&gt; to JSM or Compass — alerts, on-call schedules, escalation policies, integrations, incidents, notes, attachments&lt;/li&gt;
&lt;li&gt;Stop accepting any incoming webhooks or notifications&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; unlike the legacy Opsgenie Enterprise plan, JSM automatically deletes alert data after a retention window. Once alert data is deleted in JSM, it cannot be recovered. Export anything you need for compliance or audit before migration (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;official source&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Opsgenie Migration Paths: JSM vs Compass
&lt;/h2&gt;

&lt;p&gt;Atlassian offers two official migration destinations. Both share the same underlying Operations engine — schedules, alerts, and policies sync bidirectionally — but the wrapping product and pricing differ (&lt;a href="https://support.atlassian.com/opsgenie/docs/managing-operations-in-compass-and-jira-service-management-at-the-same-time/" rel="noopener noreferrer"&gt;managing operations across both&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Jira Service Management (JSM) Operations
&lt;/h3&gt;

&lt;p&gt;JSM Operations is the ITSM-centric path — alerts are paired with change, problem, and incident workflows. JSM pricing (&lt;a href="https://www.atlassian.com/software/jira/service-management/pricing" rel="noopener noreferrer"&gt;official page&lt;/a&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;JSM Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Outbound Webhooks&lt;/th&gt;
&lt;th&gt;Incident Command Center&lt;/th&gt;
&lt;th&gt;Post-Incident Reviews&lt;/th&gt;
&lt;th&gt;99.9% SLA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 3 agents)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Contact sales&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;99.95%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Opsgenie features that &lt;strong&gt;do not carry over&lt;/strong&gt; to JSM Operations, per &lt;a href="https://support.atlassian.com/jira-service-management-cloud/docs/start-shifting-from-opsgenie-to-jira-service-management/" rel="noopener noreferrer"&gt;Atlassian's shifting guide&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incoming Call Routing integration is not supported&lt;/li&gt;
&lt;li&gt;Stakeholder role — custom Opsgenie roles default to User&lt;/li&gt;
&lt;li&gt;Alert creation rules from Opsgenie do not migrate&lt;/li&gt;
&lt;li&gt;Legacy &lt;code&gt;api.opsgenie.com/v1/services&lt;/code&gt; endpoint stops working&lt;/li&gt;
&lt;li&gt;Chat integrations must be reconnected manually&lt;/li&gt;
&lt;li&gt;The old Opsgenie mobile app stops working — responders switch to the Jira mobile app&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compass
&lt;/h3&gt;

&lt;p&gt;Compass is positioned as a software component catalog + alerting platform aimed at DevOps, SRE, and Platform Engineering teams rather than ITSM. Compass pricing (&lt;a href="https://www.atlassian.com/software/compass/pricing" rel="noopener noreferrer"&gt;official page&lt;/a&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compass Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Alerting&lt;/th&gt;
&lt;th&gt;Heartbeats&lt;/th&gt;
&lt;th&gt;99.9% SLA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 3 full users)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$8/user/mo&lt;/td&gt;
&lt;td&gt;Yes (150+ integrations)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$25/user/mo&lt;/td&gt;
&lt;td&gt;Advanced&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Migration Friction
&lt;/h3&gt;

&lt;p&gt;Real complaints from the &lt;a href="https://community.atlassian.com/forums/Jira-Service-Management/Replacement-for-Opsgenie/qaq-p/2967670" rel="noopener noreferrer"&gt;Atlassian Community&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price increases&lt;/strong&gt; — JSM Premium is widely reported as more expensive than standalone Opsgenie Standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature parity gaps&lt;/strong&gt; — some users need JSM &lt;em&gt;and&lt;/em&gt; Compass together to match Opsgenie's alert processing depth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;120-day forced cutover&lt;/strong&gt; — Opsgenie auto-shuts-down 120 days after migration begins; Atlassian has &lt;a href="https://community.atlassian.com/forums/Jira-Service-Management/Extend-120-day-window-to-shutdown-of-Opsgenie-after-migration/qaq-p/3084093" rel="noopener noreferrer"&gt;declined requests&lt;/a&gt; to extend the window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split paths confusion&lt;/strong&gt; — some features only exist in JSM, others only in Compass, forcing customers to choose or buy both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One user put it bluntly: "Switching to Compass seems like buying a new car just to listen to the radio."&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Teams Are Evaluating Alternatives Instead of Migrating
&lt;/h2&gt;

&lt;p&gt;The forced migration has created a rare evaluation moment. Teams that adopted Opsgenie in 2018 are re-evaluating the entire category with three shifts in mind:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI-native incident management has arrived.&lt;/strong&gt; Products like Aurora, incident.io AI SRE, Rootly AI, and PagerDuty Advance didn't exist when most Opsgenie contracts were signed. Per &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-10-29-gartner-survey-54-percent-of-infrastructure-and-operations-leaders-are-adopting-artificial-intelligence-to-cut-costs" rel="noopener noreferrer"&gt;Gartner (October 2025)&lt;/a&gt;, 54% of I&amp;amp;O leaders are now adopting AI in operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-call burnout is a hiring and retention problem.&lt;/strong&gt; The &lt;a href="https://www.catchpoint.com/learn/sre-report-2025" rel="noopener noreferrer"&gt;Catchpoint SRE Report 2025&lt;/a&gt; found that roughly 70% of SREs cite on-call stress as a direct cause of burnout, and toil rose to 30% of SRE work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downtime costs have climbed.&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/newsroom/study-cost-of-incidents/" rel="noopener noreferrer"&gt;PagerDuty's 2024 research&lt;/a&gt; put the average cost of a major incident at $794,000, or $4,537 per minute. &lt;a href="https://itic-corp.com/itic-2024-hourly-cost-of-downtime-part-2/" rel="noopener noreferrer"&gt;ITIC's 2024 survey&lt;/a&gt; found 97% of large enterprises say an hour of downtime costs them over $100,000.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Against this backdrop, "like-for-like Opsgenie replacement" is no longer the only question — many teams are asking whether the replacement should also do autonomous investigation, not just alerting.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"By 2030, 75% of IT work will be human plus AI, 25% will be AI-only, and zero percent will be human-only." — &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-11-10-gartner-survey-finds-artificial-intelligence-will-touch-all-information-technology-work-by-2030" rel="noopener noreferrer"&gt;Gartner CIO survey of 700+ CIOs, 2025&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Top Opsgenie Alternatives in 2026
&lt;/h2&gt;

&lt;p&gt;Verified pricing and capabilities from each vendor's official site. Last checked April 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Starting price&lt;/th&gt;
&lt;th&gt;Free plan&lt;/th&gt;
&lt;th&gt;Open source&lt;/th&gt;
&lt;th&gt;AI-native&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora by Arvo AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 self-hosted&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Yes (agentic)&lt;/td&gt;
&lt;td&gt;OSS teams wanting alerting + autonomous RCA in one stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.pagerduty.com/pricing/incident-management/" rel="noopener noreferrer"&gt;$21/user/mo&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14-day trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (PagerDuty Advance, $415+/mo)&lt;/td&gt;
&lt;td&gt;Enterprises wanting the incumbent with AI add-ons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ilert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to €49/user/mo Scale&lt;/td&gt;
&lt;td&gt;Yes (5 responders)&lt;/td&gt;
&lt;td&gt;Partial (MCP server)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;EU-based teams requiring GDPR data residency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Squadcast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.squadcast.com/pricing" rel="noopener noreferrer"&gt;$9/user/mo Pro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Yes (5 users)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Small SRE teams on tight budgets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rootly OnCall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;From $20/user/mo&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;Partial (MCP, Agents JSON)&lt;/td&gt;
&lt;td&gt;Yes (AI SRE standalone)&lt;/td&gt;
&lt;td&gt;Teams wanting modular IR + on-call + AI SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;incident.io On-call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$19 base + $10 add-on&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (AI SRE)&lt;/td&gt;
&lt;td&gt;Slack-native incident coordination with AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FireHydrant Signals&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (AI Copilot)&lt;/td&gt;
&lt;td&gt;Teams preferring pay-per-alert over per-seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;xMatters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.xmatters.com/pricing" rel="noopener noreferrer"&gt;$39/user/mo base&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Yes (10 users)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Everbridge customers needing codeless workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana OnCall OSS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;AGPLv3 (archived)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Not recommended&lt;/strong&gt; — archived March 24, 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Product Notes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt; — Most mature alerting product. PagerDuty Advance adds AI agents (SRE, Scribe, Shift) but requires a paid base plan and a separate $415+/mo Advance subscription. AIOps features require a $699+/mo add-on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ilert&lt;/strong&gt; — EU-hosted with a clear GDPR and data-sovereignty story; the &lt;a href="https://www.ilert.com/product/ilert-ai" rel="noopener noreferrer"&gt;AI SRE&lt;/a&gt; opts out of LLM training on customer data. Free tier includes 5 responders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Squadcast&lt;/strong&gt; — &lt;a href="https://www.solarwinds.com/company/newsroom/press-releases/solarwinds-acquires-squadcast-unifying-observability-and-incident-response" rel="noopener noreferrer"&gt;Acquired by SolarWinds on March 3, 2025&lt;/a&gt;. Roadmap now driven by SolarWinds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; — Rootly AI Labs launched February 20, 2026; Rootly MCP GA April 2, 2026. Rootly sells IR, On-Call, and AI SRE as standalone products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; — &lt;a href="https://incident.io/blog/introducing-ai-sre" rel="noopener noreferrer"&gt;$62M Series B&lt;/a&gt; funded the launch of AI SRE — an always-on agent that investigates alerts, drafts PRs, and can autoresolve incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant&lt;/strong&gt; — &lt;a href="https://firehydrant.com/blog/firehydrant-to-be-acquired-by-freshworks/" rel="noopener noreferrer"&gt;Acquisition by Freshworks expected to close Q1 2026&lt;/a&gt;; FireHydrant will become the incident layer inside Freshservice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana OnCall&lt;/strong&gt; — &lt;a href="https://grafana.com/blog/grafana-oncall-maintenance-mode/" rel="noopener noreferrer"&gt;Entered maintenance mode March 11, 2025 and archived March 24, 2026&lt;/a&gt;. Do not start new deployments. Grafana is consolidating on a unified Cloud IRM app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splunk On-Call (VictorOps)&lt;/strong&gt; — Pricing not publicly listed. &lt;a href="https://newsroom.cisco.com/c/r/newsroom/en/us/a/y2024/m03/cisco-completes-acquisition-of-splunk.html" rel="noopener noreferrer"&gt;Cisco completed its $28B Splunk acquisition in March 2024&lt;/a&gt;; no official EOL announcement as of April 2026, but the product has seen minimal public investment since.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Aurora Integrates with Opsgenie and JSM Operations
&lt;/h2&gt;

&lt;p&gt;Aurora is &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;open-source agentic incident management&lt;/a&gt; that works alongside Opsgenie (and the JSM Operations successor). Most AI incident tools have already deprecated Opsgenie support ahead of the 2027 shutdown — Aurora supports both so teams can run their migration on their own timeline. The integration is &lt;a href="https://arvo-ai.github.io/aurora/docs/integrations/opsgenie-jsm/" rel="noopener noreferrer"&gt;fully documented in Aurora's docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Aurora does with Opsgenie alerts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional authentication&lt;/strong&gt; — Accepts either a native Opsgenie GenieKey (US or EU region) or a JSM Operations Atlassian API token. Credentials are encrypted in HashiCorp Vault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webhook ingestion&lt;/strong&gt; — Receives Create, Acknowledge, Close, and custom alert actions. Only Create triggers an investigation, preventing duplicates from acknowledgement webhooks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert correlation&lt;/strong&gt; — Aurora's AlertCorrelator groups incoming alerts with existing incidents by service, title, and time proximity. Correlated alerts attach to the parent incident instead of spawning a new one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority mapping&lt;/strong&gt; — Opsgenie priorities map deterministically: P1 → critical, P2 → high, P3 → medium, P4/P5 → low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service extraction&lt;/strong&gt; — Aurora reads alerts for a &lt;code&gt;service:xxx&lt;/code&gt; tag first, then falls back to the source and entity fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous RCA&lt;/strong&gt; — On alert creation, Aurora creates an incident record, generates an AI summary, and launches a LangGraph-orchestrated agent that queries your cloud infrastructure to find the root cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional JSM commenting&lt;/strong&gt; — For JSM Operations users, Aurora posts an "RCA in progress" comment back onto the linked Jira incident and updates it with findings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chatbot query surface&lt;/strong&gt; — Engineers can ask Aurora in natural language: &lt;em&gt;"Who is on-call right now?"&lt;/em&gt;, &lt;em&gt;"Show me P1 alerts from the last 24 hours"&lt;/em&gt;, &lt;em&gt;"Get details for alert ABC-123"&lt;/em&gt;. Aurora queries 8 Opsgenie resource types (alerts, alert details, incidents, incident details, services, on-call, schedules, teams) via parallel API calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"Most AI investigation tools only work with PagerDuty. We built Aurora to meet SRE teams where they already live — including Opsgenie and JSM — so AI-powered RCA isn't gated on migrating your alerting stack first." — Noah Casarotto-Dinning, CEO at Arvo AI&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Migrate Off Opsgenie Before April 5, 2027
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; administrator access to your Opsgenie account, access to your monitoring stack, and a target destination decided (JSM Operations, Compass, or a third-party alternative).&lt;/p&gt;

&lt;h3&gt;
  
  
  If You Are Staying with Atlassian
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inventory your Opsgenie config.&lt;/strong&gt; Document integrations, escalation policies, routing rules, heartbeats, on-call schedules, and custom roles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose JSM Operations vs Compass.&lt;/strong&gt; Pick JSM if you need ITSM workflows (change, problem, incident); pick Compass if you want alerting tied to a service catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify feature parity.&lt;/strong&gt; Review the &lt;a href="https://support.atlassian.com/jira-service-management-cloud/docs/start-shifting-from-opsgenie-to-jira-service-management/" rel="noopener noreferrer"&gt;Atlassian shifting guide&lt;/a&gt; for features that do not migrate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export historical data.&lt;/strong&gt; Alert data in JSM auto-deletes after a retention window — export anything needed for audit or compliance first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the in-product migration tool.&lt;/strong&gt; Atlassian provides a guided migration that copies your data to JSM or Compass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-authenticate chat integrations.&lt;/strong&gt; Re-authorize Slack and Microsoft Teams — OAuth grants do not transfer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update API endpoints.&lt;/strong&gt; Every consumer of the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replan the mobile rollout.&lt;/strong&gt; The standalone Opsgenie mobile app stops working — responders move to the Jira mobile app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Close Opsgenie within 120 days.&lt;/strong&gt; After migration, Opsgenie runs in parallel for up to 120 days, then auto-shuts down.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  If You Are Evaluating Alternatives
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shortlist two or three alternatives&lt;/strong&gt; using the comparison table above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a 90-day parallel trial&lt;/strong&gt; alongside Opsgenie — most vendors offer free trials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate the integrations that matter&lt;/strong&gt; — especially monitoring tool webhooks and your chat platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure MTTR and on-call satisfaction&lt;/strong&gt; against your Opsgenie baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide before Atlassian's 120-day cutover window closes&lt;/strong&gt; on any migration you start with JSM or Compass.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When is Opsgenie being shut down?&lt;/strong&gt;&lt;br&gt;
Atlassian will shut down Opsgenie permanently on April 5, 2027. End of sale was June 4, 2025 — no new signups, upgrades, or downgrades are allowed. On April 5, 2027 the service will be disabled and any data that has not been migrated to Jira Service Management or Compass will be permanently deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I still buy Opsgenie in 2026?&lt;/strong&gt;&lt;br&gt;
No. Atlassian closed new Opsgenie sales on June 4, 2025. Existing customers can continue using their current Opsgenie subscription until April 5, 2027 but cannot upgrade, downgrade, or add new users beyond their existing plan limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the official Opsgenie migration paths?&lt;/strong&gt;&lt;br&gt;
Atlassian offers two paths: Jira Service Management (JSM) Operations for ITSM teams needing change, problem, and incident workflows, and Compass for DevOps/SRE teams wanting alerting paired with a service catalog. Both share the same Operations engine, so schedules, alerts, and policies sync if you use both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will my Opsgenie data be preserved after migration?&lt;/strong&gt;&lt;br&gt;
Only data you explicitly migrate through Atlassian's in-product migration tool is preserved. Unlike legacy Opsgenie Enterprise, JSM automatically deletes alert data after a retention window — so you must export anything needed for compliance or audit before migration. Some features like alert creation rules and custom roles do not carry over at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much does Opsgenie cost in 2026?&lt;/strong&gt;&lt;br&gt;
Existing standalone customers pay $9.45/user/month annual or $11.55/user/month monthly on Essentials at 100 users. Standard and Enterprise add full routing, SSO, heartbeats, and advanced reporting. Incoming call routing is billed separately at $0.10/minute (US/Canada) and $0.35/minute (international). New signups are no longer accepted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the best Opsgenie alternatives?&lt;/strong&gt;&lt;br&gt;
The strongest 2026 alternatives are PagerDuty (incumbent with AI add-ons), incident.io (Slack-native with AI SRE), ilert (EU-hosted, GDPR-focused), Squadcast (budget-friendly, SolarWinds-owned), Rootly (modular IR + on-call + AI SRE), and Aurora by Arvo AI (open-source agentic RCA with Opsgenie and JSM support). Grafana OnCall OSS was archived in March 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Opsgenie support AI-powered root cause analysis?&lt;/strong&gt;&lt;br&gt;
Standalone Opsgenie is an alerting and on-call product — it does not perform root cause analysis. Atlassian is adding AIOps features (alert grouping, automated resolutions) to JSM and Compass. Teams wanting autonomous multi-step RCA typically pair Opsgenie with a dedicated tool like Aurora, which ingests Opsgenie webhooks and investigates incidents automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to my Opsgenie integrations after migration?&lt;/strong&gt;&lt;br&gt;
Monitoring integrations (Datadog, New Relic, Prometheus) migrate automatically via Atlassian's in-product tool. Chat integrations (Slack, Microsoft Teams) must be re-authorized manually because the OAuth grants do not transfer. Custom webhooks calling the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Aurora connect to Opsgenie and JSM?&lt;/strong&gt;&lt;br&gt;
Yes. Aurora supports both standalone Opsgenie (GenieKey authentication, US and EU regions) and JSM Operations (Atlassian API token). Aurora ingests alert webhooks, runs AI-powered alert correlation to group related alerts into incidents, and autonomously investigates the root cause. For JSM users, Aurora posts findings back as comments on the linked Jira incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Jira Service Management cheaper than Opsgenie?&lt;/strong&gt;&lt;br&gt;
No. JSM Premium is widely reported by Atlassian Community users as more expensive than standalone Opsgenie Standard. Real-time outbound webhooks require JSM Premium, and Incident Command Center requires JSM Enterprise. Many Opsgenie customers see a net price increase after migration, which is why teams use the forced migration to evaluate alternatives.&lt;/p&gt;




&lt;p&gt;Related reading: &lt;a href="https://www.arvoai.ca/blog/top-10-aiops-platforms-free-root-cause-analysis-2026" rel="noopener noreferrer"&gt;Top 10 AIOps Platforms Offering Free Root Cause Analysis&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/pagerduty-alternative-root-cause-analysis" rel="noopener noreferrer"&gt;PagerDuty Alternative: Open-Source Root Cause Analysis&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;What is Agentic Incident Management?&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;Open Source Incident Management: Why It Matters&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;All Opsgenie, JSM, Compass, and alternative-vendor claims verified from official sources in April 2026.&lt;/strong&gt; Last updated: April 21, 2026.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.arvoai.ca/blog/opsgenie-complete-guide-2026" rel="noopener noreferrer"&gt;arvoai.ca/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;By Team at Arvo AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
