<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Austin Zhai</title>
    <description>The latest articles on DEV Community by Austin Zhai (@austinzhai).</description>
    <link>https://dev.to/austinzhai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972651%2F57d9bf20-d191-4aab-8b34-5838b0529cf8.jpeg</url>
      <title>DEV Community: Austin Zhai</title>
      <link>https://dev.to/austinzhai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/austinzhai"/>
    <language>en</language>
    <item>
      <title>I open-sourced an AI agent that runs ops from chat apps</title>
      <dc:creator>Austin Zhai</dc:creator>
      <pubDate>Sun, 07 Jun 2026 16:48:36 +0000</pubDate>
      <link>https://dev.to/austinzhai/i-open-sourced-an-ai-agent-that-runs-ops-from-chat-apps-348j</link>
      <guid>https://dev.to/austinzhai/i-open-sourced-an-ai-agent-that-runs-ops-from-chat-apps-348j</guid>
      <description>&lt;h2&gt;
  
  
  The 2:47 AM problem
&lt;/h2&gt;

&lt;p&gt;A Prometheus alert fired at 02:47 last quarter. Something something CPU,&lt;br&gt;
something something device 4. I didn't know what was actually broken yet.&lt;br&gt;
I opened Grafana, eyeballed the dashboard, copy-pasted the &lt;code&gt;device_id&lt;/code&gt;&lt;br&gt;
into a LogQL query, watched the wall of &lt;code&gt;connection reset by peer&lt;/code&gt;,&lt;br&gt;
ssh-ed in, ran &lt;code&gt;top&lt;/code&gt;, ran &lt;code&gt;journalctl -u prod-api&lt;/code&gt;, grepped the wiki&lt;br&gt;
for a runbook with "connection reset" in it, gave up on the wiki,&lt;br&gt;
asked the team Slack, somebody replied 4 minutes later with "did you&lt;br&gt;
check the SNI thing?", I checked the SNI thing, that was it.&lt;/p&gt;

&lt;p&gt;Twenty-two minutes from page to first useful action. Half of that was&lt;br&gt;
&lt;strong&gt;context-switching across tools&lt;/strong&gt;, not actual investigation. The other&lt;br&gt;
half was Grafana's tabs being slow.&lt;/p&gt;

&lt;p&gt;It turns out almost every ops question I've ever had is the same shape:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A specific metric crossed a threshold.&lt;/li&gt;
&lt;li&gt;Match it against logs around that timestamp.&lt;/li&gt;
&lt;li&gt;Maybe pull a trace to see the latency tail.&lt;/li&gt;
&lt;li&gt;Walk the service graph one or two hops upstream.&lt;/li&gt;
&lt;li&gt;Look at a recent change.&lt;/li&gt;
&lt;li&gt;Decide whether to roll back, ssh in, or open an incident.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's a sequence of &lt;strong&gt;tool calls over a fixed set of data sources&lt;/strong&gt;.&lt;br&gt;
Which is exactly what an LLM tool-use loop is built for. But every&lt;br&gt;
AIOps product I tried wanted me to live in &lt;em&gt;their&lt;/em&gt; UI — and ops&lt;br&gt;
doesn't actually happen in a UI. It happens in &lt;strong&gt;Slack or&lt;br&gt;
Telegram&lt;/strong&gt;, because that's where the on-call rotation, the runbooks,&lt;br&gt;
the post-mortems, and the "did you check the SNI thing?" replies all&lt;br&gt;
live.&lt;/p&gt;

&lt;p&gt;So I built Ongrid. It's open-source, Apache 2.0, and you can &lt;code&gt;docker&lt;br&gt;
compose up&lt;/code&gt; it. This post is what it is, how it works, and what&lt;br&gt;
doesn't work yet.&lt;/p&gt;
&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;In one sentence: &lt;strong&gt;Ongrid is an AI agent that understands your systems,&lt;br&gt;
finds the root cause, and acts — driven from your team's chat apps.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A working session looks like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;a class="mentioned-user" href="https://dev.to/you"&gt;@you&lt;/a&gt;, in a Slack channel:&lt;/strong&gt; &lt;em&gt;why is device_4 throwing 5xx&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ongrid (90 seconds later):&lt;/strong&gt; &lt;em&gt;Looked at &lt;code&gt;http_requests_total{device_id="4",status=~"5.."}&lt;/code&gt;&lt;br&gt;
in the last 30m — error rate spiked from 0.1% to 14% at 02:47. Loki&lt;br&gt;
shows 240 &lt;code&gt;read: connection reset by peer&lt;/code&gt; entries on the&lt;br&gt;
&lt;code&gt;api-gateway&lt;/code&gt; pod between 02:47 and 02:51 (LogQL: &lt;code&gt;{device_id="4",unit="api-gateway.service"} |~ "connection reset"&lt;/code&gt;).&lt;br&gt;
Trace sampled at 02:48 spent 4.2s in TLS handshake to upstream&lt;br&gt;
&lt;code&gt;auth.internal&lt;/code&gt;. Recent change events on &lt;code&gt;auth.internal&lt;/code&gt;: SNI hostname&lt;br&gt;
rotated at 02:46 by deployment &lt;code&gt;auth-rotate-2026-Q2&lt;/code&gt;. Likely root&lt;br&gt;
cause: &lt;code&gt;auth.internal&lt;/code&gt; no longer matches the SNI we present.&lt;br&gt;
Suggested action: roll back &lt;code&gt;auth-rotate-2026-Q2&lt;/code&gt; OR push the new&lt;br&gt;
SNI to api-gateway. I haven't done either.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the whole loop: query, correlate, ground, propose, &lt;strong&gt;wait for&lt;br&gt;
the human to say go&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's a real one from the product — an alert investigated end to end,&lt;br&gt;
with the root cause, the evidence chain, and the pinpointed service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26wc04lrbf92ba17uk3o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26wc04lrbf92ba17uk3o.png" alt="Ongrid auto-investigating a resolved " width="800" height="611"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture (the parts that matter)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐                        ┌─────────────────┐
│   edge agent    │    outbound tunnel     │     manager     │
│   (on host)     │ ─────────────────────▶ │   agent + LLM   │
│                 │   (outbound only —     │     router      │ ──▶ Slack /
│ • procmetrics   │    manager rides it    │                 │     Telegram
│ • host_probes   │ ◀──── back for ─────── │   Prom · Loki   │ ◀── (chat apps)
│ • promtail      │      host probes)      │  Tempo·Grafana  │
│ • tracegen      │                        │     Qdrant      │
└─────────────────┘                        └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Three things worth pulling out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The edge dials out.&lt;/strong&gt; Hosts run &lt;code&gt;ongrid-edge&lt;/code&gt;, which opens an&lt;br&gt;
&lt;strong&gt;outbound&lt;/strong&gt; tunnel to the manager. No port 22, 80, or 443&lt;br&gt;
needs to be reachable from outside the host. The manager rides the&lt;br&gt;
same tunnel back when it wants to run a host probe. Result: ops boxes&lt;br&gt;
behind NAT or in a private VPC are first-class targets, not exotic&lt;br&gt;
edge-cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The agent is a graph, not a chatbot.&lt;/strong&gt; The LLM-facing kernel is&lt;br&gt;
&lt;a href="https://github.com/cloudwego/eino" rel="noopener noreferrer"&gt;eino&lt;/a&gt;'s graph runner with our own&lt;br&gt;
tool registry on top. Two design decisions that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coordinator + specialists.&lt;/strong&gt; A coordinator agent dispatches
per-incident work to a specialist (SRE, network, disk, ops,
reviewer). The specialists are read-only by default; mutating
actions go through a separate reviewer agent that gates them. You
see exactly which agent issued which tool call in the timeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool sandboxing by class.&lt;/strong&gt; 26+ built-in tools (&lt;code&gt;query_promql&lt;/code&gt;,
&lt;code&gt;query_logql&lt;/code&gt;, &lt;code&gt;query_traceql&lt;/code&gt;, &lt;code&gt;expand_topology&lt;/code&gt;,
&lt;code&gt;find_topology_node&lt;/code&gt;, &lt;code&gt;query_change_events&lt;/code&gt;, &lt;code&gt;bash&lt;/code&gt; with a policy
whitelist, &lt;code&gt;host_probe_disk/cpu/net/...&lt;/code&gt;, knowledge-base &lt;code&gt;search&lt;/code&gt;,
etc.) — each tagged with a class (&lt;code&gt;safe-read&lt;/code&gt;, &lt;code&gt;safe-write&lt;/code&gt;,
&lt;code&gt;dangerous&lt;/code&gt;). Operators decide per-role what classes are callable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. The LLM is pluggable.&lt;/strong&gt; We ship a multi-provider router so you&lt;br&gt;
can drop in &lt;strong&gt;Anthropic, OpenAI, Zhipu (GLM), Gemini, DeepSeek, Kimi,&lt;br&gt;
or any OpenAI-compatible relay&lt;/strong&gt; — switch the default at runtime, no&lt;br&gt;
restart. I run a small, cheap model for the pre-classification step&lt;br&gt;
and a stronger one for the actual investigation. Whatever model you&lt;br&gt;
have a key for works.&lt;/p&gt;

&lt;p&gt;Here's the agent working a live task — investigating disk usage on a&lt;br&gt;
host and summarising what it found, with a concrete cleanup plan:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff77dtz0vjnrkso1ecgyx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff77dtz0vjnrkso1ecgyx.png" alt="The agent investigating disk usage on a host" width="800" height="611"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why "chat-native" matters more than it sounds
&lt;/h2&gt;

&lt;p&gt;Every ops AI tool I've seen treats the chat integration as a&lt;br&gt;
notification sink. "We alert into Slack, then you click through to&lt;br&gt;
our dashboard." That misses the point. The conversation needs to&lt;br&gt;
happen &lt;em&gt;inside&lt;/em&gt; the chat client, because the conversation already has&lt;br&gt;
context — who's on call, what was the last incident, what's&lt;br&gt;
half-finished from 6 hours ago, who already tried what.&lt;/p&gt;

&lt;p&gt;Ongrid does both directions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outbound:&lt;/strong&gt; alerts, RCA summaries, post-mortem drafts go to the
channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inbound:&lt;/strong&gt; you @-mention the bot in the same thread to ask
follow-ups. The agent has the thread context, the original alert,
the existing investigation, the knowledge base, the source code
(if you give it a repo). Replies stream back into the same thread.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently working two-way: &lt;strong&gt;Slack&lt;/strong&gt; (socket-mode),&lt;br&gt;
&lt;strong&gt;Telegram&lt;/strong&gt; (getUpdates polling). DingTalk and&lt;br&gt;
WeCom on the outbound side. The clean way to think about it: each&lt;br&gt;
provider is a thin transport adapter into the same agent runtime.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's actually rough
&lt;/h2&gt;

&lt;p&gt;The honest list, because every OSS project hides one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Telegram needs a proxy if your network can't reach its API
directly.&lt;/strong&gt; The install.sh has no &lt;code&gt;--http-proxy&lt;/code&gt; flag yet. We worked around it on
our test box with a &lt;code&gt;docker-compose.override.yml&lt;/code&gt; setting
&lt;code&gt;HTTPS_PROXY&lt;/code&gt;. It works, but it's a manual step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Some providers' base models reject our default request params&lt;/strong&gt;
with a &lt;code&gt;beta-limitations&lt;/code&gt;-style error, and you have to pick a
specific variant. We should auto-detect and downgrade — open issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Anthropic provider's &lt;code&gt;anthropic_base_url&lt;/code&gt;&lt;/strong&gt; field assumes
you're either on the official endpoint or behind a relay that
speaks the same shape. Doesn't sniff for differences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No SOP / playbook executor yet.&lt;/strong&gt; Today the agent proposes a
fix; it doesn't yet execute a multi-step runbook with double-sign
gating. On the roadmap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find more, the issue tracker is open.&lt;/p&gt;
&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ubuntu 22.04+, Debian 12+, RHEL/Rocky 9&lt;/span&gt;
wget https://github.com/ongridio/ongrid/releases/download/v0.8.2/ongrid-v0.8.2-linux-amd64.tar.xz
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xf&lt;/span&gt; ongrid-v0.8.2-linux-amd64.tar.xz &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;ongrid-v0.8.2-linux-amd64
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./install.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That gets you the manager + Prom + Loki + Tempo + Grafana + Qdrant +&lt;br&gt;
Frontier broker, all in docker-compose. Edge install is a curl-pipe&lt;br&gt;
the manager hands you out of the UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="nt"&gt;-sSL&lt;/span&gt; https://&amp;lt;manager&amp;gt;/edge/install.sh | bash &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--access-key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;... &lt;span class="nt"&gt;--secret-key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;... &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--server-edge-addr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;manager&amp;gt;:40012 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--server-http-addr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;manager&amp;gt;:443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache 2.0. No telemetry phone-home. Knowledge-base contents stay in&lt;br&gt;
your qdrant; chat history stays in your MySQL.&lt;/p&gt;

&lt;p&gt;The observability stack comes wired out of the box — and the agent&lt;br&gt;
writes the queries itself:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhr4mcchbzd8ywtyoykzm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhr4mcchbzd8ywtyoykzm.png" alt="Ongrid's built-in monitoring dashboard" width="800" height="611"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to find it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/ongridio/ongrid" rel="noopener noreferrer"&gt;https://github.com/ongridio/ongrid&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docs: &lt;a href="https://ongrid.cloud" rel="noopener noreferrer"&gt;https://ongrid.cloud&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Demo: see the GIF at the top of the README&lt;/li&gt;
&lt;li&gt;Issues / questions: &lt;a href="https://github.com/ongridio/ongrid/issues" rel="noopener noreferrer"&gt;https://github.com/ongridio/ongrid/issues&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd love to know what breaks for you. The fastest way to make this&lt;br&gt;
better is for an SRE who isn't me to install it and tell me what's&lt;br&gt;
wrong with it. If you do, drop an issue or ping me on&lt;br&gt;
[Twitter / wherever] — I'll read every one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ongrid is Apache 2.0. The release this post talks about is v0.8.2&lt;br&gt;
(2026-06-04).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opensource</category>
      <category>ai</category>
      <category>aiops</category>
    </item>
  </channel>
</rss>
