I open-sourced an AI agent that runs ops from chat apps

#devops #ai #opensource #aiops

The 2:47 AM problem

A Prometheus alert fired at 02:47 last quarter. Something something CPU,
something something device 4. I didn't know what was actually broken yet.
I opened Grafana, eyeballed the dashboard, copy-pasted the device_id
into a LogQL query, watched the wall of connection reset by peer,
ssh-ed in, ran top, ran journalctl -u prod-api, grepped the wiki
for a runbook with "connection reset" in it, gave up on the wiki,
asked the team Slack, somebody replied 4 minutes later with "did you
check the SNI thing?", I checked the SNI thing, that was it.

Twenty-two minutes from page to first useful action. Half of that was
context-switching across tools, not actual investigation. The other
half was Grafana's tabs being slow.

It turns out almost every ops question I've ever had is the same shape:

A specific metric crossed a threshold.
Match it against logs around that timestamp.
Maybe pull a trace to see the latency tail.
Walk the service graph one or two hops upstream.
Look at a recent change.
Decide whether to roll back, ssh in, or open an incident.

That's a sequence of tool calls over a fixed set of data sources.
Which is exactly what an LLM tool-use loop is built for. But every
AIOps product I tried wanted me to live in their UI — and ops
doesn't actually happen in a UI. It happens in Slack or
Telegram, because that's where the on-call rotation, the runbooks,
the post-mortems, and the "did you check the SNI thing?" replies all
live.

So I built Ongrid. It's open-source, Apache 2.0, and you can docker compose up it. This post is what it is, how it works, and what
doesn't work yet.

What it does

In one sentence: Ongrid is an AI agent that understands your systems,
finds the root cause, and acts — driven from your team's chat apps.

A working session looks like:

@you, in a Slack channel: why is device_4 throwing 5xx

Ongrid (90 seconds later): Looked at http_requests_total{device_id="4",status=~"5.."}
in the last 30m — error rate spiked from 0.1% to 14% at 02:47. Loki
shows 240 read: connection reset by peer entries on the
api-gateway pod between 02:47 and 02:51 (LogQL: {device_id="4",unit="api-gateway.service"} |~ "connection reset").
Trace sampled at 02:48 spent 4.2s in TLS handshake to upstream
auth.internal. Recent change events on auth.internal: SNI hostname
rotated at 02:46 by deployment auth-rotate-2026-Q2. Likely root
cause: auth.internal no longer matches the SNI we present.
Suggested action: roll back auth-rotate-2026-Q2 OR push the new
SNI to api-gateway. I haven't done either.

That's the whole loop: query, correlate, ground, propose, wait for
the human to say go.

Here's a real one from the product — an alert investigated end to end,
with the root cause, the evidence chain, and the pinpointed service:

Architecture (the parts that matter)

┌─────────────────┐                        ┌─────────────────┐
│   edge agent    │    outbound tunnel     │     manager     │
│   (on host)     │ ─────────────────────▶ │   agent + LLM   │
│                 │   (outbound only —     │     router      │ ──▶ Slack /
│ • procmetrics   │    manager rides it    │                 │     Telegram
│ • host_probes   │ ◀──── back for ─────── │   Prom · Loki   │ ◀── (chat apps)
│ • promtail      │      host probes)      │  Tempo·Grafana  │
│ • tracegen      │                        │     Qdrant      │
└─────────────────┘                        └─────────────────┘

Three things worth pulling out:

1. The edge dials out. Hosts run ongrid-edge, which opens an
outbound tunnel to the manager. No port 22, 80, or 443
needs to be reachable from outside the host. The manager rides the
same tunnel back when it wants to run a host probe. Result: ops boxes
behind NAT or in a private VPC are first-class targets, not exotic
edge-cases.

2. The agent is a graph, not a chatbot. The LLM-facing kernel is
eino's graph runner with our own
tool registry on top. Two design decisions that matter:

Coordinator + specialists. A coordinator agent dispatches per-incident work to a specialist (SRE, network, disk, ops, reviewer). The specialists are read-only by default; mutating actions go through a separate reviewer agent that gates them. You see exactly which agent issued which tool call in the timeline.
Tool sandboxing by class. 26+ built-in tools (query_promql, query_logql, query_traceql, expand_topology, find_topology_node, query_change_events, bash with a policy whitelist, host_probe_disk/cpu/net/..., knowledge-base search, etc.) — each tagged with a class (safe-read, safe-write, dangerous). Operators decide per-role what classes are callable.

3. The LLM is pluggable. We ship a multi-provider router so you
can drop in Anthropic, OpenAI, Zhipu (GLM), Gemini, DeepSeek, Kimi,
or any OpenAI-compatible relay — switch the default at runtime, no
restart. I run a small, cheap model for the pre-classification step
and a stronger one for the actual investigation. Whatever model you
have a key for works.

Here's the agent working a live task — investigating disk usage on a
host and summarising what it found, with a concrete cleanup plan:

Why "chat-native" matters more than it sounds

Every ops AI tool I've seen treats the chat integration as a
notification sink. "We alert into Slack, then you click through to
our dashboard." That misses the point. The conversation needs to
happen inside the chat client, because the conversation already has
context — who's on call, what was the last incident, what's
half-finished from 6 hours ago, who already tried what.

Ongrid does both directions:

Outbound: alerts, RCA summaries, post-mortem drafts go to the channel.
Inbound: you @-mention the bot in the same thread to ask follow-ups. The agent has the thread context, the original alert, the existing investigation, the knowledge base, the source code (if you give it a repo). Replies stream back into the same thread.

Currently working two-way: Slack (socket-mode),
Telegram (getUpdates polling). DingTalk and
WeCom on the outbound side. The clean way to think about it: each
provider is a thin transport adapter into the same agent runtime.

What's actually rough

The honest list, because every OSS project hides one:

Telegram needs a proxy if your network can't reach its API directly. The install.sh has no --http-proxy flag yet. We worked around it on our test box with a docker-compose.override.yml setting HTTPS_PROXY. It works, but it's a manual step.
Some providers' base models reject our default request params with a beta-limitations-style error, and you have to pick a specific variant. We should auto-detect and downgrade — open issue.
The Anthropic provider's anthropic_base_url field assumes you're either on the official endpoint or behind a relay that speaks the same shape. Doesn't sniff for differences.
No SOP / playbook executor yet. Today the agent proposes a fix; it doesn't yet execute a multi-step runbook with double-sign gating. On the roadmap.

If you find more, the issue tracker is open.

Install

# Ubuntu 22.04+, Debian 12+, RHEL/Rocky 9
wget https://github.com/ongridio/ongrid/releases/download/v0.8.2/ongrid-v0.8.2-linux-amd64.tar.xz
tar -xf ongrid-v0.8.2-linux-amd64.tar.xz && cd ongrid-v0.8.2-linux-amd64
sudo ./install.sh

That gets you the manager + Prom + Loki + Tempo + Grafana + Qdrant +
Frontier broker, all in docker-compose. Edge install is a curl-pipe
the manager hands you out of the UI:

curl -k -sSL https://<manager>/edge/install.sh | bash -s -- \
  --access-key=... --secret-key=... \
  --server-edge-addr=<manager>:40012 \
  --server-http-addr=<manager>:443

Apache 2.0. No telemetry phone-home. Knowledge-base contents stay in
your qdrant; chat history stays in your MySQL.

The observability stack comes wired out of the box — and the agent
writes the queries itself:

Where to find it

Repo: https://github.com/ongridio/ongrid
Docs: https://ongrid.cloud
Demo: see the GIF at the top of the README
Issues / questions: https://github.com/ongridio/ongrid/issues

I'd love to know what breaks for you. The fastest way to make this
better is for an SRE who isn't me to install it and tell me what's
wrong with it. If you do, drop an issue or ping me on
[Twitter / wherever] — I'll read every one.

Ongrid is Apache 2.0. The release this post talks about is v0.8.2
(2026-06-04).