The 2:47 AM problem
A Prometheus alert fired at 02:47 last quarter. Something something CPU,
something something device 4. I didn't know what was actually broken yet.
I opened Grafana, eyeballed the dashboard, copy-pasted the device_id
into a LogQL query, watched the wall of connection reset by peer,
ssh-ed in, ran top, ran journalctl -u prod-api, grepped the wiki
for a runbook with "connection reset" in it, gave up on the wiki,
asked the team Slack, somebody replied 4 minutes later with "did you
check the SNI thing?", I checked the SNI thing, that was it.
Twenty-two minutes from page to first useful action. Half of that was
context-switching across tools, not actual investigation. The other
half was Grafana's tabs being slow.
It turns out almost every ops question I've ever had is the same shape:
- A specific metric crossed a threshold.
- Match it against logs around that timestamp.
- Maybe pull a trace to see the latency tail.
- Walk the service graph one or two hops upstream.
- Look at a recent change.
- Decide whether to roll back, ssh in, or open an incident.
That's a sequence of tool calls over a fixed set of data sources.
Which is exactly what an LLM tool-use loop is built for. But every
AIOps product I tried wanted me to live in their UI — and ops
doesn't actually happen in a UI. It happens in Slack or
Telegram, because that's where the on-call rotation, the runbooks,
the post-mortems, and the "did you check the SNI thing?" replies all
live.
So I built Ongrid. It's open-source, Apache 2.0, and you can docker it. This post is what it is, how it works, and what
compose up
doesn't work yet.
What it does
In one sentence: Ongrid is an AI agent that understands your systems,
finds the root cause, and acts — driven from your team's chat apps.
A working session looks like:
@you, in a Slack channel: why is device_4 throwing 5xx
Ongrid (90 seconds later): Looked at
http_requests_total{device_id="4",status=~"5.."}
in the last 30m — error rate spiked from 0.1% to 14% at 02:47. Loki
shows 240read: connection reset by peerentries on the
api-gatewaypod between 02:47 and 02:51 (LogQL:{device_id="4",unit="api-gateway.service"} |~ "connection reset").
Trace sampled at 02:48 spent 4.2s in TLS handshake to upstream
auth.internal. Recent change events onauth.internal: SNI hostname
rotated at 02:46 by deploymentauth-rotate-2026-Q2. Likely root
cause:auth.internalno longer matches the SNI we present.
Suggested action: roll backauth-rotate-2026-Q2OR push the new
SNI to api-gateway. I haven't done either.
That's the whole loop: query, correlate, ground, propose, wait for
the human to say go.
Here's a real one from the product — an alert investigated end to end,
with the root cause, the evidence chain, and the pinpointed service:
Architecture (the parts that matter)
┌─────────────────┐ ┌─────────────────┐
│ edge agent │ outbound tunnel │ manager │
│ (on host) │ ─────────────────────▶ │ agent + LLM │
│ │ (outbound only — │ router │ ──▶ Slack /
│ • procmetrics │ manager rides it │ │ Telegram
│ • host_probes │ ◀──── back for ─────── │ Prom · Loki │ ◀── (chat apps)
│ • promtail │ host probes) │ Tempo·Grafana │
│ • tracegen │ │ Qdrant │
└─────────────────┘ └─────────────────┘
Three things worth pulling out:
1. The edge dials out. Hosts run ongrid-edge, which opens an
outbound tunnel to the manager. No port 22, 80, or 443
needs to be reachable from outside the host. The manager rides the
same tunnel back when it wants to run a host probe. Result: ops boxes
behind NAT or in a private VPC are first-class targets, not exotic
edge-cases.
2. The agent is a graph, not a chatbot. The LLM-facing kernel is
eino's graph runner with our own
tool registry on top. Two design decisions that matter:
- Coordinator + specialists. A coordinator agent dispatches per-incident work to a specialist (SRE, network, disk, ops, reviewer). The specialists are read-only by default; mutating actions go through a separate reviewer agent that gates them. You see exactly which agent issued which tool call in the timeline.
-
Tool sandboxing by class. 26+ built-in tools (
query_promql,query_logql,query_traceql,expand_topology,find_topology_node,query_change_events,bashwith a policy whitelist,host_probe_disk/cpu/net/..., knowledge-basesearch, etc.) — each tagged with a class (safe-read,safe-write,dangerous). Operators decide per-role what classes are callable.
3. The LLM is pluggable. We ship a multi-provider router so you
can drop in Anthropic, OpenAI, Zhipu (GLM), Gemini, DeepSeek, Kimi,
or any OpenAI-compatible relay — switch the default at runtime, no
restart. I run a small, cheap model for the pre-classification step
and a stronger one for the actual investigation. Whatever model you
have a key for works.
Here's the agent working a live task — investigating disk usage on a
host and summarising what it found, with a concrete cleanup plan:
Why "chat-native" matters more than it sounds
Every ops AI tool I've seen treats the chat integration as a
notification sink. "We alert into Slack, then you click through to
our dashboard." That misses the point. The conversation needs to
happen inside the chat client, because the conversation already has
context — who's on call, what was the last incident, what's
half-finished from 6 hours ago, who already tried what.
Ongrid does both directions:
- Outbound: alerts, RCA summaries, post-mortem drafts go to the channel.
- Inbound: you @-mention the bot in the same thread to ask follow-ups. The agent has the thread context, the original alert, the existing investigation, the knowledge base, the source code (if you give it a repo). Replies stream back into the same thread.
Currently working two-way: Slack (socket-mode),
Telegram (getUpdates polling). DingTalk and
WeCom on the outbound side. The clean way to think about it: each
provider is a thin transport adapter into the same agent runtime.
What's actually rough
The honest list, because every OSS project hides one:
-
Telegram needs a proxy if your network can't reach its API
directly. The install.sh has no
--http-proxyflag yet. We worked around it on our test box with adocker-compose.override.ymlsettingHTTPS_PROXY. It works, but it's a manual step. -
Some providers' base models reject our default request params
with a
beta-limitations-style error, and you have to pick a specific variant. We should auto-detect and downgrade — open issue. -
The Anthropic provider's
anthropic_base_urlfield assumes you're either on the official endpoint or behind a relay that speaks the same shape. Doesn't sniff for differences. - No SOP / playbook executor yet. Today the agent proposes a fix; it doesn't yet execute a multi-step runbook with double-sign gating. On the roadmap.
If you find more, the issue tracker is open.
Install
# Ubuntu 22.04+, Debian 12+, RHEL/Rocky 9
wget https://github.com/ongridio/ongrid/releases/download/v0.8.2/ongrid-v0.8.2-linux-amd64.tar.xz
tar -xf ongrid-v0.8.2-linux-amd64.tar.xz && cd ongrid-v0.8.2-linux-amd64
sudo ./install.sh
That gets you the manager + Prom + Loki + Tempo + Grafana + Qdrant +
Frontier broker, all in docker-compose. Edge install is a curl-pipe
the manager hands you out of the UI:
curl -k -sSL https://<manager>/edge/install.sh | bash -s -- \
--access-key=... --secret-key=... \
--server-edge-addr=<manager>:40012 \
--server-http-addr=<manager>:443
Apache 2.0. No telemetry phone-home. Knowledge-base contents stay in
your qdrant; chat history stays in your MySQL.
The observability stack comes wired out of the box — and the agent
writes the queries itself:
Where to find it
- Repo: https://github.com/ongridio/ongrid
- Docs: https://ongrid.cloud
- Demo: see the GIF at the top of the README
- Issues / questions: https://github.com/ongridio/ongrid/issues
I'd love to know what breaks for you. The fastest way to make this
better is for an SRE who isn't me to install it and tell me what's
wrong with it. If you do, drop an issue or ping me on
[Twitter / wherever] — I'll read every one.
Ongrid is Apache 2.0. The release this post talks about is v0.8.2
(2026-06-04).



Top comments (0)