Our OpenClaw agent ran for eleven days before I realised I had no idea what it was doing.
Not in the philosophical sense. I mean I couldn't tell you how many messages it handled in a given hour, what percentage of tool calls succeeded, how much each conversation cost, or why it sometimes took 14 seconds to respond to a simple question. The agent was running. People were using it. And we had essentially zero insight into the space between "user sends message" and "bot replies."
This is the observability gap, and nearly every team running OpenClaw in Slack has it.
What You Can't See Will Hurt You
When we first set up our agent, "monitoring" meant checking whether the gateway process was alive. That's the equivalent of monitoring a web app by pinging the homepage. Yes, it's up. No, you don't know if it's actually working.
Here's what we couldn't see:
How long each LLM call took. Our agent uses Claude for complex queries and a cheaper model for simple ones, but we had no data on actual latency per request. When people complained about slow responses, we couldn't tell whether the bottleneck was the model, the MCP server, the Slack API, or our own code.
Which tools were failing silently. An MCP server can return a 200 with garbage data. The agent receives the response, does its best to interpret it, and replies to the user. Nobody gets an error. The answer is just wrong, and unless the user notices, it goes unreported.
What anything cost. We were running two models with different pricing. Tool calls to external APIs had their own costs. After 30 days, our bill was 40% higher than projected and we couldn't attribute the overage to any specific pattern. Was it one power user? A misconfigured system prompt that caused extra reasoning loops? A particular channel that generated unusually long conversations? We had no idea.
The Token Black Hole
The cost problem deserves its own section because it's where the pain hit us hardest.
LLM pricing is per token. Every message in, every message out, every tool call description, every tool result — tokens. And in a Slack workspace, context accumulates. A busy channel might feed the agent 50 messages per hour. Each message includes the sender's name, the full text, maybe a thread context. That's input tokens you're paying for even when the agent decides not to respond.
We found out we were spending roughly $8 per day on input tokens just from the agent passively reading channels it was monitoring. It wasn't doing anything with most of those messages. It was reading them, deciding they weren't relevant, and moving on. But the read still costs money.
The fix required visibility we didn't have. Once we started logging token counts per interaction, we discovered that three channels accounted for 70% of our daily token spend. Two of them rarely needed the agent at all — they'd been added to the watch list during setup and nobody removed them. Fifteen minutes of config cleanup saved us $150 per month.
You won't find those savings without instrumentation.
What OpenTelemetry Actually Gives You
OpenClaw added built-in OpenTelemetry support in the 2026.2 release. If you're running a recent version and haven't turned it on, you're leaving the most useful debugging tool on the table.
Here's what it captures when configured correctly:
Trace spans for each conversation turn. From message received to response sent, broken into LLM call, tool execution, and response formatting. You can see exactly where time goes. Our 14-second responses? Turned out one MCP server was doing a DNS lookup on every call because we'd configured it with a hostname instead of an IP. Three seconds per tool call, and the agent was making four calls per response.
Token counts per span. Input tokens, output tokens, total. Broken down by model if you're using routing. This is how we found the channel cost problem.
Tool call success/failure rates. Not just HTTP status codes, but whether the tool returned data the agent could actually use. We defined a custom metric for "tool returned data but agent didn't reference it in the response" which turned out to be a surprisingly reliable indicator of malformed tool results.
Error categorisation. Rate limit hits, auth failures, timeouts, malformed responses. Each one gets a span event with the error details. We built an alert for "more than 3 auth failures in 5 minutes" which caught an expired OAuth token about 40 minutes after it expired, instead of the usual "someone complained on Monday."
The setup isn't hard. Export spans to Jaeger or whatever you already use. If you don't have a tracing backend, Grafana Cloud has a free tier that'll handle a small agent's output. The OpenClaw docs cover the config — it's about 10 lines in your settings file.
The Dashboards Nobody Builds
Having traces is step one. Doing something useful with them is step two, and this is where most teams stall.
Here's what we track on a daily dashboard:
Response time P50/P95/P99. The P50 tells you normal performance. The P99 tells you when someone's having a bad time. Our P50 is 2.3 seconds. Our P99 is 11 seconds. That gap is almost entirely explained by responses that require multiple tool calls.
Tool call success rate per MCP server. We have six servers. Four run above 99%. The GitHub one sits at 96% because the GitHub API rate-limits us about once a day. The deploy server is at 94% because half our deploys take longer than the timeout. Different failure modes, different responses needed.
Cost per channel per day. This is the one that changed our behaviour. When you can see that #general costs $0.40 per day and #engineering costs $4.20, you start asking questions about the engineering channel's system prompt.
Unanswered messages. Messages directed at the agent that got no response. Either the agent crashed, the context was too long, or the message was filtered by our relevance check. We aim for under 2% unanswered. When it spikes, something's broken.
New Relic launched an agentic observability platform specifically for this in February. Fiddler AI has been publishing about what they call the "observability gap" in autonomous agents. ClawMetry showed up on Product Hunt pitched as "Grafana but for AI agents." The tooling is arriving because the problem is real and widespread.
What Managed Platforms Get Right
I'll be honest about why I ended up recommending SlackClaw to teams who ask me about this: they build the dashboard before the agent.
When you deploy an agent through SlackClaw, you get per-channel cost tracking, tool call success rates, response time percentiles, and token usage breakdowns from day one. No OpenTelemetry config, no Grafana setup, no custom metrics. It's just there.
The credit-based pricing model also changes the economics of observability. Instead of "how much did this cost in API fees that I need to calculate across three providers," you see credit consumption per channel per day. One number. The abstraction hides complexity you don't need.
Building equivalent visibility into a self-hosted setup took us about two weeks. Getting it reliable took another two. And we still don't have cost attribution as clean as what a managed platform gives you, because our cost calculation involves two model providers, six MCP servers, and a routing layer that makes decisions we don't fully control.
The Baseline You Need Before Anything Else
If you take one thing from this: turn on tracing before you add your next feature. Before you build the third MCP server, before you add the deploy integration, before you roll the agent out to more channels. Instrument first.
At minimum, you need:
Response time per conversation turn. Cost per conversation turn. Tool call success rate per MCP server. Daily cost per channel.
Four metrics. You can get them from OpenTelemetry spans and a simple aggregation query. Once you have them, every other decision gets easier: which channels to monitor, which tools to invest in, which model to use for which queries, and whether the whole thing is worth the money you're spending.
Without them, you're flying blind. And flying blind is fine right up until the bill arrives, or the agent starts giving wrong answers, or the tool that broke three days ago is still broken because nobody has any way to know.
Helen Mireille is chief of staff at an early-stage tech startup. She writes about the gap between AI demos and what actually runs in production.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.