Two months ago we deployed an OpenClaw agent in our team Slack. First week was great. Second week was fine. Third week, it started failing in ways I didn't know were possible.
The agent would go silent for hours. Or it'd reply to messages with hallucinated tool results. Once it told a colleague the deploy had succeeded when the deploy had actually timed out — because the Vercel API returned a 504 and the agent treated the error page HTML as a success response.
We've since fixed all of these. Some were our fault. Some were architectural. Here are the five failure modes we hit, in order of how much they annoyed us.
1. The Silent Death
Your agent stops responding. No errors in the logs. The gateway process is running. Slack shows the bot as online. But messages go in and nothing comes out.
Nine times out of ten, this is a token expiry. The Slack bot token expired, or the model provider API key hit its rate limit, or (our favourite) the MCP server's OAuth token to a third-party service silently expired at 2am and the agent's attempt to call the tool returned a 401 that the error handler swallowed.
The fix was embarrassingly simple once we figured it out: health check every tool on a schedule, not just the gateway. We run a cron job every 15 minutes that makes a test call to each MCP server and alerts if any return non-200. Before that, we'd only know something was broken when someone complained in Slack.
# Basic health check — hit each MCP server's test endpoint
for server in linear notion github deploy; do
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:${PORT}/health)
if [ "$status" != "200" ]; then
echo "MCP server $server is down (HTTP $status)" | tg
fi
done
The other silent death: disk space. The gateway logs to disk. If nobody rotates the logs and the disk fills, the gateway crashes without writing a final error — because there's nowhere to write it. We lost four hours to this one on a Saturday. Log rotation is boring right up until it isn't.
2. The Hallucinated Tool Result
This one's insidious. The agent calls a tool, the tool fails, and instead of saying "I couldn't do that," the agent fabricates a plausible-looking result.
In our case, the Linear API returned a timeout error. The agent received the error message as plain text, decided it looked like ticket data, and told the user their ticket had been updated. It hadn't.
The root cause is that most MCP servers return errors as text strings. The model sees text and does what models do: it makes sense of it. If the error message contains words like "ticket" or "updated," the model might interpret it as confirmation rather than failure.
Our fix: every MCP server response now includes a structured status field. Not "here's some text, figure it out" but {"status": "error", "code": 504, "message": "Linear API timeout"}. The agent's system prompt explicitly says: "If any tool returns status: error, tell the user the action failed and include the error message. Never infer a successful result from an error response."
This cut hallucinated results by roughly 90%. The remaining 10% come from edge cases where the tool technically succeeds but returns unexpected data. We're still chasing those.
3. The Context Window Overflow
OpenClaw agents in Slack have a specific problem that doesn't affect CLI or web agents: conversation context grows continuously.
In a busy channel, the agent might process 50-100 messages per hour. Each message, plus any tool calls and their results, goes into the context window. Eventually the context fills, and one of two things happens: the agent starts dropping earlier messages (losing important context), or the API call fails with a token limit error.
We hit this on day 12 when the agent was watching both #engineering and #support simultaneously. Someone asked a question that required context from 3 hours of conversation, and the agent gave an answer based on the last 40 minutes because everything before that had been evicted from context.
The fix is a summarisation layer. Every 30 minutes, the agent compresses older conversation into a summary: "In the last 2 hours, the team discussed the billing migration. Key decisions: schema change approved, launch date set for Thursday. Open questions: API backward compatibility." This summary takes 200 tokens instead of 8,000. We wrote this as an MCP tool so the agent can call it when its context gets large.
On SlackClaw, this is handled automatically — the platform manages context windows per channel and runs summarisation in the background. Building it yourself is doable but you'll spend a week getting the summarisation prompts right.
4. The Permission Escalation
This wasn't a failure exactly — it was working as designed, which was the problem.
Our agent had a tool that could update Linear tickets. We'd configured it so anyone could ask the agent to update tickets, because we thought "updating a ticket" was a low-risk operation.
Then an intern asked the agent to "close all the bugs tagged P3" and it did. All 47 of them. In production.
The issue: we'd defined permissions at the tool level (can use the update tool) instead of the operation level (can update individual tickets but can't bulk close). The tool didn't distinguish between "update status of PROJ-123" and "update status of all tickets matching a filter."
After we re-opened all 47 tickets and had an awkward conversation, we rebuilt the permission system. Each tool now has operation-level permissions: read, update-single, update-bulk, create, delete. Users are mapped to permission levels. The intern can read and update individual tickets. Bulk operations require team lead approval, which the agent requests via a confirmation message in Slack.
This kind of granular permission system is one of the things that takes SlackClaw from "nice to have" to "actually necessary." Per-channel permissions aren't enough. You need per-operation permissions, and building those into every MCP server is substantial work.
5. The Model Switcheroo
We were running Claude for our agent. Anthropic had a 4-hour outage in February. Our agent was dead for all 4 hours.
The fix is obvious in hindsight: have a fallback model. But switching models isn't just changing an API endpoint. Different models handle tool calling differently. Claude returns tool calls in a specific format. GPT-5 uses a different format. Open models vary wildly. Our system prompts had Claude-specific formatting instructions baked in.
We spent a week abstracting the model layer. Now we have a config that maps model-specific tool calling formats to our internal format, and the agent can switch from Claude to GPT-5 to MiniMax M2.5 without changing anything else. The failover is automatic: if the primary model returns a 5xx three times in a row, we switch to the backup for 30 minutes before trying the primary again.
The abstraction also let us run cheaper models for simple queries (summarise this thread, look up a ticket) and expensive models for complex ones (plan this sprint based on the last two weeks of conversations). Our costs dropped about 35%.
What I'd Do Differently
If I were starting over, three things.
Start with health checks, not features. Build the monitoring before you build the second MCP server. You'll catch problems in hours instead of days.
Define permissions at the operation level from day one. Tool-level permissions will bite you. It's tedious to set up but the alternative is an intern closing 47 tickets.
Abstract the model layer before you need to. You will need to switch models. Whether it's an outage, a pricing change, or a political situation (ask Anthropic about the Pentagon), your model will become unavailable at some point. Have a tested fallback.
Or just use SlackClaw and let someone else solve these problems. After two months of building and maintaining agent infrastructure, I understand why managed hosting exists. The credit-based pricing means you're paying for what you use, not maintaining what you built.
The agent infrastructure problem isn't building v1. It's keeping v1 running at 3am on a Saturday when the disk is full and the OAuth tokens have expired and nobody's awake to notice.
Helen Mireille is chief of staff at an early-stage tech startup. She writes about the gap between AI demos and AI in production.
Top comments (0)