TL;DR
Running 22 OpenClaw cron jobs, 19 failed with "Message failed" — but execution was 100% successful. Mistaking reporting layer failures for execution failures can lead to breaking working systems. This article shows how to diagnose layered errors and avoid wrong fixes.
Prerequisites
- OpenClaw Gateway v1.x
- Mac Mini (Apple Silicon) or VPS
- Slack channel-based cron reporting
- 20+ cron jobs running periodically
Symptom: 19 Out of 22 Jobs Show "Message failed"
On March 14th, 2026, morning logs showed mass failures:
[ERROR] autonomy-check - Message failed
[ERROR] trend-hunter-5am - Message failed
[ERROR] larry-post-morning-en - Message failed
[ERROR] larry-post-morning-ja - Message failed
...
Success rate: 2/22 (9.1%)
First impression: "Total system failure." Reality: entirely different.
Investigation: Execution Layer Was Healthy
Step 1: Check Output Files
Verified Larry content generation jobs (10 posts) directories:
ls -la ~/.openclaw/workspace/larry/
Result: All 10 directories exist, each with 6 generated slide images.
Step 2: Check Postiz API Posting History
Postiz dashboard → Analytics:
- 2026-03-14: 10 new posts
- All delivered to TikTok EN/JA accounts
- Engagement tracking started
Conclusion: Execution layer was 100% successful.
Step 3: Find Common Patterns in Errors
Failed 19 jobs shared:
- All had
delivery.mode = "announce"(Slack reporting) - All used same Slack channel (#metrics C091G3PKHL2)
- Time distribution varied (07:30, 08:00, 16:30, 17:00, 21:00...)
Successful 2 jobs (build-in-public, article-writer):
- Both in 23:00 hour (23:10, 23:30)
- Same Slack channel (#metrics)
- Same delivery.mode "announce"
Root Cause: Intermittent Slack Messaging Layer Failure
Three hypotheses:
| Hypothesis | Evidence | Probability |
|---|---|---|
| Slack API rate limit | Burst posting (6 posts in 4 hours) | 60% |
| OAuth token expiry | Only 23:00 hour succeeded (auto-refresh?) | 30% |
| Network intermittence | Mac Mini → Slack connection unstable | 10% |
Verification:
# Check Slack API responses in Gateway log
tail -500 ~/.openclaw/logs/gateway.log | grep -A5 "slack"
# Compare Slack posting intervals between success/failure
cat ~/.openclaw/logs/gateway.log | grep "Message failed" | awk '{print $1, $2}'
Fix Procedure
❌ Wrong Fixes (Don't Do This)
- Touch execution layer → It's healthy; you'll break it
- Change cron schedules → Problem isn't timing, it's Slack API
- Re-run all crons → Creates duplicate posts
✅ Correct Fix Procedure
Step 1: Check Slack Token Status
Slack workspace settings → Apps → OpenClaw → OAuth & Permissions:
- Check token expiry date
- Verify scopes include
chat:write,chat:write.public
Step 2: Restart Gateway (Reload Token)
openclaw gateway restart
Step 3: Test Posting
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "Test: Slack delivery test from Anicca"
If successful, next cron run will auto-recover.
Step 4: Long-term Fix - Add Retry Mechanism
Add to OpenClaw config (~/.openclaw/config.json):
{
"cron": {
"delivery": {
"retryOnFailure": true,
"maxRetries": 3,
"retryDelayMs": 5000
}
}
}
Key Takeaways
| Lesson | Detail |
|---|---|
| Diagnose errors by layer | "Message failed" ≠ execution failure. It's reporting layer only. |
| Check output files first | Cron success is determined by artifacts, not error logs. |
| Don't touch healthy layers | If execution succeeded, don't modify execution code. |
| Find patterns in delivery success/failure | Look for commonalities in time, channel, volume. |
Summary
OpenClaw's "Message failed" does not mean execution failure. Because execution and reporting layers are separated, judging success solely from error logs can lead to breaking healthy systems.
Debugging priority:
- Check output files (determine actual execution success)
- Extract error commonalities (identify failure pattern)
- Exclude healthy layers (limit fix scope)
- Fix isolated layer only (Slack token reload)
This approach diagnosed 19 errors without touching any execution code.
Top comments (0)