zac

Posted on Apr 14 • Originally published at remoteopenclaw.com

OpenClaw Monitoring Dashboard: Track Agent Health and...

#claude #ai #productivity #tutorial

Originally published on Remote OpenClaw.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Join the Community

Join 1k+ OpenClaw operators sharing deployment guides, security configs, and workflow automations.

Join the Community →

What to Monitor

A healthy OpenClaw agent needs monitoring in five areas:

1. Container health. Is the Docker container running? How much CPU and memory is it using? Has it restarted unexpectedly? Container crashes are the most common failure mode and the most important to detect quickly.

2. API connectivity. Can the agent reach its AI model provider (Anthropic, OpenAI, Google, Ollama)? API outages or expired keys cause the agent to stop responding even though the container is running.

3. Message throughput. How many messages is the agent processing per hour? A sudden drop in throughput can indicate a connectivity issue, a messaging platform problem, or a misconfiguration. A sudden spike can indicate abuse or a runaway automation.

4. Error rate. How often do actions fail? A baseline error rate of 1-5% is normal (temporary API failures, rate limits, network hiccups). Above 10% indicates a systemic problem that needs investigation.

5. Disk usage. OpenClaw stores logs, conversation history, and memory data on disk. Without monitoring, these can grow until the disk is full, causing the agent to crash. Monitor disk usage and set up alerts at 80% capacity.

Docker Stats: Quick Health Check

The fastest way to check your agent's health is Docker's built-in monitoring:

Check if the container is running:

docker ps | grep openclaw

This shows the container status, uptime, and port mappings. If the container is not listed, it has crashed or been stopped.

Real-time resource monitoring:

docker stats openclaw

This shows a live dashboard with CPU usage, memory usage, network I/O, and disk I/O. Press Ctrl+C to exit.

Healthy baselines for OpenClaw:

CPU: 1-5% at idle. Spikes to 20-50% during active processing. Sustained above 80% indicates a problem.
Memory: 800MB-1.5GB depending on loaded skills. Gradual increase over time (memory leak) is abnormal.
Network: Active during API calls and message processing. Zero network I/O when the agent should be active indicates connectivity loss.

Check container restart count:

docker inspect openclaw --format='{{.RestartCount}}'

If the restart count is increasing, the container is crashing and Docker's restart policy is bringing it back. Check the logs to find the crash cause.

Health endpoint:

curl -s http://localhost:3000/health

OpenClaw exposes a health endpoint that returns a 200 status when the agent is running and responsive. Use this endpoint for automated monitoring.

Mission Control Setup

OpenClaw's web UI includes a Mission Control page that provides a browser-based dashboard for monitoring your agent. To access it:

Open your OpenClaw web UI (https://your-domain:3000 or your Tailscale URL)
Enter your gateway token
Navigate to the Mission Control tab

Mission Control shows:

Agent status: Online/offline indicator with uptime duration
Active sessions: Current conversations and their status
Recent actions: Timeline of the last 50 actions the agent took
Scheduled tasks: List of upcoming scheduled tasks with next run time
Model usage: Token consumption and API call counts for the current period
Error log: Recent errors with stack traces for debugging

Mission Control is useful for day-to-day monitoring and debugging. For historical analysis and automated alerting, you need additional tools.

Uptime Monitoring

Uptime monitoring continuously checks that your agent is responsive and alerts you when it goes down. Two good options:

Uptime Kuma (self-hosted, free):

Uptime Kuma is an open-source monitoring tool you can run alongside OpenClaw. It supports HTTP checks, ping, Docker container monitoring, and dozens of notification channels.

# Add to your docker-compose.yml
uptime-kuma:
 image: louislam/uptime-kuma:latest
 container_name: uptime-kuma
 ports:
 - "3001:3001"
 volumes:
 - ./uptime-kuma-data:/app/data
 restart: unless-stopped

After starting Uptime Kuma, access it at port 3001 and add a monitor for your OpenClaw health endpoint:

Type: HTTP
URL: http://openclaw:3000/health (use the Docker container name)
Interval: 60 seconds
Notification: Telegram, Slack, email, or webhook

Healthchecks.io (cloud, free tier):

If you prefer a hosted solution, Healthchecks.io provides a free tier with up to 20 checks. Create a check and add a cron job to ping it:

# Add to crontab (crontab -e)
* * * * * curl -fsS --retry 3 https://hc-ping.com/your-check-uuid > /dev/null

If the ping stops arriving, Healthchecks.io sends you an alert. This is a "dead man's switch" approach — you are alerted when something stops working, rather than when a specific check fails.

Log Analysis

OpenClaw logs contain detailed information about every action the agent takes, every API call, every error, and every scheduled task execution.

View recent logs:

# Last 100 lines
docker logs --tail 100 openclaw

# Follow logs in real-time
docker logs -f openclaw

# Logs from the last hour
docker logs --since 1h openclaw

Search logs for errors:

docker logs openclaw 2>&1 | grep -i error

Configure persistent log storage:

By default, Docker stores logs in JSON files that can grow without limit. Configure max size and rotation in your docker-compose.yml:

services:
 openclaw:
 logging:
 driver: json-file
 options:
 max-size: "50m"
 max-file: "5"

This keeps a maximum of 250MB of logs (5 files at 50MB each), automatically rotating when each file fills up.

What to look for in logs:

API errors: 401 (expired key), 429 (rate limited), 500 (provider issue), connection timeouts
Memory warnings: "Heap out of memory" or increasing memory allocation messages
Unhandled exceptions: Stack traces indicating bugs or unexpected inputs
Slow responses: API response times consistently above 10 seconds
Restart messages: The agent restarting unexpectedly during normal operation

Alerting With Webhooks

Automated alerting ensures you know about problems before they affect your workflows. The simplest approach is a shell script that runs via cron and sends alerts through Telegram or Slack.

Basic alerting script:

#!/bin/bash
# /opt/openclaw/monitor.sh

TELEGRAM_BOT_TOKEN="your-bot-token"
TELEGRAM_CHAT_ID="your-chat-id"
HEALTH_URL="http://localhost:3000/health"

# Check if OpenClaw is responding
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$HEALTH_URL")

if [ "$HTTP_STATUS" != "200" ]; then
 MESSAGE="OpenClaw is DOWN. Health check returned HTTP $HTTP_STATUS."
 curl -s -X POST "https://api.telegram.org/bot$TELEGRAM_BOT_TOKEN/sendMessage" \
 -d "chat_id=$TELEGRAM_CHAT_ID" \
 -d "text=$MESSAGE"
fi

# Check disk usage
DISK_USAGE=$(df /opt/openclaw/data | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 80 ]; then
 MESSAGE="OpenClaw disk usage is at ${DISK_USAGE}%. Consider cleaning up logs."
 curl -s -X POST "https://api.telegram.org/bot$TELEGRAM_BOT_TOKEN/sendMessage" \
 -d "chat_id=$TELEGRAM_CHAT_ID" \
 -d "text=$MESSAGE"
fi

# Check memory usage
MEM_USAGE=$(docker stats --no-stream --format "{{.MemPerc}}" openclaw | sed 's/%//')
MEM_INT=${MEM_USAGE%.*}
if [ "$MEM_INT" -gt 80 ]; then
 MESSAGE="OpenClaw memory usage is at ${MEM_USAGE}%. Consider restarting."
 curl -s -X POST "https://api.telegram.org/bot$TELEGRAM_BOT_TOKEN/sendMessage" \
 -d "chat_id=$TELEGRAM_CHAT_ID" \
 -d "text=$MESSAGE"
fi

Add to crontab to run every 5 minutes:

*/5 * * * * /opt/openclaw/monitor.sh

This gives you basic monitoring that covers the most important failure modes: agent down, disk full, and memory exhaustion.

Grafana Integration

For operators who want historical dashboards, trend analysis, and professional-grade monitoring, Grafana with Prometheus provides a complete observability stack.

Architecture:

cAdvisor collects Docker container metrics (CPU, memory, network, disk)
Prometheus stores metrics in a time-series database
Grafana provides the visual dashboard

Add these to your docker-compose.yml:

cadvisor:
 image: gcr.io/cadvisor/cadvisor:latest
 container_name: cadvisor
 volumes:
 - /:/rootfs:ro
 - /var/run:/var/run:ro
 - /sys:/sys:ro
 - /var/lib/docker/:/var/lib/docker:ro
 ports:
 - "8080:8080"

prometheus:
 image: prom/prometheus:latest
 container_name: prometheus
 volumes:
 - ./prometheus.yml:/etc/prometheus/prometheus.yml
 ports:
 - "9090:9090"

grafana:
 image: grafana/grafana:latest
 container_name: grafana
 ports:
 - "3002:3000"
 volumes:
 - grafana-data:/var/lib/grafana

volumes:
 grafana-data:

Create a prometheus.yml configuration:

global:
 scrape_interval: 15s

scrape_configs:
 - job_name: 'cadvisor'
 static_configs:
 - targets: ['cadvisor:8080']

After starting the stack, access Grafana at port 3002, add Prometheus as a data source, and import a Docker monitoring dashboard (Grafana dashboard ID 193 is a good starting point).

With Grafana, you get:

Historical CPU and memory graphs showing trends over days, weeks, or months
Alerting rules that fire when metrics cross thresholds
Custom dashboards tailored to your specific monitoring needs
Comparison between multiple OpenClaw instances if you run more than one

The Grafana stack adds approximately 500MB of additional RAM to your server requirements. For a single OpenClaw instance, the basic shell script alerting may be sufficient. Grafana becomes valuable when you manage multiple agents or need historical trend data for capacity planning.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →