Shifu

Posted on Mar 1 • Edited on Mar 12

The OpenClaw Heartbeat Trap: How a Simple Health Check Cost Me 300+ LLM Calls Per Day

#openclaw #ai #debugging #llm

🧵 I thought my AI agent was just casually checking system health. Instead, it was running a full-blown medical drama every 55 minutes—and racking up massive token usage behind my back. 🎬

💸 The Fear of the Runaway API Bill

If you're building autonomous AI agents with frameworks like OpenClaw, LangChain, or AutoGPT, you already know the existential dread of waking up to a massive API billing alert.

When we give an LLM the ability to autonomously call tools in a loop to "achieve a goal," we hand over the keys to our wallets.

This week, my AI assistant—running on OpenClaw using Google's Gemini models—started throwing 429 RESOURCE_EXHAUSTED errors. At first, I was just annoyed by the rate limits. But when I looked at the dashboard, my annoyance turned to panic.

The daily quota of 1,500 requests was seemingly exhausted.

The terrifying part? I hadn't even talked to the agent all day.

The only automated task running? A "simple" system health heartbeat set to trigger every 55 minutes. That's just ~26 pings a day. Where were all these hundreds of requests coming from? I needed to know exactly where those tokens were flying off to.

🕵️ The Investigation: Digging Through the JSON Logs

My first assumption was a configuration error—maybe the heartbeat frequency was accidentally set to 5 minutes instead of 55? I checked my openclaw.json config file. Nope, strictly set to "every": "55m".

So, I brought out the heavy machinery: the raw agent logs.

I downloaded the 5MB openclaw.log file from the server. OpenClaw logs everything in structured JSON, which is great for machines but terrible for human eyes. Staring at raw JSON wasn't going to cut it, so I wrote two custom Node.js parser scripts (extract_events.js and trace_sessions.js) to reconstruct the crime scene.

Here is what the scripts did:

Regex-matched every embedded run start and embedded run done to capture the LLM execution times.
Grouped every event by sessionId to track long-running conversations.
Extracted every single tool invocation (exec, read_file, web_search) attached to those runs.

When the scripts spit out the final timeline, my jaw dropped. 😲

What I found was a textbook case of uncontrollable LLM tool looping—the silent killer of API budgets. 🌪️

🔪 The Smoking Gun: The System Health Definition

My agent is designed to run autonomously. Every 55 minutes, a cron job wakes it up and tells it to read a file called HEARTBEAT.md.

Here was the fateful instruction inside that file:

"System Health Check: Monitor for stalled interactive processes and kill them. Check memory usage (free -h)."

To a human sysadmin, this is a 10-second task. You run ps aux, maybe free -h, and you're done.

But to a deterministic, stateless LLM agent using a tool-chain architecture? It's a multi-round forensic team. 🕵️

Here is the exact timeline of a single 55-minute heartbeat check my script extracted:

Time	Action	What the LLM was doing
`07:51:55`	🛠️ Tool: `exec`	Ran `ps aux` to list all processes
`07:52:15`	🛠️ Tool: `exec`	Ran `grep` to filter the list
`07:56:49`	🛠️ Tool: `exec`	Checked a specific process
`07:56:54`	🛠️ Tool: `exec`	Checked memory with `free -h`
`07:57:02`	🌐 Tool: `web_search`	Looked something up on the internet!?
`07:57:24`	🛠️ Tool: `exec`	Checked disk space (`df -h`)
`07:58:10`	🛠️ Tool: `exec`	Final cleanup/verification
`07:58:12`	✅ Done	Summarized findings

Total duration: 6.2 minutes.
Total tool calls: 12.

❄️ The Context Snowball Effect (How the tokens multiply)

Here is the critical architectural quirk I had overlooked (and why so many AutoGPT users end up with massive API bills): In an LLM tool-calling loop, every single tool execution is a brand new API request.

When the agent ran ps aux, it fetched the result. To decide what to do next, it had to send the entire conversation history (including the massive ps aux output) back to the LLM. Then it decided to run free -h. It executed it, got the result, and sent the history back again.

With each step, the context ballooned.

Instead of 26 lightweight pings a day, my "simple" health check was generating 300+ massive LLM round-trips daily, each with a larger context window than the last. 🏔️

My agent was silently burning through hundreds of thousands of tokens every single day just to check if the server was okay.

⛈️ The Retry Storm

This aggressive tool usage is also what caused the rate limits. When the agent hit its 12-tool streak in 6 minutes, it bumped into Google's per-minute quota (~15 requests/min).

When the API returned a 429 Rate Limit error, OpenClaw (as designed) initiated an exponential backoff retry. But during those retry windows, other scheduled checks queued up.

At exactly 11:15 UTC, the dam broke. The logs showed 12 API requests firing in 40 seconds as the system panic-retried a backlog of tool calls.

I wasn't being rate-limited because of daily usage. I was being rate-limited because my agent was behaving like an over-caffeinated sysadmin slamming the terminal with 12 commands a minute. ☕💥

🛠️ The Fix: Taking the Keys Away

When building autonomous agents, it's tempting to give the LLM control over everything. Why write a bash script when the AI can just figure it out dynamically?

This incident is exactly why. Some tasks don't need "reasoning." They just need execution.

The Solution:

I opened HEARTBEAT.md and completely deleted the actionable instructions. I left it as a comment-only file so the LLM wakes up, sees nothing to do, and goes immediately back to sleep (1 API call instead of 12).
I moved the actual system monitoring to a dumb, reliable cron bash script:

#!/bin/bash
AVAILABLE=$(free -m | awk '/Mem:/ {print $7}')
if [ "$AVAILABLE" -lt 200 ]; then
  echo "[$(date)] LOW MEMORY: ${AVAILABLE}MB" >> /tmp/health_alerts.log
fi

Now, a traditional cron job runs every 55 minutes, takes 0.1 seconds, costs 0 API tokens, and logs any issues to a file. The LLM only gets involved if a human explicitly asks to read that file.

🧠 The Takeaway for Agent Builders

If you are building LLM agents with access to real tools (exec, browser, search), remember:

Every tool call is a full LLM round-trip. A 5-step thought process is 5 API calls. Set hard caps (max_iterations) on your agent loops to prevent them from digging a bottomless pit in your wallet.
Never give an LLM a monitoring job a config or bash script can do. Reserve the expensive AI reasoning for when things actually break and need diagnosing, not for the routine patrol.
Log your tool chains. If I hadn't built custom JS scripts to trace the session IDs and see exactly which tools were being called in sequence, I would have had no idea my agent was hallucinating 12-step system audits in the background.

📚 Diagnostic Playbook: Fixing "Unknown Model" and Configured,Missing

If you're hitting configured,missing or Unknown model in OpenClaw, here's the exact playbook I used:

Step 1: Check if you have an agent-level `models.json`

ls -la ~/.openclaw/agents/main/agent/models.json 2>&1

If this file exists and you're only using standard providers (OpenRouter, Google, Anthropic, OpenAI), this file is probably unnecessary and might be shadowing the built-in registry.

Step 2: Check what's in it

cat ~/.openclaw/agents/main/agent/models.json | python3 -m json.tool | grep -E '"id"'

If you see a provider name that matches a built-in provider (openrouter, google, anthropic, etc.), that block is overriding the built-in model catalog. Only models explicitly listed will be recognized.

Step 3: Try disabling it (with backup)

# Backup first!
cp ~/.openclaw/agents/main/agent/models.json \\
   ~/.openclaw/agents/main/agent/models.json.bak.$(date +%Y%m%d-%H%M%S)

# Rename to disable
mv ~/.openclaw/agents/main/agent/models.json \\
   ~/.openclaw/agents/main/agent/models.json.disabled

# Restart
systemctl --user restart openclaw-gateway

# Check
openclaw models list

If all models now show configured — the file was the problem. Delete it permanently (or keep the .disabled backup just in case).

Step 4: If you DO need custom providers

If you have truly custom providers (not built-in), such as:

Nvidia API (integrate.api.nvidia.com)
Custom self-hosted endpoints
Non-standard API providers

Then you need models.json, but be very careful:

Don't use provider names that match built-in providers (e.g., use openrouter-custom instead of openrouter)
Only define the custom providers, let the built-in registry handle the standard ones

Quick diagnostic cheat sheet

Symptom	Likely cause	Fix
`configured,missing`	Custom `models.json` is shadowing built-in registry	Rename/remove `models.json`
`Unknown model` in logs	Same as above	Same as above
`401 Unauthorized`	API key missing from `.env`	Check `.env` (and never use `>`!)
Model works via `curl` but not OpenClaw	Provider block in `models.json` doesn't list the model	Remove the shadowing provider block
`models scan` doesn't find a model	Model doesn't support tool-calling	Add manually via `openclaw models set`

🎯 The Takeaway

Infrastructure debugging is archaeology.

You're not fixing bugs — you're reconstructing what a system looked like at a moment when it worked, and comparing it to the moment it stopped.

The difference is usually:

✏️ One character (> vs >>)
📄 One file that's shadowing a built-in registry
🤖 One good-faith change by an AI agent that had unintended side effects

And the real fix isn't always adding what's missing — sometimes it's removing what shouldn't be there.

If you've ever stared at configured,missing and felt your sanity slipping — now you know exactly where to look. 🦞

TL;DR

My OpenClaw agent's 55-minute heartbeat check was running 12 shell commands per cycle, costing 300+ daily LLM calls. The root cause? A models.json file with an openrouter provider block that shadowed the built-in catalog. The fix: remove the unnecessary file, use cron + bash for monitoring, and let the built-in registry handle standard providers.

Has your AI agent ever surprised you with a massive API bill? Share your horror stories below! 👇

Top comments (2)

Harpinder • May 23

this is a very real trap. for system health, your fix feels right: keep the check deterministic, then only ask the agent to reason when something actually looks wrong.

small founder disclosure: i'm building Watchline around the same boundary for app streams. instead of having OpenClaw wake every N minutes to inspect Gmail/Calendar/etc, you register the watch once, filter outside the agent, then deliver only matched events back into the OpenClaw session. local OpenClaw can use pull delivery too, so it doesn't need a public webhook: watch.qordinate.ai/docs/openclaw

were your heartbeat tasks only server-health checks, or did you also have inbox/calendar-style monitoring mixed in?

Shifu • Jul 6

Purely server-health - HEARTBEAT.md was just process/memory checks on a cron, no inbox or calendar in that loop at all. Which in hindsight makes the runaway cost worse, not better: burning hundreds of calls a day just to ask "is anything stalled" means the equivalent inbox/calendar version, with actually reasoning-worthy content, would've been far more expensive.

The pull-delivery model you're describing is basically what I ended up doing by hand - moved the health check to a plain bash script that only wakes the agent when something's actually wrong (nonzero exit code, memory over a threshold), instead of the agent reasoning over free -h output every single time. Sounds like Watchline is the generalized version of that for app-stream events instead of shell exit codes. Registering the watch once and only paying the LLM cost on an actual match is the right shape - will take a look at the docs.