DEV Community

Shifu
Shifu

Posted on

The Autonomous Agent Trap: How My AI Burned 300+ LLM Calls a Day Checking Its Own Pulse 💸🤖

🧵 I thought my AI agent was just casually checking system health. Instead, it was running a full-blown medical drama every 55 minutes—and racking up massive token usage behind my back. 🎬


💸 The Fear of the Runaway API Bill

If you're building autonomous AI agents with frameworks like OpenClaw, LangChain, or AutoGPT, you already know the existential dread of waking up to a massive API billing alert.

When we give an LLM the ability to autonomously call tools in a loop to "achieve a goal," we hand over the keys to our wallets.

This week, my AI assistant—running on OpenClaw using Google's Gemini models—started throwing 429 RESOURCE_EXHAUSTED errors. At first, I was just annoyed by the rate limits. But when I looked at the dashboard, my annoyance turned to panic.

The daily quota of 1,500 requests was seemingly exhausted.

The terrifying part? I hadn't even talked to the agent all day.

The only automated task running was a "simple" system health heartbeat set to trigger every 55 minutes. That’s just ~26 pings a day. Where were all these hundreds of requests coming from? I needed to know exactly where those tokens were flying off to.


🕵️‍♂️ The Investigation: Digging Through the JSON Logs

My first assumption was a configuration error—maybe the heartbeat frequency was accidentally set to 5 minutes instead of 55? I checked my openclaw.json config file. Nope, strictly set to "every": "55m".

So, I brought out the heavy machinery: the raw agent logs.

I downloaded the 5MB openclaw.log file from the server. OpenClaw logs everything in structured JSON, which is great for machines but terrible for human eyes. Staring at raw JSON wasn't going to cut it, so I wrote two custom Node.js parser scripts (extract_events.js and trace_sessions.js) to reconstruct the crime scene.

Here is what the scripts did:

  1. Regex-matched every embedded run start and embedded run done to capture the LLM execution times.
  2. Grouped every event by sessionId to track long-running conversations.
  3. Extracted every single tool invocation (exec, read_file, web_search) attached to those runs.

When the scripts spit out the final timeline, my jaw dropped. 😲

What I found was a textbook case of uncontrollable LLM tool looping—the silent killer of API budgets. 🌪️


🔪 The Smoking Gun: The System Health Definition

My agent is designed to run autonomously. Every 55 minutes, a cron job wakes it up and tells it to read a file called HEARTBEAT.md.

Here was the fateful instruction inside that file:

"System Health Check: Monitor for stalled interactive processes and kill them. Check memory usage (free -h)."

To a human sysadmin, this is a 10-second task. You run ps aux, maybe free -h, and you're done.

But to a deterministic, stateless LLM agent using a tool-chain architecture? It's a multi-round forensic team. 🕵️‍♂️

Here is the exact timeline of a single 55-minute heartbeat check my script extracted:

Time Action What the LLM was doing
07:51:55 🛠️ Tool: exec Ran ps aux to list all processes
07:52:15 🛠️ Tool: exec Ran grep to filter the list
07:56:49 🛠️ Tool: exec Checked a specific process
07:56:54 🛠️ Tool: exec Checked memory with free -h
07:57:02 🌐 Tool: web_search Looked something up on the internet!?
07:57:24 🛠️ Tool: exec Checked disk space (df -h)
07:58:10 🛠️ Tool: exec Final cleanup/verification
07:58:12 Done Summarized findings

Total duration: 6.2 minutes.
Total tool calls: 12.


❄️ The Context Snowball Effect (How the tokens multiply)

Here is the critical architectural quirk I had overlooked (and why so many AutoGPT users end up with massive API bills): In an LLM tool-calling loop, every single tool execution is a brand new API request.

When the agent ran ps aux, it fetched the result. To decide what to do next, it had to send the entire conversation history (including the massive ps aux output) back to the LLM. Then it decided to run free -h. It executed it, got the result, and sent the history back again.

Instead of 26 lightweight pings a day, my "simple" health check was generating 300+ massive LLM round-trips daily, each with a larger context window than the last. 🏔️

My agent was silently burning through hundreds of thousands of tokens every single day just to check if the server was okay.

⛈️ The Retry Storm

This aggressive tool usage is also what caused the rate limits. When the agent hit its 12-tool streak in 6 minutes, it bumped into Google's per-minute quota (~15 requests/min).

When the API returned a 429 Rate Limit error, OpenClaw (as designed) initiated an exponential backoff retry. But during those retry windows, other scheduled checks queued up.

At exactly 11:15 UTC, the dam broke. The logs showed 12 API requests firing in 40 seconds as the system panic-retried a backlog of tool calls.

I wasn't being rate-limited because of daily usage. I was being rate-limited because my agent was behaving like an over-caffeinated sysadmin slamming the terminal with 12 commands a minute. ☕💥


🛠️ The Fix: Taking the Keys Away

When building autonomous agents, it's tempting to give the LLM control over everything. Why write a bash script when the AI can just figure it out dynamically?

This incident is exactly why. Some tasks don't need "reasoning." They just need execution.

The Solution:

  1. I opened HEARTBEAT.md and completely deleted the actionable instructions. I left it as a comment-only file so the LLM wakes up, sees nothing to do, and goes immediately back to sleep (1 API call instead of 12).
  2. I moved the actual system monitoring to a dumb, reliable cron bash script:
# /home/user/health_check.sh
AVAILABLE=" + "$" + "(free -m | awk '/Mem:/ {print " + "$" + "7}')
if [ "" + "$" + "AVAILABLE" -lt 200 ]; then
  echo "[" + "$" + "(date)] LOW MEMORY: " + "$" + "{AVAILABLE}MB" >> /tmp/health_alerts.log
fi
Enter fullscreen mode Exit fullscreen mode

Now, a traditional cron job runs every 55 minutes, takes 0.1 seconds, costs 0 API tokens, and logs any issues to a file. The LLM only needs to get involved if a human explicitly asks it to read that file.


🧠 The Takeaway for Agent Builders

If you are building LLM agents with access to real tools (exec, browser, search), remember:

  1. Every tool call is a full LLM round-trip. A 5-step thought process is 5 API calls. Set hard caps (max_iterations) on your agent loops to prevent them from digging a bottomless pit in your wallet.
  2. Never give an LLM a monitoring job a config or bash script can do. Reserve the expensive AI reasoning for when things actually break and need diagnosing, not for the routine patrol.
  3. Log your tool chains. If I hadn't built custom JS scripts to trace the session IDs and see exactly which tools were being called in sequence, I would have had no idea my agent was hallucinating 12-step system audits in the background.

Has your AI agent ever run away with your API quota or surprised you with a massive bill? Let me know your horror stories in the comments! 👇

Top comments (0)