DEV Community: Joongho Kwon

Why Your AI Agent Health Check Is Lying to You

Joongho Kwon — Wed, 01 Apr 2026 18:05:35 +0000

ClevAgent monitors your AI agents in production — heartbeat watchdog, auto-restart, loop detection, and cost tracking. Free for up to 3 agents.

Your AI Agent Looks Healthy — But Your API Bill Says Otherwise

Joongho Kwon — Tue, 31 Mar 2026 18:08:13 +0000

You wake up to a $200 API bill. Your agent ran all night. It looked healthy — heartbeat green, no errors, process running. But token usage went from 200/min to 40,000/min because it was stuck re-parsing a malformed response in a loop.

This is the most expensive failure mode in AI agent operations, and traditional monitoring won't catch it.

Why cost tracking matters for AI agents

Traditional services have relatively predictable costs. A web server handles N requests per second, each costing roughly the same in compute.

AI agents are different. A single LLM call can cost anywhere from $0.001 to $2.00 depending on the model, context size, and output length. A logic loop that retries the same failing operation can burn through hundreds of dollars in minutes.

The key insight: for LLM-backed agents, cost is a health metric, not just a billing metric.

The pattern: cost per heartbeat cycle

Instead of tracking total spend, track cost per work cycle:

while True:
    start_tokens = get_token_count()

    result = do_llm_work()

    end_tokens = get_token_count()
    tokens_used = end_tokens - start_tokens
    cost = calculate_cost(tokens_used)

    heartbeat(tokens=tokens_used, cost_usd=cost)
    sleep(interval)

Now you have a time series of cost-per-cycle. Normal is ~200 tokens. If it jumps to 40,000, you know immediately.

What to track

Metric	Why	Alert threshold
Tokens per cycle	Catch loops	10x above 24h average
Cost per hour	Budget protection	Fixed dollar amount
Tool calls per cycle	Catch recursive tool use	5x above baseline

Auto-tracking with SDK monkey-patching

If you use OpenAI or Anthropic SDKs, you can patch the API client to automatically track every call without changing your application code:

import os

# Wrap the OpenAI client to track usage
original_create = openai_client.chat.completions.create

def tracked_create(*args, **kwargs):
    response = original_create(*args, **kwargs)
    if response.usage:
        track_tokens(response.usage.total_tokens, model=kwargs.get("model"))
    return response

openai_client.chat.completions.create = tracked_create

The wrapper intercepts the API call, extracts usage.total_tokens from the response, estimates cost based on the model, and logs it. You can pipe this into your existing monitoring stack or a simple SQLite database.

Cost alerting strategies

1. Absolute threshold

Alert if hourly cost exceeds $X. Simple, catches catastrophic loops.

2. Relative spike

Alert if current cycle cost is 10x+ above the rolling 24-hour average. Catches loops that start gradually.

3. Budget gate

Hard-stop the agent if daily spend exceeds a configured limit. Last line of defense.

The real-world numbers

From running three production agents with cost tracking:

Normal operation: $0.01-0.05 per day per agent (gpt-4o-mini, ~50 tokens/cycle)
Loop incident: $50 in 40 minutes (40,000 tokens/min)
Detection time with cost tracking: < 60 seconds
Detection time without: 6+ hours (discovered via billing alert next morning)

The difference between a $0.50 incident and a $200 incident is whether you detect the cost spike in real time.

Summary

Track tokens per work cycle, not just total spend
Alert on 10x spikes above baseline
Use SDK monkey-patching to auto-track without code changes
Set a hard daily budget gate as last resort

Cost isn't just a billing concern for AI agents — it's the single best health signal for catching the failure modes that traditional monitoring misses.

I built ClevAgent after dealing with exactly these problems. But the pattern matters more than the tool — even a simple SQLite table tracking tokens-per-cycle would have saved me that $200 bill.

Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch

Joongho Kwon — Sun, 29 Mar 2026 23:27:14 +0000

I run several AI agents in production — trading bots, data scrapers, monitoring agents. They run 24/7, unattended. Over the past few months, I've hit three failure modes that my existing monitoring (process checks, log watchers, CPU/memory alerts) completely missed.

These aren't exotic edge cases. If you're running any long-lived AI agent, you'll probably hit all three eventually.

Failure #1: The Silent Exit

One of my agents exited cleanly at 3 AM. No traceback. No error log. No crash dump. The Python process simply stopped. My log monitoring saw nothing because there was nothing to log.

I found out six hours later when I noticed the bot hadn't posted since 3 AM.

What happened

The OS killed the process for memory. The agent was slowly leaking — a library was caching LLM responses in memory without any eviction policy. RSS grew from 200MB to 4GB over a few days. The OOM killer sent SIGKILL, which leaves no Python traceback.

Why traditional monitoring missed it

Process monitoring (systemd, supervisor): Saw the exit code, but by the time you check alerts, the damage is done
Log monitoring (Datadog, CloudWatch): Nothing to see — OOM kill happens below the application layer
CPU/memory dashboards: Would have caught it if someone was watching. Nobody watches dashboards at 3 AM.

The pattern that catches this

Positive heartbeat. Instead of monitoring for bad signals (errors, crashes), monitor for the absence of a good signal. The agent must actively report "I'm alive" every N seconds. If the heartbeat stops for any reason — clean exit, OOM, segfault, kernel panic — you know immediately.

# Inside your agent's main loop
while True:
    result = do_work()
    heartbeat()  # This is the line that matters
    sleep(interval)

If heartbeat() doesn't fire, something is wrong. You don't need to know what — you need to know when.

Failure #2: The Zombie Agent

This one is more insidious. The process was running. CPU usage normal. Memory stable. Every health check said "healthy."

But the agent hadn't done useful work in four hours.

What happened

The agent was stuck on an HTTP request. An upstream API had rotated its TLS certificate, and the request was hanging — the socket was open, the connection was established, but the TLS handshake was deadlocked. No timeout was set on the request (a classic oversight).

From the outside, the process was "running." From the inside, the main loop was blocked on line 47 of api_client.py, and it would stay blocked forever.

Why traditional monitoring missed it

PID checks: Process exists ✓
Port checks: Agent's HTTP server responds ✓ (the health endpoint runs on a separate thread)
CPU/memory: Normal ✓

The health check thread was fine. The work thread was dead.

The pattern that catches this

Application-level heartbeat. The heartbeat must come from inside the work loop, not from a separate health-check thread or sidecar process.

# Bad — heartbeat from a separate thread
threading.Thread(target=lambda: while True: heartbeat(); sleep(30)).start()

# Good — heartbeat from the actual work loop
while True:
    data = fetch_from_api()    # If this hangs...
    process(data)
    heartbeat()                # ...this never fires
    sleep(interval)

The difference is critical. If your heartbeat runs independently from your work loop, it's measuring "is the process alive?" not "is the agent working?" These are two very different questions.

Failure #3: The Runaway Loop

This is the scariest failure mode because the agent looks great. It's running. It's doing work. It's calling the LLM API, getting responses, processing them, and calling again. Every metric says "healthy."

Except your bill is exploding.

What happened

The agent received a malformed response from an API. It asked the LLM to parse it. The LLM returned a structured output that triggered the same code path again. The agent asked the LLM to re-parse. Same result. Repeat.

Token usage went from 200/min (normal) to 40,000/min. In 40 minutes, it burned through about $50 of API credits. Not catastrophic for a single incident, but imagine this happening overnight with a larger model.

Why traditional monitoring missed it

Process health: Running ✓
Heartbeat: Firing normally ✓ (the loop is running, just wastefully)
Error rate: Zero ✓ (no errors — the LLM is responding successfully every time)
CPU/memory: Normal ✓ (LLM calls are I/O-bound, not compute-bound)

The pattern that catches this

Cost as a health metric. Track token usage (or API cost) per heartbeat cycle. If it spikes 10-100x above baseline, flag it.

while True:
    start_tokens = get_token_count()
    result = do_llm_work()
    end_tokens = get_token_count()

    heartbeat(
        tokens_used=end_tokens - start_tokens,
        cost_estimate=calculate_cost(end_tokens - start_tokens)
    )
    sleep(interval)

This is the one metric that's unique to LLM-backed agents. Traditional services don't have a per-request cost that can spike 200x. AI agents do.

The Monitoring Stack for AI Agents

After dealing with all three failures, I realized the monitoring requirements for AI agents are fundamentally different from web services:

What to monitor	Web service	AI agent
Is it alive?	Process check	Positive heartbeat (agent must prove it's alive)
Is it working?	Request latency	Application-level heartbeat (from inside the work loop)
Is it healthy?	Error rate	Cost per cycle (token usage as health signal)

The minimum viable version of this is surprisingly simple:

Put a heartbeat call inside your main loop (not in a health-check thread)
Include token/cost data in each heartbeat
Alert on silence (missed heartbeat) and on cost spikes

That alone would have caught all three of my failures within 60 seconds instead of hours.

What I Built

After reimplementing this pattern across multiple agents, I packaged it into ClevAgent — an open monitoring service for AI agents. Two lines of code to add heartbeat + cost tracking:

import clevagent
clevagent.init(api_key=os.environ["CLEVAGENT_API_KEY"], agent="my-bot")

while True:
    result = do_work()
    clevagent.heartbeat(tokens=result.tokens_used)

It handles the alerting, auto-restart, loop detection, and daily reports. Free for up to 3 agents.

But honestly, the pattern matters more than the tool. Even if you roll your own with a simple webhook + PagerDuty, the three signals — heartbeat, application-level liveness, and cost tracking — will save you from 90% of production AI agent failures.

Running AI agents in production? I'd genuinely like to hear what monitoring patterns work for you. The failure modes keep surprising me.

I Run 6 AI Agents as My Dev Team — Here's the Architecture That Actually Works

Joongho Kwon — Sat, 28 Mar 2026 15:41:38 +0000

I'm not a developer. I don't write code. But I ship production software across 8+ projects — trading bots, SaaS platforms, monitoring tools, market dashboards — every single week.

My secret? I run 6 AI agents (Claude Code instances) as a structured engineering team, each with a distinct role, personality, and set of responsibilities. They communicate through a shared file, hand off work to each other, and I just... watch.

Here's exactly how it works, what failed spectacularly, and what I'd do differently.

The Problem: One Human, Too Many Projects

I manage multiple production systems simultaneously. Trading algorithms that execute real money. A SaaS product with paying users. Market analysis pipelines. Each needs ongoing development, bug fixes, and monitoring.

A single AI coding assistant hits a wall fast:

Context overload — one agent can't hold the full picture of 8 projects
No specialization — the same agent doing architecture AND line-by-line bug fixes is inefficient
No review — AI-generated code reviewing itself is meaningless
Sequential bottleneck — one agent means one task at a time

So I built a team.

The Architecture: 6 Agents, 6 Roles

Each agent runs in its own terminal (tmux session) with a dedicated role:

Agent	Role	What They Do	What They Don't Do
Max (Director)	Architect	Design systems, break down tasks, route work	Write production code
Isabelle (Developer)	Senior Dev	Implement features, make design decisions	Review her own code
Kevin (Coder)	Junior Dev	Execute well-specified tasks, bug fixes	Make design choices
Sarah (Reviewer)	Code Reviewer	Review code quality, catch edge cases	Write code
Sam (Optimizer)	Cleanup	Remove dead code, run audits	Add features
Alex (Partner)	Specialist	Independent research, analysis	Core dev loop tasks

The key insight: each agent has hard boundaries. Sarah cannot write code. Max cannot implement features. Kevin cannot make design decisions. These constraints prevent the "do everything badly" failure mode.

Communication: A Shared Markdown File

All 6 agents communicate through a single file: current.md. That's it. No database, no message queue, no WebSocket server. Just a markdown file.

Every message follows a strict format:

### [DIRECTOR] 2026-03-28 14:30

**Status**: done
**Turn**: DEVELOPER
**Tier**: 2

#### What I Did
Designed the new notification system. Three components needed...

#### For Developer
Implement the webhook handler in src/webhooks/.
Use the existing auth middleware. Expected: POST /webhooks/notify returns 200.

The Turn field is the traffic light. Only one agent works at a time (per task). When Max writes Turn: DEVELOPER, Isabelle picks it up. When Isabelle finishes, she writes Turn: REVIEWER and Sarah takes over.

Why This Works Better Than You'd Think

Full audit trail — every decision, every handoff, every review comment is in one file. When something breaks at 2 AM, I can read exactly what happened.
Async by default — agents don't need to be "online" simultaneously. Max designs at 9 AM, Isabelle implements at 2 PM, Sarah reviews at 6 PM. The file is the queue.
No lost context — unlike chat-based communication, the shared file preserves the full thread. Agent 4 can read what Agent 1 said without anyone relaying the message.

The Tier System: Not Everything Needs a Review

Early on, I made the mistake of routing every change through the full pipeline. A typo fix going through Director > Developer > Reviewer > Director was absurd.

Now I use tiers:

Tier 1 (Trivial): Config edits, docs, one-line fixes. Director handles it directly. No review needed.

Tier 2 (Standard): New features, scripts, logic changes. Director designs, Implementer builds, Director verifies. Done.

Tier 3 (Critical): Trading logic, security, data loss risk. Director designs, Sarah reviews the design first, Implementer builds, Sarah reviews the code, Director confirms, then I sign off.

Tier 3 is the one that saved me real money. Sarah caught a rounding error in a trading algorithm that would have compounded into significant losses over time. The design pre-review step caught an architecture flaw that would have taken days to refactor.

What Failed Spectacularly

1. Agents Going Rogue

Without hard constraints, agents would "help" by doing work outside their role. The reviewer would silently fix bugs instead of reporting them. The coder would redesign systems instead of implementing the spec.

Fix: Explicit boundary rules in each agent's profile + automated hooks that physically block violations. The Director's terminal literally rejects .py file edits.

2. The Echo Chamber

When one agent designs and another implements with no friction, bad ideas sail through unchallenged.

Fix: Sarah (Reviewer) has an obligation to challenge design decisions, not just review code. And the Director must respond to her challenges — silence is not an option.

3. Stale Handoffs

Agent A sets Turn: AGENT_B, but Agent B's session crashed. The work sits there forever.

Fix: A watchdog script checks for handoffs older than 13 minutes and alerts me. Agents themselves check after 5 minutes of no response.

4. "Done" Doesn't Mean Done

The biggest recurring problem: an agent says "done" but the work is incomplete, untested, or breaks something else.

Fix: Three completion gates that must be explicitly passed:

Gate 1: Does it run without errors?
Gate 2: Is the output actually correct? (not just "exit 0")
Gate 3: Are all related files updated? (docs, configs, tests)

The Numbers

After 2+ months of running this system:

8 active projects maintained simultaneously
~30 sessions completed per week
Tier 3 catch rate: Sarah has caught 12 critical issues that would have hit production
My daily involvement: ~2 hours of direction-setting, the rest is autonomous

The cost is real — running 6 Claude instances isn't cheap. But compared to a human engineering team? It's a rounding error. And they work weekends.

Practical Takeaways If You Want to Try This

Start with 2 agents, not 6. A Director + Implementer pair is enough to prove the pattern. Add reviewers and specialists later.
The shared file is non-negotiable. Every other communication method I tried (databases, APIs, inter-process messages) added complexity without adding value. A markdown file is human-readable, git-trackable, and impossible to misconfigure.
Hard role boundaries matter more than smart prompts. An agent that "can do everything" will do everything poorly. Constraints create quality.
Automate the handoffs. Manual "go check the file" instructions get forgotten. A simple notification script that pokes the next agent is the difference between a working system and an abandoned experiment.
Build in a review loop for anything that touches money or user data. This is the one thing that pays for the entire system.

What's Next

I'm now building ClevAgent — a monitoring tool for AI agents, born directly from needing to keep my own agent team healthy. When your "developers" are AI processes that can silently crash, you need monitoring that understands AI agent behavior, not just uptime.

If you're experimenting with multi-agent systems, I'd love to hear your approach. What worked? What blew up? Drop a comment.

This post describes a real production system I use daily, not a theoretical framework. The agent names are their actual configured personas. Yes, they have personalities. No, I'm not apologizing for that.

What I wish I knew before running AI agents 24/7

Joongho Kwon — Fri, 27 Mar 2026 22:46:44 +0000

I've been running long-lived AI agents in production for a while now. The specific workloads changed over time, but the operational failures were surprisingly consistent.

What follows is the setup I wish I had from day one. None of it depends on a specific framework. If you run an LLM in a loop, poll external APIs, or make decisions on a schedule, these patterns matter.

1. Your agent will die and not tell you

The first time one of my agents crashed overnight, I lost hours before I noticed. There was no error log because it was not an application error in the usual sense — the OS killed the process for memory.

What I do now: Every agent sends a heartbeat from inside its main loop. Not a separate health-check thread. Not a sidecar process. From the actual loop.

That distinction matters. If the main loop is stuck on I/O, deadlocked on a lock, or wedged inside a retry path, an external "process is up" check tells you very little.

Here is the minimal pattern:

import time


def heartbeat(agent: str, *, status: str = "ok", tokens_used: int = 0) -> None:
    # Send this to whatever monitoring system you use.
    print(
        {
            "agent": agent,
            "status": status,
            "tokens_used": tokens_used,
            "ts": time.time(),
        }
    )


while True:
    tokens_used = 0

    result = run_agent_cycle()
    tokens_used += result.tokens_used

    heartbeat("my-agent", status="ok", tokens_used=tokens_used)
    time.sleep(60)

If the heartbeat stops, something is wrong. I usually check every 60 seconds and alert after 2 missed beats.

2. Auto-restart is harder than you think

"Just restart it" sounds simple until you hit edge cases:

Restart loops: A bad config causes the agent to crash immediately after starting. Without a cooldown, you get crash → restart → crash → restart forever.
Platform differences: Docker restart policies work well. launchd on macOS silently fails if the service domain is wrong. systemd needs a RestartSec or it can spin.
State corruption: If your agent crashed mid-write to a state file, restarting puts it in an inconsistent state.

What I do now: 5-minute cooldown between restarts. After 3 failed restarts, stop trying and alert me. On restart, the agent validates its state before resuming.

A good restart policy is less like "always restart" and more like:

missed heartbeats -> mark unhealthy
restart once -> wait 5 minutes
restart again -> wait 5 minutes
restart third time -> stop auto-restarting, escalate to human

3. LLM cost is a health metric

This was my biggest insight. For traditional services, you monitor CPU, memory, and latency. For LLM agents, token cost per cycle is often the metric that catches problems first.

A runaway loop doesn't spike CPU (API calls are I/O bound). It doesn't spike memory. But token usage goes from 200/min to 40,000/min instantly. If you're not tracking cost per cycle, you'll find out from your API bill.

The simplest version of this is a moving baseline:

baseline = rolling_average(tokens_per_cycle[-50:])

if tokens_used_this_cycle > baseline * 10:
    alert("possible loop", tokens_used=tokens_used_this_cycle)

4. Graceful shutdown is not optional

One of my agents sends a burst of API calls during shutdown to finish cleanup safely. The first time I added loop detection, it flagged every graceful shutdown as a runaway.

What I do now: The agent signals "shutting down" before cleanup. The monitoring system knows to expect a burst and does not flag it.

5. Daily reports catch the slow problems

Alerts catch emergencies. Daily reports catch slower drift that alerts miss — an agent that is gradually using more tokens per cycle, or one that restarts once a day at the same time because of a cron conflict.

I review a daily summary of each agent's health, cost, and event history. Most of my operational improvements came from patterns in that report, not from real-time alerts.

The basic report I want every morning is:

- Was the agent alive the whole day?
- How many restart events happened?
- Did token cost per cycle move outside baseline?
- Were there loop-detection or cooldown events?
- Did anything get auto-recovered, or does it need a human?

These patterns aren't complicated, but I didn't find them written down anywhere when I started. Hopefully this saves someone a few "learning experiences."

If you want to see what my setup looks like, I built these ideas into ClevAgent. But honestly, even a homegrown heartbeat plus cost-per-cycle tracker gets you most of the way there.