Last Tuesday a content generation agent ran overnight. Completed every task. Reported success on all of them. Exit codes clean.
The output was wrong on 80% of tasks. The agent had evaluated its own work, decided it met the criteria, and marked each task complete. By its own definition — done. By any useful definition — not done.
Nobody knew until a human reviewed the output the next morning.
This is the AI agent failure mode that conventional monitoring was never built to catch. Not a crash. Not an exception. Not an error rate spike. A process that completed correctly and produced wrong output.
Here's how to detect it before the human review catches it — or worse, before it doesn't.
Why your existing monitoring misses this
Your error monitor watches for exceptions. The agent didn't throw one.
Your uptime monitor watches for downtime. The agent was running the whole time.
Your APM tool watches latency and throughput. Both looked normal — API calls completing, responses arriving.
Your logs show: agent started, tasks processed, agent completed. Clean run.
None of these tools know what the agent was supposed to produce, what it actually produced, or whether those two things match. They watch the infrastructure around the agent. Nobody watches the agent itself.
The four failure modes that look like success
1. Ghost run — completed, produced nothing useful
The agent finishes. Exit code clean. Task marked done. Output exists — technically. But the output doesn't accomplish the task. The agent evaluated its own work using criteria that were too loose, met its own bar, and finished.
From your monitoring: successful run.
From reality: wasted compute, wrong output, downstream processes inheriting bad state.
2. Infinite loop — running, producing nothing
The agent keeps running. Tool calls keep firing. Tokens keep accumulating. The process looks healthy. What's actually happening — the agent called the same tool 47 times and can't converge on a result. No completion event ever arrives.
From your monitoring: healthy process, slight delay.
From reality: $4.80 in tokens, zero useful output, worker occupied for hours.
3. Stall — alive, not progressing
The agent is waiting. For an API response. For a tool result. For a resource that never arrives. Not looping — just stuck. Process alive, no progress, no timeout fired.
From your monitoring: healthy process, normal latency.
From reality: stuck for 4 hours, dependent processes blocked, output delayed indefinitely.
4. Token burn — working, but at 10x cost
The agent is legitimately working. Making progress. But task scope expanded somewhere, or prompting is inefficient, or a reasoning loop is consuming tokens without converging efficiently. Finishes eventually. OpenAI bill for that one run: $12 instead of $0.04.
From your monitoring: successful run.
From reality: 300x cost overrun, no signal it happened.
What you actually need to monitor
Not whether the agent ran. Whether the agent ran the way it normally runs.
That requires a baseline — what does a normal run look like for this agent on this task type?
- How many tool calls does it normally make?
- How many iterations does it normally take?
- How long does it normally run?
- How many tokens does it normally consume?
- Does it normally produce an output event before completing? When a run deviates from that baseline — 47 tool calls instead of 3-5, 45 minutes instead of 2-3, $12 in tokens instead of $0.04, no output event before completion — that's the signal.
Not a threshold you configured. A deviation from what you've learned is normal.
Setting up AI agent monitoring with NotiLens
Install:
pip install notilens
npm install @notilens/notilens
Get credentials at notilens.com → Topic → create a new topic.
Basic setup — auto-instrumentation
The fastest path. patch=True auto-instruments OpenAI, Anthropic, and LangChain calls:
Python
import notilens
nl = notilens.init(
name="content-agent",
token="YOUR_TOKEN",
secret="YOUR_SECRET",
patch=True # auto-instruments all AI calls
)
Node.js
import { NotiLens } from '@notilens/notilens';
const nl = NotiLens.init('content-agent', {
token: 'YOUR_TOKEN',
secret: 'YOUR_SECRET'
});
Full instrumentation — loop and output tracking
For complete visibility including loop detection and ghost run detection:
Python
import notilens
nl = notilens.init(name="content-agent", token="TOKEN", secret="SECRET")
run = nl.task("content-generation")
run.start()
try:
run.progress("Starting content generation")
for i, task in enumerate(tasks):
run.loop(f"[{i+1}] Processing: {task.title}") # every iteration
run.metric("tool_calls", 1) # accumulates
result = agent.execute(task)
run.metric("tokens", result.usage.total_tokens)
run.metric("cost_usd", result.usage.cost)
# Critical — only fires if output was actually produced
# If this never fires, NotiLens detects a ghost run
run.output_generated(f"Generated {len(tasks)} content pieces")
run.complete(f"Processed {len(tasks)} tasks")
except Exception as e:
run.fail(str(e))
Node.js
import { NotiLens } from '@notilens/notilens';
const nl = NotiLens.init('content-agent', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('content-generation');
run.start();
try {
run.progress('Starting content generation');
for (const [i, task] of tasks.entries()) {
run.loop(`[${i+1}] Processing: ${task.title}`); // every iteration
run.metric('tool_calls', 1); // accumulates
const result = await agent.execute(task);
run.metric('tokens', result.usage.totalTokens);
run.metric('cost_usd', result.usage.cost);
}
// Critical — only fires if output was actually produced
run.outputGenerated(`Generated ${tasks.length} content pieces`);
run.complete(`Processed ${tasks.length} tasks`);
} catch (err) {
run.fail(err.message);
}
The ghost run detection — why output_generated matters
The run.output_generated() call is the ghost run detector.
A run that calls run.complete() without calling run.output_generated() first — NotiLens flags it. Agent said it was done. No output event arrived before completion. That's the signal.
Place run.output_generated() only after you've confirmed output was actually produced — not just that the agent finished, but that something useful came out of it.
Stall detection
For agents pausing on slow tools or external APIs:
Python
run.wait("Awaiting external API response")
result = call_slow_api()
run.progress("API response received, continuing")
run.wait() is non-terminal — the run continues. NotiLens learns how long your agent normally spends between events and fires if the gap becomes anomalous. A 4-hour stall on a tool that normally responds in 200ms — that's the alert.
Token and cost tracking
run.metric("tokens", tokens_used_this_call)
run.metric("cost_usd", round(tokens_used_this_call * 0.0000002, 6))
NotiLens accumulates these per run and compares against the learned baseline. A run consuming 10x normal token count — cost anomaly alert. No budget limit to configure. The baseline is the limit.
What the alert looks like
✅ task.started Content agent — task started
🔄 task.loop [1] Processing: Article 1
🔄 task.loop [2] Processing: Article 2
...
🔄 task.loop [40] Processing: Article 40
⚠️ Anomaly detected Loop count 40 — exceeds baseline (avg: 8.2)
No output_generated event received
tokens: 42,000 | cost: $0.0084
→ Push notification fired
Loop count deviation. No output event. Token anomaly. Three signals, one alert, fired while the agent is still running — not after the morning review catches it.
Full monitoring checklist for AI agents
- [ ]
run.start()fires when task begins - [ ]
run.loop()called on every agent iteration - [ ]
run.metric("tool_calls", 1)accumulates per iteration - [ ]
run.metric("tokens", n)tracks token usage - [ ]
run.metric("cost_usd", n)tracks cost per run - [ ]
run.wait()fires when agent pauses on slow calls - [ ]
run.output_generated()fires only when useful output is confirmed - [ ]
run.complete()fires on successful completion - [ ]
run.fail()fires on unhandled exceptions - [ ]
run.error()fires on non-fatal tool errors
The short version
Your agent completing and your agent doing the right thing are two different events.
Conventional monitoring watches the first one. You need to watch the second.
run.output_generated() before run.complete() is the simplest version of that check. Everything else — loop count, token tracking, stall detection — builds the baseline that makes deviations detectable automatically.
Works with LangChain, CrewAI, AutoGen, LlamaIndex, Pydantic AI, or any custom agent loop.
notilens.com — 7-day free trial, no credit card required.
Top comments (0)