Stephen Souza

Posted on May 28

Your AI agent said it was done. It wasn't. Here's how to catch that.

#ai #agents #devops

Last Tuesday a content generation agent ran overnight. Completed every task. Reported success on all of them. Exit codes clean.

The output was wrong on 80% of tasks. The agent had evaluated its own work, decided it met the criteria, and marked each task complete. By its own definition — done. By any useful definition — not done.

Nobody knew until a human reviewed the output the next morning.

This is the AI agent failure mode that conventional monitoring was never built to catch. Not a crash. Not an exception. Not an error rate spike. A process that completed correctly and produced wrong output.

Here's how to detect it before the human review catches it — or worse, before it doesn't.

Why your existing monitoring misses this

Your error monitor watches for exceptions. The agent didn't throw one.

Your uptime monitor watches for downtime. The agent was running the whole time.

Your APM tool watches latency and throughput. Both looked normal — API calls completing, responses arriving.

Your logs show: agent started, tasks processed, agent completed. Clean run.

None of these tools know what the agent was supposed to produce, what it actually produced, or whether those two things match. They watch the infrastructure around the agent. Nobody watches the agent itself.

The four failure modes that look like success

1. Ghost run — completed, produced nothing useful

The agent finishes. Exit code clean. Task marked done. Output exists — technically. But the output doesn't accomplish the task. The agent evaluated its own work using criteria that were too loose, met its own bar, and finished.

From your monitoring: successful run.
From reality: wasted compute, wrong output, downstream processes inheriting bad state.

2. Infinite loop — running, producing nothing

The agent keeps running. Tool calls keep firing. Tokens keep accumulating. The process looks healthy. What's actually happening — the agent called the same tool 47 times and can't converge on a result. No completion event ever arrives.

From your monitoring: healthy process, slight delay.
From reality: $4.80 in tokens, zero useful output, worker occupied for hours.

3. Stall — alive, not progressing

The agent is waiting. For an API response. For a tool result. For a resource that never arrives. Not looping — just stuck. Process alive, no progress, no timeout fired.

From your monitoring: healthy process, normal latency.
From reality: stuck for 4 hours, dependent processes blocked, output delayed indefinitely.

4. Token burn — working, but at 10x cost

The agent is legitimately working. Making progress. But task scope expanded somewhere, or prompting is inefficient, or a reasoning loop is consuming tokens without converging efficiently. Finishes eventually. OpenAI bill for that one run: $12 instead of $0.04.

From your monitoring: successful run.
From reality: 300x cost overrun, no signal it happened.

What you actually need to monitor

Not whether the agent ran. Whether the agent ran the way it normally runs.

That requires a baseline — what does a normal run look like for this agent on this task type?

How many tool calls does it normally make?
How many iterations does it normally take?
How long does it normally run?
How many tokens does it normally consume?
Does it normally produce an output event before completing? When a run deviates from that baseline — 47 tool calls instead of 3-5, 45 minutes instead of 2-3, $12 in tokens instead of $0.04, no output event before completion — that's the signal.

Not a threshold you configured. A deviation from what you've learned is normal.

Setting up AI agent monitoring with NotiLens

Install:

pip install notilens

npm install @notilens/notilens

Get credentials at notilens.com → Topic → create a new topic.

Basic setup — auto-instrumentation

The fastest path. patch=True auto-instruments OpenAI, Anthropic, and LangChain calls:

Python

import notilens

nl = notilens.init(
    name="content-agent",
    token="YOUR_TOKEN",
    secret="YOUR_SECRET",
    patch=True  # auto-instruments all AI calls
)

Node.js

import { NotiLens } from '@notilens/notilens';

const nl = NotiLens.init('content-agent', {
  token: 'YOUR_TOKEN',
  secret: 'YOUR_SECRET'
});

Full instrumentation — loop and output tracking

For complete visibility including loop detection and ghost run detection:

Python

import notilens

nl  = notilens.init(name="content-agent", token="TOKEN", secret="SECRET")
run = nl.task("content-generation")
run.start()

try:
    run.progress("Starting content generation")

    for i, task in enumerate(tasks):
        run.loop(f"[{i+1}] Processing: {task.title}")  # every iteration
        run.metric("tool_calls", 1)                     # accumulates

        result = agent.execute(task)
        run.metric("tokens", result.usage.total_tokens)
        run.metric("cost_usd", result.usage.cost)

    # Critical — only fires if output was actually produced
    # If this never fires, NotiLens detects a ghost run
    run.output_generated(f"Generated {len(tasks)} content pieces")
    run.complete(f"Processed {len(tasks)} tasks")

except Exception as e:
    run.fail(str(e))

Node.js

import { NotiLens } from '@notilens/notilens';

const nl  = NotiLens.init('content-agent', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('content-generation');
run.start();

try {
  run.progress('Starting content generation');

  for (const [i, task] of tasks.entries()) {
    run.loop(`[${i+1}] Processing: ${task.title}`);  // every iteration
    run.metric('tool_calls', 1);                      // accumulates

    const result = await agent.execute(task);
    run.metric('tokens', result.usage.totalTokens);
    run.metric('cost_usd', result.usage.cost);
  }

  // Critical — only fires if output was actually produced
  run.outputGenerated(`Generated ${tasks.length} content pieces`);
  run.complete(`Processed ${tasks.length} tasks`);

} catch (err) {
  run.fail(err.message);
}

The ghost run detection — why output_generated matters

The run.output_generated() call is the ghost run detector.

A run that calls run.complete() without calling run.output_generated() first — NotiLens flags it. Agent said it was done. No output event arrived before completion. That's the signal.

Place run.output_generated() only after you've confirmed output was actually produced — not just that the agent finished, but that something useful came out of it.

Stall detection

For agents pausing on slow tools or external APIs:

Python

run.wait("Awaiting external API response")
result = call_slow_api()
run.progress("API response received, continuing")

run.wait() is non-terminal — the run continues. NotiLens learns how long your agent normally spends between events and fires if the gap becomes anomalous. A 4-hour stall on a tool that normally responds in 200ms — that's the alert.

Token and cost tracking

run.metric("tokens", tokens_used_this_call)
run.metric("cost_usd", round(tokens_used_this_call * 0.0000002, 6))

NotiLens accumulates these per run and compares against the learned baseline. A run consuming 10x normal token count — cost anomaly alert. No budget limit to configure. The baseline is the limit.

What the alert looks like

✅ task.started        Content agent — task started
🔄 task.loop           [1] Processing: Article 1
🔄 task.loop           [2] Processing: Article 2
...
🔄 task.loop           [40] Processing: Article 40
⚠️  Anomaly detected   Loop count 40 — exceeds baseline (avg: 8.2)
                       No output_generated event received
                       tokens: 42,000 | cost: $0.0084
→ Push notification fired

Loop count deviation. No output event. Token anomaly. Three signals, one alert, fired while the agent is still running — not after the morning review catches it.

Full monitoring checklist for AI agents

[ ] run.start() fires when task begins
[ ] run.loop() called on every agent iteration
[ ] run.metric("tool_calls", 1) accumulates per iteration
[ ] run.metric("tokens", n) tracks token usage
[ ] run.metric("cost_usd", n) tracks cost per run
[ ] run.wait() fires when agent pauses on slow calls
[ ] run.output_generated() fires only when useful output is confirmed
[ ] run.complete() fires on successful completion
[ ] run.fail() fires on unhandled exceptions
[ ] run.error() fires on non-fatal tool errors

The short version

Your agent completing and your agent doing the right thing are two different events.

Conventional monitoring watches the first one. You need to watch the second.

run.output_generated() before run.complete() is the simplest version of that check. Everything else — loop count, token tracking, stall detection — builds the baseline that makes deviations detectable automatically.

Works with LangChain, CrewAI, AutoGen, LlamaIndex, Pydantic AI, or any custom agent loop.

notilens.com — 7-day free trial, no credit card required.

DEV Community

Your AI agent said it was done. It wasn't. Here's how to catch that.

Why your existing monitoring misses this

The four failure modes that look like success

1. Ghost run — completed, produced nothing useful

2. Infinite loop — running, producing nothing

3. Stall — alive, not progressing

4. Token burn — working, but at 10x cost

What you actually need to monitor

Setting up AI agent monitoring with NotiLens

Basic setup — auto-instrumentation

Full instrumentation — loop and output tracking

The ghost run detection — why output_generated matters

Stall detection

Token and cost tracking

What the alert looks like

Full monitoring checklist for AI agents

The short version

Top comments (0)