Probably not. And that's the problem.
Your process was running. Tool calls were firing. No exceptions thrown. From every angle your stack can see — healthy agent doing legitimate work.
What was actually happening: the agent called web_search with "latest AI news". Got results. Called it again with "recent AI developments". Got similar results. Called it again. And again. 47 tool calls. Zero useful output. $4.80 in tokens. Nobody knew until the OpenAI invoice arrived.
This is the failure mode conventional monitoring was never built to catch.
Why your current monitoring misses it
Your uptime monitor sees: process running.
Your error monitor sees: no exceptions thrown.
Your log monitor sees: tool calls completing successfully.
Your OpenAI dashboard sees: token consumption — but no per-task breakdown.
Nothing in that stack identifies that the agent is calling the same tool repeatedly with no progress toward completion. A looping agent looks identical to a healthy agent doing legitimate multi-step research.
The signal isn't an error. It's a pattern — same tool, repeated calls, no convergence toward a completion event. Catching that requires monitoring that understands what the agent is supposed to be doing, not just whether it's technically running.
What actually causes loops
Ambiguous task objective — the agent can't determine if it has "enough" to complete the task. It keeps searching for a confidence threshold it can never reach.
Tool output that looks like progress but isn't — each result genuinely looks like partial progress. The agent never recognises the cycle.
Malformed tool responses — the agent retries with slightly different parameters, gets the same malformed response, retries again.
Context window pressure — as context fills with results, the agent loses track of what it already tried and starts repeating earlier tool calls.
max_iterations as the only safeguard — stops the loop but doesn't alert you, doesn't tell you how many times the agent looped, and doesn't fire until the budget is already spent. Circuit breaker, not a monitor.
Three loop signatures to watch
Signature 1 — Same tool, repeated calls
web_search → web_search → web_search ← loop detected
Same tool appears in 3 of the last 5 calls without a completion event.
Signature 2 — High iteration count, no completion
Tool calls: 18
Normal completion at: 4–6 tool calls
Status: no completion event
→ Loop detected
Signature 3 — Token budget anomaly
Tokens consumed: 42,000
Normal consumption: 3,000–6,000
Status: no completion event
→ Budget anomaly detected
Setting up loop detection with NotiLens
Install:
pip install notilens
npm install @notilens/notilens
The pattern is simple — call run.loop() on every agent iteration. NotiLens ML learns how many iterations your agent normally runs and alerts when a run is anomalously high with no run.complete() arriving. No threshold to configure. No detection logic to write.
Python
import notilens
nl = notilens.init(name="ai-agent")
run = nl.task("research")
run.start()
for iteration in range(max_iterations):
tool_name, tool_input = agent.decide_next_action()
run.loop(f"[{iteration + 1}] Tool: {tool_name}") # every iteration
run.metric("tool_calls", 1) # accumulates
result = execute_tool(tool_name, tool_input)
if agent.is_done(result):
break
run.metric("tokens", total_tokens_used)
run.complete("Task completed")
Node.js
import { NotiLens } from '@notilens/notilens';
const nl = NotiLens.init('ai-agent');
const run = nl.task('research');
run.start();
for (let i = 0; i < maxIterations; i++) {
const { toolName, toolInput } = agent.decideNextAction();
run.loop(`[${i + 1}] Tool: ${toolName}`); // every iteration
run.metric('tool_calls', 1); // accumulates
const result = await executeTool(toolName, toolInput);
if (agent.isDone(result)) break;
}
run.metric('tokens', totalTokensUsed);
run.complete('Task completed');
Token and cost tracking
Track token consumption per run — NotiLens uses this alongside iteration count to detect cost anomalies:
run.metric("tokens", tokens_used_this_call)
run.metric("cost_usd", round(tokens_used_this_call * 0.0000002, 6))
If a run consumes 5x more tokens than your baseline without completing, NotiLens flags it — even without a manually configured budget limit.
Agent stall detection
For agents pausing on slow tools or external APIs:
run.wait("Awaiting API response")
result = call_slow_external_api()
run.progress("API response received")
run.wait() is non-terminal — the run continues. NotiLens learns how long your agent normally spends between events and fires if the gap becomes anomalous.
What the alert looks like
✅ task.started Research agent — task started
🔄 task.loop [1] Tool: web_search
🔄 task.loop [2] Tool: web_search
🔄 task.loop [3] Tool: web_search
🔄 task.loop [4] Tool: web_search
🔄 task.loop [5] Tool: web_search
🔄 task.loop [6] Tool: web_search
⚠️ Anomaly detected Iteration count 6 — exceeds learned baseline (avg: 3.2)
No task.complete received
tool_calls: 6 | tokens: 8,420 | cost: $0.0017
→ Push notification fired
One line per iteration. NotiLens detected the pattern automatically.
Full agent monitoring checklist
-
run.loop()called on every agent iteration -
run.start()fires when task begins -
run.complete()fires on successful completion -
run.fail()fires on any unhandled exception -
run.error()fires on non-fatal tool errors -
run.timeout()fires if agent exceeds SLA window -
run.wait()fires when agent pauses on slow external call -
run.metric("tool_calls", 1)accumulates per iteration -
run.metric("tokens", ...)tracks token usage -
run.metric("cost_usd", ...)tracks cost per run
The short answer
max_iterations stops the loop. NotiLens tells you it happened at call 6, not call 47 — while there's still something to investigate.
Works with LangChain, CrewAI, AutoGen, LlamaIndex, Pydantic AI, or any custom agent loop. No framework dependency.
Top comments (0)