Mukunda Rao Katta

Posted on May 25

How to Know When Your Agent Is Done (Before It Runs Forever)

#hermeschallenge #ai #python #agents

Agent loops need exit conditions. Every agent author knows this. Most agents exit on MaxIters. Set it to 10, run the loop, stop after 10 turns. Simple.

The problem is that MaxIters alone is not sufficient. A 10-iteration run can cost $50 if each iteration involves expensive tool calls and large contexts. A 100-iteration run can cost $0.02 if each turn is a short exchange. And a run that makes progress on turn 1 and stalls forever on turn 2 will never exit unless you detect the stall.

This post covers five stop conditions, why you need all of them, and how to compose them in a single evaluator.

Hook

Consider an agent tasked with debugging a build failure. It calls a log reader, gets a wall of text, calls a search tool, gets nothing useful, calls the log reader again with slightly different args, gets the same wall of text, calls the search tool again with slightly different args. This continues. Each call is different enough that MaxIters is the only thing that stops it. At 20 iterations and $12 of API calls, the user still has no answer.

What you actually wanted:

Stop after 5 iterations of search returning empty results (NoProgress)
Stop before the cost hits $2 (MaxUsd)
Stop after 120 seconds of real time (MaxSeconds)
Stop if context grows past 80,000 tokens (MaxTokens)

MaxIters is the last resort. The other four are the real safeguards.

Main Code

import time
import anthropic
from llm_stop_conditions import (
    StopEvaluator,
    MaxIters,
    MaxUsd,
    MaxTokens,
    MaxSeconds,
    NoProgress,
    StopResult,
)
from agenttrace import Tracer

client = anthropic.Anthropic()
tracer = Tracer()


def build_evaluator() -> StopEvaluator:
    """
    Compose five stop conditions.

    Any one of them triggers a stop. The evaluator returns
    a StopResult that tells you which condition fired and why.
    """
    return StopEvaluator(
        conditions=[
            MaxIters(limit=25),
            MaxUsd(limit=3.00),
            MaxTokens(limit=80_000),
            MaxSeconds(limit=120.0),
            NoProgress(
                window=4,           # look at last 4 tool results
                similarity=0.85,    # if results are 85%+ similar to prior, count as no progress
                max_stale=3,        # stop after 3 consecutive stale results
            ),
        ]
    )


def cost_from_usage(usage: anthropic.types.Usage) -> float:
    """Rough Claude Sonnet 4.6 cost estimate."""
    input_cost = (usage.input_tokens / 1_000_000) * 3.00
    output_cost = (usage.output_tokens / 1_000_000) * 15.00
    cache_write_cost = ((usage.cache_creation_input_tokens or 0) / 1_000_000) * 3.75
    cache_read_cost = ((usage.cache_read_input_tokens or 0) / 1_000_000) * 0.30
    return input_cost + output_cost + cache_write_cost + cache_read_cost


def agent_loop(task: str) -> dict:
    messages = [{"role": "user", "content": task}]
    evaluator = build_evaluator()
    tracer_run = tracer.start_run()

    total_cost = 0.0
    total_tokens = 0
    start_time = time.monotonic()

    # The loop context passed to the evaluator on each turn
    ctx = {
        "iters": 0,
        "cost_usd": 0.0,
        "tokens": 0,
        "elapsed_seconds": 0.0,
        "last_tool_results": [],
    }

    while True:
        # Check stop conditions before calling the LLM
        stop: StopResult = evaluator.check(ctx)
        if stop.should_stop:
            tracer.end_run(tracer_run, note=f"stopped: {stop.reason}")
            return {
                "status": "stopped",
                "reason": stop.reason,
                "condition": stop.condition_name,
                "cost_usd": total_cost,
                "tokens": total_tokens,
                "iters": ctx["iters"],
            }

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=[
                {
                    "name": "search_logs",
                    "description": "Search error logs.",
                    "input_schema": {
                        "type": "object",
                        "properties": {"query": {"type": "string"}},
                        "required": ["query"],
                    },
                }
            ],
            messages=messages,
        )

        # Update metrics
        usage = response.usage
        turn_cost = cost_from_usage(usage)
        total_cost += turn_cost
        total_tokens += usage.input_tokens + usage.output_tokens
        ctx["iters"] += 1
        ctx["cost_usd"] = total_cost
        ctx["tokens"] = total_tokens
        ctx["elapsed_seconds"] = time.monotonic() - start_time

        if response.stop_reason == "end_turn":
            text = next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )
            tracer.end_run(tracer_run, note="completed")
            return {
                "status": "complete",
                "answer": text,
                "cost_usd": total_cost,
                "tokens": total_tokens,
                "iters": ctx["iters"],
            }

        # Handle tool calls
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue
            result_text = f"Search results for '{block.input.get('query', '')}': no matches found."
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result_text,
            })
            ctx["last_tool_results"].append(result_text)

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})


def main():
    result = agent_loop(
        "Find the root cause of the build failure that started happening "
        "after 2am UTC. Search the logs and explain what went wrong."
    )
    print(f"Status: {result['status']}")
    if result["status"] == "stopped":
        print(f"Stopped by: {result['condition']} ({result['reason']})")
    else:
        print(f"Answer: {result.get('answer', '')}")
    print(f"Cost: ${result['cost_usd']:.4f}")
    print(f"Tokens: {result['tokens']}")
    print(f"Iterations: {result['iters']}")


if __name__ == "__main__":
    main()

What It Does NOT Do

NoProgress here uses similarity between tool result strings. It is a heuristic. It catches the case where search returns the same empty result or the same error message repeatedly. It does not detect semantic repetition. If the agent rephrases the same query into ten different forms and gets ten slightly different empty results, NoProgress might not fire. MaxIters and MaxUsd are the backstop for that case.

This evaluator is not async-aware. The MaxSeconds check runs at the start of each loop iteration, not in a background timer. If one LLM call takes 90 seconds and your limit is 60 seconds, the check fires at the start of the next turn, not at the 60-second mark. For hard real-time cutoffs, use agent-deadline with a background cancellation token.

There is no per-tool budget here. MaxUsd applies to the whole run. If you want separate budgets for the LLM calls versus external API calls, track them separately and add a custom condition to the evaluator.

Design Reasoning

Five conditions is not arbitrary. Each covers a different failure mode.

MaxIters is a hard structural limit. It catches infinite loops the other conditions miss. Never remove it.

MaxUsd is the billing limit. It fires before a run that is clearly stuck can rack up charges. Set it to 2x your expected per-run cost as a starting point.

MaxTokens is the context limit. It fires before the LLM starts failing with context-exceeded errors. Set it to 80% of the model's context window.

MaxSeconds is the user-facing deadline. A user waiting for an answer does not want to wait 10 minutes. Set it to a number you are willing to put in your SLA.

NoProgress is the semantic quality guard. It catches the case where the agent is technically running but not getting anywhere. The other conditions are about resource limits. This one is about output quality.

Composing them with StopEvaluator means any single condition firing stops the run. You get the most conservative behavior by default.

When This Applies / Does Not Apply

Use all five conditions in any agent that calls external tools, handles user requests, or runs unsupervised. These are the conditions that prevent a bad prompt or an unusual dataset from turning into a $50 bill.

For batch agents processing many small tasks, MaxUsd and MaxTokens should be set per task, not per batch. A single expensive task in a batch should not consume the whole batch budget.

Skip NoProgress for agents whose valid behavior includes searching the same source multiple times with different queries. A code analysis agent that runs the same grep with different patterns is not stuck. Tune the similarity threshold and window size carefully for your use case before enabling it.

For short single-turn queries, skip the evaluator entirely. It adds overhead that is not worth it for a single client.messages.create call.

Quick-Start Snippet

pip install llm-stop-conditions agenttrace

Minimal usage:

from llm_stop_conditions import StopEvaluator, MaxIters, MaxUsd

evaluator = StopEvaluator(conditions=[MaxIters(10), MaxUsd(1.00)])
ctx = {"iters": 0, "cost_usd": 0.0}

for _ in range(100):
    stop = evaluator.check(ctx)
    if stop.should_stop:
        print(f"Stopped: {stop.reason}")
        break
    # ... LLM call ...
    ctx["iters"] += 1
    ctx["cost_usd"] += 0.05

Siblings Table

Library	Stop dimension	GitHub
llm-stop-conditions	All five conditions in one evaluator	MukundaKatta/llm-stop-conditions
agent-deadline	Hard time deadline with cancellation	MukundaKatta/agent-deadline
tool-loop-guard	Per-tool call frequency in a window	MukundaKatta/tool-loop-guard
llm-cost-cap	Pre-flight USD cost gate per call	MukundaKatta/llm-cost-cap
token-budget-py	Token budget pool for concurrent runs	MukundaKatta/token-budget-py
llm-budget-window	Time-windowed token and USD budget	MukundaKatta/llm-budget-window

What's Next

The gap this evaluator does not cover is partial success. Right now a stopped run returns the stop reason but no partial answer. For many tasks, a partial answer is better than nothing. The next step is to checkpoint the best answer seen so far on each turn, and return it alongside the stop reason when a condition fires.

The second gap is adaptive limits. A run that is making clear progress should be allowed to continue past MaxIters. A run that costs $0.001 per iteration is not a billing risk even at 100 iterations. Dynamic limits based on rate-of-progress are possible but require more state in ctx. That is a good use case for agent-state-checkpoint to track rolling averages.

All repos are at MukundaKatta on GitHub. Issues and PRs welcome.

DEV Community