How to write an AI agent that knows when to stop and ask

#programming #ai #llm #python

The most valuable code in my agent stack is the code that does nothing.

I run a pipeline where agents research, draft, and queue content for publishing, mostly unattended. The thing that has saved me the most money and embarrassment is not a clever system prompt. It's a short, hard-coded list of actions the agent cannot take without a human click — and a set of rules for when it has to stop mid-task and ask a question instead of guessing.

Almost no agent tutorial covers this. They all show the happy path: model plans, model calls tools, task completes, confetti. But the defining property of an agent — the loop keeps going without you — is exactly what makes the stop condition the hardest design decision in the whole system. Get it wrong in one direction and your agent confidently does something irreversible and wrong. Get it wrong in the other direction and you've built a very expensive confirmation dialog.

Here's the case for building the stop, and then three mechanisms you can implement this afternoon.

Agents guess by default, and guessing measurably fails

Two results are worth having in your head.

τ-bench (Sierra Research, June 2024) put function-calling agents in simulated conversations with users, with real domain policies to follow (airline rebooking, retail returns). The headline: even the best agents at the time succeeded on fewer than 50% of tasks — and reliability was worse than that number suggests. On the pass^8 metric (does the agent succeed all 8 times on the same task?), performance dropped below 25% in the retail domain (paper, benchmark repo). Same task, same agent, different outcome depending on the run. A big share of the failures are exactly what you'd predict: the agent proceeds on an assumption instead of resolving what the user actually wants or what the policy actually allows.

ClarifyGPT (October 2023, later published at FSE 2024 — paper page) tested the inverse: what happens if you force the model to detect ambiguity and ask before generating? On MBPP-sanitized, having GPT-4 ask targeted clarifying questions on ambiguous requirements raised Pass@1 from 70.96% to 80.80% — roughly ten points from asking instead of guessing. The paper's motivating observation is the important part: left alone, LLMs will generate a complete, confident answer to an ambiguous request rather than ask about it.

That's the core problem. The model is trained to be maximally helpful this turn. Asking a question feels, to the model, like failure — so unless you build the ask-path yourself, it doesn't exist.

The vendors shipping these models say the same thing. Anthropic's "Building Effective Agents" (December 2024, good summary in Simon Willison's write-up) recommends checkpoints where agents pause for human review, specifically before irreversible actions. OpenAI's "A Practical Guide to Building Agents" (April 2025, coverage) names two triggers that should escalate to a human: exceeding failure thresholds and high-risk actions — sensitive, irreversible, or high-stakes operations. Both guides are telling you the same thing: don't prompt for caution, architect for it.

The three triggers

Everything I gate reduces to three questions, checked in this order:

Is the action irreversible? Can it be undone with another tool call? Sending an email, charging a card, publishing a post, deleting a record — no. Writing a draft, creating a branch, staging a file — yes. Irreversible → confirm, always, no matter how confident the agent is. Confidence is not the variable that matters here; blast radius is.
Is the task ambiguous? Do materially different actions follow from reasonable readings of the request? If two competent interpretations lead to two different tool calls, the agent should ask one question, not pick one silently.
Is the failure budget spent? Has the agent already failed at this step N times? Retry loops without a ceiling are how you get an agent that burns $40 of tokens elaborately failing. After N failures, stop retrying and escalate with a summary.

Notice what's not on the list: "is the model unsure?" Self-reported confidence is the one signal I've stopped trusting — the model that just guessed wrong on τ-bench was not hedging while it did it. All three triggers above are computable outside the model.

Mechanism 1: the confirmation gate

The pattern: every tool call passes through a policy function before it executes. The policy is boring, deterministic Python — deliberately not an LLM.

from dataclasses import dataclass

# Tier the tools by blast radius. Default-deny anything unlisted.
ALLOW   = {"read_file", "search_docs", "write_draft", "run_tests"}
CONFIRM = {"send_email", "publish_post", "delete_record", "issue_refund"}

@dataclass
class FailureBudget:
    max_failures: int = 3
    failures: int = 0

    def record(self, ok: bool) -> None:
        self.failures = 0 if ok else self.failures + 1

    def exhausted(self) -> bool:
        return self.failures >= self.max_failures

def gate(tool_name: str, budget: FailureBudget) -> str:
    """Return 'allow', 'confirm', or 'deny'."""
    if budget.exhausted():
        return "confirm"        # stop retrying, escalate with a summary
    if tool_name in CONFIRM:
        return "confirm"
    if tool_name in ALLOW:
        return "allow"
    return "deny"               # unknown tool = someone forgot to classify it

Wired into a standard Anthropic tool-use loop:

import anthropic

client = anthropic.Anthropic()

def tool_result_message(tool_use_id: str, content) -> dict:
    return {"role": "user", "content": [{
        "type": "tool_result", "tool_use_id": tool_use_id, "content": str(content),
    }]}

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    tools=TOOLS,
    messages=history,
)

# The assistant turn (with its tool_use blocks) must be in history BEFORE
# any tool_result messages, or the API rejects the conversation.
history.append({"role": "assistant", "content": resp.content})

for block in resp.content:
    if block.type != "tool_use":
        continue

    decision = gate(block.name, budget)

    if decision == "deny":
        history.append(tool_result_message(
            block.id,
            "Denied by policy: this tool is not classified in the gate. "
            "Do not call it again; propose an alternative approach."))
        continue

    if decision == "confirm":
        verdict = ask_human(block.name, block.input)   # CLI prompt, Slack, queue
        if not verdict.approved:
            history.append(tool_result_message(
                block.id,
                f"Operator declined: {verdict.reason}. "
                f"Do not retry this action; propose an alternative."))
            continue

    result = execute(block.name, block.input)
    budget.record(result.ok)
    history.append(tool_result_message(block.id, result))

Two details that matter more than they look:

Default-deny for unknown tools. The gate's failure mode should be "annoyingly asks about a safe tool," never "silently allows a dangerous one you added last week and forgot to classify."
The denial goes back into the conversation as a tool result. The agent doesn't crash on a "no" — it learns the operator's reason and re-plans. A denial is information.

ask_human can be as primitive as input() in a CLI tool or as real as a Slack message with two buttons and a parked task. The frameworks have first-class versions of this now — LangGraph ships an interrupt() primitive that pauses the graph, persists state, and resumes on human input — but the pattern is 40 lines without a framework, and the 40-line version is easier to audit.

Mechanism 2: an ambiguity check you can actually implement

"Detect ambiguity" sounds like it needs a research team. ClarifyGPT's trick is simpler and steals well: sample the model several times and check whether the answers agree. In their setup, they sample multiple code solutions and test whether the solutions behave differently on the same inputs — behavioral divergence means the spec is ambiguous (paper).

For a general agent, the cheap adaptation is to sample the plan, not the code:

import re

PLAN_PROMPT = (
    "Task: {task}\n"
    "In ONE short line, state the single concrete final action you would take. "
    "No hedging, no options — commit to one action."
)

FILLER = {"a", "an", "the", "i", "would", "will", "then"}

def canonicalize(plan: str) -> str:
    # lowercase, strip punctuation, drop filler words, collapse whitespace
    words = re.sub(r"[^a-z0-9 ]", " ", plan.lower()).split()
    return " ".join(w for w in words if w not in FILLER)

def is_ambiguous(task: str, k: int = 4) -> tuple[bool, list[str]]:
    plans = []
    for _ in range(k):
        r = client.messages.create(
            model="claude-sonnet-4-5",      # or a cheaper model; this is a probe
            max_tokens=60,
            temperature=1.0,                 # you WANT the variance here
            messages=[{"role": "user",
                       "content": PLAN_PROMPT.format(task=task)}],
        )
        plans.append(r.content[0].text.strip())

    distinct = {canonicalize(p) for p in plans}
    return len(distinct) > 1, plans

If four high-temperature samples all commit to the same action, the request is specific enough — proceed. If they diverge, you've got concrete evidence of ambiguity and you're holding the raw material for a great clarifying question, because you know exactly which readings are in conflict:

CLARIFY_PROMPT = (
    "The user asked: {task}\n"
    "Reasonable readings led to different actions:\n{plans}\n"
    "Ask the user ONE question whose answer decides between these readings. "
    "Offer the most likely reading as a default they can accept with 'yes'."
)

Cost: k short probe calls with tiny max_tokens, on a cheap model if you like — a rounding error next to one wrong irreversible action. I run this only on tasks that will reach a CONFIRM-tier tool; read-only work doesn't need it.

Mechanism 3: escalate well, not just often

The failure-budget code was in the gate above; the part people skip is what the escalation says. An agent that stops and dumps 200 lines of log on you has technically asked — and functionally taught you to ignore it. Every escalation from my agents has to fit this shape:

BLOCKED: [one line — what it was trying to do]
Tried: [the 2-3 approaches, one line each, with the error]
Believes: [its best guess at the cause]
Question: [ONE decidable question]
Default: [what it will do if you just reply "go"]

The Default line is the trick that keeps this fast: most escalations become a one-word human reply, so the human actually keeps answering them. The "one decidable question + recommended default" shape is the same thing ClarifyGPT found effective for code and the same thing you'd want from a junior engineer at your door.

Which is the mental model for the whole post, honestly. A junior who does whatever they think you meant is dangerous; one who asks about everything is exhausting; the one you promote pushes through the reversible stuff and shows up with one sharp question when the action is irreversible or genuinely underspecified. OpenAI's guide frames human intervention as something you tune down as evidence of reliability accumulates (guide, April 2025) — that's the right direction of travel. Start with a wide CONFIRM tier, log every gate decision, and graduate tools to ALLOW when the log shows the confirmations were all rubber stamps.

But start with the gate. The agent that knows when to stop is the one you can afford to let run.

I keep the full set of reliability rules I apply before letting any agent run unattended — including the gate tiers and escalation template above — in the free **Reliable Agent Field Guide: penloomstudio.com/field-guide.html. And if your gate keeps firing because the agent calls the wrong tool in the first place, that's usually a schema problem — there's a $2.99 tool-calling reliability pack with the linter and schema patterns I use.