DEV Community

Cover image for Anthropic Stop Reasons in Production: 5 Cases That Need Different Reactions
Gabriel Anhaia
Gabriel Anhaia

Posted on

Anthropic Stop Reasons in Production: 5 Cases That Need Different Reactions


Two weeks after your first Claude-backed feature ships, a support ticket lands: long answers are being cut off mid-sentence, the assistant "stops talking" right after deciding to call a tool, and one user got a polite refusal that the handler logged as a normal answer and shipped back to the UI as if it were one.

All three bugs live in one field. The Anthropic Messages API returns a stop_reason on every completion, and most handlers I've seen treat five different values as one. The five values come with five different reactions, and the dispatcher you'll end this post writing is the cheapest LLM observability surface you can ship.

The five values come from the Anthropic Messages API reference. The behavior below matches the tool-use docs and the stop sequences guide.

Where stop_reason lives in the response

Every non-streaming Messages response carries the field at the top level:

{
  "id": "msg_01abc...",
  "type": "message",
  "role": "assistant",
  "model": "claude-sonnet-4-5",
  "content": [
    { "type": "text", "text": "..." }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": { "input_tokens": 412, "output_tokens": 88 }
}
Enter fullscreen mode Exit fullscreen mode

In the streaming API, the same value arrives on the message_delta event near the end, with the final usage numbers. stop_reason carries the branch decision. Read it before you read content.

The values you'll see in production are end_turn, max_tokens, stop_sequence, tool_use, and refusal. They each want a different branch.

1. end_turn: the happy path, but verify the shape

end_turn means the model finished what it had to say on its own. No budget hit, no delimiter, no tool dispatch, no safety trip. This is what you want most of the time.

A clean end_turn doesn't guarantee a parseable response. It can still hand you partial JSON if you asked for JSON without a schema, a tool intent narrated in prose instead of dispatched, or empty content blocks if the model decided the answer was nothing.

from anthropic import Anthropic

client = Anthropic()

def call(prompt: str):
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )

    if msg.stop_reason == "end_turn":
        text = "".join(
            block.text for block in msg.content
            if block.type == "text"
        )
        if not text.strip():
            raise EmptyResponse(msg.id)
        return text

    return handle_other_stop(msg)
Enter fullscreen mode Exit fullscreen mode

Two observability hooks earn their keep here. Track the ratio of end_turn to total stops per route. As a starting heuristic, expect the bulk of traffic on a healthy text endpoint to land on end_turn; a material drop usually means something upstream changed, like a longer system prompt or a new tool in the catalogue. The other one to track is empty content on end_turn, which is rare enough that any non-zero count deserves a look.

2. max_tokens: raise the budget, or stream and continue

max_tokens means the model was still writing when the API stopped sampling because it had hit the ceiling you set. The response is truncated by definition, so if your code returns content[0].text straight to the user, the user sees a sentence that ends with a comma.

There are two real reactions, picked by route:

For finite tasks (summaries, extractions, classifications): bump the budget. Most truncations on these routes mean your max_tokens was guessed once and never revisited. The fix is one line, and you'll know it landed when the max_tokens rate drops.

For open-ended tasks (drafting, agent reasoning, code generation): accept that some answers are long, and continue. The pattern is to take the truncated assistant message, append it as the next assistant turn, and ask the model to keep going.

def call_with_continuation(prompt: str, max_rounds: int = 3):
    messages = [{"role": "user", "content": prompt}]
    full_text = ""

    for _ in range(max_rounds):
        msg = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            messages=messages,
        )
        chunk = "".join(
            b.text for b in msg.content if b.type == "text"
        )
        full_text += chunk

        if msg.stop_reason != "max_tokens":
            return full_text

        messages.append({"role": "assistant", "content": chunk})
        messages.append({
            "role": "user",
            "content": "Continue from where you stopped.",
        })

    return full_text
Enter fullscreen mode Exit fullscreen mode

A bound on max_rounds is mandatory. Without one, a runaway response runs up an unbounded bill. Three is a reasonable default for most open-ended endpoints; raise it only with eyes on the cost dashboard.

The metric to track is max_tokens_rate{route=...}. Useful starting thresholds: above a few percent on a finite-task route is a signal that the budget is too low, and a steady rate north of ten percent on an open-ended route means you're paying for the continuation overhead and should consider larger windows up front.

3. stop_sequence: a custom delimiter fired, route on which one

stop_sequence means the model emitted one of the strings you passed in stop_sequences and the API cut sampling at that boundary. The matched string lives in the response's stop_sequence field, which is null on every other branch.

The pattern shows up most in two places: enforcing a structured output protocol (</answer>, ###END###) and stopping a chain-of-thought scratchpad before it bleeds into the user-visible answer. In both cases, the right reaction depends on which delimiter fired.

def call_with_delimiters(prompt: str):
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        stop_sequences=["</answer>", "###HUMAN###"],
        messages=[{"role": "user", "content": prompt}],
    )

    if msg.stop_reason != "stop_sequence":
        return handle_other_stop(msg)

    matched = msg.stop_sequence
    body = "".join(
        b.text for b in msg.content if b.type == "text"
    )

    if matched == "</answer>":
        return parse_answer_block(body)
    if matched == "###HUMAN###":
        raise PromptInjectionSuspected(msg.id)

    raise UnknownStopSequence(matched)
Enter fullscreen mode Exit fullscreen mode

Two things to remember. The delimiter string itself is not included in the response text — the model stopped before emitting it — so don't try to parse it back out. And every delimiter you add is a string the model will sometimes hit by accident; pick something that real text would never produce. Closing-tag styles like </answer> work because no plausible English answer ends with them naturally.

The ###HUMAN### branch above is one of the cheap hardening tricks for prompt-injection paranoia. If a user input convinces the model to start a new "human" turn inside the assistant message, the stop sequence catches it before the output reaches a parser that might trust it.

4. tool_use: dispatch the call and resume the conversation

When you pass a tools array to the Messages API, tool_use becomes a possible terminal stop. It means the assistant message contains at least one tool_use content block and is asking your code to run that tool. The conversation isn't over; the model is paused, waiting for a tool_result to come back so it can continue.

The shape your handler has to learn is that the assistant turn is no longer a single text blob — it's a list of typed blocks, and you have to walk it.

def run_with_tools(messages, tools):
    while True:
        msg = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            tools=tools,
            messages=messages,
        )

        if msg.stop_reason == "end_turn":
            return final_text(msg)

        if msg.stop_reason != "tool_use":
            return handle_other_stop(msg)

        messages.append({"role": "assistant", "content": msg.content})

        tool_results = []
        for block in msg.content:
            if block.type != "tool_use":
                continue
            output = dispatch(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(output),
            })

        messages.append({"role": "user", "content": tool_results})
Enter fullscreen mode Exit fullscreen mode

A few details bite teams the first time they wire this up. The full assistant content list (text blocks, thinking blocks, and tool_use blocks together) has to go back into messages unchanged on the next turn; you cannot pluck out only the tool calls. Every tool_use_id from a tool_use block has to match the tool_use_id you set on the corresponding tool_result. And tool_result blocks travel inside a user message, not an assistant one. OpenAI's API has a separate tool role; Anthropic's does not.

A bound on the loop is non-negotiable. A misbehaving tool that always returns the same input back to the model can produce an infinite call chain that bills happily until your alerting catches it. Cap the number of tool rounds, log each dispatch with timing and inputs, and break the loop on repeat-call patterns.

5. refusal: the model declined; the prompt is the bug

refusal is the newest of the five and the one most teams haven't built a branch for yet. It signals that the model declined to produce the response, usually because the prompt brushed against a safety policy (see the Messages API reference for the full list of stop_reason values). The response still has content, but the text is a polite refusal, not the answer the caller asked for.

Your handler should never ship a refusal body to a user as if it were a normal answer.

def call_with_refusal_handling(prompt: str, route: str):
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )

    if msg.stop_reason != "refusal":
        return handle_other_stop(msg)

    metrics.incr(
        "anthropic_refusal_total",
        tags={"route": route},
    )
    log.warn("anthropic.refusal", extra={
        "route": route,
        "request_id": getattr(msg, "id", None),
        "prompt_hash": hash_prompt(prompt),
    })

    if route in SAFETY_REVIEW_ROUTES:
        return enqueue_for_review(prompt, msg)

    raise PromptNeedsRedesign(route)
Enter fullscreen mode Exit fullscreen mode

For routes where occasional refusals are expected (user-facing chat, public APIs that take arbitrary input), enqueue the request for review and surface a graceful fallback to the user. The metric on anthropic_refusal_total{route=...} is the long-term signal for whether your input filtering needs tightening or your prompt needs softer framing.

For routes that should never refuse (internal classification, structured extraction over your own data), a refusal means the prompt is wrong. Some real input is hitting a phrasing the model won't answer, often a system prompt that frames the task adversarially or a few-shot example that crossed a line. Fix the prompt; don't retry the same payload.

Putting the branches together

The dispatcher at the top of every Anthropic call ends up looking like this:

def safe_call(prompt: str, route: str):
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=route_tools(route),
        stop_sequences=route_delimiters(route),
        messages=[{"role": "user", "content": prompt}],
    )

    metrics.incr(
        "anthropic_stop_reason_total",
        tags={"route": route, "reason": msg.stop_reason},
    )

    match msg.stop_reason:
        case "end_turn":
            return extract_text(msg)
        case "max_tokens":
            return continue_or_truncate(msg, route)
        case "stop_sequence":
            return route_on_delimiter(msg)
        case "tool_use":
            return run_tool_loop(msg, route)
        case "refusal":
            return handle_refusal(msg, route)
        case _:
            raise UnknownStopReason(msg.stop_reason)
Enter fullscreen mode Exit fullscreen mode

One counter, tagged by route and reason: the cheapest LLM observability you can ship. A max_tokens spike means the budget is too low. Refusals climbing? The prompt or the inputs drifted. tool_use rate creeping up signals an agent loop heating up, and stop_sequence firing on safe text means a delimiter is leaking into prose it shouldn't have triggered on.

Wire this dispatcher in early. The next thing on the agent surface area worth this kind of attention is the rate-limit headers Anthropic returns alongside stop_reason, and that's a separate post.


If this was useful

Stop reasons are one slice of the surface area an agent presents to the rest of your stack. Tool loops, retry budgets, message-history hygiene, safety branches — all of it gets easier when you have a vocabulary for the response taxonomy and a metric per branch. The AI Agents Pocket Guide walks through the patterns for building agents that don't surprise you in production.

AI Agents Pocket Guide — Patterns for Building Autonomous Systems with LLMs

Top comments (0)