Return Claude's Thinking Blocks or Your Agent Breaks

Enable extended thinking on Claude Opus 4.x, wire it into a tool-use loop, and the second turn throws a 400: messages.1.content.0.type: expected thinking or redacted_thinking. Or worse — it doesn't throw, the model just gets dumber, loops on the same tool, or contradicts a chain of reasoning it can no longer see. The cause is almost always the same: your agent loop dropped the thinking blocks before sending the conversation back.

This is the single most common way people break Claude's extended thinking in production, and it's invisible until you look at exactly what bytes you're putting back into messages.

TL;DR

When extended thinking is enabled and the model calls a tool, you must send the thinking (or redacted_thinking) block back verbatim, in the same assistant turn as the tool_use block, including its signature.
The signature is a cryptographic check. Modify, truncate, or reorder the thinking block and the API rejects the request; drop it and you get a 400 or degraded reasoning.
Thinking blocks are billed as output tokens and count against max_tokens, but budget_tokens is a floor for reasoning, not a hard cap on the final answer.
You only need to preserve thinking for the current tool-use cycle. Once a turn ends with a normal text answer, earlier thinking blocks are no longer required on subsequent turns.
Interleaved thinking (a beta header) lets Claude reason between tool calls — powerful for multi-step agents, but it makes preserving those blocks non-optional.

Why does Claude require the thinking block back during tool use?

Because the thinking block is part of the model's state, not a log line for you. When Claude Opus 4.x decides to call a tool mid-reasoning, the thinking block is the reasoning that led to that call. On the next request, the model needs to continue from that reasoning to interpret the tool result. If the block isn't there, the model is asked to make sense of a tool_result with no memory of why it called the tool.

The API enforces this structurally. With thinking enabled, the first content block of an assistant turn that contains tool_use must be a thinking or redacted_thinking block. Strip it and you get:

400 messages.1.content.0.type: expected `thinking` or `redacted_thinking`,
but found `tool_use`

This is different from normal (non-thinking) tool use, where you're free to drop everything except the tool_use block. Many agent frameworks were written for that world — they reconstruct the assistant turn from just the tool call — and they silently violate the thinking contract the moment you flip the feature on.

What is the signature field and why can't I edit the thinking?

The signature is a cryptographic signature over the thinking content, generated by the API. It exists so Anthropic can verify that the thinking block was produced by the model and handed back unaltered. You don't compute it, you don't validate it, you just carry it.

The practical rule: treat the entire thinking block as an opaque token. Don't pretty-print it, don't strip whitespace, don't reorder fields, don't summarize it, don't merge two blocks into one. If you mutate the content, the signature no longer matches and the request is rejected.

You'll also occasionally get redacted_thinking blocks — encrypted content the safety system chose not to expose in plaintext. These look like noise and there's nothing to read, but they carry the same rule: pass them back exactly as received. A common bug is filtering out blocks your rendering code doesn't recognize; that filter eats redacted_thinking and breaks the next turn.

How do I write the tool loop correctly?

Append the assistant's entire content array — thinking block included — before you append your tool_result. Don't rebuild the assistant message from parsed fields.

import anthropic

client = anthropic.Anthropic()

messages = [{"role": "user", "content": "What's the weather in Seoul, in Fahrenheit?"}]

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city (Celsius).",
    "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"],
    },
}]

while True:
    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        thinking={"type": "enabled", "budget_tokens": 2048},
        tools=tools,
        messages=messages,
    )

    # Append the FULL content array verbatim — thinking + tool_use together.
    # Do NOT filter to just the tool_use block.
    messages.append({"role": "assistant", "content": resp.content})

    if resp.stop_reason != "tool_use":
        break

    tool_results = []
    for block in resp.content:
        if block.type == "tool_use":
            # ... your real tool call here ...
            result = "9°C, clear"
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result,
            })

    messages.append({"role": "user", "content": tool_results})

print(resp.content[-1].text)

The load-bearing line is messages.append({"role": "assistant", "content": resp.content}). Passing resp.content straight through keeps the thinking block, its signature, and any redacted_thinking intact. The moment someone "cleans this up" into [{"type": "tool_use", ...}], the loop breaks.

Do I need to keep thinking blocks from every previous turn?

No — and this is the nuance that saves you tokens and confusion. You only need thinking blocks for the tool-use cycle currently in flight. Once an assistant turn ends with a final text answer (not a tool call), the thinking that produced earlier answers is no longer required on later turns. In practice the API strips previous-turn thinking from the context automatically, so re-sending it buys you nothing.

Concretely: a single user question that triggers three tool calls before answering is one cycle — you must preserve the thinking across all three round-trips. But once Claude gives its final text answer and the user asks a new question, you don't need to drag the old thinking blocks forward.

Interleaved thinking changes the intensity of this. With the beta header interleaved-thinking-2025-05-14, Claude can emit a fresh thinking block after each tool_result — reasoning about what the tool returned before deciding the next call. That's exactly what you want for multi-step agents (read a file, reason, decide the next read), but it means a single cycle can contain several thinking blocks, each of which must survive back to the API in order.

How does extended thinking interact with prompt caching?

Carefully, and this is where costs get surprising. Thinking blocks are billed as output tokens, and on the next request they re-enter as input. If you're running long tool-use cycles with a big budget_tokens, that reasoning shows up on both sides of the ledger.

Two things worth knowing:

Thinking blocks are not cached the way you might hope across separate turns. Because previous-turn thinking is stripped once a turn resolves, changing thinking content upstream can move your cache breakpoint. Put your stable, cacheable prefix (system prompt, tool definitions) before anything that varies, as usual — thinking doesn't give you a new place to cache.
Changing budget_tokens between requests can interact badly with cache hits on some setups. Keep it stable across a cycle.

The mental model: extended thinking is not free context you get to reuse. It's expensive output that you're obligated to carry for the duration of one reasoning cycle, then allowed to drop.

What are the other constraints that bite in production?

A short list of things that will cost you an afternoon if you don't know them:

budget_tokens must be at least 1024 and must be less than max_tokens. It's a target for reasoning depth, not a guarantee — the model can use less.
Temperature is constrained. With extended thinking you generally cannot set an arbitrary temperature (and top_p is restricted). If your framework hard-codes temperature=0.2, thinking requests may error. Let it default.
Streaming is required past a certain max_tokens. Large thinking budgets push total output high enough that the API expects a streaming request; non-streaming can time out or be rejected.
budget_tokens counts toward max_tokens. If you set max_tokens=2048 and budget_tokens=2048, there's no room left for the actual answer. Size max_tokens to hold both.

Direct answer: why does returning Claude's thinking blocks matter?

When extended thinking is enabled on Claude Opus 4.x or Sonnet 4.x and the model calls a tool, the thinking block is part of the model's reasoning state, not a debug artifact — so you must return it to the API verbatim, in the same assistant turn as the tool_use block, complete with its cryptographic signature and any redacted_thinking blocks. Strip it and you get a 400 error (expected thinking or redacted_thinking); mutate it and the signature check fails; filter out blocks your code doesn't recognize and you silently break redacted thinking. The fix is one line: append the model's entire content array back into messages rather than reconstructing it from the tool call alone, and only for the tool-use cycle currently in flight. Get that right and extended thinking plus tool use is one of the strongest agent patterns available; get it wrong and it fails in ways that look like the model, not your loop.