DEV Community

Lavelle Hatcher Jr
Lavelle Hatcher Jr

Posted on

Calling Anthropic's Advisor Tool in 50 Lines of Python


This article reflects my own experience and research. It is not the official view of any company mentioned.

When I first read Anthropic's Advisor Strategy post earlier this week, my first thought was: can a single /v1/messages call really let one Claude model consult another one mid-generation? I wanted to see the actual wire format and the token accounting before I trusted it in production, so I sat down and wrote the smallest working example I could. That is what this article is.

Versions used

  • Python 3.11
  • anthropic Python SDK 0.94.0 (released 2026-04-10)
  • Claude API, advisor tool in public beta since 2026-04-09
  • Beta header: advisor-tool-2026-03-01
  • Tool type: advisor_20260301

If you are reading this later, double check the beta header and tool type against the official docs. Beta names change when features move toward GA.

What the advisor tool actually does

Most server side tools the executor can call (web search, code execution) perform an action and return data. The advisor tool is different. When the executor invokes it, the server runs a separate sub-inference on a stronger model using the entire transcript so far, then injects the advice back into the executor's stream. No extra round trip on your side.

The mechanics are slightly unusual. The executor emits a server_tool_use block with name: "advisor" and, unusually, an empty input. The executor only decides the timing. The server constructs the advisor's view automatically from the full transcript (system prompt, tool definitions, prior turns, prior tool results). Then the advisor runs without tools and without its own context management, its thinking blocks are stripped, and only the advice text lands back in the executor's prompt as an advisor_tool_result block. The executor resumes generating.

The pairing Anthropic recommends is Sonnet 4.6 (executor) plus Opus 4.6 (advisor). Haiku 4.5 also works as an executor. The only advisor model available today is claude-opus-4-6, and the advisor must be at least as capable as the executor.

The minimal call

Here is the smallest viable request, using client.beta.messages.create:

import anthropic

client = anthropic.Anthropic()

response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    betas=["advisor-tool-2026-03-01"],
    tools=[
        {
            "type": "advisor_20260301",
            "name": "advisor",
            "model": "claude-opus-4-6",
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Build a concurrent worker pool in Go with graceful shutdown.",
        }
    ],
)

print(response)
Enter fullscreen mode Exit fullscreen mode

Four things worth pointing at:

  • betas=["advisor-tool-2026-03-01"] turns the feature on. This is the SDK shortcut for the anthropic-beta header.
  • The tool type is advisor_20260301, and name must literally be the string advisor.
  • model inside the tool definition is the advisor model. The top level model is the executor.
  • You call client.beta.messages.create, not client.messages.create.

Reading what came back

When the executor decides to consult the advisor, two new content blocks appear in the response: a server_tool_use block with an empty input, followed by an advisor_tool_result block carrying the advice. This loop walks the content array and pulls each piece out:

for block in response.content:
    if block.type == "text":
        print("EXECUTOR:", block.text)
    elif block.type == "server_tool_use" and block.name == "advisor":
        print("ADVISOR CALL:", block.id)
    elif block.type == "advisor_tool_result":
        content = block.content
        if content.type == "advisor_result":
            print("ADVISOR SAID:", content.text)
        elif content.type == "advisor_tool_result_error":
            print("ADVISOR FAILED:", content.error_code)
Enter fullscreen mode Exit fullscreen mode

Notice the two success variants. advisor_result carries human readable text. advisor_redacted_result carries encrypted_content that you round trip verbatim on the next turn. Opus 4.6 returns plaintext today, but other advisor models may not. If the sub-inference fails, you get advisor_tool_result_error with an error_code such as overloaded, too_many_requests, max_uses_exceeded, prompt_too_long, or execution_time_exceeded. The whole request does not fail in that case. The executor keeps going without further advice.

Counting the tokens properly

This is the part I wanted to see with my own eyes. Usage is split between executor and advisor, and the top level usage.input_tokens does not include the advisor's tokens at all. Everything lives in usage.iterations[]:

usage = response.usage
print(f"Executor output tokens (top level): {usage.output_tokens}")

for i, it in enumerate(usage.iterations):
    if it.type == "advisor_message":
        print(
            f"  [{i}] advisor ({it.model}): "
            f"in={it.input_tokens} out={it.output_tokens}"
        )
    else:
        print(
            f"  [{i}] executor: "
            f"in={it.input_tokens} out={it.output_tokens}"
        )
Enter fullscreen mode Exit fullscreen mode

Advisor tokens bill at the advisor model's rate, so rolling them into the executor numbers would give you the wrong cost. The docs spell out the aggregation rules: top level output_tokens is the sum across executor iterations, and top level input_tokens reflects the first executor iteration only. For anything resembling billing, loop over iterations and group by type.

Capping cost with max_uses

The advisor tool ships without a conversation level cap, but it does support a per request max_uses:

tools = [
    {
        "type": "advisor_20260301",
        "name": "advisor",
        "model": "claude-opus-4-6",
        "max_uses": 2,
    }
]
Enter fullscreen mode Exit fullscreen mode

Once the executor hits that cap, additional advisor calls return an advisor_tool_result_error with error_code: "max_uses_exceeded". This is per request, so on a multi turn conversation you still need a client side counter if you want a total ceiling. When you decide to stop offering the advisor, the docs are explicit: remove it from tools AND strip every advisor_tool_result block from your message history before the next request. Leaving the blocks behind without the tool returns a 400 invalid_request_error.

Advisor side caching

For long agent loops where the advisor fires three or more times, you can enable caching on the advisor's own transcript:

tools = [
    {
        "type": "advisor_20260301",
        "name": "advisor",
        "model": "claude-opus-4-6",
        "caching": {"type": "ephemeral", "ttl": "5m"},
    }
]
Enter fullscreen mode Exit fullscreen mode

The shape is fixed: type must be ephemeral and ttl is 5m or 1h. Unlike cache_control on normal content blocks, this is just an on or off switch. The server decides where the cache boundaries go. The documented break even point is about three advisor calls per conversation. Below that, the write cost exceeds the read savings.

Things to watch out for

  • Streaming pauses. The advisor sub-inference does not stream. While it runs, your executor stream sits idle except for standard SSE ping keepalives roughly every 30 seconds. Short advisor calls may show no pings at all. Your UI needs to handle that silence without timing out.
  • max_tokens bounds the executor only. It does not cap advisor output. Budget for an extra 1,400 to 1,800 tokens per advisor call (400 to 700 text plus thinking).
  • Rate limits draw from two buckets. Executor rate limits fail the whole request with HTTP 429. Advisor rate limits come back as too_many_requests inside the advisor_tool_result block, and the request continues.
  • Invalid pairings return 400. The advisor must be at least as capable as the executor. Today that means Opus as advisor for any executor. Haiku as advisor is not supported.
  • Do not rewrite redacted results. If the advisor returns advisor_redacted_result, pass the opaque encrypted_content back on the next turn verbatim. The server decrypts it server side. Reading it or substituting text will break the conversation.
  • Context editing has sharp edges. clear_thinking with any keep value other than "all" shifts the advisor's quoted transcript each turn and kills advisor side caching. If you use extended thinking alongside the advisor, set keep: "all" explicitly.

Is it worth wiring in?

From a single request cost angle, the advisor is cheaper than Opus solo whenever your task is mostly mechanical output with a few key decisions. It is more expensive than Sonnet solo whenever those decisions are unnecessary. That tradeoff lives in your prompt and your workload, not in the API. I would not blindly turn it on for chat, but for agent loops with dozens of turns it is the right knob to have.

Anthropic's own guidance in the docs is specific about timing: call the advisor early, after a few exploratory reads are in the transcript but before substantive work begins, and call it again near the end after file writes and test outputs are available. That matches what I see in practice. The advisor adds almost all of its value in the first call, before your approach crystallizes. If you wait until the executor is three quarters of the way through a wrong solution, the advisor will politely tell you so and you will still have to redo the work.

The part I underestimated before writing this example was the usage accounting. If you have a billing pipeline that reads usage.input_tokens and usage.output_tokens directly, it will silently undercount advisor time. Migrate to iterations before you flip this on in production.

What would you use a second opinion model for in your own agent loops? I am curious whether people are reaching for this more on planning or on verification.

References

Top comments (0)