Gabriel Anhaia

Posted on May 5

Claude Opus 4.7 Adaptive Thinking: When the Reasoning Tokens Pay Off

#anthropic #llm #python #benchmark

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A friend of mine flipped on extended thinking across an entire production endpoint last quarter, as it was described to me. The endpoint did three things: classify a support ticket into one of eight categories, extract a customer email, and route to a queue. He set a high budget_tokens value because someone on a podcast said reasoning makes models smarter. The next invoice came back roughly four times the previous month. The classification accuracy barely moved.

That is the kind of mistake adaptive thinking was built to prevent. As of writing, Anthropic recommends adaptive thinking on its current model line. On Opus 4.7, the docs steer you to adaptive rather than manual budget_tokens. The model string in this post is claude-opus-4-7. Swap to claude-sonnet-4-6 if you want manual budget_tokens for comparison.

The interesting question was never "should I turn thinking on." It was "how do I tell, for my tasks, which prompts pay back the reasoning tokens and which do not." That answer is empirical. You build a small harness, you run the same task three ways, you read the numbers.

What thinking actually buys you

Per the Anthropic docs, thinking tokens are billed as standard output tokens at the model's normal output rate. There is no separate "reasoning" tier on the price card. Whatever Claude generates inside the thinking block lands on your invoice at the output rate of the model you picked. Check the current pricing page for the rate on your model as of writing. For a given prompt, switching from no-thinking to high-effort adaptive thinking can multiply your output token count several-fold, depending on the task. If the task is hard enough that the answer quality moves, the multiplier earns its place. If the task is a one-line transformation, you have just paid that multiplier for the same answer.

Three task families where the reasoning tends to pay off, based on what I see in my own harness runs:

Multi-step math. The model has to chain operations and check intermediate results. Thinking lets it backtrack from a wrong path before committing to an answer.
Multi-document synthesis. Reading three PDFs and reconciling the contradictions. The thinking trace is where the reconciliation happens; without it, the model picks one source and ignores the others.
Agent planning with competing options. "Should I call search_docs or read_file first?" When the cost of the wrong choice is a wasted tool call and a wrong answer downstream, thinking is the cheapest place to catch it.

Three task families where it is pure waste:

Short factoid recall. "What is the capital of France?" The model has the answer in the first forward pass. Thinking adds latency and cost without changing the output.
Deterministic transformations. "Convert this JSON to YAML." There is one right answer; reasoning about it does not improve it.
Simple classification. Eight buckets, clear rules. The signal is in the surface text, not in any chain of reasoning the model would build.

The line between those two lists is where the harness lives.

The harness

Same task, three modes, fair eval. The point is to measure the lift from reasoning for the specific shape of work you actually run, not the average across someone else's benchmark.

import json
import time
from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-7"


@dataclass
class Run:
    case_id: str
    mode: str
    answer: str
    thinking_chars: int
    output_tokens: int
    input_tokens: int
    elapsed_ms: int

Three modes wrap the same messages.create call. Per the docs as of writing, adaptive is the recommended thinking mode on Opus 4.7. To compare against a no-thinking baseline you omit the thinking parameter entirely. The effort field on output_config is how you nudge how deep the reasoning goes.

def call(prompt: str, mode: str) -> Run:
    kwargs = dict(
        model=MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    if mode == "off":
        pass
    elif mode == "low":
        kwargs["thinking"] = {
            "type": "adaptive",
            "display": "summarized",
        }
        kwargs["output_config"] = {"effort": "low"}
    elif mode == "high":
        kwargs["thinking"] = {
            "type": "adaptive",
            "display": "summarized",
        }
        kwargs["output_config"] = {"effort": "high"}
    else:
        raise ValueError(mode)

    t0 = time.perf_counter()
    msg = client.messages.create(**kwargs)
    elapsed = int((time.perf_counter() - t0) * 1000)

    text_parts = []
    thinking_chars = 0
    for block in msg.content:
        if block.type == "text":
            text_parts.append(block.text)
        elif block.type == "thinking":
            thinking_chars += len(block.thinking or "")

    return Run(
        case_id="",
        mode=mode,
        answer="".join(text_parts),
        thinking_chars=thinking_chars,
        output_tokens=msg.usage.output_tokens,
        input_tokens=msg.usage.input_tokens,
        elapsed_ms=elapsed,
    )

Two notes on that snippet. First, on Opus 4.7 the display field controls whether the thinking trace comes back over the wire (per the docs as of writing); if you want to see the reasoning for evaluation purposes, set display: "summarized" explicitly. The bill is the same either way (you pay for the full thinking trace regardless of what is returned). Second, output_tokens already includes the thinking tokens. Do not double-count.

The eval loop is where the work happens. A small JSONL file with prompts and reference answers, an LLM-judge that grades each candidate against the rubric on a 0-to-5 scale, and a per-mode aggregate at the end.

def evaluate(cases_path: str):
    rows = []
    for line in open(cases_path):
        case = json.loads(line)
        for mode in ("off", "low", "high"):
            run = call(case["prompt"], mode)
            run.case_id = case["id"]
            score = judge(case["reference"], run.answer)
            rows.append((run, score, case["category"]))
    return rows


def judge(reference: str, candidate: str) -> int:
    rubric = (
        "Score the candidate 0-5 against the reference. "
        "5 = identical meaning, 0 = wrong or off-topic. "
        "Return only the integer."
    )
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8,
        messages=[{
            "role": "user",
            "content": (
                f"{rubric}\n\nReference: {reference}"
                f"\n\nCandidate: {candidate}\n\nScore:"
            ),
        }],
    )
    try:
        return int(msg.content[0].text.strip()[0])
    except (ValueError, IndexError):
        return 0

The judge is a different model (Sonnet 4.6) on purpose. You do not want the same model grading its own work. Keep the rubric short. If the rubric is doing real work, write a longer one and run a second-pass calibration against human grades on a 20-case slice. Otherwise the noise floor of the judge swamps the lift you are trying to measure.

Reading the output

Aggregate per category, per mode. The shape you are looking for is two columns: the score lift from off→high, and the dollar cost of that lift.

def summarize(rows):
    by = {}
    for run, score, cat in rows:
        key = (cat, run.mode)
        b = by.setdefault(key, {
            "scores": [],
            "out_tokens": 0,
            "in_tokens": 0,
            "ms": 0,
            "n": 0,
        })
        b["scores"].append(score)
        b["out_tokens"] += run.output_tokens
        b["in_tokens"] += run.input_tokens
        b["ms"] += run.elapsed_ms
        b["n"] += 1

    for (cat, mode), b in sorted(by.items()):
        avg_score = sum(b["scores"]) / b["n"]
        cost = (
            b["in_tokens"] * 5 / 1_000_000
            + b["out_tokens"] * 25 / 1_000_000
        )
        print(
            f"{cat:>20s} {mode:>4s}  "
            f"score={avg_score:.2f}  "
            f"out={b['out_tokens']:>6d}  "
            f"$={cost:.4f}  "
            f"ms={b['ms']//b['n']}"
        )

Pull the current per-million input/output rates from the pricing page when you wire up the cost column; the discounts available via prompt caching and the batch API are listed there too. Third-party writeups have also flagged a tokenizer change in 4.7 (e.g., the Finout post on the topic). The takeaway, if it holds: a "rate-card unchanged" upgrade can still raise your invoice if the same input string produces more tokens. Verify against the primary docs before quoting any specific multiplier in your own write-ups.

What you should expect to see, qualitatively. On math and multi-doc synthesis, the score column climbs as you move off→low→high and the dollar column climbs faster. The question is whether your category is one where the score curve is steep enough to justify the cost curve. On classification and short factoid recall, the score column is flat and the dollar column is anything but. That is the entire decision.

When to ship which mode

Three rules that fall out of running this harness.

Default to off for anything under 200 input tokens. A short prompt rarely benefits from reasoning. The model has the answer immediately. You are paying for a deliberation that produces the same output.

Default to low effort for agentic loops. You want adaptive thinking on so the model can plan between tool calls (interleaved thinking is automatic in adaptive mode), but you want the effort dial low so it does not over-deliberate every read. Let the complexity of each step decide.

Reserve high or max effort for the prompts where your harness shows a real lift. Numbered, in your eval. Not vibes. If you do not have evidence of a lift on a category, you do not have a reason to pay for thinking on that category.

Run the harness on a representative slice of your traffic before the next pricing review.

If this was useful

This is the kind of measurement-first thinking the Prompt Engineering Pocket Guide builds out across every Anthropic feature: caching, batching, tool-use, and now adaptive thinking. The book is short on purpose. It is the chapters you would have written if you had spent a year reading invoices instead of marketing posts. If you ship anything that calls Claude in a hot loop, the patterns in there pay for themselves the first time you push the right effort dial.

DEV Community