Gabriel Anhaia

Posted on May 24

Reasoning Effort: Low, Medium, High: When Each Setting Actually Pays Off

#ai #llm #prompt #cost

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Somebody on your team flipped reasoning_effort to high six weeks ago. The eval scores moved a percentage point. Nobody benchmarked the cost. Now the invoice is up 3x and the product manager is asking why.

Here's the part nobody puts on the marketing page: the high setting earns its keep on maybe three task types. On the other seventeen, it's a tax. You're paying for thinking tokens the model didn't need, on prompts that didn't reward them.

This post is the calibration nobody runs. What the dial actually does, where it pays back, where it burns money, and a 40-line script you can point at your own task set tonight.

What reasoning_effort actually does

OpenAI's reasoning_effort (on the o-series and GPT-5 reasoning models) and Anthropic's extended thinking with budget_tokens are the same idea wearing different uniforms. Both tell the model how much hidden chain-of-thought to generate before it commits to an answer.

OpenAI's version is a knob with three positions (low, medium, high, plus minimal on some models). Anthropic's version is a number: thinking.budget_tokens in the request, anywhere from 1024 to tens of thousands. Different shape, same outcome: the model emits more or fewer internal reasoning tokens before the visible response.

Those reasoning tokens are billed. They count against your output token quota even though you never see them. On OpenAI you can inspect them as output_tokens_details.reasoning_tokens in the response. On Anthropic the thinking blocks come back in the content array tagged type: "thinking", and the tokens land in usage.output_tokens.

# OpenAI
resp = client.responses.create(
    model="gpt-5",
    reasoning={"effort": "high"},
    input="..."
)
print(resp.usage.output_tokens_details.reasoning_tokens)
# Anthropic
resp = client.messages.create(
    model="claude-sonnet-4-5",
    thinking={"type": "enabled", "budget_tokens": 8000},
    max_tokens=16000,
    messages=[...]
)
print(resp.usage.output_tokens)  # includes thinking

The default on most reasoning models is medium. That's not a neutral choice. It's already paying for a chunk of hidden thought you may or may not need.

The cost shape

Rough back-of-napkin from runs against a 200-prompt mixed task set:

low produces a few hundred reasoning tokens per response. Sometimes near zero.
medium produces 1k-4k reasoning tokens on hard prompts, a few hundred on easy ones.
high produces 4k-20k reasoning tokens, and on genuinely hard prompts will push past 30k.

In dollar terms: medium runs roughly 2x low. High runs 3-4x medium on the same prompts. So high is around 6-8x the cost of low, sometimes more on prompts where the model decides to really chew on it.

The catch is variance. High doesn't spend high tokens every time — it spends them when the prompt looks worth it. So your average cost depends entirely on your prompt mix. A workload that's mostly "extract three fields from this invoice" will cost almost the same on low and high. A workload that's mostly "solve this scheduling problem" will see the full 6x.

That's the trap. You enable high, your eval set is mostly easy prompts, the bill doesn't move much, you ship. Then production traffic hits a different mix and the cost ramps overnight.

Where high earns its tokens

Three task families consistently reward the extra thinking budget:

Multi-step math and logic. Anything that requires holding intermediate state across more than three steps. Constraint satisfaction. Algebra word problems where one variable feeds another. Combinatorial counting. The model needs scratch space, and the thinking tokens are literally that scratch space.

Code generation with verification loops. Especially when the prompt includes test cases or a contract the output must satisfy. High thinking lets the model run mental simulations of the code against the tests, catch off-by-ones, fix them before emitting. The gap between medium and high on competitive-programming-style problems is real, often a 10-20% pass-rate swing.

Planning under constraints. Scheduling, resource allocation, "find the order to do these N things given these dependencies." Tasks where there's a search over options and the model has to evaluate each.

If your task looks like one of these, and you have a measurable success metric, high is worth a serious A/B test.

Where high is just expensive

The other seventeen task types. Specifically:

Classification. "Is this email spam?" "Which of these 12 categories does this ticket belong to?" The model knows the answer in the first 50 tokens of thought. Another 10,000 tokens of thinking adds nothing and sometimes adds noise.

Extraction. Pulling structured fields from text. The work is recognition, not reasoning. Low is fine. Often minimal is fine.

Short-form Q&A over provided context. RAG-style answers. If the retrieval is good, the model just needs to read and paraphrase. High burns tokens second-guessing facts that are right in the context window.

Summarisation. Same story. The work is compression, not deduction.

Style rewrites, translations, format conversions. These are surface transformations. Thinking budget is the wrong tool.

A team I work with had reasoning_effort: high on a customer-support classifier for two months. The eval delta over medium was 0.4%. The cost delta was 3x. After they dropped it to low, accuracy went down 1.1% and cost dropped 7x. They took the 1.1% and never looked back.

A 40-line A/B script

Don't argue about this in a planning meeting. Run it against your own prompts. The script below sweeps three settings, logs accuracy and token usage, prints a table. Plug in your own task list and grading function.

import os, time, json, statistics as stats
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5"

# Your task set: each item is (prompt, expected_answer_or_grader_input)
TASKS = json.load(open("eval_tasks.json"))

def run(effort: str, prompt: str):
    t0 = time.time()
    r = client.responses.create(
        model=MODEL,
        reasoning={"effort": effort},
        input=prompt,
        max_output_tokens=4000,
    )
    return {
        "answer": r.output_text,
        "reasoning_tokens": r.usage.output_tokens_details.reasoning_tokens,
        "output_tokens": r.usage.output_tokens,
        "latency_s": time.time() - t0,
    }

def grade(task, answer: str) -> int:
    # Replace with your real grader. Exact match is the simplest case.
    return int(task["expected"].strip().lower() in answer.strip().lower())

results = {e: [] for e in ("low", "medium", "high")}
for task in TASKS:
    for effort in results:
        r = run(effort, task["prompt"])
        r["correct"] = grade(task, r["answer"])
        results[effort].append(r)

print(f"{'effort':<8} {'acc':>6} {'avg_reasoning':>14} {'avg_total':>11} {'p95_latency':>12}")
for effort, rs in results.items():
    acc = sum(r["correct"] for r in rs) / len(rs)
    avg_r = stats.mean(r["reasoning_tokens"] for r in rs)
    avg_t = stats.mean(r["output_tokens"] for r in rs)
    p95 = stats.quantiles([r["latency_s"] for r in rs], n=20)[18]
    print(f"{effort:<8} {acc:>6.2%} {avg_r:>14.0f} {avg_t:>11.0f} {p95:>12.2f}")

You'll get back something like:

effort     acc  avg_reasoning   avg_total  p95_latency
low      78.5%            312        487         2.14
medium   82.0%           1840       2103         5.81
high     82.5%           7920       8211        18.42

That table is the conversation. If your medium-to-high accuracy delta is 0.5% and your cost delta is 4x, that's your answer. If your low-to-medium accuracy delta is 8%, that's also your answer. The right setting is the one your data picks, not the one the docs default to.

For Anthropic, swap the request shape to thinking={"type": "enabled", "budget_tokens": N} with N in [1024, 4000, 16000], then pull usage.output_tokens for cost. The structure of the script doesn't change.

The hybrid router

Once you've measured, the next move isn't "pick one setting forever." It's a router: classify the incoming task, pick the effort per request.

The classifier itself runs on a small fast model — cost is negligible compared to what you save by not running high on every classification task.

ROUTER_MODEL = "gpt-5-nano"  # cheap, fast
WORKER_MODEL = "gpt-5"

ROUTER_PROMPT = """Classify the user task into one of:
- TRIVIAL: classification, extraction, short Q&A, format conversion
- MODERATE: summarisation, rewrite, multi-paragraph explanation
- HARD: multi-step math, code with tests, planning, debugging
Respond with one word."""

def route(task: str) -> str:
    r = client.responses.create(
        model=ROUTER_MODEL,
        input=[
            {"role": "system", "content": ROUTER_PROMPT},
            {"role": "user", "content": task},
        ],
        max_output_tokens=10,
    )
    return r.output_text.strip().upper()

EFFORT_MAP = {"TRIVIAL": "low", "MODERATE": "medium", "HARD": "high"}

def answer(task: str):
    label = route(task)
    effort = EFFORT_MAP.get(label, "medium")  # safe default
    return client.responses.create(
        model=WORKER_MODEL,
        reasoning={"effort": effort},
        input=task,
        max_output_tokens=8000,
    )

Three pieces of advice once you wire this up:

Log the router's classification alongside the worker's response. You want to see when the router said TRIVIAL and the worker came back with a 6000-token reasoning trace. That's a misclassification you can correct with one more example in the router prompt.

Default to medium on uncertainty, not high. The router will misfire. Misfiring upward costs you 4x. Misfiring downward costs you a percentage point of accuracy. Pick the cheaper failure mode.

Run the router on a sample of traffic, not 100%, until you trust it. A traffic mirror that runs the router on 5% of requests and logs the implied savings is enough to convince finance.

The max_tokens gotcha

This one quietly breaks production and looks like a model bug.

reasoning_effort: high can emit 10k-30k thinking tokens. Those tokens count against max_tokens (OpenAI) or max_tokens (Anthropic, where budget_tokens must be strictly less than max_tokens). If you set max_tokens=4000 because that's what your old gpt-4o-mini calls used, you'll get a response that's either truncated, empty, or finishes with finish_reason: "length" right in the middle of reasoning. The user-visible output is sometimes blank. The model didn't even reach the answer.

The OpenAI docs are explicit about this and most people miss it: reserve enough budget for the reasoning tail. A rule of thumb that's worked across a few deployments:

low: max_tokens = visible answer budget + 1000
medium: visible answer budget + 5000
high: visible answer budget + 25000

For Anthropic with extended thinking, you must set max_tokens > budget_tokens, and you should leave at least a few thousand on top of the budget for the actual response. Setting max_tokens=8000, budget_tokens=8000 is the most common mistake. The model uses the whole budget on thinking and has zero tokens left for the answer.

If you see finish_reason: "length" showing up after enabling reasoning, this is almost certainly it. The fix is one line. The bug looks like the model lost its mind.

What to do this week

Run the script on 100 prompts from your real traffic. Print the table. If high doesn't beat medium by enough to justify the cost on your task mix, drop to medium and pocket the difference. If low matches medium, drop further. Then add the router for the long tail.

The thing the dial rewards isn't more effort. It's more thinking on the prompts that need it, and no thinking on the prompts that don't. That's a measurement problem, not a model problem.

What's your current reasoning_effort setting in production, and have you ever actually measured the accuracy delta against the cost delta on your real traffic? Drop the numbers in the comments.

If this was useful

This is the kind of calibration the Prompt Engineering Pocket Guide covers in the chapter on cost-aware prompting — when reasoning budgets pay back, how to design routers, and the failure modes that look like model bugs but are really configuration mistakes. If you ship LLM features and the bill is starting to hurt, that's the chapter to read first.