AIchain Reasoning: One Parameter for Every Provider

#llm #api #ai

OpenAI calls it reasoning_effort. Anthropic calls it budget_tokens. Google calls it thinkingBudget. Kimi calls it thinking. Qwen calls it enable_thinking. DeepSeek skips the parameter entirely and routes you to a separate model. Same idea, six implementations — and real cost implications if you use it carelessly.

Every major model provider now ships some version of "let the model think longer before answering." The concept is identical: allocate extra compute at inference time so the model can work through a problem step by step before committing to a response. But each provider wraps that in a different API parameter, with different value types, different scaling behavior, and different documentation you'll need to dig through.

I got tired of reading all of them.

Six Providers, Six Parameter Names, One Idea

Here's what the landscape actually looks like when you strip away the marketing:

Provider	Native Parameter	Value Format
OpenAI	`reasoning_effort`	`"low"`, `"medium"`, `"high"`
Anthropic	`budget_tokens`	Integer (token count)
Google	`thinkingBudget`	Integer (token count)
Kimi	`thinking`	Enabled / routes to a thinking model
DeepSeek	(no param)	Routes entire request to `deepseek-reasoner`
Qwen	`enable_thinking`	Boolean

OpenAI gives you a string enum. Anthropic and Google want a raw token budget — but on different scales. Kimi toggles a flag or swaps the model entirely. DeepSeek doesn't even have a parameter; it redirects your call to a completely separate model. Qwen gives you a boolean, take it or leave it.

If you're building anything that runs across multiple providers — for cost optimization, fallback routing, or A/B testing — this fragmentation is a genuine problem. You end up writing provider-specific branching logic that has nothing to do with your actual application.

One Parameter That Translates Everywhere

In aichain, the reasoning interface is a single key in the model options:

from yait_aichain import Model, Skill

skill = Skill(
    model=Model("gpt-4o", options={"reasoning": "high"}),
    input={"messages": [{"role": "user", "parts": ["{prompt}"]}]},
)
result = skill.run(variables={"prompt": "Solve this step by step: ..."})
print(result.content)  # result.content holds the model's response text

The reasoning option accepts three levels — "low", "medium", "high" — or you omit it entirely for standard inference. The library handles translation to each provider's native format automatically. For Qwen, enable_thinking is set to True for any non-null reasoning level and False when reasoning is omitted. There are no partial levels on Qwen's side, so "low" and "medium" both map to True.

Swapping the model is a one-line change, and the reasoning behavior follows:

from yait_aichain import Model, Skill

PROMPT = {
    "messages": [{"role": "user", "parts": ["Prove that sqrt(2) is irrational. Show your work."]}]
}

models = [
    Model("gpt-4o",             options={"reasoning": "high"}),  # → reasoning_effort: "high"
    Model("claude-sonnet-4-5",  options={"reasoning": "high"}),  # → budget_tokens: large value
    Model("gemini-2.5-pro",     options={"reasoning": "high"}),  # → thinkingBudget: large value
    Model("kimi-k2",            options={"reasoning": "high"}),  # → routes to thinking variant
    Model("deepseek-chat",      options={"reasoning": "high"}),  # → routes to deepseek-reasoner
    Model("qwen-plus",          options={"reasoning": "high"}),  # → enable_thinking: true
]

for model in models:
    skill = Skill(model=model, input=PROMPT)
    result = skill.run()
    print(f"{model}: {result.content[:200]}")

Six providers, six different native implementations, zero branching logic in your code. The provider is inferred from the model name prefix — gpt-* routes to OpenAI, claude-* to Anthropic, gemini-* to Google, and so on — and the reasoning parameter gets translated accordingly.

Some providers have special quirks that the library absorbs. DeepSeek doesn't have a reasoning toggle — it has an entirely separate model called deepseek-reasoner. When you set reasoning="high" on deepseek-chat, the library silently reroutes your request:

from yait_aichain import Model, Skill

# Without reasoning: hits deepseek-chat
skill = Skill(
    model=Model("deepseek-chat"),
    input={"messages": [{"role": "user", "parts": ["Explain gradient descent."]}]},
)

# With reasoning: library automatically routes to deepseek-reasoner
skill_reasoning = Skill(
    model=Model("deepseek-chat", options={"reasoning": "high"}),
    input={"messages": [{"role": "user", "parts": ["Explain gradient descent."]}]},
)

Kimi works similarly. The library maps kimi-k2 to Kimi's appropriate API model slug, and the reasoning flag triggers the thinking variant. You can also target the thinking model directly — both approaches hit the same endpoint:

from yait_aichain import Model, Skill

# Direct model targeting
skill_a = Skill(
    model=Model("kimi-k2-thinking"),
    input={"messages": [{"role": "user", "parts": ["Find the bug in this code: {code}"]}]},
)

# Reasoning flag on the base model — library resolves to the same endpoint
skill_b = Skill(
    model=Model("kimi-k2", options={"reasoning": "high"}),
    input={"messages": [{"role": "user", "parts": ["Find the bug in this code: {code}"]}]},
)

The Most Expensive Token Is the One You Didn't Need

Before you turn reasoning on everywhere, the cost math matters. Reasoning tokens are billed as output tokens, and they accumulate fast. A single call to OpenAI's o3 at high effort can generate 10–20× the token count of the same task on gpt-4o. Anthropic's extended thinking adds budget_tokens on top of normal output — at 10,000 budget tokens on Claude Sonnet, that's roughly $0.03 extra per call just for the thinking.

Scale that across thousands of requests and the math turns ugly quickly.

Google's technical report on Gemini 2.5 Pro shows meaningful gains on the MATH benchmark when thinking is enabled. But that result applies to math problems. Not every task is a math problem.

A rough breakdown of where reasoning earns its cost and where it doesn't:

Worth the extra tokens:

Competitive programming
Formal proofs
Complex debugging
Multi-constraint optimization
Multi-step logical reasoning

Rarely worth it:

Text summarization
Sentiment analysis
Simple classification
Creative writing
Data formatting
Translation

The pattern is consistent: reasoning shines on tasks with a verifiably correct answer. If there's a right answer and a wrong answer — a proof that holds or doesn't, code that passes tests or fails — reasoning will find it more reliably. For generative or subjective tasks, you're paying a premium for the model to overthink something that doesn't benefit from overthinking. Standard inference is faster, cheaper, and produces output that's just as good.

When to Turn the Dial

Start with no reasoning. Measure quality. If it's failing on correctness — wrong calculations, flawed logic, missed edge cases — bump to "medium" and measure again. Only go to "high" when medium falls short on tasks you can objectively evaluate. This matters because "medium" often captures most of the quality gain at a fraction of the cost of "high".

The practical workflow: define your eval set, run it at each reasoning level, compare quality scores against token costs. The universal parameter makes this straightforward — changing reasoning level is one string, swapping the underlying model is one line. You're comparing apples to apples across providers without rewriting your integration each time.

One practical note: you'll need valid API credentials for each provider set as environment variables before any of this works. The library picks up standard provider key names automatically, but an auth error will stop you before reasoning ever comes into play.

The provider fragmentation will keep growing as each company iterates on its approach. You don't need to track all of it. You need the model to think when thinking helps, and skip it when it doesn't — and one parameter handles that regardless of which provider you're calling.