Rahul Singh

Posted on Apr 11 • Originally published at aicodereview.cc

Input vs Output vs Reasoning Tokens Cost - LLM Pricing Explained

#codereview #ai #programming #tutorial

If you have ever checked the pricing page for OpenAI, Anthropic, or Google and wondered why there are three different token prices listed, you are not alone. The distinction between input tokens, output tokens, and reasoning tokens is one of the most misunderstood aspects of LLM pricing - and getting it wrong can mean overspending by 5-10x on your AI workloads.

This guide breaks down exactly what each token type is, why they cost different amounts, and how to optimize your spending - whether you are building an AI application, running code reviews, or just trying to understand your API bill.

What Are Tokens in LLMs?

Before diving into pricing, let's clarify what tokens actually are. A token is the fundamental unit of text that large language models process. It is not a word, not a character, but something in between.

On average, one token equals roughly 4 characters or 0.75 words in English. The word "understanding" is two tokens. A line of Python code like def calculate_total(items): is about 8 tokens.

Every API call involves two phases:

Input processing - the model reads your prompt (input tokens)
Output generation - the model writes its response (output tokens)

Some newer models add a third phase:

Reasoning - the model thinks through the problem before responding (reasoning tokens)

Each phase has different computational costs, which is why providers charge differently for each.

Input Tokens - What You Send

Input tokens represent everything you send to the model in your API request. This includes:

System prompts - instructions that define the model's behavior
User messages - your actual question or request
Context - code snippets, documents, conversation history
Few-shot examples - example inputs and outputs you provide

For AI code review, input tokens typically make up the bulk of your usage. When a tool like CodeRabbit or CodeAnt AI reviews a pull request, it sends the diff, surrounding code context, repository rules, and review instructions as input. A single PR review can easily consume 10,000-50,000 input tokens depending on the size of the change.

Why Input Tokens Are the Cheapest

Input tokens cost less because the model processes them in parallel using a single forward pass through the neural network. All input tokens are read simultaneously, making this phase computationally efficient. The GPU can process thousands of input tokens in roughly the same time it takes to process a few hundred.

Output Tokens - What You Receive

Output tokens are the tokens in the model's response. Every word, code snippet, and explanation the model generates counts as output tokens.

Output tokens are consistently more expensive than input tokens across every major provider. The ratio varies, but output tokens typically cost 3-4x more than input tokens.

Why Output Tokens Cost More

The reason is fundamental to how LLMs work. During output generation, the model must:

Predict one token at a time (autoregressive generation)
Run a full forward pass through the entire network for each token
Maintain the full attention context from all previous tokens
Store and update KV (key-value) cache for each new token

This sequential process means generating 1,000 output tokens requires roughly 1,000 separate forward passes, while reading 1,000 input tokens requires just one. The GPU utilization during output generation is also less efficient because it processes one token per step instead of batching thousands.

This is why a verbose model response is not just annoying - it is literally expensive.

Reasoning Tokens - The Hidden Cost

Reasoning tokens are the newest and most misunderstood token type. Introduced with OpenAI's o1 model and Anthropic's extended thinking feature, reasoning tokens represent the model's internal chain-of-thought process.

When you ask a reasoning model to solve a complex problem, it does not jump straight to the answer. Instead, it generates an internal monologue - breaking the problem into steps, considering approaches, checking its work, and then producing the final response.

How Reasoning Tokens Work

Here is what happens when you send a request to a reasoning model:

The model reads your input tokens (same as any model)
The model generates reasoning tokens - internal thinking that you do not see
The model generates output tokens - the visible response

The critical detail: reasoning tokens are billed at the output token rate because they require the same expensive sequential generation process. But they are not visible in the API response.

This means a request that returns a 500-token response might actually consume 3,000 or more total output tokens - 2,500 for reasoning and 500 for the visible answer.

Models That Use Reasoning Tokens

Model	Provider	Reasoning Type	Reasoning Visible?
o1	OpenAI	Built-in chain-of-thought	No (summary only)
o3	OpenAI	Built-in chain-of-thought	No (summary only)
o4-mini	OpenAI	Built-in chain-of-thought	No (summary only)
Claude Opus 4.5+	Anthropic	Extended thinking	Yes (thinking blocks)
Claude Sonnet 4.5+	Anthropic	Extended thinking	Yes (thinking blocks)
Gemini 2.5 Pro	Google	Thinking mode	Yes (thought summaries)

One important difference: Anthropic's extended thinking tokens are visible to the developer (returned as thinking blocks), while OpenAI's reasoning tokens are hidden. Both are billed at output rates.

LLM Token Pricing Comparison (March 2026)

Here is the current pricing for major LLM providers. All prices are per million tokens.

Standard Models (No Reasoning)

Model	Input (per 1M)	Output (per 1M)	Output/Input Ratio
GPT-4o	$2.50	$10.00	4x
GPT-4o-mini	$0.15	$0.60	4x
GPT-4.1	$2.00	$8.00	4x
GPT-4.1-mini	$0.40	$1.60	4x
Claude Opus 4.5	$5.00	$25.00	5x
Claude Sonnet 4.5	$3.00	$15.00	5x
Claude Haiku 4.5	$1.00	$5.00	5x
Gemini 2.5 Pro	$1.25	$10.00	8x
Gemini 2.5 Flash	$0.30	$2.50	8.3x
DeepSeek V3	$0.28	$0.42	1.5x

Reasoning Models

Model	Input (per 1M)	Output (per 1M)	Reasoning Rate	Notes
o1	$15.00	$60.00	Same as output	Most expensive reasoning model
o3	$2.00	$8.00	Same as output	Best price-to-reasoning ratio
o4-mini	$1.10	$4.40	Same as output	Budget reasoning option
Claude Sonnet (thinking)	$3.00	$15.00	Same as output	Extended thinking enabled
Claude Opus (thinking)	$5.00	$25.00	Same as output	Extended thinking enabled
Gemini 2.5 Pro (thinking)	$1.25	$10.00	Same as output	Thinking mode enabled

Notice that reasoning models do not have a separate per-token rate for reasoning. Instead, reasoning tokens are simply billed as output tokens. The cost impact comes from the volume - a reasoning model might generate 5-10x more total output tokens than a standard model for the same request.

Real-World Cost Calculations

Let's work through some practical examples to show how token types affect your actual spending.

Example 1 - Simple Code Review with GPT-4o

You send a 200-line diff for review with a system prompt and code context.

Component	Tokens	Rate	Cost
System prompt	2,000	$2.50/1M input	$0.005
Code diff + context	8,000	$2.50/1M input	$0.020
Review response	1,500	$10.00/1M output	$0.015
Total	11,500		$0.040

At this rate, reviewing 100 PRs per month costs about $4.00.

Example 2 - Deep Security Review with o3

Same diff, but you want the model to reason through potential security vulnerabilities.

Component	Tokens	Rate	Cost
System prompt	2,000	$2.00/1M input	$0.004
Code diff + context	8,000	$2.00/1M input	$0.016
Reasoning tokens (hidden)	6,000	$8.00/1M output	$0.048
Visible response	2,000	$8.00/1M output	$0.016
Total	18,000		$0.084

The reasoning tokens more than doubled the cost compared to a standard model, even though the visible output was similar in length.

Example 3 - High-Volume Review with GPT-4o-mini

For a team doing 500 PR reviews per month with lighter analysis.

Component	Tokens per Review	Monthly Total	Rate	Monthly Cost
Input tokens	8,000	4,000,000	$0.15/1M	$0.60
Output tokens	1,000	500,000	$0.60/1M	$0.30
Total				$0.90/month

That is less than a dollar per month for 500 code reviews. This is why smaller models are so attractive for high-volume, lower-complexity tasks.

Why the Output-to-Input Ratio Matters

The output-to-input cost ratio varies significantly across providers, and understanding this ratio is crucial for cost optimization.

Google's Gemini models have the highest ratio at 8x, meaning output tokens cost eight times more than input tokens. If your use case is output-heavy (generating long code, documentation, or detailed reviews), Gemini becomes comparatively more expensive on the output side despite having competitive input prices.

DeepSeek has the lowest ratio at just 1.5x, making it the most balanced option for output-heavy workloads. However, DeepSeek's overall quality for code review tasks may not match GPT-4o or Claude for certain languages.

OpenAI and Anthropic sit at 4-5x, which is the most common range across the industry.

The practical takeaway: if you can keep your outputs short and your inputs lean, you will save the most money with high-ratio providers. If your task requires long outputs, a lower-ratio provider might be cheaper overall even if its base rates are higher.

Prompt Caching - The Biggest Cost Saver

Prompt caching is the single most effective way to reduce LLM API costs, especially for AI code review workloads where the same system prompt and repository context are sent with every request.

How Prompt Caching Works

Instead of reprocessing the same prompt prefix on every request, the provider stores it in GPU memory. Subsequent requests that share the same prefix get a significant discount on those cached input tokens.

Caching Pricing by Provider

Provider	Cache Write Cost	Cache Read Cost	Cache Duration
Anthropic	1.25x base input	0.1x base input	5 minutes
Anthropic (extended)	2x base input	0.1x base input	1 hour
OpenAI	1x base input (automatic)	0.5x base input	Up to 1 hour
Google	1x base input	0.25x base input	Varies

Anthropic's caching is the most aggressive - a cache read costs just 10% of the normal input rate. For Claude Sonnet at $3/1M input tokens, cached reads cost just $0.30/1M. The 1.25x write overhead pays for itself after a single cache hit.

Caching Example for Code Review

Suppose your AI code review tool uses a 3,000-token system prompt with every review. Over 100 reviews:

Without caching (Claude Sonnet):

3,000 tokens x 100 reviews = 300,000 input tokens
Cost: 300,000 / 1M x $3.00 = $0.90

With caching (Claude Sonnet):

First request (cache write): 3,000 tokens x $3.75/1M = $0.011
99 cached requests: 3,000 x 99 = 297,000 tokens x $0.30/1M = $0.089
Total: $0.10

That is an 89% reduction just from caching the system prompt. In practice, you can also cache repository-level context, coding standards, and review rules for even greater savings.

How AI Code Review Tools Handle Token Costs

Understanding token economics explains many of the design decisions behind modern AI code review tools. Here is how the major tools manage costs:

Diff-Based Analysis

Tools like CodeRabbit, CodeAnt AI, and Sourcery only send the changed code (the diff) rather than entire files. This dramatically reduces input tokens. A 10-line change in a 500-line file uses roughly 95% fewer input tokens than sending the whole file.

Model Routing

Sophisticated tools route different types of analysis to different models:

Style and formatting checks go to cheap, fast models like GPT-4o-mini or Gemini Flash
Logic and bug detection go to mid-tier models like GPT-4o or Claude Sonnet
Security vulnerability analysis may use reasoning models like o3 for deeper analysis

This tiered approach can reduce costs by 60-80% compared to using a single premium model for everything.

Prompt Caching for Repository Context

When a code review tool indexes your repository, it builds a context document with your coding standards, common patterns, and project structure. This context is cached and reused across every PR review, saving thousands of input tokens per request.

Flat Pricing Abstraction

Most AI code review tools charge a flat per-user monthly fee rather than passing through token costs directly. This means the tool vendor absorbs the token cost risk and optimizes internally. Tools like CodeRabbit ($15/user/month) and CodeAnt AI (free for public repos) handle all the token math behind the scenes.

8 Strategies to Optimize Your LLM Token Costs

Whether you are building your own AI-powered tool or managing API costs directly, these strategies will help you spend less without sacrificing quality.

1. Use Prompt Caching Aggressively

Put your longest, most stable content at the beginning of your prompt. System instructions, few-shot examples, and reference documents are prime candidates for caching. Structure prompts so the cached prefix is reused across as many requests as possible.

2. Choose the Right Model for the Task

Do not use o3 or Claude Opus for tasks that GPT-4o-mini can handle. Create a model routing strategy:

Tier 1 (cheap): Formatting, linting, simple classification - GPT-4o-mini, Gemini Flash
Tier 2 (balanced): Code review, bug detection, summarization - GPT-4o, Claude Sonnet
Tier 3 (premium): Security analysis, architecture review, complex reasoning - o3, Claude Opus

3. Trim Your Input Context

Every unnecessary token in your prompt costs money. Remove:

Redundant instructions
Overly verbose few-shot examples
Code that is not relevant to the current task
Conversation history beyond what is needed for context

For code review, send only the diff and immediately surrounding lines rather than entire files.

4. Limit Output Length

Set max_tokens to prevent the model from generating unnecessarily long responses. If you need a yes/no classification, cap the output at 10 tokens. If you need a code review summary, 500-1,000 tokens is usually sufficient.

This is especially important with reasoning models where unconstrained thinking budgets can generate thousands of expensive reasoning tokens.

5. Use Batch APIs for Non-Urgent Work

OpenAI and Anthropic both offer batch processing APIs with significant discounts:

OpenAI Batch API: 50% discount on all models
Anthropic Message Batches: 50% discount, results within 24 hours

If your code reviews do not need to be instant (for example, nightly security scans), batch processing cuts your costs in half.

6. Monitor and Set Spending Limits

Track your token usage by model and endpoint. Set daily and monthly spending limits to avoid surprise bills. Most providers offer usage dashboards and API-level budget controls.

7. Compress Code Context Intelligently

Instead of sending raw source code, consider:

Sending AST (Abstract Syntax Tree) representations which are more token-efficient
Using code summaries for files that provide context but are not being reviewed
Stripping comments and whitespace from context files (but not from the code being reviewed)

8. Cache and Reuse Responses

If multiple developers open PRs that modify similar code, cache the review analysis for shared components. This avoids paying for the same analysis twice.

Token Costs for Popular AI Tasks

To put token pricing in broader perspective, here is what common AI-powered developer tasks cost using GPT-4o pricing ($2.50 input / $10.00 output per million tokens):

Task	Input Tokens	Output Tokens	Cost per Task
PR code review (small)	5,000	1,000	$0.023
PR code review (large)	30,000	3,000	$0.105
Security vulnerability scan	20,000	2,000	$0.070
Code documentation generation	3,000	5,000	$0.058
Bug explanation and fix	4,000	2,000	$0.030
Test case generation	5,000	8,000	$0.093
Architecture review	50,000	5,000	$0.175

These costs are per individual task. At scale, they add up - but they are remarkably cheap compared to the developer time they save.

The Future of Token Pricing

Token prices have dropped roughly 80% from 2024 to 2026, and the trend shows no signs of slowing. Several factors are driving costs down:

Hardware improvements - newer GPU architectures (NVIDIA Blackwell, AMD MI400) are more efficient
Model distillation - smaller models trained on outputs from larger models close the quality gap at lower cost
Inference optimization - techniques like speculative decoding, quantization, and better KV cache management reduce compute per token
Competition - DeepSeek, Mistral, and open-source models keep pressure on pricing

For AI code review tools, this means the cost of running comprehensive analysis on every PR is approaching near-zero. The limiting factor is shifting from cost to quality - which model provides the most accurate, lowest-false-positive reviews.

How This Affects Your Choice of AI Code Review Tool

When evaluating AI code review tools, understanding token economics helps you assess whether a tool's pricing is fair:

Tools charging $15-30/user/month likely use mid-tier models with caching, which is sustainable and cost-effective
Tools offering unlimited free tiers are either using very cheap models, heavily rate-limiting, or subsidizing costs with venture capital
Self-hosted tools like SonarQube or Semgrep avoid token costs entirely but require infrastructure investment

The best value typically comes from tools that intelligently route between model tiers - using cheap models for simple checks and premium models only when the complexity warrants it. CodeAnt AI and CodeRabbit both take this approach.

Key Takeaways

Input tokens are cheapest because the model processes them in parallel
Output tokens cost 3-8x more because they require sequential generation
Reasoning tokens are billed as output tokens and can multiply your costs 3-10x with no visible output increase
Prompt caching can reduce costs by up to 90% for repeated prompts
Model routing - using cheap models for simple tasks - is the most impactful optimization strategy
Batch APIs offer 50% discounts for non-urgent workloads
For AI code review, token costs are now low enough that comprehensive analysis on every PR is economically viable for teams of any size

Understanding these fundamentals puts you in control of your AI spending rather than being surprised by your monthly bill. Whether you are using AI APIs directly or evaluating managed tools, knowing what drives token costs helps you make smarter decisions.

Frequently Asked Questions

Why do output tokens cost more than input tokens?

Output tokens cost 2-6x more than input tokens because generating text is far more compute-intensive than reading it. During input processing, the model runs one forward pass over all tokens in parallel. During output generation, the model must run a separate forward pass for every single token, predicting one token at a time while maintaining the full context. This sequential, autoregressive process requires significantly more GPU time and memory bandwidth per token.

What are reasoning tokens and how are they billed?

Reasoning tokens are internal chain-of-thought tokens generated by models like OpenAI o1, o3, and Claude with extended thinking. These tokens represent the model's step-by-step problem-solving process. They are billed at the output token rate because they require the same sequential generation process. Reasoning tokens are not visible in the API response but still consume your token budget and context window. A 500-token visible response may use 2,000 or more total tokens when reasoning is included.

How can I reduce LLM API costs without losing quality?

The most effective strategies are prompt caching (up to 90% savings on repeated prompts), using smaller models for simple tasks and reserving expensive models for complex ones, trimming unnecessary context from prompts, batching requests during off-peak hours for 50% discounts, and setting maximum token limits on output. For AI code review, focusing reviews on changed files only rather than entire repositories dramatically reduces input token usage.

How does prompt caching work and when should I use it?

Prompt caching stores frequently used prompt prefixes so they do not need to be reprocessed on every request. Anthropic charges 0.1x the base input rate for cache reads and 1.25x for 5-minute cache writes. OpenAI offers automatic caching at 0.5x the input rate. You should use caching whenever you send the same system prompt, code context, or instructions repeatedly - which is exactly how AI code review tools work when analyzing multiple PRs against the same codebase rules.

Which LLM model offers the best value for AI code review?

For most AI code review use cases, GPT-4.1 at $2/$8 per million tokens or Claude Sonnet at $3/$15 offer the best balance of quality and cost. If you need deep reasoning for complex security analysis, o3 at $2/$8 with reasoning tokens is more cost-effective than o1. For high-volume linting and style checks, Gemini 2.5 Flash at $0.30/$2.50 or GPT-4o-mini at $0.15/$0.60 are the most economical choices.

How do AI code review tools manage token costs internally?

AI code review tools use several strategies to keep costs manageable. They use diff-based analysis to only send changed code rather than entire files, employ prompt caching for system instructions and repository context, route simple checks to cheaper models while using premium models for security analysis, and batch multiple file reviews where possible. Tools like CodeRabbit and CodeAnt AI abstract this complexity so you pay a flat per-user fee instead of worrying about token math.

Originally published at aicodereview.cc