If you have ever checked the pricing page for OpenAI, Anthropic, or Google and wondered why there are three different token prices listed, you are not alone. The distinction between input tokens, output tokens, and reasoning tokens is one of the most misunderstood aspects of LLM pricing - and getting it wrong can mean overspending by 5-10x on your AI workloads.
This guide breaks down exactly what each token type is, why they cost different amounts, and how to optimize your spending - whether you are building an AI application, running code reviews, or just trying to understand your API bill.
What Are Tokens in LLMs?
Before diving into pricing, let's clarify what tokens actually are. A token is the fundamental unit of text that large language models process. It is not a word, not a character, but something in between.
On average, one token equals roughly 4 characters or 0.75 words in English. The word "understanding" is two tokens. A line of Python code like def calculate_total(items): is about 8 tokens.
Every API call involves two phases:
- Input processing - the model reads your prompt (input tokens)
- Output generation - the model writes its response (output tokens)
Some newer models add a third phase:
- Reasoning - the model thinks through the problem before responding (reasoning tokens)
Each phase has different computational costs, which is why providers charge differently for each.
Input Tokens - What You Send
Input tokens represent everything you send to the model in your API request. This includes:
- System prompts - instructions that define the model's behavior
- User messages - your actual question or request
- Context - code snippets, documents, conversation history
- Few-shot examples - example inputs and outputs you provide
For AI code review, input tokens typically make up the bulk of your usage. When a tool like CodeRabbit or CodeAnt AI reviews a pull request, it sends the diff, surrounding code context, repository rules, and review instructions as input. A single PR review can easily consume 10,000-50,000 input tokens depending on the size of the change.
Why Input Tokens Are the Cheapest
Input tokens cost less because the model processes them in parallel using a single forward pass through the neural network. All input tokens are read simultaneously, making this phase computationally efficient. The GPU can process thousands of input tokens in roughly the same time it takes to process a few hundred.
Output Tokens - What You Receive
Output tokens are the tokens in the model's response. Every word, code snippet, and explanation the model generates counts as output tokens.
Output tokens are consistently more expensive than input tokens across every major provider. The ratio varies, but output tokens typically cost 3-4x more than input tokens.
Why Output Tokens Cost More
The reason is fundamental to how LLMs work. During output generation, the model must:
- Predict one token at a time (autoregressive generation)
- Run a full forward pass through the entire network for each token
- Maintain the full attention context from all previous tokens
- Store and update KV (key-value) cache for each new token
This sequential process means generating 1,000 output tokens requires roughly 1,000 separate forward passes, while reading 1,000 input tokens requires just one. The GPU utilization during output generation is also less efficient because it processes one token per step instead of batching thousands.
This is why a verbose model response is not just annoying - it is literally expensive.
Reasoning Tokens - The Hidden Cost
Reasoning tokens are the newest and most misunderstood token type. Introduced with OpenAI's o1 model and Anthropic's extended thinking feature, reasoning tokens represent the model's internal chain-of-thought process.
When you ask a reasoning model to solve a complex problem, it does not jump straight to the answer. Instead, it generates an internal monologue - breaking the problem into steps, considering approaches, checking its work, and then producing the final response.
How Reasoning Tokens Work
Here is what happens when you send a request to a reasoning model:
- The model reads your input tokens (same as any model)
- The model generates reasoning tokens - internal thinking that you do not see
- The model generates output tokens - the visible response
The critical detail: reasoning tokens are billed at the output token rate because they require the same expensive sequential generation process. But they are not visible in the API response.
This means a request that returns a 500-token response might actually consume 3,000 or more total output tokens - 2,500 for reasoning and 500 for the visible answer.
Models That Use Reasoning Tokens
| Model | Provider | Reasoning Type | Reasoning Visible? |
|---|---|---|---|
| o1 | OpenAI | Built-in chain-of-thought | No (summary only) |
| o3 | OpenAI | Built-in chain-of-thought | No (summary only) |
| o4-mini | OpenAI | Built-in chain-of-thought | No (summary only) |
| Claude Opus 4.5+ | Anthropic | Extended thinking | Yes (thinking blocks) |
| Claude Sonnet 4.5+ | Anthropic | Extended thinking | Yes (thinking blocks) |
| Gemini 2.5 Pro | Thinking mode | Yes (thought summaries) |
One important difference: Anthropic's extended thinking tokens are visible to the developer (returned as thinking blocks), while OpenAI's reasoning tokens are hidden. Both are billed at output rates.
LLM Token Pricing Comparison (March 2026)
Here is the current pricing for major LLM providers. All prices are per million tokens.
Standard Models (No Reasoning)
| Model | Input (per 1M) | Output (per 1M) | Output/Input Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4x |
| GPT-4o-mini | $0.15 | $0.60 | 4x |
| GPT-4.1 | $2.00 | $8.00 | 4x |
| GPT-4.1-mini | $0.40 | $1.60 | 4x |
| Claude Opus 4.5 | $5.00 | $25.00 | 5x |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 5x |
| Claude Haiku 4.5 | $1.00 | $5.00 | 5x |
| Gemini 2.5 Pro | $1.25 | $10.00 | 8x |
| Gemini 2.5 Flash | $0.30 | $2.50 | 8.3x |
| DeepSeek V3 | $0.28 | $0.42 | 1.5x |
Reasoning Models
| Model | Input (per 1M) | Output (per 1M) | Reasoning Rate | Notes |
|---|---|---|---|---|
| o1 | $15.00 | $60.00 | Same as output | Most expensive reasoning model |
| o3 | $2.00 | $8.00 | Same as output | Best price-to-reasoning ratio |
| o4-mini | $1.10 | $4.40 | Same as output | Budget reasoning option |
| Claude Sonnet (thinking) | $3.00 | $15.00 | Same as output | Extended thinking enabled |
| Claude Opus (thinking) | $5.00 | $25.00 | Same as output | Extended thinking enabled |
| Gemini 2.5 Pro (thinking) | $1.25 | $10.00 | Same as output | Thinking mode enabled |
Notice that reasoning models do not have a separate per-token rate for reasoning. Instead, reasoning tokens are simply billed as output tokens. The cost impact comes from the volume - a reasoning model might generate 5-10x more total output tokens than a standard model for the same request.
Real-World Cost Calculations
Let's work through some practical examples to show how token types affect your actual spending.
Example 1 - Simple Code Review with GPT-4o
You send a 200-line diff for review with a system prompt and code context.
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| System prompt | 2,000 | $2.50/1M input | $0.005 |
| Code diff + context | 8,000 | $2.50/1M input | $0.020 |
| Review response | 1,500 | $10.00/1M output | $0.015 |
| Total | 11,500 | $0.040 |
At this rate, reviewing 100 PRs per month costs about $4.00.
Example 2 - Deep Security Review with o3
Same diff, but you want the model to reason through potential security vulnerabilities.
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| System prompt | 2,000 | $2.00/1M input | $0.004 |
| Code diff + context | 8,000 | $2.00/1M input | $0.016 |
| Reasoning tokens (hidden) | 6,000 | $8.00/1M output | $0.048 |
| Visible response | 2,000 | $8.00/1M output | $0.016 |
| Total | 18,000 | $0.084 |
The reasoning tokens more than doubled the cost compared to a standard model, even though the visible output was similar in length.
Example 3 - High-Volume Review with GPT-4o-mini
For a team doing 500 PR reviews per month with lighter analysis.
| Component | Tokens per Review | Monthly Total | Rate | Monthly Cost |
|---|---|---|---|---|
| Input tokens | 8,000 | 4,000,000 | $0.15/1M | $0.60 |
| Output tokens | 1,000 | 500,000 | $0.60/1M | $0.30 |
| Total | $0.90/month |
That is less than a dollar per month for 500 code reviews. This is why smaller models are so attractive for high-volume, lower-complexity tasks.
Why the Output-to-Input Ratio Matters
The output-to-input cost ratio varies significantly across providers, and understanding this ratio is crucial for cost optimization.
Google's Gemini models have the highest ratio at 8x, meaning output tokens cost eight times more than input tokens. If your use case is output-heavy (generating long code, documentation, or detailed reviews), Gemini becomes comparatively more expensive on the output side despite having competitive input prices.
DeepSeek has the lowest ratio at just 1.5x, making it the most balanced option for output-heavy workloads. However, DeepSeek's overall quality for code review tasks may not match GPT-4o or Claude for certain languages.
OpenAI and Anthropic sit at 4-5x, which is the most common range across the industry.
The practical takeaway: if you can keep your outputs short and your inputs lean, you will save the most money with high-ratio providers. If your task requires long outputs, a lower-ratio provider might be cheaper overall even if its base rates are higher.
Prompt Caching - The Biggest Cost Saver
Prompt caching is the single most effective way to reduce LLM API costs, especially for AI code review workloads where the same system prompt and repository context are sent with every request.
How Prompt Caching Works
Instead of reprocessing the same prompt prefix on every request, the provider stores it in GPU memory. Subsequent requests that share the same prefix get a significant discount on those cached input tokens.
Caching Pricing by Provider
| Provider | Cache Write Cost | Cache Read Cost | Cache Duration |
|---|---|---|---|
| Anthropic | 1.25x base input | 0.1x base input | 5 minutes |
| Anthropic (extended) | 2x base input | 0.1x base input | 1 hour |
| OpenAI | 1x base input (automatic) | 0.5x base input | Up to 1 hour |
| 1x base input | 0.25x base input | Varies |
Anthropic's caching is the most aggressive - a cache read costs just 10% of the normal input rate. For Claude Sonnet at $3/1M input tokens, cached reads cost just $0.30/1M. The 1.25x write overhead pays for itself after a single cache hit.
Caching Example for Code Review
Suppose your AI code review tool uses a 3,000-token system prompt with every review. Over 100 reviews:
Without caching (Claude Sonnet):
- 3,000 tokens x 100 reviews = 300,000 input tokens
- Cost: 300,000 / 1M x $3.00 = $0.90
With caching (Claude Sonnet):
- First request (cache write): 3,000 tokens x $3.75/1M = $0.011
- 99 cached requests: 3,000 x 99 = 297,000 tokens x $0.30/1M = $0.089
- Total: $0.10
That is an 89% reduction just from caching the system prompt. In practice, you can also cache repository-level context, coding standards, and review rules for even greater savings.
How AI Code Review Tools Handle Token Costs
Understanding token economics explains many of the design decisions behind modern AI code review tools. Here is how the major tools manage costs:
Diff-Based Analysis
Tools like CodeRabbit, CodeAnt AI, and Sourcery only send the changed code (the diff) rather than entire files. This dramatically reduces input tokens. A 10-line change in a 500-line file uses roughly 95% fewer input tokens than sending the whole file.
Model Routing
Sophisticated tools route different types of analysis to different models:
- Style and formatting checks go to cheap, fast models like GPT-4o-mini or Gemini Flash
- Logic and bug detection go to mid-tier models like GPT-4o or Claude Sonnet
- Security vulnerability analysis may use reasoning models like o3 for deeper analysis
This tiered approach can reduce costs by 60-80% compared to using a single premium model for everything.
Prompt Caching for Repository Context
When a code review tool indexes your repository, it builds a context document with your coding standards, common patterns, and project structure. This context is cached and reused across every PR review, saving thousands of input tokens per request.
Flat Pricing Abstraction
Most AI code review tools charge a flat per-user monthly fee rather than passing through token costs directly. This means the tool vendor absorbs the token cost risk and optimizes internally. Tools like CodeRabbit ($15/user/month) and CodeAnt AI (free for public repos) handle all the token math behind the scenes.
8 Strategies to Optimize Your LLM Token Costs
Whether you are building your own AI-powered tool or managing API costs directly, these strategies will help you spend less without sacrificing quality.
1. Use Prompt Caching Aggressively
Put your longest, most stable content at the beginning of your prompt. System instructions, few-shot examples, and reference documents are prime candidates for caching. Structure prompts so the cached prefix is reused across as many requests as possible.
2. Choose the Right Model for the Task
Do not use o3 or Claude Opus for tasks that GPT-4o-mini can handle. Create a model routing strategy:
- Tier 1 (cheap): Formatting, linting, simple classification - GPT-4o-mini, Gemini Flash
- Tier 2 (balanced): Code review, bug detection, summarization - GPT-4o, Claude Sonnet
- Tier 3 (premium): Security analysis, architecture review, complex reasoning - o3, Claude Opus
3. Trim Your Input Context
Every unnecessary token in your prompt costs money. Remove:
- Redundant instructions
- Overly verbose few-shot examples
- Code that is not relevant to the current task
- Conversation history beyond what is needed for context
For code review, send only the diff and immediately surrounding lines rather than entire files.
4. Limit Output Length
Set max_tokens to prevent the model from generating unnecessarily long responses. If you need a yes/no classification, cap the output at 10 tokens. If you need a code review summary, 500-1,000 tokens is usually sufficient.
This is especially important with reasoning models where unconstrained thinking budgets can generate thousands of expensive reasoning tokens.
5. Use Batch APIs for Non-Urgent Work
OpenAI and Anthropic both offer batch processing APIs with significant discounts:
- OpenAI Batch API: 50% discount on all models
- Anthropic Message Batches: 50% discount, results within 24 hours
If your code reviews do not need to be instant (for example, nightly security scans), batch processing cuts your costs in half.
6. Monitor and Set Spending Limits
Track your token usage by model and endpoint. Set daily and monthly spending limits to avoid surprise bills. Most providers offer usage dashboards and API-level budget controls.
7. Compress Code Context Intelligently
Instead of sending raw source code, consider:
- Sending AST (Abstract Syntax Tree) representations which are more token-efficient
- Using code summaries for files that provide context but are not being reviewed
- Stripping comments and whitespace from context files (but not from the code being reviewed)
8. Cache and Reuse Responses
If multiple developers open PRs that modify similar code, cache the review analysis for shared components. This avoids paying for the same analysis twice.
Token Costs for Popular AI Tasks
To put token pricing in broader perspective, here is what common AI-powered developer tasks cost using GPT-4o pricing ($2.50 input / $10.00 output per million tokens):
| Task | Input Tokens | Output Tokens | Cost per Task |
|---|---|---|---|
| PR code review (small) | 5,000 | 1,000 | $0.023 |
| PR code review (large) | 30,000 | 3,000 | $0.105 |
| Security vulnerability scan | 20,000 | 2,000 | $0.070 |
| Code documentation generation | 3,000 | 5,000 | $0.058 |
| Bug explanation and fix | 4,000 | 2,000 | $0.030 |
| Test case generation | 5,000 | 8,000 | $0.093 |
| Architecture review | 50,000 | 5,000 | $0.175 |
These costs are per individual task. At scale, they add up - but they are remarkably cheap compared to the developer time they save.
The Future of Token Pricing
Token prices have dropped roughly 80% from 2024 to 2026, and the trend shows no signs of slowing. Several factors are driving costs down:
- Hardware improvements - newer GPU architectures (NVIDIA Blackwell, AMD MI400) are more efficient
- Model distillation - smaller models trained on outputs from larger models close the quality gap at lower cost
- Inference optimization - techniques like speculative decoding, quantization, and better KV cache management reduce compute per token
- Competition - DeepSeek, Mistral, and open-source models keep pressure on pricing
For AI code review tools, this means the cost of running comprehensive analysis on every PR is approaching near-zero. The limiting factor is shifting from cost to quality - which model provides the most accurate, lowest-false-positive reviews.
How This Affects Your Choice of AI Code Review Tool
When evaluating AI code review tools, understanding token economics helps you assess whether a tool's pricing is fair:
- Tools charging $15-30/user/month likely use mid-tier models with caching, which is sustainable and cost-effective
- Tools offering unlimited free tiers are either using very cheap models, heavily rate-limiting, or subsidizing costs with venture capital
- Self-hosted tools like SonarQube or Semgrep avoid token costs entirely but require infrastructure investment
The best value typically comes from tools that intelligently route between model tiers - using cheap models for simple checks and premium models only when the complexity warrants it. CodeAnt AI and CodeRabbit both take this approach.
Key Takeaways
- Input tokens are cheapest because the model processes them in parallel
- Output tokens cost 3-8x more because they require sequential generation
- Reasoning tokens are billed as output tokens and can multiply your costs 3-10x with no visible output increase
- Prompt caching can reduce costs by up to 90% for repeated prompts
- Model routing - using cheap models for simple tasks - is the most impactful optimization strategy
- Batch APIs offer 50% discounts for non-urgent workloads
- For AI code review, token costs are now low enough that comprehensive analysis on every PR is economically viable for teams of any size
Understanding these fundamentals puts you in control of your AI spending rather than being surprised by your monthly bill. Whether you are using AI APIs directly or evaluating managed tools, knowing what drives token costs helps you make smarter decisions.
Frequently Asked Questions
Why do output tokens cost more than input tokens?
Output tokens cost 2-6x more than input tokens because generating text is far more compute-intensive than reading it. During input processing, the model runs one forward pass over all tokens in parallel. During output generation, the model must run a separate forward pass for every single token, predicting one token at a time while maintaining the full context. This sequential, autoregressive process requires significantly more GPU time and memory bandwidth per token.
What are reasoning tokens and how are they billed?
Reasoning tokens are internal chain-of-thought tokens generated by models like OpenAI o1, o3, and Claude with extended thinking. These tokens represent the model's step-by-step problem-solving process. They are billed at the output token rate because they require the same sequential generation process. Reasoning tokens are not visible in the API response but still consume your token budget and context window. A 500-token visible response may use 2,000 or more total tokens when reasoning is included.
How can I reduce LLM API costs without losing quality?
The most effective strategies are prompt caching (up to 90% savings on repeated prompts), using smaller models for simple tasks and reserving expensive models for complex ones, trimming unnecessary context from prompts, batching requests during off-peak hours for 50% discounts, and setting maximum token limits on output. For AI code review, focusing reviews on changed files only rather than entire repositories dramatically reduces input token usage.
How does prompt caching work and when should I use it?
Prompt caching stores frequently used prompt prefixes so they do not need to be reprocessed on every request. Anthropic charges 0.1x the base input rate for cache reads and 1.25x for 5-minute cache writes. OpenAI offers automatic caching at 0.5x the input rate. You should use caching whenever you send the same system prompt, code context, or instructions repeatedly - which is exactly how AI code review tools work when analyzing multiple PRs against the same codebase rules.
Which LLM model offers the best value for AI code review?
For most AI code review use cases, GPT-4.1 at $2/$8 per million tokens or Claude Sonnet at $3/$15 offer the best balance of quality and cost. If you need deep reasoning for complex security analysis, o3 at $2/$8 with reasoning tokens is more cost-effective than o1. For high-volume linting and style checks, Gemini 2.5 Flash at $0.30/$2.50 or GPT-4o-mini at $0.15/$0.60 are the most economical choices.
How do AI code review tools manage token costs internally?
AI code review tools use several strategies to keep costs manageable. They use diff-based analysis to only send changed code rather than entire files, employ prompt caching for system instructions and repository context, route simple checks to cheaper models while using premium models for security analysis, and batch multiple file reviews where possible. Tools like CodeRabbit and CodeAnt AI abstract this complexity so you pay a flat per-user fee instead of worrying about token math.
Originally published at aicodereview.cc
Top comments (0)