Claude Opus 4.8 costs $5 per million input tokens and $25 per million output tokens in standard mode. That matches Opus 4.7 pricing, so upgrading from 4.7 does not change your base token rate. Your real bill, however, depends on how you use the model: fast mode, effort, prompt caching, batching, and output limits can move costs significantly.
This guide shows how to estimate and reduce Opus 4.8 costs with practical examples. For the model overview, see what is Claude Opus 4.8. To start building, see the API guide.
The Opus 4.8 rate card
| Mode | Input, per 1M tokens | Output, per 1M tokens | Speed |
|---|---|---|---|
| Standard | $5 | $25 | Baseline |
| Fast | $10 | $50 | 2.5x faster output |
Two implementation details matter most:
Output tokens are 5x more expensive than input tokens.
Long answers, verbose tool traces, and unconstrained generations drive spend quickly.Fast mode doubles the per-token price.
You pay $10/$50 per million tokens for output that streams 2.5x faster.
You can confirm current rates in Anthropic’s pricing docs.
When to use fast mode
Use standard mode by default.
Use fast mode only when latency directly affects the user experience, for example:
- Live coding assistants
- Interactive chat UIs
- Agent workflows where a user is watching progress
- Cursor-style autocomplete or editing tools
Avoid fast mode for:
- Background jobs
- Scheduled tasks
- Offline evals
- Data labeling
- Bulk summarization
- Agent loops that do not need real-time feedback
A simple rule:
If a human is waiting, consider fast mode.
If the job runs in the background, use standard mode.
Use effort to control token spend
Opus 4.8’s effort parameter controls how many tokens the model spends across the full response, including tool calls. Because output tokens cost more than input tokens, tuning effort is one of the most direct ways to reduce cost.
The available levels are:
-
low: terse answers, fewest tool calls, lowest spend -
medium: balanced -
high: default, more thorough -
xhigh: deeper reasoning, more tool calls, recommended for coding -
max: no constraints, highest spend
Use lower effort for deterministic or shallow tasks:
{
"model": "claude-opus-4-8",
"messages": [
{
"role": "user",
"content": "Classify this support ticket as billing, bug, or feature request: ..."
}
],
"effort": "low",
"max_tokens": 100
}
Use higher effort for tasks where reasoning quality matters:
{
"model": "claude-opus-4-8",
"messages": [
{
"role": "user",
"content": "Analyze this repository and propose a safe refactor plan..."
}
],
"effort": "xhigh",
"max_tokens": 4000
}
Anthropic’s effort guidance explains where each level tends to hold quality. The practical takeaway: do not run every workload at high or xhigh by default.
Cost formula
Use this formula to estimate standard-mode costs:
cost =
(input_tokens / 1,000,000 * 5)
+ (output_tokens / 1,000,000 * 25)
For fast mode:
cost =
(input_tokens / 1,000,000 * 10)
+ (output_tokens / 1,000,000 * 50)
For batch jobs with a 50% discount:
cost =
((input_tokens / 1,000,000 * 5)
+ (output_tokens / 1,000,000 * 25)) * 0.5
Worked cost scenarios
The following examples use standard pricing: $5 input and $25 output per million tokens.
Actual costs depend on your prompt length, response length, tool calls, and selected effort.
Scenario 1: chatbot turn
Assume:
- 1,000 input tokens
- 500 output tokens
Calculation:
Input: 1,000 / 1,000,000 * $5 = $0.005
Output: 500 / 1,000,000 * $25 = $0.0125
Total: $0.0175
Approximate cost:
$0.018 per turn
If you lower effort and cap max_tokens, the output may shrink enough to bring the turn under one cent.
Scenario 2: agentic coding task
Assume:
- 50,000 input tokens of repository context
- 8,000 output tokens
effort: "xhigh"
Calculation:
Input: 50,000 / 1,000,000 * $5 = $0.25
Output: 8,000 / 1,000,000 * $25 = $0.20
Total: $0.45
Approximate cost:
$0.45 per task
If the 50K-token context repeats across calls, prompt caching can reduce the repeated input cost to roughly $0.025, bringing the task closer to:
$0.23 per task
Scenario 3: overnight batch job
Assume:
- 1,000,000 input tokens
- 200,000 output tokens
- Batch API with a 50% discount
Calculation:
Input: 1,000,000 / 1,000,000 * $5 * 0.5 = $2.50
Output: 200,000 / 1,000,000 * $25 * 0.5 = $2.50
Total: $5.00
Approximate cost:
$5.00 for the whole batch
For comparison shopping against cheaper models, see the Gemini 3.5 Flash pricing breakdown and Xiaomi MiMo v2.5 API cost.
Prompt caching: reduce repeated input cost
If you send the same system prompt, document, or codebase on every request, you repeatedly pay full input-token pricing for content the model has already seen.
Prompt caching reduces that repeated input cost. After the initial cache write, cached input reads are charged at a fraction of the normal input rate, roughly a tenth.
Good candidates for caching:
- Long system prompts
- Product documentation
- API references
- Repository context
- Style guides
- Policy documents
- Shared agent instructions
A common agent pattern looks like this:
First request:
- Send full system prompt + repo context
- Write cache
- Pay normal input cost for cached content
Later requests:
- Reuse cached prompt/context
- Pay reduced cached-read cost
- Send only the new task-specific input
Long-context coding agents usually benefit the most because the same repository context is reused across many calls.
Use the Batch API for non-real-time work
The Batch API is useful when you do not need an immediate response.
Use it for:
- Evals
- Bulk summarization
- Data labeling
- Report generation
- Document processing
- Offline analysis pipelines
The Batch API also raises the output ceiling. Opus 4.8 supports:
- Up to 128K output tokens on the synchronous Messages API
- Up to 300K output tokens through the Batch API with the
output-300k-2026-03-24beta header
Use synchronous calls for interactive workflows. Use batch calls when minutes of latency are acceptable and lower cost matters more than immediate output.
Opus pricing across generations
Opus 4.8 keeps the same pricing as recent Opus releases:
| Model | Input, per 1M tokens | Output, per 1M tokens |
|---|---|---|
| Opus 4.1 | $15 | $75 |
| Opus 4.5 | $5 | $25 |
| Opus 4.6 | $5 | $25 |
| Opus 4.7 | $5 | $25 |
| Opus 4.8 | $5 | $25 |
Opus pricing dropped from $15/$75 to $5/$25 at the 4.5 generation and has stayed there through 4.8.
For a head-to-head comparison against other flagship models, see Opus 4.8 vs GPT-5.5 vs Gemini 3.5.
Cost optimization checklist
Before scaling Opus 4.8 in production, apply these controls.
1. Set effort per workload
Do not use one global effort level.
Example mapping:
| Task | Suggested effort |
|---|---|
| Classification | low |
| Extraction |
low or medium
|
| Customer support draft | medium |
| Technical explanation | high |
| Coding agent | xhigh |
| Open-ended deep reasoning |
xhigh or max
|
2. Cap max_tokens
Always set a reasonable max_tokens value.
{
"effort": "low",
"max_tokens": 150
}
This bounds the worst-case output cost per call.
3. Cache repeated context
Cache anything large and stable:
system prompt
+ docs
+ repo context
+ style guide
+ shared tool instructions
Then send only task-specific content in each request.
4. Batch non-urgent work
Move offline work to the Batch API:
interactive user request -> synchronous API
overnight processing -> Batch API
bulk evals -> Batch API
5. Use standard mode unless latency is the product
Fast mode is useful, but expensive. Keep it limited to user-facing interactions where faster streaming improves the experience.
6. Track quota and usage
Rate limits, spend, and usage tiers change how you operate at scale. The Claude Code weekly limits change is a reminder to monitor quota, not just per-token pricing.
Track real spend with Apidog
Estimates are useful before launch, but production cost depends on actual responses.
Every Messages API response includes a usage object with token counts. You should log or inspect it for each workload.
Apidog helps you test and compare real requests:
- Send an Opus 4.8 request and inspect the
usageblock - Run the same prompt with
low,high, andxhigheffort - Compare input and output token counts
- Save requests for each workload
- Re-run requests as prompts change
- Mock the endpoint while developing to avoid spending tokens during local testing
A practical test flow:
1. Create one representative request.
2. Run it with effort = low.
3. Record input and output tokens.
4. Run it with effort = high.
5. Run it with effort = xhigh.
6. Compare cost and output quality.
7. Pick the cheapest effort level that meets your quality bar.
FAQ
How much does Claude Opus 4.8 cost?
Claude Opus 4.8 costs $5 per million input tokens and $25 per million output tokens in standard mode.
Fast mode costs $10 per million input tokens and $50 per million output tokens.
Is Opus 4.8 more expensive than Opus 4.7?
No. Opus 4.8 and Opus 4.7 use the same per-token rates.
What is the difference between standard and fast mode pricing?
Fast mode doubles the per-token rate in exchange for output that streams about 2.5x faster. Use it when latency matters to a waiting user.
How do I lower Opus 4.8 costs?
Use these controls:
- Lower
effortfor simpler tasks - Cache repeated prompt content
- Batch non-urgent jobs
- Set tight
max_tokenslimits - Avoid fast mode unless latency matters
- Monitor real
usagevalues in responses
Does prompt caching save money?
Yes. After the first cache write, repeated input is read at roughly a tenth of the normal input rate. Long-context agents usually save the most.
How many output tokens can Opus 4.8 produce?
Opus 4.8 supports up to 128K output tokens on the synchronous Messages API and up to 300K output tokens through the Batch API with the output-300k-2026-03-24 beta header.
Where do I see token usage per call?
Check the usage object in each Messages API response. Tools like Apidog surface it so you can compare cost across prompts, effort levels, and workloads.

Top comments (0)