DEV Community

Cover image for Claude Opus 4.8 Pricing: The Full Cost Breakdown
Hassann
Hassann

Posted on • Originally published at apidog.com

Claude Opus 4.8 Pricing: The Full Cost Breakdown

Claude Opus 4.8 costs $5 per million input tokens and $25 per million output tokens in standard mode. That matches Opus 4.7 pricing, so upgrading from 4.7 does not change your base token rate. Your real bill, however, depends on how you use the model: fast mode, effort, prompt caching, batching, and output limits can move costs significantly.

Try Apidog today

This guide shows how to estimate and reduce Opus 4.8 costs with practical examples. For the model overview, see what is Claude Opus 4.8. To start building, see the API guide.

The Opus 4.8 rate card

Mode Input, per 1M tokens Output, per 1M tokens Speed
Standard $5 $25 Baseline
Fast $10 $50 2.5x faster output

Two implementation details matter most:

  1. Output tokens are 5x more expensive than input tokens.

    Long answers, verbose tool traces, and unconstrained generations drive spend quickly.

  2. Fast mode doubles the per-token price.

    You pay $10/$50 per million tokens for output that streams 2.5x faster.

You can confirm current rates in Anthropic’s pricing docs.

When to use fast mode

Use standard mode by default.

Use fast mode only when latency directly affects the user experience, for example:

  • Live coding assistants
  • Interactive chat UIs
  • Agent workflows where a user is watching progress
  • Cursor-style autocomplete or editing tools

Avoid fast mode for:

  • Background jobs
  • Scheduled tasks
  • Offline evals
  • Data labeling
  • Bulk summarization
  • Agent loops that do not need real-time feedback

A simple rule:

If a human is waiting, consider fast mode.
If the job runs in the background, use standard mode.
Enter fullscreen mode Exit fullscreen mode

Use effort to control token spend

Opus 4.8’s effort parameter controls how many tokens the model spends across the full response, including tool calls. Because output tokens cost more than input tokens, tuning effort is one of the most direct ways to reduce cost.

The available levels are:

  • low: terse answers, fewest tool calls, lowest spend
  • medium: balanced
  • high: default, more thorough
  • xhigh: deeper reasoning, more tool calls, recommended for coding
  • max: no constraints, highest spend

Use lower effort for deterministic or shallow tasks:

{
  "model": "claude-opus-4-8",
  "messages": [
    {
      "role": "user",
      "content": "Classify this support ticket as billing, bug, or feature request: ..."
    }
  ],
  "effort": "low",
  "max_tokens": 100
}
Enter fullscreen mode Exit fullscreen mode

Use higher effort for tasks where reasoning quality matters:

{
  "model": "claude-opus-4-8",
  "messages": [
    {
      "role": "user",
      "content": "Analyze this repository and propose a safe refactor plan..."
    }
  ],
  "effort": "xhigh",
  "max_tokens": 4000
}
Enter fullscreen mode Exit fullscreen mode

Anthropic’s effort guidance explains where each level tends to hold quality. The practical takeaway: do not run every workload at high or xhigh by default.

Cost formula

Use this formula to estimate standard-mode costs:

cost =
  (input_tokens / 1,000,000 * 5)
+ (output_tokens / 1,000,000 * 25)
Enter fullscreen mode Exit fullscreen mode

For fast mode:

cost =
  (input_tokens / 1,000,000 * 10)
+ (output_tokens / 1,000,000 * 50)
Enter fullscreen mode Exit fullscreen mode

For batch jobs with a 50% discount:

cost =
  ((input_tokens / 1,000,000 * 5)
+  (output_tokens / 1,000,000 * 25)) * 0.5
Enter fullscreen mode Exit fullscreen mode

Worked cost scenarios

The following examples use standard pricing: $5 input and $25 output per million tokens.

Actual costs depend on your prompt length, response length, tool calls, and selected effort.

Scenario 1: chatbot turn

Assume:

  • 1,000 input tokens
  • 500 output tokens

Calculation:

Input:  1,000 / 1,000,000 * $5  = $0.005
Output:   500 / 1,000,000 * $25 = $0.0125
Total:                               $0.0175
Enter fullscreen mode Exit fullscreen mode

Approximate cost:

$0.018 per turn
Enter fullscreen mode Exit fullscreen mode

If you lower effort and cap max_tokens, the output may shrink enough to bring the turn under one cent.

Scenario 2: agentic coding task

Assume:

  • 50,000 input tokens of repository context
  • 8,000 output tokens
  • effort: "xhigh"

Calculation:

Input:  50,000 / 1,000,000 * $5  = $0.25
Output:  8,000 / 1,000,000 * $25 = $0.20
Total:                                $0.45
Enter fullscreen mode Exit fullscreen mode

Approximate cost:

$0.45 per task
Enter fullscreen mode Exit fullscreen mode

If the 50K-token context repeats across calls, prompt caching can reduce the repeated input cost to roughly $0.025, bringing the task closer to:

$0.23 per task
Enter fullscreen mode Exit fullscreen mode

Scenario 3: overnight batch job

Assume:

  • 1,000,000 input tokens
  • 200,000 output tokens
  • Batch API with a 50% discount

Calculation:

Input:  1,000,000 / 1,000,000 * $5  * 0.5 = $2.50
Output:   200,000 / 1,000,000 * $25 * 0.5 = $2.50
Total:                                           $5.00
Enter fullscreen mode Exit fullscreen mode

Approximate cost:

$5.00 for the whole batch
Enter fullscreen mode Exit fullscreen mode

For comparison shopping against cheaper models, see the Gemini 3.5 Flash pricing breakdown and Xiaomi MiMo v2.5 API cost.

Prompt caching: reduce repeated input cost

If you send the same system prompt, document, or codebase on every request, you repeatedly pay full input-token pricing for content the model has already seen.

Prompt caching reduces that repeated input cost. After the initial cache write, cached input reads are charged at a fraction of the normal input rate, roughly a tenth.

Good candidates for caching:

  • Long system prompts
  • Product documentation
  • API references
  • Repository context
  • Style guides
  • Policy documents
  • Shared agent instructions

A common agent pattern looks like this:

First request:
  - Send full system prompt + repo context
  - Write cache
  - Pay normal input cost for cached content

Later requests:
  - Reuse cached prompt/context
  - Pay reduced cached-read cost
  - Send only the new task-specific input
Enter fullscreen mode Exit fullscreen mode

Long-context coding agents usually benefit the most because the same repository context is reused across many calls.

Use the Batch API for non-real-time work

The Batch API is useful when you do not need an immediate response.

Use it for:

  • Evals
  • Bulk summarization
  • Data labeling
  • Report generation
  • Document processing
  • Offline analysis pipelines

The Batch API also raises the output ceiling. Opus 4.8 supports:

  • Up to 128K output tokens on the synchronous Messages API
  • Up to 300K output tokens through the Batch API with the output-300k-2026-03-24 beta header

Use synchronous calls for interactive workflows. Use batch calls when minutes of latency are acceptable and lower cost matters more than immediate output.

Opus pricing across generations

Opus 4.8 keeps the same pricing as recent Opus releases:

Model Input, per 1M tokens Output, per 1M tokens
Opus 4.1 $15 $75
Opus 4.5 $5 $25
Opus 4.6 $5 $25
Opus 4.7 $5 $25
Opus 4.8 $5 $25

Opus pricing dropped from $15/$75 to $5/$25 at the 4.5 generation and has stayed there through 4.8.

For a head-to-head comparison against other flagship models, see Opus 4.8 vs GPT-5.5 vs Gemini 3.5.

Cost optimization checklist

Before scaling Opus 4.8 in production, apply these controls.

1. Set effort per workload

Do not use one global effort level.

Example mapping:

Task Suggested effort
Classification low
Extraction low or medium
Customer support draft medium
Technical explanation high
Coding agent xhigh
Open-ended deep reasoning xhigh or max

2. Cap max_tokens

Always set a reasonable max_tokens value.

{
  "effort": "low",
  "max_tokens": 150
}
Enter fullscreen mode Exit fullscreen mode

This bounds the worst-case output cost per call.

3. Cache repeated context

Cache anything large and stable:

system prompt
+ docs
+ repo context
+ style guide
+ shared tool instructions
Enter fullscreen mode Exit fullscreen mode

Then send only task-specific content in each request.

4. Batch non-urgent work

Move offline work to the Batch API:

interactive user request -> synchronous API
overnight processing     -> Batch API
bulk evals               -> Batch API
Enter fullscreen mode Exit fullscreen mode

5. Use standard mode unless latency is the product

Fast mode is useful, but expensive. Keep it limited to user-facing interactions where faster streaming improves the experience.

6. Track quota and usage

Rate limits, spend, and usage tiers change how you operate at scale. The Claude Code weekly limits change is a reminder to monitor quota, not just per-token pricing.

Track real spend with Apidog

Estimates are useful before launch, but production cost depends on actual responses.

Every Messages API response includes a usage object with token counts. You should log or inspect it for each workload.

Apidog helps you test and compare real requests:

  • Send an Opus 4.8 request and inspect the usage block
  • Run the same prompt with low, high, and xhigh effort
  • Compare input and output token counts
  • Save requests for each workload
  • Re-run requests as prompts change
  • Mock the endpoint while developing to avoid spending tokens during local testing

A practical test flow:

1. Create one representative request.
2. Run it with effort = low.
3. Record input and output tokens.
4. Run it with effort = high.
5. Run it with effort = xhigh.
6. Compare cost and output quality.
7. Pick the cheapest effort level that meets your quality bar.
Enter fullscreen mode Exit fullscreen mode

FAQ

How much does Claude Opus 4.8 cost?

Claude Opus 4.8 costs $5 per million input tokens and $25 per million output tokens in standard mode.

Fast mode costs $10 per million input tokens and $50 per million output tokens.

Is Opus 4.8 more expensive than Opus 4.7?

No. Opus 4.8 and Opus 4.7 use the same per-token rates.

What is the difference between standard and fast mode pricing?

Fast mode doubles the per-token rate in exchange for output that streams about 2.5x faster. Use it when latency matters to a waiting user.

How do I lower Opus 4.8 costs?

Use these controls:

  • Lower effort for simpler tasks
  • Cache repeated prompt content
  • Batch non-urgent jobs
  • Set tight max_tokens limits
  • Avoid fast mode unless latency matters
  • Monitor real usage values in responses

Does prompt caching save money?

Yes. After the first cache write, repeated input is read at roughly a tenth of the normal input rate. Long-context agents usually save the most.

How many output tokens can Opus 4.8 produce?

Opus 4.8 supports up to 128K output tokens on the synchronous Messages API and up to 300K output tokens through the Batch API with the output-300k-2026-03-24 beta header.

Where do I see token usage per call?

Check the usage object in each Messages API response. Tools like Apidog surface it so you can compare cost across prompts, effort levels, and workloads.

Top comments (0)