DeepSeek released V4 pricing on April 23, 2026, resetting expectations for frontier AI costs. V4-Flash starts at $0.14 per million input tokens and $0.28 per million output tokens. V4-Pro is priced at $1.74 input and $3.48 output per million tokens. Both support a 1M-token context window and up to 384K output tokens, with a cache-hit discount that cuts input costs by 80–90% on repeated prompts.
This guide covers the full rate card, how context caching affects real per-call costs, a comparison with GPT-5.5 and Claude Opus, and four actionable methods to keep your spend predictable in Apidog.
For additional details, see what is DeepSeek V4, the DeepSeek V4 API walkthrough, and how to use DeepSeek V4 for free.
TL;DR
- V4-Flash: $0.14 / M input (cache miss), $0.028 / M input (cache hit), $0.28 / M output
- V4-Pro: $1.74 / M input (cache miss), $0.145 / M input (cache hit), $3.48 / M output
- Context window: 1M tokens input, 384K tokens output on both
- Cache-hit discount: ~80% off Flash, ~92% off Pro on repeated prefixes
-
deepseek-chatanddeepseek-reasonerdeprecated July 24, 2026; billing maps to V4-Flash - At cache-miss rates, V4-Pro is ~2.9x cheaper than GPT-5.5 on input and ~8.6x cheaper on output
The Full Rate Card
| Model | Input (cache miss) | Input (cache hit) | Output | Context |
|---|---|---|---|---|
deepseek-v4-flash |
$0.14 / M | $0.028 / M | $0.28 / M | 1M / 384K |
deepseek-v4-pro |
$1.74 / M | $0.145 / M | $3.48 / M | 1M / 384K |
deepseek-chat (deprecated) |
maps to V4-Flash non-thinking | — | — | — |
deepseek-reasoner (deprecated) |
maps to V4-Flash thinking | — | — | — |
Key implementation details:
- Pricing is set by model ID—thinking/non-thinking mode only affects how many tokens you consume, not the rate.
- Cache-hit pricing is automatic—any repeated prefix ≥1,024 tokens (byte-for-byte match) within the same account gets discounted input pricing. No setup required.
-
Old model IDs (
deepseek-chat,deepseek-reasoner) are now V4-Flash aliases. If you haven’t migrated, you’re already billed at V4-Flash rates. The deprecation deadline is July 24, 2026.
Context Caching Explained
Context caching is the biggest lever to reduce DeepSeek V4 costs. Any repeated content across calls—such as long system prompts, agent schemas, or RAG context—is billed at a heavily discounted input rate after the first call.
Example: Agent with Static System Prompt
Suppose you run an agent with a 20,000-token system prompt and 100 user questions (200 tokens each).
Without caching:
- Input: 100 × 20,200 × $1.74 / M = $3.52
- Output: 100 × 500 × $3.48 / M = $0.17
- Total: $3.69
With caching (1 miss, 99 hits):
- First input: 20,200 × $1.74 / M = $0.035
- 99 cache-hit prefixes: 99 × 20,000 × $0.145 / M = $0.287
- 99 user turns: 99 × 200 × $1.74 / M = $0.034
- Output: 100 × 500 × $3.48 / M = $0.174
- Total: $0.53 (over 7x cheaper)
On V4-Flash, the effect is even more pronounced due to the already low base rate.
Comparing DeepSeek V4 to GPT-5.5 and Claude
| Model | Input (std) | Input (cached) | Output | Context |
|---|---|---|---|---|
| DeepSeek V4-Flash | $0.14 / M | $0.028 / M | $0.28 / M | 1M |
| DeepSeek V4-Pro | $1.74 / M | $0.145 / M | $3.48 / M | 1M |
| GPT-5.5 | $5 / M | $1.25 / M | $30 / M | 1M |
| GPT-5.5 Pro | $30 / M | — | $180 / M | 1M |
| Claude Opus 4.6 | $15 / M | $1.50 / M | $75 / M | 200K |
Takeaways:
- V4-Pro is ~8.6x cheaper than GPT-5.5 and ~21x cheaper than Claude Opus 4.6 on output tokens.
- Cached input: V4-Pro is ~10x cheaper than both GPT-5.5 and Claude.
- Benchmarking: V4-Pro matches or beats GPT-5.5 on LiveCodeBench (93.5 vs top tier) and Codeforces (3206 vs 3168). For full benchmarks, see what is DeepSeek V4.
Caveats: Claude outperforms V4-Pro on long-context retrieval, and Gemini 3.1 Pro leads on MMLU-Pro. If your workload depends on long-context retrieval, weigh quality vs. price savings.
Cost Modeling for Common Workloads
Here’s what typical workloads cost on V4-Pro (cache-miss baseline):
1. Agentic Coding Loop (50K context, 2K output, 20 calls)
Input: 50,000 × 20 × $1.74 / M = $1.74
Output: 2,000 × 20 × $3.48 / M = $0.14
Per-task cost: ~$1.88
GPT-5.5: ~$6.20 per task.
2. Long-Document Q&A (500K context, 1K output)
Input: 500,000 × $1.74 / M = $0.87
Output: 1,000 × $3.48 / M = $0.003
Per-call cost: ~$0.87
GPT-5.5: ~$2.53 per call.
3. High-Volume Classification (2K context, 200 output, 10,000 calls)
Use V4-Flash; V4-Pro is overkill.
Input: 2,000 × 10,000 × $0.14 / M = $2.80
Output: 200 × 10,000 × $0.28 / M = $0.56
Run cost: ~$3.36
GPT-5.5: ~$110 per run.
4. Repeated-Prompt Chatbot (10K system, 500 user, 1K output, 1,000 sessions)
First input: 10,500 × $1.74 / M = $0.018
Cache-hit input: 999 × 10,000 × $0.145 / M = $1.45
Cache-miss user: 999 × 500 × $1.74 / M = $0.87
Output: 1,000 × 1,000 × $3.48 / M = $3.48
Session run cost: ~$5.82
GPT-5.5 (with caching): ~$26.35.
Hidden Costs to Watch
Be aware of these cost traps:
-
Thinking-mode token inflation:
thinking_maxburns 3–10x more output tokens. Only use Think Max for critical tasks. - Silent context growth: Agent loops that feed entire conversations back into each turn can balloon costs. Truncate or summarize aggressively.
- Retry storms: Uncapped retries (e.g., on every HTTP 500) can quickly double your bill. Implement exponential backoff and set a hard retry cap.
- Development churn: Iterating with raw curl replays the full context each time. Use Apidog for variable substitution and to avoid unnecessary prompt replays.
Track Cost in Apidog
Optimize workflow and avoid surprises:
-
Download Apidog and store your
DEEPSEEK_API_KEYas a secret per environment. - Save a POST request to
https://api.deepseek.com/v1/chat/completions. - In the response panel, pin
usage.prompt_tokens,usage.completion_tokens, andusage.reasoning_tokens—you’ll see cost metrics with every call. - Parameterize
modelandthinking_modeso you can A/B V4-Flash vs V4-Pro, and Non-Think vs Think Max, without duplicating requests. - Mirror the collection for GPT-5.5 using the GPT-5.5 API guide. One window, both providers, full cost transparency.
This setup catches ~80% of surprises that show up on invoices.
Four Rules to Keep Spend Predictable
- Default to V4-Flash. Use V4-Pro only if a measurable quality gap impacts revenue.
- Default to Non-Think. Escalate to Think High as needed; reserve Think Max for correctness-critical work.
-
Cap
max_tokens. The 384K output ceiling is a safety net, not a target. Production answers usually fit in 2K. -
Ship usage telemetry. Log
prompt_tokens,completion_tokens, andreasoning_tokenson every call. Alert on reasoning-token spikes—they often signal prompt drift into Think Max.
FAQ
Is there a free tier?
No usage-free API tier, but new accounts may get a trial credit. For zero-cost options, see how to use DeepSeek V4 for free.
How does cache-hit pricing work?
Prefixes ≥1,024 tokens that repeat across requests in the same account are billed at the cache-hit rate. First call is full rate; subsequent identical-prefix calls are discounted. Caching is automatic.
Do thinking modes cost more?
Per-token rates are unchanged. Thinking modes generate more tokens (reasoning traces). Monitor reasoning_tokens in the usage object to assess real cost.
Is pricing stable?
DeepSeek updates pricing periodically. V3.2 rates lasted most of 2025; V4 pricing has no published end-date. Always check the live pricing page before budgeting.
Are V4-Pro and V4-Flash output rates the same?
No. V4-Pro output is $3.48 / M; V4-Flash is $0.28 / M. The 12.4x difference is the main reason to default to V4-Flash.
Does the Anthropic-format endpoint change pricing?
No. https://api.deepseek.com/anthropic uses the same pricing as the OpenAI-format endpoint. Format does not affect cost.
Top comments (0)