Hassann

Posted on May 25 • Originally published at apidog.com

DeepSeek V4-Pro 75% Price Cut Is Now Permanent: What It Means for Developers (2026)

DeepSeek turned the most aggressive temporary discount in 2026 LLM pricing into the new normal. On May 22, the team announced that the 75% off DeepSeek-V4-Pro offer, originally set to expire on May 31, 2026 at 15:59 UTC, would not roll back. The promotional rate becomes the permanent list price: input drops to $0.435 per million tokens, output to $0.87, and cache hits to $0.003625. Here’s what changed, what stayed the same, and what API developers should update in their cost models this week.

Try Apidog today

TL;DR

DeepSeek-V4-Pro API pricing is now permanent at 1/4 of the original list price: $0.435/MTok input, $0.87/MTok output, $0.003625/MTok cache hit.
The 75% promo discount that was set to end May 31, 2026 is now the regular rate.
V4-Pro is now roughly 34x cheaper than GPT-5.5 on output while landing within ~95% of GPT-5.5 on most coding and reasoning benchmarks.
The cache-hit price is the implementation detail to optimize for. Long, stable system prompts can become almost free at the prefix.
If you priced AI features against GPT-5.5 or Claude Opus 4.7 last quarter, rerun the math before you defer anything on cost.

Why this matters now

LLM pricing usually moves down slowly, with caveats. DeepSeek removed the main caveat: the discount does not expire. The team ran an aggressive promo through May, watched developer traffic climb, and locked the rate in instead of rolling it back.

If your product calls an LLM in a hot path—autocomplete, RAG chat, code review, agent loops—the difference between $3.48 and $0.87 per million output tokens shows up quickly.

For example:

50M output tokens/day × $3.48 / 1M × 30 days = $5,220/month
50M output tokens/day × $0.87 / 1M × 30 days = $1,305/month

That is a roughly $3,915/month reduction on output tokens alone.

Building on top of DeepSeek? Apidog lets you generate, test, and monitor V4-Pro API calls in one workspace, including streaming, tool calls, and JSON schema validation.

In the rest of this post, we’ll turn the announcement into implementation steps: pricing math, model comparisons, cache-hit design, workload routing, and a practical migration checklist.

What changed: the announcement decoded

DeepSeek’s official pricing notice is short, but three points matter for developers:

The 75% discount is permanent.

The promo running through May 31, 2026 15:59 UTC was supposed to revert to the launch list price on June 1. It will not. The promo rate is now the list rate.
The cut applies to V4-Pro only.

DeepSeek-V4-Flash, at $0.14 / $0.28 per million tokens, was already cheap. V4-Pro is the frontier-tier model that dropped. See What is DeepSeek V4 for the Flash vs Pro split.
Cache-hit pricing was cut to 1/10 of launch, effective April 26, 2026 12:15 UTC.

This stacks with the headline cut. The result is cache hits at $0.003625/MTok.

Read together, the announcement points to a clear developer strategy: make V4-Pro cheap enough to become the default model for agentic and long-context workloads, then rely on usage volume.

The new permanent price sheet

Pricing per 1 million tokens, USD, effective immediately and permanent:

Token type	Old list	New permanent	Cut
Input, cache miss	$1.74	$0.435	75%
Input, cache hit	$0.0145	$0.003625	75%
Output	$3.48	$0.87	75%

Implementation takeaways:

Output cost is the big invoice lever. Agent loops, code generation, summarization, and content tools often produce large outputs.
Cache hits change prompt architecture. Input miss to input hit is roughly 120:1. Stable prefixes now matter a lot.
These rates apply to the API only. DeepSeek’s web chat remains free for individuals.

For more historical context on V4 pricing tiers and Flash-vs-Pro tradeoffs, see the DeepSeek V4 API Pricing reference.

How V4-Pro compares to GPT-5.5, Claude Opus 4.7, and Gemini 3.5 Flash

The useful comparison is not V4-Pro versus its old price. It is V4-Pro versus other frontier and near-frontier models.

Model	Input ($/MTok)	Output ($/MTok)	SWE-bench Pro
DeepSeek-V4-Pro, new	$0.435	$0.87	55.4%
GPT-5.5	$5.00	$30.00	58.6%
Claude Opus 4.7	$3.00	$15.00	~62%
Gemini 3.5 Flash	~$1.50	~$9.00	~48%
DeepSeek-V4-Flash	$0.14	$0.28	~42%

Two numbers matter:

On output tokens, DeepSeek-V4-Pro is 34x cheaper than GPT-5.5.
On public coding and reasoning evals, V4-Pro lands within 3 to 7 percentage points of GPT-5.5 on most benchmarks, according to the DataCamp comparison.

If your workload is latency-tolerant and quality-acceptable in that band, migration becomes a cost-routing problem. If the last few benchmark points matter, V4-Pro can still be useful as a draft model, fallback model, or first-pass model behind a critic.

For deeper head-to-head reviews, see DeepSeek V4 vs Claude Opus 4.5 for coding and GLM-5 vs DeepSeek V3 vs GPT-5: speed, cost, and practical developer comparison.

The cache-hit angle most articles miss

The $0.87 output price is obvious. The $0.003625 cache-hit input price is where implementation choices matter.

DeepSeek’s prompt cache hits when the prefix of your request is byte-identical to a recent prior request, within roughly a 30-minute window. For chat agents and retrieval pipelines, the prefix is usually:

system prompt
tool definitions
JSON schema instructions
few-shot examples
safety or formatting rules

That prefix often sits around 4,000 to 10,000 tokens and changes rarely.

Example: 100,000 chat turns/day

Assume:

System prompt: 6,000 tokens
User message: 200 tokens
Average response: 800 output tokens
Traffic: 100,000 turns/day

Without cache hits:

100,000 × 6,200 input tokens × $0.435 / 1,000,000
= $269.70/day on input

With 90% of the system-prompt tokens hitting cache:

Per turn input cost:
200 × $0.435
+
6,000 × ((0.9 × $0.003625) + (0.1 × $0.435))

Then divide by 1,000,000 and multiply by 100,000 turns.

That comes out to about $32/day on input, an 88% reduction.

For more on how prefix caching works across providers, see the prompt caching deep dive.

How to design prompts for cache hits

Use these patterns in real agents:

1. Pin the prefix

Keep stable content at the start of every request:

SYSTEM:
- Role and behavior
- Tool schemas
- JSON output rules
- Few-shot examples
- Static product constraints

USER:
- Current user input
- Request-specific context
- Session-specific metadata

Avoid putting timestamps, request IDs, user IDs, or retrieved snippets inside the system prompt.

2. Keep tool schemas stable

If your tool definitions are generated dynamically, sort keys and keep ordering deterministic.

Bad:

{
  "tools": [
    { "name": "search_docs", "description": "..." },
    { "name": "create_ticket", "description": "..." }
  ],
  "request_id": "req_2026_05_22_abc"
}

Better:

{
  "tools": [
    { "name": "create_ticket", "description": "..." },
    { "name": "search_docs", "description": "..." }
  ]
}

Put request-specific values in the user message or metadata layer instead.

3. Sort or hash dynamic context

If you append retrieved chunks, sort them stably. If identical requests are common, hash the normalized context and route matching hashes consistently.

Small prefix changes can invalidate the cache.

4. Warm up the prefix

On agent startup, send one request with the full stable prefix before user traffic arrives. This seats the prefix in the provider cache.

Quick API smoke test

If your current provider uses an OpenAI-compatible request shape, start with a minimal smoke test against DeepSeek.

export DEEPSEEK_API_KEY="YOUR_API_KEY"

curl https://api.deepseek.com/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise coding assistant. Return valid JSON when asked."
      },
      {
        "role": "user",
        "content": "Write a JavaScript function that calculates token cost from input tokens, output tokens, and per-million-token rates."
      }
    ]
  }'

Then test the same prompt against your current model and compare:

response quality
latency
tool-call shape
JSON validity
retry rate
total cost per request

For a hands-on walkthrough of the V4-Pro endpoint shape, see How to use the DeepSeek V4 API.

What you should do this week

The migration decision is not binary. Route by workload.

1. Measure your output:input ratio

Start with actual production traces. Compute token spend by route:

const INPUT_RATE = 0.435;
const OUTPUT_RATE = 0.87;

function estimateCost({ inputTokens, outputTokens }) {
  return {
    inputCost: (inputTokens / 1_000_000) * INPUT_RATE,
    outputCost: (outputTokens / 1_000_000) * OUTPUT_RATE,
    totalCost:
      (inputTokens / 1_000_000) * INPUT_RATE +
      (outputTokens / 1_000_000) * OUTPUT_RATE,
  };
}

console.log(
  estimateCost({
    inputTokens: 6200,
    outputTokens: 800,
  })
);

If your route spends most of its budget on output, V4-Pro’s new pricing is especially relevant.

2. Run a 100-sample eval on your real workload

Do not rely only on public benchmarks. Pull 100 production traces, run them through V4-Pro and your current model with identical prompts, then score using your own criteria.

Track:

task completion
hallucination rate
JSON/schema validity
tool-call correctness
latency
cost per successful task

Most teams find V4-Pro is “good enough” for 70% to 85% of their traffic.

3. Route by difficulty

A practical routing pattern:

Simple requests           -> DeepSeek-V4-Pro
Medium coding/reasoning   -> DeepSeek-V4-Pro
Hard tail / high-risk     -> Premium model
Failed validation         -> Retry or escalate

This captures most savings without forcing a full migration.

4. Lock in cache prefixes

Audit every system prompt. Move variable fields out of the prefix:

timestamps
user IDs
session IDs
request IDs
retrieved chunks
per-request instructions

Stable prefix first. Dynamic context later.

5. Add regression tests before shipping

This is where Apidog helps. Record golden responses from your current model, replay the same requests against V4-Pro, and diff the outputs. Apidog’s JSON schema validation can catch drift in tool-call shapes before production.

You can Download Apidog, import your OpenAI-compatible collection, change the base URL to:

https://api.deepseek.com

Then run a side-by-side smoke test.

How V4-Pro stacks up against other 2026 price drops

DeepSeek is not the only lab cutting prices. The 2026 LLM market is in a clear margin-compression phase:

OpenAI O3 dropped 80% earlier this year. See the O3 pricing breakdown.
Kimi K2 repriced aggressively to compete with DeepSeek’s V3 tier. Kimi K2 API pricing covers the details.
Anthropic Claude held the line on Opus pricing but introduced cheaper Haiku and Sonnet tiers. The full Claude API cost breakdown walks through where each tier fits.

V4-Pro’s cut is different because it targets the frontier capability band, not only the budget tier.

The build math shifted

DeepSeek did not just drop the price. It changed the baseline. Frontier capability at sub-dollar output pricing is now part of the 2026 cost model.

Do three things next:

Audit your top three LLM workloads and pick one route to test on V4-Pro this week.
Stabilize your cache prefixes, regardless of which model you use.
Wire up an Apidog regression suite so the next price cut takes hours to evaluate instead of weeks.

The promo flag came off. The discount did not.

DEV Community