DEV Community

Owen
Owen

Posted on • Originally published at ofox.ai

Qwen 3.6 Plus vs DeepSeek V4 Pro for Coding: Open-Weight API Showdown (3 Tasks, Real Cost)

TL;DR

Both models achieved comparable scores on SWE-bench Verified (Qwen at 78.8%, V4 Pro at 80.6%), but they demonstrate distinct failure patterns in practical applications. V4 Pro excels in speed and cost efficiency during promotional pricing, while Qwen 3.6 Plus's integrated reasoning mechanism catches edge cases that V4 Pro overlooks. Context coherence begins degrading for V4 Pro beyond 200K tokens. Three specific coding tasks reveal that optimal model selection depends on task characteristics rather than choosing a single solution. The recommendation is to implement task-based routing rather than committing exclusively to one model.

Background Context

The question about open-weight models for coding previously centered on DeepSeek versus other options. As of May 2026, the meaningful comparison involves DeepSeek versus Alibaba's Qwen, with a performance gap of 1.8 points on SWE-bench Verified—well within typical variance margins. Both models offer 1M-token context windows, expose OpenAI-compatible tool-calling functionality, and cost substantially less than Claude Opus. The core inquiry shifts from identifying an outright winner to understanding where each model's capabilities degrade.

Pricing and Architecture: What Actually Differs

Model Input (list) Output (list) Context Parameters Released
Qwen 3.6 Plus (ofox) $0.50/M $3.00/M 1M Linear-attention MoE, reasoning-by-default 2026-04-02
DeepSeek V4 Pro (direct) $1.74/M $3.48/M 1M 1.6T total / 49B active MoE, MIT license 2026-04-24
DeepSeek V4 Pro (launch promo, ends 2026-05-31) $0.435/M $0.87/M 1M

Sources: DeepSeek API pricing (verified 2026-05-15), ofox.ai model catalog, Hugging Face V4 Pro card

Two pricing considerations reshape the interpretation of this comparison. V4 Pro's launch promotional pricing expires on May 31, after which both input and output costs increase fourfold. Teams budgeting based on promotional rates will face substantial increases starting June 1. Qwen 3.6 Plus's input pricing of $0.50/M actually undercuts DeepSeek's standard rate; its output at $3.00/M remains below V4 Pro's post-promotional $3.48/M. For workloads extending beyond the promotional window, the pricing differential narrows considerably.

The architectural distinction carries greater significance than the price difference. V4 Pro implements sparse mixture-of-experts architecture, routing each token through 49B active parameters from a 1.6T total parameter pool. Qwen 3.6 Plus combines linear attention with mandatory chain-of-thought reasoning, meaning every response includes a reasoning_content field regardless of explicit request. Output tokens consumed by reasoning incur full pricing. Tasks demanding careful analysis benefit from this approach. Routine tasks incur overhead for reasoning generation.

For comprehensive DeepSeek pricing context, consult the DeepSeek API pricing breakdown. For cost-quality analysis within the V4 family, the V4 Pro vs Flash comparison examines when Pro capacity exceeds requirements. For isolated Qwen 3.6 Plus evaluation, the complete guide includes model IDs and curl command examples.

Task 1: Algorithmic Implementation with Edge Cases

The first evaluation involved implementing a function matching specified constraints, including three non-obvious edge cases: empty input, single-character input, and off-by-one boundary conditions on window sizing.

V4 Pro Performance:

  • Generated clean, idiomatic code within approximately 8 seconds
  • Correctly handled empty-input case
  • Missed single-character edge case on first attempt, producing incorrect function output
  • Clarifying follow-up prompt resolved the discrepancy

Qwen 3.6 Plus Performance:

  • Required 14 seconds including reasoning trace generation
  • Handled all three edge cases correctly on initial output
  • Reasoning trace explicitly enumerated boundary conditions prior to implementation
  • Code demonstrated slightly less elegance than V4 Pro's first attempt (extra variable, redundant length check), but correctness achieved without iteration

The consistent pattern across algorithmic evaluation: V4 Pro produces faster, more elegant-appearing first-pass code but more frequently skips edge cases compared to Qwen 3.6 Plus. The reasoning trace functions beyond cosmetic presentation—it compels explicit boundary condition enumeration before code generation, consistently identifying conditions that V4 Pro omits.

Cost Analysis:
A request of 2,500 input tokens / 800 output tokens costs approximately $0.0035 on V4 Pro (promotional rate) or $0.0103 on V4 Pro (standard list pricing), versus $0.0037 on Qwen 3.6 Plus. Qwen's reasoning trace output cost is tangible—approximately 1,500 reasoning tokens beyond the 800-token answer, totaling roughly $0.0045. The difference remains negligible at individual task level but becomes material at 10,000 monthly calls.

Selection Guidance:
For scenarios permitting single follow-up prompts to address missed edge cases without pipeline disruption, V4 Pro offers superior speed and cost efficiency. For pipelines intolerant of first-pass errors—such as unattended agents committing code changes—Qwen 3.6 Plus's reasoning cost provides concrete value.

Task 2: Multi-File Refactor with Cross-References

The second evaluation separated models demonstrating syntactic understanding from those retaining working codebase memory. Both models received four related files (TypeScript service, two consumer implementations, test file) with instructions to rename a method, replace positional arguments with options object parameter, update both call sites, and modify test mocks accordingly.

The prompt consumed approximately 12K tokens, leaving substantial context capacity for both models. Initial review suggested both produced syntactically valid output.

V4 Pro Performance:

  • Successfully renamed method in service file
  • Correctly updated first consumer
  • Missed option default in second consumer—passed empty object {} where original code provided specific default value as positional argument
  • Bug would manifest only during specific second-consumer code path, undetected by existing tests
  • Quiet semantic drift rather than syntax error

Qwen 3.6 Plus Performance:

  • Captured the missing default value
  • Reasoning trace explicitly noted that consumer B's second positional argument defaultPolicy required options-object transformation to { policy: defaultPolicy }
  • Flagged that test file mock setup required additional assertion verifying new signature—a point V4 Pro omitted

Qwen's advantage on this evaluation extends beyond code quality—both produced syntactically valid output. It reflects unstated invariant comprehension. Multi-file refactors carry implicit assumptions: default values, ordering conventions, error handling patterns consistent across codebases. V4 Pro captures explicit instructions while releasing implicit assumptions. Qwen's always-on reasoning surfaces these invariants for explicit handling.

This mirrors the failure pattern documented in DeepSeek V4 Pro vs Flash comparison for Flash on extended-file refactors—except V4 Pro here occupies the role of missing subtle invariants. The consistency gap between Pro and Flash narrows when tasks fit within 12K tokens and difficulty derives from reasoning depth rather than context length requirements.

Cost Analysis:
Complete prompt plus output: approximately 12K input / 3K output tokens. V4 Pro: $0.031 (standard) / $0.008 (promotional). Qwen 3.6 Plus with reasoning: $0.018. Qwen achieves cost advantage at standard pricing, loses marginally under promotional pricing, and delivers first-pass correctness either way.

Selection Guidance:
Multi-file refactors where prompt cannot enumerate every invariant explicitly: Qwen 3.6 Plus. The reasoning trace delivers concrete utility on this evaluation category—it represents substantive analysis, not presentation.

Task 3: Long-Context Bug Triage (200K-Token Repo Snapshot)

The third evaluation stresses context length capabilities. Approximately 200K tokens of open-source codebase content—three major directories, approximately 80 files—populated the prompt with a request to identify root cause from a stack trace. The trace referenced a generic error path; the actual cause resided three call levels deep within an unnamed file.

Both models report 1M-token context windows. The evaluation assesses performance at upper input ranges, not mere acceptance.

V4 Pro Performance:

  • Identified immediate calling function from stack trace
  • Examined associated file
  • Concluded bug existed in immediate caller (incorrect)
  • Actual bug existed one level deeper in transformation logic silently mutating arrays
  • Response demonstrated confidence and specificity, proposing fix addressing symptom rather than root cause
  • Follow-up prompt requesting three-level-deeper investigation identified actual bug

Qwen 3.6 Plus Performance:

  • Applied reasoning budget to data flow tracing rather than call stack navigation
  • Worked backward from bad-value origin point through each transformation
  • Correctly identified silent array mutation on first attempt
  • Reasoning trace consumed 4,000 tokens
  • Answer achieved correctness without follow-up

The notable observation across long-context tasks: V4 Pro at 200K-token input maintains syntactic understanding coherence but demonstrates reduced accuracy on causal reasoning chains. Qwen 3.6 Plus operates slower and more expensively at this input scale (reasoning tokens scale with input complexity) but produces noticeably superior cause-and-effect analysis.

This aligns with independent reviewer findings. Artificial Analysis's intelligence-index methodology ranks Qwen 3.6 Plus at 50 composite score versus 35 median for comparable price-tier reasoning models—the gap maximizes on reasoning-depth-rewarding tasks over throughput-sensitive work. BenchLM V4 Pro reporting shows inverse pattern: V4 Pro excels on throughput benchmarks and shorter-context coding work.

Cost Analysis:
200K input + 4K output (V4 Pro) or 200K input + 4K answer + 4K reasoning (Qwen). V4 Pro at standard pricing: $0.362. V4 Pro promotional pricing: $0.090. Qwen 3.6 Plus: $0.124. Qwen achieves cost advantage at standard pricing, loses to promotional pricing, and delivers first-pass correctness exclusively.

Selection Guidance:
Long-context bug triage and "codebase explanation" evaluation categories: Qwen 3.6 Plus. V4 Pro demonstrates speed advantage, but on large-input causal reasoning, speed offers minimal benefit if follow-up prompts become necessary.

What the Aggregate Picture Looks Like

Across the three evaluation tasks, wins distribute evenly:

  • Task 1 (algorithmic edge cases): Tie after follow-up iteration; Qwen wins on initial correctness. V4 Pro wins on speed and promotional-pricing cost.
  • Task 2 (multi-file refactor): Qwen wins on correctness. V4 Pro wins on promotional-pricing cost exclusively.
  • Task 3 (long-context triage): Qwen wins on correctness. V4 Pro wins on speed and promotional-pricing cost.

Flattening into single ranking would characterize Qwen 3.6 Plus as more deliberate and V4 Pro as faster—roughly accurate but structurally incomplete. The meaningful conclusion involves prompt-dependent decision making:

  • Prompts enumerating every edge case and invariant explicitly: V4 Pro generates cleaner initial output and processes faster.
  • Exploratory prompts or implicit-knowledge-dependent ones: Qwen 3.6 Plus reasoning captures gaps V4 Pro misses.

Most production prompts occupy the middle ground. Implementing task-based routing—directed one-shots to V4 Pro, exploratory or multi-step work to Qwen 3.6 Plus—captures each model's strengths while avoiding characteristic failure modes. For routing implementation within Claude Code and comparable systems, the hybrid routing pattern guide covers concrete technical approaches.

For 2026 coding model selection context, the best LLM for coding ranked by real use post places both in the broader landscape. The LLM API selection decision matrix provides task-type-by-model mapping across the complete catalog.

The Promo-Window Decision

A significant portion of this comparison becomes irrelevant June 1, 2026, when DeepSeek's promotional window expires and V4 Pro pricing reverts to $1.74 / $3.48 per million tokens. Three concrete decisions warrant attention:

  • Task 1-heavy workloads (bounded algorithmic code) currently using V4 Pro at promotional rates: budget for 4x cost increase on June 1, or construct router downshifting bounded tasks to V4 Flash. The V4 Pro vs Flash document identifies appropriate transition points.
  • Task 2-heavy workloads (multi-file refactors with implicit invariants): Qwen 3.6 Plus currently represents the correctness-optimal selection. Post-June 1, it becomes the cost-optimal selection as well.
  • Task 3-heavy workloads (long-context exploration): Qwen 3.6 Plus represents the optimal selection independent of promotional timing. V4 Pro's promotional cost advantage disappears when follow-up prompts become necessary.

The broader pattern: V4 Pro's promotional pricing functions as marketing tactic rather than sustainable economic model. Token budgeting for remainder of 2026 should employ standard list pricing rather than discounted rates.

Access Both Through One Key

Both models are available through ofox.ai with OpenAI-compatible endpoints. Model identifiers:

  • Qwen 3.6 Plus: bailian/qwen3.6-plus
  • DeepSeek V4 Pro: deepseek/deepseek-v4-pro
from openai import OpenAI
client = OpenAI(base_url="https://api.ofox.ai/v1", api_key="$OFOX_API_KEY")

resp = client.chat.completions.create(
    model="bailian/qwen3.6-plus",  # or "deepseek/deepseek-v4-pro"
    messages=[{"role": "user", "content": "..."}],
)
Enter fullscreen mode Exit fullscreen mode

Routing logic—determining which model handles which task—resides within application code rather than billing configuration. Single authentication key supports both models; swap via modification of the model parameter string. For complete gateway setup context, see AI API aggregation documentation. For cost-reduction tactics complementing model selection, the cost reduction guide covers caching, batching, and routing patterns applicable to both.

Core Takeaway

These models demonstrate sufficient benchmark equivalence that the optimal approach involves "both, behind a router"—anyone claiming definitive winner selection is overfitting to specific task types. Building the router once transforms open-weight coding selection from vendor lock-in to intelligent load balancing.

References

  • DeepSeek V4 Pro pricing and release notes: api-docs.deepseek.com/quick_start/pricing (verified 2026-05-15)
  • DeepSeek V4 Pro model card: huggingface.co/deepseek-ai/DeepSeek-V4-Pro
  • DeepSeek V4 Pro review and SWE-bench data: Codersera V4 Pro review
  • Qwen 3.6 Plus model details: Qwen blog 3.6 release
  • Independent benchmarks: Artificial Analysis intelligence index, BenchLM V4 Pro page
  • Cross-vendor coding comparison: LLM Coding Benchmark May 2026 (AkitaOnRails)
  • ofox model catalog: ofox.ai/llms.txt

Originally published on ofox.ai/blog.

Top comments (0)