DEV Community

Stanislav
Stanislav

Posted on

Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro

A real-world comparison of two LLMs on a genuine race condition bug from GitHub


TL;DR

Metric DeepSeek V4 Pro MiMo V2.5 Pro
Time ~8 min (2 rounds) ~15 min (2 rounds)
Tokens 2.43M 3.36M
Cache hit rate 92.1% 95.2%
Cost $0.14 (6% top-up fee) $0.13 (0% fee)
Bugs found 1 race condition 3 race conditions
Fix approach Prevention (lock-based) Prevention (three-phase separation)

Verdict: MiMo is better at debugging (finds more bugs, deeper analysis) AND cheaper. DeepSeek is faster and better for writing code.


Why This Benchmark?

Most LLM benchmarks test coding ability — write a function, solve a puzzle, implement an algorithm. But in real-world development, debugging is harder than writing code. You need to:

  1. Understand complex, multi-file codebases
  2. Find non-obvious root causes
  3. Explain the mechanism clearly
  4. Propose a correct fix

We wanted to test this specific skill. So we took a real race condition bug from a popular open-source library and gave it to both models.


The Bug: httpcore #961

Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations

What's httpcore?

httpcore is a low-level HTTP client library used by httpx (the popular Python HTTP client). It handles connection pooling, HTTP/2, proxies, and more.

The Bug

When async tasks are cancelled during connection operations, the pool's internal state becomes inconsistent. The pool thinks connections are still in use when they're actually cancelled, leading to pool exhaustion — new requests can never acquire a connection.

Why It's Hard

  • Multi-file: Involves connection_pool.py, connection.py, and http2.py
  • Async-specific: Only manifests with asyncio/trio cancellation
  • Non-obvious: Logs show normal operation; the race window is tiny
  • Real-world: This is a production issue affecting real users

Methodology

Benchmark Structure

We gave each model the entire httpcore project at the commit BEFORE the fix (commit 79fa6bf). The project included:

  • Full source code (90 files)
  • README.md with bug description (no hints about the fix)
  • PROMPT.md with instructions
  • SOLUTION.md and SOLUTION.diff (hidden from models)

Round 1: Find the Bug

Prompt (identical for both models):

You are given a Python project with a bug. Your task is to find the bug
and write a detailed explanation of how to fix it.

1. Read README.md to understand the project and the bug description.

2. Analyze the source code in httpcore/_async/ and httpcore/_sync/
   to find the root cause of the race condition.

3. Run the tests to see which ones fail:
   pip install -e ".[asyncio]"
   pytest tests/ -v

4. Write your findings to SOLUTION.md with:
   - Root cause analysis (what exactly goes wrong)
   - Why it happens (the mechanism)
   - How to fix it (the approach, not necessarily the exact code)
   - Which files need to be changed

Do NOT modify the source code. Only write SOLUTION.md.
Enter fullscreen mode Exit fullscreen mode

Round 2: Refine the Fix

After Round 1, both models proposed patches (cleanup handlers) rather than prevention (atomic state management). We gave them a hint:

Prompt (identical for both models):

Your previous fix handled orphaned connections in the cancellation
handler. This works, but it treats the symptom — connections still
get orphaned, you just clean them up after.

A better approach would be to prevent the race condition from
happening in the first place. The root cause is that state
management (tracking idle vs in-use connections) is interleaved
with I/O operations (queue.get(), queue.put()). When a task is
cancelled between state update and I/O, the pool loses track.

Can you find a way to make the state management atomic — so that
cancellation cannot happen midway through the acquire/release
sequence?

Write your refined solution to SOLUTION_V2.md.
Enter fullscreen mode Exit fullscreen mode

Results: Round 1

DeepSeek V4 Pro (~3 minutes)

Root Cause: Found 1 race condition — orphaned connections when task is cancelled after assignment but before resume.

Key Insight: "The connection remains in the pool marked as 'in use' but the task that was supposed to use it is gone."

Proposed Fix: Handle orphaned connections in the cancellation handler — check if a connection was assigned and release it.

Quality: Excellent root cause analysis, step-by-step mechanism explanation. However, proposed fix was a patch (cleanup handler), not prevention.

MiMo V2.5 Pro (~9 minutes)

Root Cause: Found 3 distinct race conditions:

  1. Status leak during initial pool lock
  2. Status leak during ConnectionNotAvailable retry
  3. Connection leak during cleanup

Key Insight: Explained why existing tests don't catch it — they use single-request scenarios.

Proposed Fix: Add cleanup handlers + defensive connection sweep.

Quality: Excellent analysis, deeper than DeepSeek (3 bugs vs 1). However, proposed fix was also a patch (cleanup handlers), not prevention.

Round 1 Comparison

Aspect DeepSeek MiMo
Time ~3 min ~9 min
Bugs found 1 3
Fix approach Patch (cleanup) Patch (cleanup)
Fix quality 🟡 Treats symptoms 🟡 Treats symptoms
Explanation quality Excellent Excellent

Results: Round 2

DeepSeek V4 Pro (~5 minutes)

Approach: Move connection claiming to the waiting task, make it atomic inside a lock.

Key Changes:

  • New _wait_and_acquire() method
  • Removed proactive assignment from response_closed()
  • Added _pool_state_changed event

Quality: 🟢 Architecturally clean, similar to the actual fix.

MiMo V2.5 Pro (~6 minutes)

Approach: Three-phase separation — CLEANUP (I/O), STATE (sync), I/O (network).

Key Changes:

  • _attempt_to_acquire_connection is now synchronous (no await inside lock)
  • AsyncShieldCancellation for critical sections
  • Top-level cleanup handler as safety net

Quality: 🟢 Systematic approach, analyzed 5 cancellation scenarios.

Round 2 Comparison

Aspect DeepSeek MiMo
Time ~5 min ~6 min
Approach Lock-based atomic Three-phase separation
Complexity Medium High
Edge cases Good Excellent (5 scenarios)

Token Usage & Cost

Metric DeepSeek V4 Pro MiMo V2.5 Pro
Total tokens 2,431,121 3,356,951
Cache hit 2,198,400 3,146,304
Cache miss 189,058 157,502
Output 43,663 53,145
Cache hit rate 92.1% 95.2%
API requests 30 34

Cost Calculation

Both models use the same pricing on OpenCode Go:

  • Cache hit: $0.003625/M tokens
  • Cache miss: $0.435/M tokens
  • Output: $0.87/M tokens

DeepSeek V4 Pro:

  • Cache hit: 2.198M × $0.003625 = $0.008
  • Cache miss: 0.189M × $0.435 = $0.082
  • Output: 0.044M × $0.87 = $0.038
  • Subtotal: $0.128
  • With 6% top-up fee: $0.14

MiMo V2.5 Pro:

  • Cache hit: 3.146M × $0.003625 = $0.011
  • Cache miss: 0.158M × $0.435 = $0.069
  • Output: 0.053M × $0.87 = $0.046
  • Subtotal: $0.126
  • With 0% fee: $0.13

Why MiMo Is Cheaper

Even though MiMo used 38% more tokens, it was still cheaper because:

  1. No top-up commission (0% vs 6% for DeepSeek)
  2. Higher cache hit rate (95.2% vs 92.1%)

The Reference Fix (PR #880)

The actual fix by Tom Christie (httpcore author) was elegantly simple:

Approach: Move ALL state management into non-cancellable sections using locks.

Key Insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."

Files Changed: 9 files, +512/-379 lines

Both models converged on this approach in Round 2, though with different implementations:

  • DeepSeek: Lock-based atomic claiming
  • MiMo: Three-phase separation (CLEANUP → STATE → I/O)

Verdict

Different Profiles, Not Better/Worse

Task Better Model Why
Writing code DeepSeek V4 Pro Faster, fewer tokens, cleaner architecture
Debugging MiMo V2.5 Pro Finds more bugs, deeper analysis, cheaper

DeepSeek Strengths

  • Speed: 2x faster (8 min vs 15 min)
  • Efficiency: 37% fewer tokens
  • Architecture: Cleaner, simpler solutions
  • Best for: Implementing features, writing new code

MiMo Strengths

  • Depth: Found 3 race conditions vs 1
  • Analysis: 5 cancellation scenarios, explains why tests don't catch it
  • Cost: Cheaper despite using more tokens (0% commission)
  • Best for: Debugging complex issues, code review, finding bugs

Cost Efficiency

For this specific debugging task:

  • DeepSeek: $0.14 for 2 rounds
  • MiMo: $0.13 for 2 rounds

MiMo is both better at debugging AND cheaper. The higher token usage is offset by the lack of top-up commission.


Methodology Notes

Why Real Bugs?

Synthetic bugs are too easy — models solve them in seconds. Real bugs from production codebases require:

  • Understanding complex architectures
  • Tracing state through multiple files
  • Reasoning about async/concurrent behavior
  • Finding non-obvious root causes

Why Two Rounds?

In real-world debugging, you often get a quick fix first, then refine it. Testing both rounds shows:

  1. Initial debugging ability (finding the bug)
  2. Iterative improvement (refining with hints)

Token counting methodology

  • Baseline snapshot before benchmark
  • After snapshot when complete
  • Delta = tokens used for the task
  • Cost calculated using OpenCode Go pricing

Conclusion

This benchmark reveals that debugging and code writing are different skills. DeepSeek excels at writing clean, efficient code quickly. MiMo excels at deep analysis and finding subtle bugs.

For teams building AI-assisted development tools:

  • Use DeepSeek for code generation, refactoring, implementation
  • Use MiMo for code review, debugging, root cause analysis

The surprise finding: MiMo is cheaper for debugging despite using more tokens, thanks to zero commission on top-up. For high-volume debugging workloads, this cost difference adds up.


Benchmark conducted on June 30, 2026 using DeepSeek API and Xiaomi MiMo API platforms. Full benchmark data available in the author's GitHub repository.

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

Debugging benchmarks get much more useful when they include the failed reasoning path, not only the final fix. For coding agents, I care less about one lucky patch and more about whether the tool can localize, test, and explain the change.