Stanislav

Posted on Jun 30 • Edited on Jul 2

DeepSeek V4 Pro vs MiMo V2.5 Pro - Debugging Benchmark

#ai #llm #programming #python

A real-world comparison of two LLMs on a genuine race condition bug from GitHub

TL;DR

Metric	DeepSeek V4 Pro	MiMo V2.5 Pro
Time	~8 min (2 rounds)	~15 min (2 rounds)
Tokens	2.43M	3.36M
Cache hit rate	92.1%	95.2%
Cost	$0.14 (6% top-up fee)	$0.13 (0% fee)
Bugs found	1 race condition	3 race conditions
Fix approach	Prevention (lock-based)	Prevention (three-phase separation)

Verdict: MiMo is better at debugging (finds more bugs, deeper analysis) AND cheaper. DeepSeek is faster and better for writing code.

Why This Benchmark?

Most LLM benchmarks test coding ability — write a function, solve a puzzle, implement an algorithm. But in real-world development, debugging is harder than writing code. You need to:

Understand complex, multi-file codebases
Find non-obvious root causes
Explain the mechanism clearly
Propose a correct fix

We wanted to test this specific skill. So we took a real race condition bug from a popular open-source library and gave it to both models.

The Bug: httpcore #961

Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations

What's httpcore?

httpcore is a low-level HTTP client library used by httpx (the popular Python HTTP client). It handles connection pooling, HTTP/2, proxies, and more.

The Bug

When async tasks are cancelled during connection operations, the pool's internal state becomes inconsistent. The pool thinks connections are still in use when they're actually cancelled, leading to pool exhaustion — new requests can never acquire a connection.

Why It's Hard

Multi-file: Involves connection_pool.py, connection.py, and http2.py
Async-specific: Only manifests with asyncio/trio cancellation
Non-obvious: Logs show normal operation; the race window is tiny
Real-world: This is a production issue affecting real users

Methodology

Benchmark Structure

We gave each model the entire httpcore project at the commit BEFORE the fix (commit 79fa6bf). The project included:

Full source code (90 files)
README.md with bug description (no hints about the fix)
PROMPT.md with instructions
SOLUTION.md and SOLUTION.diff (hidden from models)

Round 1: Find the Bug

Prompt (identical for both models):

You are given a Python project with a bug. Your task is to find the bug
and write a detailed explanation of how to fix it.

1. Read README.md to understand the project and the bug description.

2. Analyze the source code in httpcore/_async/ and httpcore/_sync/
   to find the root cause of the race condition.

3. Run the tests to see which ones fail:
   pip install -e ".[asyncio]"
   pytest tests/ -v

4. Write your findings to SOLUTION.md with:
   - Root cause analysis (what exactly goes wrong)
   - Why it happens (the mechanism)
   - How to fix it (the approach, not necessarily the exact code)
   - Which files need to be changed

Do NOT modify the source code. Only write SOLUTION.md.

Round 2: Refine the Fix

After Round 1, both models proposed patches (cleanup handlers) rather than prevention (atomic state management). We gave them a hint:

Prompt (identical for both models):

Your previous fix handled orphaned connections in the cancellation
handler. This works, but it treats the symptom — connections still
get orphaned, you just clean them up after.

A better approach would be to prevent the race condition from
happening in the first place. The root cause is that state
management (tracking idle vs in-use connections) is interleaved
with I/O operations (queue.get(), queue.put()). When a task is
cancelled between state update and I/O, the pool loses track.

Can you find a way to make the state management atomic — so that
cancellation cannot happen midway through the acquire/release
sequence?

Write your refined solution to SOLUTION_V2.md.

Results: Round 1

DeepSeek V4 Pro (~3 minutes)

Root Cause: Found 1 race condition — orphaned connections when task is cancelled after assignment but before resume.

Key Insight: "The connection remains in the pool marked as 'in use' but the task that was supposed to use it is gone."

Proposed Fix: Handle orphaned connections in the cancellation handler — check if a connection was assigned and release it.

Quality: Excellent root cause analysis, step-by-step mechanism explanation. However, proposed fix was a patch (cleanup handler), not prevention.

MiMo V2.5 Pro (~9 minutes)

Root Cause: Found 3 distinct race conditions:

Status leak during initial pool lock
Status leak during ConnectionNotAvailable retry
Connection leak during cleanup

Key Insight: Explained why existing tests don't catch it — they use single-request scenarios.

Proposed Fix: Add cleanup handlers + defensive connection sweep.

Quality: Excellent analysis, deeper than DeepSeek (3 bugs vs 1). However, proposed fix was also a patch (cleanup handlers), not prevention.

Round 1 Comparison

Aspect	DeepSeek	MiMo
Time	~3 min	~9 min
Bugs found	1	3
Fix approach	Patch (cleanup)	Patch (cleanup)
Fix quality	🟡 Treats symptoms	🟡 Treats symptoms
Explanation quality	Excellent	Excellent

Results: Round 2

DeepSeek V4 Pro (~5 minutes)

Approach: Move connection claiming to the waiting task, make it atomic inside a lock.

Key Changes:

New _wait_and_acquire() method
Removed proactive assignment from response_closed()
Added _pool_state_changed event

Quality: 🟢 Architecturally clean, similar to the actual fix.

MiMo V2.5 Pro (~6 minutes)

Approach: Three-phase separation — CLEANUP (I/O), STATE (sync), I/O (network).

Key Changes:

_attempt_to_acquire_connection is now synchronous (no await inside lock)
AsyncShieldCancellation for critical sections
Top-level cleanup handler as safety net

Quality: 🟢 Systematic approach, analyzed 5 cancellation scenarios.

Round 2 Comparison

Aspect	DeepSeek	MiMo
Time	~5 min	~6 min
Approach	Lock-based atomic	Three-phase separation
Complexity	Medium	High
Edge cases	Good	Excellent (5 scenarios)

Token Usage & Cost

Metric	DeepSeek V4 Pro	MiMo V2.5 Pro
Total tokens	2,431,121	3,356,951
Cache hit	2,198,400	3,146,304
Cache miss	189,058	157,502
Output	43,663	53,145
Cache hit rate	92.1%	95.2%
API requests	30	34

Cost Calculation

Both models use the same pricing on OpenCode Go:

Cache hit: $0.003625/M tokens
Cache miss: $0.435/M tokens
Output: $0.87/M tokens

DeepSeek V4 Pro:

Cache hit: 2.198M × $0.003625 = $0.008
Cache miss: 0.189M × $0.435 = $0.082
Output: 0.044M × $0.87 = $0.038
Subtotal: $0.128
With 6% top-up fee: $0.14

MiMo V2.5 Pro:

Cache hit: 3.146M × $0.003625 = $0.011
Cache miss: 0.158M × $0.435 = $0.069
Output: 0.053M × $0.87 = $0.046
Subtotal: $0.126
With 0% fee: $0.13

Why MiMo Is Cheaper

Even though MiMo used 38% more tokens, it was still cheaper because:

No top-up commission (0% vs 6% for DeepSeek)
Higher cache hit rate (95.2% vs 92.1%)

The Reference Fix (PR #880)

The actual fix by Tom Christie (httpcore author) was elegantly simple:

Approach: Move ALL state management into non-cancellable sections using locks.

Key Insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."

Files Changed: 9 files, +512/-379 lines

Both models converged on this approach in Round 2, though with different implementations:

DeepSeek: Lock-based atomic claiming
MiMo: Three-phase separation (CLEANUP → STATE → I/O)

Verdict

Different Profiles, Not Better/Worse

Task	Better Model	Why
Writing code	DeepSeek V4 Pro	Faster, fewer tokens, cleaner architecture
Debugging	MiMo V2.5 Pro	Finds more bugs, deeper analysis, cheaper

DeepSeek Strengths

Speed: 2x faster (8 min vs 15 min)
Efficiency: 37% fewer tokens
Architecture: Cleaner, simpler solutions
Best for: Implementing features, writing new code

MiMo Strengths

Depth: Found 3 race conditions vs 1
Analysis: 5 cancellation scenarios, explains why tests don't catch it
Cost: Cheaper despite using more tokens (0% commission)
Best for: Debugging complex issues, code review, finding bugs

Cost Efficiency

For this specific debugging task:

DeepSeek: $0.14 for 2 rounds
MiMo: $0.13 for 2 rounds

MiMo is both better at debugging AND cheaper. The higher token usage is offset by the lack of top-up commission.

Methodology Notes

Why Real Bugs?

Synthetic bugs are too easy — models solve them in seconds. Real bugs from production codebases require:

Understanding complex architectures
Tracing state through multiple files
Reasoning about async/concurrent behavior
Finding non-obvious root causes

Why Two Rounds?

In real-world debugging, you often get a quick fix first, then refine it. Testing both rounds shows:

Initial debugging ability (finding the bug)
Iterative improvement (refining with hints)

Token counting methodology

Baseline snapshot before benchmark
After snapshot when complete
Delta = tokens used for the task
Cost calculated using OpenCode Go pricing

Conclusion

This benchmark reveals that debugging and code writing are different skills. DeepSeek excels at writing clean, efficient code quickly. MiMo excels at deep analysis and finding subtle bugs.

For teams building AI-assisted development tools:

Use DeepSeek for code generation, refactoring, implementation
Use MiMo for code review, debugging, root cause analysis

The surprise finding: MiMo is cheaper for debugging despite using more tokens, thanks to zero commission on top-up. For high-volume debugging workloads, this cost difference adds up.

Benchmark conducted on June 30, 2026 using DeepSeek API and Xiaomi MiMo API platforms. Full benchmark data available in the author's GitHub repository.

Top comments (1)

Alex Shev • Jun 30

Debugging benchmarks get much more useful when they include the failed reasoning path, not only the final fix. For coding agents, I care less about one lucky patch and more about whether the tool can localize, test, and explain the change.