A real-world comparison of two LLMs on a genuine race condition bug from GitHub
TL;DR
| Metric | DeepSeek V4 Pro | MiMo V2.5 Pro |
|---|---|---|
| Time | ~8 min (2 rounds) | ~15 min (2 rounds) |
| Tokens | 2.43M | 3.36M |
| Cache hit rate | 92.1% | 95.2% |
| Cost | $0.14 (6% top-up fee) | $0.13 (0% fee) |
| Bugs found | 1 race condition | 3 race conditions |
| Fix approach | Prevention (lock-based) | Prevention (three-phase separation) |
Verdict: MiMo is better at debugging (finds more bugs, deeper analysis) AND cheaper. DeepSeek is faster and better for writing code.
Why This Benchmark?
Most LLM benchmarks test coding ability — write a function, solve a puzzle, implement an algorithm. But in real-world development, debugging is harder than writing code. You need to:
- Understand complex, multi-file codebases
- Find non-obvious root causes
- Explain the mechanism clearly
- Propose a correct fix
We wanted to test this specific skill. So we took a real race condition bug from a popular open-source library and gave it to both models.
The Bug: httpcore #961
Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations
What's httpcore?
httpcore is a low-level HTTP client library used by httpx (the popular Python HTTP client). It handles connection pooling, HTTP/2, proxies, and more.
The Bug
When async tasks are cancelled during connection operations, the pool's internal state becomes inconsistent. The pool thinks connections are still in use when they're actually cancelled, leading to pool exhaustion — new requests can never acquire a connection.
Why It's Hard
-
Multi-file: Involves
connection_pool.py,connection.py, andhttp2.py - Async-specific: Only manifests with asyncio/trio cancellation
- Non-obvious: Logs show normal operation; the race window is tiny
- Real-world: This is a production issue affecting real users
Methodology
Benchmark Structure
We gave each model the entire httpcore project at the commit BEFORE the fix (commit 79fa6bf). The project included:
- Full source code (90 files)
-
README.mdwith bug description (no hints about the fix) -
PROMPT.mdwith instructions -
SOLUTION.mdandSOLUTION.diff(hidden from models)
Round 1: Find the Bug
Prompt (identical for both models):
You are given a Python project with a bug. Your task is to find the bug
and write a detailed explanation of how to fix it.
1. Read README.md to understand the project and the bug description.
2. Analyze the source code in httpcore/_async/ and httpcore/_sync/
to find the root cause of the race condition.
3. Run the tests to see which ones fail:
pip install -e ".[asyncio]"
pytest tests/ -v
4. Write your findings to SOLUTION.md with:
- Root cause analysis (what exactly goes wrong)
- Why it happens (the mechanism)
- How to fix it (the approach, not necessarily the exact code)
- Which files need to be changed
Do NOT modify the source code. Only write SOLUTION.md.
Round 2: Refine the Fix
After Round 1, both models proposed patches (cleanup handlers) rather than prevention (atomic state management). We gave them a hint:
Prompt (identical for both models):
Your previous fix handled orphaned connections in the cancellation
handler. This works, but it treats the symptom — connections still
get orphaned, you just clean them up after.
A better approach would be to prevent the race condition from
happening in the first place. The root cause is that state
management (tracking idle vs in-use connections) is interleaved
with I/O operations (queue.get(), queue.put()). When a task is
cancelled between state update and I/O, the pool loses track.
Can you find a way to make the state management atomic — so that
cancellation cannot happen midway through the acquire/release
sequence?
Write your refined solution to SOLUTION_V2.md.
Results: Round 1
DeepSeek V4 Pro (~3 minutes)
Root Cause: Found 1 race condition — orphaned connections when task is cancelled after assignment but before resume.
Key Insight: "The connection remains in the pool marked as 'in use' but the task that was supposed to use it is gone."
Proposed Fix: Handle orphaned connections in the cancellation handler — check if a connection was assigned and release it.
Quality: Excellent root cause analysis, step-by-step mechanism explanation. However, proposed fix was a patch (cleanup handler), not prevention.
MiMo V2.5 Pro (~9 minutes)
Root Cause: Found 3 distinct race conditions:
- Status leak during initial pool lock
- Status leak during ConnectionNotAvailable retry
- Connection leak during cleanup
Key Insight: Explained why existing tests don't catch it — they use single-request scenarios.
Proposed Fix: Add cleanup handlers + defensive connection sweep.
Quality: Excellent analysis, deeper than DeepSeek (3 bugs vs 1). However, proposed fix was also a patch (cleanup handlers), not prevention.
Round 1 Comparison
| Aspect | DeepSeek | MiMo |
|---|---|---|
| Time | ~3 min | ~9 min |
| Bugs found | 1 | 3 |
| Fix approach | Patch (cleanup) | Patch (cleanup) |
| Fix quality | 🟡 Treats symptoms | 🟡 Treats symptoms |
| Explanation quality | Excellent | Excellent |
Results: Round 2
DeepSeek V4 Pro (~5 minutes)
Approach: Move connection claiming to the waiting task, make it atomic inside a lock.
Key Changes:
- New
_wait_and_acquire()method - Removed proactive assignment from
response_closed() - Added
_pool_state_changedevent
Quality: 🟢 Architecturally clean, similar to the actual fix.
MiMo V2.5 Pro (~6 minutes)
Approach: Three-phase separation — CLEANUP (I/O), STATE (sync), I/O (network).
Key Changes:
-
_attempt_to_acquire_connectionis now synchronous (no await inside lock) -
AsyncShieldCancellationfor critical sections - Top-level cleanup handler as safety net
Quality: 🟢 Systematic approach, analyzed 5 cancellation scenarios.
Round 2 Comparison
| Aspect | DeepSeek | MiMo |
|---|---|---|
| Time | ~5 min | ~6 min |
| Approach | Lock-based atomic | Three-phase separation |
| Complexity | Medium | High |
| Edge cases | Good | Excellent (5 scenarios) |
Token Usage & Cost
| Metric | DeepSeek V4 Pro | MiMo V2.5 Pro |
|---|---|---|
| Total tokens | 2,431,121 | 3,356,951 |
| Cache hit | 2,198,400 | 3,146,304 |
| Cache miss | 189,058 | 157,502 |
| Output | 43,663 | 53,145 |
| Cache hit rate | 92.1% | 95.2% |
| API requests | 30 | 34 |
Cost Calculation
Both models use the same pricing on OpenCode Go:
- Cache hit: $0.003625/M tokens
- Cache miss: $0.435/M tokens
- Output: $0.87/M tokens
DeepSeek V4 Pro:
- Cache hit: 2.198M × $0.003625 = $0.008
- Cache miss: 0.189M × $0.435 = $0.082
- Output: 0.044M × $0.87 = $0.038
- Subtotal: $0.128
- With 6% top-up fee: $0.14
MiMo V2.5 Pro:
- Cache hit: 3.146M × $0.003625 = $0.011
- Cache miss: 0.158M × $0.435 = $0.069
- Output: 0.053M × $0.87 = $0.046
- Subtotal: $0.126
- With 0% fee: $0.13
Why MiMo Is Cheaper
Even though MiMo used 38% more tokens, it was still cheaper because:
- No top-up commission (0% vs 6% for DeepSeek)
- Higher cache hit rate (95.2% vs 92.1%)
The Reference Fix (PR #880)
The actual fix by Tom Christie (httpcore author) was elegantly simple:
Approach: Move ALL state management into non-cancellable sections using locks.
Key Insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."
Files Changed: 9 files, +512/-379 lines
Both models converged on this approach in Round 2, though with different implementations:
- DeepSeek: Lock-based atomic claiming
- MiMo: Three-phase separation (CLEANUP → STATE → I/O)
Verdict
Different Profiles, Not Better/Worse
| Task | Better Model | Why |
|---|---|---|
| Writing code | DeepSeek V4 Pro | Faster, fewer tokens, cleaner architecture |
| Debugging | MiMo V2.5 Pro | Finds more bugs, deeper analysis, cheaper |
DeepSeek Strengths
- Speed: 2x faster (8 min vs 15 min)
- Efficiency: 37% fewer tokens
- Architecture: Cleaner, simpler solutions
- Best for: Implementing features, writing new code
MiMo Strengths
- Depth: Found 3 race conditions vs 1
- Analysis: 5 cancellation scenarios, explains why tests don't catch it
- Cost: Cheaper despite using more tokens (0% commission)
- Best for: Debugging complex issues, code review, finding bugs
Cost Efficiency
For this specific debugging task:
- DeepSeek: $0.14 for 2 rounds
- MiMo: $0.13 for 2 rounds
MiMo is both better at debugging AND cheaper. The higher token usage is offset by the lack of top-up commission.
Methodology Notes
Why Real Bugs?
Synthetic bugs are too easy — models solve them in seconds. Real bugs from production codebases require:
- Understanding complex architectures
- Tracing state through multiple files
- Reasoning about async/concurrent behavior
- Finding non-obvious root causes
Why Two Rounds?
In real-world debugging, you often get a quick fix first, then refine it. Testing both rounds shows:
- Initial debugging ability (finding the bug)
- Iterative improvement (refining with hints)
Token counting methodology
- Baseline snapshot before benchmark
- After snapshot when complete
- Delta = tokens used for the task
- Cost calculated using OpenCode Go pricing
Conclusion
This benchmark reveals that debugging and code writing are different skills. DeepSeek excels at writing clean, efficient code quickly. MiMo excels at deep analysis and finding subtle bugs.
For teams building AI-assisted development tools:
- Use DeepSeek for code generation, refactoring, implementation
- Use MiMo for code review, debugging, root cause analysis
The surprise finding: MiMo is cheaper for debugging despite using more tokens, thanks to zero commission on top-up. For high-volume debugging workloads, this cost difference adds up.
Benchmark conducted on June 30, 2026 using DeepSeek API and Xiaomi MiMo API platforms. Full benchmark data available in the author's GitHub repository.
Top comments (1)
Debugging benchmarks get much more useful when they include the failed reasoning path, not only the final fix. For coding agents, I care less about one lucky patch and more about whether the tool can localize, test, and explain the change.