Darwin-35B-A3B-Opus
Darwin-35B-A3B-Opus
Model Space FINAL Bench ALL Bench
"The child surpassed both parents — that is evolution."
TL;DR: 35B MoE (3B active) | GPQA Diamond 90.0% (vs Father 84.2% & Mother 85.0%) | MMMLU 85.0% | Multimodal ✅ | 201 Languages | 262K Context | 147.8 tok/s | Apache 2.0
Table of Contents
Why Darwin — The Child That Surpassed Both Parents
Model Overview
Parent Models
Darwin V5 — Beyond Simple Merging
Model MRI Scans — Parent Neural Anatomy
Child Model Health Check — MRI Verification
Inherited Capabilities
Father's Official Benchmarks (Reference)
Performance & Hardware Requirements
Model Specifications
Usage
Built By
FAQ
- Why Darwin — The Child That Surpassed Both Parents There is a fundamental question at the heart of AI model merging: If the parent models already exist, why crossbreed at all?
This model is the answer.
Benchmark Results
GPQA Diamond (198 Questions, Graduate-Level Reasoning)
Model Accuracy Multimodal Benchmark Published
🧬 Darwin-35B-A3B-Opus (Child) 90.0% ✅ Image/Video ✅ Fully Open
👩 Mother — Jackrong Claude 4.6 Opus Distilled 85.0% ❌ Text-only ❌ Not Published
👨 Father — Qwen3.5-35B-A3B (Official) 84.2% ✅ Image/Video ✅ Official
Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format ("ANSWER: LETTER")
MMMLU (Multilingual Knowledge, 29 Languages)
Model Accuracy
🧬 Darwin-35B-A3B-Opus (Child) 85.0%
👨 Father — Qwen3.5-35B-A3B (Official) 85.2%
Darwin preserves Father-level multilingual knowledge while achieving decisively superior reasoning.
The child outperformed both parents in reasoning and matched the Father in multilingual knowledge.
GPQA vs Father: +6.9% relative improvement ((90.0−84.2)/84.2)
GPQA vs Mother: +5.9% relative improvement ((90.0−85.0)/85.0)
MMMLU: 85.0% — Father-level (85.2%) multilingual knowledge preserved
Why Not Simply Use the Mother?
Mother (Claude Distilled) Darwin (Child)
Reasoning Strong (85.0%) Stronger (90.0%)
Image/Video ❌ Lost during text-only fine-tuning ✅ Inherited from Father
201 Languages ❌ Potentially degraded ✅ Inherited from Father
262K Context Unverified ✅ Father's architecture preserved
Benchmark Transparency ❌ No scores published ✅ Fully open
Why Not Simply Use the Father?
The Father (Qwen3.5-35B-A3B) excels in versatility but plateaus at 84.2% on hard reasoning tasks. Darwin pushes reasoning to 90.0% while retaining Father-level multilingual knowledge (MMMLU 85.0% vs 85.2%) along with all general-purpose capabilities.
Bottom line: Darwin is the only model that exceeds the Mother's reasoning, preserves the Father's multilingual knowledge, and retains full multimodal capability — all at once.
- Model Overview Darwin-35B-A3B-Opus is a next-generation reasoning-enhanced language model produced by VIDRAFT's Darwin V5 evolution engine.
Darwin V5 fuses two key innovations:
Evolutionary Merge — Applies natural selection to automatically discover optimal weight combinations across generations of candidates
Model MRI Integration — CT-scans each parent model layer by layer before merging, steering the evolutionary process with structural insight
If conventional merging is "mixing ingredients blindfolded," Darwin V5 is "precision surgery under X-ray guidance."
- Parent Models Role Model Strengths 👨 Father Qwen/Qwen3.5-35B-A3B General knowledge, multimodal (image/video), coding, agents, 201 languages, 262K context 👩 Mother Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled Claude 4.6 Opus CoT distillation, structured step-by-step reasoning, coding agent compatibility
- Darwin V5 — Beyond Simple Merging The Limitations of Conventional Merging Traditional model merging requires humans to set hyperparameters — ratio, density, and the like — by intuition. You pick ratio=0.5, density=0.9, run the merge once, and hope for the best. The outcome hinges on luck, and applying a single ratio uniformly across billions of parameters ignores the distinct role each layer plays.
Darwin V4's Breakthrough
Darwin V4 addressed this with evolutionary algorithms — automatically exploring hundreds of parameter combinations and selecting survivors based on real benchmark scores. Yet V4 was still blind evolution: it had no understanding of what each layer actually does.
Darwin V5: Model MRI Opens the Eyes
V5 integrates Model MRI — a neural anatomy analyzer — to give the evolutionary process "sight":
[Phase 0] Model MRI — CT-scan both parents, layer by layer
↓ "Father's layers 15–25 concentrate multilingual knowledge"
↓ "Mother's layers 30–40 concentrate reasoning patterns"
↓
[Phase 1] MRI-Guided Evolution — Begin from a scan-informed initial genome
↓ Not random, but "initialized from CT findings"
↓
[Phase 2] mergekit real merge + benchmark-driven fitness selection
↓ Faster convergence within the MRI-narrowed search space
↓
[Phase 3] MRI Health Check — CT-scan the child model
↓ Detect interference and function loss
↓ Prescribe layer-specific ratio adjustments
↓
[Final] Darwin-35B-A3B-Opus
V4 vs V5 at a Glance
Darwin V4 Darwin V5
Analogy Mixing ingredients blindfolded Precision surgery under X-ray
Initial genome Random MRI-guided
Layer control 2 ratios (attn/ffn) 40 layers independently
Pre-diagnosis ❌ None ✅ Phase 0 MRI scan
Post-verification Benchmark only ✅ Phase 3 health check
Search efficiency Broad, unfocused Narrowed, guided search
Failure diagnosis Unknown "why" Pinpoints the failing layer
Darwin V4: Discovered Parameters (Blind Evolution)
Parameter Value Interpretation
ratio 0.481 Father 52% : Mother 48% — asymmetric blend
density_a 0.855 85.5% of Father's weights selected
density_b 0.971 97.1% of Mother's weights adopted
attn 0.168 Only 16.8% modification in attention layers
ffn 0.841 84.1% modification in FFN layers
What this means: Attention patterns (determining what to focus on) are almost entirely preserved from the Father, while FFN layers (the knowledge store) are largely overwritten with the Mother's reasoning patterns.
Discovering attn=0.168 alongside ffn=0.841 — this extreme asymmetry — is virtually impossible to arrive at through human intuition.
Darwin V5: The MRI-Guided Merge Recipe
After scanning both parents, Model MRI prescribed a fundamentally different recipe:
MRI-Guided Genome
Parameter V4 (Blind) V5 (MRI) Shift
global_ratio 0.481 0.800 Mother weight ↑↑
attn_ratio 0.168 0.320 Attention also shifts toward Mother
ffn_ratio 0.841 0.590 FFN becomes more conservative
density_a 0.855 0.799 Similar
density_b 0.971 0.799 Mother density ↓ (Dead Expert compensation)
The key insight: MRI prescribed "draw more heavily from the Mother (ratio 0.8), but reduce density (0.799) because 50–65% of her experts are dead." V4, searching blindly, landed on ratio=0.481 — the opposite direction entirely.
Layer-Wise Merge Strategy (3 Surgical Blocks)
MRI did not prescribe uniform ratios. Instead, it partitioned all 40 layers into 3 distinct blocks:
Merge Ratio + Parent Importance + MoE Health per Layer
Block Layers t (Mother %) Router Source Rationale
Block 1 L0–L37 59.9% Mother Reasoning pattern injection across the bulk of the network
Block 2 L38 90.0% Mother Golden Layer — the Mother's core reasoning engine
Block 3 L39 53.4% Father Output layer — Father's router preserves multimodal routing
L38 is the "Golden Layer": The Mother's MRI revealed peak cosine distance at L34–L38 (see Mother MRI below). Darwin V5 responded by assigning t=0.9 to L38 — transplanting the Mother's reasoning engine nearly in its entirety.
- Model MRI Scans — Parent Neural Anatomy Mother MRI: Claude 4.6 Opus Distilled Mother Probe Cosine Distance
Probe-wise Layer Importance: Layers L34–L38 light up in intense red (high cosine distance) across the REASONING, CODE, and LOGIC probes — this is the Mother's reasoning engine.
Mother MoE Health
Metric Status Interpretation
Router Entropy ✅ ~1.0 across all layers Healthy — experts are evenly distributed
Dead Expert % 🔴 50–65% Critical — Claude distillation killed half the experts
Expert Similarity ✅ 0.001–0.008 Healthy — surviving experts remain diverse
A Dead Expert rate of 50–65% is the telltale fingerprint of Claude's text-only distillation. The fine-tuning process silenced multimodal and multilingual experts that were never activated during text-only training.
Mother Expert Utilization Heatmap
Expert Utilization Heatmap: The map is predominantly dark (inactive), with only sparse bright activations — the Claude reasoning pattern is concentrated in a small cluster of specialized experts.
Father MRI: A Healthy Generalist (The Organ Donor)
Father MoE Health
Father Expert Utilization Heatmap
Father Layer Importance by Probe
The Father (Qwen3.5-35B-A3B) exhibits healthy, uniform expert activation across all 40 layers — a well-balanced generalist with every expert alive and contributing. He serves as the "organ donor" who revives the Mother's dead 50–65% of experts.
Parent Comparison: Layer Advantage Map
Parent A vs B Layer Advantage
Above zero (↑ A): Father is stronger — primarily L0–L5 (embedding and early layers)
Below zero (↓ B): Mother is stronger — scattered but consistent from L5 through L35
L34–L38: Mother shows her strongest advantage on the REASONING and CODE probes
L39: Father recovers — the output layer favors Father's multimodal routing
This advantage map directly informed the 3-block merge recipe: Mother dominates L0–L38, Father reclaims L39.
How GPQA 90% Was Achieved
Mother L34–L38: reasoning engine (MRI red zone)
↓ t=0.9 — transplanted nearly in full
+
Father L39: output router (multimodal/multilingual expert activation)
↓ t=0.53 — Father's routing preserved
+
Dead Expert replacement → Father's living experts fill the Mother's dead slots
↓
= GPQA 90.0% (surpassing both parents)
The Mother's "reasoning brain" was transplanted while her dead experts were replaced with the Father's living counterparts. Reasoning went up; versatility stayed intact.
Evolution History
Phase 1 → Phase 2 evolution complete
Final real_score: 0.8405
Merge time: 181.6 seconds
Merge commit: 109838c2
- Child Model Health Check — MRI Verification Darwin Health Check — Child vs Parents
✅ Verdict: Healthy — No issues detected.
The chart above plots the layer-by-layer importance of the child (Darwin, green bars) against both parents (Father = blue dashed, Mother = red dashed). Key findings:
Layer 0 (Embedding): The child's importance spikes to 0.42 — both parents exhibit similar peaks (~0.35–0.50). The child has successfully inherited the critical embedding layer from both parents with no interference.
Layers 1–33 (Middle): Near-zero importance across all three models. This is expected — middle layers in MoE architectures process information incrementally, with no single layer acting as a bottleneck. The child tracks both parents precisely, confirming zero function loss across the bulk of the network.
Layers 34–39 (Reasoning Engine): Importance rises sharply. This is the exact region where the Mother's MRI revealed intense reasoning activity (cosine distance > 0.6). The child's green bars match or exceed both parents — demonstrating that the Mother's reasoning patterns were successfully transplanted while the Father's output routing was preserved.
Layer 39 (Output): The child peaks at ~0.48, closely tracking both parents. The final output layer is intact.
Why This Matters
The MRI health check confirms three critical outcomes:
No interference — There is no layer where the child's importance abnormally exceeds the parents' (which would signal weight conflict)
No function loss — There is no layer where the parents had high importance but the child collapsed to zero
Successful transplant — The L34–L39 reasoning engine from the Mother is fully operational in the child
Darwin V5 MRI-Guided Merge Recipe
MRI-guided layer-wise merge (3 blocks)
Genome: ratio=0.800 attn=0.320 ffn=0.590 density=0.799
L0–L37: t=0.5988 (Mother 60%) — router from Mother
L38: t=0.9000 (Mother 90%) — "Golden Layer" reasoning core
L39: t=0.5336 (Father 47%) — router from Father (output routing)
Insight Detail
L38 = "Golden Layer" MRI identified L34–L38 as the Mother's reasoning core. Darwin assigned t=0.9 (90% Mother) to L38 specifically
Router Strategy: B→B→A Mother's router for the reasoning layers, Father's router for the final output — preserving both the reasoning pathways and multimodal routing
Dead Expert Revival The Mother's 50–65% dead experts (killed during text-only fine-tuning) were replaced with the Father's living experts — restoring multimodal and multilingual capabilities
📄 The full algorithm and technical details of the Darwin V5 evolution engine will be released alongside an upcoming paper.
- Inherited Capabilities From the Father (Qwen3.5-35B-A3B) Multimodal: Image and video understanding 201 Languages: Global linguistic coverage 262K Context: Native long-context support (extendable to 1M via YaRN) Gated DeltaNet + MoE: Efficient hybrid architecture Multi-Token Prediction: Improved inference throughput From the Mother (Claude 4.6 Opus Distilled) Structured Thinking: Systematic step-by-step reasoning within tags Efficient Reasoning: "Let me analyze this request carefully: 1… 2… 3…" pattern Coding Agent Compatibility: Native "developer" role support for Claude Code and OpenCode Tool Calling Stability: Consistent performance in tool-use scenarios Autonomous Execution: Extended autonomous operation in agentic environments
- Father's Official Benchmarks (Reference) Darwin is built on this architecture with enhanced reasoning:
Category Benchmark Father Official
Knowledge MMLU-Pro 85.3
Knowledge MMLU-Redux 93.3
Reasoning GPQA Diamond 84.2
Reasoning HLE w/ CoT 22.4
Math HMMT Feb 2025 89.0
Coding SWE-bench Verified 69.2
Coding LiveCodeBench v6 74.6
Agent TAU2-Bench 81.2
Agent BFCL-V4 (Tool Use) 67.3
Instruction IFEval 91.9
Multilingual MMMLU 85.2
Agentic Search BrowseComp 61.0
Performance & Hardware Requirements
Inference Speed
Metric Value
Generation Speed 147.8 tok/s
Environment Single NVIDIA H100 93GB NVL, SGLang, BF16
Qwen Official API 162.8 tok/s (Alibaba Cloud)
Hardware Requirements
Setup VRAM Status
BF16 (Full Precision) 65.5 GiB
Single H100 93GB NVL 93 GB ✅ Comfortable
Single A100 80GB 80 GB ⚠️ Tight
Single A100 40GB 40 GB ❌ Insufficient
Q8 Quantized ~35 GiB
Single A100 40GB 40 GB ✅ Feasible
Q4_K_M Quantized ~18 GiB
Single RTX 4090 24GB 24 GB ✅ Comfortable
2× RTX 4090 (tp=2) 48 GB ✅ BF16 feasible
As a Mixture-of-Experts model, only 3B parameters are active per token despite loading the full 35B. This sparsity means quantization has minimal impact on output quality.Model Specifications
Architecture Qwen3.5 MoE (Gated DeltaNet + MoE)
Total Parameters 35B
Active Parameters 3B per forward pass
Hidden Dimension 2,048
Layers 40
Layer Layout 10 × (3 × GDN→MoE + 1 × Attention→MoE)
Experts 256 (8 routed + 1 shared active)
Expert Intermediate Dim 512
Context Length 262,144 native (up to 1,010,000 via YaRN)
Languages 201
Multimodal ✅ Image & Video input
License Apache 2.0
Engine Darwin V5 (Evolutionary Merge + Model MRI)
Evolution Phase Phase 2, real_score 0.8405
Merge Commit 109838c2Usage
SGLang (Recommended)
python -m sglang.launch_server \
--model-path FINAL-Bench/Darwin-35B-A3B-Opus \
--tp 1 \
--mem-fraction-static 0.90 \
--context-length 32768 \
--trust-remote-code
vLLM
vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \
--trust-remote-code \
--enforce-eager
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-35B-A3B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-35B-A3B-Opus",
dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
Best Practices
Use context ≥ 32K for reasoning tasks — the model leverages extended thinking
For maximum reasoning quality, use thinking mode (default) with generous max_tokens (≥ 16384)
The model generates blocks for internal reasoning; extract the final answer after
- Built By Developer VIDRAFT Evolution Engine Darwin V5 (Evolutionary Merge + Model MRI) Infrastructure 4 × NVIDIA H100 93GB NVL GPU Merge Time 181.6 seconds Shard Distribution 14 shards → GPU [1, 2, 3] round-robin Acknowledgements Korean Government — This research was supported by the Korean Government's 'GPU Support Program' research grant Qwen Team — Qwen3.5-35B-A3B base architecture Jackrong — Claude 4.6 Opus Reasoning Distilled model nohurry, TeichAI — Distillation datasets Citation @misc{vidraft_darwin_35b_opus, title = {Darwin-35B-A3B-Opus: MRI-Guided Evolutionary Merge Beyond Both Parents}, author = {VIDRAFT}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}} }
Contact
📧 kkms1116@koreacu.ac.kr
- FAQ What is Darwin-35B-A3B-Opus? How does Darwin V5 differ from simple model merging? What GPU do I need to run this model? Does it support multimodal inputs (images/video)? What languages does it support? What is Model MRI? What are "Dead Experts" and why do they matter? Is this model open source? #DarwinAI #EvolutionaryMerge #ModelMRI #DarwinV5 #GPQA90 #Qwen35 #MoE3B #Reasoning #Multimodal #201Languages #OpenSource #Apache2 #VIDRAFT #NaturalSelection #LayerWiseMerge #ClaudeOpus #ThinkingModel #CodingAgent #LongContext262K #BestOpenSourceLLM2026 #DeadExpertRevival #GoldenLayer #MoEMerge #NeuralAnatomy
Top comments (0)