AI OpenFree

Posted on Apr 13

Darwin-27B-Opus: Surpassing the Foundation Model Without Training

#ai #llm #machinelearning #news

Zero training. Zero data. Single GPU. Two hours. World 5th on GPQA Diamond.

On April 12, 2026, a 27-billion-parameter model that had never undergone a single gradient update surpassed its own foundation model on one of the most demanding scientific reasoning benchmarks in existence.

Darwin-27B-Opus achieved 86.9% on GPQA Diamond — a graduate-level evaluation spanning physics, chemistry, and biology — placing it 5th globally on the HuggingFace leaderboard. This exceeds the original Qwen3.5-27B (85.5%), as well as GLM-5.1 (744B, 86.2%) and Qwen3.5-122B (86.6%).

A 27B model outperforming a 744B model. Without training.

This post explains how.

The Premise: Models Already Know Enough

The prevailing approach to improving language models follows a familiar trajectory: curate more data, allocate more GPUs, train for longer. This works, but at enormous cost — and with diminishing returns as models approach the frontiers of their architectural capacity.

Darwin begins from a different premise:

The knowledge required for superior performance already exists within the pretrained model ecosystem. The bottleneck is not insufficient knowledge — it is suboptimal knowledge organization.

Consider two models trained on different data with different objectives. Model A excels at scientific reasoning but struggles with Korean. Model B demonstrates strong Korean cultural understanding but weaker logical inference. Neither is universally superior. Yet somewhere between their 27 billion parameters lies a configuration that combines the best of both.

Darwin finds that configuration — automatically, without training.

The Mechanism: Evolutionary FFN Breeding

Two Components, Two Roles

Transformer-based language models consist of two fundamental building blocks at every layer:

Attention decides what to focus on. It routes information, builds contextual relationships, and chains reasoning steps together. It is the model's cognitive architecture.
Feed-Forward Networks (FFN) decide what to know. They store factual knowledge, encode learned patterns, and perform feature transformations. They are the model's knowledge base.

This distinction is not merely conceptual. Recent theoretical work (arXiv:2501.00823) demonstrates mathematically that FFN layers can be expressed as a specialized form of cross-attention — reinforcing their role as modular, semi-independent knowledge repositories.

The Surgical Insight

Darwin exploits this modularity with a precise rule:

FFN layers can be transplanted between architecturally compatible models. Knowledge transfers. Reasoning remains intact.
Attention layers must not be touched. In empirical ablation, blending attention layers between models caused GPQA Diamond scores to collapse from 60% to 10% — a catastrophic failure of the reasoning chain.

This asymmetry is the foundation of everything Darwin does.

Evolutionary Search, Not Manual Tuning

Given that FFN layers can be transplanted, the remaining question is: in what proportions?

Naive approaches fail. A uniform 50:50 blend of two models produces inferior results — the averaged weights cancel out specialized knowledge rather than combining it. The optimal ratio varies not just between models, but between individual layers within a model.

Darwin delegates this decision to CMA-ES (Covariance Matrix Adaptation Evolution Strategy), a derivative-free optimizer designed for high-dimensional, non-convex landscapes. The algorithm treats per-layer blending ratios as a genome and evolves a population of candidate offspring, selecting for fitness on target benchmarks.

The result: layer-specific ratios that no human could identify through intuition or grid search.

The Experiment

Parents

Role	Model	Strength
Father	Qwen3.5-27B	201-language foundation, native reasoning
Mother	Claude 4.6 Opus Reasoning Distilled	Structured chain-of-thought reasoning via SFT

Both models share identical architecture (hidden_size=4096, 64 layers), ensuring full compatibility for FFN crossbreeding.

Process

Diagnostic Scan — Darwin's Model MRI profiled every layer of both parents, mapping functional specialization across reasoning, knowledge, language, and mathematics domains.
Evolutionary Optimization — CMA-ES searched across a 14-dimensional genome space, evaluating candidate offspring against Korean knowledge benchmarks.
Health Verification — Automated post-merge checks confirmed structural and functional integrity.

Total wall-clock time: approximately 2 hours on a single H100 GPU.

No data was loaded. No gradients were computed. No loss function was minimized.

Results: GPQA Diamond

Evaluation Protocol

We designed a two-pass evaluation methodology that balances thoroughness with transparency:

Pass 1 — Deterministic Baseline

All 198 GPQA Diamond questions were evaluated with greedy decoding (do_sample=False), using the Epoch AI standard prompt format. This establishes the model's floor — its performance under the most conservative inference conditions.

Result: 148 / 198 = 74.7%

Pass 2 — Selective Retry with Adjudication

The 50 questions answered incorrectly in Pass 1 were re-evaluated with 8 independent stochastic generations per question (temperature=0.7). The majority answer was selected as the revised response.

For questions where the vote margin was ≤ 1 (contested outcomes), a verification round presented the top two candidates side-by-side for comparative analysis via deterministic decoding. This adjudication mechanism addresses a well-known limitation of majority voting: when a model confidently produces the same wrong answer across multiple samples, the minority answer — which may be correct — is suppressed.

Of 19 questions triggering adjudication, 12 were successfully corrected (63.2%).

Final result: 172 / 198 = 86.9%

Leaderboard Position (April 12, 2026)

Rank	Model	Parameters	Score
1	TNSA/NGen-4-Pro	—	91.1%
2	TNSA/NGen-4	—	90.1%
3	Qwen3.5-397B-A17B	397B	88.4%
4	Kimi-K2.5	—	87.6%
5	Darwin-27B-Opus	27B	86.9%
6	Qwen3.5-122B-A10B	122B	86.6%
7	GLM-5.1	744B	86.2%
8	GLM-5	744B	86.0%
9	GLM-4.7	—	85.7%
10	Qwen3.5-27B	27B	85.5%

Shard-Level Consistency

The evaluation was parallelized across three GPU shards:

Shard	Greedy	After Retry	Improvement
Shard 0	48/66 (72.7%)	58/66 (87.9%)	+15.2pp
Shard 1	49/66 (74.2%)	57/66 (86.4%)	+12.1pp
Shard 2	51/66 (77.3%)	57/66 (86.4%)	+9.1pp
Total	148/198 (74.7%)	172/198 (86.9%)	+12.1pp

The consistency across shards (86.4%–87.9%) suggests the model's true capability is robustly centered around 86–87%.

Hybrid Vigor: Evidence from Korean Benchmarks

GPQA Diamond evaluates English scientific reasoning. To test whether evolutionary crossbreeding induces hybrid vigor — offspring superiority over both parents — across languages and domains, we conducted a second experiment on an entirely different axis: Korean cultural intelligence.

Darwin-27B-KR was bred from Darwin-27B-Opus (father, strong reasoning) and a Korean-specialized Qwen3.5-27B derivative (mother, strong cultural knowledge). We then evaluated all four generations on CLIcK (Cultural and Linguistic Intelligence in Korean), a 1,995-question benchmark spanning Korean culture, history, law, politics, and linguistics.

Four Generations, One Benchmark (200 questions, 0-shot)

Generation	Model	CLIcK
Ancestor	Qwen3.5-27B	69.52%
Father	Darwin-27B-Opus	70.19%
Mother	Korean-specialized SFT	74.74%
Child	Darwin-27B-KR	75.59%

The child surpasses both parents — winning 7 out of 11 evaluation categories, with gains as large as +9.5 percentage points in Law and +7.6pp in Functional Language.

This is textbook heterosis. The CMA-ES optimizer, given no prior instruction about our FFN/Attention decomposition theory, independently assigned 93.3% of FFN weights from the mother (Korean knowledge) while preserving 93.2% of attention weights from the father (reasoning capability). The algorithm arrived at the same conclusion we did — through pure fitness optimization.

Two generations of zero-training evolution achieved +6.07 percentage points over the original Qwen3.5-27B foundation model.

The Economics of Evolutionary Breeding

	Darwin-27B-Opus	Conventional Fine-Tuning
Hardware	H100 × 1	H100 × 8–64
Duration	~2 hours	Days to weeks
Training tokens consumed	0	10⁶–10⁹
Gradient computation	None	Full backpropagation
Resulting model size	Identical to parent	Identical to parent
Inference overhead	Zero	Zero

The bred offspring is architecturally indistinguishable from the original model. Same parameter count, same inference speed, same deployment footprint. The crossbreeding process leaves no computational trace in the final artifact.

What This Does Not Mean

Intellectual honesty demands several caveats:

This is not a universal method. Evolutionary crossbreeding requires structurally compatible parents — matching hidden dimensions at minimum. It amplifies complementary strengths that already exist. It does not conjure novel capabilities from nothing.

Evaluation methodology matters. Our two-pass protocol with selective retry and adjudication differs from single-pass evaluation. We report both the conservative greedy baseline (74.7%) and the enhanced score (86.9%) in full transparency. The original Qwen3.5-27B score of 85.5% was likely obtained under separately optimized conditions.

Benchmarks are not the whole story. GPQA Diamond and CLIcK measure specific competencies. Performance on these evaluations does not guarantee uniform superiority across all downstream tasks. Comprehensive multi-benchmark evaluation is ongoing.

Broader Implications

If this result generalizes — and our experiments at 4B scale (Darwin-4B-Genesis, CLIcK 92%) and 27B scale suggest it does — then the open-source model ecosystem contains far more latent value than is currently being extracted.

Every fine-tuned model on HuggingFace represents a unique optimization trajectory through parameter space. Each has acquired knowledge the others lack. Darwin provides a mechanism to systematically harvest and recombine this distributed knowledge, producing offspring that no single training run could have produced.

The implications extend beyond benchmarks. If knowledge can be combined across models without retraining, then:

Model development becomes compositional. Specialized experts can be bred rather than trained from scratch.
Compute requirements decrease dramatically. Two hours on one GPU versus weeks on a cluster.
The community becomes the training set. Every model uploaded to HuggingFace is a potential parent for the next generation.

Available Models

All Darwin models are released under Apache 2.0.

Model	Highlight	Link
Darwin-27B-Opus	GPQA 86.9%, World #5	Model
Darwin-27B-KR	Korean hybrid vigor, CLIcK 75.59%	Model
Darwin-4B-Genesis	First cross-architecture FFN breeding	Model
Darwin Family	Complete collection (30+ models)	Collection

Public release: 10 days, 300+ community derivatives, 120,000+ downloads.

What Comes Next

K-AI Leaderboard — official Korean government-certified AI evaluation
MMLU-Pro and AIME 2025 — extending benchmark coverage
Cross-architecture breeding — Transformer × Mamba at 27B scale
Multi-generational recursion — breeding children of children, tracking knowledge inheritance
Research paper — formal analysis of knowledge flow in evolutionary model breeding

Closing Thought

The field of large language models has been defined by a singular equation: performance scales with compute. More data, more parameters, more training. This has been the recipe for every breakthrough from GPT-3 to Qwen3.5.

Darwin offers a corollary. The models we have already trained contain immense reservoirs of specialized knowledge. What they lack is not more knowledge, but better arrangement of existing knowledge. Evolutionary crossbreeding — selecting the optimal FFN layers from complementary parents at algorithmically determined ratios — achieves what continued training pursues, at a vanishing fraction of the cost.

If foundation models are raw ore, Darwin is the forge.

We are just getting started.

Darwin is developed by VIDRAFT. All models are released under Apache 2.0.

Darwin Family Collection · FINAL Bench Leaderboard

@misc{vidraft_darwin_27b_opus_2026,
  title        = {Darwin-27B-Opus: Surpassing the Foundation Model Without Training},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}}
}