What Is Ornith-1.0 and Why Does It Matter?
Ornith-1.0 is an open-source family of agentic coding models from DeepReinforce AI that learns to write its own reinforcement learning scaffolds during training — jointly optimising its orchestration strategy and the code it generates. The 397B flagship matches Claude Opus 4.7 on SWE-Bench Verified (82.4% vs 80.8%) and beats it on Terminal-Bench 2.1 (77.5% vs 70.3%), while the 9B variant fits a single GPU and outperforms models three times its size.
Released on June 25, 2026 under a permissive MIT license, Ornith-1.0 is available on GitHub and Hugging Face in four variants from 9B to 397B parameters. The real story is not the benchmark numbers — it is how Ornith-1.0 achieves them, and what that means for the future of AI-assisted development.
Sam Witteveen provides a comprehensive overview of Ornith-1.0, demonstrating the self-scaffolding RL training process and practical coding benchmarks — including real tests with the 35B MoE variant.
The Self-Scaffolding Breakthrough
Here is what makes Ornith-1.0 fundamentally different. Most coding agents pair a model with a fixed, human-designed harness (scaffold) that manages memory, tool orchestration, and error handling. Teams spend months hand-designing one scaffold per task category. It is expensive, brittle, and fails to scale.
Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the model's policy:
Propose Scaffold: The model reads the task and proposes a refined scaffold optimised for that specific problem.
Generate Rollout: Using the new scaffold, the model generates the code solution. The reward signal flows back to both stages — teaching better code and better orchestration simultaneously.
Over thousands of RL iterations, per-task orchestration strategies emerge automatically. The model mutates its own scaffold templates and naturally selects higher-reward trajectories without any human workflow engineering.
As remio.ai notes, "Agentic coding performance has stopped rising mainly from raw parameter count. Success now depends on how reliably a model recovers from partial failures and re-plans without human intervention." Ornith-1.0 directly addresses this — producing longer autonomous runs before a developer needs to step in.
Benchmark Performance: Beats Opus 4.7 (But Not 4.8)
On headline benchmarks, Ornith-1.0-397B delivers:
SWE-Bench Verified: 82.4% (vs Claude Opus 4.7 at 80.8%)
Terminal-Bench 2.1: 77.5% (vs Claude Opus 4.7 at 70.3%)
SWE-Bench Multilingual: 78.9% — a standout for multi-language engineering
SWE-Bench Pro: 62.2% (vs Claude Opus 4.7 at 64.3%)
However, Claude Opus 4.8 still leads at 87.6% on SWE-Bench Verified and 85.0% on Terminal-Bench 2.1. GLM-5.2-744B beats Ornith on Terminal-Bench at 81.0%. DeepReinforce's "SOTA" claim is scoped to open-source models of comparable size.
One important caveat: independent research from March 2026 found that ~19.78% of patches labeled as "resolved" by top SWE-Bench agents are semantically incorrect. Real repositories have noisy dependencies and undocumented APIs that curated benchmarks cannot replicate.
The 9B Surprise: Agentic Coding on a Single GPU
The Ornith-1.0-9B variant (built on Gemma 4) scores 69.4% on SWE-Bench Verified — comfortably surpassing Gemma 4-31B (52.0%) despite being over three times smaller. On Terminal-Bench 2.1, it scores 43.1%, doubling Qwen 3.5-9B (21.3%) and matching Gemma 4-31B (42.1%). This puts capable agentic coding within reach of individual developers on consumer hardware.
The 35B MoE variant is equally impressive. Activating only ~3B parameters per token, it scores 64.2 on Terminal-Bench 2.1 — beating Qwen 3.5-397B (53.5), a model over ten times its total parameter count.
Fahd Mirza walks through running Ornith-1.0-9B locally — demonstrating how the compact variant delivers impressive agentic coding performance on consumer hardware.
Defending Against Reward Hacking
Letting a model write its own scaffold introduces a real safety concern: reward hacking. A model controlling its own evaluation could theoretically read test files or game the verifier. As detailed on the DeepReinforce blog, the team built three defence layers:
Fixed Outer Trust Boundary: The environment and test isolation are immutable. The model only evolves its inner scaffold.
Deterministic Monitor: Rule-based flagging bans reading withheld paths or modifying verification scripts — such trajectories receive zero reward.
Frozen LLM Judge (Veto): A separate frozen LLM monitors the allowed tool surface to catch subtle intent-level gaming.
These defences have not been independently validated in production, and the literature suggests reward hacking risk scales with agent capability. Still, this is the most comprehensive approach seen in open-source agentic training to date.
What This Means for the AI Coding Landscape
The open-source agentic coding race is becoming a workflow battle, not merely a benchmark competition. Ornith-1.0 shifts the target from handcrafted engineering to learned behaviour — and competitors must follow suit or fall behind in real-world coding performance.
With its MIT license and no regional restrictions, Ornith-1.0 is positioned for rapid adoption. The 9B variant already runs on consumer hardware, making it relevant to the growing ecosystem of AI development tools that developers integrate daily.
Watch for three developments in the coming quarter: independent scaffold tests on private monorepos, wider GGUF quantisation adoption on platforms like Ollama, and competing open projects adopting joint scaffold-and-solution RL training.
How to Get Started
All four Ornith-1.0 variants are available on Hugging Face under the deepreinforce-ai/ornith-10 collection. They support vLLM (>=0.19.1), SGLang (>=0.5.9), and Transformers (>=5.8.1) with an OpenAI-compatible endpoint. FP8 and GGUF quantised builds are available for all sizes.
Hero image: DeepReinforce AI / Ornith-1.0
Top comments (0)