This article was originally published on runaihome.com
TL;DR: Ornith-1.0 is DeepReinforce's new MIT-licensed coding family — 9B Dense, 31B Dense, 35B MoE, and 397B MoE, post-trained on Gemma 4 and Qwen 3.5. The home-lab pick is the 35B MoE: ~3B active parameters per token make it fast, and the Q4_K_M GGUF is 21.2 GB, so it just fits a single 24 GB card. The catch: 21.2 GB on a 24 GB GPU leaves almost no room for long context.
| 9B Dense | 35B MoE | 397B MoE | |
|---|---|---|---|
| Best for | 8–12 GB cards | The 24 GB sweet spot | Cloud / API only |
| Q4 size | ~6 GB | 21.2 GB (Q4_K_M) | ~225 GB+ |
| Active params | 9B (dense) | ~3B per token | ~? per token |
| Runs on a single consumer GPU? | Yes, easily | Yes, on 24 GB | No |
| The catch | Weakest of the family | No headroom for 256K context | No card holds it |
Honest take: If you have a 24 GB card, grab the 35B MoE Q4_K_M — it's the rare model that gives you MoE speed and a license you can actually ship a product on. If you're on 8–16 GB, run the 9B and keep your expectations modest. The 397B is an API model; don't try to buy hardware for it.
The local-AI release calendar has been relentless this month, but Ornith-1.0 is worth stopping for — not because it's the biggest, but because it lands the two things home-labbers actually ask for: a permissive license and a variant that runs fast on a card you already own.
What Ornith-1.0 actually is
DeepReinforce released the Ornith-1.0 family on June 25, 2026, under the MIT license with no regional restrictions — every checkpoint, including the GGUF and FP8 builds, ships under that license on Hugging Face. That alone separates it from a lot of "open weight" releases that bolt on usage clauses or research-only terms.
The family spans four checkpoints, all post-trained on Gemma 4 and Qwen 3.5 bases:
- Ornith-1.0-9B — dense, edge/resource-constrained target
- Ornith-1.0-31B — dense
- Ornith-1.0-35B — sparse Mixture-of-Experts
- Ornith-1.0-397B — flagship MoE
The headline feature is the training method, not the size. Ornith is a self-scaffolding model: during reinforcement learning it learns to write its own harness — the tool-use loop, the test scaffold — and jointly optimizes that scaffold alongside the code it produces. It's also reasoning-first: each assistant turn opens with a chain-of-thought block, and the serving stack returns that reasoning in a separate field from the final answer. For a coding agent, that's the right shape: it plans, then acts.
If you've read our piece on why local LLMs got good in 2026, this is the same story playing out — sparse activation plus better post-training, not raw parameter count, is what's closing the gap.
The benchmarks (and where to be skeptical)
The vendor numbers are strong, and a few are striking enough to flag as vendor-reported until independent runs land:
- Ornith-1.0-397B: 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1. DeepReinforce positions this above Claude Opus 4.7; for context, Claude Opus 4.8 sits at 87.6 on SWE-Bench Verified, so the 397B trails only the very top of the closed-model field on that test.
- Ornith-1.0-35B MoE: 64.2 on Terminal-Bench 2.1 — above Qwen 3.5-397B's 53.5, a model with more than ten times the total parameter count. If that holds up under independent testing, it's the most interesting result in the release.
- Ornith-1.0-9B Dense: 43.1 on Terminal-Bench 2.1, essentially matching Gemma 4-31B's 42.1.
A 35B MoE beating a 397B dense model on an agentic benchmark is exactly the kind of claim that needs third-party confirmation — vendor benchmark suites tend to flatter the home team. Treat these as a reason to try the model, not as settled fact. We've taken the same cautious line on every fresh-drop coding model, from Kimi K2.7 to Qwen3-Coder-Next.
Which GPU runs which variant
This is the part that matters for your wallet. Here's how each variant maps to real hardware.
Ornith-1.0-9B — for 8 GB to 16 GB cards
The 9B dense weights are about 6 GB at Q4 quantization and roughly 19 GB in BF16. At Q4_K_M it runs comfortably on 6–8 GB of VRAM, which means it's the variant for an RTX 3060 12GB, an RTX 4060 Ti 8GB or 16GB, or even an older 8 GB card. With a 16 GB card you can move up to Q6_K or Q8_0 and still leave plenty of room for context.
Being a dense 9B, it's bandwidth-bound — generation speed scales with your card's memory bandwidth, not its FLOPS. On a modern 16 GB card you'll get interactive speeds, but understand the trade-off: this is the weakest member of the family. It's a capable local autocomplete and small-task assistant, not a replacement for a frontier agent.
Ornith-1.0-35B MoE — the 24 GB sweet spot
This is the one to care about. The 35B MoE is a sparse model with 256 routed experts, 8 active per token plus a shared expert, across 40 layers, activating roughly 3B parameters per token. That architecture is the whole point: all 35B weights have to sit in VRAM, but only ~3B are read per token, so it generates far faster than a dense model of the same footprint.
The official GGUF sizes:
| Quant | Size | Fits |
|---|---|---|
| Q4_K_M | 21.2 GB | 24 GB card (tight) |
| Q5_K_M | 24.7 GB | 32 GB card |
| Q6_K | 28.5 GB | 32 GB card |
| Q8_0 | 36.9 GB | 48 GB+ / multi-GPU |
The practical read: Q4_K_M at 21.2 GB fits a single 24 GB GPU, but barely. That leaves under 3 GB for the KV cache and runtime overhead. You'll run it fine at 8K–16K context; the model's full 256K context window is a cloud-serving figure, not something you'll reach on 24 GB. If you want real context headroom, you want a 32 GB RTX 5090 and the Q4_K_M, or you accept short context on 24 GB.
What about speed? We don't have independent tokens/sec measurements for Ornith yet, so we won't invent one. But we do have a measured reference point on this exact site: Nemotron-Cascade 2 is a ~3B-active MoE (30B-A3B) that hits 187 tok/s on a used RTX 3090 at a comparable quant. Ornith-1.0-35B has near-identical active-parameter math, so expect it to land in the same neighborhood on the same hardware — fast enough for genuinely interactive agentic coding. We'll update this with real numbers once community benchmarks are out. For more on how 3B-active MoE compares to dense models at the same VRAM, see our Qwen 3.6 35B-A3B guide.
The best-value card for this remains the used RTX 3090: 24 GB, 936 GB/s of bandwidth, and a used average around $1,070 as of June 2026. The RTX 4090 is faster but costs roughly twice as much used for the same 24 GB ceiling.
Ornith-1.0-31B Dense — capable, but awkward
The 31B dense variant sits in an odd spot. As a dense 31B it needs a similar VRAM footprint to the 35B MoE at the same quant (call it ~18–20 GB at Q4), but because it's dense it activates all 31B parameters per token — so it'll be meaningfully slower than the 35B MoE while taking up about the same space. Unless you have a specific reason to prefer a dense model's behavior, the 35B MoE is the better pick on identical hardware. This is the same dense-vs-MoE trade we walked through in the Codestral 2 guide.
Ornith-1.0-397B MoE — rent it, don't buy for it
The flagship is not a home-lab model. The FP8 checkpoint alone is on the order of 225 GB+, well beyond any single consumer GPU and beyond most multi-GPU home builds. If you want to use the 397B, the sane paths are the
Top comments (0)