DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Open-Source LLM Shootout 2026: Qwen3.6 vs Gemma 4 vs Llama 4 vs GLM-5.1 vs DeepSeek V4 — Which Fits Your GPU?

This article was originally published on runaihome.com

TL;DR: Of the five open-weight families everyone is benchmarking in 2026, only two — Qwen3.6 and Gemma 4 — actually run well on a single consumer GPU. Llama 4 Maverick, GLM-5.1, and DeepSeek V4 Pro are server-class MoEs that need 200 GB+ of VRAM at usable quantization. The headline "five-way shootout" collapses to a two-horse race on the hardware most home labs own, with DeepSeek V4 Flash and Llama 4 Scout sitting in an awkward middle.

Qwen3.6 35B-A3B Gemma 4 (12B–31B) Llama 4 Scout 109B DeepSeek V4 Flash 284B GLM-5.1 754B
Best for Single RTX 3090, speed Quality per GB, 8–24 GB Gray-zone, 24 GB at 1.78-bit 2× RTX 4090 minimum Server / API only
Min usable VRAM ~23 GB (Q3) 6.6 GB (12B Q4) → 24 GB (31B Q4) ~24 GB (1.78-bit, degraded) ~33 GB (heavy quant) ~860 GB (FP8)
RTX 3090 speed ~120 tok/s 30–119 tok/s ~20 tok/s Not viable Not viable
License Apache 2.0 Gemma custom Llama 4 Community MIT MIT

Honest take: If you have one 24 GB GPU, run Qwen3.6 35B-A3B for speed or Gemma 4 31B for reasoning depth — pick by license and task, not by leaderboard rank. The 1T-class models (GLM-5.1, DeepSeek V4 Pro, Llama 4 Maverick) are API models you should use as APIs, not local builds.

The five families, and what they actually are

Every "open-source LLM comparison" article in mid-2026 lists the same five names. The problem is that they treat a 12B dense model and a 1.6T MoE as if they belong in the same buying decision. They don't. Here is what each family ships, and the size of the gap.

Qwen3.6 (Alibaba). The Qwen3.6-35B-A3B dropped April 16, 2026 — a 35B-total MoE that activates just 3B parameters per token. There are also dense 8B and 14B members carried over from the Qwen3 line. Everything is Apache 2.0, which matters more than people admit. Qwen optimized this family explicitly for consumer hardware.

Gemma 4 (Google). A dense-and-MoE family spanning E2B up to a 31B dense flagship, plus a 26B-A4B MoE. Google's June 5 QAT (quantization-aware training) checkpoints cut VRAM roughly 72% versus BF16 with near-original quality, which is why a 26B now fits 16 GB. License is Google's custom Gemma terms — permissive for most use, but not OSI-approved.

Llama 4 (Meta). Scout is 109B total / 17B active; Maverick is 402B total / 17B active. Both are MoE, and both load all parameters into memory regardless of how few activate per token — which is the trap. The Llama 4 Community License carries a 700M-MAU clause, a "Built with Llama" attribution requirement, and a ban on multimodal use by entities domiciled in the EU.

DeepSeek V4. Two checkpoints: V4 Flash (284B total / 13B active) and V4 Pro (1.6T total / ~49B active), released April 2026 under MIT. Flash is the only one with any home-lab pathway.

GLM-5.1 (Z.ai). A 754B open-weight MoE under MIT, released April 7, 2026. It matches frontier closed models on coding benchmarks — and needs roughly 860 GB of VRAM at FP8, i.e. 8× H200. It is on this list because it is excellent, not because you can run it.

Which family fits which GPU

This is the only question that matters for a home lab. Here is the decision by the VRAM you actually have.

8 GB (RTX 3060, RTX 4060, RTX 5060)

Your only real options are small dense models. Gemma 4 12B at Q4 needs about 6.6 GB of weights — it fits, but with little headroom for context, so keep num_ctx modest. A Qwen3 8B at Q4 sits around 5 GB and leaves more room. Llama 4, GLM-5.1, and DeepSeek V4 are all out at this tier — not "slow," but flatly impossible without spilling to system RAM and crawling.

16 GB (RTX 4060 Ti 16GB, RTX 5060 Ti 16GB, RTX 5070 Ti)

This is where Gemma 4's MoE shines. The 26B-A4B at Q4_K_M lands right at 16 GB and, because only ~4B parameters activate per token, it generates fast — 64–119 tok/s on a 24 GB RTX 3090, and comfortably usable on a 16 GB card if you trim context. Qwen3 14B dense also fits here. Qwen3.6 30B/35B-A3B at Q4 wants ~17 GB, so it's right on the edge — doable at low context, more comfortable at 24 GB.

24 GB (RTX 3090, RTX 4090)

The sweet spot, and where the shootout gets interesting:

  • Qwen3.6 35B-A3B: with Unsloth's Q3 quant it takes ~23 GB and runs at ~120 tok/s on a RTX 3090. This is the fastest capable model you can run on one consumer card.
  • Gemma 4 31B dense at Q4_K_M fills 24 GB and runs ~30–34 tok/s on the 3090 — slower, but a dense 31B scoring 87.1% MMLU is a different quality tier than a 3B-active MoE.
  • Gemma 4 26B-A4B at Q8 also fits 24 GB if you want quality with speed.
  • Llama 4 Scout technically fits via Unsloth's 1.78-bit dynamic quant at ~20 tok/s — but a 1.78-bit 109B model is a curiosity, not a daily driver. The quality you lose at that bit depth erases the point of running Scout over Qwen3.6.

On a RTX 4090, the same models run faster on the small end — Qwen3 8B Q4 hits ~104 tok/s, 14B ~69 tok/s — but the 24 GB ceiling is identical to the 3090, so it changes speed, not which models fit.

Beyond 24 GB (multi-GPU / server)

DeepSeek V4 Flash needs ~33 GB heavily quantized — realistically 2× RTX 4090 or a single RTX 6000-class card. Llama 4 Maverick at INT4 wants ~200 GB (4× H100). GLM-5.1 and DeepSeek V4 Pro are 8-GPU server territory. If you're renting rather than buying for these, a cloud GPU like RunPod is far cheaper than assembling an 8× H200 box you'll use occasionally.

Speed vs. quality: the real trade-off at 24 GB

The MoE-vs-dense split is the whole story on a single 24 GB card. Qwen3.6 35B-A3B activates 3B parameters per token, so it feels like running a 3B model — ~120 tok/s — while having the knowledge of a 35B. Gemma 4 31B is dense: every one of its 31B parameters fires on every token, so you get ~30 tok/s but denser reasoning per token. Neither is "better"; they're different shapes.

For interactive chat and agentic loops where latency compounds, Qwen3.6's 120 tok/s wins. For one-shot reasoning, summarization, or code review where you read the output once, Gemma 4 31B's depth is worth the slower stream. We go deeper on the two-model decision in our Qwen3.6 35B-A3B guide and the Gemma 4 QAT update.

Licenses: the column most comparisons skip

For a home lab tinkering on a side project, license rarely bites. For anyone building a product, it's decisive.

Family License Commercial use Catch
Qwen3.6 Apache 2.0 Yes, unrestricted None
DeepSeek V4 MIT Yes, unrestricted None
GLM-5.1 MIT Yes, unrestricted None
Gemma 4 Gemma custom Yes Use restrictions, not OSI-approved
Llama 4 Community Yes, with limits 700M-MAU clause, "Built with Llama" badge, EU multimodal ban

The irony of the 2026 landscape: the two model families you can actually run locally (Qwen3.6 and Gemma 4) sit at opposite license ends — Qwen3.6 is the cleanest (Apache 2.0), Gemma 4 the most encumbered of the two. If license cleanliness matters to you and you want speed, Qwen3.6 is the unambiguous pick.

The gotcha that wastes an afternoon

The most common failure we see: someone reads "Llama 4 Scout is only 17B active" and runs ollama pull expecting a 17B-class memory footprint. Then they hit CUDA error: out of memory on a 24 GB card. MoE models must hold all 109B parameters in VRAM even though 17B activate per token — the "active" number describes compute, not memory. The fix isn't a smaller context; it's accepting that Scout needs either a sub-2-bit dynamic quant (degraded) or more than one GPU. If you hit OOM mid-load on any of these MoEs, check total parameter count, not active count, against your VRAM. (Our full [CUDA OOM fixes](/blog/cuda-out-of-memory

Top comments (0)