Every week or two, a model drops that makes the local AI community lose its collective mind. This week it was three at once: DeepSeek V4-Pro, DeepSeek V4-Flash, and Zyphra ZAYA1-8B. All three are genuinely impressive. All three are models I wanted to benchmark on our homelab. And after doing the research, I'm not testing any of them.
Not because I don't want to. Because I physically can't — or can't yet.
This post isn't a benchmark. It's the research that happens before the benchmark, where you figure out which models are even candidates for your hardware. If you're building or considering a local inference setup, the reasons these three models don't work are more instructive than any leaderboard score.
The Rig
Quick refresher on what we're working with:
| Resource | Spec |
|---|---|
| GPU | NVIDIA RTX 5090 — 32 GB VRAM |
| RAM | 64 GB DDR5 |
| CPU | AMD Ryzen 9 9950X3D — 16 cores / 32 threads |
| Disk | 1.8 TB NVMe |
| Inference | llama.cpp on the GPU |
This is a strong homelab by any measure. We run Qwen 3.5 35B-A3B daily for agentic coding at 200+ tok/s. In previous benchmark rounds, Devstral, Codestral, Gemma 4, and DeepSeek R1 14B have all run comfortably. The 5090 is the sweet spot for 20B–35B models.
But the new generation of models isn't playing in the 20B–35B range anymore.
DeepSeek V4-Pro: Too Big for Anything Short of a Data Center
V4-Pro is DeepSeek's new flagship. The numbers are staggering:
| Spec | Value |
|---|---|
| Total parameters | 1.6 trillion |
| Activated per token | 49B (MoE, 256 experts, top-6 routing) |
| Model weights (FP4+FP8 mixed) | 805 GB on disk |
| Context window | 1M tokens |
That 805 GB number is the wall. Our entire system — 32 GB VRAM plus 64 GB RAM — gives us 96 GB of addressable memory. The model is 8.4x larger than our total memory. There are no GGUF quants available, and nobody is making them because there's no consumer hardware that could run them meaningfully.
For context, we tried running Kimi K2.6 (a similarly-sized 1T MoE model) a few weeks ago. It "ran" at less than 1 token per second — the weights spilled out of VRAM into RAM, and we hit the DDR5 memory bandwidth ceiling (~80 GB/s vs the 5090's ~1.8 TB/s). V4-Pro at 1.6T would be even slower.
Verdict: Cloud API only. DeepSeek serves it at api.deepseek.com and we've added it to our benchmark rig as a cloud provider alongside Anthropic.
DeepSeek V4-Flash: Close, But Not Close Enough
V4-Flash is V4-Pro's smaller sibling and the one I was actually hopeful about:
| Spec | Value |
|---|---|
| Total parameters | 284B |
| Activated per token | 13B (MoE, 256 experts, top-6 routing) |
| Smallest GGUF quant (Q2_K) | 96.2 GB |
| Most popular quant (Q4_K_M) | 160.2 GB |
| Context window | 1M tokens |
Only 13B activated per token sounds incredible — that's smaller than our DeepSeek R1 14B. But MoE models need all their expert weights resident in memory even though only a fraction fires per token. That 284B of total parameters has to be somewhere accessible.
The math doesn't work:
| Quant | Size | Fits in VRAM + RAM (96 GB)? |
|---|---|---|
| Q2_K | 96.2 GB | Barely — 0.2 GB over before KV cache |
| Q3_K_M | 126.2 GB | No — needs 30 GB disk offload |
| Q4_K_M | 160.2 GB | No — needs 64 GB disk offload |
| FP4-FP8 native | 145.4 GB | No — needs 49 GB disk offload |
There were IQ1_S (54 GB) and IQ2_M (87 GB) quants that would have fit — but the community removed them. When quant maintainers pull their own files, that's a strong signal the output quality was garbage.
And even if one of these squeaked into memory, there's a bigger problem: llama.cpp doesn't support the DeepSeek V4 architecture yet. All existing GGUFs require custom forks. The mainline support PRs are still open and under active debate. You'd be building from an untested branch to run a model that barely fits.
Verdict: Not ready. We've added V4-Flash to the benchmark as a cloud API model for now. When llama.cpp merges V4 support and a viable sub-90 GB quant exists, we'll revisit.
ZAYA1-8B: The Right Size, the Wrong Stack
This is the one that hurts the most, because on paper it's a perfect homelab model:
| Spec | Value |
|---|---|
| Total parameters | 8.4B |
| Activated per token | 760M (MoE, 16 experts, top-1 routing) |
| VRAM at bf16 | ~17 GB |
| Context window | 128K tokens |
| AIME '26 score | 89.1 |
8.4 billion parameters. 17 GB in bf16. Fits trivially on the 5090 with room to spare. Punches absurdly above its weight on reasoning benchmarks — 89.1 on AIME '26 is competitive with models 10–15x its size.
So what's the problem? Architecture.
ZAYA1 uses CCA (Cross-Channel Attention) — Zyphra's novel hybrid of Mamba-style recurrence and traditional attention. It's not standard Mamba2. It's not standard transformer attention. It's a fundamentally new layer type with small 1D convolutions, custom Q/K projections, and learned residual scaling.
llama.cpp has no support for this architecture. There's an open feature request with nothing but +1 comments. No GGUF quants exist because there's nothing to run them on. Even Zyphra's older Zamba2 architecture (#21412) remains unimplemented.
The only way to run ZAYA1 today is through Zyphra's custom vLLM fork — a completely different serving stack from our llama.cpp setup. It would work on the 5090, but it means standing up and maintaining a parallel inference pipeline.
Verdict: On the to-do list. When llama.cpp adds CCA support or we carve out time to set up vLLM as a second serving backend, this is the first model we'll test.
What Actually Runs on a 32 GB GPU
Here's the uncomfortable reality of local inference in mid-2026: the models generating the most hype are the ones you can't run.
The models that fly on a 32 GB card — where you get 100+ tok/s and useful agentic performance — are capped at roughly 24–28 GB of weights (leaving room for KV cache). That means:
| Category | What Fits |
|---|---|
| Dense models | Up to ~14B at Q8, ~20B at Q6, ~27B at Q4 |
| MoE models | Up to ~35B total at Q4 (e.g. Qwen 3.5 35B-A3B) |
| What doesn't | Anything over ~28 GB of quantized weights |
Our current daily driver — Qwen 3.5 35B-A3B at Q4_K_XL — is 22 GB of weights with 3B activated per token, running at 200+ tok/s. It's fast, it's good, and it's approximately the ceiling of what a single 5090 can do at interactive speeds.
The Three Walls
Each of these models hits a different wall, and that's what makes this exercise useful:
V4-Pro — pure size. 805 GB of weights. No amount of quantization or clever offloading helps when the model is 8x your total memory.
V4-Flash — the quantization gap. The model almost fits at extreme compression, but the quality degrades too far. We're in a window where the model exists but the tooling hasn't caught up to make it practical on consumer hardware.
ZAYA1 — architecture support. The model fits perfectly. The hardware is more than enough. But the inference engine doesn't speak the language yet.
If you're evaluating models for a homelab or edge deployment, these are the three questions to ask before you even think about benchmarks: Is it small enough? Is the quantization viable? Does my inference stack support it?
By the Numbers
- 805 GB — DeepSeek V4-Pro model weight size. 8.4x our total system memory.
- 96.2 GB — smallest V4-Flash GGUF quant. Still 0.2 GB over our VRAM + RAM.
- 17 GB — ZAYA1-8B at bf16. Fits trivially, runs nowhere (yet).
- 22 GB — our actual daily driver (Qwen 3.5 35B-A3B at Q4_K_XL). The real ceiling.
- 0 — number of these three models with merged llama.cpp support.
- 2 — models we added to the benchmark as cloud API endpoints instead (V4-Flash, V4-Pro).
Top comments (1)
The 24–28GB usable ceiling on a "32GB" card is the number nobody puts on the box. People budget for the model weights and forget that KV cache at any reasonable context length easily takes another several GB, plus whatever the inference runtime keeps resident.
The V4-Flash compression-degradation point is the one I'd push hardest on. There's a real perception problem in the open-weights community where Q3/Q2 quants of a flagship model get benchmarked against Q4/Q5 quants of a smaller model and "win" on MMLU but fall apart on anything requiring multi-step reasoning. The benchmarks aren't catching the degradation modes that matter for agentic use.
ZAYA1's architecture story is the more interesting medium-term issue — every novel attention variant means another 6-12 months before llama.cpp / vLLM / MLX have a tuned kernel for it. The "we can't run it" window is real even when the hardware would technically fit, and it keeps getting longer as architectures diverge.