Rob

Posted on May 14 • Originally published at vibescoder.dev

Thursday Thoughts: The Models We Can't Run

#agents #ai #llm #homelab

Every week or two, a model drops that makes the local AI community lose its collective mind. This week it was three at once: DeepSeek V4-Pro, DeepSeek V4-Flash, and Zyphra ZAYA1-8B. All three are genuinely impressive. All three are models I wanted to benchmark on our homelab. And after doing the research, I'm not testing any of them.

Not because I don't want to. Because I physically can't — or can't yet.

This post isn't a benchmark. It's the research that happens before the benchmark, where you figure out which models are even candidates for your hardware. If you're building or considering a local inference setup, the reasons these three models don't work are more instructive than any leaderboard score.

The Rig

Quick refresher on what we're working with:

Resource	Spec
GPU	NVIDIA RTX 5090 — 32 GB VRAM
RAM	64 GB DDR5
CPU	AMD Ryzen 9 9950X3D — 16 cores / 32 threads
Disk	1.8 TB NVMe
Inference	llama.cpp on the GPU

This is a strong homelab by any measure. We run Qwen 3.5 35B-A3B daily for agentic coding at 200+ tok/s. In previous benchmark rounds, Devstral, Codestral, Gemma 4, and DeepSeek R1 14B have all run comfortably. The 5090 is the sweet spot for 20B–35B models.

But the new generation of models isn't playing in the 20B–35B range anymore.

DeepSeek V4-Pro: Too Big for Anything Short of a Data Center

V4-Pro is DeepSeek's new flagship. The numbers are staggering:

Spec	Value
Total parameters	1.6 trillion
Activated per token	49B (MoE, 256 experts, top-6 routing)
Model weights (FP4+FP8 mixed)	805 GB on disk
Context window	1M tokens

That 805 GB number is the wall. Our entire system — 32 GB VRAM plus 64 GB RAM — gives us 96 GB of addressable memory. The model is 8.4x larger than our total memory. There are no GGUF quants available, and nobody is making them because there's no consumer hardware that could run them meaningfully.

For context, we tried running Kimi K2.6 (a similarly-sized 1T MoE model) a few weeks ago. It "ran" at less than 1 token per second — the weights spilled out of VRAM into RAM, and we hit the DDR5 memory bandwidth ceiling (~80 GB/s vs the 5090's ~1.8 TB/s). V4-Pro at 1.6T would be even slower.

Verdict: Cloud API only. DeepSeek serves it at api.deepseek.com and we've added it to our benchmark rig as a cloud provider alongside Anthropic.

DeepSeek V4-Flash: Close, But Not Close Enough

V4-Flash is V4-Pro's smaller sibling and the one I was actually hopeful about:

Spec	Value
Total parameters	284B
Activated per token	13B (MoE, 256 experts, top-6 routing)
Smallest GGUF quant (Q2_K)	96.2 GB
Most popular quant (Q4_K_M)	160.2 GB
Context window	1M tokens

Only 13B activated per token sounds incredible — that's smaller than our DeepSeek R1 14B. But MoE models need all their expert weights resident in memory even though only a fraction fires per token. That 284B of total parameters has to be somewhere accessible.

The math doesn't work:

Quant	Size	Fits in VRAM + RAM (96 GB)?
Q2_K	96.2 GB	Barely — 0.2 GB over before KV cache
Q3_K_M	126.2 GB	No — needs 30 GB disk offload
Q4_K_M	160.2 GB	No — needs 64 GB disk offload
FP4-FP8 native	145.4 GB	No — needs 49 GB disk offload

There were IQ1_S (54 GB) and IQ2_M (87 GB) quants that would have fit — but the community removed them. When quant maintainers pull their own files, that's a strong signal the output quality was garbage.

And even if one of these squeaked into memory, there's a bigger problem: llama.cpp doesn't support the DeepSeek V4 architecture yet. All existing GGUFs require custom forks. The mainline support PRs are still open and under active debate. You'd be building from an untested branch to run a model that barely fits.

Verdict: Not ready. We've added V4-Flash to the benchmark as a cloud API model for now. When llama.cpp merges V4 support and a viable sub-90 GB quant exists, we'll revisit.

ZAYA1-8B: The Right Size, the Wrong Stack

This is the one that hurts the most, because on paper it's a perfect homelab model:

Spec	Value
Total parameters	8.4B
Activated per token	760M (MoE, 16 experts, top-1 routing)
VRAM at bf16	~17 GB
Context window	128K tokens
AIME '26 score	89.1

8.4 billion parameters. 17 GB in bf16. Fits trivially on the 5090 with room to spare. Punches absurdly above its weight on reasoning benchmarks — 89.1 on AIME '26 is competitive with models 10–15x its size.

So what's the problem? Architecture.

ZAYA1 uses CCA (Cross-Channel Attention) — Zyphra's novel hybrid of Mamba-style recurrence and traditional attention. It's not standard Mamba2. It's not standard transformer attention. It's a fundamentally new layer type with small 1D convolutions, custom Q/K projections, and learned residual scaling.

llama.cpp has no support for this architecture. There's an open feature request with nothing but +1 comments. No GGUF quants exist because there's nothing to run them on. Even Zyphra's older Zamba2 architecture (#21412) remains unimplemented.

The only way to run ZAYA1 today is through Zyphra's custom vLLM fork — a completely different serving stack from our llama.cpp setup. It would work on the 5090, but it means standing up and maintaining a parallel inference pipeline.

Verdict: On the to-do list. When llama.cpp adds CCA support or we carve out time to set up vLLM as a second serving backend, this is the first model we'll test.

What Actually Runs on a 32 GB GPU

Here's the uncomfortable reality of local inference in mid-2026: the models generating the most hype are the ones you can't run.

The models that fly on a 32 GB card — where you get 100+ tok/s and useful agentic performance — are capped at roughly 24–28 GB of weights (leaving room for KV cache). That means:

Category	What Fits
Dense models	Up to ~14B at Q8, ~20B at Q6, ~27B at Q4
MoE models	Up to ~35B total at Q4 (e.g. Qwen 3.5 35B-A3B)
What doesn't	Anything over ~28 GB of quantized weights

Our current daily driver — Qwen 3.5 35B-A3B at Q4_K_XL — is 22 GB of weights with 3B activated per token, running at 200+ tok/s. It's fast, it's good, and it's approximately the ceiling of what a single 5090 can do at interactive speeds.

The Three Walls

Each of these models hits a different wall, and that's what makes this exercise useful:

V4-Pro — pure size. 805 GB of weights. No amount of quantization or clever offloading helps when the model is 8x your total memory.
V4-Flash — the quantization gap. The model almost fits at extreme compression, but the quality degrades too far. We're in a window where the model exists but the tooling hasn't caught up to make it practical on consumer hardware.
ZAYA1 — architecture support. The model fits perfectly. The hardware is more than enough. But the inference engine doesn't speak the language yet.

If you're evaluating models for a homelab or edge deployment, these are the three questions to ask before you even think about benchmarks: Is it small enough? Is the quantization viable? Does my inference stack support it?

By the Numbers

805 GB — DeepSeek V4-Pro model weight size. 8.4x our total system memory.
96.2 GB — smallest V4-Flash GGUF quant. Still 0.2 GB over our VRAM + RAM.
17 GB — ZAYA1-8B at bf16. Fits trivially, runs nowhere (yet).
22 GB — our actual daily driver (Qwen 3.5 35B-A3B at Q4_K_XL). The real ceiling.
0 — number of these three models with merged llama.cpp support.
2 — models we added to the benchmark as cloud API endpoints instead (V4-Flash, V4-Pro).

Top comments (1)

Max Quimby • May 14

The 24–28GB usable ceiling on a "32GB" card is the number nobody puts on the box. People budget for the model weights and forget that KV cache at any reasonable context length easily takes another several GB, plus whatever the inference runtime keeps resident.

The V4-Flash compression-degradation point is the one I'd push hardest on. There's a real perception problem in the open-weights community where Q3/Q2 quants of a flagship model get benchmarked against Q4/Q5 quants of a smaller model and "win" on MMLU but fall apart on anything requiring multi-step reasoning. The benchmarks aren't catching the degradation modes that matter for agentic use.

ZAYA1's architecture story is the more interesting medium-term issue — every novel attention variant means another 6-12 months before llama.cpp / vLLM / MLX have a tuned kernel for it. The "we can't run it" window is real even when the hardware would technically fit, and it keeps getting longer as architectures diverge.