ZAYA1-8B Review 2026: Apache 2.0 Reasoning MoE on AMD

#zaya1 #llm #reasoning #localllm

This article was originally published on aifoss.dev

TL;DR: ZAYA1-8B is an Apache 2.0 Mixture-of-Experts reasoning model from Zyphra — 8.4B total parameters, ~760M active, trained entirely on AMD Instinct MI300X. It posts frontier-class math scores for its size, but it needs Zyphra's custom forks of vLLM or transformers to run, and there's no official GGUF yet. Great weights, awkward to self-host today.

	ZAYA1-8B	Qwen3-4B-Thinking-2507	Gemma-4-E4B-it
Best for	Math/reasoning density per param	Drop-in Ollama reasoning	Multimodal + easy local use
Active params	~760M (8.4B total MoE)	4B dense	~4B effective
Install complexity	High (custom vLLM/transformers fork)	Low (`ollama run`)	Low (`ollama run`)
License	Apache 2.0	Apache 2.0	Gemma terms
GGUF / Ollama	None official (June 2026)	Yes	Yes
Min VRAM (bf16)	~17 GB weights, ~48 GB w/ vLLM defaults	~9 GB	~8 GB

Honest take: If you want the best math-per-watt open weights of mid-2026 and you're comfortable on AMD or building from a fork, ZAYA1-8B is genuinely special. If you just want a reasoning model running tonight, ollama run qwen3:4b beats it on convenience by a mile.

What ZAYA1-8B actually is

Zyphra released ZAYA1-8B on May 6, 2026 under the Apache 2.0 license, with weights on Hugging Face and a technical report on arXiv. The headline isn't the size — it's the efficiency. This is a sparse Mixture-of-Experts model with 8.4B total parameters but only about 760M active per token. The pitch is "maximum intelligence density per parameter," and unusually for a 2026 frontier-adjacent model, it was pretrained 100% on AMD Instinct MI300X GPUs rather than NVIDIA hardware.

That AMD detail isn't marketing fluff. Most open-weight models are trained on NVIDIA clusters, and the software stack reflects that. ZAYA1 is a proof point that a full pretraining run — including long-context extension — works end to end on AMD silicon with Pensando networking on IBM Cloud. If you care about hardware diversity in the open ecosystem, this matters.

Three architecture changes carry the model, all part of what Zyphra calls its MoE++ stack:

Compressed Convolutional Attention (CCA) — attention that operates in a compressed latent space and achieves roughly 8× KV-cache compression versus standard attention. The KV cache is the per-token memory the model holds during generation; cutting it 8× is what makes long context affordable.
An MLP-based expert router with PID-controller bias balancing, which keeps expert utilization stable instead of collapsing onto a few experts.
Learned residual scaling, a small but real contributor to training stability at this sparsity.

It was trained at up to 32k context length, with context-parallel techniques used to push effective context further (the report describes scaling to 131K with eight ranks).

The benchmarks — and the asterisk

ZAYA1-8B punches well above its active-parameter count on math and reasoning. The numbers Zyphra published:

Benchmark	ZAYA1-8B	Comparison
AIME'25	91.9	with Markovian RSA test-time compute
HMMT'25	89.6	vs Claude 4.5 Sonnet 88.3
GPQA-Diamond	71.0	knowledge/reasoning
AIME'26	89.1	vs Mistral-Small-4-119B 86.4
HMMT Feb'26	71.6	vs Mistral-Small-4-119B 70.6

Read that Mistral row again: a model with 760M active parameters edging out a 119B model on competition math. Zyphra also reports ZAYA1-8B beating Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it across math and coding categories, and staying competitive with first-generation frontier reasoning models like DeepSeek-R1-0528 and Gemini 2.5 Pro.

Here's the asterisk you need before you quote these at work: the top-line results that approach Claude 4.5 Sonnet lean on Markovian RSA, Zyphra's test-time-compute scheme. Markovian RSA generates parallel reasoning traces and recursively aggregates them while carrying forward only a bounded ~4K-token "tail" between rounds — so you get long effective reasoning without unbounded memory growth. That's clever, and the constant-memory property is the real engineering win. But it means those scores reflect extra inference budget, not a single greedy pass. Without that budget, ZAYA1-8B is still strong for its size, but the gap to frontier models widens. Anyone comparing it to a vanilla pass@1 number from another model is comparing apples to oranges.

Self-hosting reality check (read this before you download)

This is where the review gets practical, and where most "ZAYA1-8B is the new local king" posts go quiet. The custom architecture that makes the model efficient also means it does not run on stock vLLM or stock transformers as of June 2026. The supported paths are Zyphra's own forks.

A minimal transformers-fork load looks like this:

# Install Zyphra's transformers fork (CCA + MoE++ router not in upstream yet)
pip install "transformers @ git+https://github.com/Zyphra/transformers.git"

python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Zyphra/ZAYA1-8B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "Prove there are infinitely many primes."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=1024)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
PY

Expected behavior on first run: a multi-GB shard download, then a chain-of-thought style answer. Confirm the exact repo path and fork URL on the model's Hugging Face card before copying — Zyphra's repo names have moved during the preview window.

VRAM, honestly. At bf16, 8.4B parameters is only about 17 GB of weights. But in real testing, served through vLLM on an NVIDIA RTX 6000, the process consumed roughly 47 GB once fully loaded. That's not the model being secretly huge — it's vLLM's default gpu_memory_utilization pre-allocating most of the card for the KV cache and paged-attention pool. You can dial that down (--gpu-memory-utilization 0.5) and fit it in far less, but plan for a 24 GB card minimum at bf16 and don't be surprised when vLLM grabs everything you give it. Community quantizations (BNB and MXFP4 builds) have started appearing on Hugging Face, which bring the footprint down further, but they're unofficial.

No Ollama, no llama.cpp — yet. There is no official GGUF from Zyphra, and the llama.cpp feature request (issue #22776) was still open with no merged implementation when I checked. CCA and the Markovian RSA sampler are non-trivial to port. If your entire workflow is ollama run, ZAYA1-8B is not ready for you this month. Watch that issue.

If you don't own a 24 GB+ card and just want to try the weights without buying hardware, renting is the rational move — a single 48 GB cloud GPU on RunPod costs less than a coffee per hour and saves you the fork-wrangling on a fresh image. For the dual-consumer-GPU route, two used RTX 3090 cards give you 48 GB of pooled VRAM; our friends at runaihome.com cover those multi-GPU home-lab builds in depth.

Where it fits versus what you already run

If you've read our Ollama vs LM Studio vs llama.cpp comparison, you already know the convenience hierarchy: anything with a GGUF and an Ollama tag wins on time-to-first-token. ZAYA1-8B sits outside that comfort zone right now. So the question isn't "is it better than Qwen3-4B-Thinking?" on a benchmark sheet — on math, it is — it's "is the setup tax worth it for your workload?"

For pure math and structured re