DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

ZAYA1-8B: Zyphra's Efficient MoE Reasoning Model Guide

The scaling-is-everything story has a new challenger. On May 6, 2026, Zyphra released ZAYA1-8B — an open-weight Mixture-of-Experts reasoning model with 8.4 billion total parameters and fewer than 800 million active per token. On AIME 2025, a benchmark where DeepSeek-R1 sits at 87.5 with its 671 billion parameter footprint, ZAYA1-8B scores 91.9. That gap in parameter count — roughly two orders of magnitude — is the headline, but the engineering story underneath is more interesting.

This guide covers the architecture, the novel test-time compute method, how to run the model today, and what the benchmark numbers actually mean in practice.

Why This Model Matters Right Now

The reasoning model landscape in 2026 has sorted into two camps: closed frontier models (GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro) that require API calls with unpredictable pricing, and large open-weight models (DeepSeek-R1-0528 at 671B, Llama 4 Maverick at 400B+) that technically run locally but demand multi-GPU clusters to do so practically.

ZAYA1-8B occupies a gap that has largely been empty: a model small enough to run on a single high-end consumer GPU, open-weight under Apache 2.0, and strong enough on math and coding to be genuinely useful rather than just impressive in press releases.

For developers building agents, math tutors, code review tools, or anything requiring extended reasoning chains, the ability to self-host a frontier-class reasoner at reasonable cost changes what's economically viable to build. That's the real significance here.

Core Concepts

The MoE++ Architecture

ZAYA1-8B is built on Zyphra's proprietary MoE++ architecture. Standard Mixture-of-Experts models activate a subset of their total parameters per token — that's how a 671B model like DeepSeek-R1 keeps inference costs closer to a 37B dense model in practice. Zyphra's variant introduces three architectural changes on top of that foundation:

Compressed Convolutional Attention (CCA) replaces standard multi-head attention with a sequence-mixing mechanism that operates in a compressed latent space. The result is an 8x reduction in KV-cache size compared to full multi-head attention. For long reasoning chains — the kind that burn memory fast in standard transformers — this is a meaningful practical advantage.

MLP-based expert router replaces the linear router used in most MoE designs. Standard linear routers are fast but limited in their ability to capture token-context relationships when deciding which experts to activate. A multi-layer MLP router is more expensive but more expressive, which the team argues improves routing stability across diverse reasoning tasks.

Learned residual scaling controls how residual-norm grows through depth. This is a low-cost addition (negligible parameter and FLOP overhead) that addresses gradient dynamics in deep networks — particularly relevant for reasoning models that benefit from very deep computation graphs.

Together, these changes give ZAYA1-8B its headline statistic: 760M active parameters and 8.4B total, with a KV-cache footprint that stays manageable even during extended multi-step reasoning.

Training: AMD Hardware, Four-Stage RL

The model was trained on 1,024 AMD MI300X GPUs running on IBM Cloud infrastructure with AMD Pensando Pollara networking — an end-to-end AMD stack. The pretraining was done entirely on this hardware. This is notable less for the hardware choice itself and more for what it signals: that AMD's MI300X cluster is capable of frontier model training at scale, not just inference.

Post-training follows a multi-stage reinforcement learning cascade. The stages described in the technical report include:

  1. Reasoning warmup — standard SFT on chain-of-thought examples to establish a baseline reasoning style
  2. RLVE-Gym — a curriculum of 400 adaptive puzzle-like environments for RL training across mathematics, logic, and coding domains
  3. Behavioral polishing — final alignment passes to smooth output quality and instruction following

The curriculum-based RL approach (RLVE-Gym specifically) is where the model's mathematical reasoning capabilities are thought to originate. Most open-weight reasoning models rely on GRPO or similar RL methods with verifiable math rewards; Zyphra's staged curriculum is a more structured alternative.

Markovian RSA: Test-Time Compute Without Memory Blowup

This is the most technically novel piece. Most approaches to test-time compute scaling fall into two patterns: serial chain-of-thought (more reasoning steps in a single trace) or parallel sampling (generate N traces, pick the best). Serial scaling hits context-window limits. Parallel sampling multiplies memory use linearly with N.

Markovian RSA (Randomized Sequential Aggregation) is Zyphra's answer to both problems. Instead of generating N fully independent traces, it generates traces sequentially where each new trace is conditioned on a fixed-length summary of prior traces. The "Markovian" part means each trace only needs the summary, not the full history — keeping memory cost constant regardless of how many traces you run.

In practice, this means you can allocate more inference compute to hard problems by running more RSA steps, without the memory cost growing unboundedly. It's the mechanism behind ZAYA1-8B's 91.9 AIME 2025 score.

One thing worth being direct about: Markovian RSA scores are not comparable to standard pass@1 numbers from Claude 4.5 Sonnet or GPT-5. RSA runs substantially more inference compute per problem than a single chain-of-thought call. When you see "ZAYA1-8B beats Claude Sonnet on HMMT 2025 (89.6 vs 88.3)", the correct read is "at higher compute budget, ZAYA1-8B can close the gap with much larger models" — not "8B beats 200B+ on equal footing."

That framing doesn't diminish the result. It means ZAYA1-8B gives you a knob: spend more compute on hard problems, less on easy ones, without switching models.

Benchmark Performance in Context

Benchmark ZAYA1-8B (Markovian RSA) DeepSeek-R1-0528 Claude 4.5 Sonnet Mistral-Small-4
AIME 2025 91.9 87.5
AIME 2026 89.1 86.4
HMMT 2025 89.6 88.3
HMMT Feb 2026 71.6 70.6
Active Params 760M ~37B ~200B+ 6B

The comparison with Mistral-Small-4 (6B active parameters) is arguably the most useful for developers. ZAYA1-8B has fewer active parameters and beats it on both AIME 2026 and HMMT Feb 2026. That's efficiency in the most direct sense — fewer compute FLOPs per token, better reasoning output.

The DeepSeek-R1-0528 comparison (87.5 AIME 2025 at 671B total params) is the headline number, and it holds up under scrutiny. DeepSeek-R1's AIME scores use a similar extended test-time compute budget, so that particular comparison is closer to apples-to-apples than the Claude/GPT comparisons.

How to Run ZAYA1-8B Today

Option 1: Zyphra Cloud (no setup required)

The fastest path is the free serverless endpoint at cloud.zyphra.com. No model weights to manage, no GPU required. Rate limits and pricing beyond the free tier are not publicly documented at time of writing — check the platform directly.

Option 2: Self-host with vLLM (recommended for production)

ZAYA1-8B requires Zyphra's fork of vLLM for the custom MoE++ routing and Markovian RSA inference. The upstream vLLM does not support the model yet.

# Install Zyphra's vLLM fork
git clone -b zaya1-pr https://github.com/Zyphra/vllm.git
cd vllm && pip install -e .

# Serve the model
vllm serve Zyphra/ZAYA1-8B \
  --port 8010 \
  --mamba-cache-dtype float32 \
  --dtype bfloat16 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser zaya_xml
Enter fullscreen mode Exit fullscreen mode

Hardware requirement: approximately 16 GB VRAM for BF16. A single A100 40GB or RTX 4090 24GB with some headroom covers this.

Option 3: Transformers inference (experimentation)

For exploration without production-grade serving, use Zyphra's transformers fork:

pip install git+https://github.com/Zyphra/transformers.git@zaya1
pip install accelerate bitsandbytes
Enter fullscreen mode Exit fullscreen mode

With 4-bit quantization, VRAM drops to approximately 6 GB:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Zyphra/ZAYA1-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve: For all integers n, prove that n^3 - n is divisible by 6."}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=2048, temperature=0.6, do_sample=True)

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

Community-made quantizations (4-bit BNB and MXFP4 variants) are available on HuggingFace at barozp/ZAYA1-8B-BNB and OsaurusAI/ZAYA1-8B-MXFP4, but these are unofficial. Use them for local experiments, not production.

Option 4: HuggingFace weights for custom inference

The model weights are available at Zyphra/ZAYA1-8B under Apache 2.0. Download and integrate however your stack requires — the license imposes no usage restrictions beyond attribution.

Practical Application: Where to Use This Model

Mathematical reasoning pipelines. The AIME and HMMT scores indicate genuine competition-level math ability. For any application where verifiable step-by-step reasoning on math problems matters — tutoring, automated grading, scientific computing — ZAYA1-8B is a credible small-footprint option.

Code review and debugging agents. Coding tasks were part of the RLVE-Gym training curriculum. The model is worth evaluating for agents that need to reason about code correctness, not just autocomplete.

Self-hosted reasoning API on constrained hardware. The combination of 760M active parameters + Markovian RSA lets you run a high-quality reasoning model on hardware that can't serve DeepSeek-R1. A single A100 40GB can handle this; a four-GPU setup of RTX 3090s could too with careful batching.

Multi-model routing. Use ZAYA1-8B for tasks where extended reasoning is needed, and a smaller fast model for routine completions. The Apache 2.0 license means you can build commercial routing systems without licensing friction.

Common Mistakes to Avoid

Comparing RSA scores to pass@1 baselines. The 91.9 AIME 2025 number uses Markovian RSA, which runs multiple inference passes. If you're benchmarking ZAYA1-8B in your own pipeline with a single inference call, expect lower numbers — competitive with models in its active-parameter class, but not frontier-beating without the RSA budget.

Using upstream vLLM or unforked transformers. The model will either error out or produce degraded outputs. The custom fork is required for correct MoE++ routing. Check the HuggingFace model card for the current recommended fork revision before deploying.

Ignoring the KV-cache advantage. If your use case involves long reasoning chains (500+ tokens of chain-of-thought), CCA's 8x KV-cache reduction is a concrete throughput win. Many benchmarks don't capture this because they use short prompts. Run your own throughput tests with realistic prompt lengths.

Treating community quantizations as production-ready. The 4-bit and MXFP4 variants on HuggingFace were not released by Zyphra. They may work fine, but quality degradation on hard reasoning tasks is worth verifying before deploying.

FAQ

Q: Does ZAYA1-8B run on consumer GPUs?

The BF16 model needs around 16 GB of VRAM — tight for an RTX 4090 (24 GB) but workable. With 4-bit quantization via bitsandbytes, you can get this down to roughly 6 GB, making it runnable on midrange consumer hardware. Performance at 4-bit on hard reasoning tasks hasn't been officially benchmarked, so expect some degradation on competition-level math.

Q: Is the Apache 2.0 license truly unrestricted for commercial use?

Apache 2.0 allows commercial use, modification, and distribution with attribution. There are no additional usage restrictions from Zyphra beyond what Apache 2.0 specifies. This makes it one of the most permissive licenses for an open-weight reasoning model at this capability level.

Q: How does Markovian RSA compare to majority voting / best-of-N sampling?

Best-of-N generates N independent samples and picks the highest-scored one — memory scales with N. Markovian RSA generates traces sequentially where each conditions on a fixed summary of prior traces, keeping memory constant. For a fixed memory budget, RSA can run more "deliberation steps" than best-of-N. The tradeoff is latency: RSA traces are sequential rather than parallel.

Q: When will ZAYA1-8B be supported in mainstream vLLM?

Zyphra maintains a fork (zaya1-pr branch) with the MoE++ changes. Upstream vLLM integration is not confirmed at time of writing. Watch the Zyphra vLLM fork for merge activity.

Q: What's the context window size?

The technical report (arXiv:2605.05365) details the model architecture but the maximum context length wasn't confirmed in publicly available sources at time of writing. The CCA's 8x KV-cache reduction suggests it's designed for extended contexts — check the HuggingFace model card for the current spec.

Key Takeaways

Bottom Line

ZAYA1-8B is the most parameter-efficient competitive reasoning model available under an open license as of May 2026. The 760M active parameter footprint and Apache 2.0 license make it the first credible option for self-hosted reasoning on single-GPU hardware — if you can navigate the custom vLLM fork requirement. The Markovian RSA scores are genuinely impressive, with the important caveat that they reflect higher inference compute, not just model quality. For developers who need strong math and coding reasoning without closed-model API dependency or multi-GPU infrastructure, this is worth evaluating seriously.

What makes ZAYA1-8B technically distinctive:

  • 760M active parameters out of 8.4B total — roughly 9% activation ratio, far below comparable MoE models
  • CCA drops KV-cache 8x, reducing memory pressure on long reasoning chains
  • Markovian RSA enables compute-scaling at inference without memory blowup
  • Apache 2.0, no usage restrictions

What to watch:

  • Upstream vLLM/transformers integration status
  • Official quantization support from Zyphra
  • Community fine-tunes leveraging the Apache 2.0 license for domain-specific reasoning

The technical report is available at arXiv:2605.05365. Model weights are at Zyphra/ZAYA1-8B. The free inference endpoint is at cloud.zyphra.com.

Top comments (0)