DEV Community

Cover image for ZAYA1-8B: a 760M-active MoE trained on AMD MI300x
Thousand Miles AI
Thousand Miles AI

Posted on

ZAYA1-8B: a 760M-active MoE trained on AMD MI300x

Zyphra released ZAYA1-8B on May 6, 2026. It's an 8.4B-parameter Mixture-of-Experts model where only about 760M parameters activate per token, and on the math and coding benchmarks Zyphra ran it sits next to models many times its size, including Claude 4.5 Sonnet and DeepSeek-R1-0528. The small-model performance is the headline. The interesting part is how they got there.

TL;DR

  • ZAYA1-8B has 8.4B total parameters but only ~760M active per token, so inference cost looks like a sub-billion model.
  • Three architecture changes do most of the work: Compressed Convolutional Attention (CCA), an MLP-based router for expert selection, and learned residual scaling.
  • Pretrained end-to-end on 1,024 AMD Instinct MI300x nodes with a Pensando Pollara fabric — the first MoE trained this way at scale.

Background

A Mixture of Experts is a Transformer where each feed-forward block is replaced by a set of feed-forward blocks (the "experts") plus a tiny router that picks two or four of them per token. The model's total parameter count is the sum of all experts. The active parameter count is the few experts that fire for any given token. Active count dictates inference FLOPs. Total count dictates capacity. Pretty much every modern open-weights MoE — DeepSeek-V3, Qwen3, GPT-OSS, now ZAYA1-8B — is a play on that gap.

What "intelligence density per active parameter" actually means

When Zyphra says ZAYA1-8B has frontier intelligence density per active parameter, they mean the ratio of benchmark performance to active parameter count is high. Concretely: 760M active parameters, AIME-2025 and HMMT-2025 scores at the level of Claude 4.5 Sonnet on the configurations Zyphra ran (with their Markovian-RSA test-time compute scheme on top — more on that caveat later).

Why active count matters for the person serving this thing: the matrix multiplies in the forward pass scale with active params, not total. An 8.4B MoE with 760M active is roughly as expensive to run a token through as a dense 760M model (plus a small router overhead and the cost of holding all experts in memory). You pay capacity in VRAM. You pay throughput in active count. ZAYA1-8B optimizes the second.

Here's what a router actually looks like, in pseudo-code:

# Per token, per MoE layer.
# Standard "linear" router: one matmul, softmax, top-k.
def linear_router(hidden, W_route, k=2):
    logits = hidden @ W_route        # shape: [num_experts]
    probs  = softmax(logits)
    top_k_experts = topk(probs, k)   # which experts to send this token to
    return top_k_experts

# Zyphra's MLP router: a small MLP instead of a single matmul.
def mlp_router(hidden, W1, W2, k=2):
    h      = relu(hidden @ W1)       # small hidden layer
    logits = h @ W2                  # shape: [num_experts]
    probs  = softmax(logits)
    return topk(probs, k)
Enter fullscreen mode Exit fullscreen mode

This is illustrative — see the technical report for actual shapes, normalization, and gating details.

The three architectural ideas

Compressed Convolutional Attention (CCA). Zyphra's attention variant compresses the K and V projections along the sequence dimension with a small 1D convolution before computing attention scores. The intuition: not every token needs a full-resolution key. A short conv (kernel size 3-5) averages local context, which lets the attention layer treat short spans as a single "compressed" token. The original CCA paper is at arXiv:2510.04476. The result is fewer keys per query, so attention's quadratic term shrinks without going non-quadratic.

MLP-based expert router. The standard MoE router is a single matrix multiply: hidden state in, expert logits out, softmax + top-k. Zyphra replaces it with a small two-layer MLP. They report that the nonlinear router gives smoother gradients during training and reduces expert collapse — the failure mode where the router learns to always pick the same two experts. It's a small change with an outsized effect on training stability.

Learned residual scaling. In a deep Transformer, the residual stream grows in norm as you stack layers; each block writes onto it. Zyphra adds one learned scalar per block that multiplies the residual contribution before it lands back in the stream. Cost: one parameter per block, basically free. Effect: explicit control over how much each layer is allowed to perturb the residual, which keeps gradients better-behaved at depth.

And then there's the AMD-only training story

ZAYA1-8B was pretrained, midtrained, and SFT'd on 1,024 AMD Instinct MI300x nodes with AMD Pensando Pollara interconnect, on a cluster co-built with IBM. Not ported. Not fine-tuned on AMD after pretraining on Nvidia. The whole run.

Why this matters: the AI training stack has been Nvidia-shaped for years. CUDA, NCCL, all the tooling, all the half-precision tricks, all the kernel libraries. ROCm has existed for a while, but "ROCm at frontier scale" was an asterisk on most papers. ZAYA1-8B is one of the first public datapoints showing the AMD stack can produce a competitive MoE at the 1,000-node level.

What this changes for builders

Not much today — you're not swapping your serving stack overnight. But three small things follow. The intelligence-density framing is the right way to look at MoE choice: compare on active params, not total, and remember the MoE costs more VRAM. The architectural ideas are likely to show up in other releases — both CCA and learned residual scaling are small enough to lift into an existing codebase, and the router change is a one-week experiment for anyone training a small MoE from scratch. And the AMD story matters for compute pricing: if MI300x clusters become a credible second source, training and serving costs follow.

Caveats and open questions

  • The headline result that approaches Claude 4.5 Sonnet and GPT-5 High on HMMT-25 uses Markovian RSA, Zyphra's test-time-compute scheme that generates multiple parallel traces and recursively aggregates them. Without that extra inference budget, ZAYA1-8B is still strong but the gap to frontier models widens.
  • The benchmarks Zyphra leans on are math (AIME, HMMT), coding (LCB), and reasoning (GPQA-Diamond, IFEval). Generalist chat and long-context evals are not the focus of this release.
  • We don't know yet how much of the gain is the architecture vs. the post-training pipeline (five stages: SFT, reasoning warmup, RLVE-Gym, math/code RL, RLHF/RLAIF). The ablations in the technical report are where to look.

The full picture — routing-stability ablations, the cluster build, and the Markovian-RSA training setup — is in the arXiv technical report (2605.05365). Weights are at huggingface.co/Zyphra/Zaya1-8B. This is the kind of release worth reading the report on, not just the blog.

Top comments (0)