ZAYA1-8B: Zyphra's 760M-Active MoE Trained on AMD

Zyphra's ZAYA1-8B MoE model, trained on AMD, achieves high performance with efficient parameter activation.

May 24, 2026

ZAYA1-8B: The 760M-Active MoE That Trained Entirely on AMD

The dominant narrative in frontier model training is simple: you need Nvidia. H100s, H200s, B200s, NVLink, CUDA. Everything else is a distant second. Zyphra just published a counterpoint.

On May 6, the San Francisco lab released ZAYA1-8B, an 8.4-billion-parameter Mixture-of-Experts language model with 760 million active parameters per token that was pretrained, midtrained, and supervised fine-tuned end-to-end on AMD hardware. The full training run used a cluster of 1,024 AMD Instinct MI300x nodes wired together with AMD's Pensando Pollara interconnect, built in partnership with IBM.

The headline numbers are louder than the hardware story, though. ZAYA1-8B's reasoning variant scores 89.6 on HMMT'25, edging out Claude 4.5 Sonnet (88.3) and beating GPT-5-High on the same benchmark. For a model that activates fewer parameters per token than a single layer of Llama 3 70B, that is not a normal result.

Architecture: three changes that compound

ZAYA1-8B is built on what Zyphra calls MoE++, a stack of three architectural changes that compound rather than substitute.

The first is Compressed Convolutional Attention (CCA), a sequence-mixing mechanism that operates in a compressed latent space. Zyphra reports an 8× KV-cache compression versus standard attention — meaning the model holds the same effective context using roughly an eighth of the inference-time memory. For an 8B-class model serving long-context workloads, that math directly translates to either longer windows or cheaper serving on the same hardware.

The second is a new router. Standard MoE designs use a single linear layer to decide which experts handle each token. ZAYA1's router is an MLP — a multi-layer perceptron — which Zyphra says "substantially increases" routing expressiveness and improves routing stability under depth. A bad router is one of the most common ways a sparse model loses performance to its dense equivalent; Zyphra is betting that a richer router is worth the small parameter overhead.

The third is learned residual scaling, which controls how residual-norm grows through the layers. The cost is "negligible parameter and FLOP," in Zyphra's words; the payoff is more stable training at depth.

Component	Role	Why it matters
Compressed Convolutional Attention	Sequence mixing in latent space	8× KV-cache compression at inference
MLP-based router	Expert selection	More stable routing, better expert utilization
Learned residual scaling	Depth normalization	Steady training without exotic tricks

None of these is exotic on its own. Together, they're a deliberate bet that intelligence per FLOP is the metric to optimize — not raw parameter count.

Markovian RSA: the test-time-compute story

Architecture is half the story. The other half is Markovian RSA, the test-time-compute (TTC) scheme Zyphra co-designed with the model.

The idea fuses two recent threads in reasoning research. From RSA, it borrows recursive self-aggregation — generate many candidate traces in parallel, then aggregate them into a fresh seed for the next round. From the Markovian thinker, it borrows chunked reasoning — keep the model's working context window bounded by only carrying the tail of each chunk forward.

"Rollout generation can be done in parallel taking advantage of batching, while the Markovian chunking strategy ensures that no matter how long the model reasons for the intermediate chain-of-thoughts, the context length always remains bounded." — Zyphra technical writeup

In practice, Zyphra reports the headline configuration uses a 40k-token budget for intermediate reasoning and carries only the last 4k tokens into the next iteration. At that setting ZAYA1-8B closes in on DeepSeek-V3.2 and Qwen3-A22B and lands within a few points of GPT-5-High. With an "extra-high-TTC" configuration burning 5.5M tokens per problem, it surpasses both DeepSeek-V3.2 and GPT-OSS-120B (high) on the APEX-shortlist mathematics benchmark.

Two things are worth noting about that result. First, the per-problem token budget is enormous — Markovian RSA is not a free lunch, it's a way of spending inference compute that small models couldn't previously spend usefully. Second, Zyphra reports that simply applying the same Markovian-RSA harness to Qwen3-4B-Thinking-2507 yielded "substantially less" uplift. The model has to be trained to live inside the harness for it to pay out — a point worth remembering when evaluating any reasoning-scaffold paper.

Why the AMD detail actually matters

It's tempting to read "trained on AMD" as a press detail. It isn't, for two reasons.

The first is supply. The Nvidia data-center waitlist is the largest constraint on AI training right now. A working end-to-end pipeline on MI300x — including pretraining, midtraining, SFT, and a multi-stage RL pipeline — means that labs with Instinct allocations can now do real frontier work, not just benchmark runs.

The second is the cluster design itself. The 1,024-node training run used AMD's Pensando Pollara interconnect instead of Nvidia's NVLink/InfiniBand stack. Networking is the part of large-scale training people most often gloss over, and the part most likely to silently bottleneck a run. Zyphra's choice to publish both the ZAYA1-base technical report and the full ZAYA1-8B paper means the recipe is reproducible, not just rumored.

Post-training: a five-stage pipeline

The post-training pipeline is where ZAYA1-8B does most of its earning. Five stages run sequentially:

SFT on chat, instruction-following, code, math, and test-time-compute prompts
Reasoning warmup combining math, logic, and TTC-aware prompts so the model learns to self-aggregate candidate solutions
RLVE-Gym — a large-scale RL phase using dynamically-adjusted puzzle difficulty to drill core reasoning circuits
Math and code RL to deepen the model's most verifiable skills
Lightweight RLHF / RLAIF for chat behavior, style, and less-verifiable rewards

Most of the post-training delta lands in math and coding, which lines up with the benchmark table. Zyphra notes smaller bumps in MMLU, GPQA, and creative writing — areas where verifiable reward is harder to build.

How to try it

ZAYA1-8B is Apache-2.0 and available two ways:

Hugging Face weights: Zyphra/ZAYA1-8B
Serverless endpoint: Zyphra Cloud

For local inference, the model fits comfortably in the memory budget of a single 80GB GPU thanks to CCA's compression and the small active-parameter count. The catch: Markovian RSA inference is not a one-shot call. Reproducing the headline math results requires the aggregation harness, and Zyphra has signaled the recipe is in the technical report rather than baked into a simple generate().

The Bottom Line

ZAYA1-8B is not the most capable model released this month, and Zyphra is not pretending otherwise. What it is, is the cleanest evidence to date that frontier-grade reasoning can be trained on non-Nvidia silicon, at small active-parameter budgets, with a co-designed reasoning harness doing most of the heavy lifting at inference.

The two interesting forks in the road that follow: first, whether AMD's ecosystem can attract more labs of Zyphra's caliber off the back of this; second, whether Markovian-RSA-style trained-in harnesses become the default way small models punch above their weight, or remain a specialty trick.

Either way, "you need Nvidia to do this" got slightly less true on May 6.

zaya1-8b zyphra amd-mi300x mixture-of-experts open-source reasoning-models

More in Deep Dives

Deep Dives

Mixture of Experts: How Sparse Models Beat Dense LLMs

Mixture of Experts (MoE) replaces a transformer's single feed-forward network with many smaller expert networks plus a learned router that sends each token to only its top-k experts (sparse activation). This decouples total parameters (which set memory) from active parameters (which set compute). Mixtral 8x7B has 46.7B total but 12.9B active via top-2 routing; DeepSeek-V3 has 671B total but 37B active (5.5%) using 256 routed experts plus one shared expert and top-8 routing. The design traces to Shazeer et al. (2017) and Google's Switch Transformer (2021, top-1 routing, 1.6T params). Trade-offs include memory footprint, load-balancing difficulty, training instability, communication overhead, and harder fine-tuning.

By Aisha Patel · 6 min · Jul 10, 2026

Deep Dives

LoRA and QLoRA: Fine-Tune Massive LLMs on a Single GPU

LoRA (2021) freezes a model's weights and trains tiny low-rank matrices, cutting GPT-3's trainable parameters 10,000x with no inference latency. QLoRA (2023) quantizes the frozen base to 4-bit NF4, fitting a 65B model on one 48GB GPU at ~33% less memory but ~39% more training time. Rank sets capacity; alpha (via alpha/r) sets scale. Adapt attention projections first and raise rank only when quality demands it.

By Aisha Patel · 8 min · Jul 3, 2026

Deep Dives

Agentjacking: Fake Sentry Errors Hijack Your AI Coding Agent

Agentjacking injects fake Sentry errors that AI coding agents read over MCP as trusted guidance, then execute - hitting an 85% success rate across 2,388 exposed orgs.

By Aisha Patel · 8 min · Jun 29, 2026