ZAYA1-8B: The 760M-Active MoE That Trained Entirely on AMD
The dominant narrative in frontier model training is simple: you need Nvidia. H100s, H200s, B200s, NVLink, CUDA. Everything else is a distant second. Zyphra just published a counterpoint.
On May 6, the San Francisco lab released ZAYA1-8B, an 8.4-billion-parameter Mixture-of-Experts language model with 760 million active parameters per token that was pretrained, midtrained, and supervised fine-tuned end-to-end on AMD hardware. The full training run used a cluster of 1,024 AMD Instinct MI300x nodes wired together with AMD's Pensando Pollara interconnect, built in partnership with IBM.
The headline numbers are louder than the hardware story, though. ZAYA1-8B's reasoning variant scores 89.6 on HMMT'25, edging out Claude 4.5 Sonnet (88.3) and beating GPT-5-High on the same benchmark. For a model that activates fewer parameters per token than a single layer of Llama 3 70B, that is not a normal result.
Architecture: three changes that compound
ZAYA1-8B is built on what Zyphra calls MoE++, a stack of three architectural changes that compound rather than substitute.
The first is Compressed Convolutional Attention (CCA), a sequence-mixing mechanism that operates in a compressed latent space. Zyphra reports an 8× KV-cache compression versus standard attention — meaning the model holds the same effective context using roughly an eighth of the inference-time memory. For an 8B-class model serving long-context workloads, that math directly translates to either longer windows or cheaper serving on the same hardware.
The second is a new router. Standard MoE designs use a single linear layer to decide which experts handle each token. ZAYA1's router is an MLP — a multi-layer perceptron — which Zyphra says "substantially increases" routing expressiveness and improves routing stability under depth. A bad router is one of the most common ways a sparse model loses performance to its dense equivalent; Zyphra is betting that a richer router is worth the small parameter overhead.
The third is learned residual scaling, which controls how residual-norm grows through the layers. The cost is "negligible parameter and FLOP," in Zyphra's words; the payoff is more stable training at depth.
| Component | Role | Why it matters |
|---|---|---|
| Compressed Convolutional Attention | Sequence mixing in latent space | 8× KV-cache compression at inference |
| MLP-based router | Expert selection | More stable routing, better expert utilization |
| Learned residual scaling | Depth normalization | Steady training without exotic tricks |
None of these is exotic on its own. Together, they're a deliberate bet that intelligence per FLOP is the metric to optimize — not raw parameter count.
Markovian RSA: the test-time-compute story
Architecture is half the story. The other half is Markovian RSA, the test-time-compute (TTC) scheme Zyphra co-designed with the model.
The idea fuses two recent threads in reasoning research. From RSA, it borrows recursive self-aggregation — generate many candidate traces in parallel, then aggregate them into a fresh seed for the next round. From the Markovian thinker, it borrows chunked reasoning — keep the model's working context window bounded by only carrying the tail of each chunk forward.
"Rollout generation can be done in parallel taking advantage of batching, while the Markovian chunking strategy ensures that no matter how long the model reasons for the intermediate chain-of-thoughts, the context length always remains bounded." — Zyphra technical writeup
In practice, Zyphra reports the headline configuration uses a 40k-token budget for intermediate reasoning and carries only the last 4k tokens into the next iteration. At that setting ZAYA1-8B closes in on DeepSeek-V3.2 and Qwen3-A22B and lands within a few points of GPT-5-High. With an "extra-high-TTC" configuration burning 5.5M tokens per problem, it surpasses both DeepSeek-V3.2 and GPT-OSS-120B (high) on the APEX-shortlist mathematics benchmark.
Two things are worth noting about that result. First, the per-problem token budget is enormous — Markovian RSA is not a free lunch, it's a way of spending inference compute that small models couldn't previously spend usefully. Second, Zyphra reports that simply applying the same Markovian-RSA harness to Qwen3-4B-Thinking-2507 yielded "substantially less" uplift. The model has to be trained to live inside the harness for it to pay out — a point worth remembering when evaluating any reasoning-scaffold paper.
Why the AMD detail actually matters
It's tempting to read "trained on AMD" as a press detail. It isn't, for two reasons.
The first is supply. The Nvidia data-center waitlist is the largest constraint on AI training right now. A working end-to-end pipeline on MI300x — including pretraining, midtraining, SFT, and a multi-stage RL pipeline — means that labs with Instinct allocations can now do real frontier work, not just benchmark runs.
The second is the cluster design itself. The 1,024-node training run used AMD's Pensando Pollara interconnect instead of Nvidia's NVLink/InfiniBand stack. Networking is the part of large-scale training people most often gloss over, and the part most likely to silently bottleneck a run. Zyphra's choice to publish both the ZAYA1-base technical report and the full ZAYA1-8B paper means the recipe is reproducible, not just rumored.
Post-training: a five-stage pipeline
The post-training pipeline is where ZAYA1-8B does most of its earning. Five stages run sequentially:
- SFT on chat, instruction-following, code, math, and test-time-compute prompts
- Reasoning warmup combining math, logic, and TTC-aware prompts so the model learns to self-aggregate candidate solutions
- RLVE-Gym — a large-scale RL phase using dynamically-adjusted puzzle difficulty to drill core reasoning circuits
- Math and code RL to deepen the model's most verifiable skills
- Lightweight RLHF / RLAIF for chat behavior, style, and less-verifiable rewards
Most of the post-training delta lands in math and coding, which lines up with the benchmark table. Zyphra notes smaller bumps in MMLU, GPQA, and creative writing — areas where verifiable reward is harder to build.
How to try it
ZAYA1-8B is Apache-2.0 and available two ways:
- Hugging Face weights:
Zyphra/ZAYA1-8B - Serverless endpoint: Zyphra Cloud
For local inference, the model fits comfortably in the memory budget of a single 80GB GPU thanks to CCA's compression and the small active-parameter count. The catch: Markovian RSA inference is not a one-shot call. Reproducing the headline math results requires the aggregation harness, and Zyphra has signaled the recipe is in the technical report rather than baked into a simple generate().
The Bottom Line
ZAYA1-8B is not the most capable model released this month, and Zyphra is not pretending otherwise. What it is, is the cleanest evidence to date that frontier-grade reasoning can be trained on non-Nvidia silicon, at small active-parameter budgets, with a co-designed reasoning harness doing most of the heavy lifting at inference.
The two interesting forks in the road that follow: first, whether AMD's ecosystem can attract more labs of Zyphra's caliber off the back of this; second, whether Markovian-RSA-style trained-in harnesses become the default way small models punch above their weight, or remain a specialty trick.
Either way, "you need Nvidia to do this" got slightly less true on May 6.


