Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.
  • This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit
  • Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%
  • Create a SuperClaude Framework with Modes, Commands and Session memory
  • TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents
  • The Bumblebee Open Source Supply Chain Scanner is a read-only tool for developer endpoints.
  • Contrastive Neuron attribution (CNA), Sparse MLP circuit steering without SAE training or weight modification, is released by Nous Research
  • A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Reminiscence Layer Constructed by Y Combinator’s Garry Tan for AI Brokers
AI-trends.todayAI-trends.today
Home»Tech»Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup

Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup

Tech By Gavin Wallace15/05/202611 Mins Read
Facebook Twitter LinkedIn Email
NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language
NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language
Share
Facebook Twitter LinkedIn Email

Zyphra, the San Francisco-based AI lab behind the ZAYA1 model family, released ZAYA1-8B-Diffusion-Preview — a preview of its early work in diffusion-language models. The release demonstrates how an autoregressive-language model can be turned into a discrete diffusivity model, without sacrificing evaluation performance.

https://www.zyphra.com/post/zaya1-8b-diffusion-preview

This is a problem with autoregressive decoding

For this to make sense, we need to know how the majority of language models today generate text. Large language models that are standard tend to be autoregressive, decoding tokens one at a time. For each new token, the attention mechanism has to look back over all previously generated tokens and load their stored representations — called the KV-cache — from GPU memory. Importantly, since each user has their own history of tokens in the batch, KV-cache for each user must be loaded individually and cannot be used across requests.

It creates a bottleneck. If the GPU is spending more time transferring data than actually performing computations, then the system will become memory bandwidth bound instead of compute-bound. This limits how efficiently modern GPU hardware — which has been scaling compute FLOPs faster than memory bandwidth — can be used during inference.

The diffusion model offers an alternative. A diffusion model, instead of creating one token at a single time generates several drafts N tokens and repeats this process many times. The GPU is more efficient because all N tokens share the same KV cache. In ZAYA1-8B-Diffusion-Preview specifically, the model performs a single-step transformation from mask to token for each token in the block — meaning it directly predicts the unmasked token in one step rather than iteratively denoising.

How to convert auto-regression into diffusion without training from scratch

The process of training a diffusion-mode language model is very difficult and few known recipes exist. Zyphra team offers two reasons for preferring conversion over training from scratch: first, it is simply hard, with few known recipes; second, there is no advantage to training in diffusion-mode because training is already compute-bound — the memory-bandwidth bottleneck that diffusion solves only appears at inference time. All the diffusion benefits are only available at inference time, so a pretraining stack that already exists can still be used.

Zyphra, based on TiDAR’s recipe, took the ZAYA1-8B base checkpoint, performed 600 billion tokens in mid-training of diffusion conversion at a context length of 32k, then 500 billion tokens for native context extension up to 128k. Then, it went through a diffusion supervised refinement (SFT).

ZAYA1-8B-Diffusion-Preview is the first MoE diffusion model converted from an autoregressive LLM, and the first diffusion-language model to be trained on AMD GPUs. Zyphra reported minimal evaluation degradation compared with the base autoregressive Checkpoint. It also showed gains in some benchmarks, such as LCBv6. The researchers attribute the improvement in mid-training datasets, and also to a greater expression of non-causal diffusion within blocks compared to causal Autoregression.

The Diffusion Sampler works

During inference, ZAYA1-8B-Diffusion-Preview generates a draft of 16 tokens simultaneously. This sampling method is borrowed from the speculative encoding. A portion of these tokens will be accepted. This method has the advantage that it can be used as a speculator as well as a verifier in merely one forward pass. It eliminates the need to run two models, as is the case with traditional methods such as EAGLE and dFlash. In heavily memory-bandwidth-bound regimes, almost all accepted tokens represent free speedup over autoregressive decoding — the GPU is already loaded and the extra tokens cost very little additional compute.

The Zyphra team has developed two different samplers that trade off speed and quality.

  • Sampler for lossless diffusionThe standard acceptance criteria for speculative decoding is min(1,p(x),/q(x), where p represents the logit distribution of the autoregressive distribution model and q represents the distribution of the diffusion distribution model. The next token sample is taken from the residual distribution p(x-q)(x). This sampler has a 4.6x increase in speed with no degradation of the evaluation system.
  • Sampler of Logit MixingVerification is done by averaging the distributions of the logits obtained from diffusion speculator, autoregressive models and averaged logits. It improves the acceptance rates, because the verification logits look more like the diffusion logits. However, it can have some effect on quality. This sampler has a 7.7x increase in speed. You can choose between quality and speed at any time.

One important caveat on these numbers: because ZAYA1-8B-Diffusion-Preview is a base mid-train checkpoint that has not yet undergone RL training, Zyphra uses pass@ evaluations rather than standard accuracy benchmarks to better represent the model’s ultimate potential after RL training. If you are comparing this model’s reported benchmarks with those of other models, please keep in mind that ZAYA1-8B-Diffusion-Preview is a base mid train checkpoint which has not yet undergone RL training. Zyphra uses pass@ evaluations instead of standard accuracy benchmarks to better represent the ultimate potential after RL training.

Zyphra’s team has also noted that speedups resulting from diffusion have been higher than other alternatives such as MTP and various decoding methods like EAGLE3. TiDAR diffusion models use a single pass forward only. This allows for acceptance rates that are comparable to those of dFlash.

https://www.zyphra.com/post/zaya1-8b-diffusion-preview

Architectural Details

ZAYA1-8B-Diffusion-Preview is a single-step speculative diffusion model that uses order constrained generation which means the diffusion model is only capable of generating tokens in a contiguous subsequence starting from the prefix. This constraint improves training stability compared to mask diffusion objectives that are not constrained or block decoding. It was a main reason Zyphra built upon the TiDAR Recipe.

This model is based on Zyphra’s CCA attention variant, which was originally developed for ZAYA1-8B. CCA reduces FLOPs for prefill in attention dramatically, and this is beneficial to diffusion since diffusion transforms decoding into prefill operations. CCA allows the model to diffuse more tokens simultaneously before it reaches its compute limit.

The architecture is based on CCGQA, with a ratio of 4:1 between key heads and query heads. This was due to the design decision of avoiding MLA, whose high intensity was deemed a mismatch with CCGQA. Because block diffusion is accessed by the same cache as CCGQA, the arithmetic strength scales both with block size and the number blocks in a forward pass. The system can support roughly three proposals of the same block size per forward pass on AMD MI300x in bf16; this increases to five for MI355x. CCGQA operates also at 2x compression which allows Zyphra the extra training FLOPs that TiDAR requires. AMD GPUs have a higher VRAM capability, which allows for more efficient diffusion training.

The theoretical speeds are more difficult to achieve in practice because diffusion has a higher operational overhead, and inference tools for diffusion models are less mature than those for autoregressive analysis.

Marktechpost’s Visual Explainer

■ Marktechpost Guide
ZAYA1-8B-Diffusion-Preview

01 / 08  —  Overview
What is ZAYA1-8B-Diffusion-Preview?
Zyphra released ZAYA1-8B-Diffusion-Preview on May 14, 2026. The model converts a current autoregressive MoE model to a discrete diffuse model without any systematic degradation in evaluation performance.
It generates 16 different tokens at once, instead of only one.

ReleasedMay 14, 2026 — San Francisco

The following is a list of the most popular ways to contact someone.Zyphra

Base modelZAYA1-8B (autoregressive MoE)

HardwareAMD MI300x MI355x

First of a kindThe first MoE diffusion model to be converted from an AR-LLM, and the first diffusion-LM that is trained on AMD

02 / 08  —  The Problem
Bottlenecks Caused by Autoregressive Decodification
LLMs standard are auto-regressive, i.e. one token for each step. The model loads the KV cache of each user from GPU memory for every token. Because each batch contains a unique token history for every user, the caches can’t be shared.
It is easy to decode Memory-bandwidth bound in many serving scenarios — the GPU waits on data transfers instead of computing. This gap is growing over time as modern GPUs are able to scale FLOPs more quickly than memory bandwidth.

For engineers: Memory-bandwidth limited = GPU units are waiting on HBM data. Compute-bound: GPUs are fully utilized. Diffusion targets the GPU by sharing a KV-cache across N tokens.

03 / 08  —  The Solution
Diffusion removes bottlenecks
A diffusion model produces multiple drafts for N tokens. All N tokens in a block share the same KV-cache — one cache load regardless of block size. The workload is now compute-bound instead of memory-bandwidth-bound.

Autoregressive
1 token per pass
Separate KV-cache per user
Memory-bandwidth bound
Low GPU utilization

Diffusion
16 tokens per pass
Shared KV per Block
Compute-bound
Speed up to 7.7x

04 / 08  —  Training Pipeline
The Model Converted
It is impossible to train from scratch, and there is no advantage since the training already relies on computation. Only at inference does the bottleneck appear. Zyphra can convert via the TiDAR Recipe, mid-training. This recipe reuses the pretraining stack.

ZAYA1-8B-base checkpointModel MoE with pretrained Autoregressive Base

Diffusion mid-training — 600B tokens @ 32kThe TiDAR conversion recipe is used to create discrete diffusion

Context extension — 500B tokens @ 128kNatively extended context length to 128,000 tokens

The Diffusion phase of SFTFine-tuning in the diffusion mode under supervision

Total: ZAYA1-8B Pretraining + 1.1 trillion tokens for mid-training.

05 / 08  —  Inference
You can choose between two samplers: speed vs. quality
The model generates 16 tokens each time. A fraction are accepted via a sampling criterion, similar to speculative decoding, but the same model acts as both speculator and verifier in a single forward pass — no separate draft model needed, unlike EAGLE or dFlash.

4.6x
No-Loss Sampler
No systematic eval loss
min(1, p(x)/q(x))

7.7x
Sampler for Logit-Mixing
There are some quality compromises
Mixes AR + diffusion logits

Note: On rejection in the lossless sampler, next token is sampled from residual distribution p(x)—q(x). At runtime, you can choose between speed/quality.

06 / 08  —  Architecture
Architectural Details
The following are some of the ways to get in touch with each other single-step speculative diffusion model using order constrained generation — it only generates tokens in a contiguous subsequence starting from the prefix. The training is more stable than unconstrained diffusion of masks or block set decoding.

AttentionZyphra’s CCA attention — reduces prefill FLOPs, enables more parallel tokens before compute limit

CCGQA2:1 query-to key heads; 2x compressed; MLA avoids the high intensity of arithmetic

MI300x (bf16)3 block-sized suggestions per forward pass

MI355x5 block-sized suggestions per forward pass

07 / 08  —  Results
Benchmark Results & Comparisons
The evaluation is not degraded compared with the AR baseline checkpoint. Benchmarks, including LCB v6, have shown improvements due to better mid-training datasets as well as greater non-causal expression in diffusion style within blocks.

ZAYA1 Diffusion: 4.6x—7.7x
Lower MTP
EAGLE3 (lower)
dFlash: lower net speedup

Important: The use of Evaluations Pass@ measurements, not standard accuracy benchmarks — because this is a base mid-train checkpoint pre-RL training. Compare your scores to the standard benchmarks from different models.

08 / 08  —  Implications
What this means for AI engineers
The deeper implication is for RL training: on-policy rollouts — model-generated sequences used during reinforcement learning — are expensive. Rollout costs are reduced by faster, more compute-optimal creation, which makes RL, and even test-time computation scaling, more feasible.

For MLEsBetter GPU usage at service time with compute-bound inference

Teams that play RLMore RL at the same budget with cheaper policy rollouts

You can find architects on the internet.CCA + CCGQA co-designed for diffusion from the start — not bolted on

You can access this page by clicking here.ZAYA1-8B based on Hugging face (Zyphra). The Diffusion Inference Stack is in its early stages.

The Key Takeaways

  • Zyphra transformed its existing ZAYA1-8B auto-regressive MoE into a discrete diffuse model, using the TiDAR formula. This model also included 1.1 trillion additional tokens at mid-training.
  • This model generates 16 tokens in a single step, using a logit-mixing sampling sampler. It achieves speed-ups of 4.6 times with a sampler that does not use loss.
  • The first MoE-based diffusion language model has been developed on AMD GPUs.
  • Evaluation figures are pass@ metrics on a base mid-train checkpoint — the model has not yet undergone RL training
  • Test-time computing becomes more practicable with faster diffusion inference.

Check out the Technical details. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us


aya x
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

24/05/2026

Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%

24/05/2026

Create a SuperClaude Framework with Modes, Commands and Session memory

24/05/2026

TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents

23/05/2026
Top News

It’s a race to build the DeepSeek in Europe

Grammarly Is Facing a Class Action Lawsuit Over Its AI ‘Expert Review’ Feature

Who are the real losers in the Musk v. Altman Trial?

Learn What you need to know before launching your AI Startup

AI Drafting My Stories? Is AI Drafting My Stories?

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

OpenClaw users are allegedly bypassing anti-bot system

25/02/2026

Meta’s new reality: record high profits Record Low Morale

14/05/2026
Latest News

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

24/05/2026

This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit

24/05/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.