Zyphra, the San Francisco-based AI lab behind the ZAYA1 model family, released ZAYA1-8B-Diffusion-Preview — a preview of its early work in diffusion-language models. The release demonstrates how an autoregressive-language model can be turned into a discrete diffusivity model, without sacrificing evaluation performance.
This is a problem with autoregressive decoding
For this to make sense, we need to know how the majority of language models today generate text. Large language models that are standard tend to be autoregressive, decoding tokens one at a time. For each new token, the attention mechanism has to look back over all previously generated tokens and load their stored representations — called the KV-cache — from GPU memory. Importantly, since each user has their own history of tokens in the batch, KV-cache for each user must be loaded individually and cannot be used across requests.
It creates a bottleneck. If the GPU is spending more time transferring data than actually performing computations, then the system will become memory bandwidth bound instead of compute-bound. This limits how efficiently modern GPU hardware — which has been scaling compute FLOPs faster than memory bandwidth — can be used during inference.
The diffusion model offers an alternative. A diffusion model, instead of creating one token at a single time generates several drafts N tokens and repeats this process many times. The GPU is more efficient because all N tokens share the same KV cache. In ZAYA1-8B-Diffusion-Preview specifically, the model performs a single-step transformation from mask to token for each token in the block — meaning it directly predicts the unmasked token in one step rather than iteratively denoising.
How to convert auto-regression into diffusion without training from scratch
The process of training a diffusion-mode language model is very difficult and few known recipes exist. Zyphra team offers two reasons for preferring conversion over training from scratch: first, it is simply hard, with few known recipes; second, there is no advantage to training in diffusion-mode because training is already compute-bound — the memory-bandwidth bottleneck that diffusion solves only appears at inference time. All the diffusion benefits are only available at inference time, so a pretraining stack that already exists can still be used.
Zyphra, based on TiDAR’s recipe, took the ZAYA1-8B base checkpoint, performed 600 billion tokens in mid-training of diffusion conversion at a context length of 32k, then 500 billion tokens for native context extension up to 128k. Then, it went through a diffusion supervised refinement (SFT).
ZAYA1-8B-Diffusion-Preview is the first MoE diffusion model converted from an autoregressive LLM, and the first diffusion-language model to be trained on AMD GPUs. Zyphra reported minimal evaluation degradation compared with the base autoregressive Checkpoint. It also showed gains in some benchmarks, such as LCBv6. The researchers attribute the improvement in mid-training datasets, and also to a greater expression of non-causal diffusion within blocks compared to causal Autoregression.
The Diffusion Sampler works
During inference, ZAYA1-8B-Diffusion-Preview generates a draft of 16 tokens simultaneously. This sampling method is borrowed from the speculative encoding. A portion of these tokens will be accepted. This method has the advantage that it can be used as a speculator as well as a verifier in merely one forward pass. It eliminates the need to run two models, as is the case with traditional methods such as EAGLE and dFlash. In heavily memory-bandwidth-bound regimes, almost all accepted tokens represent free speedup over autoregressive decoding — the GPU is already loaded and the extra tokens cost very little additional compute.
The Zyphra team has developed two different samplers that trade off speed and quality.
- Sampler for lossless diffusionThe standard acceptance criteria for speculative decoding is min(1,p(x),/q(x), where p represents the logit distribution of the autoregressive distribution model and q represents the distribution of the diffusion distribution model. The next token sample is taken from the residual distribution p(x-q)(x). This sampler has a 4.6x increase in speed with no degradation of the evaluation system.
- Sampler of Logit MixingVerification is done by averaging the distributions of the logits obtained from diffusion speculator, autoregressive models and averaged logits. It improves the acceptance rates, because the verification logits look more like the diffusion logits. However, it can have some effect on quality. This sampler has a 7.7x increase in speed. You can choose between quality and speed at any time.
One important caveat on these numbers: because ZAYA1-8B-Diffusion-Preview is a base mid-train checkpoint that has not yet undergone RL training, Zyphra uses pass@ evaluations rather than standard accuracy benchmarks to better represent the model’s ultimate potential after RL training. If you are comparing this model’s reported benchmarks with those of other models, please keep in mind that ZAYA1-8B-Diffusion-Preview is a base mid train checkpoint which has not yet undergone RL training. Zyphra uses pass@ evaluations instead of standard accuracy benchmarks to better represent the ultimate potential after RL training.
Zyphra’s team has also noted that speedups resulting from diffusion have been higher than other alternatives such as MTP and various decoding methods like EAGLE3. TiDAR diffusion models use a single pass forward only. This allows for acceptance rates that are comparable to those of dFlash.

Architectural Details
ZAYA1-8B-Diffusion-Preview is a single-step speculative diffusion model that uses order constrained generation which means the diffusion model is only capable of generating tokens in a contiguous subsequence starting from the prefix. This constraint improves training stability compared to mask diffusion objectives that are not constrained or block decoding. It was a main reason Zyphra built upon the TiDAR Recipe.
This model is based on Zyphra’s CCA attention variant, which was originally developed for ZAYA1-8B. CCA reduces FLOPs for prefill in attention dramatically, and this is beneficial to diffusion since diffusion transforms decoding into prefill operations. CCA allows the model to diffuse more tokens simultaneously before it reaches its compute limit.
The architecture is based on CCGQA, with a ratio of 4:1 between key heads and query heads. This was due to the design decision of avoiding MLA, whose high intensity was deemed a mismatch with CCGQA. Because block diffusion is accessed by the same cache as CCGQA, the arithmetic strength scales both with block size and the number blocks in a forward pass. The system can support roughly three proposals of the same block size per forward pass on AMD MI300x in bf16; this increases to five for MI355x. CCGQA operates also at 2x compression which allows Zyphra the extra training FLOPs that TiDAR requires. AMD GPUs have a higher VRAM capability, which allows for more efficient diffusion training.
The theoretical speeds are more difficult to achieve in practice because diffusion has a higher operational overhead, and inference tools for diffusion models are less mature than those for autoregressive analysis.
Marktechpost’s Visual Explainer
■ Marktechpost Guide
ZAYA1-8B-Diffusion-Preview
The Key Takeaways
- Zyphra transformed its existing ZAYA1-8B auto-regressive MoE into a discrete diffuse model, using the TiDAR formula. This model also included 1.1 trillion additional tokens at mid-training.
- This model generates 16 tokens in a single step, using a logit-mixing sampling sampler. It achieves speed-ups of 4.6 times with a sampler that does not use loss.
- The first MoE-based diffusion language model has been developed on AMD GPUs.
- Evaluation figures are pass@ metrics on a base mid-train checkpoint — the model has not yet undergone RL training
- Test-time computing becomes more practicable with faster diffusion inference.
Check out the Technical details. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

