Alibaba’s Qwen team has just released FP8-quantized checkpoints for its new Qwen3-Next-80B-A3B models in two post-training variants—Instruct The following are some examples of how to get started: Thinking—aimed at high-throughput inference with ultra-long context and MoE efficiency. The FP8 releases are identical to the BF16 versions, but they come packaged differently. “fine-grained FP8” The weights and deployment notes are for both sglang (block size 128) as well as vLLM. The cards still have the original BF16 benchmarks; FP8 models are provided. “for convenience and performance,” Not as an independent evaluation run.
What’s inside the A3B stack
Qwen3-Next-80B-A3B combines Gated DeltaNet with Gated Attention (a linear/conv-style Attention Surrogate), interleaved in a ultra-sparse mixture-of-experts (MoE). The 80B parameter budget is activated by 3B parameters per token through 512 experts. The design is 48 layers divided into 12 blocks. 3×(Gated DeltaNet → MoE) Followed by 1×(Gated Attention → MoE). The native context is 262,144 Tokens. Validated up to 1,010,000 Tokens by RoPE Scaling (YaRN). The hidden size is 2048. Attention uses 16 Q and 2 KV linear heads in head dim 256. DeltaNet uses 16 V and 16 QK heads in head dim 128.
Qwen team reports the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at ~10% of its training cost and delivers ~10× inference throughput beyond 32K context—driven by low activation in MoE and multi-token prediction (MTP). Instruct’s variant does not reason (no The Thinking variant, on the other hand, enforces reasoning trails by default. It is optimised for complex problems.
What actually changed in the FP8 release?
On the FP8 Model Cards, it is clearly stated that quantization has been applied. “fine-grained fp8” With block size 128, It is slightly different from the deployment of BF16. sglang, vLLM and example commands are all provided with 256K and optional MTP contexts. Thinking FP8 also suggests a flag for the reasoning parser (e.g. --reasoning-parser deepseek-r1 It is a language that translates into English. deepseek_r1 In vLLM). Apache 2.0 licensing is still in effect for this release.
Benchmarks (reported using BF16 weights).
The Instruct FP8 card reproduces Qwen’s BF16 comparison table, putting Qwen3-Next-80B-A3B-Instruct on par with Qwen3-235B-A22B-Instruct-2507 on several knowledge/reasoning/coding benchmarks, and ahead on long-context workloads (up to 256K). The Thinking FP8 card lists AIME’25, HMMT’25, MMLU-Pro/Redux, and LiveCodeBench v6, where Qwen3-Next-80B-A3B-Thinking surpasses earlier Qwen3 Thinking releases (30B A3B-2507, 32B) and claims wins over Gemini-2.5-Flash-Thinking on multiple benchmarks.

After-training and training signals
Before post-training, the series is taught on tokens of 15T. Qwen highlights the stability additions. GSPO is used in RL after-training to deal with the combination of hybrid attention and high-sparsity MOE. The MTP can be used to improve the pretraining signal and speed up inference.
Why FP8 Matters
FP8 activations/weights on modern accelerators allow for longer or larger sequences while maintaining the same latency. A3B only routes 3B parameters for each token. The combination of MoE sparsity and FP8 can increase throughput in long context regimes. This is especially true when combined with MTP speculative decoding as shown in serving flags. That said, quantization interacts with routing and attention variants; real-world acceptance rates for speculative decoding and end-task accuracy can vary with engine and kernel implementations—hence Qwen’s guidance to use current sglang/vLLM and to tune speculative settings.
You can read more about it here:
Qwen FP8 releases allow the active 80B/3B A3B to be used in 256K contexts on mainstream engines while preserving hybrid-MoE designs and MTP paths for high-throughput. Since the model cards still use BF16 benchmarks, teams are encouraged to validate FP8 latency and accuracy on their own stacks. This is especially important for reasoning parsers, as well as speculative and speculative setting. Net result: Lower memory bandwidth, improved concurrency and no architectural regressions. Positioned for production workloads with long contexts.
Click here to find out more Qwen3-Next-80B-A3B models in two post-training variants—Instruct The following are some examples of how to get started: Thinking. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, which specializes in covering machine learning and deep-learning news, is both technically solid and understandable to a broad audience. This platform has over 2,000,000 monthly views which shows its popularity.

