Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

In the race to reduce the cost and speed of running large models, the two main levels have been the model architecture as well as the hardware. But there is a third, often underappreciated frontier — the GPU kernel. Kernel is the lowest-level routine which executes the mathematical operations on the GPU. To write a great kernel, you need to understand not only the mathematics, but also the memory layout, the instruction scheduling and the hardware quirks on the target chip. Many ML professionals don’t actually write their kernels; they rely instead on libraries, such as FlashAttention (or Triton), to handle it.

Meet FlashQLA: QwenLM contributed to the layer. Released under MIT License. Built on the TileLang compiler framework, it is a high-performance linear attention kernel library specifically optimized for the Gated Delta Network (GDN) attention mechanism — the linear attention architecture that powers the Qwen3.5 and Qwen3.6 model families.

Why is linear attention important?

It is important to know what the standard attention cost of softmax is in order to fully understand FlashQLA. In a conventional Transformer, the attention mechanism has O(n²) complexity — meaning that doubling the sequence length quadruples the computation. The bottleneck is what makes long documents or code expensive.

Linear attention is a linear version of softmax, which replaces it with a formulation reducing this complexity to O(n). This makes the system scale more easily with length. Gated Delta Networks (GDNs) are one of these linear attention mechanisms. They have been integrated in Qwen’s hybrid model, which alternates GDN layers with full attention. In this hybrid design, the goal is to achieve the best of two worlds. Full attention in the areas where it’s most important and efficiency with linear attention elsewhere.

GDN uses what is called a ‘gated’ formulation — it applies an exponentially decaying gate to control how much past context is carried forward. FlashQLA gains its performance by using this gate.

Kernels that are not compatible with existing kernels

Before FlashQLA, the standard implementation for GDN operations came from the Flash Linear Attention (FLA) library, which uses Triton kernels — Triton being OpenAI’s Python-based GPU programming language. Triton’s ease of use makes it easier to create kernels, but there are trade-offs. The kernels produced by Triton may not be optimally optimized for certain hardware, especially NVIDIA Hopper, which is the H100 or H200 GPUs generation.

Hopper’s architecture brought new features, such as Tensor Core at warpgroup levels and data pipelines asynchronous to Triton that it could not exploit fully. FlashQLA was designed to bridge this gap.

FlashQLA is Different

FlashQLA performs operator fusion, performance optimization and gradient computation during the training phase of GDN Chunked Prefill. This results in a 2–3× speedup on forward passes The a 2× speedup on backward passes On NVIDIA’s Hopper GPUs, the FLA Triton is compared in multiple scenarios to its FLA Triton kernel.

These gains are driven by three technical innovations:

1. Gate-driven automatic intra-card context parallelismContext parallelism is splitting a sequence of data across several processing units, so that they may work simultaneously on various parts. FlashQLA exploits the exponential decay property of the GDN gate to make this split mathematically valid — because the gate’s decay means that tokens far apart in a sequence have diminishing influence on each other. FlashQLA can automatically activate intra-card CP in tensor-parallelism (TP), small-headcount, long-sequence and other settings. This improves GPU Streaming Multiprocessor’s (SM) usage without requiring configuration.

2. The algebraic Reformulation that’s Hardware FriendlyFlashQLA reformulates the forward and backward flow of GDN Chunked Prefill to a certain degree to reduce overhead for three GPU hardware unit types: Tensor Cores(which perform matrix multiplications), CUDA Cores(which are responsible for scalar/vector operations) and Special Function Units. Critically, this is done without sacrificing numerical precision — an important guarantee when the reformulation is being used for model training.

3. TileLang fused-warp-specialized KernelsIt uses TileLang to build several key fused kernels and manually implements warpgroup specialization — a technique that assigns different warpgroups (groups of 128 threads on Hopper) to specialized roles, such as one moving data from global memory into shared memory while another runs Tensor Core matrix multiplications. It uses TileLang to build several key fused kernels and manually implements warpgroup specialization — a technique that assigns different warpgroups (groups of 128 threads on Hopper) to specialized roles, such as one warpgroup moving data from global memory to shared memory while another simultaneously runs Tensor Core matrix multiplications. FlashQLA can achieve theoretically peak performance by combining Tensor-Core computations, CUDA-Core computations, and data movement.

Benchmarks

FlashQLA benchmarked with two baselines, the FLA Triton Kernel (version 0.5.0) and FlashInfer kernel (version 0.6.9) using TileLang 0.1.8, on NVIDIA GPUs H200. These benchmarks were performed using the Qwen3.5/Qwen3.6 head models, which have head sizes of h_{The v} ∈The following values correspond to the tensor parallax settings: TP1, TP2, TP3, TP4, TP8, TP64, TP48, TP32, TP24, TP16, TP8.

The benchmarks for forward (FWD), measure latency in a single kernel, for different models or TP settings with varying batch size. The BWD benchmarks measure the latency of a single update for a given batch based on total token count.

What you need to know

FlashQLA provides high performance linear attention kernel libraries Built by Qwen on TileLang and optimized specifically for Gated Delta Network Chunked Prefill Forward and Backward Passes.
It achieves 2–3× forward speedup and 2× backward speedup The FLA Triton kernel was compared to multiple scenarios running on NVIDIA’s Hopper GPUs.
Performance gains are driven by three core innovations: Hardware-friendly reformulation of Tensor Core and CUDA Core without sacrificing numerical precision. TileLang warp specialized kernels which overlap data movement and Tensor Core and CUDA Core calculations.
The GDN attention system is a linear mechanism of O(n)., used in Qwen’s hybrid model architecture alongside standard full attention layers — making efficient GDN kernels critical for both training and long-context inference at scale.
FlashQLA was released under the MIT License. This requires SM90+, CUDA12.8+ and PyTorch 2.8+. The installation of pip is simple, while both the high-level Python libraries and low-level Python libraries are available.

Check out the GitHub Repo The following are some examples of how to get started: Technical details. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization

OpenAI Privacy Filter: A Step by Step Guide on How to Create a Complete PII Detection & Redaction Pipeline

You can organize your thoughts by whispering into this AI-powered smart ring

OpenAI designed GPT-5 so that it is safer. The software still produces gay slurs

Google Gemini and ChatGPT can help you organize your life with scheduled actions

OpenAI’s open-weight models are coming to US Military

Amazon Explains how its AWS outage brought down the web

Top Insights

Sigmoid vs ReLU Activation functions: the Inference Costs of Losing Geographic Context

ByteDance researchers introduce VGR, a novel reasoning multimodal large language model (MLLM) that has enhanced fine-grained visual perception capabilities

Latest News

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

Why is linear attention important?

Kernels that are not compatible with existing kernels

FlashQLA is Different

Benchmarks

What you need to know

Related Posts