Kimi.ai, the company behind Moonshot AI (Moonshot AI), has made a major contribution to AI infrastructure open source. Research team contributed significantly to open-source AI space. The team released FlashKDA Flash Kimi Delta attention is a CUTLASS implementation that provides high performance. Kimi Delta Attention (KDA) mechanism. The mecanism FlashKDA It is licensed under MIT. The prefilling speedups are up to ten times faster. 1.72× to 2.22× Over the years, there have been many attempts to improve on what is already available. flash-linear-attention The NVIDIA H20 is the default GPU for NVIDIA. It can be used as an easy-to-install backend to support popular graphics cards. flash-linear-attention library.
What is Kimi delta attention and why does it matter?
FlashKDA can only be understood if you first know where it is in relation to the LLM attention map.
Standard softmax attention has quadratic complexity with respect to sequence length — meaning that as you feed longer context into a model, compute costs grow extremely fast. The result has been a surge of research in Linear attention The softmax is a method of linear scaling that uses a similar or alternative mechanism. Kimi Delta Attention (KDA) Moonshot AI is Moonshot AI’s contribution to the space. A linear attention mechanism which refines and improves on existing algorithms. Gated DeltaNet Finer graining is possible. channel-wise gating Mechanism that allows for more efficient use of RNN finite state memory.
KDA is more than a prototype. This is the main attention mechanism of Kimi LinearMoonshot AI is an open-source hybrid with 48B parameters total, and 3B parameters activated. Kimi Linear uses a 3:1 KDA-to-MLA (Multi-Head Latent Attention) ratio — three KDA layers for every one global attention layer — which reduces KV cache usage by up to 75% during long-sequence generation while achieving up to 6× higher decoding throughput at 1 million context length compared to full attention. FlashKDA This kernel is a production-grade CUDA that speeds up prefilling.
The KDA Forward Pass takes queries in (The qKeys (The kThe values of (VA gate must be activated before it can open (The gThe beta logits are a combination of the alpha and beta logits.The following are examples of how to use the word "youthfulness"The tetrahedron (also known as a Scale Factor, an output TensorThe following are some of the most recent and popular posts on our website.Gate parameters A_log (log-gate parameter per head), dt_bias The gate bias is a term that describes the way in which the gates are biased. lower_bound Gate lower bound (range -5.0 to zero). The activation of the sigmoids on The following are examples of how to use the word "youthfulness" The kernel applies it internally. The mechanism also supports optional initial and final recurrent states — useful for multi-turn inference where you want to carry state across requests.
Models that use recurrent form can process many sequences efficiently during the generation. It is efficient. Prefill of these architectures still requires highly optimized GPU kernels — which is exactly what FlashKDA delivers.
CUTLASS Hopper: Under the Hood
FlashKDA was built using CUTLASSNVIDIA’s open-source library for CUDA C++ templates abstractions is designed to support high-performance kernel and linear algebra development. CUTLASS enables developers to create kernels which take advantage of NVIDIA Tensor Core Architecture. It’s also the foundation for libraries such as FlashAttention-3.
Targets for the library SM90 and higher — meaning NVIDIA’s Hopper architecture (H100, H20) and newer. Minimum requirements include CUDA 12.9.0 and PyTorch 2.4. The majority of the codebase is CUDA (56.4%), with Python (36.2%) and C++ glue code (6.7%).
Core API is a core component of the platform. flash_kda.fwdThe following are the inputs that are required:
The q,The k,V,The gAll in bf16 Shape[B, T, H, K]The following are some examples of how to use[B, T, H, V](whereThe gThe gate Before you begin, please read the following: activation)The following are examples of how to use the word "youthfulness"Beta logits are ready for the bf16[B, T, H]The internal application of the signatureScale: fp32 scalar scaling factorThe following are some of the most recent and popular posts on our website.Shape of the bf16 output vector[B, T, H, V]A_log,dt_bias,lower_boundParameters for the fp32 Gateinitial_state,final_stateRecurring bf16 and fp32 states are optionalcu_seqlensOptional int64 sequence lengths Batching with variable length
Kernel requirements is one current limitation K = V 128 for head dimension.
This support is for batching in variable lengths. cu_seqlens It is especially notable when used in production. Requests in real-world inference service rarely have the same length sequence. High-throughput systems must be able to fit multiple sequences with different lengths in a single kernel.
Benchmark Results: 1.72× to 2.22× on H20
Benchmark results as of April 20th, 2026 flash_kda The following are some examples of the use of fla_chunk_kda The existing flash-linear-attention The implementation of a program across a series length T=8192, head dimension D=128Two headcounting configurations are available: H=96 The following are some examples of how to get started: H=64. Every benchmark had 30 warmups, 200 measurements, and 5 repetitions.
The following are some of the ways to improve your own ability. H=96:
| The Case for Using | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| You can fix it yourself | 2.6219 | 4.5052 | 1.72× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
2.3420 | 4.5717 | 1.95× |
Varlen, seq_lens=1024 × 8 |
2.0100 | 4.4668 | 2.22× |
The following are some of the ways to improve your own ability. H=64:
| The Case for Using | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| You can fix it yourself | 1.6199 | 2.9587 | 1.83× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
1.7027 | 3.0595 | 1.80× |
Varlen, seq_lens=1024 × 8 |
1.3930 | 3.0412 | 2.18× |
The peak speedup of 2.22× appears in the uniform variable-length case (seq_lens=1024 × 8Eighteen sequences with length 1024 add up to T=8192. The fixed-length case delivers the floor of the range at 1.72×. FlashKDA is consistently superior to the Konkurrenz in all head configurations, and for each of the three sequence scenarios. flash-linear-attention By a considerable margin, the baseline is higher than it was before.
Integration with flash-linear-attention
FlashKDA’s integration is one of its most useful features. FlashKDA installs easily. auto-dispatched from flash-linear-attention’s chunk_kda — which means existing codebases using flash-linear-attention The faster kernel can be used without manual wiring. Integration is tracked. flash-linear-attention PR #852.
Installing is simple:
git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
Cd flash-kda
git submodule update --init --recursive
Install pip -v
This correctness testing suite is (tests/test_fwd.py() performs exact-match validation against PyTorch’s reference implementation, and then cross-validates it against flash-linear-attention. It gives AI developers a baseline to audit kernel behavior prior to deployment in production.
The Key Takeaways
- FlashKDA, Moonshot AI’s CUTLASS based CUDA Kernel is open source. for Kimi Delta Attention (KDA), delivering 1.72×–2.22× prefill speedup Over the years, there have been many attempts to improve on what is already available.
flash-linear-attentionBaseline on NVIDIA’s H20 GPUs - KDA’s Gated DeltaNet now includes fine-grained channel-by-channel gating — it’s the core attention mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid model that reduces KV cache usage by up to 75% and achieves up to 6× higher decoding throughput at 1M context length.
- This kernel is designed for SM90+ Hardware (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and currently supports a fixed head dimension of
K = V = 128,. - Natively, variable-length batching can be supported
cu_seqlensparameter, allowing multiple sequences of different lengths to be packed into a single kernel call — a critical feature for high-throughput inference serving. - FlashKDA can be automatically downloaded from the Internet once installed.
flash-linear-attention‘schunk_kdaIt is a simple performance upgrade that can be applied to any codebase using theflash-linear-attentionlibrary — no architecture changes required.
Check out the GitHub Repo. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

