NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

The unbounded cache KV of softmax attention is replaced by a fixed size recurrent state. The decoding is done in constant memory and the sequence mixing can be reduced to linear time. What to forget is not the hard part. How to edit compressed memories without destroying existing associations is the hard part.

NVIDIA releases new NVIDIA GPUs Gated DeltaNet-2The model decouples the active memory edit into two channel-wise gates. Model decouples active memory editing in two gate-by-channel gates. This model was trained with 100B FineWeb, Edu tokens and 1.3B parameter values. It is superior to Mamba-2 Gated, DeltaNet, KDA and Mamba-3 in the benchmark suite.

The problem of the scalar gates in delta rule models

The matrix is stored in a recurrent linear focus layer S_{The t} The query will then read it. DeltaNet subtracts the value associated with the key to add an active edit. The scalar size is used. β_t to control how much to overwrite. MamThe ba-2 adds a daThe mA-dependenThe t scalar decay α_t Global forgetting. Gated DeltaNet merged both operations. However, both gates were scalar for each head.

Kimi Delta Attention KDA refines the decay. It is a replacement for the scaler α_t with A channel-wise vecThe tor. KDA uses a single vector β_t for the active edit. That scalar controls two different things at once. It decides how much old content to erase on the key side. It also decides how much new content to commit on the value side. These two decisions act on different axes of the state. Tying them together is a modeling restriction, not a property of the delta rule.

https://githuThe b.com/NVlabs/GaThe tedDeltaNet-2/blob/main/paper/GDN2_paper.pdf

Two gates in place of one: Gated Delta Rule-2

Gated DeltaNet-2 seperates the two decisions using Gated Delta Rule-2. The Gated DeltaNet-2 introduces an erase gate that is channel-based. The b_{The t} ∈ [0,1]^D_K On the key axis. The write gate is channel-wise. W_{The t} ∈ [0,1]^{The d}_V The value axis. These gates are generated by the sigmoidal projection of token representation. Update the decay after active editing.

When written in compact form, this recurrence would be:

S_{The t} = (I − k_{The t} (b_{The t} ⊙The k_{The t})^⊤"_{The t} S_t−1 +_{The l} (w_{The t} ⊙V_{The t})^⊤

You can find out more about this by clicking here. The following are some of the ways to get in touch with someone else_{The t} = Diag(α_{The t}) The decay of the channels is carried over to KDA. The right factor in the erase matrix remains K_{The t}It is important to maintain the Delta-rule when writing. The correct factor will become The b_{The t} ⊙The k_{The t}This is achieved by making the channel selection for reading. Write term K_{The t} z_{The t}^⊤ You can use The z_{The t} =_T ⊙V_{The t}Selecting the channel for value updates.

Both gates will collapse at the same scaler β_t, the update recovers KDA exactly. When the decay α_t also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied suThe bspaces of The the new updAThe ’te.

Gated Delta Rule-2, in fast-weight mode, is one online gradient on the local regression loss. While the decayed state remains close to memory during residual editing, gated targets are used for both read and write.

Backwards training with gate awareness and chunkwise training

This recurrence has a form WY that is similar to the KDA structure. Each rank-one delete absorbs the cumulative channel-wise decay. The per-chunk updating is a result of the asymmetric matrix of form I − k̄_R ē_r^⊤. ThThe e-mail address you entered is not valid. implementation uses chunk size C = 64 with fused TRiton kernels.

KDA’s scalar-shortcut is not applicable to the reverse pass. On the write side, a diagonal gate is different over each value channel. On the eraser side, a diagonal gate is used to cover key channels. Gate factors are therefore required to appear within the dots products which accumulate gradients. This gate-aware Vector-Jacobian is derived explicitly in the paper. To avoid Triton WGMMA assertions on Hopper GPUs the fused WY forward kernel is limited to only two and four warps.

Hybrid block design

In a Transformer-style standard block, Gated DeltaNet-2 serves as the token mixer. The Query- and Key-Paths use linear project, short causal Convolution, L2 Normalization, and SiLU. Value path is linear project, SiLU, and short convolution. The decay α_t, erAse gaThe te The b_{The t}Write gate W_{The t} The linear branches come from different directions. The recurrent out is RMS-normalized and multiplied using a SILU output gate.

In a hybrid version, Sliding-Window Attention is added after the mixer. Gated-DeltaNet-2 contains an MLP with SWA and another MLP. SWA is used to handle local interactions that are exact, while the recurrent mix compresses histories. This hybrid maintains linear scale with a limited attention cache.

Results for 1.3B Parameter

The models all have 1.3B parameters and are trained using 100B tokens of FineWeb-Edu. All models have the same parameter size and count. Recurrent states hold 262,144 floating elements per layer and per batch element. The training length for hybrid models is 2K SWA and 4K Tokens. Mamba-3 MIMO uses rank R = 4.

Gated DeltaNet-2 is the most averaged model in terms of language modeling as well as commonsense reasoning. This model has an average of 53.11 for LAMBADA as well as the reasoning suite. It is higher than Mamba-3 MIMO (52.39) and KDA (52.28). Gated DeltaNet-2 is averaging 53.97 vs. Mamba-3 MIMO, which averages 52.72 in hybrid settings. As the size of the recurrent and the current states is the same, it’s the update rules that are responsible for the gains, not the memory.

The most obvious gains are seen in RULER’s long context retrieval. S-NIAH-2 rises to 93.0 in the recurrent mode. S-NIAH-3 jumps up from 63.2 to 89.8 (KDA). MK-NIAH-1 climbs to 37.8 (KDA) at 4K from 28.8 (KDA).

Gated DeltaNet-2 leads in real-world retrieval as well (SWDE SQuAD FDA TriviaQA NQ Drop DROP) for both settings. The hybrid average stands at 42.28 and the recurrent is 29.88.

Marktechpost’s Visual Explainer

NVIDIA · 2026

Gated DeltaNet-2

The Decoupling of Erase and Write for Linear Attention. Delta-rule Recurrent Attention Layer With Channel-wise Erase and Write Gates

PyTorch
Triton kernels
The 1.3B parameter is a list of arbitrary sized parameters.
100B FineWeb-Edu tokens

Step 01 · The Idea

Instead of one gate, there are two gates

Linear attention compacts an unlimited KV cache to a fixed-size state. The hard part is editing this memory, without scrambling the existing associations.

The problem

Prior delta-rule models (Gated DeltaNet, KDA) tie erasing old content You can also find out more about the following: Writing new content To one single scalar gates β_t.

Fix

Split it: a channel-wise erase gate b_t On the main axis there is also a gate that allows you to write channel by channel. w_t The value axis.

Remove the gate Selects the key side coordinates that are removed and read.
The Write Gate Picks the coordinates that are assigned to new value.
Channel-wise decay It is the KDA’s finely granular global forgetting.

Step 02 · The Update Rule

The Gated Delta Rule-2

You can erase the gate b_t ∈ [0,1]^{d_k}Write gate w_t ∈ [0,1]^{d_v}It is a decay that occurs channel-wise D_t = Diag(α_t)The state of recurrence evolves into:

S_t = (I − k_t (b_t &odot; k_t)^⊤) D_t S_{t−1} + k_t (w_t &odot; v_t)^⊤

Recovers KDA It is possible to determine when the gates will collapse into the same scalar.
Recovers Gated DeltaNet The decay is also reduced to a scaler.
You can train efficiently with a chunkwise WY Form with channel-wise degradation absorbed into asymetric erase factors.

Step 03 · Get the Code

Clone and create the environment

PyTorch comes with an official Dockerfile. It also includes training scripts and documentation. lit_gpt model definitions.

git clone https://github.com/NVlabs/GatedDeltaNet-2.git
GatedDeltaNet-2

Build the Dockerfile environment.
Docker build -t Gdn2
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2

Repo layout

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py data & KV cache · paper/ ArXiv PDF

Step 04 · Launch Training

Run `pretrain.py`

This is the streamlined version of README. Please replace placeholders in the code with your dataset paths, and configure name.

Pre-train Python.. 
  --train_data_dir ${TRAIN_DATA} 
  --val_data_dir ${VALIDATION_DATA} 
  --output_root ${SAVE_DIR} 
  --exp_name ${NAME} 
  --model_name ${MODEL} 
  --train_config ${CONFIG} 
  --eval_iters ${EVAL_ITERS} 
  --learning_rate ${LR} 
  --micro_batch_size ${MICRO_BATCH_SIZE}

Pro tip

Add --interactive_job --debug For an interactive debugging sessions.

Step 05 · Default Recipe

This setup is 1.3B/100B.

Mamba-2 (Gated DeltaNet), KDA (KDA-based), and Mamba-3 are compared to their baselines using identical optimizer settings, recurrent states, and a similar state size.

Optimizer

AdamW · peak LR 4e-4 · weight decay 0.1 · gradient clip 1.0 · cosine schedule · 1B-token warmup.

Batch & Sequence

Global batch 0.5M tokens · sequence length 4K · hybrid models use a 2K The sliding-window size of attention.

Model Shape

16 heads · d_k = d_v = 128 · per-layer recurrent state 262,144 Mamba-2/3 is a good match for floats.

Hybrid block

Repeated cell: Gated DeltaNet-2 → MLP → SWA → MLP. SWA is used to handle local interactions. The recurrent mix compresses histories.

Step 06 · Results

Paste these numbers into your comparison

The best average between language modeling and reasoning with commonsense, with largest gains in long context retrieval.

Setting · Metric	KDA	Mamba-3 MIMO	GDN-2
Recurrent avg. (LMB + reasoning)	52.28	52.39	53.11
Hybrid avg. (LMB + reasoning)	52.68	52.72	53.97
S-NIAH-3 @2K (recurrent)	63.2	72.4	89.8
MK-NIAH-1 @4K (recurrent)	28.0	18.0	37.8
Recurring average of real-world recalls	28.67	28.35	29.88
Hybrid avg. for real-world recall	40.14	40.11	42.28

Step 07 · Resources

Citation, paper, code and citation

All you need in one location to run Gated DeltaNet-2, read it, or cite it.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

MARKTECHPOST · The hub for AI research, dev tools, and model launches

The Key Takeaways

GaThe ted DeltaNet-2 splits the scalar β_t into a channel-wise erase gate The b_{The t} (key axis), and channel-wise write gates W_{The t} (value axis).
When both gates collapse, the update will recover KDA and Gated DeltaNet.
Triton is fused with a gate aware backward and an asymmetric erase factor.
It has a better average than Mamba-2 and Gated DeltaNet.
Largest gains come on RULER long-context retrieval — S-NIAH-3 at 2K rises 63.2 → 89.8 You can also find out more about the following: MK-NIAH-1 at 4K rises 28.0 → 37.8 over KDA (recurrent).

Check out the Paper and Repo. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

Gated DeltaNet-2

Instead of one gate, there are two gates

The problem

Fix

The Gated Delta Rule-2

Clone and create the environment

Run `pretrain.py`

This setup is 1.3B/100B.

Optimizer

Batch & Sequence

Model Shape

Hybrid block

Paste these numbers into your comparison

Citation, paper, code and citation

Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%

Create a SuperClaude Framework with Modes, Commands and Session memory

TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents

The Bumblebee Open Source Supply Chain Scanner is a read-only tool for developer endpoints.

A small English town caught up in the global AI arms race

X Data Center Fire in Oregon Started Inside Power Cabinet, Authorities Say

I’m More Hopeful about Our Collective Brain Drain After Watching a 7-Hour Film in the Theater

You want a different kind of work trip? Try a Robotics Hotel

Roblox’s AI-Powered Age Verification Is a Complete Mess

Top Insights

Photon releases Spectrum, an open-source TypeScript framework that deploys AI agents directly to iMessages, WhatsApp and Telegram

The SmallThinker family of efficient large language models LLMs is natively trained for local deployment.

Latest News

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

The problem of the scalar gates in delta rule models

Two gates in place of one: Gated Delta Rule-2

Backwards training with gate awareness and chunkwise training

Hybrid block design

Results for 1.3B Parameter

Marktechpost’s Visual Explainer

Gated DeltaNet-2

Instead of one gate, there are two gates

The problem

Fix

The Gated Delta Rule-2

Clone and create the environment

Run pretrain.py

This setup is 1.3B/100B.

Optimizer

Batch & Sequence

Model Shape

Hybrid block

Paste these numbers into your comparison

Citation, paper, code and citation

The Key Takeaways

Related Posts

Run `pretrain.py`