Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.
  • This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit
  • Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%
  • Create a SuperClaude Framework with Modes, Commands and Session memory
  • TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents
  • The Bumblebee Open Source Supply Chain Scanner is a read-only tool for developer endpoints.
  • Contrastive Neuron attribution (CNA), Sparse MLP circuit steering without SAE training or weight modification, is released by Nous Research
  • A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Reminiscence Layer Constructed by Y Combinator’s Garry Tan for AI Brokers
AI-trends.todayAI-trends.today
Home»Tech»NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

Tech By Gavin Wallace24/05/202610 Mins Read
Facebook Twitter LinkedIn Email
NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning
NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning
Share
Facebook Twitter LinkedIn Email

The unbounded cache KV of softmax attention is replaced by a fixed size recurrent state. The decoding is done in constant memory and the sequence mixing can be reduced to linear time. What to forget is not the hard part. How to edit compressed memories without destroying existing associations is the hard part.

NVIDIA releases new NVIDIA GPUs Gated DeltaNet-2The model decouples the active memory edit into two channel-wise gates. Model decouples active memory editing in two gate-by-channel gates. This model was trained with 100B FineWeb, Edu tokens and 1.3B parameter values. It is superior to Mamba-2 Gated, DeltaNet, KDA and Mamba-3 in the benchmark suite.

The problem of the scalar gates in delta rule models

The matrix is stored in a recurrent linear focus layer SThe t The query will then read it. DeltaNet subtracts the value associated with the key to add an active edit. The scalar size is used. βt to control how much to overwrite. MamThe ba-2 adds a daThe mA-dependenThe t scalar decay αt Global forgetting. Gated DeltaNet merged both operations. However, both gates were scalar for each head.

Kimi Delta Attention KDA refines the decay. It is a replacement for the scaler αt with A channel-wise vecThe tor. KDA uses a single vector βt for the active edit. That scalar controls two different things at once. It decides how much old content to erase on the key side. It also decides how much new content to commit on the value side. These two decisions act on different axes of the state. Tying them together is a modeling restriction, not a property of the delta rule.

https://githuThe b.com/NVlabs/GaThe tedDeltaNet-2/blob/main/paper/GDN2_paper.pdf

Two gates in place of one: Gated Delta Rule-2

Gated DeltaNet-2 seperates the two decisions using Gated Delta Rule-2. The Gated DeltaNet-2 introduces an erase gate that is channel-based. The bThe t ∈ [0,1]DK On the key axis. The write gate is channel-wise. WThe t ∈ [0,1]The dV The value axis. These gates are generated by the sigmoidal projection of token representation. Update the decay after active editing.

When written in compact form, this recurrence would be:

SThe t = (I − kThe t (bThe t ⊙The kThe t)⊤"The t St−1 +The l (wThe t ⊙VThe t)⊤

You can find out more about this by clicking here. The following are some of the ways to get in touch with someone elseThe t = Diag(αThe t) The decay of the channels is carried over to KDA. The right factor in the erase matrix remains KThe tIt is important to maintain the Delta-rule when writing. The correct factor will become The bThe t ⊙The kThe tThis is achieved by making the channel selection for reading. Write term KThe t zThe t⊤ You can use The zThe t =T ⊙VThe tSelecting the channel for value updates.

Both gates will collapse at the same scaler βt, the update recovers KDA exactly. When the decay αt also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied suThe bspaces of The the new updAThe ’te.

Gated Delta Rule-2, in fast-weight mode, is one online gradient on the local regression loss. While the decayed state remains close to memory during residual editing, gated targets are used for both read and write.

Backwards training with gate awareness and chunkwise training

This recurrence has a form WY that is similar to the KDA structure. Each rank-one delete absorbs the cumulative channel-wise decay. The per-chunk updating is a result of the asymmetric matrix of form I − k̄R ēr⊤. ThThe e-mail address you entered is not valid. implementation uses chunk size C = 64 with fused TRiton kernels.

KDA’s scalar-shortcut is not applicable to the reverse pass. On the write side, a diagonal gate is different over each value channel. On the eraser side, a diagonal gate is used to cover key channels. Gate factors are therefore required to appear within the dots products which accumulate gradients. This gate-aware Vector-Jacobian is derived explicitly in the paper. To avoid Triton WGMMA assertions on Hopper GPUs the fused WY forward kernel is limited to only two and four warps.

Hybrid block design

In a Transformer-style standard block, Gated DeltaNet-2 serves as the token mixer. The Query- and Key-Paths use linear project, short causal Convolution, L2 Normalization, and SiLU. Value path is linear project, SiLU, and short convolution. The decay αt, erAse gaThe te The bThe tWrite gate WThe t The linear branches come from different directions. The recurrent out is RMS-normalized and multiplied using a SILU output gate.

In a hybrid version, Sliding-Window Attention is added after the mixer. Gated-DeltaNet-2 contains an MLP with SWA and another MLP. SWA is used to handle local interactions that are exact, while the recurrent mix compresses histories. This hybrid maintains linear scale with a limited attention cache.

Results for 1.3B Parameter

The models all have 1.3B parameters and are trained using 100B tokens of FineWeb-Edu. All models have the same parameter size and count. Recurrent states hold 262,144 floating elements per layer and per batch element. The training length for hybrid models is 2K SWA and 4K Tokens. Mamba-3 MIMO uses rank R = 4.

Gated DeltaNet-2 is the most averaged model in terms of language modeling as well as commonsense reasoning. This model has an average of 53.11 for LAMBADA as well as the reasoning suite. It is higher than Mamba-3 MIMO (52.39) and KDA (52.28). Gated DeltaNet-2 is averaging 53.97 vs. Mamba-3 MIMO, which averages 52.72 in hybrid settings. As the size of the recurrent and the current states is the same, it’s the update rules that are responsible for the gains, not the memory.

The most obvious gains are seen in RULER’s long context retrieval. S-NIAH-2 rises to 93.0 in the recurrent mode. S-NIAH-3 jumps up from 63.2 to 89.8 (KDA). MK-NIAH-1 climbs to 37.8 (KDA) at 4K from 28.8 (KDA).

Gated DeltaNet-2 leads in real-world retrieval as well (SWDE SQuAD FDA TriviaQA NQ Drop DROP) for both settings. The hybrid average stands at 42.28 and the recurrent is 29.88.

Marktechpost’s Visual Explainer

NVIDIA · 2026

Gated DeltaNet-2

The Decoupling of Erase and Write for Linear Attention. Delta-rule Recurrent Attention Layer With Channel-wise Erase and Write Gates

PyTorch
Triton kernels
The 1.3B parameter is a list of arbitrary sized parameters.
100B FineWeb-Edu tokens

Step 01 · The Idea

Instead of one gate, there are two gates

Linear attention compacts an unlimited KV cache to a fixed-size state. The hard part is editing this memory, without scrambling the existing associations.

The problem

Prior delta-rule models (Gated DeltaNet, KDA) tie erasing old content You can also find out more about the following: Writing new content To one single scalar gates β_t.

Fix

Split it: a channel-wise erase gate b_t On the main axis there is also a gate that allows you to write channel by channel. w_t The value axis.

  • Remove the gate Selects the key side coordinates that are removed and read.
  • The Write Gate Picks the coordinates that are assigned to new value.
  • Channel-wise decay It is the KDA’s finely granular global forgetting.

Step 02 · The Update Rule

The Gated Delta Rule-2

You can erase the gate b_t ∈ [0,1]^{d_k}Write gate w_t ∈ [0,1]^{d_v}It is a decay that occurs channel-wise D_t = Diag(α_t)The state of recurrence evolves into:

S_t = (I − k_t (b_t ⊙ k_t)⊤) D_t S_{t−1} + k_t (w_t ⊙ v_t)⊤

  • Recovers KDA It is possible to determine when the gates will collapse into the same scalar.
  • Recovers Gated DeltaNet The decay is also reduced to a scaler.
  • You can train efficiently with a chunkwise WY Form with channel-wise degradation absorbed into asymetric erase factors.

Step 03 · Get the Code

Clone and create the environment

PyTorch comes with an official Dockerfile. It also includes training scripts and documentation. lit_gpt model definitions.

git clone https://github.com/NVlabs/GatedDeltaNet-2.git
GatedDeltaNet-2

Build the Dockerfile environment.
Docker build -t Gdn2
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2
Repo layout

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py data & KV cache · paper/ ArXiv PDF

Step 04 · Launch Training

Run pretrain.py

This is the streamlined version of README. Please replace placeholders in the code with your dataset paths, and configure name.

Pre-train Python.. 
  --train_data_dir ${TRAIN_DATA} 
  --val_data_dir ${VALIDATION_DATA} 
  --output_root ${SAVE_DIR} 
  --exp_name ${NAME} 
  --model_name ${MODEL} 
  --train_config ${CONFIG} 
  --eval_iters ${EVAL_ITERS} 
  --learning_rate ${LR} 
  --micro_batch_size ${MICRO_BATCH_SIZE}
Pro tip

Add --interactive_job --debug For an interactive debugging sessions.

Step 05 · Default Recipe

This setup is 1.3B/100B.

Mamba-2 (Gated DeltaNet), KDA (KDA-based), and Mamba-3 are compared to their baselines using identical optimizer settings, recurrent states, and a similar state size.

Optimizer

AdamW · peak LR 4e-4 · weight decay 0.1 · gradient clip 1.0 · cosine schedule · 1B-token warmup.

Batch & Sequence

Global batch 0.5M tokens · sequence length 4K · hybrid models use a 2K The sliding-window size of attention.

Model Shape

16 heads · d_k = d_v = 128 · per-layer recurrent state 262,144 Mamba-2/3 is a good match for floats.

Hybrid block

Repeated cell: Gated DeltaNet-2 → MLP → SWA → MLP. SWA is used to handle local interactions. The recurrent mix compresses histories.

Step 06 · Results

Paste these numbers into your comparison

The best average between language modeling and reasoning with commonsense, with largest gains in long context retrieval.

Setting · Metric KDA Mamba-3 MIMO GDN-2
Recurrent avg. (LMB + reasoning) 52.28 52.39 53.11
Hybrid avg. (LMB + reasoning) 52.68 52.72 53.97
S-NIAH-3 @2K (recurrent) 63.2 72.4 89.8
MK-NIAH-1 @4K (recurrent) 28.0 18.0 37.8
Recurring average of real-world recalls 28.67 28.35 29.88
Hybrid avg. for real-world recall 40.14 40.11 42.28

Step 07 · Resources

Citation, paper, code and citation

All you need in one location to run Gated DeltaNet-2, read it, or cite it.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

MARKTECHPOST  ·  The hub for AI research, dev tools, and model launches

The Key Takeaways

  • GaThe ted DeltaNet-2 splits the scalar βt into a channel-wise erase gate The bThe t (key axis), and channel-wise write gates WThe t (value axis).
  • When both gates collapse, the update will recover KDA and Gated DeltaNet.
  • Triton is fused with a gate aware backward and an asymmetric erase factor.
  • It has a better average than Mamba-2 and Gated DeltaNet.
  • Largest gains come on RULER long-context retrieval — S-NIAH-3 at 2K rises 63.2 → 89.8 You can also find out more about the following: MK-NIAH-1 at 4K rises 28.0 → 37.8 over KDA (recurrent).

Check out the Paper and Repo. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us


AI ar Net nvidia
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%

24/05/2026

Create a SuperClaude Framework with Modes, Commands and Session memory

24/05/2026

TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents

23/05/2026

The Bumblebee Open Source Supply Chain Scanner is a read-only tool for developer endpoints.

23/05/2026
Top News

A small English town caught up in the global AI arms race

X Data Center Fire in Oregon Started Inside Power Cabinet, Authorities Say

I’m More Hopeful about Our Collective Brain Drain After Watching a 7-Hour Film in the Theater

You want a different kind of work trip? Try a Robotics Hotel

Roblox’s AI-Powered Age Verification Is a Complete Mess

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Photon releases Spectrum, an open-source TypeScript framework that deploys AI agents directly to iMessages, WhatsApp and Telegram

22/04/2026

The SmallThinker family of efficient large language models LLMs is natively trained for local deployment.

01/08/2025
Latest News

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

24/05/2026

This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit

24/05/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.