Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory
  • A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy
  • Jeffrey Epstein Had a ‘Personal Hacker,’ Informant Claims
  • PyKEEN: Coding for Training, Optimizing and Evaluating Knowledge Graph Embeddings
  • Robbyant LingBot World – a Real Time World Model of Interactive Simulations and Embodied AI
  • SERA is a Soft Verified Coding agent, built with only Supervised training for practical Repository level Automation Workflows.
  • I Let Google’s ‘Auto Browse’ AI Agent Take Over Chrome. It didn’t quite click
  • DeepSeek AI releases DeepSeek OCR 2 with Causal visual flow encoder for layout-aware document understanding
AI-trends.todayAI-trends.today
Home»Tech»NVIDIA AI KVzap: A method for SOTA KV Cache pruning that delivers near-lossless 2x-4x compression

NVIDIA AI KVzap: A method for SOTA KV Cache pruning that delivers near-lossless 2x-4x compression

Tech By Gavin Wallace15/01/20267 Mins Read
Facebook Twitter LinkedIn Email
LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents
LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents
Share
Facebook Twitter LinkedIn Email

Transformer decoders become a bottleneck for deployment when contexts lengths increase to tens or hundreds of thousand tokens. The cache holds keys and value for each layer, head, and shape (2,L,H,T,D). If you use a transformer like Llama1-65B the cache can reach 335 GB when using 128k tokens.

https://arxiv.org/pdf/2601.07891

The sequence axis is left untouched by architectural compression

Models for production already compress cache along several axes. Grouped Query Attention is a method that shares values and keys across several queries. It can achieve compression factors as high as 16 with Qwen3-235BA22B and 12 for GLM 4.5. DeepSeek v2 compresses both the value and key dimension using Multi Head Latent attention. Hybrid model models use sliding window attention, state space layers or hybrid attention. This reduces the number of layers needed to keep a full cache.

The changes are not compressed along the axis of sequence. The retrieval and sparse style of attention only retrieves a portion of the cache for each decoding stage, yet all tokens occupy the memory. Long-term context service requires techniques to remove cache entries that have little effect on future tokens.

It is important to note that the word “you” means “you”. KVpress project NVIDIA has collected more than 20 such methods into one codebase, and exposed them via a leaderboard public on Hugging Face. The evaluation of methods such as Expected Attention (H2O), Compactor, DuoAttention and KVzip is consistent.

KVzip or KVzip Plus as the score-oracle

KVzip has the highest cache-pruning baseline currently available. KVpress Leaderboard. The model defines the importance of each cache entry by using pretext tasks that require copying and pasting. It runs the model on a longer prompt, where it’s asked to reproduce exactly the context of an original. The score for each position of a token in the initial prompt is equal to the attention value that any position within the repeated segment will assign back to the token. This is true across all heads when using grouped query. All low scoring entries get evicted from the competition until they reach a budget.

KVzip+ enhances this score. The attention weight is multiplied by the normal of the contribution of value into the residual streams and then normalized by the standard of the hidden state. The actual changes that the tokens make to the residual stream are better represented by this method.

They are not only effective but also expensive. KVzip is too slow to produce because it requires the prefilling of the extended prompt. The scoring process assumes that the prompt is fixed, so it cannot be run while decoding.

https://arxiv.org/pdf/2601.07891

KVzap: a model surrogate for hidden states

KVzap is a surrogate scoring model which operates on hidden states directly. For each transformer layer and each sequence position t, the module receives the hidden vector hₜ and outputs predicted log scores for every key value head. There are two architectural options, one linear layer and another MLP architecture with an eighth-sized model and a hidden layer width of GELU.

Prompts are taken from the Nemotron Pretraining Dataset. Researchers filter 27k prompts into lengths between 500 and 1,250 tokens. They then sample as many prompts in each subset and sample 500 tokens for every prompt. They obtain a training set of 1.2 millions pairs for each key value and a validation group consisting of 23k pairs. The surrogate then learns how to use regression to go from the hidden score to the log of KVzip+. The Pearson correlation squared between the predictions and the oracle score varies between approximately 0.63 to 0.77.

https://arxiv.org/pdf/2601.07891

The sliding window, threshold and negligible overhead

KVzap’s model infers hidden states during the inference process and generates scores per cache entry. Entries that score below a certain threshold are removed, but a sliding-window of 128 recent tokens will always be kept. The team has developed a PyTorch-style function which applies the model and sets the scores in the local window at zero. It also returns the compressed key and values tensors. All experiments apply pruning after the attention operations.

KVzap relies on score thresholding instead of fixed top-k selection. The same threshold produces different compressive ratios across benchmarks or even prompts of the same benchmark. Researchers report up to a 20 percent difference in the compression ratio between prompts when a threshold is fixed. This reflects different information density.

The overhead of computing is minimal. Analysis at layer level shows the cost difference between KVzap and linear FLOPs is about 1%, with the linear variant adding about 0.02 %. Relative memory overheads follow the same values. When the context is long, the quadratic costs of attention are dominant and the additional FLOPs can be negligible.

https://arxiv.org/pdf/2601.07891

Results for RULER LongBench, AIME25 and AIME

KVzap’s reasoning and long-context benchmarks are evaluated using Qwen3-8B and Llama 3.1-8B Instruction. Long context behavior can be measured using RULER or LongBench. RULER measures long context behavior using synthetic tasks with sequence lengths ranging from 4k up to 128k. LongBench is a real-world document that includes multiple categories. AIME25 offers a maths reasoning task with 30 Olympiad problems evaluated at pass 1 or pass 4.

KVzap on RULER matches the baseline of the entire cache within a very small margin, while removing a significant fraction. The best KVzap settings for Qwen3-8B achieve a fraction removed above 0.7 in RULER 16k and 4k, while maintaining the average score to within a few 10ths. Llama-3.1-28B Instruct, Qwen3-32B all exhibit similar behavior.

LongBench uses the same thresholds, but the compression ratios are lower because documents have less repetition. KVzap is close to the baseline cache level up to approximately 2 to 3 times of compression. Fixed budget methods, such as Expected Attention degrades more for several subsets when compression increases.

KVzap MLP, on AIME25 (nearly 2x compression), maintains or even improves the pass rate at 4 accuracy and still remains useful when discarding half of the cache. Even with extremely aggressive settings (such as linear variants and high thresholds) that eliminate more than 90% of the entries, performance collapses.

https://arxiv.org/pdf/2601.07891

The above table shows, overall, that KVzap’s best configuration for each model provides an average cache compression of between 2.7 to 3.5, while maintaining task scores close to baseline across RULER LongBench, and AIME25.

The Key Takeaways

  1. This input-adaptive approximation to KVzip+ learns how to estimate oracle KV scores based on hidden states by using either small surrogate models per layer (either a thin linear layer) or shallow MLPs, before pruning low score pairs.
  2. In the training, Nemotron is used to provide pretraining prompts. KVzip+ supervises. Each head produces approximately 1.2,000,000 examples, and there are squared correlations between oracle and predicted scores in the range of 0.6 to 0.8%.
  3. KVzap uses a threshold for a global score with a sliding window that shows recent tokens. This allows compression to automatically adjust to the prompt’s information density. The research team reports up to a 20 percent difference in compression achieved across all prompts when using the same threshold.
  4. KVzap is able to achieve a 2 to 4-fold KV compression on Qwen3-8B and Llama-3.1-8B Instruction on RULER and LongBench, while still maintaining accuracy that’s very close to full cache. It also achieves the state-of-the-art tradeoffs for the NVIDIA KVpress Leadership Board.
  5. KVzap, implemented as an open-source framework, is equipped with checkpoints ready to be used on Hugging Face. This makes it possible to easily integrate the system into long context LLM service stacks.

Take a look at the Paper You can also find out more about the following: GitHub Repo. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost was his most recent venture. This platform, which specializes in covering machine learning and deep-learning news, is both technically solid and understandable to a broad audience. Over 2 million views per month are a testament to the platform’s popularity.

AI ar ETH Live met nvidia x
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

02/02/2026

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

02/02/2026

PyKEEN: Coding for Training, Optimizing and Evaluating Knowledge Graph Embeddings

31/01/2026

Robbyant LingBot World – a Real Time World Model of Interactive Simulations and Embodied AI

31/01/2026
Top News

There is Only One AI Company. Blob Welcome!

Open Source AI can help the US to defeat China

Cloudflare blocks AI crawlers by default

Authors Are Posting TikToks to Protest AI Use in Writing—and to Prove They Aren’t Doing It

Where is the AI drug?

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

LangGraph: How to design transactional agentic AI systems using two-phase commit, human interrupts, and safe rollbacks

31/12/2025

Accenture Research Launches MCP Bench: A Large Scale Benchmark for Evaluating LLM Agents on Complex Real World Tasks Via MCP Servers

30/08/2025
Latest News

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

02/02/2026

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

02/02/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.