Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing
  • The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.
  • Elon Musk squeezed OpenAI, and they’re ‘going to want to kill me’
  • Taylor Swift is attempting to trademark her image. TikTok Deepfake Advertisements Show Why
  • Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs
  • Emergency First Responders Say Waymos Are Getting Worse
  • Google One, YouTube and YouTube One drive Google 25M subscribers in Q1
  • The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization
AI-trends.todayAI-trends.today
Home»Tech»The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization

The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization

Tech By Gavin Wallace29/04/202610 Mins Read
Facebook Twitter LinkedIn Email
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
Share
Facebook Twitter LinkedIn Email

Key-value (KV), or the cache of key values, is a memory bottleneck that has been identified in inference production systems as large language models are scaled to larger context windows with more simultaneous users. The KV cache for a 30 billion-parameters model, with batch sizes of 128 tokens and input lengths of 1,024 can take up 180 GB. As an example, the parameters of a model with 7 billion parameters can consume up to 14 GB GPU memory. The KV cache, however, could require as much as 72 GB.

The KV cache can be compressed to reduce the memory load, increase the batch sizes and improve throughput, all without the need for retraining of the base model. In the last two years several different compression strategies emerged as a result of research. This article delves into the top ten, focusing on what they do and their place in an inference pipeline.

Token Eviction using H2O

H2OThe method, which was published in NeurIPS, 2023 is one of those foundational methods for token expulsion. This method is based on the observation that during token generation, a very small number of tokens generate most of the attention score. They are referred to as “Heavy Hitters” (H2). H2O retains an optimal balance between H2 and recent tokens. The cache size of the Transformer layer is fixed. The selection of tokens is determined by the cumulative attention scores, which are averaged over all queries.

The distribution of attention weights follows a power law, which results in a minimal loss in accuracy when removing tokens with low scores. The H2O method is decoding phase and doesn’t reduce the prefill calculation, which still remains a problem for prompts with long contexts. With 20% heavy hitters, H2O improves throughput over Hugging Face Accelerate by up to 29× on OPT-6.7B and OPT-30B.

Attention Sink Retention (StreamingLLM)

StreamingLLM It is intended for situations where LLMs are required to handle extremely long input streams or even infinite ones. The strategy of this algorithm is to maintain KV state of the tokens that serve as attention sinks and to combine these with a sliding windows of recent tokens within the memory budget.

This insight suggests that, no matter what their semantic content is, initial tokens function as structural anchors which receive an excessive amount of attention through the generation process. The accuracy of outputs is significantly reduced when they are dropped. However, preserving them along with a recent window helps stabilize the outputs. It is hardware friendly and fast, but it does not score tokens based on importance. This means that middle context tokens can be discarded if they are semantically important. The best use for it is in streaming dialog applications, where the recent context takes precedence.

SnapKV (Observation Window compression)

SnapKV It focuses on the long-prompt scenario, which is the stage of pre-fill. The small window of observation at the end predicts the token’s importance. The attention scores from queries in this observation window are aggregated to vote for important positions — the heavy hitters — in the prefix.

SnapKV uses an attention-based pooling method to choose important clusters of KVs per head. H2O relies on a single cumulative score for the whole sequence. SnapKV is more accurate at the same budget because it selects each head individually. SnapKV, a baseline widely used for prefill-phase compressibility and directly comparable to the H2O benchmarks on LongBench is now a standard.

PyramidKV / PyramidInfer (Layer-Wise Pyramidal Allocation)

The main limitation of SnapKV and H2O is the fact that they use a consistent compression budget for all Transformer layers. PyramidKV This is addressed by allocating different size caches per layer depending on attention pattern. The complementing system. PyramidInferThe prefill phase is included in this.

PyramidInfer measures the consistency of attention weights in recent tokens to determine which keys and values are most important. PyramidInfer saves memory by computing less keys and values at deeper levels during prefill, rather than pruning the pre-computed caching. Experimental results show PyramidInfer improves throughput by 2.2× compared to Hugging Face Accelerate, with over 54% GPU memory reduction in the KV cache.

It is intuitive that the information will flow through Transformer layers in a similar way: deeper layers are more focused on salient tokens, while early layers have fewer. It is more effective to assign compression budgets in proportion to the information density of each layer than a uniform budget.

KV Cache Quantization — KIVI

KIVI, which was published at ICML 2020, is a 2-bit KV Cache Quantization Algorithm that does not require any fine tuning. The key cache is quantized per channel and the value caching per token.

This asymmetrical scheme was inspired by the observed differences in distribution: Keys have larger outliers channel-wise, while Values are represented better per token. With this hardware-friendly design, KIVI enables models including Llama-2, Falcon, and Mistral to maintain comparable generation quality while reducing combined peak memory, model weights and KV cache, by 2.6×. This enables up to 4× larger batch sizes and increases throughput by 2.35× to 3.47× on real inference workloads. The 2.6× figure covers both model weights and KV cache together: at 2-bit precision the KV cache reduction is more aggressive, and it is this reduction that drives the batch size scaling.

KVQuant is a calibrated mixed-precision quantization.

While KIVI has a set asymmetrical system, KVQuant This multi-component calibration approach is used to achieve low-bit quantization of the KV cache. This combines key quantization per channel, key quantization pre-RoPE (which prevents key quantization after the positional embeddings distort the distribution), non-uniform sensitivity-weighted quantization which defines quantization from calibration data instead of fixed grids and dense-and sparse decomposition to handle extreme outliers separately from bulk distribution.

The combination of KVQuant and fixed precision schemes allows KVQuant’s quantization down to sub-4 bit with greater accuracy. This is aimed at deployments which need to accommodate extremely long contexts. (The paper evaluated up to 10,000,000 context length). In production systems, where workloads are stable, calibration costs can be amortized.

TurboQuant is a near-optimal online KV Cache quantization tool.

TurboQuant Google Research has submitted its latest work in this area, which was accepted by ICLR. This method addresses a weakness that has been known in previous quantization methods. MSE-optimal quantizers cause systematic bias in inner products estimations, and this bias compounds in attention calculations. TurboQuant solves the problem in two steps.

First stage PolarQuant Before quantification, (AISTATS 2026) applies an orthogonal random rotation to every key vector and each value vector. The rotation distributes variance evenly across all coordinates, without altering the mathematical content. Each coordinate can then be accurately quantized with an analytically calculated scalar quantumizer. There is no need for training or calibration. The second stage applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) correction to the quantization residual, which produces an unbiased inner product estimator. Together, the two stages achieve at least 6× memory reduction and up to 8× faster attention computation on NVIDIA H100 GPUs at 3-bit precision, operating within a factor of approximately 2.7 of the information-theoretic limit. TurboQuant is a random matrix, not a learned one. It can be applied to any model during inference without offline preparation.

Attention Grouped and Multi-Query (MQA)

MQA The following are some examples of how to get started: GQA Architectural modifications reduce the KV Cache by design, rather than compressing a current one. MQA reduces cache size by sharing a key-value head with all query heads. GQA groups several query heads into a single set of key and value heads. It is an intermediate between MQA’s full attention to each head, as well as MQA. The two require training either from scratch, or fine-tuning. Applying them to already-trained models will result in a degraded performance without the proper training.

GQA is now the standard for open-weight LLMs. In Llama 2, only the 70B model used GQA — the 7B and 13B variants used standard multi-head attention. Llama 3 expanded GQA to both 8B and the 70B sizes. Mistral began applying GQA with its first 7B release on September 20, 2023. GQA has become a standard expectation for practitioners when selecting and deploying new models families.

Multi-Head Latent Attention (MLA) — DeepSeek

MLA, DeepSeek’s solution for cache memory KV first introduced in DeepSeek-V2 DeepSeek V3 and DeepSeek R1 will be available in May 2024. This is an attention system equipped with low rank key-value compression. MLA does not store the full-dimensional key and values tensors for each token. Instead, it projects these into a latent vector compressed during inference.

This technique has the best results of all those on this list. DeepSeek V2 reduces KV cache by over 93% compared with the previous 67B dense DeepSeek model. This is not a marginal improvement — it fundamentally changes the memory economics of serving large models, enabling significantly longer context windows and larger batch sizes on the same hardware. MLA has been shown to consistently offer higher expressive powers than GQA within the same budget of KV cache, providing the theoretical basis for these empirical gains. MLA has the highest validation at scale among all architectural methods.

Low-Rank KV Cache Compression (Palu / LoRC)

Low-rank compressing targets the hidden KV dimension rather than the bit or sequence width. Palu It is a framework for post-training cache compression that uses low-rank projections of the key and value matrixes. It is based upon a low-rank, medium-grained decomposition of group heads that balances the accuracy with reconstruction overhead.

Other methods that belong to this family are LoRCSVDq, CSKV, and ReCalKV are all based on the observation that matrices of key values and attention heads have a significant low-rank structural pattern, in particular for contexts with longer durations. Quantization and token removal are not compatible with low-rank methods. They can, however, be combined to achieve compounded compression. The family is relatively unexplored in comparison to the eviction-based method, which makes it a research area.

What you need to know:

  • Because KV growth depends on both the length of the sequence and size of each batch, compression is essential to high-throughput service.
  • SnapKV is a KV system that selects important positions for each head based on a pool of attention scores.
  • Quantization reduces the amount of memory (KIVI/KVQuant/TurboQuant) without removing any tokens. KIVI achieves 2.6× combined peak memory reduction (model weights + KV cache) at 2-bit precision; TurboQuant achieves 6× memory reduction at 3-bit precision with no calibration, operating near the information-theoretic limit.
  • In comparison to the token-based eviction, low-rank methods are more effective at eliminating redundancy due to hidden dimensions.
  • Architectural solutions such as GQA or MLA must be integrated into the classroom. GQA was only used in Llama 2 for the 70B; Llama 3 expanded it to all sizes. DeepSeek-V2 reduces the KV cache by 93.3% using MLA.
  • In 2026, the research frontier will be focused on latent space compaction.Attention Matching, 50× compaction) and reasoning-aware compression (TriAttention, 10.7× memory reduction on AIME25 at matched accuracy).

Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us


Michal is a professional in data science with a Masters of Science degree from the University of Padova. Michal is a data scientist with a background in machine learning, statistical analysis and data engineering.

ETH met Tech
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

30/04/2026

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

30/04/2026

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

29/04/2026

OpenAI Privacy Filter: A Step by Step Guide on How to Create a Complete PII Detection & Redaction Pipeline

29/04/2026
Top News

The AI Party at the End of the World

The Intelligence Age by Sam Altman • AI Blog

Deepfake ‘Nudify’ Technology Is Getting Darker—and More Dangerous

Google AI overviews says it’s still 2024

Gemini on Google Home keeps mistaking my dog for a cat

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

This AI Paper Proposes a Novel Dual Branch Encoder/Decoder Architecture For Unsupervised Speech Enhancement.

05/10/2025

How to build a meta-cognitive AI agent that dynamically adjusts its own reasoning depth for efficient problem solving

05/12/2025
Latest News

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

30/04/2026

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

30/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.