A complete guide to NVIDIA's KVPress, including the KV cache compression, memory-efficient generation, and long-context LLM inference.

We will explore the basics of integrating a CAD system into your existing workflow. NVIDIA’s KVPress It is important to understand how this can improve the efficiency of long-context inference. To begin, we set up the entire environment. We install the necessary libraries, load a compact model of Instruct, and prepare a workflow in Colab that demonstrates the true value of KV caching compression. We create a long-context synthetic corpus and define extraction questions. Then we run several inference experiments that directly compare different KVPress methods. By the end of this tutorial, you will be better able to understand how long context optimization is implemented in the real world, the impact of different presses on performance and the adaptability of such a workflow for retrieval applications, document analyses, or memory-sensitive LLM.

Import os, subprocesses, textwraps, time, gc. json. math. random.
warnings.filterwarnings("ignore")


def run(cmd):
   print("n[RUN]", " ".join(cmd))
   subprocess.check_call(cmd)


run([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "pip"])
run([sys.executable, "-m", "pip", "install", "-q", "torch", "transformers", "accelerate", "bitsandbytes", "sentencepiece", "kvpress==0.4.0"])


try:
 Import userdata from Google.colab
   hf_token = userdata.get("HF_TOKEN")
Except Exception
   hf_token = os.environ.get("HF_TOKEN", "")


If not, hf_token is:
   try:
       import getpass
       hf_token = getpass.getpass("Enter your Hugging Face token (leave empty if model is public and accessible): ").strip()
 The exception:
       hf_token = ""


if hf_token:
   os.environ["HF_TOKEN"] = hf_token
   os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_token


Buy a torch
Transformers for import
Import kvpress


BitsAndBytesConfig imports transformers through its pipeline.
from kvpress import ExpectedAttentionPress, KnormPress


print("Python:", sys.version.split()[0])
print("Torch:", torch.__version__)
print("Transformers:", transformers.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
   print("GPU:", torch.cuda.get_device_name(0))


MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
MAX_NEW_TOKENS = 96
SEED = 4
random.seed(SEED)
torch.manual_seed(SEED)

Install all necessary libraries and set up Colab to successfully run KVPress. We collect and store the Hugging Face Token, import core modules for model loading and pipeline execution and perform compression experiments. Also, we print out the hardware and runtime details to help us understand how we will perform the tutorial.

if torch.cuda.is_available():
   torch.cuda.empty_cache()
   quantization_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_compute_dtype=torch.float16,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
   )
   pipe = pipeline(
       "kv-press-text-generation",
       model=MODEL_ID,
       device_map="auto",
       token=hf_token if hf_token else None,
       model_kwargs={
           "quantization_config": quantization_config,
           "attn_implementation": "sdpa",
       },
   )
else:
   pipe = pipeline(
       "kv-press-text-generation",
       model=MODEL_ID,
       device_map="auto",
       torch_dtype=torch.float32,
       token=hf_token if hf_token else None,
       model_kwargs={
           "attn_implementation": "sdpa",
       },
   )


Def cuda_memReturnReturn():
   if not torch.cuda.is_available():
       return {"allocated_gb": None, "reserved_gb": None, "peak_gb": None}
   return {
       "allocated_gb": round(torch.cuda.memory_allocated() / 1024**3, 3),
       "reserved_gb": round(torch.cuda.memory_reserved() / 1024**3, 3),
       "peak_gb": round(torch.cuda.max_memory_allocated() / 1024**3, 3),
   }


def reset_peak():
   if torch.cuda.is_available():
       torch.cuda.reset_peak_memory_stats()


Def extract_answer():
   if isinstance(x, list) and len(x) > 0:
 x =[0]
 If isinstance(x dict):
 The k is for ["answer", "generated_text", "text", "output_text"]:
           if k in x:
 Return to x[k]
 return json.dumps() (x, ensure_ascii=False), indentation=2.
 Return str(x).


def generate_once(context, question, press=None, label="run"):
   gc.collect()
   if torch.cuda.is_available():
       torch.cuda.empty_cache()
   reset_peak()
 "Start" = "time.time()
 Out = Pipe
       context,
       question=question,
       press=press,
       max_new_tokens=MAX_NEW_TOKENS,
       do_sample=False,
       temperature=None,
       return_full_text=False,
   )
 Time = elapsed() Startseite
 ""> answer = extract_answer (out)The result is =
   stats = cuda_mem()
   result = {
       "label": label,
       "elapsed_sec": round(elapsed, 2),
       "allocated_gb"Stats["allocated_gb"],
       "reserved_gb". stats["reserved_gb"],
       "peak_gb"Stats["peak_gb"],
       "answer": answer.strip(),
   }
 Return to result

We initialize the kv-press-text-generation pipeline and configure it differently depending on whether GPU support is available. Helper functions are defined to measure CUDA Memory Usage, reset Peak Memory, extract model answers, and cleanly run a generation pass. This section provides the reusable logic for the remainder of the tutorial. It also allows us to compare baseline estimation with KV caching.

company_records = [
   {"company": "Arcturus Dynamics", "hq": "Bengaluru", "founded": 2017, "focus": "warehouse robotics"},
   {"company": "BlueMesa Energy", "hq": "Muscat", "founded": 2014, "focus": "grid analytics"},
   {"company": "CinderPeak Health", "hq": "Pune", "founded": 2019, "focus": "clinical imaging AI"},
   {"company": "DeltaForge Marine", "hq": "Kochi", "founded": 2012, "focus": "autonomous vessel telemetry"},
   {"company": "EonCircuit Labs", "hq": "Hyderabad", "founded": 2020, "focus": "edge silicon tooling"},
   {"company": "Frostline Aero", "hq": "Jaipur", "founded": 2016, "focus": "drone inspection"},
]


needle_facts = [
   "PROJECT NEEDLE 1: The internal codename for the confidential pilot program is SAFFRON-17.",
   "PROJECT NEEDLE 2: The audit escalation owner is Meera Vashisht.",
   "PROJECT NEEDLE 3: The approved deployment region for the first production rollout is Oman North.",
   "PROJECT NEEDLE 4: The emergency rollback phrase is amber lantern.",
   "PROJECT NEEDLE 5: The signed commercial start date is 17 September 2026.",
]


background_block = """
Long context systems contain a lot of information, including repeated notes on operations, history, policies, and retrieval artifacts.
This demo aims to produce a long, realistic prompt in which only the most important details are relevant for answering.
By reducing the number of key-value pairs in cache, KV compression can reduce memory consumption while maintaining answer quality.
"""


policy_block = """
Overview of the operational policy
1. When sensor confidence is below the threshold, safety takes precedence over throughput.
2. The logs must include the region, timestamp, class of device, and state of operator approval.
3. There may be duplicate annexes or artifacts in the OCR style, along with repeated summary summaries of compliance.
4. In order to create a good model, it is important that you ignore the repetition of irrelevant details and focus on what really matters.
"""


records_text = []
for i in range(120):
   rec = company_records[i % len(company_records)]
   records_text.append(
 The f"Record {i+1}: {rec['company']} is headquartered in {rec['hq']}, founded in {rec['founded']}, and focuses on {rec['focus']}. "
 The f"Quarterly memo {i+1}: retention remained stable, operator training progressed, and the compliance appendix was reattached for review."
   )


needle_insert_positions = {18, 41, 73, 96, 111}
full_corpus = []
for i, para in enumerate(records_text):
   full_corpus.append(background_block.strip())
   full_corpus.append(policy_block.strip())
   full_corpus.append(para)
   if i in needle_insert_positions:
       full_corpus.append(needle_facts[len([x for x in needle_insert_positions if x

We create a synthetic long-context dataset to test the KVPress system in a controlled yet realistic way. We define company records, insert important hidden facts at different positions, and mix them with repeated background and policy blocks, making the prompt long and noisy. This helps us simulate the context in which memory-efficient inference matters and the model must retrieve only the truly relevant details.

context = "nn".join(full_corpus)


question = textwrap.dedent("""
Answer using only the provided context.
Give a compact JSON object with exactly these keys:
commercial_start_date
deployment_region
audit_owner
rollback_phrase
pilot_codename
""").strip()


print("nContext characters:", len(context))
print("Approx words:", len(context.split()))


experiments = []


baseline = generate_once(context, question, press=None, label="baseline_no_compression")
experiments.append(baseline)


Presses = [
   ("expected_attention_0.7", ExpectedAttentionPress(compression_ratio=0.7)),
   ("expected_attention_0.5", ExpectedAttentionPress(compression_ratio=0.5)),
   ("knorm_0.5", KnormPress(compression_ratio=0.5)),
]


Press in the presses for labels:
   try:
       result = generate_once(context, question, press=press, label=label)
       experiments.append(result)
 Except Exception As e.
       experiments.append({
           "label": label,
           "elapsed_sec": None,
           "allocated_gb": None,
           "reserved_gb": None,
           "peak_gb": None,
           "answer"""FAILED: {type(e).__name__}: {e}"
       })


try:
 DecodingPress from kvpress Importkwargs is the same as
   sig = inspect.signature(DecodingPress)
   kwargs = {"base_press"KnormPress()}
 if "compression_interval" in sig.parameters:
 Kwargs["compression_interval"] = 10
 The elif "compression_steps" in sig.parameters:
 Kwargs["compression_steps"] = 10
 If you want to know more about if "target_size" in sig.parameters:
 Kwargs["target_size"] = 512
 The elif "token_buffer_size" in sig.parameters:
 Kwargs["token_buffer_size"] = 512
 If you want to know more about if "hidden_states_buffer_size" in sig.parameters:
 Kwargs["hidden_states_buffer_size"] = 0
   decoding_press = DecodingPress(**kwargs)
   decoding_result = generate_once(context, question, press=decoding_press, label="decoding_knorm")
   experiments.append(decoding_result)
Not Exception as:
   experiments.append({
       "label": "decoding_knorm",
       "elapsed_sec": None,
       "allocated_gb": None,
       "reserved_gb": None,
       "peak_gb": None,
       "answer"The f"SKIPPED_OR_FAILED: {type(e).__name__}: {e}"
   })

In the first step, we assemble and define the final context. We then launch the inference experiment set. We run the baseline first without compression and then use multiple press strategies. This allows us to compare the different results. Decoding-oriented experiments are also conducted to extend the tutorial past prefilling. This gives a wider view of KVPress.

print("n" + "=" * 120)
print("RESULTS")
print("=" * 120)


For example, r is used in the following experiments:
   print(f"n[{r['label']}]")
   print("elapsed_sec:"The r["elapsed_sec"])
   print("allocated_gb:"The r["allocated_gb"])
   print("reserved_gb:"The s["reserved_gb"])
   print("peak_gb:"The r["peak_gb"])
   print("answer:")
   print(r["answer"])


print("n" + "=" * 120)
print("SIMPLE SUMMARY")
print("=" * 120)


Def Safe_Float(x)
   try:
 Return float (x)
 The exception:
 Return None


base_peak = safe_float(baseline["peak_gb"]) if baseline.get("peak_gb"It is the only thing that exists.
base_time = safe_float(baseline["elapsed_sec"]If baseline.get() is true, then the following will be true:"elapsed_sec"It is not the same as None


For r in experiments[1:]:
 The maximum safe_float is r["peak_gb"])
 Safe_float (r) = t["elapsed_sec"])
 Peak_delta = 0 if peak or base_peak are both None. Otherwise, round(base_peak + peak)
 Time_delta = 0 if base_time or t are both None. Otherwise, round (base_time – t).
   print({
       "label"The r["label"],
       "peak_gb_saved_vs_baseline": peak_delta,
       "time_sec_saved_vs_baseline": time_delta,
       "answer_preview"The r["answer"][:180].replace("n", " ")
   })


print("n" + "=" * 120)
print("OPTIONAL NEXT STEPS")
print("=" * 120)
print("1. Swap MODEL_ID to a stronger long-context instruct model that fits your GPU.")
print("2. Increase context length by duplicating records_text more times.")
print("3. Try other presses from kvpress, such as SnapKVPress, StreamingLLMPress, QFilterPress, or ChunkKVPress.")
print("4. Replace the synthetic corpus with your own long PDF/text chunks and keep the same evaluation loop.")

Printing all outputs into a readable form and summarizing the differences in runtime and memory relative to baseline is our method of presenting the results. Calculate simple metrics for comparison to see quickly how much time or memory each compression strategy can save. The tutorial is then concluded with suggestions for next steps, including stronger models, larger contexts, more press methods and actual document workloads.

We have developed an understanding on how NVIDIA KVPress is used in real-world Colab settings to maximize long-context analysis. The workflow we developed was more comprehensive than just running the model. It included installing the framework, loading the pipeline properly, constructing a long-context, applying multiple compression press, and evaluating the results based on answer quality and runtime. Comparing baseline generation to compressed KV caching, we were able see clearly the trade-offs. This gave us a good idea of when the methods could be used to reduce resource usage without compromising output quality. By testing various press configurations as well as including an optional decoding oriented compression path we explored the flexibility of KVPress.

Check out the Codes and Notebook here. Also, feel free to follow us on Twitter Join our Facebook group! 120k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

A complete guide to NVIDIA’s KVPress, including the KV cache compression, memory-efficient generation, and long-context LLM inference.

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

Google DeepMind Hires Former CTO of Boston Dynamics because the Firm Pushes Deeper Into Robotics

Mark Zuckerberg reveals Meta’s plan for a self-improving and superintelligent AI

OpenAI is destroying its 4o model. China’s ChatGPT Fanatics Aren’t Okay

McDonald’s AI Hiring Bot Exposed Millions of Applicants’ Data to Hackers Using the Password ‘123456′

AI-Powered Swarms of Disinformation Are Coming to Democracy

Top Insights

Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Mannequin That Achieves SOTA on SWE-Bench Professional and Sustains 8-Hour Autonomous Execution

Mark Zuckerberg is Offering AI Talent Top Paying Jobs

Latest News

Anthropic Mythos is Unauthorized by Discord Sleuths

Ace the Ping Pong Robot can Whup your Ass

A complete guide to NVIDIA’s KVPress, including the KV cache compression, memory-efficient generation, and long-context LLM inference.

Related Posts