Large language models continue to face a challenge in handling extremely long documents. Even when using techniques like length extrapolation or sparse attention to improve performance, the models still suffer high computation costs. In order to solve this problem, Tsinghua University researchers and ByteDance Seed have introduced a novel technique. MemAgentThe agent is based on reinforcement learning and designed to allow long context processing while maintaining linear complexity with minimal performance loss.
The limitations of existing approaches
There are three major categories of current solutions to long-context modelling:
- Use length extrapolation methods Use positional embedding to expand the context windows (e.g. NTK PI YaRN DCA) They often suffer from performance issues and scaling problems.
- Simple and linear attention mechanismsReduce complexity of attention to O(n), however, this requires a retraining process from the beginning and relies on patterns that are fixed or rules defined by humans.
- Context CompressionWe can use external memory modules or tokens to compress long inputs, but this often causes problems with standardization and extrapolation.
The approaches do not provide all three of the critical features: consistent accuracy and efficiency in linear complexity.
MemAgent is a human-like memory strategy
MemAgent is based on how people summarize important information and ignore noise. Each step reads an entire document, and then an internal memory. This is overwritten by the new, compressed information.
The following are some of the key innovations.
- The Fixed-Length Memory Token Based on MemoryCompresses vital information and maintains model compatibility.
- Segment-Wise Overwrite MechanismIt supports infinite length text without increasing memory.
- Linear ComplexityMemory updates and decoding costs remain constant for each chunk.
Multi-Conv RL with GRPO
MemAgent views each interaction between document pieces as a separate dialogue. The training is done via Group Relative policy Optimization (GRPO) In a multi-conversation RL pipe called DAPOYou can enable reward-driven updates of your memory.
These key components include:
- Rule-Based VerifierCalculates rewards for outcomes by comparing answers to the model with several ground truths.
- Token-Level RL SignalApply uniformly to all conversations based on a single sample.
The memory-compression technique focuses on the information that is relevant to answering questions and eliminates all other irrelevant data.
Performance Assessment
MemAgent is trained on an 8K contextual window, and the extrapolation up to 3.5M tokens was done using the RULER Benchmark and synthetic datasets of HotpotQA and SQuAD.
| Model | 224K | 896K | 3.5M |
|---|---|---|---|
| Qwen2.5-Instruct-14B-1M | 37.5% | 0.0% | N/A |
| QwenLong-L1-32B | 17.2% | 11.7% | N/A |
| RL-MemAgent-14B | 81.3% | 77.3% | 78.1% |
MemAgent consistently performed better than long context and distillation baselines, and maintained over 95% accuracy when benchmarking RULER (8K to 512K Tokens).

Multi-Hop Quality Assurance Case Study
Give the question “The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city?”MemAgent tracked content in 3 pieces:
- Location information is retained but content that does not relate to the location of the user can be recognized.
- Keep your memory clear of irrelevant information.
- When reading the biography of Adriana Trigiani, you will have a more accurate memory.
Last answer Greenwich Village, New York City.
Complexity and theoretic foundation
MemAgent reformulates the autoregressive model using latent memory variables (m₁…mₖ):
p(x₁:N) = ∑ₘ₁:ₖ ∏ₖ p(cₖ | mₖ₋₁) * p(mₖ | cₖ, mₖ₋₁)
This enables O(N) compute cost and human-readable intermediate memory—unlike attention-based feature compression. RL must be used, since memory updates cannot be learned by backpropagation and are discrete.
The conclusion of the article is:
MemAgent provides a highly scalable solution for the long-context problem: linear complexity and near-lossless precision. With its RL overwrite memory, LLMs can read, generate, and abstract inputs with over a multi-million tokens.
FAQs
Q1: MemAgent – What is it?
MemAgent, a framework based on reinforcement learning that provides LLMs memory tokens for handling extremely long contexts in an efficient manner.
Q2: What makes it different than attention methods or extrapolation techniques?
MemAgent, unlike techniques that scale or extrapolate based on attention, uses a token-based system of memory which is updated through reinforcement learning.
Q3: Which models can MemAgent apply to?
Transformer-based LLM. There are no changes required to the model.
Q4: Does it change with the input size?
By limiting the size of memory, it maintains a linear complexity regardless input length.
Q5: How can MemAgent help you?
Long-document QA (Quality Assurance), agent memory systems, review of legal documents, literature reviews, and decision-making based on large amounts of evidence.
Click here to find out more Paper. The researchers are the sole credit holders for this work.
Sponsorship Opportunity: Contact the top AI developers from US and Europe. Unlimited possibilities. 1M+ monthly subscribers, 500K+ active community builders. [Explore Sponsorship]



