How can you create a LLM agent which decides? For yourself How can we decide what goes into long-term storage, what stays in context for the moment and what gets thrown out without using heuristics that are hand-tuned or additional controllers? Is it possible to learn a policy that manages both types of memory through the same space for text generation?
Researchers from Alibaba Group Wuhan University Introduce AgeMem, also known as Agentic Memory or AgeMemA framework allows large language models to learn how they can manage long-term and short-term memory within a policy. Instead of depending on external controllers or hand-written rules, the agent can decide when to retrieve, store, summarize, and then forget using tools built into the action area of the model.
What current LLM agents are struggling with?
Memory is usually treated as two separate systems in most agent frameworks..
The long-term memory is the window that displays current context, which contains active dialogue and retrieved documents. The short-term memory is the context window that contains active dialogues and documents.
The existing systems are designed to separate these two components. External stores, such as vector databases with easy add-and-retrieve triggers, are used to manage long term memory. With retrieval enhanced generation, sliding Windows or summarization Schedules short-term memory can managed.
The separation of the two families creates a number of issues.
- The short-term and long-term memory can be optimized separately. The interaction between them is not taught in its entirety.
- Heuristics help determine the best time to memorize and summarise. They are fragile and can miss important but rare events.
- The cost of adding controllers and expert models increases.
AgeMem eliminates the external memory controller, and integrates it into the policy of the agent.
Memory and the Agent Action Space
AgeMem exposes memory operations as tools. The model may emit normal text or tool calls at each step. The framework has six tools.
Memory retention is important:
ADDStores a new memory object with metadata and content.UPDATEModifies an existing memory item.DELETEGet rid of items with low or obsolete value.
Short term memory:
RetirePerforms a semantic search of the long-term memory, and then injects the items retrieved into the context.SUMMARYThe dialogue is compressed into a shorter summary.FILTERRemove context segments which are no longer useful to future reasoning.
It is structured. Every step begins with a The model can only reason privately if it is blocked. Next, the model emits either a A block can be created with either a JSON listing of the tools that are invoked, or a Block with the response that is visible to the user. Memory actions, therefore, are first-class decisions and not side effects.
The three-stage reinforcement of memory
AgeMem uses reinforcement learning to combine long-term and short-term memory behaviors.
State of the time T Includes the conversational context of the moment, long-term memory and task specifications. A token call or an application is selected as the policy’s action. Each sample has a training path that is broken down into three stages.:
- Long-term memory stage 1Agents interact in casual settings and gather information which will be useful later. It makes use of
ADD,UPDATEYou can also find out more about the following:DELETEBuild and maintain long-term memory. This is the stage when short-term memory begins to develop. - Stage 2, short term memory control under distractorsThis resets the context for short-term events. The long-term memory remains. Now, the agent receives content related to but not essential. The agent must use short-term memory.
SUMMARYYou can also find out more about the following:FILTERRemove noise and keep only useful information. - The third stage of integrated reasoningThe agent retrieves from long term memory using the method of retrieval. The agent uses long-term memory to retrieve the information.
RetireIt controls the context in the short-term and provides the answer.
This is a crucial point: Long-term memories persist across stages, whereas short-term memories are cleared from Stages 1 to 2. It is because of this design that the model relies on retrieval and not residual context. This exposes long-term dependencies.
Reward Design and Step-Wise GRPO
AgeMem employs a variant of Group Relative Policy Optimization, or GRPO. The system sample multiple trajectory groups for each task. The system computes a terminal reward for each trajectory and then normalizes it within the group in order to get an advantage signal. This signal is then broadcasted to the entire trajectory, so that all intermediate tool choices can be trained using the final outcome.
There are three major components to the reward.:
- Task reward that uses an LLM to score answer quality from 0 to 1.
- The context reward is a way to measure the effectiveness of memory short term operations. It includes compressing, summarizing early and preserving query-relevant content.
- The memory rewards measure the long term memory’s quality. This includes the number of stored items of high quality, as well as their usefulness and relevance to queries.
These three elements are given equal weighting so they all contribute equally to the signal of learning. When the dialogue exceeds its maximum length, or the context is too large, a penalty term will be added.
The experimental set-up and its main results
AgeMem is evaluated on five benchmarks by the research team.
- ALFWorld text based tasks.
- SciWorld is a theme park for environments with a science-themed.
- BabyAI instruction is below.
- PDDL for planning.
- HotpotQA is a multi-hop question answering system.
HotpotQA LLM Judge Score and Success Rate for ALFWorld are among the metrics. The Memory Quality Metric is also defined using an LLM Evaluator, which compares the stored memories with HotpotQA’s supporting facts.

Baselines are LangMem (A Mem), Mem0 and Mem0g, as well as a memory-free agent. The backbones of Qwen2.5-7B Instruct and Qwen3-4B Instruct are the Qwen2.5-7B -Instruct.
AgeMem scores an average 41.96 points across all 5 benchmarks on Qwen2.5-7B. Mem0 is the top baseline with a score of 37.14. AgeMem achieves 54.31 in Qwen3-4B – Instruct. The best baseline is A Mem, which reaches 45.74.
Memory also gets better. AgeMem on HotpotQA reaches 0.53 with Qwen2.5-7B or 0.605 Qwen3-4B. This is better than any baseline.
Short-term memory tools can reduce the length of prompts while maintaining performance. Configurations with STM use 3 to 5% fewer tokens for each prompt on HotpotQA than those that substitute STM with retrieval pipelines.
Each component is important, according to ablation studies. Addition of only the long-term memory tool on top a baseline with no memory yields already clear benefits. The scores are improved further by adding reinforcement learning to these tools. With both short-term and long-term tools, plus RL, the full system can improve scores by up to 21,7 percentage points over baseline without memory on SciWorld.
Implications for LLM agent design
AgeMem proposes a pattern of design for future agentic system. Memory is best handled within the policy learned, and not in two separate subsystems. The agent can learn to effectively manage context by combining language generation with storage, retrieval and summarization.
The Key Takeaways
- AgeMem makes memory operations explicit, so the policy used to generate text is also used for deciding when to use them.
ADD,UPDATE,DELETE,Retire,SUMMARYYou can also find out more about the following:FILTERmemory. - The long-term and short-term memory is trained together through a three-stage RL set up where the long term memory remains across all stages, and the short term context resets to force retrieval reasoning.
- With uniform weights and penalties for excessive dialog length and context overload, the reward function balances task accuracy and context management with long-term memory.
- AgeMem consistently performs better than memory baselines like LangMem (A Mem), Mem0 and LangMem in ALFWorld and SciWorld tasks, BabyAI and PDDL, and HotpotQA.
- The use of short term memory can reduce the length of prompts by 3 to 5 per cent compared with RAG-style baselines, while maintaining or improving performance. This shows that context filtering and summarization rules are not necessary.
Take a look at the FULL PAPER here. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
Our latest releases of ai2025.devIt is an analytics platform focused on 2025 that converts model launches and benchmarks into a structured data set you can export, compare and filter.
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

