Prime Intellect's RLMEnv, a Recursive Language Model (RLM) for Long Horizon LLM Agents.

Recursive Language Models In large language models, the trade-off between accuracy, context length and cost is usually a major issue. RLMs don’t force models to process a large prompt all at once. Instead, they treat it like an external environment, and allow the model to decide what code to use to examine the prompt. Then, the model will recursively invoke itself for smaller chunks.

https://arxiv.org/pdf/2512.24601

Learn the basics

The entire input is loaded as a string into the Python REPL. GPT-5 is never shown that string. It receives instead a prompt from the system that describes how to combine results, read slices of variable, and write helper functions. Models return a textual answer to the question, and so an external chat endpoint is identical.

RLM uses REPL to control long-term context. It is written in Python. The environment exposes useful tools like string slicing. Regex search, and helper function such as llm_query Call a smaller version of a model. For example, GPT-5 Mini. This code is written by the root model to call these helpers in order to scan, divide and summarise external context variables. This code allows you to store the intermediate result in variables, and then build up your final answer. By using this structure, the size of the prompt window is not dependent on the model context and the long-term context becomes a programsynthesis problem.

What is the status of Evaluation?

You can also find out more about the following: research paper The idea was evaluated using four context benchmarks that have different computation structures. S-NIAH represents a needle in a haystack with constant complexity. BrowseComp Plus is a web-style multihop question answer benchmark that covers up to 1,000 documents. OOLONG requires the model to transform and aggregate many entries. OOLONG Pairs makes the task even more difficult by aggregating inputs in quadratic pairs. This task stresses context length as well reasoning depth.

RLMs are more accurate than direct LLM calls or common agents for long contexts. RLMs achieve accuracy of 62.00 for GPT-5 in CodeQA. For a question-answering setup involving a lengthy document, the RLM achieves a score of 66.00. For Qwen3-Coder-480B-A35B, the base model scores 20.00, a CodeAct retrieval agent 52.00, and the RLM 56.00 with a REPL only variant at 44.66.

OOLONG pairs is the most difficult setting. The direct model for GPT-5 is nearly unusable when F1 = 0.04 CodeAct and summarization agents are near 0.01, and both around 24.67. The RLM variant with non-recursive REPL still reaches 43.93. Qwen3 Coder’s base model remains below 0.10 F1 while the RLM version reaches 23,11, and only the REPL version 17,34. This shows that the REPL as well as recursive calls can be critical for dense quadratic problems.

BrowseComp-Plus highlights effective context extension. The context of the corpus can range from 6M up to 11M tokens. That’s 2 orders-of-magnitude beyond GPT-5’s 272k token window. RLM GPT 5 is able to maintain strong performance when 1,000 documents are included in the variable environment, while GPT-5 baselines decrease as more documents are added. RLM GPT 5, on this benchmark, achieves 91.33 percent accuracy at an average cost per query of 0.99 dollars. A hypothetical model which reads context directly costs between $1.50 to $2.75, based on current pricing.

You can also find out more about the following: research paper The trajectories for RLM runs are also analyzed. Several behavior patterns emerge. It often begins with a peek, where the model inspects the initial few thousand characters in the context. The model uses regex and keyword searches to filter out relevant lines. In the case of more complex queries it divides the context in chunks. Then, it calls recursive LMs to extract or label each chunk. This is followed by a programmatic aggregation. RLMs can be used to store partial outputs as variables, and then stitch them together. This allows the RLMs to bypass the output limits set by the base model.

Prime Intellect has a new perspective.

Prime Intelligence team This concept has been turned into a real-world environment called RLMEnv. It is integrated with their Verifiers Stack and Environments Hub. They designed the RLM to have only a Python RePL and sub LLMs will receive heavy tools, such as file or web access. The REPL is a tool that exposes many different aspects of Python. llm_batch The root model is able to generate many sub-queries in parallel. Answer: This is the place where you must write and mark as “ready” a final solution. The RLM can then delegate costly operations to sub-models and isolate token heavy outputs.

Prime Intellect This implementation is evaluated on four different environments. DeepDive evaluates web research using search, open tools, and verbose webpages. Math Python exposes Python REPLs for solving difficult math competition problems. Oolong uses the benchmark for long contexts inside RLMEnv. The verbatim copy technique focuses on the exact replication of complex strings in content types like JSON, CSV or mixed codes. In these environments, GPT-5 mini and the INTELLECT-3 MoE model benefit from RLM’s scaffolding in terms of success rate, and robustness in very long contexts.

Prime Intellect’s and the author team of the paper both emphasize that current implementations have not been fully optimized. RLM calls have a synchronous nature, the recursion is small and costs are distributed with a heavy tail due to long trajectory. It is possible to use RLM scaffolding in conjunction with reinforcement learning, so that over time models can learn more effective chunking and recursion policies. RLMs can be used to convert improvements in system design and base models into agents with a longer horizon that are capable of consuming 10M token environments or more without context decay.

What you need to know

You can use the following five concise technical points to enhance your understanding of this article.

RLMs frame long context in terms of an environment variableThe LLM transforms the tokens by coding, rather than ingesting them directly in the Transformer context.
The context can be extended to up to 10M tokens by recursing the inference time.RLMs are able to call sub LLMs recursively on selected pieces of context. They can process prompts as long as 2 orders-of magnitudes longer than the default context window.
RLMs perform better than long-context scaffolds in hard benchmarksAcross SNIAH and BrowseComp Plus as well as OOLONG pairs and OOLONG pair, RLM variations of GPT-5 coders and Qwen3 improve F1 and accuracy over the direct model call, retrieval agents (such a CodeAct) and summarization agents while maintaining a similar or lower cost per query.
Only REPL variants can help with quadratic calculations.A full RLM is required for large gains when dealing with information-dense settings, like OOLONG Pairs.
Prime Intellect’s RLMEnv, INTELLECT 3, and RLMEnv operationalize RLMsPrime Intellect implemented the RLM model as RLMEnv. Here, the root LM is a sandboxed Python REPL that calls other tools through sub LMs, and then outputs the results to an Answer: Models such as Intellect-3 report consistent gains in DeepDive environments, Oolong, math Python and verbatim copies.

Take a look at the Paper The following are some examples of how to get started: Technical details. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

Prime Intellect’s RLMEnv, a Recursive Language Model (RLM) for Long Horizon LLM Agents.

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

X Data Center Fire in Oregon Started Inside Power Cabinet, Authorities Say

OpenAI Hires Slack’s CEO as its New Chief Revenue Officer

Melania Trump’s AI Era is Here

Livestream Replay – Beginner’s Advice for Claude – a ChatGPT alternative

Can AI Kill Venture Capitalists?

Top Insights

Machine Learning Framework: Thought Anchors for Measuring Reasoning in Large Language Models

Content moderation: From prompt to response: Teaching mistral agents to say no

Latest News