Meta AI introduces DreamGym, a textual experience synthesizer for Reinforcement Learning RL agents

On paper Reinforcement Learning RL is attractive, but it fails in reality due to cost, infrastructure, and noise. It can take tens or even hundreds of thousands of interactions to train an agent who clicks on web pages and completes multiple step tools. Each interaction is slow, fragile, and difficult to reset. Meta Framework is a new Meta framework DreamGym This bottleneck can be reframed as a problem of modeling. It learns a reason-based model of experience that simulates these environments entirely in text.

https://arxiv.org/pdf/2511.03773

Why Real Environment RL For Agents Doesn’t Scale?

Four problems are interrelated in the current RL pipelines. The infrastructure stack, the task variety, and reward signals can be unstable, which makes real rollouts expensive. The web environment changes frequently, and rewards are dependent on scrapers that can be fragile. Many actions cannot be reversed. It is also difficult to implement reset mechanisms and episode controls, which makes long-horizon tasks noisy and inefficient.

The benchmarks are divided into two categories. WebShop and ALFWorld have RL capabilities, but are expensive. They still require about 80 000 real transitions in order to achieve strong baselines for PPO or GRPO. WebArena Lite has no RL readiness because the automatic rewards checks and resets are not reliable. Online RL is therefore impossible.

DreamGym is a reasoning-based simulator

DreamGym was built to help you achieve your goals. Three componentsA reasoning-based model of experience, a buffer for experience replay and a curriculum task generator that is adaptive are all included. They define together a Markov synthetic decision process in which the environment is represented as text.

You can also find out more about the following: Model of reasoning-based experience You can also find out more about M States are compact descriptions of what matters for the task. States describe what is important for the task. For example, cleaned page elements rather than raw HTML. The agent will provide the state of the system, action, task instructions, and interaction history for each step. The system pulls up the k most similar transitions in the replay buffer and then applies chain-of-thought reasoning to create a reasoning trail, the next state, and the reward.

You can also view the M It is a world model for LLM tasks that are performed on the web or with tools. However, it’s only defined in text. This model learns by supervised tuning of offline paths, with the joint goal that it can generate the reasoning trace as well as the next state conditional on this trace. The model is forced to encode the causal structure and not only local text statistics.

Use the Replay Buffer to Ground Your Memory

It is important to note that the word “you” means someone. Experience the replay buffer Initialized using offline data of the real-world environment from WebShop ALFWorld, and WebArena Lite. DreamGym updates the buffer with new paths as policies are trained in the synthetic environments. The prediction steps in M Uses an encoder for retrieving a set of transitions similar to those in this memory, and then conditions are applied when creating reasoning and the next state.

It acts as a basis. This keeps the synthetic data distribution close and helps reduce hallucinations during long-term rollouts. Researchers found that removing the history and retrieval of generated states degraded consistency, factuality, and informativeness when evaluated by external evaluators. It also reduced downstream success rates for WebShop and WebArena Lite.

Reward Entropy: Curriculum

You can also find out more about the following: curriculum task generator The experience model is the basis for this new approach. The model selects tasks with high rewards variances under current policies, corresponding to tasks of intermediate difficulty that are sometimes solved and sometimes failed by the agent. It generates variations for every task. These variants preserve the action type but modify constraints, target or context.

The selection algorithm is based upon the reward entropy calculated over batches of rollsouts. The tasks with a non-zero variance, and a balanced ratio of success to failure should be preferred. Turning off the adaptive curriculum can cause WebShop or WebArena Lite to perform worse by 6 percent points. It also leads to a plateau in performance as the replay buffer is saturated with low-entropy, easy trajectory.

Inside DreamGym: Theoretical Garanties

The policy in DreamGym uses standard RL algorithm. The team of researchers evaluates Proximal policy optimization and Group Relative Optimization. The rollouts are alternated between policy selecting actions and experience models synthesizing future states and rewards. This is another interface from the perspective of the RL.

Researchers also get a bound for improvement in trust regions that relates policy performance to the MDP synthetic and the real world. This bound includes error terms that are dependent on reward prediction errors and divergence in real and synthetic distributions. DreamGym improves as these errors decrease.

WebShop ALFWorld WebArena and WebArena Light: Results of the Experimental Testing

DreamGym has been tested on WebShop, ALFWorld, and WebArena Lite with agents based on Qwen and Llama. Three regimes of results were observed.

FirstIn RL is ready, but it’s expensive WebShop and ALFWorld agents, trained using PPO or GRPO in DreamGym and only synthetic transitions match the performance PPO and GRPO baselines which use around 80 thousand real-environment interactions. It is clear that experience-based reasoning can be used to provide enough information for policy improvements.

The SecondIn The environment is not RL-ready DreamGym is a RL-training platform that allows for trainings otherwise not possible. The framework achieved a 30 percent increase in success over the baselines. This includes supervised behavior cloning, fine tuning, and direct behavior copying.

ThirdIn Real to Sim TransferDreamGym’s S2R configuration trains policies in synthetic environments first and then refines them with real-world rollouts. This configuration provides a 40% increase over training in the actual environment. Less than 10% real data is used, and the training costs are reduced to between a third and 5% of baselines.

The Key Takeaways

DreamGym’s reasoning-based model replaces real-world rollouts. The model operates within an abstract state space that predicts next states and rewards based on task, past history and previously retrieved transitions.
This framework combines three components: a reasoning experience modeling, an experience replaying buffer that contains real trajectories and a curriculum generator which selects and varies the tasks using a reward entropy heuristic. These 3 components stabilize and diversify RL Training.
Agents trained in DreamGym with PPO, GRPO, or synthetic interactions, are able to match performance levels of PPO, GRPO, and baselines using about 80,000 actual environment transitions.
DreamGym, which does not have RL capabilities, enables online RL in WebArena Lite and has a success rate of more than 30% higher than non-RL baselines, including supervised behavior cloning and fine tuning.
The sim-to-real configuration allows policies to be pretrained and fine tuned in DreamGym, and then deployed with real data. This results in an additional 40 percent improvement. It uses less than 10% of the budget for real interactions and reduces the total cost of training to a third or a fifth of the standard RL.

DreamGym represents a significant step towards practical reinforcement learning in LLM agents. It reframes it as an environment that is based on a reasoning-based experience model and is grounded by a reward entropy curriculum. According to the reported results on WebArena Lite and WebShop, as well as ALFWorld when using PPO and GRPO, synthetic experience combined with Sim-to-Real adaptation could become a common pattern in agent training. DreamGym uses the model of experience, rather than the policy to scale RL agents.

Click here to find out more Full Paper. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, known as an Artificial Intelligence Media Platform (AIMP), is notable for its comprehensive coverage of deep learning and machine learning. This platform has over 2,000,000 monthly views which shows its popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Meta AI introduces DreamGym, a textual experience synthesizer for Reinforcement Learning RL agents

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

Mistral’s new ultra-fast translation model gives big AI labs a run for their money

OpenAI Updates GPT-5 Following Users’ Revolt

OpenAI is destroying its 4o model. China’s ChatGPT Fanatics Aren’t Okay

Price Increases Are Driven by Algorithms According to Game Theory

A significant amount of website traffic is now generated by AI bots

Top Insights

Google DeepMind Researchers introduce Evo Memory Benchmark and ReMem for Experience Reuse by LLM Agents

Engaging with LinkedIn by replying to comments increases engagement 30%

Latest News

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

Meta AI introduces DreamGym, a textual experience synthesizer for Reinforcement Learning RL agents

Why Real Environment RL For Agents Doesn’t Scale?

DreamGym is a reasoning-based simulator

Use the Replay Buffer to Ground Your Memory

Reward Entropy: Curriculum

Inside DreamGym: Theoretical Garanties

WebShop ALFWorld WebArena and WebArena Light: Results of the Experimental Testing

The Key Takeaways

Related Posts