The recent advances in language models that focus on reasoning have revolutionized AI. They scale test-time computing. The importance of reinforcement learning in the development and mitigation of reward hacking pitfalls is critical. There is still a debate about whether RL can provide new reasoning from a model base or if it only helps to optimize sample efficiency for existing solutions. Current research has two important limitations. (a) relying heavily on specialized fields, like mathematics, in which models are overtrained to the point that they limit exploration. (b) terminating RL training prematurely before it can develop fully new reasoning capabilities.
AI reasoning models engage in long, detailed CoT processes to generate final answers. DeepSeek’s and Kimi’s detailed training methods use reinforcement learning (RLVR) to train reasoning models. This makes algorithms like GRPO Mirror Descent and RLOO very popular. Recent methods such as AlphaGo and AlphaZero showed that AI agents could improve indefinitely, proving that RL helps agents to develop new techniques that are not in their base model. Likewise, some works have questioned whether RLVR training improves the reasoning capability of LLMs. This is based on pass@k metrics that show no improvements compared to baseline models.
NVIDIA researchers have developed ProRL, a training method that allows for extended RL periods and deeper explorations of reasoning strategies. ProRL scales data for diverse tasks such as math problems, coding puzzles, science questions, logic games, and instructions. It supports more than 2,000 different training steps. Using ProRL, the researchers developed Nemotron-Research-Reasoning-Qwen-1.5B, the world’s best 1.5B reasoning model, which outperforms its base model, DeepSeek-R1-1.5B, and excels over DeepSeek-R1-7B across diverse benchmarks. The researchers found that RL could discover new pathways for solving problems not available in base models, when trained sufficiently and then applied to novel reasoning challenges.
Researchers created an extensive and diverse training data set that spans 136,000 tasks across five domains: math, code, science, logic puzzles, instruction following. Training is based on verl, a framework that adopts enhancements to the GRPO methodology proposed by DAPO. A variety of evaluation benchmarks is used to test this model across domains. Mathematics evaluations include AIME2024, AIME2025 and AMC; coding assessments use HumanevalPlus or LiveCodeBench, while logic puzzles evaluations reserve 100 samples.
In mathematics, Nemotron-Research-Reasoning-Qwen-1.5B achieves an average improvement of 15.7% across benchmarks, while competitive programming tasks show 14.4% improvement in pass@1 accuracy. STEM instruction and reasoning domains lead to gains of 25.9% on GPQA Diamond, and 22.0% for IFEval. This model shows that the reward increases by 54.8% and demonstrates high accuracy when solving Reasoning Gym puzzles. The results of the out-of-distribution analysis show significant improvement on three Reasoning Gym tasks that were not previously seen, which highlights effective generalization outside of the distribution. ProRL’s trained model outperforms domain-specific models DeepScaleR-1.50B and DeepCoder-1.50B in math and code benchmarks.
Researchers in this article introduce ProRL. This paper provides proof that stable, extended RL training can develop novel reasoning patterns above and beyond the initial capabilities of a base-model. Based on this method, researchers developed Nemotron-Research-Reasoning-Qwen-1.5B, the world’s best 1.5B reasoning model. ProRL demonstrates how it is able to perform tasks that base models are unable to do. This shows the power of extended RL-training in helping models learn abstract reasoning patterns which can be transferred beyond training distributions. The results of this study challenge the previous beliefs about RL limits and show that a sufficient amount of training with appropriate techniques can expand reasoning boundaries. This will allow for more powerful reasoning models to be developed.
Take a look at the Paper You can also find out more about the following: Model Page . This research is the work of researchers. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe now our Newsletter.


