Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Apple’s new CEO must launch an AI killer product
  • OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing
  • 5 Reasons to Think Twice Before Using ChatGPT—or Any Chatbot—for Financial Advice
  • OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval
  • Your Favorite AI Gay Thirst Traps: The Men Behind them
  • Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin
  • Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges
  • Mend.io releases AI Security Governance Framework covering asset inventory, risk tiering, AI Supply Chain Security and Maturity model
AI-trends.todayAI-trends.today
Home»Tech»NVIDIA introduces ProRL: long-horizon reinforcement learning boosts reasoning and generalization

NVIDIA introduces ProRL: long-horizon reinforcement learning boosts reasoning and generalization

Tech By Gavin Wallace05/06/20254 Mins Read
Facebook Twitter LinkedIn Email
NVIDIA Introduces ProRL: Long-Horizon Reinforcement Learning Boosts Reasoning and Generalization
NVIDIA Introduces ProRL: Long-Horizon Reinforcement Learning Boosts Reasoning and Generalization
Share
Facebook Twitter LinkedIn Email

The recent advances in language models that focus on reasoning have revolutionized AI. They scale test-time computing. The importance of reinforcement learning in the development and mitigation of reward hacking pitfalls is critical. There is still a debate about whether RL can provide new reasoning from a model base or if it only helps to optimize sample efficiency for existing solutions. Current research has two important limitations. (a) relying heavily on specialized fields, like mathematics, in which models are overtrained to the point that they limit exploration. (b) terminating RL training prematurely before it can develop fully new reasoning capabilities.

AI reasoning models engage in long, detailed CoT processes to generate final answers. DeepSeek’s and Kimi’s detailed training methods use reinforcement learning (RLVR) to train reasoning models. This makes algorithms like GRPO Mirror Descent and RLOO very popular. Recent methods such as AlphaGo and AlphaZero showed that AI agents could improve indefinitely, proving that RL helps agents to develop new techniques that are not in their base model. Likewise, some works have questioned whether RLVR training improves the reasoning capability of LLMs. This is based on pass@k metrics that show no improvements compared to baseline models.

NVIDIA researchers have developed ProRL, a training method that allows for extended RL periods and deeper explorations of reasoning strategies. ProRL scales data for diverse tasks such as math problems, coding puzzles, science questions, logic games, and instructions. It supports more than 2,000 different training steps. Using ProRL, the researchers developed Nemotron-Research-Reasoning-Qwen-1.5B, the world’s best 1.5B reasoning model, which outperforms its base model, DeepSeek-R1-1.5B, and excels over DeepSeek-R1-7B across diverse benchmarks. The researchers found that RL could discover new pathways for solving problems not available in base models, when trained sufficiently and then applied to novel reasoning challenges.

Researchers created an extensive and diverse training data set that spans 136,000 tasks across five domains: math, code, science, logic puzzles, instruction following. Training is based on verl, a framework that adopts enhancements to the GRPO methodology proposed by DAPO. A variety of evaluation benchmarks is used to test this model across domains. Mathematics evaluations include AIME2024, AIME2025 and AMC; coding assessments use HumanevalPlus or LiveCodeBench, while logic puzzles evaluations reserve 100 samples.

In mathematics, Nemotron-Research-Reasoning-Qwen-1.5B achieves an average improvement of 15.7% across benchmarks, while competitive programming tasks show 14.4% improvement in pass@1 accuracy. STEM instruction and reasoning domains lead to gains of 25.9% on GPQA Diamond, and 22.0% for IFEval. This model shows that the reward increases by 54.8% and demonstrates high accuracy when solving Reasoning Gym puzzles. The results of the out-of-distribution analysis show significant improvement on three Reasoning Gym tasks that were not previously seen, which highlights effective generalization outside of the distribution. ProRL’s trained model outperforms domain-specific models DeepScaleR-1.50B and DeepCoder-1.50B in math and code benchmarks.

Researchers in this article introduce ProRL. This paper provides proof that stable, extended RL training can develop novel reasoning patterns above and beyond the initial capabilities of a base-model. Based on this method, researchers developed Nemotron-Research-Reasoning-Qwen-1.5B, the world’s best 1.5B reasoning model. ProRL demonstrates how it is able to perform tasks that base models are unable to do. This shows the power of extended RL-training in helping models learn abstract reasoning patterns which can be transferred beyond training distributions. The results of this study challenge the previous beliefs about RL limits and show that a sufficient amount of training with appropriate techniques can expand reasoning boundaries. This will allow for more powerful reasoning models to be developed.


Take a look at the Paper You can also find out more about the following: Model Page . This research is the work of researchers. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe now our Newsletter.


Sajjad is in his final year of undergraduate studies at IIT Kharagpur. Tech enthusiast Sajjad is interested in the applications of AI, with an emphasis on their impact and real-world implications. His goal is to explain complex AI concepts clearly and in an accessible way.

nvidia
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

24/04/2026

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

24/04/2026

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

24/04/2026

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

24/04/2026
Top News

This Scammer Used an AI-Generated MAGA Girl to Grift ‘Super Dumb’ Men

OpenAI’s unreleased AGI Paper could complicate Microsoft’s negotiations

Gemini on Google Home keeps mistaking my dog for a cat

Sora II is used to create disturbing videos with AI-generated children

Wired Roundup: AI Psychosis and Missing Files from the FTC

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Google DeepMind introduces Nano Banana Pro, the Gemini 3 Pro image model for text accuracy and studio grade visuals

22/11/2025

Prego Has a Dinner-Conversation-Recording Device, Capisce?

20/04/2026
Latest News

Apple’s new CEO must launch an AI killer product

24/04/2026

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

24/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.