There is a growing need for scalable reasoning models in machine intelligence
Machine intelligence has advanced to the point where we have sophisticated reasoning models. This is especially true in fields like mathematical problem-solving, symbolic reasoning and other domains. These models can perform multi-step calculation and logic deductions. They often generate solutions which mirror the human reasoning process. After pretraining, reinforcement learning is used to increase accuracy. However, scaling up these techniques while maintaining efficiency can be a challenge. In response to the growing demand for models with a smaller footprint and greater resource efficiency, yet still exhibiting high reasoning capabilities, researchers have begun to focus on strategies addressing issues such as data quality, exploratory methods, or long-context-generalization.
The Challenges of Reinforcement learning for Large Reasoning Architectures
The mismatch between the capability of the model and the complexity of the data used to train it is a persistent issue with reinforcement learning. A model’s learning curve can stagnate when it is presented with too easy tasks. In the opposite situation, data that are too difficult can overwhelm a model and produce no learning signals. It is particularly pronounced to apply recipes for small models on larger ones. The lack of efficient methods for adjusting rollout diversification and output length both during training and when inferring is another issue. It further restricts the reasoning capabilities of a model on complex benchmarks.
Current Post-Training Methods for Advanced Models: Limitations
Earlier approaches like DeepScaleR, GRPO and GRPO-R have shown how reinforcement learning improves the performance of reasoning models at small scales with just 1.5 billion parameter. However, applying these same recipes to more capable models, such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B, results in only marginal gains or even performance drops. The static distribution of data and limited sampling diversity are two major limitations. The majority of these methods do not adjust the sampling temperature and/or response length in time, nor filter data on the basis of model capability. These approaches often don’t scale up well on advanced architectures.
Polaris. A Recipe Tailored for Scalable Real-Time Reasoning in Reasoning Tasks
Researchers from Fudan University and Fudan University have introduced Polaris. It is a recipe for post-training that was designed with the specific purpose of scaling reinforcement learning to handle advanced reasoning problems. Polaris comes with two preview versions: Polaris-4B and Polaris-7B. Polaris-4B-Preview is fine-tuned from Qwen3-4B, while Polaris-7B-Preview is based on Deepseek-R1-Distill-Qwen-7B. Researchers developed a model-agnostic system that allows for a variety of explorations through the use of controlled sampling temperature and a broader range inferences through extrapolation. Both models were optimized for consumer graphics processing units. These strategies used open-source datasets as well as training pipelines.
Polaris Innovations – Difficulty Balancing and Controlled Sampling with Long Context Inference
Polaris incorporates many innovations. Polaris implements multiple innovations. So, the data is updated to reflect the growing abilities of the model. Second, the researchers dynamically adjust the sampling temperature across training stages—using 1.4, 1.45, and 1.5 for Polaris-4B and 0.7, 1.0, and 1.1 for Polaris-7B—to maintain rollout diversity. This method also uses Yarn-based techniques to expand the inference context to 96K tokens, without the need for additional training. The method addresses the lack of efficiency in long-sequence learning by allowing a “train-short, test-long” approach. It also uses techniques like the Rollout Rescue System and Intra-Batch Informative Replacement to avoid zero-reward batch sizes and preserve useful training signals, even when keeping the rollout size small.
Polaris performs better than larger models
Polaris models have achieved state of the art results in multiple math benchmarks. Polaris-4B Preview achieves an accuracy of 81.2% on AIME24 while Qwen332B is only using 2%. The score is 44.0% for Minerva Math and 69.1% for Olympiad Bench. It also scores 94.8% in AMC23. Polaris-7B Preview is also a strong performer, with scores of 72.6% for AIME24 and 52.6 percent on AIME25. This results shows a steady improvement over other models, such as Claude-4 and Grok-3. Polaris is a light, compact model with a performance that bridges between the small, open models and 30B+ commercial models.

This is the conclusion: Effective reinforcement learning through smart post-training strategies
Researchers have shown that intelligent control is needed to scale reasoning models. This includes a smarter approach in controlling the difficulty of data training, sample diversity and inference duration. Polaris is a reproducible formula that tunes all of these elements to allow smaller models the same reasoning power as massive commercial systems.
Click here to find out more Model You can also find out more about the following: Code. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.
Nikhil works as an intern at Marktechpost. He has a dual integrated degree in Materials from the Indian Institute of Technology Kharagpur. Nikhil, an AI/ML fanatic, is constantly researching AI/ML applications for biomaterials and other biomedical fields. He has a background in Material Science and is always exploring advancements.


