Recent developments in reasoning centric large language model (LLM) technology have broadened the scope of reinforcement-learning (RL), enabling it to be used for more generalized and rational applications. This shift poses significant challenges. In particular, scaling up the computing power required to learn from experience. RL is more computationally demanding than imitation through pretraining and fine tuning. One of the most important issues is the decrease in policy entropy. This affects the balance that exists between exploitation and exploration. This exploitation-exploration trade-off is fundamental in RL, and controlling policy entropy has become critical to maintaining effective exploration during training.
Existing efforts address the exploration-exploitation trade-off in RL by utilizing policy entropy. In maximum entropy RL, a regularization factor is introduced to the reward functions. This encourages exploration and promotes uncertainty. This technique is widely used in traditional RL algorithms but its applicability to LLMs has not been fully explored. Likewise, LLM predictability isn’t explored in RL. Predictive frameworks are limited for RL learning, despite the fact that neural scaling laws govern LLM development. The existing RL methods with verified rewards for LLMs show promising reasoning improvements but do not have a thorough understanding of the core mechanisms.
Researchers from Tsinghua University (Shanghai AI Laboratory), Peking University, Nanjing University and CUHK have developed a method to deal with the loss of policy entropy when using reasoning-centric LLMs. They established a transformation equation, R = −a exp H + b, where H is entropy, R is downstream performance, and a and b are fitting coefficients. The empirical law suggests strongly that the policy performance trades off with policy entropy and is therefore bottlenecked due to its exhaustion. They also investigated the entropy-dynamics, and their results showed that the policy entropy change is driven by the correlation between action probabilities and changes in logit. They also proposed two techniques, namely Clip-Cov and KL-Cov, which clip and apply a KL penalty to tokens with high covariances, respectively.
Researchers used an autoregressive setup, where the models produced token sequences based upon input prompts. Study involves 11 models with parameter ranges from 0.5B up to 32B, grouped into four open-source families – Qwen2.5 Mistral LLaMA DeepSeek. RL-training is performed in a zero shot setting using algorithms like GRPO. REINFORCE++ and PRIME. This optimizes policy performance by observing entropy dynamic.
Clip-Cov, KL-Cov, and other techniques for evaluating math problems were tested on Qwen2.5 using DAPOMATH data. The methods have non-trivial gains in performance across benchmarks. Comparing the GRPO baseline to these methods results in an average improvement of 2.0% and 6.4% respectively for 7B, 32B. When the baseline’s level of entropy plateaus, for example, the KL Cov method maintains a 10x higher entropy. They can keep a high level of entropy through the entire training. Moreover, these methods produce more significant gains for the Qwen2.5-332B model. Improvements of 15% and 14.6% compared with GRPO were achieved on AIME24, AIME25 and the largest Qwen2.5-32B models.
Researchers have, in conclusion, overcome the problem of entropy-collapsed policy for LLMs that are reasoning-centric. Researchers found that there is a compromise between improved performance and reduced exploration. The trade-off limits future gains. Through theoretical analysis and empirical validation, researchers identify entropy dynamics as a key bottleneck and propose two effective regularization strategies—Clip-Cov and KL-Cov to manage high-covariance tokens and sustain exploration. In an era when RL is a vital axis of scaling, addressing entropy collapsing becomes crucial. This research provides insights into the role entropy plays in scaling RL to more capable and intelligent language models.
Take a look at the Paper You can also find out more about the following: GitHub Page . The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.


