In LLM, reinforcement learning has been adopted as the fundamental method. It uses supervision signals obtained from feedback provided by humans (RLHF), or rewards that can be verified (RLVR). RLVR has shown promise for mathematical reasoning. However, its limitations are due to the dependence it places on queries that have verifiable results. RLVR is limited to training large numbers of users on queries in general domains where verification has proven difficult. In addition, reward models of scalar or generative type, which are currently available, do not scale well for estimation of rewards at test time. The existing approaches use uniform computing resources for all inputs and lack the flexibility to adapt to more challenging queries that require nuanced analyses.
Reward models are defined by formulation strategies, scoring schemes and reward structures. While numerical approaches use scalar scoring to assign query-response pair scores, generative feedback is generated by generative methods. Scores are based either on the absolute evaluation or comparison of response candidates. The LLM as a Judge paradigm aligned to generative reward models offers interpretable feedback, but reliability issues arise due to bias judgments. Scaling methods for inference time dynamically adapt computational resources. They include parallel strategies such as multi-sampling, horizon-based scales and extended reasoning traces. They do not adapt to complex inputs, which limits their efficacy across different query types.
Reward Reasoning Models are a new concept developed by researchers from Microsoft Research and Tsinghua University. They perform an explicit reasoning process before generating the final reward. RRMs are able to adapt computational resources to evaluate complex tasks by using this phase of reasoning. RRMs enhance reward modeling through scaling of test-time computing while retaining general applicability in diverse evaluation scenarios. RRMs use additional test-time computation for complex questions where rewards may not be immediately obvious. RRMs are encouraged to develop their reward-reasoning capabilities on their own without using explicit reasoning trails as training data.
RRMs utilise the Qwen2 with a Transformer-decoder core, formulating rewards modeling through text completion. RRMs generate thought processes in an autoregressive manner and end up making final judgments. To determine preferences, each input contains two answers and a question. RewardBench is used by researchers to perform systematic analyses of evaluation criteria such as instruction fidelity and helpfulness. It also includes accuracy, safety, and level of detail. RRMs allow for multi-response assessment through ELO-rating systems, knockout competitions or majority voting. It samples RRMs for pairwise and multiple comparisons.
RRMs are able to achieve competitive results against the strong benchmarks of RewardBench, PandaLM Test and RRM-32B. RRM 32B achieved 98.6% accuracy for reasoning. DirectJudge trained models on the same data show substantial gaps in performance, showing that RRMs use test time computation effectively for complex questions. RRMs outperform all baseline models in reward-guided “best-of-N” inference without any additional test-time computation. Majority voting provides substantial improvements for evaluated subsets. The results of post-training studies show that downstream performance improves on MMLU Pro, GPQA and MMLU. Experiments on 7B,14B and 32B models show that longer thinking-horizons improve accuracy.
The researchers concluded that RRMs can perform explicit reasoning prior to reward allocation to solve the inflexibility of reward models. The Rule-based Reward RL allows RRMs develop reasoning abilities without the need for explicit reasoning traces. RRMs are able to efficiently use test-time computation through sequential and parallel scaling. RRMs are effective in many practical applications. This includes reward-guided inferences and feedback after training.
Click here to find out more Paper The following are some examples of how to get started: Models on Hugging Face. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.



