Can LLMs Judge With Reasoning? Microsoft researchers and Tsinghua researchers introduce Reward Reasoning models to dynamically scale test-time computation for better alignment

In LLM, reinforcement learning has been adopted as the fundamental method. It uses supervision signals obtained from feedback provided by humans (RLHF), or rewards that can be verified (RLVR). RLVR has shown promise for mathematical reasoning. However, its limitations are due to the dependence it places on queries that have verifiable results. RLVR is limited to training large numbers of users on queries in general domains where verification has proven difficult. In addition, reward models of scalar or generative type, which are currently available, do not scale well for estimation of rewards at test time. The existing approaches use uniform computing resources for all inputs and lack the flexibility to adapt to more challenging queries that require nuanced analyses.

Reward models are defined by formulation strategies, scoring schemes and reward structures. While numerical approaches use scalar scoring to assign query-response pair scores, generative feedback is generated by generative methods. Scores are based either on the absolute evaluation or comparison of response candidates. The LLM as a Judge paradigm aligned to generative reward models offers interpretable feedback, but reliability issues arise due to bias judgments. Scaling methods for inference time dynamically adapt computational resources. They include parallel strategies such as multi-sampling, horizon-based scales and extended reasoning traces. They do not adapt to complex inputs, which limits their efficacy across different query types.

Reward Reasoning Models are a new concept developed by researchers from Microsoft Research and Tsinghua University. They perform an explicit reasoning process before generating the final reward. RRMs are able to adapt computational resources to evaluate complex tasks by using this phase of reasoning. RRMs enhance reward modeling through scaling of test-time computing while retaining general applicability in diverse evaluation scenarios. RRMs use additional test-time computation for complex questions where rewards may not be immediately obvious. RRMs are encouraged to develop their reward-reasoning capabilities on their own without using explicit reasoning trails as training data.

RRMs utilise the Qwen2 with a Transformer-decoder core, formulating rewards modeling through text completion. RRMs generate thought processes in an autoregressive manner and end up making final judgments. To determine preferences, each input contains two answers and a question. RewardBench is used by researchers to perform systematic analyses of evaluation criteria such as instruction fidelity and helpfulness. It also includes accuracy, safety, and level of detail. RRMs allow for multi-response assessment through ELO-rating systems, knockout competitions or majority voting. It samples RRMs for pairwise and multiple comparisons.

RRMs are able to achieve competitive results against the strong benchmarks of RewardBench, PandaLM Test and RRM-32B. RRM 32B achieved 98.6% accuracy for reasoning. DirectJudge trained models on the same data show substantial gaps in performance, showing that RRMs use test time computation effectively for complex questions. RRMs outperform all baseline models in reward-guided “best-of-N” inference without any additional test-time computation. Majority voting provides substantial improvements for evaluated subsets. The results of post-training studies show that downstream performance improves on MMLU Pro, GPQA and MMLU. Experiments on 7B,14B and 32B models show that longer thinking-horizons improve accuracy.

The researchers concluded that RRMs can perform explicit reasoning prior to reward allocation to solve the inflexibility of reward models. The Rule-based Reward RL allows RRMs develop reasoning abilities without the need for explicit reasoning traces. RRMs are able to efficiently use test-time computation through sequential and parallel scaling. RRMs are effective in many practical applications. This includes reward-guided inferences and feedback after training.

Click here to find out more Paper The following are some examples of how to get started: Models on Hugging Face. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.

Sajjad is in his final year of undergraduate studies at IIT Kharagpur. Tech enthusiast Sajjad is interested in the applications of AI, with an emphasis on their impact and real-world implications. His goal is to explain complex AI concepts clearly and in an accessible way.

Can LLMs Judge With Reasoning? Microsoft researchers and Tsinghua researchers introduce Reward Reasoning models to dynamically scale test-time computation for better alignment

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

AI podcasters Want To Tell You How To Keep A Man Happy

Siri Must Die

Anthropic’s plan is to stop its AI from developing a nuclear weapons. Does It Work?

Hacking the EU’s new age-verification app takes only 2 minutes

Infiltrated Moltbook: the AI-Only Social Network, where humans are not allowed

Top Insights

This AI Research Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Faster Containment with

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

Latest News

Apple’s new CEO must launch an AI killer product

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

Can LLMs Judge With Reasoning? Microsoft researchers and Tsinghua researchers introduce Reward Reasoning models to dynamically scale test-time computation for better alignment

Related Posts