The reward models play a fundamental role in aligning LLMs to human feedback. However, they are challenged by issues of reward hacking. The models are based on superficial factors such as the length of responses or their formatting, rather than true indicators of quality like relevance and factuality. The problem is that standard training objectives do not distinguish between false correlations in the training data, and real causal factors of response quality. This failure leads to reward models that are brittle and generate policies with misaligned incentives. Moreover, a causal understanding is needed to develop RMs which are insensitive to spurious signals and sensitive to quality causal attributes.
Limitations in Existing RM and the need for Causal Robustness
There are methods that try to fix reward hacking in standard RLHFs that rely either on Bradley-Terry ranking or pairwise methods. These include architectural modifications such as Odin and policy-level adjustment, or data-centric techniques involving ensembles, consistency checks, or other methods. Modern causal-inspired methods employ MMD regularization in order to estimate causal effects by rewriting corrected sentences. These methods, however, only target predetermined spurious variables, and miss unknown correlations. The augmentation strategy is still crude, while evaluation-based methods do not equip the reward model with robust mechanisms to combat diverse spurious variations.
Crome: Causally-robust reward modelling for LLMs
Researchers from Google DeepMind, McGill University, and MILA – Quebec AI Institute have proposed Crome (Causally Robust Reward Modeling), a framework built on an explicit causal model of answer generation. Crome trains RMs how to discern between real quality drivers and surface cues. It does this by adding counterfactuals generated using LLMs that are based on preference data. It creates two different types of synthetic learning pairs. (a) Causal augmentations which change specific causal attributes (such as factuality) to increase sensitivity for true quality shifts. (b)Neutral augmentations to maintain invariance with spurious attributes (like style) using tie-labels. Crome improves robustness and increases RewardBench accuracy of up to 4.5%. It also enhances safety and reasoning.
The Technical Approach to Counterfactual Accumulation and Composite Loss Optimizing
Crome has two major phases. First, it generates attribute-aware data counterfactuals based upon a causal model. Second, the model is trained with combined data using a special loss. It presents a theoretical study on the way causal augmentation separates real reward drivers from false correlates using an idealized version of a model. Crome uses the UltraFeedback data with counterfactuals created using Gemini 2.0 Flash and assesses performance on RewardBench, and reWordBench. Researchers use diverse LLMs for their experiments including Gemma-2-9B – IT, Qwen2.5-7B, and Gemma-2-2-2B.
Performance gains: from RewardBench and WildGuardTest
Crome has achieved significant improvements in RewardBench ranking accuracy when compared to RRM. This includes gains of up to 13,18% in the Safety category and 7.19% in Reasoning. Crome achieves aggregate accuracy improvements of up to 9,1% with Gemma-2-9B in PairPM settings on reWordBench. It also performs better on 21 of the 23 transformations. Moreover, the accuracy of ranking from RewardBench compared to reWordBench is lower (19.78% vs. 21.54%). Crome has shown excellent improvements in safety on WildGuardTest using Best-of N selection. It achieved lower attack success rates for harmful prompts, while maintaining similar refusing rates for benign prompts.
The Conclusion of Causal Data Enhancement and the Future Directions
Researchers introduced Crome as a framework to solve reward hacking problems during RM Training. It uses two strategies for augmented synthetic data: Causal and Neutral Augmentations. Crome has outperformed strong baselines for multiple base models on RewardBench and reward modeling techniques, as well as superior robustness in reWordBench when it comes to spurious correlations. Crome, a dataset-focused training method, opens up new directions for research in the generation of synthetic data for base model learning, and causal attribute validation could be highly useful for future development in robust language models alignment.
Take a look at the Paper. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.


