The cost of evaluating large language model (LLM) models is high, on both a scientific and economic level. As the field races toward ever-larger models, the methodology for evaluating and comparing them becomes increasingly critical—not just for benchmark scores, but for informed development decisions. The Allen Institute for Artificial Intelligence’s (Ai2) recent research introduces a framework that is based on two key metrics. Signal The following are some examples of how to get started: noiseThe ratio of the two is called the signal-to-noise ratio (SNR). This framework has been validated on hundreds of benchmarks and models, and provides concrete interventions to help reduce the uncertainty in language model evaluation.
Understanding signal and noise in LLM evaluation
Sign up to receive the latest news and updates.
Sign up to receive the latest news and updates. This metric measures a benchmark’s ability to discriminate between good and bad models, by quantifying how widely the scores of the different models vary for any given task. High signal indicates that the model performance is distributed across the benchmark. This makes it easy to compare and rank models. Scores for a benchmark that has a low signal are likely to be too similar, making it difficult to determine which model is better.
Noise
Noise refers to the variability of a benchmark score as a result of random fluctuations during training—including random initialization, data order, and checkpoint-to-checkpoint changes within a single training run. A benchmark with high noise is less reliable as it can produce inconsistent results when repeated experiments are performed using the exact same data model.
Signal-to-Noise Ratio (SNR)
Ai2’s key insight is that the utility of a benchmark for model development is governed not just by the signal or the noise individually, but by their ratio—the signal-to-noise ratio. SNR high benchmarks are more accurate and can help make decisions at a small scale that will transfer over to a larger scale.
SNR is important for development decisions
Two scenarios are common in LLM where the evaluation of benchmarks guides critical decisions.
- Decision Accuracy The core question: Does the ranking of models at small scale hold for larger scale? This core question is: Does ranking models on a small scale apply to larger scales?
- The Scaling Law Error A scaling law can be fitted to small models in order for them to accurately predict the performance of larger ones.
For these scenarios, research has shown that SNR-high benchmarks provide a much more reliable result. SNR strongly correlates with the accuracy of decision making (R2=0.626R2= 0.626R2=0.626), and predicts scaling law error probability (R2=0.426R2= 0.426R2=0.426). Low signal and high noise benchmarks make the development of new products more risky, as findings from small scale may not be valid at larger production levels.
How to measure signal and noise
Definitions of the Practical Term
- Signal: This is the difference in score between two different models (normalized by their mean scores) for a group of models with similar computing budgets.
- Noise: This is the standard deviation relative of all the checks that a model has passed at its final checkpoints.
This combination is a great way to save money. SNR= Relative Standard Deviation (Noise)/ Relative Dispersion (Signal)
It is an inexpensive and reliable method to determine evaluation robustness. Importantly, checkpoint-to-checkpoint noise is highly correlated with traditional sources such as initialization and data order noise, making it a practical proxy Noise is a general term used to describe the overall sound of a model.

What can be done to improve evaluation benchmarks?
Ai2 proposes and tests several practical interventions to boost benchmark SNR—empowering better decisions during LLM development.
1. Filtering subtasks according to SNR
Multitask Benchmarks (e.g. MMLU or AutoBencher), are usually averages across many subtasks. Research shows that using a small subset of subtasks with high SNR (rather then all tasks available or larger samples sizes) improves SNR as well as decision accuracy. As an example, using only 16 subtasks out of the total 57 results in better prediction and higher SNR than the entire set. It also allows for the elimination of subtasks with high errors in labeling, because low SNR often indicates poor data.
2. Averaging Checkpoint Scores
Instead of relying on the score at the last training checkpoint alone, the effect of noise is reduced by averaging scores from several checkpoints. The method is a consistent way to improve decision accuracy while reducing scaling law error predictions. As an example, the average improved decision accuracy of 2.4% while reducing prediction errors in most benchmarks.
3. Using Continuous Metrics Like Bits-Per-Byte (BPB)
The continuous nature of LLM is not captured by classification metrics such as accuracy. Measurement bits-per-byte A continuous measure of perplexity yields a significantly higher SNR. This is especially true for generative tasks like math or code. SNR is increased for GSM8K by shifting from accuracy to BPB, while for MBPP it increases from 2.0 up to 41.8. This results in a marked improvement in accuracy of decision (e.g. MBPP went from 68% to 90%, MinervaMATH from 50% to 90%).
What you need to know
- The SNR is a tool for benchmark selection: If you are choosing LLM benchmarks, choose those with a low noise-to-signal ratio. It is important to ensure that the decisions taken with small-scale tests are also predictive of production.
- Qualitative over Quantitative: It is not necessarily better to have more or larger data. SNR informed subtask and metric selection improves evaluation quality.
- Early stopping and smoothing Average results at final and intermediate test points to reduce random noise.
- Continuous Metrics Improve Reliability: For challenging or generative tasks, prefer continuous metrics over classification metrics. This will increase SNR and results stability.
The conclusion of the article is:
Ai2’s Signal and Noise framework changes the way that model developers evaluate and benchmark LLMs. SNR allows practitioners to reduce the risk of making a bad decision, forecast scaling law behavior and pick optimal benchmarks. Ai2’s dataset, which includes 900,000 models with open weights, is used to supplement the research. This provides robust tools that can be applied in LLM evaluation.
Click here to find out more Paper, Technical Blog, GitHub Page The following are some examples of how to get started: Hugging Face Page. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

