If you have ever stared at thousands of lines of integration test logs wondering which of the sixteen log files actually contains your bug, you are not alone — and Google now has data to prove it.
Google Researchers Introduced Auto-DiagnoseThe tool is powered by LLM and automatically analyzes failure logs to find the cause of a failed integration test. It then posts the diagnosis in the code review that the problem occurred. A manual assessment of 71 failures in real world scenarios spanning There are 39 different teams.The tool accurately identified the root causes 90% of the Time. It’s been on for a while. 52,635 distinct failing tests There were 224 782 executions. On, There are 91,130 different code changes. Authored by 22,962 distinct developers, with a ‘Not helpful’ rate of just 5.8% on the feedback received.
Integration tests can be a tax on debugging
Integrity tests ensure that components in a distributed software system are communicating correctly. These tests have Auto-Diagnose as their targets. Hermetic Functional Integration Tests: tests where an entire system under test (SUT) — typically a graph of communicating servers — is brought up inside an isolated environment by a test driver, and exercised against business logic. Google conducted a separate survey with 239 respondents. Google performs functional integration testing in 78% cases.This is the reason for its scope.
One of the five most common complaints was about failure to diagnose integration tests. EngSat, an Google-wide study of 6,059 developer. The results of the follow-up study conducted on 116 developers revealed that Diagnose 38.4% integration failures takes more than one hour, while 8.9% require more than 24 hours — versus 2.7% and 0% for unit tests.
It is structural. The test driver logs typically only show a generic problem (such as a timeout or assertion). In the SUT logs the error is often hidden under ERROR level lines and recoverable warnings.

What is Auto Diagnosis?
An Auto-Diagnose event is triggered when an Integration Test fails. The system collects all test driver and SUT component logs at level INFO and above — across data centers, processes, and threads — then The logs are sorted and joined by time stamp into one stream.. The stream and component metadata are dropped in a template.
This model has a slant. Gemini 2.5 FlashCalling with Temperature = 0.01 For near-deterministic and debuggable outputs, TopP = 0.8. Gemini is not fine-tuned based on Google integration test data. This is pure prompt-engineering on a general purpose model.
It is important to note that the prompt in itself provides most of the information. The model is guided through a step-by-step process: read the component context, scan logs, find the failure and summarize errors. Only then can a conclusion be attempted. Critically, it includes hard negative constraints — for example: Do not jump to conclusions if you do not see any line from the component which failed.
Model response can be post-processed to a markeddown result. ==Conclusion==, ==Investigation Steps==” Most Relevant Log Lines The sections are then added as comments in CritiqueGoogle’s code-review system. Each log line cited is displayed as a link.
Numerical production
Auto-Diagnose averages Each execution uses 110,617 tokens for input and 5,962 tokens for output.The findings are posted with the hashtag # Latency p50 is 56 seconds, and latency p90 346 seconds — fast enough that developers see the diagnosis before they have switched contexts.
Three feedback buttons are revealed by the critic on an observation: You can fix it Useful for reviewers Please Help Us” You’re not helping Authors can use either (or both) of these. Total 517 reports of feedback from 437 developers. 436 (84.3%) were “Please fix” from 370 reviewers — by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. The helpfulness ratio is the most common feedback among developers.H + (H N)The % is 62.96% and the “Not helpful” Rate (N = (PF + N + H)) is 5.8% — well under Google’s 10% threshold for keeping a tool live. All across Critique is a tool that allows you to post your findings.. Auto-Diagnose ranks It is ranked #14 for helpfulness and ranks in the top 3.8%.
Manual evaluation revealed a side-effect that was also useful. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed — both real infrastructure bugs, reported back to the relevant teams. In production, around 20 ‘more information is needed‘ diagnoses have similarly helped surface infrastructure issues.
What you need to know
- The accuracy of Autodiagnose is 90.14%. As a result of an evaluation manual of 71 failures in real integration tests across 39 Google teams, the problem was addressed. This issue was ranked as the top complaint by 6,059 developers according to EngSat’s survey.
- Gemini Flash version 2.5 runs the system without any fine tuning — just prompt engineering. This trigger gathers logs from data centers or processes across multiple systems, combines them with timestamps, then sends the data to the model when temperature is 0.1 degrees and above.P 0.8.
- It is designed so that the prompt will not allow you to guess. The model is forced to use the negative constraint. “more information is needed” when evidence is missing — a deliberate trade-off that prevents hallucinated root causes and even helped surface real infrastructure bugs in Google’s logging pipeline.
- Auto-Diagnose, in production since May 20, 2025, has been run over 52,635 failing tests, 224,782 code executions, and 91,130 changes by 22,962 developers., posting findings in a p50 of 56 seconds — fast enough that engineers see the diagnosis before switching contexts.
Check out the Pre-Print Paper here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

