Meta Introduces: Autodata: A Framework that Turns AI Models Into Autonomous Data Scientists to Create High-Quality Training Data

The bottleneck in building better AI models has never been compute alone — it has always been data quality. Meta AI RAM’s (Reasoning Alignment and Memory) team has now taken on that problem directly. Meta researchers are introducing Autodata, a framework that deploys AI agents in the role of an autonomous data scientist, tasked with iteratively building, evaluating, and refining training and evaluation datasets — without relying on costly human annotation at every step.

And the results, tested on complex scientific reasoning problems, show that this approach doesn’t just match classical synthetic data generation methods — it significantly outperforms them.

https://facebookresearch.github.io/RAM/blogs/autodata/

The Creation of Synthetic Data Has Been a Challenge Since the Beginning

Autodata’s AI Training Data is created in a way that is different from the current AI training data.

Human-written text is the starting point for most modern AI systems. As AI models became more sophisticated, researchers started adding human-written information. Synthetic Data — data generated by the model itself. It is appealing to use synthetic data because it allows for rare cases and reduces the need for manual labeling.

Synthetic data generation has dominated the industry for many years. Self-Instruct — prompting a large language model (LLM) using zero-shot or few-shot examples to create new training samples. The Grounded Self Instruct Methods extended by grounding the generation of documents and other sources in order to reduce hallucinations and increase diversity. CoT Self Instruct Chain-of-Thought self-instruct (CtSI) has been pushed to the limit by using reasoning based on chain of thought during creation in order to create more complex tasks with greater accuracy. Recently, “Self-Challenging” Using the right methods allow a challenger agent to interact with tools before proposing a task and accompanying evaluation functions — the closest prior work to what Autodata does.

What is the problem? These methods did not provide researchers with a way to control data quality or improve iteratively during the generation process. You could filter, evolve, or refine data after the fact — but the generation pipeline remained largely static and single-pass.

This is what autodata does.

https://facebookresearch.github.io/RAM/blogs/autodata/

What Does Autodata Really Do?

Autodata is an AI method which allows AI agents act as data scientist who build up high-quality evaluation and training data iteratively. The agent does not generate data all at once, but instead runs a pipeline that is modeled on how data scientists actually work.

Data Creation — The agent grounds itself on provided source documents (research papers, code, legal text, etc.) Uses tools and acquired skills to create training and evaluation examples.
Data Analysis — The agent then inspects what it created: Is this example correct? What is the quality of your work? Challenge enough? This synthesises lessons at an example-level and eventually at a dataset level. Is it diverse? Does using the data as training improve a particular model?).
Iteration — Using those learnings, the agent updates its data-generation recipe and loops back to create better data. It continues to do this until a stop criterion has been met.

Create data using an agent. Increased inference computation can be converted into better model training. The more inference-time compute you give the agent, the better the data it produces — a key insight for practitioners managing compute budgets.

Its Specific Implementation : Self-instructing Agent

Meta’s first instantiation is known as Agentic Self InstructIts architecture revolves around an orchestrator LLM, which coordinates the four sub-agents.

Challenger LLM — generates a training example (input + response pair) based on a detailed prompt from the main agent
Weak Solver — a smaller, less capable model expected to generally fail on the generated example
Strong Solver — a more capable model expected to generally succeed
Verifier/Judge — evaluates whether each solver’s output meets quality criteria, using rubrics generated by the Challenger LLM

Important design point: The Weak and the Strong Solver can be the same LLM, but operating in two different modes. For example, the strong version can be allowed to use increased inference time compute including scaffolding or aggregation, as well as having access to privileged information — giving practitioners flexibility in how they define capability separation.

Acceptance criteria is precise and multiple-condition. If you want an example accepted in the dataset, Four of the four following conditions must apply:

Quality verifiers (QVs) must be able to pass an example
weak_avg ≤ 65% The following are some examples of how to get started: max_weak ≤ 75% There are no negative scores
strong_avg ≥ 60% The following are some examples of how to get started: strong_avg — ensuring the question is neither too hard for everyone nor trivially easy for the strong solver
There is a gap. strong_avg − weak_avg ≥ 20%

If any of those thresholds aren’t met, the main agent sends targeted feedback to the Challenger and tries again — from a different reasoning angle. This loop typically runs several rounds per paper (median 3–5) before producing an accepted question or exhausting its step budget.

Numbers that Matter

Qualitative improvements over CoT Standard Self-Instruct can be measured and are substantial.

Under CoT Self-Instruct, the two solvers score nearly identically — weak at 71.4% and strong at 73.3%, a gap of only 1.9 percentage points — showing that single-shot questions fail to find challenging enough tasks for either model. Agentic Self Instruct lowers the weaker score from 79.3% to just 43.7% and increases the stronger score to 77.8%. The gap is now 34 points. This loop rewards stronger models with questions, not questions which both models could answer equally.

It was generated by processing 10,000 CS files from S2ORC (2022+) and yielding 2,117 QA Pairs that satisfied all the quality constraints as well as performance gap requirements.

When Qwen-3.5-4B was then trained with GRPO for roughly one epoch (batch size 32, learning rate 1e-6) on Agentic Self-Instruct data versus CoT Self-Instruct data — using Kimi-K2.6 as the reward model to score responses against the generated rubrics — the model trained on agentic data demonstrated a clear advantage on both in-distribution and out-of-distribution test sets.

Meta-Optimization – Teaching Agents to be Better Data Scientists

AUTODATA goes a step further. The framework goes beyond the inner loop of data creation. meta-optimization of the data scientist agent itself — using the same inner-loop quality criteria to optimize the outer-loop agent harness (the agent’s code scaffolding, prompts, and evaluation logic).

Meta-optimizer, using an evolution-based optimized framework, ran 233 iterations in total, of which only 126 accepted. (A mutant harness will be included into the population if the validation score is higher than its parent’s). The meta-optimizer used Kimi-K2.6 as both the analyzer — reading full evaluation trajectories to diagnose systematic failure patterns — and the implementer, which modified the agent’s harness via a code-editing agent. Setup used 50 validation and training papers.

Meta-optimizer discovered progressively new harnesses by starting with a base harness which achieves a validation pass rate of 12.8%. The four most important harnesses can be improved automatically.

Paper-specific enforcementQuestions should test specific knowledge of the topic, and not general knowledge about ML/CS. The self-test has been introduced. “If a solver could answer correctly without reading this specific paper, the question is too easy.”
Context Leak PreventionThe context must only describe the domain of the problem and the setup. It cannot include the solution proposed in the paper.
Rubric only for positives with weight capThe optimizer removed all rubrics with negative weights, because they had historically failed to work and destroyed high model scores. They also did not improve discrimination. The weights of all criteria are now positive integers, capped at seven.
The structure of rubricsFormat strict JSON for rubric criteria, with weights as integers. This eliminates parsing mistakes that caused failures to evaluate in previous iterations.

Meta-optimizing data scientist agents’ instructions to improve data quality is possible without manual harnessing.

Check out the Technical details here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

Meta Introduces: Autodata: A Framework that Turns AI Models Into Autonomous Data Scientists to Create High-Quality Training Data

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

Natural Language Automatencoders by Anthropic Convert Claude’s internal activations directly into human-readable text explanations

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

AI Model that Never Stops Learning

Meta Tells Its Metaverse Workers to Use AI to ‘Go 5X Faster’

IBM and NASA create a digital twin of the sun to predict future solar storms

Anthropic Sues Department of Defense for Supply Chain-Risk Determination

Power Play: The Great Big Power Play

Top Insights

Researchers from MBZUAI release K2 Think, a 32B system that is open source and can perform better than reasoning models up to 20 times larger.

Gemini Flash-Lite is now the fastest proprietary model (external tests) with 50% fewer output tokens.

Latest News

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

Meta Introduces: Autodata: A Framework that Turns AI Models Into Autonomous Data Scientists to Create High-Quality Training Data

The Creation of Synthetic Data Has Been a Challenge Since the Beginning

What Does Autodata Really Do?

Its Specific Implementation : Self-instructing Agent

Numbers that Matter

Meta-Optimization – Teaching Agents to be Better Data Scientists

Related Posts