This new agency-focused approach to supervision scales software AI agents with only 78 examples

Are curated demonstrations with tool-grounded tools more powerful than a large amount of generic data for software agents? The proposal is made by a team of researchers at Shanghai Jiao Tong University, SII Generative AI Research Lab and SII Generative AI Research Lab. LIMI”Less Is More for Agency”)A supervised method of fine tuning that transforms a model from a simple software agent into an advanced one using 78 samples. LIMI scores 73.5% average on AgencyBench The FTFC (71.7, RC@3 74.2, SR@3 74.6) beats strong baselines like Qwen3-235BA22B, GLM-4.5 (45.1), Kimi K2 (24.1), DeepSeek-V3.1 (11.9), and can even outperform variants that are trained on 10,000 samples—with 128× less data.

https://arxiv.org/pdf/2509.17567

What is the new product?

Agency Efficiency PrincipleLIMI is a state of mind Agentic competency The scales are more. data quality/structure The raw sample count is more than just a simple number. Research team refines GLM-4.5/GLM 4.5-Air using 78 The long-horizon tool-use trajectories are reported (samples), and the suites of generalization (EvalPlus-HE/MBPP (DS-1000) and SciCode (AgencyBench) show large improvements.
Minimal but dense supervision. Each trajectory (~13k–152k tokens; ~42.4k avg.) captures complete multi-turn workflows—model reasoning, tool calls, and environment observations—collected in the SII-CLI execution environment. The task span “vibe coding” Interactive Software Development (ISD) is a new way to develop software. Workflows for research (search, analysis, experiment design).

What is the process?

Base models: GLM 4.5 (355B), and GLM 4.5-Air (106) are available. The training uses the Slime SFT framework using identical configurations in all comparisons.
Data construction: 60 real-world queries by practitioners, plus 18 synthesized GitHub submissions with high stars (subject to tight QA from PhD annotators). LIMI records the entire agent path to successful completion for each query. SII-CLI.
Evaluation: AgencyBench The FTFC is used with SR@3 and RC@3 plus the generalization suites TAU2-airline/retail pass4, EvalPlus HE/MBPP (DS-1000), SciCode (DS-1000), EvalPlus HE/MBPP (EvalPlus HE/MBPP).

The results of the study are:

AgencyBench (avg): 73.5%. LIMI vs. GLM 4.5 (+28.4 pts)FTFC 71.7% Vs 37.8%; SR@3 74.6% Vs 47.4%.
Data efficiency: LIMI78 Samples) performs better than GLM-4.5 when trained on AFM CodeAgent (10,000 Samples): The difference between 73.5% and 47.8% is a staggering 73.5%—+53.7% Absolute With 128× less data. AFM-WebAgent (7.610) and CC Bench-Traj (2260) have similar gaps.
Generalization: Across tool-use/coding/scientific computing, LIMI averages ~57%LIMI leads GLM-4.5 but is still slightly behind without the tool (50% vs. 48.7% The GLM-4.5 tool demonstrates intrinsic benefits beyond environmental tools.

The Key Takeaways

The scale of data is a measure that puts the efficiency first. LIMI has reached 73.5% AgencyBench average using Trajectories curated by the curatorGLM-4,5 (45.1%) and a +53.7-point Avantage over 10k-sample SFT baseline—with 128× fewer samples.
Not bulk, but trajectory quality. Training data are long-horizon, tool-grounded Workflows for collaborative software development, scientific research and other workflows collected through the SII-CLI The paper will refer to the execution stack.
Across-metric gains. On AgencyBench, LIMI reports FTFC 71.7%, SR@3 74.6%The strong and the. RC@3Detailed tables show large margins of error over baselines. Averages for generalization suites: (EvalPlus-HE/MBPP/DS-1000, SciCode). 57.2%.
The scales of work. Fine-tuning GLM-4.5 (355B) You can also find out more about the following: GLM-4.5-Air (106B) Both methods yield large deltas above their base, which indicates robustness of the method to model sizes.

It is a CLI-based environment that combines software engineering and research. The team uses 78 long-horizon, tool grounded, curated trajectories to train GLM-4.5. This report reports an average of 73.5% with the FTFC, R@3, and S@3 metrics on AgencyBench; GLM 4.5 baseline is 45.1%. A comparison against a 10,000-sample AFM-CodeAgent SFT baseline shows 73.5% vs 47.8%; tool-free evaluation indicates intrinsic gains (≈50.0% for LIMI vs 48.7% GLM-4.5). The trajectory is multi-turn, token dense, with a focus on planning, orchestration of tools, and verification.

Click here to find out more Paper, GitHub Page You can also find out more about the following: Model Card on HF. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

This new agency-focused approach to supervision scales software AI agents with only 78 examples

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

OpenClaw agents can be guilt-tripped into self-sabotage

A Hiker Was Missing for Nearly a Year—Until an AI System Recognized His Helmet

Gemini 3 Is Here—and Google Says It Will Make Search Smarter

Microsoft Agent 365 tries its best to become the AI Bot Boss

How the Loudest Voices in AI Went From ‘Regulate Us’ to ‘Unleash Us’

Top Insights

ByteDance’s AI ambitions are being hampered by computer restrictions and copyright concerns

Google AI’s new teaching method is key to LLM reasoning

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

This new agency-focused approach to supervision scales software AI agents with only 78 examples

What is the new product?

What is the process?

The results of the study are:

The Key Takeaways

Related Posts