AutoAgent is an open-source AI library which allows you to optimize and engineer your own Agent.

Every AI engineer is familiar with a certain kind of monotony: the loop for tuning prompts. You create a prompt for your system, then run it against an benchmark. Then you read through the results and tweak the prompt. If you repeat this procedure a couple dozen times, it might make a big difference. This is gruntwork dressed up as Python files. There’s a brand new, open-source Python library named AutoAgentKevin Gu has built a new home in the city of thirdlayer.inc, proposes an unsettling alternative — don’t do that work yourself. Allow an AI to do the work.

AutoAgent, an open-source library that allows agents to be improved autonomously on any domain, is a powerful tool. The software scored 96.5% on SpreadsheetBench and 55.1% on TerminalBench in a 24 hour run.

https://x.com/kevingu/status/2039843234760073341

What is AutoAgent really?

AutoAgent is described as being ‘like autoresearch but for agent engineering.’ It is a simple idea. Give an AI agent an assignment, and let it iterate autonomously on an agent harness over night. The AI agent modifies system prompts, agents, tool configurations, orchestration and the tools. It runs benchmarks, compares scores, discards or keeps the changes, then repeats.

Understanding the analogy of Andrej Karpathy autoresearch does the same thing for ML training — it loops through propose-train-evaluate cycles, keeping only changes that improve validation loss. AutoAgent uses the same ratchet from ML into agent engineering. It optimizes the agent’s performance, not a model’s hyperparameters or weights. harness — the system prompt, tool definitions, routing logic, and orchestration strategy that determine how an agent behaves on a task.

The following are some of the ways to get in touch with each other harnessThis scaffolding surrounds an LLM. For example, it determines what kind of system prompt it will receive, what kinds of tools it may call upon, what it does to route between agents, and the format in which tasks can be input. This scaffolding is usually created by hand. AutoAgent automates iteration of the scaffolding.

It is a two-agent system, with one file and one directive.

It is important to note that the word “you” means “you”. GitHub repo It has been designed to be simple. agent.py is the entire harness under test in a single file — it contains config, tool definitions, agent registry, routing/orchestration, and the Harbor adapter boundary. The adapter part is clearly marked as fixed. program.md This file contains the instructions to create the meta-agent, plus the direction (what type of agent should be built). It is also the only one that humans can edit.

Consider it a division of concerns between the human and the machine. The human is the one who sets the You can also find out more about the Direction. You can find out more about this by clicking here. program.md. The meta-agent A separate AI (of a higher grade) reads this directive. agent.pyRuns benchmarks, diagnoses failures, rewrites relevant portions of agent.pyThe human never touches the machine. Never touch a human agent.py directly.

It is important to maintain the integrity of loops across multiple iterations. results.tsv — an experiment log automatically created and maintained by the meta-agent. The meta-agent can use this log to calibrate and learn what experiments to run next. This includes the entire project structure. Dockerfile.baseThe optional .agent/ Directory of reusable Agent workspace artifacts such as prompts and Skills tasks/ Folder for benchmark payloads, (added to each benchmark branch), jobs/ Directory of Harbor Job Outputs

The benchmark test suites’ total scores are used as the metric. Meta-agents hill-climb on the basis of this score. Every experiment produces a numeric score: keep if better, discard if not — the same loop as autoresearch.

Task Format and Harbor Integration

Tasks expressed in Harbor format. Every task has its own subsection. tasks/my-task/ Includes a task.toml For config such as timeouts, metadata and an instruction.md Which prompt is sent to an agent? tests/ The directory of a test.sh The entry that is used to write a score /logs/reward.txtA, and test.py Verification using either LLM as a judge or deterministic check. Then, you can use LLM as a judge to verify the results. environment/Dockerfile Defines the task container and a files/ Referencing files are stored in the directory. Verifier logs record a test score between 0.0 – 1.0. Meta-agents hill climb on this.

It is important to note that the word “you” means “you”. LLM-as-judge pattern here is worth flagging: instead of only checking answers deterministically (like unit tests), the test suite can use another LLM to evaluate whether the agent’s output is ‘correct enough.’ It is often seen in benchmarks for agentic systems, where the correct answer cannot be reduced to simple string matching.

The Key Takeaways

Automated harness manufacturing works — AutoAgent proves that a meta-agent can replace the human prompt-tuning loop entirely, iterating on agent.py The harness files can be left overnight, without anyone touching them directly.
Benchmark results confirm the method — In a 24-hour run, AutoAgent hit #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%), beating every other entry that was hand-engineered by humans.
‘Model empathy’ may be a real phenomenon — A Claude meta-agent optimizing a Claude task agent appeared to diagnose failures more accurately than when optimizing a GPT-based agent, suggesting same-family model pairing could matter when designing your AutoAgent loop.
From engineer to director, the human job has changed. — You don’t write or edit agent.py. Write program.md — a plain Markdown directive that steers the meta-agent. This distinction reflects the shift from writing code towards setting goals in agentic programming.
Plug-and-play compatibility with all benchmarks — Because tasks follow Harbor’s open format The following are some examples of how to get started: agents run in Docker containers, AutoAgent is domain-agnostic. Any scorable task — spreadsheets, terminal commands, or your own custom domain — can become a target for autonomous self-optimization.

Check out the Repo and Tweet. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

AutoAgent is an open-source AI library which allows you to optimize and engineer your own Agent.

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models

Apple plans to continue selling iPhones after it turns 100

The AI tool will tell you to stop slacking off

The Confessions Of A Recovering AI Porn Addict

Meta and Mercor Pause Work After Breach of Data Puts AI Industry Secrets in Danger

Four indicted on suspicion of conspiring to smuggle Nvidia and supercomputers into China

Top Insights

I Thought I Knew Silicon Valley. I Was Mistaken

What to do in 2025 to find the latest sounds on TikTok? 9 Easy Ways

Latest News

Schematik Is ‘Cursor for Hardware.’ The Anthropics Want In

Hacking the EU’s new age-verification app takes only 2 minutes

AutoAgent is an open-source AI library which allows you to optimize and engineer your own Agent.

What is AutoAgent really?

It is a two-agent system, with one file and one directive.

Task Format and Harbor Integration

The Key Takeaways

Related Posts