Open Source Terminal Agent Training Environments (SETA) with 400 tasks and the CAMEL toolkit

What is the end-to end stack that terminal agents would look like if you combined synthetic RL, structured toolkits and benchmark aligned testing? The research team from CAMEL AI has released a paper with other collaborators. SETA. It is a stack of tools and environments that focus on the reinforcement learning process for terminal agents. This project targets agents operating in Unix shells and they must perform verifiable tests under Terminal Bench.

Three major contributions

Terminal Bench’s state-of the-art terminal agent: The agents achieve the state-of the art with a Claude Sonnet 4.5 based based agent and a GPT 4.1 based based agent. Comparison is limited to agents with the same model base.
Research team releases initial datasets with terminal tasks of varying difficulty. Out of those, 260 are used for RLVR refinement of a Qwen3-8B.
Agents are designed to be generalizable across different training and evaluation frameworks. The same implementation of the agent is used both for local task runs as well as Terminal Bench’s official evaluation harness.

The Terminal Toolkit and Log Structure

SETA showcases Terminal Toolkits that turn a language agent into a terminal executable. The framework generates structured log files under each run of a task. evaluation/terminal_bench_run. The README file shows an actual layout for the task named play-zork.

Included in the Key Files:

chatagent.log This software records all agent and tool messages, including the test results.
You can also find out more about the A-Team here. It is a good idea to use a session. The Directory of session_logs Use the Terminal Interactions Toolkit to capture all terminal interaction.
You can find out more about it by clicking here. session_logsYou can also download files like blocking_commands.log, session_run_zork_1_correct_path.log, session_zork-1.log” session_zork_start.log Store command output in different modes and sessions.
tests.log The following are some examples of how to get started: tests.log.strip The first one records the output of the test, while the second removes the terminal control characters.

This is a practical way of debugging an agent. The high-level decisions made in chat can be traced. chatagent.log Test logs can be used to confirm whether or not a shell command has been successful.

The GitHub repository offers a dedicated entry under “Terminal Bench” for official evaluation. evaluation/terminal_bench_eval. A developer enters that directory, runs run_eval.sh The Terminal Bench is available in two versions: 1.0 (for the first bench) and 1.0 (for the second bench). run_tb2.sh For Terminal Bench 2.0.

The results are recorded in evaluation/terminal_bench_eval/run/{run_id}/results.json. Logs for specific tasks are kept under evaluation/terminal_bench_eval/logs/camel_logs/{task_id}. The class of agent which binds to benchmark the CAMEL Agent is implemented by tbench_camel_agent.py.

Note Taking Toolkit as persistent memory

A Note Taking Toolkit is also introduced by the research team. It can be described as a persistent memory system for tasks that have a long horizon. In the example tool call, they show how an agent can write and read notes while performing terminal tasks. The public information focuses on the toolkit’s existence and examples of its use. This material has not described a complete training goal for the use of notes.

Importantly, the agent must have an external channel through which it can send intermediate results or hints.

Understanding Performance

SETA’s agent harness has achieved leading results in Terminal Bench. CAMEL’s terminal agent, using Claude Sonnet 4.5 as its backbone, achieves 46.5% on Terminal Bench 2. It is the first system to reach this level of accuracy, beating the second by 3 points. The CAMEL terminal agents are particularly strong in git, DevOps and code security. GPT 4.1 based agents achieve 35% accuracy in Terminal Bench 1.0. That is 4.7 percent points more than the second entry. Comparatively, the Qwen3 baseline supervised by SETA-RL achieves a Terminal Bench 2.0 accuracy of 3.4%. The Qwen3 terminal agent, trained using the SETA-RL pipeline, improves on this baseline in the curated artificial environments.

The Key Takeaways

SETA is an open-source project which provides agents toolkits, as well a synthetic RL environment for the terminal agent. The SETA format aligns with that of Terminal Bench.
The framework provides a comparison of the state-of-the art performance between agents built using the same base models (Claude Sonnet 4.5, GPT 4.1) and CAMEL Terminal Agents on Terminal Bench 1.0 &2.0.
Hugging Face is a dataset of 400 tasks that are packaged in a SETA-RL format. task.yaml, Dockerfile” run-tests.sh260 tasks are used in RLVR for fine tuning a Qwen3-8B based agents.
SETA’s open source codebase includes a Terminal Toolkit that supports structured logging, as well as a Note-Taking Toolkit. It integrates seamlessly with Terminal Bench scripts for evaluation and log paths. Seta GitHub repository.
The design of the overall system demonstrates a path that is clean from the synthetic RL environments, to verified benchmark agents. Developers can use this stack to test, train and evaluate agents without relying upon ad hoc tools.

Click here to find out more Blog, Technical details, GitHub Repo The following are some examples of how to get started: Weights. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Latest Releases of ai2025.devThe platform is a focused analytics tool for 2025 that converts model launches, benchmarks and ecosystem activities into structured data you can compare and filter.

Michal is a professional in the field of data science with a Masters of Science degree from University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in statistics, machine learning and data engineering.

Open Source Terminal Agent Training Environments (SETA) with 400 tasks and the CAMEL toolkit

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models

OpenAI Erotica Claim Refuted by Former OpenAI Staffer

China Turns Legacy Chips Into a Trade Weapon

The Perplexity Ads Retrenchment Signals A Bigger Strategic Change

What is Palantir?

Signal Creator Helps Encrypt Meta AI

Top Insights

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models

Elon Musk’s xAI Sues Apple & OpenAI for App Store Rankings

Latest News

Schematik Is ‘Cursor for Hardware.’ The Anthropics Want In

Hacking the EU’s new age-verification app takes only 2 minutes

Open Source Terminal Agent Training Environments (SETA) with 400 tasks and the CAMEL toolkit

Three major contributions

The Terminal Toolkit and Log Structure

Note Taking Toolkit as persistent memory

Understanding Performance

The Key Takeaways

Related Posts