Microsoft Research Releases Webwright - A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys - Up From Base GPT 5.4's 35%

Today, most web agents control a browser by performing one action. The model receives the current page state — as a screenshot or DOM text — and predicts the next click, keypress, or scroll. The action-at a time design was logical when the reasoning abilities of language models were limited. The rigid loop that was once a useful structure has now become an obstacle as models are more adept at debugging and writing code.

Microsoft Research AI Frontiers has developed a new approach. They have developed a new framework that is open source. WebwrightThis gives the agent access to a terminal, instead of an active browser session. The agent uses Playwright to run commands on browsers. It also inspects logs and refines scripts iteratively. Microsoft Playwright, an open-source library for browser automation, allows programmatic control over Chromium Firefox and WebKit.

What makes Webwright different?

Webwright Separates the agent and the browser, treating the browser like something that the agent could launch, inspect and then discard, while developing a programme. It is the code in the workspace that persists, not the session of the browser.

The same developer will use this model when creating an RPA script (Robotic process automation). They write scripts instead of clicking on a website manually each time. The script is able to be run again, altered, or even shared. Webwright uses this for LLM powered agents.

It has a system that allows you to control your home. Three main components There are three types of terminal environments: a model endpoint, a runner, and an environment. The model interface is approximately 550 lines long, while the environment is only 300. There is no multi-agent orchestration or complex planning hierarchy — just a single agent loop.

The workspace stores all intermediate code, screenshots and logs. This makes it easy to examine the results of each run.

https://www.microsoft.com/en-us/research/articles/webwright-a-terminal-is-all-you-need-for-web-agents/

Agent Loop

The model is sent the context currently in use by the runner. Model returns thinking blocks and shell commands. This command is run in the Environment which will return terminal output, logs or screenshots. This loop then continues as the observations are placed back into their context.

Rather than issuing one primitive action at a time, a coding agent can naturally express multi-step interactions — such as selecting a date or filling out an entire form — as a compact program. The agent can generalize similar tasks by using loops, abstractions and functions.

Two Engineering Challenges

Premature ‘done’ and context explosion are the two core issues. The model often reports completion without completing the bash. A gate was introduced: The agent has to create a self reflection config and run the final script on a clean folder with logs, screenshots, before it can emit. "Done: True". The flag will be dropped if it is not retrieved.

Long coding paths quickly surpass context limitations, so the history is compressed every 20 steps.

Benchmark Results

The benchmarks used to evaluate Webwright were Online-Mind2Web (Online) and Odysseys.

The Online-Mind2Web benchmark contains 300 tasks spread across 136 popular sites, and it uses a framework for automated LLM evaluation. The GPT-5.4 harness recipe achieves an overall accuracy of 86.67%, the highest out of all the open-sourced harnesses in the AutoEval section, using a budget for 100 steps. Claude Opus 4.7 reached 84.7% overall but performed better on hard tasks at N=100 steps — 80.5% versus 76.6% for GPT-5.4.

In a traditional screenshot-based setting where they predict x and y coordinates, the researchers reproduced GPT-5.4 in an agent-based model that predicted clicks, typing, or other actions. Webwright, using the same model as the underlying one, achieves significant gains in all three categories of difficulty, showing the benefits of a terminal-driven code over a step-by-step approach to coordinate prediction.

Odysseys assesses tasks that span multiple sites and have a long-term horizon. The average task has 272.3 instructions. Opus 4.6.1 scored the highest in April 2026’s leaderboard with 44.5. The Webwright powered with GPT-5.4 achieves 60.1%. This is a relative improvement of 35.1% over the prior state of the technology. Compared to the base GPT-5.4 performance of 33.5%, this corresponds to a 79.4% relative improvement — or 26.6 absolute points.

Cost Analysis

Claude Opus 4.7 has fewer steps (mean 21.9) to complete each task than GPT-5.4 (26.3). Claude Opus 4.7, however, is significantly more expensive than GPT 5.4 ($5 vs. $3.50 for 1M input tokens and $25 vs. $15 per 1M out put tokens in April 2026), resulting in a higher average cost per task compared with GPT 5.4 ($2.37 vs. $5.09). The first 50 steps deliver 82% accuracy, and the next 50 steps deliver 3–4 additional points.

Small Model Performance

Researchers also evaluated Qwen3.5-9B using the Online-Mind2Web hard split. When pre-built, reusable toolscripts are added to tasks, Qwen3.5-9B is able to achieve 66.2% for Online-Mind2Web sites with more than 5 tools. It shows how smaller models, which are lower cost, can perform complex web tasks if they’re paired with pre-built tools.

Marktechpost’s Visual Explainer

Webwright
Quick Start Guide

01 / 05 — Overview
What Is Webwright
The Webwright framework is a terminal-native, open-source web agent framework. Microsoft Research. The agent does not predict one click on the browser at a given time. Instead, it writes Playwright Code runs bash commands and saves reusable scripts to a workspace.

The 1,000 line of harness code across 3 modules — no hidden orchestration
Single agent loopThe runner, the model endpoint and the terminal environment
86.7% on Online-Mind2Web | 60.1% Odysseys using GPT-5.4
Backends: OpenAI, Anthropic, OpenRouter
Scripts reusable in Claude Code, Codex, OpenClaw

# GitHub repository
github.com/microsoft/Webwright

02 / 05 — Prerequisites
How to Prepare for Installation
Before running any installation commands, confirm that the following items are available.

Python 3.10+ — required minimum runtime
Chromium — installed via Playwright in the next step
The API Key — OpenAI, Anthropic, or OpenRouter
Git — to clone the repository

Python versions can be checked by checking the version number.
Python --version
Python 3 or above must be returned.

03 / 05 — Installation
Clone and Install webwright
Download the Playwright Playback browser control, edit it, and then clone your repo.

# 1. Clone repository
Gi clone https://github.com/microsoft/Webwright
Cd Webwright

# 2. # 2.
Pip Install e.

# 3. # 3.
The playwright Chrome

You can also find out more about the following: -e The flag indicates that the local source editing is applied without having to install.

04 / 05 — Running a Task
Run Your First Web Task
The CLI will accept your API Key, then a Task Instruction and Start URL.

# Export key
You can export your product by clicking here. OPENAI_API_KEY="sk-..."
You can export your product by clicking here. ANTHROPIC_API_KEY="sk-ant-..."

# Run a task
Python -m webwright.run.cli 
  -c base.yaml -c model_openai.yaml 
  -t "Find cheapest economy flight SEA to JFK on 2026-05-15" 
  --start-url https://www.google.com/flights 
  --task-id demo_openai 
  -o outputs/default

Flag	Description
-c	Config file from src/webwright/config/ — stackable
-t	Instructions in Plain English
–start-url	The initial URL of the browser session
–task-id	Subfolder output name
-o	The root output directory of logs and scripts

05 / 05 — Claude Code Integration
Use Webwright to learn Claude Code
The Claude Code capability is built into Webwright. A separate LLM API Key is not required beyond the Claude Code subscription. Claude Code is able to read PNG screenshots.

The scope of the project (only within this repo)
Mkdir -p .claude/skills .claude/commands
The ln -s "$PWD/skills/webwright" .claude/skills/webwright
The ln -s "$PWD/skills/webwright/commands" .claude/commands/webwright

# User-scoped (all projects)
Mkdir -p ~/.claude/skills ~/.claude/commands
The ln -s "$PWD/skills/webwright" ~/.claude/skills/webwright
The ln -s "$PWD/skills/webwright/commands" ~/.claude/commands/webwright

After installing Claude Code, restart it and then enter slash commands.

# One-shot task
/webwright - run search Google Flights SEA JFK 2026-05-25"

# Reusable parameterized CLI tool
A ticket for a craft search from LAX departing on June 7

The Key Take-Aways

In Webwright, the terminal loop is used where Playwright code is written and run by the agent instead of having to anticipate one browser action.
GPT-5.4 reached 86.7% on Online-Mind2Web (100-step budget) and 60.1% on Odysseys — 26.6 points above the base GPT-5.4 score of 33.5%.
This harness consists of 1,000 line segments across three modules without multi-agent orchestration.
When augmented by pre-built scripts, Qwen3.5-9B achieved 66.2% of the split in Online-Mind2Web.
All three Claude Code products, Codex and OpenClaw, can share the task scripts.

Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

Create a SuperClaude Framework with Modes, Commands and Session memory

TencentDB Agent Memory by Tencent: A Four-Tier Pipeline of Local Memory for AI Agents

The Bumblebee Open Source Supply Chain Scanner is a read-only tool for developer endpoints.

The AI Avatar tool in Gemini allowed me to clone myself. It was unnervingly me

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice

The AI tool will tell you to stop slacking off

OpenAI partners with Oracle, SoftBank and Stargate to build 5 new Stargate data centers

Big Tech Dreams to Put Data Centers on the Moon

Top Insights

This Startup Is Trying to Create a DeepSeek Moment in the US

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

Latest News

NVIDIA AI Releases DeltaNet-2 Gated: A Linear attention layer that decouples the Erase and Write of Delta Rule.

This Robot is Making Meals in San Francisco’s Tenderloin for a Nonprofit

Microsoft Research Releases Webwright – A Terminal Native Web Agent Framework that Scores 60.1% On Odysseys – Up From Base GPT 5.4’s 35%

What makes Webwright different?

Agent Loop

Two Engineering Challenges

Benchmark Results

Cost Analysis

Small Model Performance

Marktechpost’s Visual Explainer

The Key Take-Aways

Related Posts