OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its most succesful mannequin so far and the primary absolutely retrained base mannequin since GPT-4.5. GPT-5.5 is designed to finish advanced, multi-step pc duties with minimal human course. Consider it because the distinction between an assistant who wants a guidelines and one who understands the underlying aim and figures out the steps themselves. The discharge is rolling out at this time to Plus, Professional, Enterprise, and Enterprise subscribers throughout ChatGPT and Codex.

What ‘Agentic’ Really Means Right here

An agentic mannequin doesn’t simply reply to a single immediate — it takes a sequence of actions, makes use of instruments (like looking the net, writing code, operating scripts, or working software program), checks its personal work, and retains going till the duty is completed. Prior fashions typically stalled at handoff factors, requiring the consumer to re-prompt or appropriate course. GPT-5.5 is constructed to cut back these interruptions.

OpenAI launched GPT-5.5 as a mannequin focused at agentic pc use — it writes and debugs code, browses the net, fills out spreadsheets, and retains working by means of multi-step duties with out requiring a human to oversee each transfer.

The 4 Domains The place Good points Are Concentrated

The good points are concentrated in 4 areas: agentic coding, pc use, information work, and early scientific analysis — domains OpenAI describes as these ‘where progress depends on reasoning across context and taking action over time.’

For software program engineers, essentially the most instantly related benchmark is SWE-Bench Professional, which evaluates real-world GitHub situation decision throughout 4 programming languages. GPT-5.5 resolves 58.6% of duties end-to-end in a single cross. Value noting: Claude Opus 4.7 scores greater at 64.3% on this similar benchmark, although OpenAI has famous that Anthropic reported indicators of memorization on a subset of these issues, which can have an effect on the comparability.

For long-horizon coding particularly, OpenAI additionally experiences outcomes on Skilled-SWE, an inner benchmark measuring duties with a median estimated human completion time of 20 hours. GPT-5.5 outperforms GPT-5.4 on Skilled-SWE. This benchmark is critical as a result of it displays the sort of prolonged, multi-session engineering work — massive refactors, characteristic builds, debugging deep in a codebase — that agentic instruments are more and more being requested to deal with autonomously.

Builders who examined the system early stated GPT-5.5 has a greater understanding of the “shape” of a software program system, and might higher perceive why one thing is failing, the place the repair is required, and what else within the codebase could be affected.

https://openai.com/index/introducing-gpt-5-5/

For ML engineers and information scientists who spend important time in terminal environments orchestrating pipelines and debugging scripts, the Terminal-Bench 2.0 outcomes are essentially the most compelling sign. GPT-5.5 scores 82.7% on Terminal-Bench 2.0, which exams advanced command-line workflows requiring planning, iteration, and gear coordination — beating Claude Opus 4.7 at 69.4% and Gemini 3.1 Professional at 68.5%. That isn’t a marginal lead.

For broader information work, GPT-5.5 scores 84.9% on GDPval, which exams brokers throughout 44 occupations of information work. On OSWorld-Verified, a benchmark measuring whether or not a mannequin can autonomously function actual pc environments, it reaches 78.7%.

GPT-5.5 additionally ships with a Professional variant constructed for higher-accuracy, tougher duties. On BrowseComp, which exams a mannequin’s potential to trace down hard-to-find info throughout the net, GPT-5.5 Professional scores 90.1%, forward of Gemini 3.1 Professional at 85.9%. The mannequin can also be the top-ranked system on the Synthetic Evaluation Intelligence Index.

Pace and Token Effectivity

One concern with extra succesful fashions is that they are typically slower or dearer to run. OpenAI addressed this straight. GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving whereas performing higher throughout practically each analysis measured. It additionally makes use of considerably fewer tokens to finish the identical Codex duties — that means shorter, extra environment friendly runs even on advanced agentic workflows.

On pricing, the usual GPT-5.5 API can be charged at $5 per million enter tokens and $30 per million output tokens. For context, GPT-5.4 was priced at $2.50 per million enter tokens and $15 per million output tokens — so the per-token value has doubled. OpenAI group argued that token effectivity good points offset the associated fee, since GPT-5.5 completes the identical Codex duties with fewer tokens, that means cheaper runs general even on the greater per-token charge. GPT-5.5 Professional, the higher-accuracy variant, is priced at $30 per million enter tokens and $180 per million output tokens within the API.

For groups operating Codex at scale, the online math is what issues: if GPT-5.5 completes a activity in materially fewer tokens than GPT-5.4, the efficient price per accomplished workflow can nonetheless come out decrease regardless of the upper charge.

Scale and Adoption

OpenAI has seen a surge in Codex utilization, with about 4 million builders utilizing the software weekly. That scale issues for understanding the deployment context: GPT-5.5 shouldn’t be a analysis preview however a manufacturing mannequin being pushed to an lively, massive developer base instantly on launch.

Key Takeaways

GPT-5.5 is OpenAI’s first absolutely retrained base mannequin since GPT-4.5, designed particularly for agentic workflows — it might perceive advanced targets, use instruments, examine its personal work, and carry multi-step duties by means of to completion with minimal human course.
The largest efficiency good points are in agentic coding, pc use, information work, and early scientific analysis — GPT-5.5 scores 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld-Verified, outperforming each Claude Opus 4.7 and Gemini 3.1 Professional on a number of key benchmarks.
GPT-5.5 matches GPT-5.4’s per-token latency whereas being extra succesful throughout practically each benchmark — it additionally makes use of considerably fewer tokens to finish the identical Codex duties, that means higher outcomes and not using a proportional improve in velocity or price per accomplished workflow.
API pricing will increase to $5/M enter tokens and $30/M output tokens (up from $2.50 and $15 for GPT-5.4), with GPT-5.5 Professional priced at $30/M enter and $180/M output — OpenAI group argues token effectivity good points offset the upper per-token charge for many workloads.
GPT-5.5 is rolling out at this time to Plus, Professional, Enterprise, and Enterprise customers in ChatGPT and Codex, with roughly 4 million builders already utilizing Codex weekly.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

Mend.io releases AI Security Governance Framework covering asset inventory, risk tiering, AI Supply Chain Security and Maturity model

Google Cloud AI Research introduces ReasoningBank: a memory framework that distills reasoning strategies from agent successes and failures.

Video Games: The New Battleground For Actors And AI Protection

Lisa Su, AMD’s CEO, says concerns about an artificial intelligence bubble are overblown

How much energy does AI use? The people who are in the know won’t say anything

Moltbook – the social network for AI Agents – exposed data of real humans

OpenAI’s Atlas Browser Takes Direct Intention at Google Chrome

Top Insights

NVIDIA introduces PivotRL, a new AI framework that achieves high agentic accuracy with 4x fewer rollout turns efficiently.

Researchers from MIT, NVIDIA, and Zhejiang College Suggest TriAttention: A KV Cache Compression Methodology That Matches Full Consideration at 2.5× Larger Throughput

Latest News

5 Reasons to Think Twice Before Using ChatGPT—or Any Chatbot—for Financial Advice

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

What ‘Agentic’ Really Means Right here

The 4 Domains The place Good points Are Concentrated

Pace and Token Effectivity

Scale and Adoption

Key Takeaways

Related Posts