The AI coding agent market seems to be virtually unrecognizable in comparison with 2024 and even early 2025. What began as inline autocomplete has developed into totally autonomous techniques that learn GitHub points, navigate multi-file codebases, write fixes, execute exams, and open pull requests — and not using a human typing a single line of code. By early 2026, roughly 85% of builders reported often utilizing some type of AI help for coding. The class has fractured into distinct archetypes: terminal brokers, AI-native IDEs, cloud-hosted autonomous engineers, and open-source frameworks that allow you to swap in no matter mannequin you favor.
The issue is that each device claims to be the most effective, and the benchmarks used to justify these claims should not all the time measuring the identical issues — and in some instances are now not credible measures in any respect. This text options a very powerful AI coding brokers by the metrics that really matter for manufacturing software program improvement, whereas being trustworthy about the place these metrics have damaged down. If you’re an AI/ML engineer, software program developer, or information scientist attempting to determine the place to take a position your tooling price range in 2026, begin right here.
Methods to Learn These Benchmarks — Together with Why the Most-Cited One Is Now Disputed
Earlier than the itemizing, an necessary calibration on the numbers — as a result of one main benchmark shift occurred mid-cycle and isn’t but mirrored in most device comparability articles.
SWE-bench Verified has been the {industry}’s normal coding benchmark since mid-2024. It presents brokers with 500 actual GitHub points drawn from well-liked Python repositories and measures whether or not the agent can perceive the issue, navigate the codebase, generate a repair, and confirm that it passes exams — end-to-end, with out human steering. It was a reputable proxy. In February 2026, that modified.
On February 23, 2026, OpenAI’s Frontier Evals workforce published a detailed post explaining why it had stopped reporting SWE-bench Verified scores. Their auditors reviewed 138 of the toughest issues throughout 64 unbiased runs and located that 59.4% had basically flawed or unsolvable check instances — exams that demanded actual perform names not talked about in the issue assertion, or checked unrelated habits pulled from upstream pull requests. Extra critically, they discovered proof that each main frontier mannequin — GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash — may reproduce the gold-patch options verbatim from reminiscence utilizing solely the duty ID, confirming systematic coaching information contamination. OpenAI’s conclusion: “Improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities.” OpenAI now recommends SWE-bench Professional because the substitute for frontier coding analysis.
This doesn’t make SWE-bench Verified scores ineffective. Different main labs proceed to report them, third-party evaluators proceed to run them, they usually stay helpful for broad directional comparability. However any rating that presents SWE-bench Verified scores as clear, goal measurements of real-world capability — with out this caveat — is supplying you with an incomplete image. All scores on this article are flagged accordingly.
SWE-bench Professional is more durable to interpret than Verified as a result of revealed outcomes fluctuate considerably by cut up, scaffold, harness, and reporting supply. The benchmark comprises 1,865 whole duties divided right into a 731-task public set, an 858-task held-out set, and a 276-task industrial/personal set drawn from 18 proprietary startup codebases. When the original Scale AI paper measured frontier fashions utilizing a unified SWE-Agent scaffold, prime scores have been beneath 25% — GPT-5 at 23.3% — reflecting a genuinely more durable analysis. Nevertheless, present public leaderboard and vendor-reported runs now present considerably greater scores below newer fashions and optimized agent harnesses: OpenAI experiences GPT-5.5 at 58.6% on SWE-bench Professional (Public), whereas Anthropic’s comparability desk lists Claude Opus 4.7 at 64.3% and Gemini 3.1 Professional at 54.2%. These numbers shouldn’t be instantly in contrast with the unique sub-25% SWE-Agent outcomes with out noting the scaffold and cut up variations — the benchmark has not modified, however the analysis situations and mannequin generations have. While you see a 60%+ SWE-bench Professional rating alongside a sub-25% one, they’re measuring the identical benchmark below very completely different situations, not two separate exams.
Terminal-Bench 2.0 evaluates terminal-native workflows: shell scripting, file system operations, atmosphere setup, and DevOps automation. As of April 23, 2026, GPT-5.5 leads at 82.7% on this benchmark — confirmed in OpenAI’s official release. Claude Opus 4.7 scores 69.4% (Anthropic/AWS-reported), and Gemini 3.1 Professional scores 68.5%. An necessary methodological caveat: completely different harnesses produce completely different numbers for a similar mannequin. Anthropic’s Opus 4.6 system card confirmed GPT-5.2-Codex scoring 57.5% on the unbiased Terminus-2 harness vs 64.7% on OpenAI’s personal Codex CLI harness — a 7-point hole from harness alone. When evaluating Terminal-Bench figures throughout sources, all the time test which execution atmosphere was used.
One closing cross-benchmark caveat: agent scaffolding issues as a lot because the underlying mannequin. In a February 2026 analysis of 731 issues, three completely different agent frameworks operating the identical Opus 4.5 mannequin scored 17 points aside — a 2.3-point hole that adjustments relative rankings. A benchmark rating labeled with a mannequin title displays the mannequin and the particular scaffold wrapped round it, not the mannequin in isolation.
10 AI Brokers for Software program Growth
A Word on Claude Mythos Preview
The present chief on SWE-bench Verified amongst third-party trackers is Claude Mythos Preview at 93.9%, introduced April 7, 2026 below Anthropic’s Project Glasswing. It isn’t usually out there. Entry is restricted to a restricted set of platform companions; Anthropic has acknowledged it doesn’t plan broad launch within the close to time period, partly on account of elevated cybersecurity functionality issues. It sits exterior the principle comparability beneath as a result of builders can not entry it by means of normal channels. Its existence does, nevertheless, sign that the sensible functionality ceiling sits considerably above what any publicly out there device at the moment delivers.
#1. Claude Code (Anthropic)
SWE-bench Verified (self-reported): 87.6% (Opus 4.7) / 80.8% (Opus 4.6) SWE-bench Professional (Anthropic inside variant): 64.3% (Opus 4.7, #1) / 53.4% (Opus 4.6) Terminal-Bench 2.0: 69.4% (Opus 4.7, Anthropic-reported) CursorBench: 70% (Opus 4.7, Cursor-reported) Claude Code subscription: $20–$200/month | Opus 4.7 API: $5/$25 per million tokens
Claude Code is Anthropic’s terminal-native coding agent and the chief on code high quality metrics throughout most self-reported and third-party evaluations as of Could 2026. It runs from the command line, integrates with VS Code and JetBrains through extension, and is constructed round Claude Opus 4.7 — launched April 16, 2026.
Opus 4.7 represents a step-change over its predecessor. SWE-bench Verified jumped from 80.8% to 87.6% — an almost 7-point acquire. On Anthropic’s inside SWE-bench Professional variant, the mannequin moved from 53.4% to 64.3%, an 11-point acquire that places it forward of each present publicly out there competitor on that more durable benchmark. On CursorBench, Cursor’s CEO reported Opus 4.7 at 70%, up from 58% for Opus 4.6. Rakuten reported 3× extra manufacturing duties resolved on their inside SWE-bench variant; CodeRabbit reported over 10% recall enchancment on complicated PR critiques with secure precision.
Opus 4.7 launched self-verification habits: the mannequin writes exams, runs them, and fixes failures earlier than surfacing outcomes, fairly than ready for exterior suggestions. It additionally launched multi-agent coordination — the flexibility to orchestrate parallel AI workstreams fairly than processing duties sequentially — which issues for groups operating code evaluate, documentation, and information processing concurrently. The 1 million token context window can help a lot bigger repository contexts than shorter-window instruments, although very giant monorepos nonetheless profit from indexing, retrieval, or file choice methods to remain inside sensible limits.
One necessary pricing distinction: Claude Code subscription tiers ($20–$200/month) are what particular person builders pay to make use of Claude Code within the CLI and IDE integrations. The underlying Opus 4.7 API is priced at $5 per million enter tokens and $25 per million output tokens — unchanged from Opus 4.6 — with a batch API low cost of fifty% and immediate caching decreasing prices additional. Groups constructing customized brokers on prime of the Anthropic API should not paying the subscription price.
On Terminal-Bench 2.0, Opus 4.7 scores 69.4% — sturdy, however GPT-5.5 has since moved forward on this particular benchmark at 82.7%. For pure terminal/DevOps agentic workflows, that hole is price contemplating.
Greatest for: Builders engaged on complicated multi-file engineering duties, giant codebases, or long-horizon refactoring who prioritize output high quality over velocity.
#2. OpenAI Codex (OpenAI)
Terminal-Bench 2.0 (GPT-5.5): 82.7% — present #1 SWE-bench Professional Public (OpenAI-reported, GPT-5.5): 58.6% SWE-bench Verified (third-party trackers, GPT-5.5): ~88.7% (OpenAI doesn’t self-report) Pricing: Codex CLI is open-source (mannequin utilization requires a ChatGPT plan or API key); GPT-5.5 in Codex out there on Plus ($20/month), Professional ($200/month), Enterprise, Enterprise, Edu, and Go plans; API: $5/$30 per million tokens (gpt-5.5)
An necessary correction to many comparisons of Codex: the Codex CLI is a neighborhood device that runs in your machine, not a cloud-sandboxed system. The Codex CLI (out there on GitHub as openai/codex) runs a neighborhood agent loop in your terminal, utilizing OpenAI’s API for mannequin inference. The cloud execution floor — the place duties run in an remoted VM with out touching your native atmosphere — is the Codex net product and IDE integrations, not the CLI. This distinction issues for safety, community entry, and price modeling.
GPT-5.5 launched April 23, 2026 and is OpenAI’s most succesful coding mannequin thus far. On Terminal-Bench 2.0, it scores 82.7% — the present #1 place throughout all publicly out there fashions, forward of Claude Opus 4.7 (69.4%) and Gemini 3.1 Professional (68.5%). OpenAI describes Terminal-Bench because the extra consultant benchmark for the type of work Codex really does: “complex command-line workflows requiring planning, iteration, and tool coordination.” On SWE-bench Professional (Public), GPT-5.5 scores 58.6% per OpenAI’s launch information, behind Claude Opus 4.7 (64.3%) however forward of earlier GPT generations. Claude Opus 4.7 nonetheless leads on code high quality for multi-file, long-horizon software program engineering; GPT-5.5 leads on terminal-native, DevOps-style agentic execution.
Word on SWE-bench Verified: OpenAI stopped self-reporting this metric in February 2026 on account of contamination issues. Third-party trackers present GPT-5.5 round 88.7%, however OpenAI’s official place is that this benchmark is now not a dependable frontier measure. They report SWE-bench Professional as a substitute.
GPT-5.5 is on the market in ChatGPT (Plus, Professional, Enterprise, Enterprise, Edu) and throughout Codex (CLI, IDE extensions, and the Codex net product). API entry was introduced and is rolling out. API pricing: $5/$30 per million tokens for gpt-5.5, a 2× leap from GPT-5.4. Greater than 85% of OpenAI workers now use Codex weekly — a sign of inside confidence within the product past benchmark numbers.
Greatest for: Builders targeted on terminal-native, DevOps, and pipeline automation workflows the place Terminal-Bench efficiency is the first sign; additionally the strongest alternative for fire-and-forget execution through the Codex net product.
#3. Cursor
SWE-bench Verified: ~51.7% (default config; rises considerably with Opus 4.7 backend) Job completion velocity: ~30% sooner than GitHub Copilot in head-to-head testing ARR: $2 billion (February 2026) Pricing: $20/month (Professional), $60/month (Professional+), Enterprise tiers above
Cursor reached $2 billion ARR in February 2026 — doubling from $1 billion in November 2025 — and is reportedly in talks to boost roughly $2 billion at a $50 billion-plus valuation, with Thrive Capital and Andreessen Horowitz. These figures replicate actual developer adoption, not benchmark-driven hype.
Cursor’s SWE-bench determine (~51.7%) represents its default mannequin configuration. As a result of Cursor is model-agnostic and helps Claude Opus 4.7, GPT-5.5, Gemini 3.1 Professional, and Grok, its efficient benchmark ceiling scales with the mannequin chosen — a developer operating Cursor with Opus 4.7 will get materially completely different efficiency from one utilizing a default configuration. The 30% job completion velocity benefit over Copilot displays Cursor’s editor-native structure, which eliminates context-switching overhead between a terminal agent and a separate IDE.
Cursor is a VS Code fork rebuilt round AI at each layer. Its Plan/Act mode provides builders a structured workflow: plan, evaluate, then execute. Background Brokers (Professional+ tier, $60/month) run autonomous coding classes on cloud VMs in parallel, with out blocking the principle editor. Per-task mannequin choice — quick mannequin for autocomplete, reasoning-heavy mannequin for complicated edits — provides fine-grained price management.
Cursor is its personal editor, not a plugin. Builders utilizing JetBrains, Neovim, or Xcode can not use Cursor with out switching editors. That constraint is actual and limits its enterprise footprint in comparison with Copilot.
Greatest for: VS Code-native builders who need the most effective AI-native IDE expertise and are keen to pay for the built-in workflow.
#4. Gemini CLI (Google DeepMind)
SWE-bench Verified (Gemini 3.1 Professional): 80.6% Terminal-Bench 2.0 (Gemini 3.1 Professional): 68.5% Context Window: 1 million tokens Pricing: Free tier through Google AI Studio; Google One AI Premium for greater limits
Gemini CLI is Google DeepMind’s open-source coding agent (npm set up -g @google/gemini-cli). Its major mannequin is Gemini 3.1 Pro — launched February 19, 2026 — which scores 80.6% on SWE-bench Verified and 68.5% on Terminal-Bench 2.0. Gemini 3 Flash (roughly 78% SWE-bench Verified) is the lighter, cheaper possibility throughout the similar CLI. These are distinct capabilities and the Gemini 3.1 Professional quantity is the right headline for what Gemini CLI can ship at full configuration.
Gemini 3.1 Professional additionally scores strongly on a number of non-coding benchmarks: ARC-AGI-2 (77.1%), GPQA Diamond (94.3%), and BrowseComp (85.9%), making it a powerful possibility for scientific computing, agentic analysis workflows, and duties that blend coding with deep reasoning. For Google Cloud-native groups, Gemini CLI integrates instantly with GCP, Vertex AI, and Android Studio.
The free tier is its most strategically distinctive characteristic. Solo builders, college students, and open-source maintainers who can not justify a $20–$200/month coding agent subscription have a respectable frontier-quality possibility right here. At 80.6% SWE-bench Verified — matching Claude Opus 4.6 and forward of GitHub Copilot’s default configuration — this isn’t a compromise free tier. It’s a genuinely aggressive product that removes price as a barrier to entry.
Greatest for: Value-sensitive builders, Google Cloud groups, and particular person contributors who need frontier mannequin high quality and not using a month-to-month subscription.
#5. GitHub Copilot (Microsoft/GitHub)
SWE-bench Verified (Agent Mode, default mannequin): ~56% Adoption: 4.7 million paid subscribers (January 2026) Pricing: $10/month (Professional), $19/month (Enterprise), $39/month (Professional+), Enterprise customized pricing; AI Credit billing transition on June 1, 2026
GitHub Copilot is just not essentially the most succesful agent on this listing by benchmark, however it’s the most generally deployed. With 4.7 million paid subscribers — 75% year-over-year progress — and 76% developer consciousness per GitHub’s Octoverse report, Copilot is the baseline AI coding device at most enterprise software program organizations. Microsoft CEO Satya Nadella confirmed in early 2026 that Copilot now represents a bigger enterprise than GitHub itself.
Two necessary updates for the present pricing image: GitHub added a Copilot Professional+ tier at $39/month that unlocks the total mannequin roster and better compute limits. Extra considerably, GitHub introduced that Copilot is moving to AI Credits-based billing on June 1, 2026, which implies sure agent actions, premium mannequin calls, and background job execution will draw from a credit pool fairly than being included within the flat month-to-month charge. Base plan costs are unchanged as of the announcement, however whole price for heavy agentic use might enhance relying on how credit are consumed.
On mannequin choice: in February 2026, GitHub made Copilot a multi-model platform by including Claude and OpenAI Codex as out there backends for Copilot Enterprise and Professional clients. The 56% SWE-bench determine displays the default proprietary Copilot mannequin. Configuring it to make use of Claude Opus 4.7 or GPT-5.5 would push that quantity considerably greater — although premium mannequin calls draw from the credit pool below the brand new billing mannequin.
At $10/month for people and $19/month for enterprise seats, Copilot’s price-to-capability ratio is the strongest entry level for enterprise groups that want predictable licensing, SOC 2 compliance, audit logs, and broad IDE help throughout VS Code, JetBrains, Visible Studio, Neovim, and Xcode. In enterprise procurement, compliance posture usually outweighs a number of SWE-bench share factors.
Greatest for: Enterprise groups that want predictable licensing, compliance posture, and broad IDE help throughout a number of environments.
#6. Devin 2.0 (Cognition AI)
Efficiency: Larger on clearly scoped duties; considerably weaker on ambiguous or complicated duties Pricing (up to date April 14, 2026): Free, Professional $20/month, Max $200/month, Groups usage-based with $80/month minimal, Enterprise customized
Devin holds a particular place on this class’s historical past. Its 13.86% SWE-bench Lite rating at launch in early 2024 — the primary time any AI system had autonomously resolved actual GitHub points at significant scale — was industry-defining. By at this time’s requirements, each device above it on this rating has surpassed that quantity by an element of 4 or extra.
Devin 2.0 is a considerably completely different product. It runs in a totally sandboxed cloud atmosphere with its personal IDE, browser, terminal, and shell. You assign a job; Devin produces a step-by-step plan you’ll be able to evaluate and edit; then it writes code, runs exams, and submits a pull request. Interactive Planning and Devin Wiki — which auto-indexes repositories and generates structure documentation — tackle two of the unique’s largest criticisms.
On well-scoped, well-defined duties — framework upgrades, library migrations, tech debt cleanup, check protection additions — Devin experiences greater success charges, with unbiased developer testing persistently exhibiting sturdy outcomes on clearly specified work. Reliability drops sharply for ambiguous or architecturally complicated duties; one documented neighborhood check discovered way more failures than successes throughout 20 various duties, highlighting that job specification high quality instantly determines output high quality.
On pricing: Cognition retired its older Core and ACU-based self-serve plans on April 14, 2026 and launched cleaner tiers: Free, Professional at $20/month, Max at $200/month, Groups usage-based with an $80/month minimal, and Enterprise with customized pricing. You probably have seen the sooner “$20 Core + $2.25/ACU” pricing in different articles, it’s now not present.
Cognition additionally partnered with Cognizant in January 2026 to combine Devin into enterprise engineering transformation choices, and launched Cognition for Authorities in February 2026 with FedRAMP Excessive authorization in progress — signaling a deliberate push into institutional deployments.
Greatest for: Groups with clearly scoped, well-specified engineering duties — migrations, check technology, framework upgrades — the place the price of reviewing AI output is decrease than the price of doing the work manually.
#7. OpenHands / OpenDevin (All-Fingers AI)
SWE-bench Verified: 72% GAIA Benchmark: 67.9% License: MIT Pricing: Free to self-host; pay just for mannequin API inference
OpenHands (previously OpenDevin, rebranded in late 2024 below the All-Fingers AI group) is the open-source neighborhood’s reply to Devin. With sturdy open-source adoption seen by means of GitHub exercise and neighborhood utilization, and a 72% SWE-bench Verified rating, it matches or exceeds industrial brokers at a number of worth factors.
OpenHands helps 100+ LLM backends — any OpenAI-compatible API, together with Claude, GPT-5, Mistral, Llama, and native fashions through Ollama. The CodeAct agent can execute code, run terminal instructions, browse the online, and work together with web-based improvement instruments inside a Docker sandbox. Its 67.9% on the GAIA benchmark confirms that net interplay capabilities are substantive.
The bring-your-own-key mannequin means zero platform markup — you pay inference prices on to your mannequin supplier. For open-source tasks, budget-constrained groups, and builders who need full auditability of agent habits, it’s the strongest possibility on this tier. Self-hosting requires Docker and entry to an LLM supplier API; there isn’t a hosted SaaS product.
Greatest for: Open-source groups, builders who need full management and auditability, and budget-conscious practitioners who have already got API credit with a serious mannequin supplier.
#8. Augment Code
SWE-bench Verified (self-reported, Increase harness): 70.6% Differentiator: Full repository context engine; MCP-interoperable Pricing: Group and Enterprise tiers
Increase Code’s 70.6% SWE-bench rating is self-reported utilizing Increase’s personal harness and published on Augment’s engineering blog. As with all agent-scaffolding-dependent scores, it must be learn as “what Augment + Opus 4.5 achieves with Augment’s context engine,” not a standalone mannequin quantity. That caveat acknowledged, the architectural perception behind the rating is actual and independently validated: within the February 2026 scaffold comparability described earlier, Increase’s context-first strategy outperformed different frameworks operating the identical mannequin by 17 issues out of 731.
The core innovation is that Increase’s engine indexes a complete repository earlier than the agent begins work — fairly than constructing context reactively from open recordsdata. For enterprise groups working in giant, mature monorepos, this produces measurably higher outcomes on duties that require cross-module reasoning. Increase additionally exposes its context engine through MCP (Mannequin Context Protocol), making it interoperable with different brokers. A developer may use Increase’s indexing whereas operating Claude Code or Codex for technology.
Greatest for: Enterprise groups with giant, mature codebases who want deeper repository context than single-session instruments present.
#9. Aider
Pricing: Free (open-source); pay for mannequin API inference Structure: Git-native terminal agent
Aider is the git-native coding agent: it operates instantly in your native repository and constructions its adjustments as a collection of atomic git commits with descriptive messages — a workflow that meshes effectively with groups that do cautious code evaluate. It helps any OpenAI-compatible mannequin, giving the identical model-agnostic flexibility as OpenHands, and runs solely within the terminal with no IDE dependency.
The place Aider lags behind higher-ranked instruments is on complicated, multi-step agentic duties that require net entry, browser interplay, or long-horizon planning. It’s a highly effective device inside a clearly outlined scope — terminal-based, git-integrated coding — fairly than a general-purpose autonomous agent.
Greatest for: Builders who prioritize git-native workflows, clear commit histories, and full management over their editor atmosphere.
#10. Cline (Open-Supply)
Cline is VS Code’s hottest open-source AI coding extension, with 5 million installs claimed throughout supported marketplaces. It ships with Plan/Act modes, can run terminal instructions, edit recordsdata throughout a repository, automate browser testing, and lengthen by means of any MCP server. The bring-your-own-key structure means zero inference markup. Roo Code, a neighborhood fork, gives further customization for groups that wish to transcend the core venture.
Greatest for: VS Code builders who need open-source flexibility, full code auditability, and the flexibility to convey their very own fashions with out platform markup.
Marktechpost’s Visible Explainer
The benchmark-maximizing technique and the productivity-maximizing technique should not the identical factor. Primarily based on neighborhood information and developer surveys, roughly 70% of productive skilled builders in 2026 use two or extra instruments concurrently.
The modal sample is a layered stack:
Terminal brokers for complicated duties. Claude Code or Codex for multi-file refactoring, architectural adjustments, troublesome debugging, or any job that requires holding substantial codebase context. These instruments earn their greater price on work that may take a senior engineer hours.
IDE extensions for every day enhancing. Cursor or GitHub Copilot for inline completions, fast edits, check technology, and ambient help that quickens routine coding work. The cognitive overhead of switching between a terminal agent and a separate editor is actual; IDE-native instruments remove it for on a regular basis duties.
Open-source instruments for mannequin flexibility. Aider, Cline, or OpenHands whenever you wish to check a brand new mannequin, keep away from platform markup, or want full auditability of agent habits. These additionally function a fallback when industrial instruments have outages or pricing adjustments.
What the Subsequent 12 Months Look Like
MCP as infrastructure. The Mannequin Context Protocol is rising as a shared normal that lets instruments share context, hand off duties, and compose capabilities. Increase’s context engine uncovered through MCP, and Copilot accepting Claude and Codex as backends, counsel the sphere is shifting towards interoperability fairly than winner-take-all consolidation.
Autonomous PR pipelines. GitHub Copilot’s cloud agent, Codex’s background execution mannequin, and Devin’s end-to-end PR workflow all level on the similar future: AI brokers that course of points from a backlog, work in a single day, and floor reviewed pull requests within the morning. The bottleneck is now not AI high quality — it’s the evaluate bandwidth of human engineers and the governance frameworks organizations are constructing round autonomous code adjustments.
Enterprise governance as a differentiator: Gartner tasks 40% of enterprise functions will embrace task-specific AI brokers by finish of 2026, up from lower than 5% at this time. Compliance posture, audit logs, information dealing with ensures, and safety certifications will more and more be the deciding consider enterprise procurement — not SWE-bench place.
Open-source convergence: OpenHands at 72% SWE-bench Verified, and open-source fashions like MiniMax M2.5 (80.2% SWE-bench Verified) now matching proprietary frontier efficiency, present the standard hole between open and closed techniques is closing. The remaining benefits for industrial instruments are scaffolding sophistication, enterprise help, and product polish — not uncooked mannequin functionality.
The Mythos ceiling: Claude Mythos Preview at 93.9% SWE-bench Verified — roughly 5 factors above the most effective publicly out there mannequin — indicators that the efficiency frontier is effectively forward of what builders can at the moment entry. When fashions at that tier attain common availability, anticipate the class rating to shift once more.
Main sources: Anthropic Claude Opus 4.7 announcement · AWS blog: Claude Opus 4.7 on Amazon Bedrock · OpenAI: Introducing GPT-5.5 · OpenAI: Why we no longer evaluate SWE-bench Verified · OpenAI: Introducing GPT-5.3-Codex · Scale AI SWE-bench Pro public leaderboard · SWE-bench Pro arXiv paper · Official SWE-bench leaderboard · GitHub: openai/codex · Cognition: New self-serve plans for Devin · GitHub Blog: Copilot moving to usage-based billing · GitHub Changelog: Claude and Codex for Copilot Business & Pro · Augment Code: Auggie tops SWE-bench Pro · Anthropic Project Glasswing · Google DeepMind Gemini 3.1 Pro model card · OpenHands GitHub repository

