Alibaba Launches Qwen3 - Max-Thinking: A Test-Time Scaled Reasoning model that uses Native Tools to Power Agentic Tasks

Qwen3 Max-Thinking, Alibaba’s flagship reasoning model. This model not only changes parameters but also how they are inferred, and has built-in tools to search memory and execute code.

https://qwen.ai/blog?id=qwen3-max-thinking

Data, model scale and deployment

Qwen3 – Max-Thinking (Qwen3-Max) is the flagship LLM of the MoE, which has been pre-trained with 36T tokens. It was built using the Qwen3 model as the reasoning top tier. Model targets code and reasoning with a long-term horizon, rather than casual chat. The model runs in a context of 260k tokens. This supports large technical reports and repository code.

Qwen3-Max-Thinking has an OpenAI-compatible HTTP API and is served via Qwen-Chat, Alibaba Cloud Model Studio. Qwen3 Max-Thinking can also be used in Claude-style tool schemas. Existing Anthropic and Claude Code flows are easily swapped for Qwen3 Maximizing. No public weights are available, therefore usage is API-based. This matches the position.

Experience cumulative reasoning and Smart Test Time Scaling

The reasoning of large language models can be improved by using simple time-scaling, such as best of N sampling and several parallel thought chains. The cost of this approach is almost linearly proportional to the sample number. Qwen3 introduces a strategy for multi round testing time scaling that is experience-based and cumulative.

The model does not only sample more conversations in parallel but it also uses intermediate reasoning trails as structured experiences. It extracts partial conclusions after each round and then concentrates subsequent computations on the unresolved portions of the question. Developers can control this process by adjusting an explicit thinking budget via API parameters, such as enable_thinking.

It is reported that the accuracy increases without an increase in tokens. Qwen’s ablations for example show GPQA Diamond rising from around 90 levels accuracy to approximately 92.8 and LiveCodeBench v6 gaining from roughly 88.0 level to 91.4, under the experience cumulative strategy with similar token budgets. It is crucial to note that higher reasoning quality may be achieved by scheduling compute more efficiently, and not just by adding more samples.

Native Agent stack with Adaptive Tool use

Qwen3MaxThinking is a first class tool that integrates Search, Memory and a code interpreter. The model is able to retrieve new pages from the web, as well as extracting content and generating answers. Memory is used to store state specific to a user or session, which allows for personalized reasoning across longer workflows. Code Interpreter uses Python to allow numeric and data verifications, program synthesis, as well as runtime checks.

The model decides when to use these tools in a conversation using Adaptive Tool Use. Instead of being orchestrated from an outside agent, tool calls are integrated with thinking segments. The design eliminates the need to use separate planners and routers, as the model is able to fetch information directly or perform calculations without guessing.

The tool ability is benchmarked as well. On Tau² Bench, which measures function calling and tool orchestration, Qwen3-Max-Thinking reports a score of 82.1, comparable with other frontier models in this category.

Standard benchmark across reasoning, knowledge and search

Qwen3 Max Thinking is positioned near or at the level of GPT 5.2, Thinking, Claude Opus 4.5, and Gemini 3 Pro. In the knowledge task, Qwen has reported 85.7 scores on MMLU-Pro; 92.8 for MMLU-Redux and 93.7 in C-Eval. Qwen is leading on Chinese Language Evaluation.

It scores 87.4 for hard reasoning on GPQA. 98.0 on HMMT February 25, 94.7 on HMMT November 25, and 83.9 on IMOAnswerBench. This puts it at the top of all current math and sciences models. In coding and software, it scores 85.9 in LiveCodeBench version 6 and 75.3 in SWE Verified.

In HLE’s base configuration, Qwen3 Max-Thinking scored 30.2, just below Gemini 3 Pro (37.5) and GPT 5 Thinking (35.5). Qwen3 Max-Thinking scores 49.8 in a HLE configuration with tools, ahead of GPT 5 Thinking (45.5) and Gemini 3 Pro (45.8). In its most aggressive configuration for experience cumulative testing time on HLE, Qwen3 Max-Thinking achieves 58.3, while GPT 5 Thinking stays at 45.5. However, this higher number comes from a more inference-heavy mode.

The Key Takeaways

Qwen3 Max-Thinking, a flagship closed reasoning model, is only available via API. It was built using a backbone of more than 1 trillion parameters, trained with 36 trillion tokens, and a context window of 262144 tokens.
The model incorporates experience cumulative testing time scaling whereby it reuses the intermediate reasoning over multiple rounds. Benchmarks such as GPQA Diamond, and LiveCodeBench v6 are improved at comparable token budgets.
Qwen3MaxThinking integrates Search and Memory as native Tools and uses Adaptive tool Use to decide when the model should browse, remember state, or run Python.
On public benchmarks it reports competitive scores with GPT 5.2 Thinking, Claude Opus 4.5, and Gemini 3 Pro, including strong results on MMLU Pro, GPQA, HMMT, IMOAnswerBench, LiveCodeBench v6, SWE Bench Verified, and Tau² Bench..

Click here to find out more API You can also find out more about the following: Technical details. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Michal Sutter, a data scientist with a master’s degree in Data Science at the University of Padova. Michal is a data scientist with a background in machine learning, statistical analysis and data engineering.

Alibaba Launches Qwen3 – Max-Thinking: A Test-Time Scaled Reasoning model that uses Native Tools to Power Agentic Tasks

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

Pro-Iran Meme Machine Trolls Trump with AI Lego Cartoons

OpenAI Anthropic Block are teaming up to create AI agents that play nicely

What are the 3 best portable jumpstarters for 2026? Get charged up!

Sam Altman’s House allegedly attacked by Molotov Cocktail thrower

AI is the First World We Live In

Top Insights

Can LLMs Judge With Reasoning? Microsoft researchers and Tsinghua researchers introduce Reward Reasoning models to dynamically scale test-time computation for better alignment

What You Should Know

Latest News

Apple’s new CEO must launch an AI killer product