Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Equinox Detailed implementation with JAX Native Moduls, Filtered Transformations, Stateful Ladders and Workflows from End to end.
  • Xiaomi MiMo V2.5 Pro and MiMo V2.5 Released: Frontier Model Benchmarks with Significantly Lower Token Cost
  • How to Create a Multi-Agent System of Production Grade CAMEL with Tool Usage, Consistency, and Criticism-Driven Improvement
  • Sam Altman’s Orb Company promoted a Bruno Mars partnership that didn’t exist
  • Alibaba Qwen Team Releases Qwen3.6.27B: Dense open-weight Model that Outperforms MoE 397B on Agentic Coding Benchmarks
  • Former MrBeast exec sues over ‘years’ of alleged harassment
  • Some of them Were Scary Good. They were all pretty scary.
  • JiuwenClaw Pioneers “Coordination Engineering”: Next leap to harness engineering
AI-trends.todayAI-trends.today
Home»Tech»Xiaomi MiMo V2.5 Pro and MiMo V2.5 Released: Frontier Model Benchmarks with Significantly Lower Token Cost

Xiaomi MiMo V2.5 Pro and MiMo V2.5 Released: Frontier Model Benchmarks with Significantly Lower Token Cost

Tech By Gavin Wallace23/04/20267 Mins Read
Facebook Twitter LinkedIn Email
This AI Paper Introduces Differentiable MCMC Layers: A New AI
This AI Paper Introduces Differentiable MCMC Layers: A New AI
Share
Facebook Twitter LinkedIn Email

Xiaomi MiMo Team has released two new Xiaomi MiMo models. MiMo-V2.5-Pro You can also find out more about the following: MiMo-V2.5. These benchmarks and some truly impressive real-world tasks demonstrate that the open agentic AI has caught up faster than expected. Both models can be accessed immediately through APIs and are priced reasonably.

What Is an Agentic Model and Why is It Important?

LLM benchmarks usually test whether a model can answer a self-contained, single question. Agentic benchmarks test something much harder — whether a model can complete a Multi-step goals Autonomously using various tools (such as web search, code execution and API calls), you can make many different turns without ever losing the initial objective.

Imagine it as the distinction between a question-answering model “how do I write a lexer?” One that is actually able to do something. write a complete compiler, run tests against it, catch regressions, and fix them — all without a human in the loop. This is what the Xiaomi MiMo Team is showing here.

Flagship MiMo V2.5 Pro

MiMo V2.5-Pro has been Xiaomi’s best model up to this point. It offers significant improvements in its predecessor MiMo V2-Pro with regards to general agentic abilities, software complexity, and tasks that require a long-term view.

The key benchmark numbers are competitive with top closed-source models: SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9 — placing it alongside Claude Opus 4.6 and GPT-5.4 across most evaluations. V2.5 Pro can handle complex long-horizon tasks with more than 1,000 calls to tools. This shows significant improvements in instructions following, while adhering reliably to context-specific requirements.

Xiaomi MiMo calls this behavior “V2.5-Pro” and it is one of the key differences between V2.5 Pro and earlier models. “harness awareness”The model uses the features of the harness, maintains its memory and controls how the context of its environment is filled to reach the desired goal. The model does not just follow instructions. It optimizes itself to ensure that it stays on track throughout very lengthy tasks.

Xiaomi’s three recent real-world demos demonstrate exactly what “long-horizon agentic capability” Meaning in Practice

Demo 1 — SysY Compiler in RustThe Peking University Compiler Principles This task is part of a course project. It asks that the student model implement a complete SysY complier in Rust, starting from scratch. This reference project usually takes several weeks for a PKU major in CS. MiMoV2.5Pro was finished in only 4.3 minutes, despite 672 calls to tools. This resulted in an 233/233 score on the course’s hidden test suite.

What’s notable isn’t just the final score — it’s the architecture of execution. Instead of wasting time on trial and error the model constructed the compiler by building it layer-by-layer: the pipeline was built first with perfect Koopa IR, followed by the RISC V backend (103/103) then finally the performance (20/20). First compile passed 137/233 of the tests. That’s a 59% hot start. The model detected the regressions that occurred after a later refactoring, recovered and continued. This is structured, self-correcting engineering behavior — not pattern-matched code generation.

Demo 2 — Full-Featured Desktop Video EditorMiMo’s V2.5 Pro delivered an app that works with just some simple instructions: timeline for multiple tracks, clip trimming and crossfading, audio mixing pipeline, export pipeline. It took 11.5 hours to complete the final build, which consisted of 8,192 lines.

Demo 3 — Analog EDA- FVF-LDO Design: This is the most technically specialized demo: a graduate-level analog-circuit EDA task requiring the design and optimization of a complete FVF-LDO (Flipped-Voltage-Follower low-dropout regulator) from scratch in the TSMC 180nm CMOS process. The model had to size the power transistor, tune the compensation network, and pick bias voltages so that six metrics land within spec simultaneously — phase margin, line regulation, load regulation, quiescent current, PSRR, and transient response. Wired into an ngspice simulation loop, in about an hour of closed-loop iteration — calling the simulator, reading waveforms, tweaking parameters — the model produced a design where every target metric is met, with four key metrics improved by an order of magnitude over its own initial attempt.

Token EfficiencyIntelligence is useful at frontier levels only if it can be cost-effective. On ClawEval, V2.5-Pro lands at 64% Pass^3 using only ~70K tokens per trajectory — roughly 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 at comparable capability levels. Engineers building production agent pipes can use this material cost savings, and not as a marketing statistic.

https://mimo.xiaomi.com/mimo-v2-5-pro/

MiMo Coding Bench It is Xiaomi’s own evaluation tool, designed to evaluate models against real developer tasks in agentic frameworks such as Claude Code. The evaluation suite covers repo comprehension, project creation, code reviews, artifacts, planning, SWE and more. V2.5 Pro is the market leader on this benchmark. Xiaomi has explicitly positioned it as a plug-in for scaffolds like Claude Code OpenCode Kilo.

Native Omnimodal MiMo V2.5 at Half the Cost

MiMo V2.5, while targeting the most difficult long-horizon tasks in agentics, is an important step up for multimodal capability. MiMo V2.5 is a multimodal agent that can reason across all modalities. It has native audio and visual understanding and surpasses MiMo V2-Pro’s performance.

Model is built with action and perception integrated from the start. MiMo is taught to hear, see and respond to what it perceives from the beginning, resulting in a model that can understand everything and get things done. This is architecturally significant — earlier multimodal models often bolted vision on top of a text backbone, creating capability gaps at the perception-action boundary.

The value proposition on the coding front is obvious: MiMo V2.5 Coding Bench delivers excellent results for everyday coding, and closes the gap between frontier models, while matching MiMo V2.5-Pro, at a fraction of the price. If you don’t want the V2.5-Pro depth, but still need a powerful tool for everyday coding tasks, then this can be a great option.

https://mimo.xiaomi.com/mimo-v2-5/

MiMo V2.5 is at the Pareto Frontier of Performance and Efficiency on multimodal benchmarks. It achieves 62.3 points in the Claw Eval subset general. MiMo V2.5 reaches a score of 23.8 in the Claw-Eval Multimodal multimodal subset. This is equal to Claude Sonnet 4.6 and ahead by 8 points.

Video understanding is where MiMo V2.5 scored 87.7 out of 100 on Video-MME. This puts it in a tie with Gemini 3 Pro (88.5) and far ahead of Gemini 3 Flash. Long-horizon video comprehension — scene tracking, temporal reasoning, visual grounding over minutes of footage — is now in frontier territory. MiMo’s V2.5 is rated 81.0 by CharXiv and 77.9 by MMMU Pro for image comprehension, while Gemini 3 Pro has a score of 81.2.

MiMo is 1x (1 token = 1 credit), and MiMo Pro 2.5x (2 tokens = 2 credits). Token Plans no longer charge a multiplier for the 1M-token context window — previously a common cost friction for long-context agentic workloads.

What you need to know

  • Frontier MiMo V2.5-Pro is compatible with the MiMo V2.5-Pro on key agentic benchmarks (SWE-bench Pro 57.2, Claw-Eval 63.8, τ3-Bench 72.9), while using 40–60% fewer tokens per trajectory than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4.
  • The long-horizon autonomy of the system is real, measurable and quantifiable — V2.5-Pro autonomously built a complete SysY compiler in Rust (233/233 tests, 672 tool calls, 4.3 hours) and a full-featured desktop video editor (8,192 lines of code, 1,868 tool calls, 11.5 hours).
  • MiMo V2.5 has a native omnimodal interface — trained from scratch to see, hear, and act across modalities with a native 1M-token context window, matching Claude Sonnet 4.6 on Claw-Eval Multimodal and nearly tying Gemini 3 Pro on Video-MME (87.7 vs. 88.4).
  • Get professional-level performance for half the price — on MiMo Coding Bench, MiMo-V2.5 matches MiMo-V2.5-Pro on everyday coding tasks at 1x token pricing, making it the practical choice for most production agent pipelines.
  • These models have both features. already compatible with popular agentic scaffolds like Claude Code, OpenCodeThen, Kilo — giving AI devs a drop-in, auditable, self-hostable path to frontier-level agentic AI.

Check out the Technical details MiMo-V2.5, and Technical details MiMo-V2.5-Pro. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us


ar Benchmark x
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

Equinox Detailed implementation with JAX Native Moduls, Filtered Transformations, Stateful Ladders and Workflows from End to end.

23/04/2026

How to Create a Multi-Agent System of Production Grade CAMEL with Tool Usage, Consistency, and Criticism-Driven Improvement

23/04/2026

Alibaba Qwen Team Releases Qwen3.6.27B: Dense open-weight Model that Outperforms MoE 397B on Agentic Coding Benchmarks

22/04/2026

JiuwenClaw Pioneers “Coordination Engineering”: Next leap to harness engineering

22/04/2026
Top News

GPT-5 Doesn’t Dislike You—It Might Just Need a Benchmark for Emotional Intelligence

Silicon Valley losing its influence on DC

WIRED| WIRED

Does AI Have Legal Rights?

You want a different kind of work trip? Try a Robotics Hotel

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

CoDA-1.7B by Salesforce AI Research: A Discrete Diffusion Code Model With Bidirectional Parallel Token Generating

06/10/2025

Build production-ready custom AI agents for enterprise workflows that include monitoring, orchestration and scalability

23/06/2025
Latest News

Equinox Detailed implementation with JAX Native Moduls, Filtered Transformations, Stateful Ladders and Workflows from End to end.

23/04/2026

Xiaomi MiMo V2.5 Pro and MiMo V2.5 Released: Frontier Model Benchmarks with Significantly Lower Token Cost

23/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.