Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding

Microsoft released a new version of its software Phi-4-reasoning-vision-15BThe, Multimodal open weight reasoning model with 15 billion parameters Designed for text and image tasks that demand both selective reasoning and perception. This compact model is designed to balance the requirements for training data, computation efficiency and reasoning quality. Scientific and mathematical reasoning You can also find out more about the following: understanding user interfaces.

https://arxiv.org/pdf/2603.03975

On what is the model built?

Phi-4-reasoning-vision-15B combines the Phi-4-Reasoning Language backbone is the SigLIP-2 vision encoder using a The mid-fusion architectural style. The vision encoder converts the images first into visual tokens. These tokens are then projected into the language embedding space, and the pretrained language models process them. This is an effective trade-off, as it maintains strong crossmodal reasoning and keeps training costs low compared to heavier early-fusion design.

Microsoft’s decision to go smaller?

The number of parameters and the tokens used in many recent models for vision-language have increased, increasing both deployment and latency. Phi-4-reasoning-vision-15B was built as a smaller alternative that still handles common multimodal workloads without relying on extremely large training datasets or excessive inference-time token generation. The model was developed using 200 billion multimodal tokensThe building of a nation Phi-4-ReasoningIt was taught on 16 billion tokensThe final word is on the Phi-4 The base model on which the training was conducted 400 billion unique tokens. Microsoft compares this with Over 1 trillion tokens Several recent multimodal models, such as Qwen 2,5 VL, Qwen 3 VL, Kimi-VL” Gemma 3.

Designing for high-resolution was the core choice

Microsoft team explained one of their most valuable technical lessons in a technical report, that multimodal reason often fails due to perception failing first. The models can be wrong not because of a lack of reasoning, but rather because they are unable to discern the visual detail in dense images, such as documents or small interfaces.

Phi-4-reasoning-vision-15B uses a Dynamic resolution visual encoder up to 3,600 tokensThis is a high-resolution tool that supports tasks such as The GUI Grounding You can also find out more about the following: Fine-grained Document Analysis. Microsoft states that High-resolution and dynamic-resolution encoders provide consistent improvementsIt is important to note that For high-quality reasoning, accurate perception is necessary.

Mixing reasoning is better than forcing it everywhere

Second, the design of the model is important. The mixed reasoning strategy and the non-reasoning approach to training. Rather than forcing chain-of-thought-style reasoning for all tasks, Microsoft team trained the model to switch between two modes. Samples of reasoning include ... Non-reasoning samples start with These tasks are primarily perception-oriented and include The VQA is a simple method of OCR and captioning.. This data is made up of reasoning. Around 20% Overall training is a mixture of all the above.

This hybrid set-up is designed to allow the model to respond more quickly on certain tasks, where a longer explanation would add latency and reduce accuracy. However, structured reasoning can still be used on other tasks like math or science. Microsoft also points out an important limitation. The boundary between the two modes are learned implicitly so switching may not be optimal. The default behavior can be overridden by users through an explicit prompt. The following are some examples of how to use tokens.

Which areas have the strongest performance?

Microsoft highlights two main areas of application. Microsoft team highlights two main application areas. Scientific and mathematical reasoning using visual inputsThe second is the quantitative document. This includes handwritten documents, such as equations and diagrams. Second is computer-use agent tasksModels that interpret content on the screen, localize GUI components, and support interaction with desktop interfaces, mobile interfaces, or both.

Benchmark results

Microsoft team reports the following benchmark scores for Phi-4-reasoning-vision-15B: AI2DTEST 84.8, ChartQATEST, MathVerseMINI 44.9, MathVisionMINI, version 36.2, MathVistaMINI Version 75.2, 54.3 on MMMUVAL, 64.5 on MMStar, 76.0 on OCRBench” ScreenSpotv2: 88.2 on ScreenSpot. This report notes also that the results were obtained using Eureka ML Insights You can also find out more about the following: VLMEvalKitMicrosoft presents the results as comparisons, not leaderboard claims.

The Key Takeaways

Phi-4-reasoning-vision-15B is a 15B open-weight multimodal model Built by Combining Phi-4-Reasoning With the SigLIP-2 vision encoder in a The mid-fusion architectural style.
Microsoft creates compact model of multimodal reasoningThe focus is on Math, Science, Document Understanding, and GUI GroundingScaling to an even larger number of parameters is preferable.
The system is built around high-resolution vision.With support from Encoding dynamic resolution and up to 3600 visual tokensThis is useful for tasks that require a lot of screenshots, documentation, or interfaces.
This model combines mixed reasoning with non-reasoning.The switch allows the user to choose between You can also find out more about the following: Modes are based on the task, whether it requires explicit reasoning or a direct output based on perception.
Microsoft’s benchmark results show high performance for its sizeResults on ChartQATEST AI2DTEST MathVistaMINI OCRBench ScreenSpotv2This supports the model’s positioning as an efficient but compact vision-language reasoning system.

Click here to find out more Paper, Repo You can also find out more about the following: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.

OpenClaw users are allegedly bypassing anti-bot system

Age Verification Is Sweeping Gaming. Are you ready for AI Fakes in the Age of Gaming?

The Leaked Memo from Anthropic’s CEO: the company will pursue Gulf State investments after all

OpenAI Social Video App: WIRED’s Roundup on the New Fake World

The AI Slur ‘Clanker’ Has Become a Cover for Racist TikTok Skits

Top Insights

Schedule Pinterest Posts in 2025 — For Free

The Home Robot that loads the dishwasher and clears tables all by itself

Latest News

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Schematik Is ‘Cursor for Hardware.’ The Anthropics Want In

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding

On what is the model built?

Microsoft’s decision to go smaller?

Designing for high-resolution was the core choice

Mixing reasoning is better than forcing it everywhere

Which areas have the strongest performance?

Benchmark results

The Key Takeaways

Related Posts