Microsoft released a new version of its software Phi-4-reasoning-vision-15BThe, Multimodal open weight reasoning model with 15 billion parameters Designed for text and image tasks that demand both selective reasoning and perception. This compact model is designed to balance the requirements for training data, computation efficiency and reasoning quality. Scientific and mathematical reasoning You can also find out more about the following: understanding user interfaces.
On what is the model built?
Phi-4-reasoning-vision-15B combines the Phi-4-Reasoning Language backbone is the SigLIP-2 vision encoder using a The mid-fusion architectural style. The vision encoder converts the images first into visual tokens. These tokens are then projected into the language embedding space, and the pretrained language models process them. This is an effective trade-off, as it maintains strong crossmodal reasoning and keeps training costs low compared to heavier early-fusion design.

Microsoft’s decision to go smaller?
The number of parameters and the tokens used in many recent models for vision-language have increased, increasing both deployment and latency. Phi-4-reasoning-vision-15B was built as a smaller alternative that still handles common multimodal workloads without relying on extremely large training datasets or excessive inference-time token generation. The model was developed using 200 billion multimodal tokensThe building of a nation Phi-4-ReasoningIt was taught on 16 billion tokensThe final word is on the Phi-4 The base model on which the training was conducted 400 billion unique tokens. Microsoft compares this with Over 1 trillion tokens Several recent multimodal models, such as Qwen 2,5 VL, Qwen 3 VL, Kimi-VL” Gemma 3.

Designing for high-resolution was the core choice
Microsoft team explained one of their most valuable technical lessons in a technical report, that multimodal reason often fails due to perception failing first. The models can be wrong not because of a lack of reasoning, but rather because they are unable to discern the visual detail in dense images, such as documents or small interfaces.
Phi-4-reasoning-vision-15B uses a Dynamic resolution visual encoder up to 3,600 tokensThis is a high-resolution tool that supports tasks such as The GUI Grounding You can also find out more about the following: Fine-grained Document Analysis. Microsoft states that High-resolution and dynamic-resolution encoders provide consistent improvementsIt is important to note that For high-quality reasoning, accurate perception is necessary.
Mixing reasoning is better than forcing it everywhere
Second, the design of the model is important. The mixed reasoning strategy and the non-reasoning approach to training. Rather than forcing chain-of-thought-style reasoning for all tasks, Microsoft team trained the model to switch between two modes. Samples of reasoning include Non-reasoning samples start with These tasks are primarily perception-oriented and include The VQA is a simple method of OCR and captioning.. This data is made up of reasoning. Around 20% Overall training is a mixture of all the above.
This hybrid set-up is designed to allow the model to respond more quickly on certain tasks, where a longer explanation would add latency and reduce accuracy. However, structured reasoning can still be used on other tasks like math or science. Microsoft also points out an important limitation. The boundary between the two modes are learned implicitly so switching may not be optimal. The default behavior can be overridden by users through an explicit prompt. The following are some examples of how to use tokens.
Which areas have the strongest performance?
Microsoft highlights two main areas of application. Microsoft team highlights two main application areas. Scientific and mathematical reasoning using visual inputsThe second is the quantitative document. This includes handwritten documents, such as equations and diagrams. Second is computer-use agent tasksModels that interpret content on the screen, localize GUI components, and support interaction with desktop interfaces, mobile interfaces, or both.

Benchmark results
Microsoft team reports the following benchmark scores for Phi-4-reasoning-vision-15B: AI2DTEST 84.8, ChartQATEST, MathVerseMINI 44.9, MathVisionMINI, version 36.2, MathVistaMINI Version 75.2, 54.3 on MMMUVAL, 64.5 on MMStar, 76.0 on OCRBench” ScreenSpotv2: 88.2 on ScreenSpot. This report notes also that the results were obtained using Eureka ML Insights You can also find out more about the following: VLMEvalKitMicrosoft presents the results as comparisons, not leaderboard claims.
The Key Takeaways
- Phi-4-reasoning-vision-15B is a 15B open-weight multimodal model Built by Combining Phi-4-Reasoning With the SigLIP-2 vision encoder in a The mid-fusion architectural style.
- Microsoft creates compact model of multimodal reasoningThe focus is on Math, Science, Document Understanding, and GUI GroundingScaling to an even larger number of parameters is preferable.
- The system is built around high-resolution vision.With support from Encoding dynamic resolution and up to 3600 visual tokensThis is useful for tasks that require a lot of screenshots, documentation, or interfaces.
- This model combines mixed reasoning with non-reasoning.The switch allows the user to choose between
You can also find out more about the following:Modes are based on the task, whether it requires explicit reasoning or a direct output based on perception. - Microsoft’s benchmark results show high performance for its sizeResults on ChartQATEST AI2DTEST MathVistaMINI OCRBench ScreenSpotv2This supports the model’s positioning as an efficient but compact vision-language reasoning system.
Click here to find out more Paper, Repo You can also find out more about the following: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

