AI is facing a new frontier in multimodal reasoning. Models that can integrate and interpret data from multiple sources like text, diagrams and images are a great challenge. VL Cogito is the state-of-the art Multimodal Large Language Model (MLLM) is a proposal by DAMO Academy, Alibaba Group, and other partners. It introduces a powerful reinforcement learning pipeline, which upgrades reasoning skills in large models, across math, science, logic and charts.
Core Innovations
VL-Cogito is unique in its approach. Progressive Curriculum Reinforcement Learning (PCuRL) A framework designed to eliminate the instabilities and gaps between domains that plague multimodal reasoning. This framework features two revolutionary innovations.
- Soft Online Weighting for Difficulty (ODSW), The model is able to adapt to the changing capabilities of its models and assign dynamic weights depending on how difficult a sample was. Instead of rigidly filtering out “easy” You can also find out more about “hard” samples, ODSW ensures each prompt contributes appropriately to gradient updates—enabling the model to progress from clear cases to intricate, challenging ones through a continuous curriculum. The focus can be tuned for the easy, medium or hard stage using three variants based upon a piecewise algorithm based on accuracy of rollout, and guided by empirical distribution of difficulty.
- Dynamic Length Rewards (DyLR), In RL-based reasoning, the traditional target for length is a statically set value that ignores task complexity. This encourages verbosity and unnecessary discourse. DyLR resolves this issue by calculating a target length that is based upon the average rollout sample lengths for each question. For easy tasks, DyLR encourages rapid, short reasoning, while for complex ones, it promotes deeper multi-step analysis, perfectly balancing correctness and efficiency.
Training Pipeline
VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with Cold start without initial supervised tuning (SFT). required. PCuRL has been divided into 3 sequential RL phases: Easy, medium and hard. Each stage is divided into:
- A dataset that is similar to the original one can be randomly shuffled. The model will then face different generalization problems.
- Weighting Function of ODSW For that stage, gradient updates are biased in favor of the target level.
- DyLR can be triggered in order to expand the adaptive reasoning chain.
The technical details of your setup
- AdamW optimizer, LR=1e-6, DeepSpeed-ZeRO3.
- The size of the rollout batch is 512. Global batch size is 128. Sequence length, 4,096. KL Divergence Loss: 1e-3. 16 Response Samples per Prompt. Temperature: 1.
- Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts).
Dataset curation and Real-time Data Sampling
The training sets are carefully curated and include 23 multimodal datasets from open sources, covering six categories of tasks. Mathematics, Logical Reasoning (including Counting), Science Reasoning and Chart Understanding.
- To prevent the superficial use of multiple choice cues, all samples are reformulated into open-ended QA format.
- Difficulty sampling: Qwen2.5-VL-7B-Instruct is trialed; any sample passed by it with ≥50% accuracy over 8 runs is dropped, guaranteeing that only genuinely challenging tasks remain.
Assessment and Benchmarking Results
Performance Across Benchmarks
VL-Cogito was benchmarked with general-purpose MLLMs and reasoning-oriented MLLMs in a ten-task test panel. This included datasets such as Geometry@3K MathVerse MathVista ChartQA ScienceQA MMMU EMMA MMStar MMStar
- Absolute accuracy gains Over the backbone: 7.6% Geometry@3K +5.5% MathVista +4.9% LogicVista +2.2% ScienceQA +4.5% EMMA +3.8% MMStar
- Results of the latest benchmark 6/10VL-Cogito is either the leader or equal to top performers, particularly on difficult math and science tasks. Models “cold-started” With SFT or forced-rethinking, RL’s robust curriculum is still superior.
| Model | Geo3K | MathVerse | MathVista | MathVision | LogicVista | ChartQA | SciQA | MMU | EMMA | MMStar |
|---|---|---|---|---|---|---|---|---|---|---|
| VL-Cogito (7B) | 68.7 | 53.3 | 74.8 | 30.7 | 48.9 | 83.4 | 87.6 | 52.6 | 29.1 | 66.3 |
| VL-Rethinker (7B) | 67.7 | 54.6 | 73.7 | 30.1 | 45.7 | 83.5 | 86.7 | 52.9 | 28.6 | 64.2 |
| MM-Eureka (8B) | 67.2 | 52.3 | 73.4 | 29.4 | 47.1 | 82.7 | 86.4 | 52.3 | 27.4 | 64.7 |
| Qwen2.5-VL (7B) | 61.6 | 50.4 | 69.3 | 28.7 | 44.0 | 82.4 | 85.4 | 50.9 | 24.6 | 62.5 |
Component-wise Ablation
- Curriculum RL The GRPO score alone increases average scores by 0.8% over the vanilla GRPO.
- Dynamic Length Reward Further boosts performance in particular for hard math domains.
- ODSW Binary hard sample filtering consistently outperforms it, in particular when the training data are skewed.
The Training Dynamics of Reasoning and Efficiency
- Dynamic Rewards Cosine rewards of fixed length are more efficient and provide better accuracy than those with adaptive length. As intended, the adaptive length appears to be longer for math, logic and science tasks and shorter for general understanding and scientific knowledge.
- PCuRL’s Hard Stage induces a spike of validation accuracy and reasoning length, surpassing the vanilla GRPO whose accurate plateaus despite static length.
Case Studies
VL Cogito is a step-by-step reasoning system that is self-reflective and detailed. The model actively corrects mistakes in math by decomposing solutions into granular chain and RL-verified verification.[1, Figure 5]. It considers all options before presenting the correct answer. This shows a strong understanding of multimodal concepts and process reliability.

Understanding and impact
VL Cogito’s PCuRL pipeline confirms a number of key insights
- Learning is important: Model progress can be optimized by prompts that are of a medium difficulty.
- Deep reasoning is triggered by exposure to challenges: The over-emphasis of easy samples can degenerate performance. A progressive focus on hard samples will build analytic depth.
- It is essential to reward rewards with granularity. When combining correctness with format and length, it is possible to produce context-sensitive, nuanced reasoning.
- It is possible to start RL cold without sft. PCuRL models do not require expensive SFT warming up.
You can also read our conclusion.
VL-Cogito was a pioneer in multimodal training, and its architecture set the standard. The design and empirical validity of progressive curriculum-based RL with dynamic reward lengths point towards a general roadmap to robust reasoning for multimodal models.
Nikhil works as an intern at Marktechpost. He has a dual integrated degree in Materials from the Indian Institute of Technology Kharagpur. Nikhil, an AI/ML fanatic, is constantly researching AI/ML applications for biomaterials and other biomedical fields. He has a background in Material Science and is always exploring advancements.

