VL Cogito: Multimodal Reasoning Advancing with Progressive Curriculum Reinforcement learning

AI is facing a new frontier in multimodal reasoning. Models that can integrate and interpret data from multiple sources like text, diagrams and images are a great challenge. VL Cogito is the state-of-the art Multimodal Large Language Model (MLLM) is a proposal by DAMO Academy, Alibaba Group, and other partners. It introduces a powerful reinforcement learning pipeline, which upgrades reasoning skills in large models, across math, science, logic and charts.

Core Innovations

VL-Cogito is unique in its approach. Progressive Curriculum Reinforcement Learning (PCuRL) A framework designed to eliminate the instabilities and gaps between domains that plague multimodal reasoning. This framework features two revolutionary innovations.

Soft Online Weighting for Difficulty (ODSW), The model is able to adapt to the changing capabilities of its models and assign dynamic weights depending on how difficult a sample was. Instead of rigidly filtering out “easy” You can also find out more about “hard” samples, ODSW ensures each prompt contributes appropriately to gradient updates—enabling the model to progress from clear cases to intricate, challenging ones through a continuous curriculum. The focus can be tuned for the easy, medium or hard stage using three variants based upon a piecewise algorithm based on accuracy of rollout, and guided by empirical distribution of difficulty.
Dynamic Length Rewards (DyLR), In RL-based reasoning, the traditional target for length is a statically set value that ignores task complexity. This encourages verbosity and unnecessary discourse. DyLR resolves this issue by calculating a target length that is based upon the average rollout sample lengths for each question. For easy tasks, DyLR encourages rapid, short reasoning, while for complex ones, it promotes deeper multi-step analysis, perfectly balancing correctness and efficiency.

Training Pipeline

VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with Cold start without initial supervised tuning (SFT). required. PCuRL has been divided into 3 sequential RL phases: Easy, medium and hard. Each stage is divided into:

A dataset that is similar to the original one can be randomly shuffled. The model will then face different generalization problems.
Weighting Function of ODSW For that stage, gradient updates are biased in favor of the target level.
DyLR can be triggered in order to expand the adaptive reasoning chain.

The technical details of your setup

AdamW optimizer, LR=1e-6, DeepSpeed-ZeRO3.
The size of the rollout batch is 512. Global batch size is 128. Sequence length, 4,096. KL Divergence Loss: 1e-3. 16 Response Samples per Prompt. Temperature: 1.
Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts).

Dataset curation and Real-time Data Sampling

The training sets are carefully curated and include 23 multimodal datasets from open sources, covering six categories of tasks. Mathematics, Logical Reasoning (including Counting), Science Reasoning and Chart Understanding.

To prevent the superficial use of multiple choice cues, all samples are reformulated into open-ended QA format.
Difficulty sampling: Qwen2.5-VL-7B-Instruct is trialed; any sample passed by it with ≥50% accuracy over 8 runs is dropped, guaranteeing that only genuinely challenging tasks remain.

Assessment and Benchmarking Results

Performance Across Benchmarks

VL-Cogito was benchmarked with general-purpose MLLMs and reasoning-oriented MLLMs in a ten-task test panel. This included datasets such as Geometry@3K MathVerse MathVista ChartQA ScienceQA MMMU EMMA MMStar MMStar

Absolute accuracy gains Over the backbone: 7.6% Geometry@3K +5.5% MathVista +4.9% LogicVista +2.2% ScienceQA +4.5% EMMA +3.8% MMStar
Results of the latest benchmark 6/10VL-Cogito is either the leader or equal to top performers, particularly on difficult math and science tasks. Models “cold-started” With SFT or forced-rethinking, RL’s robust curriculum is still superior.

Model	Geo3K	MathVerse	MathVista	MathVision	LogicVista	ChartQA	SciQA	MMU	EMMA	MMStar
VL-Cogito (7B)	68.7	53.3	74.8	30.7	48.9	83.4	87.6	52.6	29.1	66.3
VL-Rethinker (7B)	67.7	54.6	73.7	30.1	45.7	83.5	86.7	52.9	28.6	64.2
MM-Eureka (8B)	67.2	52.3	73.4	29.4	47.1	82.7	86.4	52.3	27.4	64.7
Qwen2.5-VL (7B)	61.6	50.4	69.3	28.7	44.0	82.4	85.4	50.9	24.6	62.5

Component-wise Ablation

Curriculum RL The GRPO score alone increases average scores by 0.8% over the vanilla GRPO.
Dynamic Length Reward Further boosts performance in particular for hard math domains.
ODSW Binary hard sample filtering consistently outperforms it, in particular when the training data are skewed.

The Training Dynamics of Reasoning and Efficiency

Dynamic Rewards Cosine rewards of fixed length are more efficient and provide better accuracy than those with adaptive length. As intended, the adaptive length appears to be longer for math, logic and science tasks and shorter for general understanding and scientific knowledge.
PCuRL’s Hard Stage induces a spike of validation accuracy and reasoning length, surpassing the vanilla GRPO whose accurate plateaus despite static length.

Case Studies

VL Cogito is a step-by-step reasoning system that is self-reflective and detailed. The model actively corrects mistakes in math by decomposing solutions into granular chain and RL-verified verification.[1, Figure 5]. It considers all options before presenting the correct answer. This shows a strong understanding of multimodal concepts and process reliability.

Understanding and impact

VL Cogito’s PCuRL pipeline confirms a number of key insights

Learning is important: Model progress can be optimized by prompts that are of a medium difficulty.
Deep reasoning is triggered by exposure to challenges: The over-emphasis of easy samples can degenerate performance. A progressive focus on hard samples will build analytic depth.
It is essential to reward rewards with granularity. When combining correctness with format and length, it is possible to produce context-sensitive, nuanced reasoning.
It is possible to start RL cold without sft. PCuRL models do not require expensive SFT warming up.

You can also read our conclusion.

VL-Cogito was a pioneer in multimodal training, and its architecture set the standard. The design and empirical validity of progressive curriculum-based RL with dynamic reward lengths point towards a general roadmap to robust reasoning for multimodal models.

Nikhil works as an intern at Marktechpost. He has a dual integrated degree in Materials from the Indian Institute of Technology Kharagpur. Nikhil, an AI/ML fanatic, is constantly researching AI/ML applications for biomaterials and other biomedical fields. He has a background in Material Science and is always exploring advancements.

VL Cogito: Multimodal Reasoning Advancing with Progressive Curriculum Reinforcement learning

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

AI Slop Makes the Internet fake-happy

Meta Warned Facial Recognizer Glasses Could Arm Sexual Assailants

I Let Google’s ‘Auto Browse’ AI Agent Take Over Chrome. It didn’t quite click

AI will never be conscious

A Hiker Was Missing for Nearly a Year—Until an AI System Recognized His Helmet

Top Insights

Sam Altman Slams Meta AI Talent Poaching Spree : “Missionaries will Beat Mercenaries”

Googlebot and Google Agent: Google’s Technical Definition of the Boundary between User-Triggered AI Access Systems and Search Crawling System Today

Latest News

Prego Has a Dinner-Conversation-Recording Device, Capisce?

AI CEOs think they can be everywhere at once

VL Cogito: Multimodal Reasoning Advancing with Progressive Curriculum Reinforcement learning

Core Innovations

Training Pipeline

Dataset curation and Real-time Data Sampling

Assessment and Benchmarking Results

Performance Across Benchmarks

Component-wise Ablation

The Training Dynamics of Reasoning and Efficiency

Case Studies

Understanding and impact

You can also read our conclusion.

Related Posts