Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Prego Has a Dinner-Conversation-Recording Device, Capisce?
  • AI CEOs think they can be everywhere at once
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
  • TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
  • Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
  • OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
  • This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
AI-trends.todayAI-trends.today
Home»Tech»VL Cogito: Multimodal Reasoning Advancing with Progressive Curriculum Reinforcement learning

VL Cogito: Multimodal Reasoning Advancing with Progressive Curriculum Reinforcement learning

Tech By Gavin Wallace09/08/20255 Mins Read
Facebook Twitter LinkedIn Email
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers
Share
Facebook Twitter LinkedIn Email

AI is facing a new frontier in multimodal reasoning. Models that can integrate and interpret data from multiple sources like text, diagrams and images are a great challenge. VL Cogito is the state-of-the art Multimodal Large Language Model (MLLM) is a proposal by DAMO Academy, Alibaba Group, and other partners. It introduces a powerful reinforcement learning pipeline, which upgrades reasoning skills in large models, across math, science, logic and charts.

Core Innovations

VL-Cogito is unique in its approach. Progressive Curriculum Reinforcement Learning (PCuRL) A framework designed to eliminate the instabilities and gaps between domains that plague multimodal reasoning. This framework features two revolutionary innovations.

  • Soft Online Weighting for Difficulty (ODSW), The model is able to adapt to the changing capabilities of its models and assign dynamic weights depending on how difficult a sample was. Instead of rigidly filtering out “easy” You can also find out more about “hard” samples, ODSW ensures each prompt contributes appropriately to gradient updates—enabling the model to progress from clear cases to intricate, challenging ones through a continuous curriculum. The focus can be tuned for the easy, medium or hard stage using three variants based upon a piecewise algorithm based on accuracy of rollout, and guided by empirical distribution of difficulty.
  • Dynamic Length Rewards (DyLR), In RL-based reasoning, the traditional target for length is a statically set value that ignores task complexity. This encourages verbosity and unnecessary discourse. DyLR resolves this issue by calculating a target length that is based upon the average rollout sample lengths for each question. For easy tasks, DyLR encourages rapid, short reasoning, while for complex ones, it promotes deeper multi-step analysis, perfectly balancing correctness and efficiency.

Training Pipeline

VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with Cold start without initial supervised tuning (SFT). required. PCuRL has been divided into 3 sequential RL phases: Easy, medium and hard. Each stage is divided into:

  • A dataset that is similar to the original one can be randomly shuffled. The model will then face different generalization problems.
  • Weighting Function of ODSW For that stage, gradient updates are biased in favor of the target level.
  • DyLR can be triggered in order to expand the adaptive reasoning chain.

The technical details of your setup

  • AdamW optimizer, LR=1e-6, DeepSpeed-ZeRO3.
  • The size of the rollout batch is 512. Global batch size is 128. Sequence length, 4,096. KL Divergence Loss: 1e-3. 16 Response Samples per Prompt. Temperature: 1.
  • Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts).

Dataset curation and Real-time Data Sampling

The training sets are carefully curated and include 23 multimodal datasets from open sources, covering six categories of tasks. Mathematics, Logical Reasoning (including Counting), Science Reasoning and Chart Understanding.

  • To prevent the superficial use of multiple choice cues, all samples are reformulated into open-ended QA format.
  • Difficulty sampling: Qwen2.5-VL-7B-Instruct is trialed; any sample passed by it with ≥50% accuracy over 8 runs is dropped, guaranteeing that only genuinely challenging tasks remain.

Assessment and Benchmarking Results

Performance Across Benchmarks

VL-Cogito was benchmarked with general-purpose MLLMs and reasoning-oriented MLLMs in a ten-task test panel. This included datasets such as Geometry@3K MathVerse MathVista ChartQA ScienceQA MMMU EMMA MMStar MMStar

  • Absolute accuracy gains Over the backbone: 7.6% Geometry@3K +5.5% MathVista +4.9% LogicVista +2.2% ScienceQA +4.5% EMMA +3.8% MMStar
  • Results of the latest benchmark 6/10VL-Cogito is either the leader or equal to top performers, particularly on difficult math and science tasks. Models “cold-started” With SFT or forced-rethinking, RL’s robust curriculum is still superior.
Model Geo3K MathVerse MathVista MathVision LogicVista ChartQA SciQA MMU EMMA MMStar
VL-Cogito (7B) 68.7 53.3 74.8 30.7 48.9 83.4 87.6 52.6 29.1 66.3
VL-Rethinker (7B) 67.7 54.6 73.7 30.1 45.7 83.5 86.7 52.9 28.6 64.2
MM-Eureka (8B) 67.2 52.3 73.4 29.4 47.1 82.7 86.4 52.3 27.4 64.7
Qwen2.5-VL (7B) 61.6 50.4 69.3 28.7 44.0 82.4 85.4 50.9 24.6 62.5

Component-wise Ablation

  • Curriculum RL The GRPO score alone increases average scores by 0.8% over the vanilla GRPO.
  • Dynamic Length Reward Further boosts performance in particular for hard math domains.
  • ODSW Binary hard sample filtering consistently outperforms it, in particular when the training data are skewed.

The Training Dynamics of Reasoning and Efficiency

  • Dynamic Rewards Cosine rewards of fixed length are more efficient and provide better accuracy than those with adaptive length. As intended, the adaptive length appears to be longer for math, logic and science tasks and shorter for general understanding and scientific knowledge.
  • PCuRL’s Hard Stage induces a spike of validation accuracy and reasoning length, surpassing the vanilla GRPO whose accurate plateaus despite static length.

Case Studies

VL Cogito is a step-by-step reasoning system that is self-reflective and detailed. The model actively corrects mistakes in math by decomposing solutions into granular chain and RL-verified verification.[1, Figure 5]. It considers all options before presenting the correct answer. This shows a strong understanding of multimodal concepts and process reliability.

Understanding and impact

VL Cogito’s PCuRL pipeline confirms a number of key insights

  • Learning is important: Model progress can be optimized by prompts that are of a medium difficulty.
  • Deep reasoning is triggered by exposure to challenges: The over-emphasis of easy samples can degenerate performance. A progressive focus on hard samples will build analytic depth.
  • It is essential to reward rewards with granularity. When combining correctness with format and length, it is possible to produce context-sensitive, nuanced reasoning.
  • It is possible to start RL cold without sft. PCuRL models do not require expensive SFT warming up.

You can also read our conclusion.

VL-Cogito was a pioneer in multimodal training, and its architecture set the standard. The design and empirical validity of progressive curriculum-based RL with dynamic reward lengths point towards a general roadmap to robust reasoning for multimodal models.


Nikhil works as an intern at Marktechpost. He has a dual integrated degree in Materials from the Indian Institute of Technology Kharagpur. Nikhil, an AI/ML fanatic, is constantly researching AI/ML applications for biomaterials and other biomedical fields. He has a background in Material Science and is always exploring advancements.

learning van
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

20/04/2026

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

20/04/2026
Top News

AI Slop Makes the Internet fake-happy

Meta Warned Facial Recognizer Glasses Could Arm Sexual Assailants

I Let Google’s ‘Auto Browse’ AI Agent Take Over Chrome. It didn’t quite click

AI will never be conscious

A Hiker Was Missing for Nearly a Year—Until an AI System Recognized His Helmet

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Sam Altman Slams Meta AI Talent Poaching Spree : “Missionaries will Beat Mercenaries”

02/07/2025

Googlebot and Google Agent: Google’s Technical Definition of the Boundary between User-Triggered AI Access Systems and Search Crawling System Today

29/03/2026
Latest News

Prego Has a Dinner-Conversation-Recording Device, Capisce?

20/04/2026

AI CEOs think they can be everywhere at once

20/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.