While reinforcement learning has shown that large reasoning (LRM) models are capable of reasoning in a short-context, these improvements do not transfer well to scenarios with a long context. For applications like multi-document QA or research synthesis as well as legal and financial analysis, models must be able to handle and analyze sequences of more than 100K symbols. RL-optimization in these regimes, however, is plagued with slower reward convergence due to KL fluctuation fluctuations and decreased exploration resulting from entropy. These bottlenecks show a fundamental problem in the transition of LRMs to generalization from short context proficiency.
QwenLong L1: Structured RL framework for long-Context Adaptation
Qwen Research introduces a solution to these problems. QwenLong-L1, a novel RL Framework designed to adapt LRMs long-context reasoning. It is divided into three stages.
- Warm-up Supervised fine-Tuning Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
- The Curriculum Guided Phased Reinforcement: Introduces an incrementally increasing training stage with contexts. This allows the model to gradually learn long-context reasoning without affecting updates.
- Samples based on difficulty: Exploration is enhanced by reusing and maintaining hard examples, ranked by difficulty. This encourages deeper reasoning across inputs and encourages robustness.
These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.
The advantages of technical design and methodological aspects
QwenLong L1 incorporates the latest advances in group-relative optimization. GRPO You can also find out more about the following: DAPOTo reduce the computation overhead of long context value estimation
- GRPO Estimates advantage is gained by standardizing rewards in sampled groups. This eliminates the need for separate value networks and encourages diverse generation patterns.
- DAPO Incorporates mechanisms like dynamic sampling, penalty shaping for overlength, and clipping thresholds that are asymmetric to mitigate biases in length during training and prevent entropy.
The reward function can be defined as a maximum of two signals, a rule-based deterministic match and a judgment derived from a compact model of evaluator (e.g. Qwen2.5-1.55B). The hybrid approach is designed to avoid overfitting rigid formats, while still maintaining the correctness of answers across different notations and phrases.
The framework has been optimized for optimal performance via Progressive context Scaling. The RL process is designed to transition from a 20K token length input into a 60K token output in controlled phases. It stabilizes the training dynamics while facilitating generalization of policy.
The experimental results and benchmark performance
QwenLong was tested against seven benchmarks for document QA in long contexts. These included DocMath and Frames. The 32B version, QwenLong-L1-32BIt is a strong performer in the empirical realm.
- This model outperforms baseline models like R1-Distill-Qwen-32B By: The 5.1 point scale It is more advanced than leading proprietary systems OpenAI-o3-mini You can also find out more about the following: Qwen3-235B-A22B.
- The performance of the company was comparable to Claude-3.7-Sonnet-ThinkingThis is a measure of competitive reasoning abilities under extreme circumstances.
- With increased sampling, the Pass@K average was able to improve. 73.7, surpassing DeepSeek-R1 You can also find out more about the following: OpenAI-o1-previewEven at low sample rates,.

Ablation studies confirmed the contributions of phased RL and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.
The conclusion of the article is:
QwenLong L1 is a systematized approach for equipping LRMs to have robust reasoning abilities in long contexts through reinforcement learning. By combining curriculum-driven context-scaling, supervised initialization and hybrid evaluation, its design bridges a gap between expertise in short-contexts and information-dense environment demands. It not only achieves the best results in long-context benchmarks, it also displays how interpretable patterns of reasoning emerges during training.
Click here to find out more Paper, Model on Hugging Face You can also find out more about the following: GitHub Page. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, which focuses on machine learning and deep-learning news, is easy to read and understand and is praised for its technical accuracy. Over 2 million views per month are a testament to the platform’s popularity.


