Apple and Duke Researchers present a Reinforcement-Learning Approach that allows LLMs provide intermediate answers, enhancing speed and accuracy

Although long CoT reasoning is more efficient for large models, it has its limitations. The typical “think-then-answer” The method can slow down the response times and disrupt real-time interaction, such as those found in chatbots. This method can also lead to inaccuracies as mistakes in the earlier steps of reasoning may result in a false final answer. LLMs are slower than humans to respond, as they wait until the reasoning has been completed before responding. While RL can be used as a training tool for reasoning models, the focus is on final answers and not useful intermediate insight. It is becoming more popular to teach models that switch between answering and thinking. However, this can be challenging.

RL, which aligns models to human preferences and focuses on the final answer, has been a widely used method for improving reasoning in LLMs. RL is guided by two common types of rewards: outcomes-based (ORM) rewards, which are based on the answer at hand; and process-based (PRM), that provide feedback about intermediate steps in reasoning. The PRM’s are more complex than ORMs, but they offer detailed monitoring. They rely heavily on additional models and human annotation, which makes them prone to problems like reward-hacking. In separate efforts to improve LLM, prompting strategies have been explored, as well as structured reasoning, tools integration and ways to reduce latency.

Interleaved Reasoning was developed by Apple researchers and Duke University. This new RL technique allows models to think and answer in parallel when solving multi-step, complex questions. Models provide intermediate responses, rather than waiting for the final answer. This improves user feedback and helps guide their reasoning. The model can be trained using a simple rule-based incentive to generate helpful reasoning steps. This leads to up to 19,3% more accuracy and over 80% quicker responses. The method, which was initially trained on QA datasets and logic benchmarks, shows strong generalization.

This study suggests a reinforcement-learning framework for training LLMs in Interleaved Reasoning. Models alternate between internal reasoning and intermediate answers that are visible to the user. Each intermediate step, or “sub-answer,” Once the model has reached a significant milestone, it is then shared. The template is a specialized one with You can also find out more about the following: Tags are used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. The model ensures that the overall accuracy of the model is prioritized by applying intermediate rewards only if specific criteria are met. The researchers also tested different reward schemes to improve reasoning quality, including all-or none, partial credit or time-discounted awards.

Qwen2.5 models (1.55B and 7.55B) were used for evaluating the interleaved method of reasoning on familiar and unfamiliar datasets. The interleaved approach provides incremental answers, which is faster and more useful than traditional methods. Combining this method with the intermediate rewards improves performance and reduces response times by more than 80%. The model is able to adapt well even without being exposed to new domains in training. It also shows strong generalization. The interleaved reasoning is shown to be effective at making AI systems better at multi-step reasoning.

In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. The authors demonstrate that using the Qwen2.5-1.55B model they can increase accuracy by providing intermediate feedback at the right time during training. PPO was the only RL strategy that showed consistent results. The conditional and time-discounted reward system proved most efficient. This method is scalable to more complex tasks, and it outperforms the traditional baseline of think then answer. This method uses rule-based rewards instead of tokens to reward reasoning. Interleaved logic improves the quality of reasoning and its efficiency, without having to rely on external tools.

Take a look at the Paper. This research is the work of researchers. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe now our Newsletter.

Sana Hassan has a passion for applying AI and technology to real world challenges. He has a passion for solving real-world problems and brings an innovative perspective at the intersection between AI and practical solutions.

Apple and Duke Researchers present a Reinforcement-Learning Approach that allows LLMs provide intermediate answers, enhancing speed and accuracy

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI is destroying its 4o model. China’s ChatGPT Fanatics Aren’t Okay

Rivals from the AI Industry are Teaming up on an Accelerator

How Bots Manipulate Victims into Crypto Fraud • AI Blog

Anthropic Claims Pentagon Feud Cost It Billions

Amazon Workers Issue Warning About Company’s ‘All-Costs-Justified’ Approach to AI Development

Top Insights

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

Google DeepMind Releases AlphaGenome: An Unified Sequence-to Function Model, using U-Nets and Hybrid Transformers to Decode Human Genome

Latest News

Anthropic Mythos is Unauthorized by Discord Sleuths

Ace the Ping Pong Robot can Whup your Ass

Apple and Duke Researchers present a Reinforcement-Learning Approach that allows LLMs provide intermediate answers, enhancing speed and accuracy

Related Posts