Stream-Omni is a new LLM from the Chinese Academy of Sciences for cross-modal real-time AI

Multimodal large models have demonstrated outstanding capabilities across speech, text and vision modalities. This has created a vast range of applications. Despite the success of vision-oriented models, LMMs supporting speech interaction that is based on visual data face difficulties due to inherent representational differences between modalities. In recent omnimodal LMMs, the goal is to integrate text, voice, and visual information by using representations encoded in each modality along the sequential dimension. But they rely heavily on big data in order to find the best modality alignements. The system isn’t aligned on limited public trimodal datasets. Also, it’s not flexible enough to generate intermediate text in speech interactions.

LMMs are divided into three types: speech-oriented (or vision-oriented), omnimodal, or a combination of both. Vision-oriented LMMs, such as LLaVA, use vision encoders that extract visual features. These are combined with inputs from textual sources and then passed to LLMs for text generation. Speech-oriented LMMs can use continuous methods like Mini-Omni/LLaMA-Omni to embed features in LLM embedding areas, or discrete speech unit conversions, such as SpeechGPT/Moshi, for LLM direct processing. Omni-modal LMs include VITA 1.5, MiniCPM2.6 -o and Qwen2.5 -Omni, which extracts representations using different encoders. They then concatenate these for multimodal understanding and use speech decoders to synthesize.

Introduce Stream Omni: Text-Centric Approach to Alignment

Researchers from the University of Chinese Academy of Sciences have proposed Stream-Omni, a large language-vision-speech model designed to address the modality alignment challenges in omni-modal systems. This model aligns speech and vision modalities based on semantics rather than concatenation. It uses an LLM as a backbone. By integrating text with semantic relations, Stream Omni aligns speech and vision modalities. The method uses sequence-dimension concatenation for vision to align text and vision. The CTC mapping is used for the alignment of speech to text. Stream-Omni overcomes concatenation methods’ limitations by introducing targeted mechanisms for alignment.

Dual-Layer Speech Integration Architecture: Visual Encoding and Dual-Layer Speech Integration

Stream-Omni architecture uses an LLM with strategies for progressive alignment of modality. Stream Omni uses a vision encoder with a layer of projection to extract the visual representations. It introduces speech layers at the top and bottom of the LLM, which enable bidirectional mapping from speech to text. Stream Omni constructs the corpus of its training through an automated pipeline, using LLaVA for both vision and text pairs, LibriSpeech or WenetSpeech to represent speech and text data and InstructOmni, which is created by converting instruction datasets into spoken language.

Benchmarking Multimodal Capabilities Across Domains

Stream-Omni is able to achieve performance that’s comparable with advanced vision-oriented LMMs, and even outperforms VITA 1.5, reducing modality interfering while maintaining strong capabilities. Stream Omni’s outstanding performance in speech interaction is based on knowledge and requires less speech data than discrete models like SpeechGPT Moshi GLM-4 Voice. Stream-Omni is superior to VITA-1.5 when it comes to real-world understanding of visuals in speech interactions based on vision. Stream Omni’s speech-text map achieves better ASR performance in accuracy and time on LibriSpeech.

Conclusion: A Paradigm Shift in Multimodal Alignment

The researchers concluded their work by introducing Stream -Omni – a modality-alignment solution that can be used in omnimodal system. The method demonstrates that modality integration can be efficiently achieved by using sequence-dimension concatenation to align vision-text pairs, and layer-dimension map for the integration of speech and text. This eliminates the need for tri-modal data. The research also establishes a paradigm shift for multimodal LMMs by demonstrating that targeted alignment strategies, based on the semantic relationship, can be used to overcome traditional concatenation approaches.

Click here to find out more Paper The following are some examples of how to get started: Model on Hugging Face. This research is the work of researchers. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.

Sajjad A. Ansari, a student in the final year at IIT Kharagpur. Tech-enthusiast, Sajjad Ansari focuses on the real world implications of AI and its practical applications. His goal is to explain complex AI concepts clearly and in an accessible way.

Stream-Omni is a new LLM from the Chinese Academy of Sciences for cross-modal real-time AI

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

A Hiker Was Missing for Nearly a Year—Until an AI System Recognized His Helmet

The Sex I Had With AI Clive Owen

Thinking Machines Cofounder’s office relationship preceded termination

Taiwan is rushing to make its own drones before it’s too late

OpenAI’s open-weight models are coming to US Military

Top Insights

EG-CFG – Enhancing Code Generating with Real-Time Feedback

The Top 15 Most Affordable Proxy Services Providers in 2025

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Stream-Omni is a new LLM from the Chinese Academy of Sciences for cross-modal real-time AI

Understand the limitations of current Omni-Modal architectures

Existing LMMs can be categorized by modal focus

Introduce Stream Omni: Text-Centric Approach to Alignment

Dual-Layer Speech Integration Architecture: Visual Encoding and Dual-Layer Speech Integration

Benchmarking Multimodal Capabilities Across Domains

Conclusion: A Paradigm Shift in Multimodal Alignment

Related Posts