Understand the limitations of current Omni-Modal architectures
Multimodal large models have demonstrated outstanding capabilities across speech, text and vision modalities. This has created a vast range of applications. Despite the success of vision-oriented models, LMMs supporting speech interaction that is based on visual data face difficulties due to inherent representational differences between modalities. In recent omnimodal LMMs, the goal is to integrate text, voice, and visual information by using representations encoded in each modality along the sequential dimension. But they rely heavily on big data in order to find the best modality alignements. The system isn’t aligned on limited public trimodal datasets. Also, it’s not flexible enough to generate intermediate text in speech interactions.
Existing LMMs can be categorized by modal focus
LMMs are divided into three types: speech-oriented (or vision-oriented), omnimodal, or a combination of both. Vision-oriented LMMs, such as LLaVA, use vision encoders that extract visual features. These are combined with inputs from textual sources and then passed to LLMs for text generation. Speech-oriented LMMs can use continuous methods like Mini-Omni/LLaMA-Omni to embed features in LLM embedding areas, or discrete speech unit conversions, such as SpeechGPT/Moshi, for LLM direct processing. Omni-modal LMs include VITA 1.5, MiniCPM2.6 -o and Qwen2.5 -Omni, which extracts representations using different encoders. They then concatenate these for multimodal understanding and use speech decoders to synthesize.
Introduce Stream Omni: Text-Centric Approach to Alignment
Researchers from the University of Chinese Academy of Sciences have proposed Stream-Omni, a large language-vision-speech model designed to address the modality alignment challenges in omni-modal systems. This model aligns speech and vision modalities based on semantics rather than concatenation. It uses an LLM as a backbone. By integrating text with semantic relations, Stream Omni aligns speech and vision modalities. The method uses sequence-dimension concatenation for vision to align text and vision. The CTC mapping is used for the alignment of speech to text. Stream-Omni overcomes concatenation methods’ limitations by introducing targeted mechanisms for alignment.
Dual-Layer Speech Integration Architecture: Visual Encoding and Dual-Layer Speech Integration
Stream-Omni architecture uses an LLM with strategies for progressive alignment of modality. Stream Omni uses a vision encoder with a layer of projection to extract the visual representations. It introduces speech layers at the top and bottom of the LLM, which enable bidirectional mapping from speech to text. Stream Omni constructs the corpus of its training through an automated pipeline, using LLaVA for both vision and text pairs, LibriSpeech or WenetSpeech to represent speech and text data and InstructOmni, which is created by converting instruction datasets into spoken language.
Benchmarking Multimodal Capabilities Across Domains
Stream-Omni is able to achieve performance that’s comparable with advanced vision-oriented LMMs, and even outperforms VITA 1.5, reducing modality interfering while maintaining strong capabilities. Stream Omni’s outstanding performance in speech interaction is based on knowledge and requires less speech data than discrete models like SpeechGPT Moshi GLM-4 Voice. Stream-Omni is superior to VITA-1.5 when it comes to real-world understanding of visuals in speech interactions based on vision. Stream Omni’s speech-text map achieves better ASR performance in accuracy and time on LibriSpeech.
Conclusion: A Paradigm Shift in Multimodal Alignment
The researchers concluded their work by introducing Stream -Omni – a modality-alignment solution that can be used in omnimodal system. The method demonstrates that modality integration can be efficiently achieved by using sequence-dimension concatenation to align vision-text pairs, and layer-dimension map for the integration of speech and text. This eliminates the need for tri-modal data. The research also establishes a paradigm shift for multimodal LMMs by demonstrating that targeted alignment strategies, based on the semantic relationship, can be used to overcome traditional concatenation approaches.
Click here to find out more Paper The following are some examples of how to get started: Model on Hugging Face. This research is the work of researchers. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.



