How can you create a model which is able to listen, watch, read, and react in real-time across audio, text, images, and videos without losing efficiency? Meituan LongCat has been released LongCat Flash OmniLongCat Flash developed a Mixture of Experts, a model that is open-source and has over 560 trillion parameters. It also includes about 27 billion tokens with active values. It extends text to audio, vision and video, while keeping a context of 128K. This allows it to run conversations, as well as document-level understanding, in one stack.
The Architecture of Modular Attachments
LongCat Flash Omni retains the same language model and adds perception modules. LongCat’s ViT encoder can process video frames and images, eliminating the need for a separate tower. The audio encoder and the LongCat audio codec turn speech into discrete tones, so that the speech can be output from the LLM stream. This allows for real-time audio visual interactions.
Watching Streaming Content with Feature Interleaving
Researchers describe chunk-wise audiovisual feature interleaving whereby audio features, video elements and timestamps can be packed in segments of 1 second. The report doesn’t tie sampling rules to the user or the model speaking phases. Therefore, duration-conditioned sampling is the right description. It keeps the latency down and provides spatial context to GUI, OCR and Video QA tasks.
From Text to Omni
The training follows a curriculum. Training follows a staged curriculum.
System Design Modality Decoupled Paradigm
Meituan employs modality decoupled paralelism because encoders are different from LLMs. The LLM uses pipeline, context, expert, and hybrid sharding parallelism. A ModalityBridge is used to align embeddings, while the audio and vision encoders use activation and hybrid sharding. Multimodal supervised refinement, according to the research team, keeps the system’s throughput at over 90 percent.

Benchmarks & Positioning
LongCat Flash Omni scores a 61.4 out of 100 on OmniBench. It is slightly higher than Qwen Omni Instruct (58.5) and Qwen 2.5 Omni (55.0) but below Gemini 2.5 PRO at 66.8. VideoMME scores it at 78.2 which is similar to GPT4o and Gemini Flash. VoiceBench gives it an 88.7 score, which is slightly above GPT4o Audio.
What you need to know
- LongCat Flash Omni, an open source model based on Meituan’s backbone of 560B MoE parameters, activates approximately 27B parameter per token by shortcut-connected MoE without computation experts.
- Model attaches the unified vision video and streaming audio paths to existing LongCat Flash LLM. The video is sampled at 2 frames per second with duration conditional adjustment.
- LongCat Flash Omni scored 61.4 in OmniBench. This is above Qwen3 Omni Instruct, which scores 58.5. However, Gemini 2.5 pro, at 66.8, comes out on top.
- Meituan is using modality decoupled paralelism. Audio and vision encoders use hybrid sharding. LLM uses pipeline, context, and expert parallelism.
Meituan’s latest release is a practical attempt to bring omni-modal interaction into the mainstream, and not just as an experiment. This release keeps 560B Shortcuts and Mixture of Experts active with 27B, so that the LongCat language backbone remains compatible. The streaming audio-visual perception is enhanced with a 2 fps video sample and duration conditional adjustment. This ensures low latency without compromising spatial foundation. Through modality-decoupled parallelism, it reports text only throughputs of more than 90 percent.
Take a look at the Paper, Model Weights You can also find out more about the following: GitHub Repo. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.


