LongCat-Flash Omni: An Open-Source SOTA Model for Real-Time Audio and Visual Interaction. 560B Parameters, 27B Actived.

How can you create a model which is able to listen, watch, read, and react in real-time across audio, text, images, and videos without losing efficiency? Meituan LongCat has been released LongCat Flash OmniLongCat Flash developed a Mixture of Experts, a model that is open-source and has over 560 trillion parameters. It also includes about 27 billion tokens with active values. It extends text to audio, vision and video, while keeping a context of 128K. This allows it to run conversations, as well as document-level understanding, in one stack.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

LongCat Flash Omni retains the same language model and adds perception modules. LongCat’s ViT encoder can process video frames and images, eliminating the need for a separate tower. The audio encoder and the LongCat audio codec turn speech into discrete tones, so that the speech can be output from the LLM stream. This allows for real-time audio visual interactions.

Watching Streaming Content with Feature Interleaving

Researchers describe chunk-wise audiovisual feature interleaving whereby audio features, video elements and timestamps can be packed in segments of 1 second. The report doesn’t tie sampling rules to the user or the model speaking phases. Therefore, duration-conditioned sampling is the right description. It keeps the latency down and provides spatial context to GUI, OCR and Video QA tasks.

From Text to Omni

The training follows a curriculum. Training follows a staged curriculum.

System Design Modality Decoupled Paradigm

Meituan employs modality decoupled paralelism because encoders are different from LLMs. The LLM uses pipeline, context, expert, and hybrid sharding parallelism. A ModalityBridge is used to align embeddings, while the audio and vision encoders use activation and hybrid sharding. Multimodal supervised refinement, according to the research team, keeps the system’s throughput at over 90 percent.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Benchmarks & Positioning

LongCat Flash Omni scores a 61.4 out of 100 on OmniBench. It is slightly higher than Qwen Omni Instruct (58.5) and Qwen 2.5 Omni (55.0) but below Gemini 2.5 PRO at 66.8. VideoMME scores it at 78.2 which is similar to GPT4o and Gemini Flash. VoiceBench gives it an 88.7 score, which is slightly above GPT4o Audio.

What you need to know

LongCat Flash Omni, an open source model based on Meituan’s backbone of 560B MoE parameters, activates approximately 27B parameter per token by shortcut-connected MoE without computation experts.
Model attaches the unified vision video and streaming audio paths to existing LongCat Flash LLM. The video is sampled at 2 frames per second with duration conditional adjustment.
LongCat Flash Omni scored 61.4 in OmniBench. This is above Qwen3 Omni Instruct, which scores 58.5. However, Gemini 2.5 pro, at 66.8, comes out on top.
Meituan is using modality decoupled paralelism. Audio and vision encoders use hybrid sharding. LLM uses pipeline, context, and expert parallelism.

Meituan’s latest release is a practical attempt to bring omni-modal interaction into the mainstream, and not just as an experiment. This release keeps 560B Shortcuts and Mixture of Experts active with 27B, so that the LongCat language backbone remains compatible. The streaming audio-visual perception is enhanced with a 2 fps video sample and duration conditional adjustment. This ensures low latency without compromising spatial foundation. Through modality-decoupled parallelism, it reports text only throughputs of more than 90 percent.

Take a look at the Paper, Model Weights You can also find out more about the following: GitHub Repo. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Michal is a professional in the field of data science with a Masters of Science degree from University of Padova. Michal is a data scientist with a background in machine learning, statistical analysis and data engineering.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

LongCat-Flash Omni: An Open-Source SOTA Model for Real-Time Audio and Visual Interaction. 560B Parameters, 27B Actived.

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

I Wasn’t Sure I Wanted Anthropic to Pay Me for My Books—I Do Now

Sora II is used to create disturbing videos with AI-generated children

AWS’ Matt Garman is looking to assert Amazon’s dominance of the cloud in an AI era

Fitbit app is turning into an AI-powered personal coach

OpenAI Sora App lets you fake yourself to entertain.

Top Insights

Nvidia’s Deal With Meta signals a new era in computing power

What You Need to Know for a Tech Bubble

Latest News

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

LongCat-Flash Omni: An Open-Source SOTA Model for Real-Time Audio and Visual Interaction. 560B Parameters, 27B Actived.

The Architecture of Modular Attachments

Watching Streaming Content with Feature Interleaving

From Text to Omni

System Design Modality Decoupled Paradigm

Benchmarks & Positioning

What you need to know

Related Posts