Xiaomi MiMo’s MiMo Team released MiMo Audio. This 7-billion parameter audio-language system runs an interleaved speech model and text over discretized voice, scaling up pretraining past 100 million hours.
What is actually new??
MiMo-Audio relies on a custom RVQ tokenizer to achieve both high-quality reconstruction and semantic fidelity. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (≈200 tokens/s), giving the LM access to “lossless” It can also model text and speech auto-regressively.
Architecture: patch encoder → 7B LLM → patch decoder
To handle the audio/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz → 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. The delayed RVQ scheme uses a multi-layer codebook generation to stagger predictions. This helps stabilize the synthesis while respecting inter-layer dependency. All three parts—patch encoder, MiMo-7B backbone, and patch decoder—are trained under a single next-token objective.
The algorithm is called Scale
There are two main phases to training: “understanding” The stage optimizes the text-token losses over speech-text interleaved corpora. “understanding + generation” The stage turns audio loss on for S2T/T2S, speech continuation and instructions-style data. The report emphasizes the compute/data threshold at which few-shot behaviors appear. “switch on,” The emergence curves of large text-only LMs are similar to those seen on the LMs.
Benchmarks – Speech Intelligence and General Audio
MiMo Audio is tested on broad audio benchmarks such as MMAU and speech reasoning suites. It reports strong scores in speech, music, and sound. “modality gap” You can choose between the text-only or speech-in/speechout setting. Xiaomi has also released MiMo-Audio-EvalA public toolkit is available to replicate these results. Online demos are available for Listen-and Respond (speech translation, voice/emotions conversion, denoising, and speech continuation).

What is the importance of this?
The approach is intentionally simple—no multi-head task tower, no bespoke ASR/TTS objectives at pretraining time—just GPT-style next-token prediction over Lossless audio tokens plus text. Engineering ideas include: (i) a tokenizer which the LM is able to use while preserving prosody and speaker identity, (ii), patchification for limiting sequence lengths and (iii), RVQ delayed decoding in order to maintain quality during generation. The design choices for teams creating spoken agents translate into robust speech continuation and few-shots speech-to speech editing with minimal task specific fine-tuning.
Six Take-Aways for Technical Professionals
- Tokenization High-Fidelity
MiMo-Audio employs a custom RVQ tonalizer, operating at 25 Hz, with eight active codebooks. The tokenizer ensures that speech tokens maintain prosody and timbre while maintaining speaker identity. - The Patch Sequence Model
The model reduces sequence length by grouping 4 timesteps into one patch (25 Hz → 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail. - Unified Next-Token Objective
The architecture is simplified while multi-task generalization can be achieved. - New Few-Shot Skills
Once training exceeds the threshold of large data (100Mhr, trillions tokens), few-shot behaviors like speech continuation, voice translation, emotion transfer and speech interpretation emerge. - Benchmark Leadership
MiMoAudio achieves state-of the-art scores in SpeechMMLU (69.5 overall, S2S 69.1, and T2S (71.5) while also minimizing text-to-speech modalities to just 3.4. - Open Ecosystem release
Xiaomi’s tokenizer is a combination of 7B Checkpoints (base and instruc), MiMo Audio Evaluation Toolkit, as well as public demos. These tools allow researchers and developers alike to extend and test speech-to-speech pipelines using open-source software.
The following is a summary of the information that you will find on this page.
MiMo’s RVQ-based high-fidelity audio system is demonstrated by MiMo. “lossless” It is possible to achieve few-shot intelligence by combining tokenization with next-token patchified pretraining. This can be done at a large scale without the need for task-specific heads. The 7B stack—tokenizer → patch encoder → LLM → patch decoder—bridges the audio/text rate gap (25→6.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the text↔speech modality gap, generalizes across speech/sound/music benchmarks, and supports in-context S2S editing and continuation.
Click here to find out more Paper, Technical details You can also find out more about the following: GitHub Page. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.


