Kyutai (an open AI research laboratory) has created a streaming Text to Speech (TTS), with over 2 billion parameters. This model is designed for high-fidelity audio production with ultra-low latencies (220 milliseconds). It is trained with an unprecedented 2,5 million hours of audio, and it’s licensed under the permissive CC BY 4.0. Kyutai has a commitment to openness. This breakthrough redefines large-scale models for speech production, especially in the context of edge deployment and AI agents.
Unpacking performance: Less than 350ms of latency on 32 users concurrently using a single L40 GPU
The streaming feature is the model’s most distinguishing characteristic. With a single NVIDIA N40 GPU the system supports up to 16 concurrent users with a low latency of 350ms. The model can maintain a generational latency of as little as 220ms for individual users, which allows applications like conversational agents and voice assistants to be used in near real time. Kyutai’s Delayed Streams Modeling, a novel approach to speech generation that allows models to produce incrementally as texts arrive, enables this performance.
Key technical Metrics
- Model sizeThe 2B parameter
- Training data2.5 Million Hours of Speech
- Latency: 220ms single-user,
- The Language SupportFrench, English
- License: CC-BY-4.0 (open source)
Delayed Streams Modeling: Architecting Real-Time Responsiveness
Kyutai innovates with Delayed Streams Modeling. It is a method that enables speech synthesis even before the entire input text is complete. The approach was designed specifically to achieve a balance between prediction accuracy and response speed. This allows for high-throughput TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis.
Kyutai has the source code and the training recipe of this architecture. GitHub repositorySupporting full reproducibility as well as community contributions.
Model availability and open research commitment
Kyutai recently released model weights as well as inference scripts. Hugging FaceResearchers, developers and commercial teams can now access the data. The permissive CC BY 4.0 license allows for unrestricted integration and adaptation into applications as long as the proper attribution is kept.
It supports batch inference as well as streaming inference. This makes this release a flexible foundation for voice cloning and real-time bots. Kyutai provides the foundation for TTS pipelines in multiple languages with its pretrained English-French models.
Real-Time AI Applications and Implications
Kyutai’s Model reduces the latency of speech to 200ms or less, which is a human-perceivable delay.
- Talking AIVoice interfaces that are human-like with a low turnaround
- Assistive techScreen readers that are faster and systems with voice feedback
- Media ProductionVoiceovers – rapid cycles of iteration
- Edge DevicesOptimized Inference for Low-Power or On-Device Environments
This makes the L40 GPU attractive to cloud-based environments that need to scale up speech services.
The conclusion: Ready, Open and Fast for Deployment
Kyutai’s release of streaming TTS is a landmark in the field of speech AI. It addresses the needs of both real-world teams and researchers with its high-quality, low-latency synthesis and generous licenses. Its reproducibility and multilingual support make this model a superior alternative to proprietary software.
You can find out more about the model on our website. Hugging FaceTechnical explanation Kyutai’s siteThe implementation of. GitHub.


