The world of generative sound is changing towards greater efficiency. Open-source software is a new contender. Kani-TTS-2The team has just released the new. nineninesix.ai. The model is a significant departure from the heavy and expensive TTS systems. This model treats audio as language to deliver high-fidelity speech with an incredibly small footprint.
Kani-TTS-2 is a high-performance, lean alternative to APIs that are closed source. Hugging Face currently offers it in both languages. English (EN) The following are some examples of how to get started: Portuguese (PT) versions.
LFM2 Architecture and NanoCodec
Kani-TTS-2 is the next generation of Kani. ‘Audio-as-Language‘ philosophy. Model does not use the traditional mel-spectrogram pipes. It converts audio using a neural encoder into discrete tokens.
It is based on a process that involves two stages:
- The Backbone of Language: Model is built from LiquidAI’s LFM2 (350M) architecture. This backbone generates ‘audio intent’ by predicting the next audio tokens. LFMs are more efficient than transformers because they’re designed to be.
- The Neural Codec Use the NVIDIA MicroCodec Turn those tokens in to 22kHz waves.
By using this architecture, the model captures human-like prosody—the rhythm and intonation of speech—without the ‘robotic’ artifacts found in older TTS systems.
Efficiency: 10 000 Hours in just 6 hours
Kani TTS-2 training metrics offer a lesson in optimality. The English Model was trained on 10,000 Hours Speech data of the highest quality.
Although the size of the model is impressive, it is the training time that is most important. Researchers trained the model within only 6 Hours The cluster method Eight NVIDIA GPUs H100. The combination of efficient architectures such as LFM2 with massive datasets does not need weeks to compute.
The Zero-Shot voice cloning system and performance
Developers will find the standout feature to be a streamlined development process. zero-shot voice cloning. Kani TTS-2 does not require any fine tuning for the new voices. speaker embeddings.
- What it does: You provide a short reference audio clip.
- This is the result This model uses the voice to generate text and extracts its unique features.
The model has a high deployment accessibility:
- Parameter count: 400M (0.4B) parameters.
- Speed: This is what it looks like Real-time Factor (RTF), 0.2. The system can create 10 seconds of spoken language in just 2 seconds.
- Hardware: Only The 3GB VRAM is a great way to get the most out of your computer.RTX 4050 and 3060 GPUs are compatible.
- License: Released under Apache 2.0 License allowing commercial use.
The Key Takeaways
- Efficient Architecture: Model uses the a This 400M-parameter is a good example of a parameter. Backbone is a basis for LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ approach treats speech as discrete tokens, allowing for faster processing and more human-like intonation compared to traditional architectures.
- Rapid Training Scale Kani TTS-2-EN has been trained 10,000 Hours High-quality Speech Data in Just 6 Hours Use this link to learn more about Eight NVIDIA GPUs H100.
- The Instant Cloning System: No fine tuning is required to reproduce a particular voice. Model uses a brief audio clip as a guide. speaker embeddings Instantly synthesizes text into the voice of a target speaker.
- Edge Hardware for High Performance: The a Real-Time factor (RTF), 0.2The model is capable of generating 10 seconds worth of audio within 2 seconds. The model requires just The 3GB VRAM is a great way to get the most out of your computer.RTX 3060 GPUs are fully compatible with the RTX 3060.
- Developer-Friendly Licensing: Released under Apache 2.0 licenseKani-TTS-2 has been ready for integration into commercial products. This low-latency, local first alternative offers an affordable and open-source TTS-API.
Take a look at the Model Weight. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


