Meet Kani-TTS-2: A 400M Param open source Text-to Speech model that runs in 3GB of VRAM and supports voice cloning

The world of generative sound is changing towards greater efficiency. Open-source software is a new contender. Kani-TTS-2The team has just released the new. nineninesix.ai. The model is a significant departure from the heavy and expensive TTS systems. This model treats audio as language to deliver high-fidelity speech with an incredibly small footprint.

Kani-TTS-2 is a high-performance, lean alternative to APIs that are closed source. Hugging Face currently offers it in both languages. English (EN) The following are some examples of how to get started: Portuguese (PT) versions.

LFM2 Architecture and NanoCodec

Kani-TTS-2 is the next generation of Kani. ‘Audio-as-Language‘ philosophy. Model does not use the traditional mel-spectrogram pipes. It converts audio using a neural encoder into discrete tokens.

It is based on a process that involves two stages:

The Backbone of Language: Model is built from LiquidAI’s LFM2 (350M) architecture. This backbone generates ‘audio intent’ by predicting the next audio tokens. LFMs are more efficient than transformers because they’re designed to be.
The Neural Codec Use the NVIDIA MicroCodec Turn those tokens in to 22kHz waves.

By using this architecture, the model captures human-like prosody—the rhythm and intonation of speech—without the ‘robotic’ artifacts found in older TTS systems.

Efficiency: 10 000 Hours in just 6 hours

Kani TTS-2 training metrics offer a lesson in optimality. The English Model was trained on 10,000 Hours Speech data of the highest quality.

Although the size of the model is impressive, it is the training time that is most important. Researchers trained the model within only 6 Hours The cluster method Eight NVIDIA GPUs H100. The combination of efficient architectures such as LFM2 with massive datasets does not need weeks to compute.

The Zero-Shot voice cloning system and performance

Developers will find the standout feature to be a streamlined development process. zero-shot voice cloning. Kani TTS-2 does not require any fine tuning for the new voices. speaker embeddings.

What it does: You provide a short reference audio clip.
This is the result This model uses the voice to generate text and extracts its unique features.

The model has a high deployment accessibility:

Parameter count: 400M (0.4B) parameters.
Speed: This is what it looks like Real-time Factor (RTF), 0.2. The system can create 10 seconds of spoken language in just 2 seconds.
Hardware: Only The 3GB VRAM is a great way to get the most out of your computer.RTX 4050 and 3060 GPUs are compatible.
License: Released under Apache 2.0 License allowing commercial use.

The Key Takeaways

Efficient Architecture: Model uses the a This 400M-parameter is a good example of a parameter. Backbone is a basis for LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ approach treats speech as discrete tokens, allowing for faster processing and more human-like intonation compared to traditional architectures.
Rapid Training Scale Kani TTS-2-EN has been trained 10,000 Hours High-quality Speech Data in Just 6 Hours Use this link to learn more about Eight NVIDIA GPUs H100.
The Instant Cloning System: No fine tuning is required to reproduce a particular voice. Model uses a brief audio clip as a guide. speaker embeddings Instantly synthesizes text into the voice of a target speaker.
Edge Hardware for High Performance: The a Real-Time factor (RTF), 0.2The model is capable of generating 10 seconds worth of audio within 2 seconds. The model requires just The 3GB VRAM is a great way to get the most out of your computer.RTX 3060 GPUs are fully compatible with the RTX 3060.
Developer-Friendly Licensing: Released under Apache 2.0 licenseKani-TTS-2 has been ready for integration into commercial products. This low-latency, local first alternative offers an affordable and open-source TTS-API.

Take a look at the Model Weight. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Michal is a professional in the field of data science with a Masters of Science degree from University of Padova. Michal is a data scientist with a background in machine learning, statistical analysis and data engineering.

Meet Kani-TTS-2: A 400M Param open source Text-to Speech model that runs in 3GB of VRAM and supports voice cloning

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

A startup is charging you for the chance to speak with AI-based versions of experts

A small English town caught up in the global AI arms race

AI has flooded all the weather apps

You can organize your thoughts by whispering into this AI-powered smart ring

Attend Our Livestream to Learn What GPT-5 Means to ChatGPT Users

Top Insights

This is a good example of how too much thought can lead to LLMs breaking: Inverse scaling in test-time computation

Our New Feature Helps You Manage Comments across 6 Platforms

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Meet Kani-TTS-2: A 400M Param open source Text-to Speech model that runs in 3GB of VRAM and supports voice cloning

LFM2 Architecture and NanoCodec

Efficiency: 10 000 Hours in just 6 hours

The Zero-Shot voice cloning system and performance

The Key Takeaways

Related Posts