Researchers at FlashLabs Release Chroma 1.01: A Real-Time Speech Dialogue Model with Personalized Voice Cloning

The Chroma 2.0 real-time dialogue model takes audio input, and then outputs it as audio. This is done while keeping the speaker’s identity intact across many turns of conversation. The system is the first end-to-end spoken dialog that integrates low latency with personalized voice cloning using only a few minutes of audio reference.

Models are based on speech data rather than text transcriptions. The model targets the same applications as real-time commercial agents but has a 4B parameter core with an improved design which treats speaker similarity not just as a feature, but as its primary goal. The Chroma system achieves an improvement of 10.96% in speaker likeness over the human baseline, and has a Real Time Factor (0.43), which means it can produce speech two times as fast as playback.

https://arxiv.org/pdf/2601.11141

From cascaded ASR ➡️ LLM ➡️ TTS ➡️ end to end S2S

The majority of production assistants use the three-stage pipeline: automatic speech recognition for converting audio into text, large language models to support reasoning and text to voice synthesis. The structure may be flexible, but the latency introduced and paralinguistic data such as emotion, timbre and speaking rate are lost once audio is converted to text. The loss of sound detail can directly impact the naturalness and fidelity of real-time conversations.

Chroma is based on the more recent class of Speech to Voice systems, which map codec tokens between each other. Speech tokenizers and neural codes produce quantized audio code. Language model reasons over the sequence, which interleaves audio code and text tokens without explicit intermediate transcription. The model is then conditioned by prosody, speaker identification and the entire processing chain.

What is the stack of Architecture + Reasoner?

The Chroma 1.0 system is divided into two subsystems. Chroma Reasoner takes care of multimodal understanding, text generation and text generation. Speech stacks, Chroma Backbones, Chroma Decoders and Chroma Codec Decoders convert that output to personalized audio.

Chroma Reasoner uses Qwen2 Audio encoding and is built upon the Thinker Module from the Qwen omni series. It takes text and audio inputs that share front ends and combines them using cross-modal attention. Then, it aligns the outputs over time with Time aligned Multimodal Rotating Position Embedding. It produces a series of hidden states, which carry both acoustic and linguistic cues.

Chroma Backbone, a LLaMA-style model with 1B parameters based on Llama3, is a model which uses Llama3. CSM-1B is used to condition the model on the voice of the target. A short audio clip with its transcription and embedding instructions are then prepended. Token embeddings from Reasoner and hidden state information are provided as one unified context.

The system interleaves the audio and text tokens according to a 1:2 schedule. Backbone creates two audio tokens for every text token generated by Reasoner. So, the model is able to emit speech right away when text generation begins. Interleaving plays a major role in the short Time to First Token.

Chroma Decoder (LLaMA) is a lightweight LLaMA with 100M parameter. Backbone can only represent the Residual Vector Quantization Codebook 1 per frame. The Decoder uses the Backbone’s hidden state, the first codebook and auto-regressively forecasts all the RVQ levels within the frame. This factorization preserves context-temporal structure and limits the Decoder’s refinement to the local frame. It reduces computation while improving detailed prosody.

Chroma Codec Decoder maps the waveform samples to coarse and fine codes. The Chroma Codec Decoder uses the same decoder as the Mimi vocoder, but it also employs a causal neural network to make sure that every output sample is only based on context from the past. That’s what streaming requires. This system has 8 codebooks which reduces the amount of auto-regressive refinement for the Decoder, while still preserving sufficient detail to allow voice cloning.

Data for the training set-up and Synthetic Speech to Speech (S2S).

It is difficult to find high-quality speech dialogues with powerful reasoning signals. Chroma uses an S2S pipeline (synthetic speech-to-speech). First, a Reasoner such as LLM produces textual responses to user questions. Test to Speech systems (TTSs) synthesize target speech to match that of the reference audio. The Backbone and Decoder are trained to do acoustic modelling and voice cloning using these synthetic pairs. The Reasoner remains frozen, and it acts as a source of multimodal hidden state and text embeddings.

Compare the quality of voice cloning with other systems

The SEED-TTS EVAL protocol is used for objective evaluation on English CommonVoice Speakers. Chroma uses a sampling rate of 24 kHz and achieves Speaker Similarity scores as high as 0.81. Human baseline is set at 0.73. CosyVoice-3 achieves 0.72, and the majority of TTS baselines are below that human benchmark. This represents a relative improvement of 10.96% over the baseline human recording.

Subjective evaluation compares Chroma with the ElevenLabs eleven_multilingual_v2 model. Listeners preferred ElevenLabs in naturalness CMOS 57.2% versus Chroma 24.4%, and 18.3% deuce. ElevenLabs’ and Chroma’s scores in the speaker-similarity CMOS are extremely close: 42.4% ElevenLabs, 40.6% Chroma with 17.0% Deuce. In a follow-up test, ElevenLabs was preferred by 92.0% compared to the original recordings.

Real-time and latency behavior

One concurrent stream is used to measure latency. Total generation time of 16.58 seconds for a 38.80-second response gives an RTF (Real Time Factor) of 0.43. Averagely, the Reasoner is responsible for 119.12 milliseconds of TTFT per frame. The Backbone contributes 8.48 milliseconds while the Decoder takes 19.27 milliseconds. The Codec Decoder only works with groups of four frames, so the TTFT doesn’t apply. Time to First Token for the entire system is 146.87ms. That is less than one second, and is ideal for interactive dialog.

Speaking and Reasoning Benchmarks

The basic track is used to evaluate Chroma. This model only uses 4B parameter but still achieves an average task achievement score of 57.34%. GLM-4 voice, which is a model with 9B parameters, has the highest score of 69.09%. Chroma is ranked second and performs better than several 0.5B and 7B baselines in many dimensions. The score is 71.14% for Storal, 51.69 % on TruthfulQA, and 22.74 % on GSM8K. Its highest score for oral conversation metrics is 60.26% on MLC and 62.07% on CommonVoice.

In this comparison, Chroma stands out as the only product that allows for personalized voice cloning. Other systems only focus on reasoning and spoken dialog. Chroma is able to provide a cognitive system that can compete with other systems, while still delivering high quality voice customization in real time.

The Key Takeaways

Real-time speech from end to endChroma is a spoken dialogue model with 4B parameters that maps directly from speech to speech using codecs. This avoids ASR and other TTS stages while preserving prosody.
Your speech plus Reasoner stackIt uses RVQ codes and a schedule of interleaved text-to-audio tokens from 1 to 2. This allows for streaming, and a low Time To First Token.
Strong personalized voice cloningChroma, using CommonVoice Speakers, achieves an SEED TTS-EVAL score of 0.81. This represents a 10.96% relative improvement compared to the human baseline (0.73), and is superior to CosyVoice 3.
Faster than real-time generation and with sub second latencyThe model can generate audio within 16.58 seconds for a response time of 38.80 seconds. This results in a real-time factor of 0.43, which is 2x faster than the playback.
Discussions and arguments with the cloning feature as an unique oneChroma achieves an overall score of 57.44 % on the URO Bench Basic Track. It also scores highly in Storal, TruthfulQA GSM8K MLC, CommonVoice and GSM8K.

Take a look at the Paper, Model Weights, Project You can also find out more about the following: Playground. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Researchers at FlashLabs Release Chroma 1.01: A Real-Time Speech Dialogue Model with Personalized Voice Cloning

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

Nvidia is planning to launch an open-source AI agent platform

Data Centers in Action

AI Model that Never Stops Learning

‘Odd Lots’ Cohost Joe Weisenthal Has Predictions About How the AI Bubble Will Burst

Amazon’s ‘House of David’ Used Over 350 AI Shots in Season 2. The creator isn’t sorry

Top Insights

OpenAI WebSocket mode: A new way to experience AI voice powered by low latency.

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Researchers at FlashLabs Release Chroma 1.01: A Real-Time Speech Dialogue Model with Personalized Voice Cloning

From cascaded ASR ➡️ LLM ➡️ TTS ➡️ end to end S2S

What is the stack of Architecture + Reasoner?

Data for the training set-up and Synthetic Speech to Speech (S2S).

Compare the quality of voice cloning with other systems

Real-time and latency behavior

Speaking and Reasoning Benchmarks

The Key Takeaways

Related Posts