Liquid AI has released LFM2-Audio-1.5B, a compact audio–language foundation model that both understands and generates speech and text through a single end-to-end stack. The LFM2-Audio-1.5B is aimed at low-latency real-time assistants for resource-constrained mobile devices. It extends the LFM2 Family into audio, while maintaining a compact footprint.
What is new, then? The unified backbone and disentangled I/O audio
LFM2-Audio is an extension of the LFM2 1.2B parameter language model to include audio and text sequence tokens. It is important to note that the model incorporates a sequence token system. disentangles Audio representation: Inputs are continuous embeddings directly projected from raw waveforms chunks (80ms), and outputs discrete audio code. It avoids artifacts of discretization on the input while maintaining training and generation autoregressive in both modalities for the output.
The released checkpoint is used to implement:
- Backbone: LFM2 (hybrid + attention), params 1.2B (LM only).
- Audio encoder: FastConformer (115M canary 180m-flash).
- Audio decoder: RQ-Transformer predicting discrete Mimi Codec Tokens (eight codebooks).
- Context: 32,768 tokens; vocab: 65,536 (text) / 2049×8 (audio)
- Precision: bfloat16; license: LFM Open License Version 1.0 languages: English
There are two modes of real-time agent generation
- Interleaved Generation For live speech-to-speech, the model alternates audio and text tokens in order to minimize perceived latencies.
- Sequential generation For ASR/TTS, (reversing modalities at each turn).
Liquid AI includes a Python package.liquid-audioThis is a Gradio demonstration that reproduces these behaviors.
Latency:
Team Liquid reports an end-toend latency of below One hundred ms from a 4-second audio query to the first audible response—a proxy for perceived responsiveness in interactive use—stating it is faster than models smaller than 1.5B parameters under their setup.
VoiceBench and results of ASR benchmarks
You can find out more about this by clicking here. VoiceBench—a suite of nine audio-assistant evaluations—Liquid reports an Score of 56.78 The blog chart includes the task numbers for LFM2-Audio-1.50B (e.g. AlpacaEval 3.71, CommonEval 3.49) and WildVoice 3.17) In the table, Liquid AI compares its results with those of larger models such as Qwen2.5 Omni-3B and Moshi-7B. VoiceBench (an external benchmark that was introduced for LLM-based assistants in 2024)
The model card on Hugging Face provides an additional VoiceBench table (with closely related—but not identical—per-task values) and includes classic ASR WERs where LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets despite being a generalist speech–text model. For example (lower is better): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

Why does voice AI matter?
You can find out more about this by clicking here. “omni” stacks couple ASR → LLM → TTS, which adds latency and brittle interfaces. LFM2-Audio, with its single-backbone system and continuous embeddings of input and discrete codes for output, reduces glue and allows early audio emission. This means faster response times for developers and simpler pipelines. It also supports ASR, TTS and conversational agents with one model. Liquid AI offers code, entry points for demos, and distribution through Hugging Face.
Take a look at the GitHub Page, Hugging Face Model Card The following are some examples of how to get started: Technical details. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, dedicated to Artificial Intelligence, is known for the in-depth reporting of news on machine learning and deep understanding that is technically correct and understandable by all audiences. This platform has over 2,000,000 monthly views which shows its popularity.

