NVIDIA released its latest streaming English transcription model Nemotron Speech (ASR), built specifically for voice agents with low latencies and live captioning. Checkpoint nvidia/nemotron-speech-streaming-en-0.6b The Hugging Face encoder and decoder are tuned to work with both batch and streaming workloads, on NVIDIA GPUs.
Modeling, input assumptions and architecture
Nemotron Speech SR (Automatic Speech Recognition), is a 600M-parameter model that uses a FastConformer decoder with RNNT and 24 layers. The encoder employs aggressive 8x convolutional resampling, which reduces the number time steps. The model requires input audio of at least 8 seconds and 16 kHz.
Context sizes can be configured to control runtime latency. This model provides 4 chunks that correspond to 80 ms audio, 160 ms audio, 560ms audio, and 1.12s of sound. The modes are controlled by the att_context_size This parameter can be altered at the time of inference without having to retrain.
Slide windows that are not buffered but cache aware of streaming
Traditional ‘streaming ASR’ often uses overlapping windows. The incoming windows reprocess a part of the audio from previous window to keep context. This wastes computation and increases latency as concurrently increases.
Nemotron Speech ASR keeps instead a cache for self-attention and convolution layer encoder state. The model reuses cached activations instead of recalculating overlapping context. Here is what you get:
- Work scales with audio length linearly, as there is no frame overlapping.
- Memory growth is predictable, as the cache grows in accordance with the sequence length and not due to concurrency.
- Voice agents need a stable latency when they are under heavy load to ensure that there is no interruption or turn-taking.
Accuracy vs latency: WER under streaming constraints
Nemotron Speech ASR evaluation is performed on Hugging Face OpenASR datasets including AMI Earnings22 Gigaspeech LibriSpeech. The word error rates (WERs) are reported for the different chunk sizes.
If you average these benchmarks out, this is what the model does:
- Around 7.84 per cent WER when a 0.16 s piece size
- When a chunk is 0.56 inch in size, WER comes to 7.22 %.
- Around 7.16 per cent WER when chunked at a size of 1.12 s
The latency accuracy tradeoff is shown here. Even though the WER for larger chunks is slightly less, it remains under 8 percent. Inference point is a choice that developers can make based on the application. For example, 160 ms may be appropriate for voice agents with aggressive behavior, while 560 ms might work well for workflows focused around transcription.
Modern GPUs: Throughput and concurrency
Cache aware design can have a measurable effect on concurrency. Nemotron Speech supports 560 streams concurrently on an NVIDIA GPU H100 at a chunk size of 320ms, which is roughly three times the concurrency compared to a standard streaming system with the same target latency. RTX A5000 benchmarks and DGX b200 results show similar gains in throughput, with A5000 achieving more than 5x concurrent streams and B200 delivering up to 2x.
As concurrency grows, it is also important that latency does not increase. Modal tested the system with 127 WebSocket users at 560ms. The system kept a constant median delay of 182ms from end to end, without drifting. This is important for agents who must remain synchronized over long sessions.
Data on training and integration of ecosystems
Nemotron Speech ASR was trained on a mixture of NVIDIA Granary data, including the English language portion. This totaled 285k audio hours. YouTube Commons includes YODAS2, Mosel and LibriLight datasets. Switchboard, WSJ and VCTK are also included. The labels combine ASR and human generated transcripts.
The Key Takeaways
- Nemotron Speech-ASR is a English streaming model of 0.6B parameters. The encoder uses an cache aware FastConformer with the decoder using RNNT. This model operates at 16 kHz mono with inputs that are no longer than 80ms.
- The model exposes four configurations of inference chunks that are about 80ms long, 160ms long, 560ms or 1.12s. These allow engineers to exchange latency for accuracy, without needing to undergo retraining, and still maintain WER between 7.2percent and 7.8percent on standard ASR benchmarks.
- The Cache Aware Streaming removes overlapping windows recomputation, so that each audio frame can be encoded only once. This results in about three times more concurrent streams for H100 and even up to two times the number of concurrent streams with RTX A5000.
- The server-side voice-to voice latency of RTX5090 on RTX5090, is approximately 500 ms. This is a fraction of total latency budget.
- Nemotron Speech-ASR has been released under NVIDIA Permissive open model license with weights that are not restricted and details on how to train. Teams can now self-host and fine tune their stacks for speech and voice applications and low latency agents.
Take a look at the MODEL WEIGHTS here. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.
Our latest releases of ai2025.devThe platform is a focused analytics tool for 2025 that converts model launches, benchmarks and ecosystem activities into structured data you can compare and export.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

