IBM released two new open speech recognition models— Granite Speech 4.1 2B The following are some examples of how to get started: Granite Speech 4.1 2B-NAR — and they make a compelling case for what a ~2B-parameter speech model can do. The Apache 2.0 license allows both to be found on Hugging Face.
They both target a particular problem, which enterprise AI teams are well aware of: to keep costs down and maintain accuracy in production grade automatic speech recognition systems (ASR), most demand massive computation or make compromises on accuracy. IBM believes that by making careful decisions about architecture, you can have the best of both worlds.
The Models That Actually Work
Granite Speech 4.1 2B This compact, efficient, multi-language speech model is designed to perform automatic speech translation in both directions (AST), and covers English, French and German. It also includes Spanish, Portuguese and Japanese. The non-autoregressive version, Granite Speech 4.1 2B-NAR, focuses exclusively on ASR — specifically targeting latency-sensitive deployments — and supports English, French, German, Spanish, and Portuguese, but not Japanese. It’s important to note that teams who need Japanese transcription or other speech translation should opt for the standard auto-regressive model.
IBM quietly launched a third version alongside the two previously mentioned. Granite Speech 4.1 2B-Plus adds speaker-attributed ASR and word-level timestamps for applications where knowing who said what — and exactly when — is a requirement.
The Word Error Ratio (WER), is the main metric used to measure transcription quality. A lower WER is always better. A WER (word error rate) of 5% is roughly equivalent to 5 of every 100 wrong words. Granite Speech 4.1 – 2B scored a WER average of 5.33 on the Open ASR Leaderboard as of April 2026. Drilling into benchmark detail — on LibriSpeech clean, the model achieves a WER of 1.33, and 2.5 on LibriSpeech other.
The architecture explained
Both models share the same three-component design at a high level — a speech encoder, a modality adapter, and a language model — though the decoding mechanism diverges significantly.
You can also find out more about the following: first component The speech encoder is the main component. The architecture uses 16 conformer blocks trained with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE units — using frame importance sampling to focus on informative parts of the audio. The Conformer layer combines attention mechanisms with convolutional layers to capture local patterns. CTC, or CTC-based training, allows the model to learn by analyzing audio-text pair without necessarily aligning frames.
You can also find out more about the following: Second component A speech-text mode adapter is available. A 2-layer window query transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the last conformer block, downsampling by a factor of 5 using 3 trainable queries per block and per layer — for a total temporal downsampling factor of 10 — resulting in a 10Hz acoustic embedding rate for the LLM. This adapter fills in the gaps between discrete tokens and continuous audio features, compressing them so that the language model is able to handle it. Q-Former is a 160M-parameter model that downsamples four hidden encoder layers (4, 8, 12, 16) and concatenates them.
You can also find out more about the following: third component The language model is. Granite Speech 4B 2B fine tunes a checkpoint intermediate of granite-4.0-1b with 128k length context. In the NAR variant, this becomes a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal attention mask removed to enable bidirectional context — adapted with LoRA at rank 128 applied to both attention and MLP layers.
The Tradeoff Between Autoregressive vs. non-autoregressive
The two models differ most dramatically here, with direct implications for the production.
In the standard Granite Speech 4.1 2B, text is generated autoregressively — one token at a time, each depending on every token before it. The result is accurate and stable transcripts, with support for AST. Keyword-biased detection and punctuation.
Granite Speech 4B-NAR is a totally different product. Instead of decoding each token one by one, the editor edits CTC in a single pass forward using a bidirectional LLM. The result is a higher accuracy and faster inference compared to autoregressive solutions. NLE is a non-autoregressive LLM editing architecture. Concretely: the CTC encoder produces a rough initial transcript, that hypothesis is interleaved with insertion slots, and then a bidirectional LLM predicts edits — copy, insert, delete, or replace — at all positions simultaneously in one pass.
RTFx (real-time factor multiplier) measures how many times faster a model can process audio than real time. RTFx (real-time factor multiplier) measures how many times faster than real time a model can process audio — an RTFx of 1820 means a one-hour audio file can be transcribed in under two seconds on that hardware. Engineers should be aware of a practical limitation: The NAR model needs flash_attention_2 to infer, as this backend respects is_causal=False and supports sequence packing.
Data and Infrastructure for Training
Two models were trained with different datasets. Standard model trained with 174,000 hours audio for ASR/AST and synthetic datasets to support Japanese ASR/AST and keyword-biased ASR/Speech Translation. This model is based on 130,000 hours in five different languages, using datasets such as CommonVoice 15 MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.
Both infrastructure differences are equally striking. The standard model’s training was completed in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR model trained in just 3 days on 16 H100 GPUs (2 nodes) for 5 epochs — a much lighter training run, which reflects the architectural simplicity of editing over full autoregressive generation.
What you need to know
These are the 5 main points to remember:
- IBM releases two new open ASR models — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — both ~2B parameters, and Apache 2.0 licensed.
- Standard model has a WER average of 5.33 on the Open ASR Leaderboard, supports 6 languages for ASR (including Japanese), bidirectional speech translation, keyword biasing, and punctuation/truecasing — competitive with models several times its size.
- NAR models trade off speed for capabilities — it drops Japanese, AST, and keyword biasing, but delivers an RTFx of ~1820 on a single H100 GPU by editing a CTC hypothesis in a single forward pass rather than generating tokens one at a time.
- Three core components make up the architecture — a 16-layer Conformer encoder trained with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding rate, and a fine-tuned granite-4.0-1b-base language model.
- Granite Speech 4B Plus, a third version, is also available. — extending the stThe following are some examples of how to get started:ard model with speaker-attributed ASR and word-level timestamps for applications where speaker identity and precise timing are required.
Check out the Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

