Google Health AI Team has released MedASR. This open weights model converts physician-patient conversations into text.
What MedASR means and its place?
MedASR uses the Conformer Architecture and has been pre-trained for transcription and medical dictation. This is designed to be a good starting point for those who wish to create healthcare voice applications, such as radiology transcription tools and visit notes capture systems.
This model accepts audio in monochannel at 16000 hertz and 16-bit integer waveforms. This model only produces text, and so can drop directly into natural language models or generative ones such as MedGemma.
MedASR is included in the Health AI Developer Foundations Portfolio, along with MedGemma, MedSigLIP, and domain specific models. All of these medical models share similar terms of usage and governance.
Domain-specific training data
MedASR has been trained with a large corpus of identified medical speech. About 5000 hours worth of physician conversations and dictations are included in the dataset.
This training includes audio clips, transcripts, and metadata. The medical names of symptoms, medication and conditions are added to subsets. It is possible to ensure that the model covers all clinical terminology and phrases used in standard documentation.
Model is only in English, and the majority of training audio is from American speakers who speak English as their first language. Documentation notes that the performance of other speaker profiles, or microphones with noises can be reduced. Fine tuning is recommended for these settings.
The architecture and decoding
MedASR uses the Conformer design. Conformer uses self-attention and convolution to capture both local patterns of acoustic sound as well longer term temporal relationships.
The model can be accessed as an automated speaker with CTC interface. The reference implementation uses AutoProcessor Waveform audio can be used to generate input features. AutoModelForCTC Produce token sequences. The default decoding method is greedy. This model can be used in conjunction with an external language six-gram model and beam search size 8 for improved word error rates.
Training for MedASR is done using JAX/ML Pathways running on the TPUv4p TPUv5p or TPUv5e. These systems offer the necessary scale for training large speech models, and are aligned with Google’s larger foundation model-training stack.
Perform on Medical Speech Tasks
The key results with greedy decoding, and a language model based on six grams, are as follows:
- RAD DICT, radiologist dictation: MedASR greedy 6.6 percent, MedASR plus language model 4.6 percent, Gemini 2.5 Pro 10.0 percent, Gemini 2.5 Flash 24.4 percent, Whisper v3 Large 25.3 percent.
- General and Internal Medicine: MedASR greedy 9,3 percent; MedASR Plus Language Model 6.9 percent; Gemini 2.50 Pro 16.4 per cent, Gemini 2.00 Flash 27.1 percent. Whisper 33.1percent.
- FM DICT, family medicine: MedASR greedy 8.1 percent, MedASR plus language model 5.8 percent, Gemini 2.5 Pro 14.6 percent, Gemini 2.5 Flash 19.9 percent, Whisper v3 Large 32.5 percent.
- Eye-gaze dictation for 998 MIMIC x-ray chest cases: MedASR greedy 66%, MedASR plus language 5.2 %, Gemini 2 5 Pro 5.9 %, Gemini 2 Flash 9.3 %, Whisper 3 Large 12.5%.
Develop workflows and deploy options
An example of minimal pipework is:
Transformers Import Pipeline
import huggingface_hub
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)
For more control, developers load AutoProcessor You can also find out more about the following: AutoModelForCTCResample your audio files to 16000 Hertz with Librosa. Call CUDA if it is available. model.generate Followed by processor.batch_decode.
The Key Takeaways
- The MedASR medical ASR Model is lightweight and open-weighted.This model, which is only available in English, was developed under the Health AI Developer Foundations Program.
- About 5000 hours (or more) of medical audio can be used to provide domain specific training.MedASR has been pre-trained on physician dictations, clinical conversations, and a wide range of specialties, including radiology, family medicine, and internal medicine. This gives it a strong coverage in clinical terminology, compared with general ASR systems.
- Medical dictation benchmarks: Competitive or Better Word Error RatesOn datasets from internal radiology (internal radiology), general medicine (general medicine), family medicine, and Eye Gaze, MedASR decoding with language models or greedy model matches or outperforms larger general models, such as Gemini2.5 Pro, Gemini2.5 Flash, or Whisperv3Large, on the word error rate of English medical speech.
Click here to find out more Repo, Model on HF You can also find out more about the following: Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence’s potential to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

