Audio AI had an explosive year. Models like NVIDIA Parakeet and Mistral Voxtral, OpenAI Whisper, NVIDIA Parakeet have improved automatic speech recognition. Audio Flamingo 3 from NVIDIA is a model that has made significant progress in audio understanding. Nari Labs Dia-1.6B offers text-to speech with dialogue-grade quality. Meta also shipped its Perception Encoder Audiovisual, a multimodal coder that can learn a common embedding space for audio, text, and video. Frontiers have never been faster.
What’s the catch? The practical knowledge required to actually work with these models — how to fine-tune them, adapt them to new languages, or run efficient inference — is scattered across GitHub issues, research blogs, and private notebooks that never see the light of day. You are starting over if, as an ML expert, you want to run a zero-shot classification on video with PEAV or fine-tune Whisper for a new domain.
This is the gap smol-audio Closes the door.
What is Smol-audio? ?
Using the Apache 2.0 license, the Deep-unlearning Team has released smol audio, a repository of Jupyter Notebooks that are each focused on a specific practical AI audio task. Every notebook is designed to be opened directly in Google Colab, requires no local GPU setup, and is built entirely on the Hugging Face ecosystem — specifically Transformateurs, Datasets, PeftThen, You can accelerate your speed by clicking here.. A 16 GB Colab is adequate for the vast majority of jobs.
It is important to note that the word “you” means “you”. “flat repo” Design is an intentional choice. Smol-audio does not hide complexity or wrap recipes in a framework. Instead, it reveals every step. It is possible to read and understand the loop of training, as well as the pipeline for data, without needing to reverse-engineer a library. Transparency is an excellent learning tool for engineers in their early careers.
ASR fine-tuning for Whisper, Parakeet Voxtral and Granite Speech
ASR Fine-Tuning for Four Different Model Families is today the biggest category on offer. Each model family requires a different approach.
It is important to note that the word “you” means “you”. Whisper notebook covers fine-tuning using Transformateurs You can also find out more about the following: DatasetsWhisper uses a sequence-to-sequence approach, generating transcripts token by token — familiar territory for anyone who has worked with language models. Whisper uses a sequence-to-sequence approach, generating transcripts token by token — familiar territory for anyone who has worked with language models.
NVIDIA’s Parakeet CTC architecture (Connectionist Temporal Classification), rather than sequence-tosequence, is employed. CTC, while faster and lighter in comparison to autoregressive coding for inference purposes, requires that audio frames be aligned with the output tokens. The smol Audio Notebook covers full fine-tuning as well as LoRA for Parakeet.
Mistral Voxtral It is not a Whisper or Parakeet. Rather than a traditional ASR encoder-decoder, Voxtral is built on a large language model backbone — Ministral 3B for Voxtral Mini and Mistral Small 3.1 24B for Voxtral Small — making it an LLM-based speech understanding model. Smol-audio supports full fine-tuning as well as LoRA for ASR. Prompt masking is important here precisely because of this LLM architecture: when a model accepts text prompts alongside audio input, you typically do not want to compute loss on the prompt tokens themselves — only on the generated transcription. This can lead to deteriorated training dynamics. Having a reference implementation that works will save you a lot of time debugging.
IBM Granite Speech The YODAS Granary dataset is used to create a notebook focusing on Italian ASR. This example is more than just the model. It shows domain-specific and language-specific tuning on a multilingual real-life speech corpus.
NVIDIA Audio Flamingo 3: Audio Flamingo 3 for a better understanding of audio
Audio Flamingo 3 by NVIDIA is a Large Audio Language Model for reasoning and understanding in speech, music, and sound. The smol-audio notebook fine-tunes it specifically for the audio captioning task — generating a natural language description of an audio clip, which is useful for accessibility tooling, content indexing, and retrieval systems. The notebook allows for both LoRA and full fine-tuning. Practitioners can choose to optimize their system for maximum performance or memory efficiency.
LoRA works for people who are new to fine-tuning parameters efficiently. It does this by freezing original weights of the model and injecting trainable rank-decomposition-matrices within specific layers. LoRA reduces GPU memory by an order-of-magnitude for large multimodal models such as Audio Flamingo 3 compared with full fine tuning, which allows iteration using commodity hardware.
Dialogue TTS for Dia-1.6B
Dia-1.6B covers dialogue text-tospeech. Its goal is not to synthesize one speaker, but rather create natural conversational exchanges. Dia is a 1.6-billion-parameter TTS model by Nari Labs capable of producing multi-speaker dialogue, making it relevant for anyone building voice agents, podcast generation tools, or conversational interfaces.
Meta’s Pe-AV Multimodal Inference
Inference with Meta is perhaps the most advanced notebook currently available. Perception Encoder Video (PE-AV).. PE-AV is a multimodal encoder that learns a single shared embedding space across audio, video, and text — enabling zero-shot video classification without any task-specific fine-tuning, and audio↔text retrieval on benchmarks like AudioCaps. All three modes map to the same embedded space. This allows cross-modal queries, such as retrieving a clip of audio from a description in text.
This notebook shows how to directly run inference pipelines. It is useful because multimodal models that have audio-visual and text encoders together are more complex architecturally than models using single-modality algorithms. They also require careful preprocessing for multiple input modes.
Check out the Repo here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us


