WhisperX: How do you build an advanced voice AI pipeline for transcription, alignment, analysis, and export?

We will be demonstrating an advanced implementation in this tutorial. WhisperXWe explore in depth word-level time stamps, transcription and alignment. The environment is set up, the audio files are loaded and processed, then the pipeline runs from alignment to analysis and transcription, all while maintaining memory efficiency. We visualize the results and export them to multiple formats. Keywords are also extracted from audio files for deeper insight. See the FULL CODES here.

!pip install -q git+https://github.com/m-bain/whisperX.git
!pip installation -q matplotlib pandas seaborn


Im whisperx
Buy a torch
Import gc
Import os
Import json
import pandas as pd
Import Path from pathlib
Import Audio, Display, HTML, and HTML from IPython.displayCONFIG =
import warnings
warnings.filterwarnings('ignore')


CONFIG = {
   "device": "cuda" if torch.cuda.is_available() You can also find out more about "cpu",
   "compute_type": "float16" if torch.cuda.is_available() You can also find out more about "int8",
   "batch_size": 16, 
   "model_size": "base", 
   "language": None, 
}


print(f"🚀 Running on: {CONFIG['device']}")
print(f"📊 Compute type: {CONFIG['compute_type']}")
print(f"🎯 Model: {CONFIG['model_size']}")

Install WhisperX and the essential libraries, then configure your setup. Set parameters for the language and batch size to be used in transcription. Visit the FULL CODES here.

def download_sample_audio():
   """Download a sample audio file for testing"""
   !wget -q -O sample.mp3 https://github.com/mozilla-extensions/speaktome/raw/master/content/cv-valid-dev/sample-000000.mp3
   print("✅ Sample audio downloaded")
 Return to the Homepage "sample.mp3"


def load_and_analyze_audio(audio_path):
   """Load audio and display basic info"""
   audio = whisperx.load_audio(audio_path)
   duration = len(audio) / 16000 
   print(f"📁 Audio: {Path(audio_path).name}")
   print(f"⏱️  Duration: {duration:.2f} seconds")
   print(f"🎵 Sample rate: 16000 Hz")
   display(Audio(audio_path))
   return audio, duration


def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
   """Transcribe audio using WhisperX (batched inference)"""
   print("n🎤 STEP 1: Transcribing audio...")
  
 If model = whisperx.load_model (
       model_size,
 CONFIG["device"],
       compute_type=CONFIG["compute_type"]
   )
  
   transcribe_kwargs = {
       "batch_size"CONFIG["batch_size"]
   }
 If you are unsure of the language, please contact us.
       transcribe_kwargs["language"] The language of the =
  
   result = model.transcribe(audio, **transcribe_kwargs)
  
   total_segments = len(result["segments"])
   total_words = sum(len(seg.get("words", [])) for seg in result["segments"])
  
   del model
   gc.collect()
 If you CONFIG["device"] == "cuda":
       torch.cuda.empty_cache()
  
   print(f"✅ Transcription complete!")
   print(f"   Language: {result['language']}")
   print(f"   Segments: {total_segments}")
   print(f"   Total text length: {sum(len(seg['text']) for seg in result['segments'])} characters")
  
 Return result

WhisperX transcribes a sample audio recording after we download, analyze, and load the file. Set up the batched inference using our selected model size and configuration. We output important details like language, segment length, and text total. Take a look at the FULL CODES here.

def align_transcription(segments, audio, language_code):
   """Align transcription for accurate word-level timestamps"""
   print("n🎯 STEP 2: Aligning for word-level timestamps...")
  
   try:
       model_a, metadata = whisperx.load_align_model(
           language_code=language_code,
           device=CONFIG["device"]
       )
      
 Results = whisperx.align
           segments,
           model_a,
           metadata,
           audio,
 CONFIG["device"],
           return_char_alignments=False
       )
      
       total_words = sum(len(seg.get("words", [])) for seg in result["segments"])
      
       del model_a
       gc.collect()
 If you CONFIG["device"] == "cuda":
           torch.cuda.empty_cache()
      
       print(f"✅ Alignment complete!")
       print(f"   Aligned words: {total_words}")
      
 Return result
 Except Exception As e.Return
       print(f"⚠️  Alignment failed: {str(e)}")
       print("   Continuing with segment-level timestamps only...")
       return {"segments": segments, "word_segments": []}

By aligning the transcription, we can generate word-level time stamps. After loading the model, it is applied to the audio. This refines timing accuracy. Visit the FULL CODES here.

def analyze_transcription(result):
   """Generate statistics about the transcription"""
   print("n📊 TRANSCRIPTION STATISTICS")
   print("="*70)
  
   segments = result["segments"]
  
   total_duration = max(seg["end"] for seg in segments) if segments else 0
   total_words = sum(len(seg.get("words", [])) for seg in segments)
   total_chars = sum(len(seg["text"].strip()) for seg in segments)
  
   print(f"Total duration: {total_duration:.2f} seconds")
   print(f"Total segments: {len(segments)}")
   print(f"Total words: {total_words}")
   print(f"Total characters: {total_chars}")
  
   if total_duration > 0:
       print(f"Words per minute: {(total_words / total_duration * 60):.1f}")
  
 Pauses = []
   for i in range(len(segments) - 1):
       pause = segments[i+1]["start"] - segments[i]["end"]
       if pause > 0:
           pauses.append(pause)
  
 If you pause:
       print(f"Average pause between segments: {sum(pauses)/len(pauses):.2f}s")
       print(f"Longest pause: {max(pauses):.2f}s")
  
   word_durations = []
   for seg in segments:
 If "words" in seg:
 Seg["words"]:
               duration = word["end"] - word["start"]
               word_durations.append(duration)
  
 if words_durations
       print(f"Average word duration: {sum(word_durations)/len(word_durations):.3f}s")
  
   print("="*70)

Our analysis of the transcription includes generating statistics like total duration and segment counts, as well as word and character counts. To better understand the pace and flow, we also compute words per minute and the average duration of each word. Click here to see the FULL CODES here.

def display_results(result, show_words=False, max_rows=50):
   """Display transcription results in formatted table"""
 Data = []
  
   for seg in result["segments"]:
       text = seg["text"].strip()
 Start = f"{seg['start']:.2f}s"
 End = f"{seg['end']:.2f}s"
       duration = f"{seg['end'] - seg['start']:.2f}s"
      
 If show_words "words" in seg:
 Seg["words"]:
               data.append({
                   "Start"The f"{word['start']:.2f}s",
                   "End"The f"{word['end']:.2f}s",
                   "Duration"The f"{word['end'] - word['start']:.3f}s",
                   "Text": word["word"],
                   "Score"The f"{word.get('score', 0):.2f}"
               })
       else:
           data.append({
               "Start": start,
               "End": end,
               "Duration": duration,
               "Text"Text
           })
  
 DataFrame (data) = df
  
   if len(df) > max_rows:
       print(f"Showing first {max_rows} rows of {len(df)} total...")
       display(HTML(df.head(max_rows).to_html(index=False)))
   else:
       display(HTML(df.to_html(index=False)))
  
   return df


def export_results(result, output_dir="output", filename="transcript"):
   """Export results in multiple formats"""
   os.makedirs(output_dir, exist_ok=True)
  
 json_path=f"{output_dir}/{filename}.json"
 Open(json_path) "w", encoding="utf-8"As f:
       json.dump(result, f, indent=2, ensure_ascii=False)
  
 If srt_path is f, then the path will be srt_path."{output_dir}/{filename}.srt"
 Open(srt_path) "w", encoding="utf-8"As a f
       for i, seg in enumerate(result["segments"], 1):
           start = format_timestamp(seg["start"])
 End = format_timestamp()["end"])
           f.write(f"{i}n{start} --> {end}n{seg['text'].strip()}nn")
  
 vtt_path=f"{output_dir}/{filename}.vtt"
 With open(vtt_path "w", encoding="utf-8"As f:
       f.write("WEBVTTnn")
       for i, seg in enumerate(result["segments"], 1):
           start = format_timestamp_vtt(seg["start"])
           end = format_timestamp_vtt(seg["end"])
           f.write(f"{start} --> {end}n{seg['text'].strip()}nn")
  
 The txt_path is equal to f"{output_dir}/{filename}.txt"
 Open(txt_path) "w", encoding="utf-8"As f:
       for seg in result["segments"]:
           f.write(f"{seg['text'].strip()}n")
  
 csv_path=f"{output_dir}/{filename}.csv"
   df_data = []
   for seg in result["segments"]:
       df_data.append({
           "start": seg["start"],
           "end": seg["end"],
           "text": seg["text"].strip()
       })
   pd.DataFrame(df_data).to_csv(csv_path, index=False)
  
   print(f"n💾 Results exported to '{output_dir}/' directory:")
   print(f"   ✓ {filename}.json (full structured data)")
   print(f"   ✓ {filename}.srt (subtitles)")
   print(f"   ✓ {filename}.vtt (web video subtitles)")
   print(f"   ✓ {filename}.txt (plain text)")
   print(f"   ✓ {filename}.csv (timestamps + text)")


def format_timestamp(seconds):
   """Convert seconds to SRT timestamp format"""
 Hour = (seconds // 3600).
   minutes = int((seconds % 3600) // 60)
 Secs = (seconds divided by 60)
   millis = int((seconds % 1) * 1000)
 Return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


def format_timestamp_vtt(seconds):
   """Convert seconds to VTT timestamp format"""
 Hours = (seconds / 3600)
   minutes = int((seconds % 3600) // 60)
 Secs = (seconds divided by 60)
   millis = int((seconds % 1) * 1000)
 Return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"


def batch_process_files(audio_files, output_dir="batch_output"):
   """Process multiple audio files in batch"""
   print(f"n📦 Batch processing {len(audio_files)} files...")
   results = {}
  
   for i, audio_path in enumerate(audio_files, 1):
       print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).name}")
       try:
           result, _ = process_audio_file(audio_path, show_output=False)
 Results[audio_path] ====
          
           filename = Path(audio_path).stem
           export_results(result, output_dir, filename)
 This is a good example of a 'except for'
           print(f"❌ Error processing {audio_path}: {str(e)}")
 Results[audio_path] No.
  
   print(f"n✅ Batch processing complete! Processed {len(results)} files.")
 Return results


def extract_keywords(result, top_n=10):
   """Extract most common words from transcription"""
 Import Counter for Collections
 This is a re-import.
  
 Text = " ".join(seg["text"] for seg in result["segments"])
  
   words = re.findall(r'bw+b', text.lower())
  
   stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
                 'of', 'with', 'is', 'was', 'are', 'were', 'be', 'been', 'being',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
                 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'}
  
   filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
  
   word_counts = Counter(filtered_words).most_common(top_n)
  
   print(f"n🔑 Top {top_n} Keywords:")
   for word, count in word_counts:
       print(f"   {word}: {count}")
  
   return word_counts

Format results as clean tables. Export transcripts in JSON/SRT/VTT/TXT/CSV format. Maintain precise timestamps using helper formatters. We can process audio files in bulk and extract keywords. See the FULL CODES here.

def process_audio_file(audio_path, show_output=True, analyze=True):
   """Complete WhisperX pipeline"""
 If you show output:
       print("="*70)
       print("🎵 WhisperX Advanced Tutorial")
       print("="*70)
  
   audio, duration = load_and_analyze_audio(audio_path)
  
   result = transcribe_audio(audioCONFIG["model_size"], CONFIG["language"])
  
   aligned_result = align_transcription(
 Results["segments"],
       audio,
 Results["language"]
   )
  
 Show output if you analyze the data
       analyze_transcription(aligned_result)
       extract_keywords(aligned_result)
  
 Show_output
       print("n" + "="*70)
       print("📋 TRANSCRIPTION RESULTS")
       print("="*70)
       df = display_results(aligned_result, show_words=False)
      
       export_results(aligned_result)
   else:
 No df
  
 return aligned_result. df


# Example 1: Process sample audio
# audio_path = download_sample_audio()
# result, df = process_audio_file(audio_path)


# Example 2: Display word-level Details
# result, df = process_audio_file(audio_path)
# word_df = display_results(result, show_words=True)


# Example 3. Process your audio
# audio_path = "your_audio.wav" #.mp3, M4A, etc.
# result, df = process_audio_file(audio_path)


# Example 4 - Batch processing multiple files
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# results = batch_process_files(audio_files)


# Example 5: Use a larger model for better accuracy
# CONFIG["model_size"] = "large-v2"
# result, process_audio_file = df"audio.mp3")


print("n✨ Setup complete! Uncomment examples above to run.")

WhisperX is then run from beginning to end, transcribing and aligning audio for timestamps at the word level. If enabled, the system will analyze data, extract keyword, create a clear results table and export it to multiple formats.

We built an entire WhisperX pipeline to not only transcribing audio, but aligning it with word-level time stamps. Exporting the results to multiple formats and processing files in bulk, we analyze patterns in order to improve the quality of the output. We now have an easy-to-use, flexible workflow on Colab for audio transcription.

Take a look at the FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to using Artificial Intelligence (AI) for the greater good. Marktechpost was his most recent venture. This platform, devoted to Artificial Intelligence, is well-known for its technical and accessible coverage of news on machine learning and deep understanding. Over 2 million views per month are a testament to the platform’s popularity.

WhisperX: How do you build an advanced voice AI pipeline for transcription, alignment, analysis, and export?

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

AI Agents Will Not Be Able To Handle Your Holiday Shopping Anytime Soon

AI podcasters Want To Tell You How To Keep A Man Happy

Google DeepMind Staffers Ask Leaders to Keep Them ‘Physically Safe’ From ICE

AI Data Center Boom is Warping US Economy

Trump and Energy Industry are Eager to Use Fossil Energy for AI

Top Insights

Meta AI Researchers Launch MapAnything: An Finish-to-Finish Transformer Structure that Instantly Regresses Factored, Metric 3D Scene Geometry

AI is a time when we all need to put the people before AI.

Latest News

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

WhisperX: How do you build an advanced voice AI pipeline for transcription, alignment, analysis, and export?

Related Posts