Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks
  • The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs
  • Schematik Is ‘Cursor for Hardware.’ The Anthropics Want In
  • Hacking the EU’s new age-verification app takes only 2 minutes
  • Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale
  • This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.
  • The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.
  • Top 19 AI Red Teaming Tools (2026): Secure Your ML Models
AI-trends.todayAI-trends.today
Home»Tech»StepFun releases Step-Audio Mini, an open-source 8B speech-to-speech AI model that surpasses GPT-4o Audio

StepFun releases Step-Audio Mini, an open-source 8B speech-to-speech AI model that surpasses GPT-4o Audio

Tech By Gavin Wallace01/09/20254 Mins Read
Facebook Twitter LinkedIn Email
A Coding Implementation to Build an AI Agent with Live
A Coding Implementation to Build an AI Agent with Live
Share
Facebook Twitter LinkedIn Email

StepFun AI has been released by the StepFun AI Team. Step-Audio 2 MiniThe LALM is an eight-parameter speech-tospeech model that provides expressive, real-time interaction with audio. Published under the Apache 2.0 license, this open-source model achieves state-of-the-art performance across speech recognition, audio understanding, and speech conversation benchmarks—surpassing commercial systems such as GPT-4o-Audio.

https://huggingface.co/stepfun-ai/Step-Audio-2-mini

The Key Features

1. Unified Audio–Text Tokenization

Step-Audio integrates ASR+LLM+TTS cascaded pipelines. Multimodal Discrete Token ModellingWhere? A single stream of modeling is shared by audio and text tokens.

It is possible to:

  • The same logic can be applied to audio and text.
  • On-the-fly Switching voice styles during inference.
  • Consistency of semantic, prosodic or emotional output.

2. Expression and emotional awareness of the Generation

The model doesn’t just transcribe speech—it interprets Paralinguistic Features Like pitch, rhythm and emotion. It allows for conversations to have realistic emotions such as sadness or excitement. Benchmarks for StepEval-Audio-Paralinguistic Show Step-Audio 2, achieving The accuracy rate is 83.1%GPT-4o Audio (43.5%), Qwen – Omni (44.2%) are both far below the average.

3. Retrieval Enhanced Speech Generating

Step-Audio 2 incorporates multimodal RAG (Retrieval-Augmented Generation):

  • Integration of Web Search Factual foundation is important.
  • Audio search—a novel capability that retrieves real voices from a large library and fuses them into responses, enabling voice timbre/style imitation It’s inference-time.

4. The Multimodal Argumentation and Tool Calling

This system goes beyond just speech recognition by providing support for Invocation of the tool. Step-Audio 2 is a textual LLM that matches the benchmarks. Tool selection and accuracyThe ‘uniqueness of excellence’ in Calls for audio search are available—a capability unavailable in text-only LLMs.

Scale of Training and Data

  • Text + Audio Corpus: 1.356T tokens
  • Audio Hours: Real and Synthetic Hours: 8M+
  • Speaker Diversity 50K voices in languages and dialects
  • Pretraining Pipeline: A multi-stage program that covers ASR, TTS (speech-to-speech), and conversational synthesis with emotion labels.

Step-Audio 2 Mini can retain its strong text reasoning via Qwen2-Audio (and CosyVoice) and master fine-grained audio modelling with this large-scale training.

Performance Benchmarks

https://huggingface.co/stepfun-ai/Step-Audio-2-mini
https://arxiv.org/abs/2507.16632

Automatic Speech Recognition

  • English: Average WER is 3.14%, which is lower than GPT-4o Transcribing at an average of 4.5%.
  • Chinese: CER average 3.08%, which is significantly lower than GPT-4o or Qwen-Omni.
  • The same robustness across all dialects and accents.

Audio Understanding (MMAU Benchmark)

  • Step-Audio 2: Average score of 78.0, beating out Audio Flamingo 3, (73.1) and Omni-R1 (77.0).
  • Strengthening in The reasoning tasks based on sound and speech.

Speech Translation

  • CoVoST 2, (S2TT), BLUÉ 39.26
  • CVSS (S2ST: The BLEU 30,87 is ahead of the GPT-4o, (23.68).

Conversational Benchmarks (URO-Bench)

  • Chinese Conversations Best overall at 83.3 (basic) You can also find out more about the following: 68.2 (pro).
  • English Conversations Comparable to GPT-4o (83,9 vs. 84,5) and far superior to other open models.
Source: Marktechpost.com

The conclusion of the article is:

Step-Audio 2 Mini Multimodal Speech Intelligence is now available to developers and the research community. The combination of multimodal speech intelligence and a sophisticated user interface allows developers to create a powerful tool for research. Qwen2-AudioReasoning ability with CosyVoice tokenization pipelineAnd enhancing with Retrieval-based GroundingStepFun is a leading provider of e-games. open audio LLMs.


Take a look at the PAPER You can also find out more about the following: MODEL on HUGGING FACE. Check out our website to learn more. GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter.


Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost was his most recent venture. This platform, which focuses on machine learning and deep-learning news, is popular for both its technical soundness and ease of understanding by the general public. Over 2 million views per month are a testament to the platform’s popularity.

AI Speech
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

19/04/2026

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

19/04/2026

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

18/04/2026

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

18/04/2026
Top News

They are the doomers who believe AI will kill us all

A $100 million AI super PAC targeted New York Democrat Alex Bores. He believes it has backfired

The AI can be tricked by poems into helping you build a nuclear weapon

ChatGPT Atlas wrote this blog post

What are the 3 best portable jumpstarters for 2026? Get charged up!

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

What to do in 2025 to find the latest sounds on TikTok? 9 Easy Ways

07/07/2025

Create a powerful tool for advanced portfolio analysis and market intelligence with OpenBB

11/08/2025
Latest News

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

19/04/2026

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

19/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.