Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence
  • A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning
  • Moonshot AI Releases Kimi K2.6 with Lengthy-Horizon Coding, Agent Swarm Scaling to 300 Sub-Brokers and 4,000 Coordinated Steps
  • In China, a humanoid robot set a record for the half-marathon.
  • Prego Has a Dinner-Conversation-Recording Device, Capisce?
  • AI CEOs think they can be everywhere at once
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
AI-trends.todayAI-trends.today
Home»Tech»xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Tech By Gavin Wallace19/04/20264 Mins Read
Facebook Twitter LinkedIn Email
Mistral Launches Agents API: A New Platform for Developer-Friendly AI
Mistral Launches Agents API: A New Platform for Developer-Friendly AI
Share
Facebook Twitter LinkedIn Email

Elon Musk’s AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. It puts xAI in the same competitive speech API space currently occupied ElevenLabs Deepgram and AssemblyAI.

What is the Grok Speech to Text API?

The technology of speech-to-text is used to convert spoken audio files into text. STT is an essential building block when it comes to developers creating meeting transcription software, voice agents or call center analytics. Instead of developing it from scratch, developers can call an endpoint and send audio to receive a structure transcript.

Grok STT is now available in both streaming and batch modes. While batch is intended for pre-recorded recordings, streaming allows real-time audio transcription. The pricing is straightforward. Speech-to Text is $0.10 an hour in batch mode and $0.20 for streaming.

It also accepts speaker diarization and supports multichannels. This API accepts a variety of formats. 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request.

Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.’ It is essential for recordings with multiple speakers, such as meetings, interviews or customer calls. Word-level timestamps Assign precise times for the start and finish of each word. This allows you to use it in applications such as subtitle generation, searching recordings and legal documents. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents’ into readable structured output: “$167,983.15.”.

Benchmark Performance

xAI’s research team makes bold claims about accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. If this margin holds up in production, it is significant. Grok, ElevenLabs, and AssemblyAI all had the same error rate of 2.4% for video and podcast transcripts. Deepgram was a distant third with 3.0%, while Deepgram was ranked second at 3.2%. xAI also reported a word error rate of 6.9% on audio benchmarks.

https://x.ai/news/grok-stt-and-tts-apis
https://x.ai/news/grok-stt-and-tts-apis

What is Grok Text-to-Speech API?

The text is converted into audio. TTS APIs are used by developers to create voice assistants, reading-aloud functions, podcasts, interactive voice response systems (IVR) and accessibility tools.

Grok TTS API is available for $4.20 per million characters and provides natural, fast speech synthesis. API supports up to 15,000 characters per REST requestWebSockets streaming endpoints are available for content that is longer than the text-length limit. They begin returning audio prior to processing all input. The API allows for Twenty languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default.

In addition to voice selection, developers may also use inline tags and speech wrappers for controlling delivery. Inline tags such as [laugh], [sigh]Then, [breath]Wrapping labels like The text is a You can also find out more about the following: The text is aThis expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output. This expressivity addresses one of TTS’s core shortcomings, where they often deliver technically accurate but emotional flat output.

The Key Takeaways

  • xAI launches two new audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support.
  • Grok STT offers both real-time transcription and batch transcription. across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming.
  • Benchmarks for phone call entity identificationGrok STT reported a 5.0% rate of error, which is significantly better than ElevenLabs (12%), Deepgram (13.5%), and AssemblyAI(21.3%). Grok STT performed particularly well in cases involving medical, financial, or legal use.
  • Grok TTS API provides five different expressive voices The speech tags (Ara Eve Leo Rex Sal) are available in 20 different languages. [laugh], [sigh]Then, giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters.

Check out the Technical details here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us


Michal is a professional in data science with a Masters of Science degree from the University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in statistics, machine learning and data engineering.

AI api enterprise Speech x xai
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

21/04/2026

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

21/04/2026

Moonshot AI Releases Kimi K2.6 with Lengthy-Horizon Coding, Agent Swarm Scaling to 300 Sub-Brokers and 4,000 Coordinated Steps

21/04/2026

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026
Top News

Kara Swisher would rather work for Sam Altman than Mark Zuckerberg

Disinformation Floods Social Media After Nicolás Maduro’s Capture

How to Limit Galaxy AI to On-Device Processing—or Turn It Off Altogether

Meta Claims Downloaded Porn at Center of AI Lawsuit Was for ‘Personal Use’

Meta Superintelligence Lab’s Researchers are Already Departing

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Understanding AI Observability Layers at the Age of LLMs

13/01/2026

Why It Works and Easy methods to Use It

02/06/2025
Latest News

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

21/04/2026

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

21/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.