xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Elon Musk’s AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. It puts xAI in the same competitive speech API space currently occupied ElevenLabs Deepgram and AssemblyAI.

What is the Grok Speech to Text API?

The technology of speech-to-text is used to convert spoken audio files into text. STT is an essential building block when it comes to developers creating meeting transcription software, voice agents or call center analytics. Instead of developing it from scratch, developers can call an endpoint and send audio to receive a structure transcript.

Grok STT is now available in both streaming and batch modes. While batch is intended for pre-recorded recordings, streaming allows real-time audio transcription. The pricing is straightforward. Speech-to Text is $0.10 an hour in batch mode and $0.20 for streaming.

It also accepts speaker diarization and supports multichannels. This API accepts a variety of formats. 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request.

Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.’ It is essential for recordings with multiple speakers, such as meetings, interviews or customer calls. Word-level timestamps Assign precise times for the start and finish of each word. This allows you to use it in applications such as subtitle generation, searching recordings and legal documents. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents’ into readable structured output: “$167,983.15.”.

Benchmark Performance

xAI’s research team makes bold claims about accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. If this margin holds up in production, it is significant. Grok, ElevenLabs, and AssemblyAI all had the same error rate of 2.4% for video and podcast transcripts. Deepgram was a distant third with 3.0%, while Deepgram was ranked second at 3.2%. xAI also reported a word error rate of 6.9% on audio benchmarks.

https://x.ai/news/grok-stt-and-tts-apis

What is Grok Text-to-Speech API?

The text is converted into audio. TTS APIs are used by developers to create voice assistants, reading-aloud functions, podcasts, interactive voice response systems (IVR) and accessibility tools.

Grok TTS API is available for $4.20 per million characters and provides natural, fast speech synthesis. API supports up to 15,000 characters per REST requestWebSockets streaming endpoints are available for content that is longer than the text-length limit. They begin returning audio prior to processing all input. The API allows for Twenty languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default.

In addition to voice selection, developers may also use inline tags and speech wrappers for controlling delivery. Inline tags such as [laugh], [sigh]Then, [breath]Wrapping labels like The text is a You can also find out more about the following: The text is aThis expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output. This expressivity addresses one of TTS’s core shortcomings, where they often deliver technically accurate but emotional flat output.

The Key Takeaways

xAI launches two new audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support.
Grok STT offers both real-time transcription and batch transcription. across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming.
Benchmarks for phone call entity identificationGrok STT reported a 5.0% rate of error, which is significantly better than ElevenLabs (12%), Deepgram (13.5%), and AssemblyAI(21.3%). Grok STT performed particularly well in cases involving medical, financial, or legal use.
Grok TTS API provides five different expressive voices The speech tags (Ara Eve Leo Rex Sal) are available in 20 different languages. [laugh], [sigh]Then, giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters.

Check out the Technical details here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

Michal is a professional in data science with a Masters of Science degree from the University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in statistics, machine learning and data engineering.

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

Moonshot AI Releases Kimi K2.6 with Lengthy-Horizon Coding, Agent Swarm Scaling to 300 Sub-Brokers and 4,000 Coordinated Steps

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Kara Swisher would rather work for Sam Altman than Mark Zuckerberg

Disinformation Floods Social Media After Nicolás Maduro’s Capture

How to Limit Galaxy AI to On-Device Processing—or Turn It Off Altogether

Meta Claims Downloaded Porn at Center of AI Lawsuit Was for ‘Personal Use’

Meta Superintelligence Lab’s Researchers are Already Departing

Top Insights

Understanding AI Observability Layers at the Age of LLMs

Why It Works and Easy methods to Use It

Latest News

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

What is the Grok Speech to Text API?

Benchmark Performance

What is Grok Text-to-Speech API?

The Key Takeaways

Related Posts