Elon Musk’s AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. It puts xAI in the same competitive speech API space currently occupied ElevenLabs Deepgram and AssemblyAI.
What is the Grok Speech to Text API?
The technology of speech-to-text is used to convert spoken audio files into text. STT is an essential building block when it comes to developers creating meeting transcription software, voice agents or call center analytics. Instead of developing it from scratch, developers can call an endpoint and send audio to receive a structure transcript.
Grok STT is now available in both streaming and batch modes. While batch is intended for pre-recorded recordings, streaming allows real-time audio transcription. The pricing is straightforward. Speech-to Text is $0.10 an hour in batch mode and $0.20 for streaming.
It also accepts speaker diarization and supports multichannels. This API accepts a variety of formats. 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request.
Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.’ It is essential for recordings with multiple speakers, such as meetings, interviews or customer calls. Word-level timestamps Assign precise times for the start and finish of each word. This allows you to use it in applications such as subtitle generation, searching recordings and legal documents. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents’ into readable structured output: “$167,983.15.”.
Benchmark Performance
xAI’s research team makes bold claims about accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. If this margin holds up in production, it is significant. Grok, ElevenLabs, and AssemblyAI all had the same error rate of 2.4% for video and podcast transcripts. Deepgram was a distant third with 3.0%, while Deepgram was ranked second at 3.2%. xAI also reported a word error rate of 6.9% on audio benchmarks.

What is Grok Text-to-Speech API?
The text is converted into audio. TTS APIs are used by developers to create voice assistants, reading-aloud functions, podcasts, interactive voice response systems (IVR) and accessibility tools.
Grok TTS API is available for $4.20 per million characters and provides natural, fast speech synthesis. API supports up to 15,000 characters per REST requestWebSockets streaming endpoints are available for content that is longer than the text-length limit. They begin returning audio prior to processing all input. The API allows for Twenty languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default.
In addition to voice selection, developers may also use inline tags and speech wrappers for controlling delivery. Inline tags such as [laugh], [sigh]Then, [breath]Wrapping labels like You can also find out more about the following: This expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output. This expressivity addresses one of TTS’s core shortcomings, where they often deliver technically accurate but emotional flat output.
The Key Takeaways
- xAI launches two new audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support.
- Grok STT offers both real-time transcription and batch transcription. across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming.
- Benchmarks for phone call entity identificationGrok STT reported a 5.0% rate of error, which is significantly better than ElevenLabs (12%), Deepgram (13.5%), and AssemblyAI(21.3%). Grok STT performed particularly well in cases involving medical, financial, or legal use.
- Grok TTS API provides five different expressive voices The speech tags (Ara Eve Leo Rex Sal) are available in 20 different languages.
[laugh],[sigh]Then,giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters.
Check out the Technical details here. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us


