Conversational AI is a tension that has been present from the beginning. It’s a choice between responding fast and smart. Real-time speech-to-speech (S2S) models — the kind that power natural-feeling voice assistants — start talking almost instantly, but their answers tend to be shallow. Although cascading systems, which route speech via a large-language model (LLM), are more sophisticated and knowledgeable than other methods, the delay in processing it can make conversations feel robotic. Researchers at Sakana AI The Tokyo AI Lab introduces KAME Knowledge-Access Model Expansion (KAME) is a hybrid system that maintains the low response time of an S2S direct system, while injecting richer information from a LLM backend in real-time.
There are two ways to look at the problem: Paradigms and tradeoffs
It is helpful to know the differences between KAME and its two main designs.
Direct S2S models like Moshi, developed by KyutAI, are monolithic transformators that take in audio tokens to produce audio tokens continuously. Because it doesn’t need to synchronize with external systems, its response latency is exceptionally low — for many queries, the model starts speaking before the user even finishes their question. The model must spend a significant amount of capacity to represent paralinguistic features such as tone, rhythm, and emotion. This is because audio signals have a much higher information density than text. This leaves little room for deep thinking and factual information.
A cascading system on the contrary routes the user’s speech through an Automatic Speech Recognition Model (ASR), then feeds the text into a powerful LLM. The LLM’s response is converted back to speech by a Text to Speech engine (TTS). The knowledge quality is excellent — you can plug in any frontier LLM — but the system must wait for the user to finish speaking before ASR and LLM processing can even begin. A median latency is around 2.1s, enough to interrupt normal conversational flow.

KAME’s architecture: Thinking While Speaking
KAME works as a tandem with two components that run in parallel.
It is important to note that the word “you” means “you”. Front-end module for S2S Based on Moshi’s architecture, it processes audio real-time at discrete audio tokens every cycle (approximately 80 milliseconds). The system starts generating spoken responses immediately. Internally, Moshi’s original three-stream design — input audio, inner monologue (text), and output audio — is extended in KAME with a fourth stream: the Oracle stream. Here is where the innovation lies.
It is important to note that the word “you” means “you”. LLM backend module This system is composed of two components: a speech-to text component (STT), and a large-scale LLM. The STT component builds up a transcript as the user speaks and sends this to the LLM at the back end. For each partial transcript it receives, the LLM generates a candidate text response — called an oracle — and streams it back to the front-end. These oracles are educated guesses at first, but become more accurate with each new transcript.
It then bases its output of speech on the context it has created internally and also these tokens from oracles. When a new, better oracle arrives, the model can correct course — effectively updating its response mid-sentence, the way a human might. Both modules running asynchronously, and independent of each other means that the response is almost instantaneous.
Simulated Oracles Training
A challenge to oracle signal detection is the fact that there are no datasets with such signals. Sakana AI researchers address this issue with a technique known as Simulated Oracle Augmentation. Using a ‘simulator’ LLM and a standard conversational dataset (user utterance + ground-truth response), the research team generates synthetic oracle sequences that mimic what a real-time LLM would produce across different levels of transcript completeness. They define six hint levels (0–5), ranging from a completely unguided guess at hint level 0 to the verbatim ground-truth response at hint level 5. KAME training data is built using 56,582 combinations of synthetic dialogues derived from MMLU Pro GSM8K HSSBench and HSSBench. They were then converted to audio by TTS.
Results: Nearly Cascaded, Near-Zero latency
Evaluations on a speech-synthesized subset of the MT-Bench multi-turn Q&A benchmark — specifically the reasoning, STEM, and humanities categories (Coding, Extraction, Math, Roleplay, and Writing were excluded as unsuitable for speech interaction) — show a dramatic improvement. Moshi scores an average of 2.05. KAME with gpt-4.1 as the back-end scores 6.43, and KAME with claude-opus-4-1 as the back-end scores 6.23 — both at essentially the same latency as Moshi. Unmute, the leading cascaded solution (also powered by gpt-4-1), scored 7.70 but had a median latencies of 2.1 second compared to near zero for KAME.
To isolate back-end capability from timing effects, the research team also evaluated the back-end LLM’s text responses from the final oracle injection in each KAME session directly — bypassing the premature-generation problem entirely. These scores were 7.79 on average (reasoning 6,48, STEM 8,34, and humanities 856), similar to Unmute’s 7.70. The gap between KAME and cascaded system isn’t a limitation on LLM knowledge at the back end, but the result of KAME speaking out before the entire user question has been heard.
KAME, a fully accredited KAME product is essential. The backend is not important. No retraining is required to swap the front-end from gpt 4.1 nano back-end for claude-opus-4-1, or gemini 2.5-flash when inferences are made. In Sakana AI’s experiments, claude-opus-4-1 tended to outperform gpt-4.1 on reasoning tasks, while gpt-4.1 scored higher on humanities questions — suggesting practitioners can route queries to the most task-appropriate LLM without touching the front-end model.
The Key Takeaways
- KAME balances knowledge and speed in conversational artificial intelligence by running a front-end speech-to-speech model and a back-end LLM asynchronously in parallel — the S2S model responds immediately while the LLM continuously injects progressively refined ‘oracle’ signals in real time, shifting the paradigm from ‘think, then speak’ to ‘speak while thinking.’
- Performance gains without latency costs are significant — KAME raises the MT-Bench score from 2.05 (Moshi baseline) to 6.43, approaching the cascaded system Unmute’s 7.70, while maintaining near-zero median response latency versus Unmute’s 2.1 seconds.
- This architecture does not require any backend. — the front-end was trained using gpt-4.1-nano but supports plug-The following are some examples of how to get started:-play swapping of any frontier LLM (gpt-4.1, claude-opus-4-1, gemini-2.5-flash) at inference time with no retraining, enabling task-specific LLM selection based on domain strengths.
Check out the Model Weights, Paper, Inference code and Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

