How can RAG systems be accurate when they are optimized to work as two separate systems, each with its own optimizer? The researchers at Apple released CLaRa (Continuous Latent Reasoning) in collaboration with the University of Edinburgh. The CLaRa-7B framework (CLaRa-7B Base, CLaRa-7B Instruct, and CLaRa-7B E2E) is a retrieval-augmented-generation framework which compresses the documents into tokens of continuous memory and performs retrieval as well as generation within this shared latent area. Simple is the goal. Avoid double-encoding and shorten the context. Let the generator tell the retriever what is important for the downstream answer.
Raw documents into continuous memory cards
CLaRa begins with a semantic compressor, which attaches only a few memory tokens learned to every document. The process of learning memory tokens is called the “learning phase”. Salient Compressor Pretraining, SCP, The base model is an adapter that can switch from a compressor to a generator mode. The hidden final layers of memory tokens are compressed to represent the document.
SCP training is based on approximately 2 million passages taken from Wikipedia, 2021. Local Qwen 32B models generate 3 monitoring signals per passage. Simple QA pairs cover atomic facts. Complex QA pair questions connect multiple facts to one question in order to reinforce multi-hop reasoning. The text is reorganized and compressed using paraphrases, while semantics are preserved. Verification loops check factual consistency, coverage, and can generate missing questions or paraphrases up to ten rounds prior to accepting samples.
Training uses 2 losses. The cross-entropy term trains a generator to respond to questions and produce paraphrases using only memory tokens, instructions prefixes, or both. The mean squared errors term aligns hidden states of memory tokens and document tokens. MSE losses give modest, but consistent, gains between 0.3 and 0.6 F1 at compression ratios of 32 and 128, and keep compressed and original representations within the same semantic area.

Joint retrieval in shared space and joint generation
Each document can be represented by only its memory tokens after the offline compression. CLaRa will then train a question reasoner and answer generator based on this same foundation. The query adapter, which is part of LoRA, maps a question to the same memory tokens as documents. The search for documents is now a pure embedding. The system calculates the cosine of similarity between each document and query embedding.
The query tokens are combined with the most compressed embeddings to create the best document. This is then fed through the generator adapter. The training uses a standard token loss for the next answer. No explicit relevance labels are used. A differentiable Top k Selector is implemented in conjunction with an estimator that uses Straight Through. In the forward-pass, the model employs a top k hard selection. In the reverse pass, a distribution of softmax over the scores of documents allows gradients to be injected into the parameter values for the query reasoner.
Two effects were observed by the researchers in their gradient analysis. The retriever is encouraged to give higher probabilities to documents which increase the likelihood of an answer. Because retrieval and generator share the same compressed representations of documents, the gradients in the generator reshape latent document spaces to make them easier to reason about. The logit lens analyses of the embeddings in the queries recovers tokens like “NFL” You can also find out more about the following: “Oklahoma” For a question regarding the nephew of Ivory Lee Brown even though these tokens were not included in the original query, but they are in the accompanying articles.

Quality of compression and accuracy in QA
This compressor was evaluated using 4 datasets of QA: HotpotQA (natural questions), 2WikiMultihopQA (2WikiMultihopQA) and MuSiQue. The average F1 for SCP-Mistral-7B, at four times compression, is 39.86. This was achieved under the Normal settings, when the system returns the top five Wikipedia 2021 documents. The baseline hard compression baseline LLMLingua2 is better by 5.37 points, and the baseline soft compression baseline PISCO has a better score of 1.13 points.
SCP Mistral-7B with 4x compression, and the Oracle setting where gold documents are guaranteed in the candidates set, achieves an average F1 score of 66.76. This is 17.31 above LLMLingua-2, and 5.35 above PISCO. The compressed representations are even more impressive. They outperform the BGE-based Text Retriever and full document Mistral-7B Generator by approximately 2.36 F1 average points for Mistral, or about 6.36 for Phi4 mini. Soft compression that is well trained can outperform full text RAG and reduce context by up to 128 factors.

However, in normal retrieval conditions, performance drops at extremely high compression ratios (above 32). As per the research group, weak relevance of documents is the key reason why the system bottlenecks before compression.
End-to-end QA and retrieval behaviour
CLaRa for end to end QA utilizes 20 documents as candidates per query, with compression ratios 4 and 16 or 32. With the 16-time compression and instruction weights initialized, CLaRa Mistral-7B reaches an F1 of 50.89 for Natural Questions and 44.66 for 2WikiMultihopQA on Normal. It is similar to DRO Mistral-7B which uses 16 times smaller document representations while reading the full text. CLaRa with 16-times compression improves F1 slightly over DRO on some datasets. For example, from 43.65 up to 47.18 for 2Wiki.
CLaRa Mistral-7B is F1 in both Natural Questions (HotpotQA) and Oracle settings at four times compression. The generator is able to fully utilize accurate retrieval, even when evidence stored in memory tokens are compressed. CLaRa that is initialized with instructions generally beats CLaRa pre-trained in Normal, while in Oracle the gap becomes smaller because retrieval noise has been reduced.
CLaRa, when used under Oracle conditions as a retrieval reranker, delivers a strong Recall of 5. After pretraining at compression 4, CLaRa Mistral-7B achieves Recall 5 of 96.21. It is 10.28 points better than the BGE Reranker baseline of 85.93 and outperforms even a fully-supervised SupInstruct retriever trained with contrastive relevancy labels.

What has Apple released recently?
Apple released 3 Hugging-Face models: CLaRa-7B, CLaRa-7B – Base and CLaRa-7B – E2E. The CLaRa-7B model, described as a RAG model that is tuned to instruction with 16- and 1-28-times document compression included in the design. The model answers questions in the instruction-style directly using compressed representations. It uses Mistral-7B Instruct v0.2 for its base.
The Key Takeaways
- CLaRa uses a limited set of memory tokens that are continuously learned by QA and paraphrase guided compression. The key reasoning signals can be preserved even when compressed 16 and 128x.
- The query encoder, and the generator are both optimized with the same language model loss.
- Gradients can be redirected from answers tokens and back to the retriever using a top-k estimater that allows for gradients. This aligns relevance of documents with quality, and eliminates the typical disjointed tuning cycle in RAG systems.
- CLaRa SCP’s 4x compression beats PISCO/ LLMLingua 2 on multi-hop QA benchmarks including Natural Questions, HotpotQA and 2WikiMultihopQA. It can also outperform full text BGE/Mistral pipelines at average F1.
- Apple has published 3 models for the CLaRa-7B, CLaRa-7B, CLaRa-7B, CLaRa-7B, CLaRa-7B, CLara-7B, CLara-7B, CLara-7B, CLara-7B, CLara-7B, CLaRa-7B, CLaRa-7B, CLara-7B, CLaRa-7B, CLara-7B, CLara-7B, CLar
Editor’s Notes
CLaRa represents an important first step in retrieval-augmented generation, because it integrates semantic compression into a pipeline that only uses text. The results show that SCP embedded in embeddings, along with an end-to-end training using a top k differentiable estimator, and a language model loss can be used to match or exceed text based RAG benchmarks, while utilizing shorter contexts, and a simpler retrieval stack. CLaRa shows that continuous latent reasoning can be a viable alternative to chunk-and-retrieve RAG in real world workloads.
Take a look at the Paper, Model Weights on HF You can also find out more about the following: Repo. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture. It is a media platform that covers machine learning, deep learning, and other news in a way that’s both technical and easy to understand. This platform has over 2,000,000 monthly views which shows its popularity.

