Google AI Launches Gemini Embedding 2: Multimodal Embedding Model That Lets You Bring Text, Images Video, Audio and Docs in the Embedding Area

Google has expanded the Gemini family of models with the launch of Gemini Embedding 2. The text-only model is replaced by this second-generation version. gemini-embedding-001 The AI developer’s toolkit is designed specifically for AI developers to meet the cross-modal retrieval and high-dimensional storage challenges they face. Retrieval-Augmented generation (RAG) systems. The Gemini Embedding 2 The release represents a technical change in embedding modeling, as it moves away from modality-specific pipes towards a multimodal, latent, unifying space.

Native multimodality, interleaved inputs

Gemini Embedding 2’s main architectural improvement is the ability to map Five distinct media types—Text, images, audio and video—into a single, high-dimensional vector space. It eliminates the requirement for complicated pipelines, which previously needed separate models to handle different types of data. For example, CLIP models were used for images while BERT models handled text.

Model supports Interleaved inputsThis allows developers to embed multiple modalities into a single request. It is especially relevant in use cases when text does not give enough context. These limits are:

Text: Requests for up to 8,192 tokens are allowed.
Images: Maximum 6 images.
Video: You can upload up to 120 seconds of video in MP4, MOV or other formats.).
Audio: Audio native up to 80 second (MP3,WAV,etc.) It does not require a separate step of transcription.
Documents: Download up to 6 PDF pages.

Gemini Embedding 2 is able to capture the semantic relationship between a frame of a video, and the audio dialogue, by processing the inputs natively. This vector can then be compared with text queries, using distance metrics such as The Cosine Similarity.

Matryoshka Representation Learning (MRL), a method of learning by representations.

The primary bottlenecks for large-scale vector searches are storage and computing costs. Gemini embedding 2 is designed to mitigate these issues. Matryoshka Representation Learning (MRL).

Standard embedding technologies distribute semantic data evenly over all dimensions. The accuracy will typically fall if a developer reduces the 3,072-dimensions vector to only 768. This is because all the semantic information has been removed. Gemini Embedding 2 on the other hand is programmed to put the most crucial semantic information at the earliest dimension of the vector.

Model defaults to 3 072 DimensionsBut Google optimized 3 specific tiers to be used in production.

3,072: The highest precision possible for legal, medical or technical datasets.
1,536: Performance and storage efficiency must be balanced.
768: Reduced Memory Footprint and Optimized to Low-Latency Retrieval.

Matryoshka Representation Learning (MRL) enables a ‘short-listing’ architecture. The system could perform a high-speed, coarse search of millions of items, using sub-vectors of 768 dimensions, and then perform an accurate re-ranking using all 3,072-dimension embedded embeddings. It reduces the computing overhead at the retrieval stage, without losing accuracy.

Benchmarking of MTEB, Long-Context retrieval and Analysis

Google AI internal evaluation of performance and results on the Massive Text Embedding Benchmark Gemini Embedding 2 has outperformed its predecessor. Two specific areas: Retriever Accuracy The following are some examples of how to get started: The Domain Shift is Robust.

Many embedding models suffer from ‘domain drift,’ where accuracy drops when moving from generic training data (like Wikipedia) to specialized domains (like proprietary codebases). Gemini Embedding 2 used a multistage training procedure involving multiple datasets in order to achieve higher performance on specialized tasks.

Model’s 8,192-token window This is one of the most important specifications for RAG. It allows for the embedding of larger ‘chunks’ of text, which preserves the context necessary for resolving coreferences and long-range dependencies within a document. This reduces the likelihood of ‘context fragmentation,’ a common issue where a retrieved chunk lacks the information needed for the LLM to generate a coherent answer.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/

What you need to know

Native Multimodality: Gemini Embedding 2 supports five distinct media types—Text, images, audio and video—within a unified vector space. This allows for Interleaved inputs It is possible to embed (e.g. an image and a caption text) in a single model without using separate pipelines.
Matryoshka Representation Learning (MRL)It is designed so that the model stores the most important semantic information at the beginning of the vector. The default is to 3 072 DimensionsIt supports efficient truncation of 1,536 You can also find out more about 768 The dimensions are maintained with minimal accuracy. This reduces the storage cost and increases retrieval time.
The Expanded Context of PerformanceThis model is a great example of a modern design. 8,192-token input window, allowing for larger text ‘chunks’ in RAG pipelines. The performance is significantly improved over the Massive Text Embedding BenchmarkThe accuracy of the retrieval and processing of code, technical documentation or other specialized areas is improved.
Specific Task OptimizationIf you are interested in developing software, please visit our Developer’s page. task_type Parameters such as RETRIEVAL_QUERY, RETRIEVAL_DOCUMENTOr CLASSIFICATIONThe model can then use this information to make better decisions. The mathematical properties of the vector are optimized for the operation. “hit rate” Semantic search is a great way to find the information you need.

Check it out Technical detailsPublic Preview via the Gemini API The following are some examples of how to get started: Vertex AI. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Google AI Launches Gemini Embedding 2: Multimodal Embedding Model That Lets You Bring Text, Images Video, Audio and Docs in the Embedding Area

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

The ICE has Spyware now | WIRED

OpenAI has released its first Open-Weight models since GPT-2

The AI that helps kids find the right college

AI Agents Are Getting Better at Writing Code—and Hacking It as Well

Jeffrey Epstein Had a ‘Personal Hacker,’ Informant Claims

Top Insights

Why spatial supersensing has emerged as a core feature for AI multimodal systems?

The Design of an Agentic AI system Using Control Plane Architecture to Create Safe, Modular and Scalable Workflows for Tool Driven Reasoning

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Google AI Launches Gemini Embedding 2: Multimodal Embedding Model That Lets You Bring Text, Images Video, Audio and Docs in the Embedding Area

Native multimodality, interleaved inputs

Matryoshka Representation Learning (MRL), a method of learning by representations.

Benchmarking of MTEB, Long-Context retrieval and Analysis

What you need to know

Related Posts