Meta AI Open Sourced Perception Audiovisual (PE AV): the Audiovisual Encoder that Powers SAM Audio and Large Scale Multimodal Search

Meta researchers has introduced Perception Encoder Video, or PE_AVAs a family of encoders, enables joint audio-video understanding. This model is able to learn aligned text, audio and video representations within a single embedded space by using large-scale contrastive training with about 100M audio/video pairs that include text captions.

The PE encoder is a step up from the Perception Encoder_AV

Perception Encoder (PE) is Meta’s core vision stack for the Perception Models Project. This family encoders images, videos, and audio reaches the state-of-the art in many vision and sound benchmarks by using a contrastive pretraining method. PE core outperforms SigLIP2 in image tasks, and InternVideo2 in video tasks. PE Lang powers Perception Language Model, which is used for multimodal reasoning. PE spatial has been tuned to perform dense predictions such as depth estimation and detection.

PE_AV This backbone is extended to include full alignment of audio, video and text. PE audiovisual is the branch in the Perception models repository that embeds text, audio, and video into one joint embedding for cross-modal comprehension.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Separate Towers & Fusion

It is a PE_AV The architecture is made up of frame encoders, video encoders, audio encoders, audio-video fusion encoders, and text encoders.

This video path applies the PE frame encoder to RGB frames and then a temporal encoder over top.
Audio path: DAC VAE is used as a codec for converting raw waveforms to discrete audio tokens with fixed frame rates, approximately one embedded audio every 40 milliseconds.

These towers feed a fusion encoder which learns to represent both streams in a common way. Text encoders project text queries in several special spaces. This gives you one backbone which can be queried many different ways. It is possible to retrieve audio or video descriptions from text and vice versa, as well as video if the text was conditioned by audio.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Data Engine Synthetic Audiovisual captions At Scale

The research team has proposed a 2-stage audiovisual data engine which generates high-quality synthetic captions. This team describes an input pipeline which uses separate captioners, video captioners, confidence scores and several weak audio captions as inputs to a large-scale language model. This LLM generates three different caption types for each clip. One for the audio content, another one for the visual content and yet another one for a joint audio/visual content. A PE AV initial model is then trained using this synthetic supervision.

The second phase is the PE_AV It is used in conjunction with the Perception Language Modell decoder. They refine captions together to maximize audiovisual correspondences. This engine uses two stages to produce captions that are reliable for approximately 100M audio-video pairs. The pretraining phase is comprised of 92M clips and the fine tuning portion includes 32M more clips.

Comparing this to previous works that have often focused on narrow sounds or speech domains only, the corpus has been designed to balance speech with general sounds, music domains and diverse video domains.

The Contrastive Objective in Ten Modality Pairs

PE_AV The model uses contrastive losses based on sigmoids across text, audio, video and merged representations. According to the research team, this model utilizes eight pairs of contrastive losses during pretraining. The combinations include audio text and video text. The fine tuning process adds two additional pairs, making a total of 10 loss pairs between the modality and caption type.

It is a generalization of the recent contrastive training objectives for audio text tri-modal training. This alignment of all views allows the encoder to support correspondence, classification and retrieval tasks when there are simple similarities in dot products.

Perform Across Speech, Audio, Music and Video

On benchmarks, PE_AV The zero-shot retrieval and classifying for different domains is the goal. PE AV achieves the best performance possible on audio and video benchmarks compared against recent models of audio text, audio video text, ImageBind and LanguageBind.

Here are some concrete gains:

AudioCaps’ text to audio conversion improves on AudioCaps. It goes from 35.4 R at 1 up to 45.8 R.
The clip classification accuracy on VGGSound has increased from 36.0 % to 47.1%.
The PE AV reaches an accuracy of 85.6 for speech retrieval tasks VCTK-style, while older models were near 0.
The text-to-video retrieval rate on ActivityNet improves from 60.4, R, at 1, to 66.5 R, at 1.
Kinetics 400 video zero shot classification improved from 76.9 – 78.9, surpassing models up to four times bigger.

PE_A-Frame, Frame Level Audio Text Alignment

With PE_AVMeta releases the Perception Audio Frame._A-FrameThe sound event location can be achieved by using. The PE A Frame audio text embedding is a model which outputs one embedding audio per frame of 40 milliseconds and only one text embedding for each query. This model returns temporal intervals to indicate where each event is located in audio.

PE_A-Frame Uses frame-level contrastive learning for alignment of audio frames and text. It is possible to pinpoint specific events, such as speakers, instruments or transient sound, in long audio sequences.

SAM and Perception models: Role in the Ecosystem

PE_AV PE_A-Frame Perception Models are stacked together with Perception Language Models to provide multimodal generation, reasoning and reasoning.

PE_AV SAM Audio uses PE as its core perception engine. SAM Audio is based on PE_AV Use embeddings for connecting visual and text prompts with sound sources and scoring the quality of separate audio tracks.

The Key Take-Aways

PE_AV This is a unified coder for text, audio and video that has been trained on more than 100M videos.
It uses separate towers for video and audio, with PE-based visual encoding (and DAC VAE audio tokenization), followed by audio visual fusion (AVF) encoders and specialized texts aligned according to the different modality pairings.
The engine uses weaker captioners and a LLM plus PE in the first stage to generate synthetic audio and visual captions._AV The Perception Language Model, in the second stage, allows multimodal surveillance on a massive scale to be done without using manual labels.
PE_AV This new technology establishes a state-of-the art for a variety of audio and visual benchmarks. It uses a sigmoidal contrastive goal across multiple modality pair, six public checkpoints ranging in size from 16 frames to all frame large variants. Average retrieval improved from around 45 to about 51.6.
PE_AVThe frame level is also included._A-Frame The Meta SAM Audio System’s perception is backed by variant. It provides the embeddings for prompt-based audio separation, as well as finely grained localization of sound events across music, speech and other sounds.

Take a look at the Paper, Repo The following are some examples of how to get started: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

Meta AI Open Sourced Perception Audiovisual (PE AV): the Audiovisual Encoder that Powers SAM Audio and Large Scale Multimodal Search

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

OpenAI’s chatGPT agent is haunting my browser

America’s largest bitcoin miners are shifting to AI

OpenAI’s Battle to Catch up to Claude Code

Anthropic’s Mythos Will Force a Cybersecurity Reckoning—Just Not the One You Think

Flock uses overseas gig workers to build its surveillance AI

Top Insights

AgentSociety is an open-source AI framework for simulating large-scale social interactions with LLM agents

NVIDIA unveils ProRL agent: A decoupled infrastructure that allows for the reinforcement learning of LLM multi-turn agents.

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Meta AI Open Sourced Perception Audiovisual (PE AV): the Audiovisual Encoder that Powers SAM Audio and Large Scale Multimodal Search

The PE encoder is a step up from the Perception EncoderAV

Separate Towers & Fusion

Data Engine Synthetic Audiovisual captions At Scale

The Contrastive Objective in Ten Modality Pairs

Perform Across Speech, Audio, Music and Video

PEA-Frame, Frame Level Audio Text Alignment

SAM and Perception models: Role in the Ecosystem

The Key Take-Aways

Related Posts

The PE encoder is a step up from the Perception Encoder_AV

PE_A-Frame, Frame Level Audio Text Alignment