Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers
  • Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks
  • The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs
  • Schematik Is ‘Cursor for Hardware.’ The Anthropics Want In
  • Hacking the EU’s new age-verification app takes only 2 minutes
  • Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale
  • This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.
  • The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.
AI-trends.todayAI-trends.today
Home»Tech»Meta AI Open Sourced Perception Audiovisual (PE AV): the Audiovisual Encoder that Powers SAM Audio and Large Scale Multimodal Search

Meta AI Open Sourced Perception Audiovisual (PE AV): the Audiovisual Encoder that Powers SAM Audio and Large Scale Multimodal Search

Tech By Gavin Wallace22/12/20256 Mins Read
Facebook Twitter LinkedIn Email
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers
Share
Facebook Twitter LinkedIn Email

Meta researchers has introduced Perception Encoder Video, or PEAVAs a family of encoders, enables joint audio-video understanding. This model is able to learn aligned text, audio and video representations within a single embedded space by using large-scale contrastive training with about 100M audio/video pairs that include text captions.

The PE encoder is a step up from the Perception EncoderAV

Perception Encoder (PE) is Meta’s core vision stack for the Perception Models Project. This family encoders images, videos, and audio reaches the state-of-the art in many vision and sound benchmarks by using a contrastive pretraining method. PE core outperforms SigLIP2 in image tasks, and InternVideo2 in video tasks. PE Lang powers Perception Language Model, which is used for multimodal reasoning. PE spatial has been tuned to perform dense predictions such as depth estimation and detection.

PEAV This backbone is extended to include full alignment of audio, video and text. PE audiovisual is the branch in the Perception models repository that embeds text, audio, and video into one joint embedding for cross-modal comprehension.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Separate Towers & Fusion

It is a PEAV The architecture is made up of frame encoders, video encoders, audio encoders, audio-video fusion encoders, and text encoders.

  • This video path applies the PE frame encoder to RGB frames and then a temporal encoder over top.
  • Audio path: DAC VAE is used as a codec for converting raw waveforms to discrete audio tokens with fixed frame rates, approximately one embedded audio every 40 milliseconds.

These towers feed a fusion encoder which learns to represent both streams in a common way. Text encoders project text queries in several special spaces. This gives you one backbone which can be queried many different ways. It is possible to retrieve audio or video descriptions from text and vice versa, as well as video if the text was conditioned by audio.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Data Engine Synthetic Audiovisual captions At Scale

The research team has proposed a 2-stage audiovisual data engine which generates high-quality synthetic captions. This team describes an input pipeline which uses separate captioners, video captioners, confidence scores and several weak audio captions as inputs to a large-scale language model. This LLM generates three different caption types for each clip. One for the audio content, another one for the visual content and yet another one for a joint audio/visual content. A PE AV initial model is then trained using this synthetic supervision.

The second phase is the PEAV It is used in conjunction with the Perception Language Modell decoder. They refine captions together to maximize audiovisual correspondences. This engine uses two stages to produce captions that are reliable for approximately 100M audio-video pairs. The pretraining phase is comprised of 92M clips and the fine tuning portion includes 32M more clips.

Comparing this to previous works that have often focused on narrow sounds or speech domains only, the corpus has been designed to balance speech with general sounds, music domains and diverse video domains.

The Contrastive Objective in Ten Modality Pairs

PEAV The model uses contrastive losses based on sigmoids across text, audio, video and merged representations. According to the research team, this model utilizes eight pairs of contrastive losses during pretraining. The combinations include audio text and video text. The fine tuning process adds two additional pairs, making a total of 10 loss pairs between the modality and caption type.

It is a generalization of the recent contrastive training objectives for audio text tri-modal training. This alignment of all views allows the encoder to support correspondence, classification and retrieval tasks when there are simple similarities in dot products.

Perform Across Speech, Audio, Music and Video

On benchmarks, PEAV The zero-shot retrieval and classifying for different domains is the goal. PE AV achieves the best performance possible on audio and video benchmarks compared against recent models of audio text, audio video text, ImageBind and LanguageBind.

Here are some concrete gains:

  • AudioCaps’ text to audio conversion improves on AudioCaps. It goes from 35.4 R at 1 up to 45.8 R.
  • The clip classification accuracy on VGGSound has increased from 36.0 % to 47.1%.
  • The PE AV reaches an accuracy of 85.6 for speech retrieval tasks VCTK-style, while older models were near 0.
  • The text-to-video retrieval rate on ActivityNet improves from 60.4, R, at 1, to 66.5 R, at 1.
  • Kinetics 400 video zero shot classification improved from 76.9 – 78.9, surpassing models up to four times bigger.
https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

PEA-Frame, Frame Level Audio Text Alignment

With PEAVMeta releases the Perception Audio Frame.A-FrameThe sound event location can be achieved by using. The PE A Frame audio text embedding is a model which outputs one embedding audio per frame of 40 milliseconds and only one text embedding for each query. This model returns temporal intervals to indicate where each event is located in audio.

PEA-Frame Uses frame-level contrastive learning for alignment of audio frames and text. It is possible to pinpoint specific events, such as speakers, instruments or transient sound, in long audio sequences.

SAM and Perception models: Role in the Ecosystem

PEAV PEA-Frame Perception Models are stacked together with Perception Language Models to provide multimodal generation, reasoning and reasoning.

PEAV SAM Audio uses PE as its core perception engine. SAM Audio is based on PEAV Use embeddings for connecting visual and text prompts with sound sources and scoring the quality of separate audio tracks.

The Key Take-Aways

  • PEAV This is a unified coder for text, audio and video that has been trained on more than 100M videos.
  • It uses separate towers for video and audio, with PE-based visual encoding (and DAC VAE audio tokenization), followed by audio visual fusion (AVF) encoders and specialized texts aligned according to the different modality pairings.
  • The engine uses weaker captioners and a LLM plus PE in the first stage to generate synthetic audio and visual captions.AV The Perception Language Model, in the second stage, allows multimodal surveillance on a massive scale to be done without using manual labels.
  • PEAV This new technology establishes a state-of-the art for a variety of audio and visual benchmarks. It uses a sigmoidal contrastive goal across multiple modality pair, six public checkpoints ranging in size from 16 frames to all frame large variants. Average retrieval improved from around 45 to about 51.6.
  • PEAVThe frame level is also included.A-Frame The Meta SAM Audio System’s perception is backed by variant. It provides the embeddings for prompt-based audio separation, as well as finely grained localization of sound events across music, speech and other sounds.

Take a look at the Paper, Repo The following are some examples of how to get started: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.


Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

AI ar met meta open source search
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

19/04/2026

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

19/04/2026

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

19/04/2026

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

18/04/2026
Top News

OpenAI’s chatGPT agent is haunting my browser

America’s largest bitcoin miners are shifting to AI

OpenAI’s Battle to Catch up to Claude Code

Anthropic’s Mythos Will Force a Cybersecurity Reckoning—Just Not the One You Think

Flock uses overseas gig workers to build its surveillance AI

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

AgentSociety is an open-source AI framework for simulating large-scale social interactions with LLM agents

31/07/2025

NVIDIA unveils ProRL agent: A decoupled infrastructure that allows for the reinforcement learning of LLM multi-turn agents.

28/03/2026
Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

19/04/2026

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

19/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.