Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.
  • AI-Designed drugs by a DeepMind spinoff are headed to human trials
  • Apple’s new CEO must launch an AI killer product
  • OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing
  • 5 Reasons to Think Twice Before Using ChatGPT—or Any Chatbot—for Financial Advice
  • OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval
  • Your Favorite AI Gay Thirst Traps: The Men Behind them
  • Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin
AI-trends.todayAI-trends.today
Home»Tech»The LiquidAI LFM2-VL-3B brings a 3B Parameter Vision Language Model to Edge-Class devices

The LiquidAI LFM2-VL-3B brings a 3B Parameter Vision Language Model to Edge-Class devices

Tech By Gavin Wallace24/10/20255 Mins Read
Facebook Twitter LinkedIn Email
NVIDIA Introduces ProRL: Long-Horizon Reinforcement Learning Boosts Reasoning and Generalization
NVIDIA Introduces ProRL: Long-Horizon Reinforcement Learning Boosts Reasoning and Generalization
Share
Facebook Twitter LinkedIn Email

Liquid AI launched LFM2-VL-3B. This is a 3B-parameter vision language for tasks converting image text into text. This model extends beyond the LFM2-VL 450M and 1.5B variants. This model aims to achieve higher accuracy, while maintaining the LFM2 speed profile. The LFM Open License Version 1.0 is supported by LEAP, Hugging Face and other software.

View of model interface

LFM2-VL-3B produces outputs in text and images. Model exposes ChatML-like template. The processor inserts the template. Sentinels that are replaced by encoded images tokens during runtime. By default, the text context is 32 768 tokens long. This information helps developers reproduce evaluations, and integrate the model into existing multimodal pipelines.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Architecture

The stack is a combination of a tower for language, a tower for shape aware vision and a tower projector. LFM2-2.6B is the language tower, which has a hybrid convolution and attention backbone. The vision tower uses SigLIP2 at 400M parameter, which preserves the native aspect ratios. The connector uses a two-layer MLP, with pixel shuffle. This compresses images tokens prior to fusion. With this design, users can limit the number of vision tokens they spend without retraining their model.

The encoder processes native resolutions up to 512×512. Larger inputs are split into non overlapping 512×512 patches. A thumbnail pathway provides global context during tiling. The efficient token mapping is documented with concrete examples, a 256×384 image maps to 96 tokens, a 1000×3000 image maps to 1,020 tokens. Model card reveals user controls such as minimum and maximal image tokens, and tiling. The controls are used to tune the speed and quality of inference.

Setting up the Inference

This card contains recommended parameters. The text generation is based on temperature 0,1, min p 0,15 and repetition penalty 1.05. The minimum image tokens are 64 and the maximum image tokens are 256. Image splitting is enabled. Automatically, the processor will apply the chat template as well as the image sentinel. This is an example. AutoModelForImageTextToText You can also find out more about the following: AutoProcessor You can also find out more about bfloat16 precision.

What is the training process?

Liquid AI describes the stages of a method. The team adjusts over time the ratio of text to images during mid-training. After that, the model is fine-tuned under supervision to focus on image comprehension. Data sources include large-scale open datasets and in house synthetic data to cover tasks.

Benchmarks

Researchers report competitive results for lightweight VLMs. The model reaches 51.83 on MM-IFEval. RealWorldQA reaches 51.83. On MMBench dev en it reaches 79.81. POPE scores 89.01. Table notes that the scores of other systems have been computed using VLMEvalKit. Qwen3VL-2B is not included in the table because it was released earlier.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Language capability is close to LFM2-2.6B. Researchers cite 30 percent for GPQA, and 63 percent for MMLU. It is important to note that perception tasks often include questions about knowledge. They also claim to have expanded their multilingual visual comprehension across English, Japanese (in Japanese), French, Spanish, German Italian Portuguese, Arabic and Chinese.

Why Edge users should be concerned?

This architecture allows for a small budget to be allocated towards memory and compute. Throughput can be predicted because image tokens are compressed and constrained by the user. The aspect ratios are preserved by the SigLIP2 NAFlex encoder. This helps with fine grained perceptual perception. This projector improves the tokens-per-second by reducing tokens in front of connectors. The team published a GGUF for runtimes on devices. The properties of this build are particularly useful for robots, mobile and industrial clients, who require local processing, as well as strict data boundaries.

The Key Takeaways

  1. Compact multimodal stackThe 3B-parameter LFM2-VL-3B combines an LFM2-2.6B tower language with a NaFlex 400M vision encoder, a MLP 2-layer projector and a SigLIP2 Vision Encoder. NaFlex preserves native aspect ratios.
  2. The handling of resolutions and budget tokens: Images run natively up to 512×512, larger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for global context. Documented token mappings include 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
  3. It is a graphical interface that allows for inference.ChatML prompting for an Sentinel’s default text context of 32,768 tokens and recommended decoding settings as well as processor-level controls to split images enable reproducible evaluation.
  4. Performance measurementMMIFEval 51.83; RealWorldQA 71.37; MMBench-dev 79.81 and POPE 89.01. A language-only signal from the backbone is about 30% GPQA.

LFM2-VL-3B provides a step forward for edge multimodal workloads. This stack couples LFM2-2.6B to a 400M-SigLIP2 naFlex encoder with an efficient projector. It lowers the image token count for predictable latency. The native resolution is processed with the 512×512 tiled tiling, and token caps. This gives budgets that are deterministic. Scores on MMIFEval and RealWorldQA are very competitive. The integration process is simplified by open weights, a GGUF-based build and access to LEAP. This is a VLM with transparent benchmarks and clear controls.


Take a look at the Model on HF You can also find out more about the following: Technical details. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


Michal is a professional in the field of data science with a Masters of Science degree from University of Padova. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in statistics, machine learning and data engineering.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

AI ar met
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

24/04/2026

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

24/04/2026

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

24/04/2026

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

24/04/2026
Top News

Where is the AI drug?

ChatGPT’s Horny Era May Be The Stickiest Ever

Google’s AI-based ransomware protection only works so well

AI Security Meets the War Machine

OpenAI’s open-weight models are coming to US Military

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Use the Standard JSON Schema to Call Functions in Agents of Mistral

08/06/2025

How to Take a Selfie Behind the Movie Scenes (Step by Step)

18/12/2025
Latest News

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

24/04/2026

AI-Designed drugs by a DeepMind spinoff are headed to human trials

24/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.