Liquid AI launched LFM2-VL-3B. This is a 3B-parameter vision language for tasks converting image text into text. This model extends beyond the LFM2-VL 450M and 1.5B variants. This model aims to achieve higher accuracy, while maintaining the LFM2 speed profile. The LFM Open License Version 1.0 is supported by LEAP, Hugging Face and other software.
View of model interface
LFM2-VL-3B produces outputs in text and images. Model exposes ChatML-like template. The processor inserts the template. Sentinels that are replaced by encoded images tokens during runtime. By default, the text context is 32 768 tokens long. This information helps developers reproduce evaluations, and integrate the model into existing multimodal pipelines.
Architecture
The stack is a combination of a tower for language, a tower for shape aware vision and a tower projector. LFM2-2.6B is the language tower, which has a hybrid convolution and attention backbone. The vision tower uses SigLIP2 at 400M parameter, which preserves the native aspect ratios. The connector uses a two-layer MLP, with pixel shuffle. This compresses images tokens prior to fusion. With this design, users can limit the number of vision tokens they spend without retraining their model.
The encoder processes native resolutions up to 512×512. Larger inputs are split into non overlapping 512×512 patches. A thumbnail pathway provides global context during tiling. The efficient token mapping is documented with concrete examples, a 256×384 image maps to 96 tokens, a 1000×3000 image maps to 1,020 tokens. Model card reveals user controls such as minimum and maximal image tokens, and tiling. The controls are used to tune the speed and quality of inference.
Setting up the Inference
This card contains recommended parameters. The text generation is based on temperature 0,1, min p 0,15 and repetition penalty 1.05. The minimum image tokens are 64 and the maximum image tokens are 256. Image splitting is enabled. Automatically, the processor will apply the chat template as well as the image sentinel. This is an example. AutoModelForImageTextToText You can also find out more about the following: AutoProcessor You can also find out more about bfloat16 precision.
What is the training process?
Liquid AI describes the stages of a method. The team adjusts over time the ratio of text to images during mid-training. After that, the model is fine-tuned under supervision to focus on image comprehension. Data sources include large-scale open datasets and in house synthetic data to cover tasks.
Benchmarks
Researchers report competitive results for lightweight VLMs. The model reaches 51.83 on MM-IFEval. RealWorldQA reaches 51.83. On MMBench dev en it reaches 79.81. POPE scores 89.01. Table notes that the scores of other systems have been computed using VLMEvalKit. Qwen3VL-2B is not included in the table because it was released earlier.

Language capability is close to LFM2-2.6B. Researchers cite 30 percent for GPQA, and 63 percent for MMLU. It is important to note that perception tasks often include questions about knowledge. They also claim to have expanded their multilingual visual comprehension across English, Japanese (in Japanese), French, Spanish, German Italian Portuguese, Arabic and Chinese.
Why Edge users should be concerned?
This architecture allows for a small budget to be allocated towards memory and compute. Throughput can be predicted because image tokens are compressed and constrained by the user. The aspect ratios are preserved by the SigLIP2 NAFlex encoder. This helps with fine grained perceptual perception. This projector improves the tokens-per-second by reducing tokens in front of connectors. The team published a GGUF for runtimes on devices. The properties of this build are particularly useful for robots, mobile and industrial clients, who require local processing, as well as strict data boundaries.
The Key Takeaways
- Compact multimodal stackThe 3B-parameter LFM2-VL-3B combines an LFM2-2.6B tower language with a NaFlex 400M vision encoder, a MLP 2-layer projector and a SigLIP2 Vision Encoder. NaFlex preserves native aspect ratios.
- The handling of resolutions and budget tokens: Images run natively up to 512×512, larger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for global context. Documented token mappings include 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
- It is a graphical interface that allows for inference.ChatML prompting for an
Sentinel’s default text context of 32,768 tokens and recommended decoding settings as well as processor-level controls to split images enable reproducible evaluation. - Performance measurementMMIFEval 51.83; RealWorldQA 71.37; MMBench-dev 79.81 and POPE 89.01. A language-only signal from the backbone is about 30% GPQA.
LFM2-VL-3B provides a step forward for edge multimodal workloads. This stack couples LFM2-2.6B to a 400M-SigLIP2 naFlex encoder with an efficient projector. It lowers the image token count for predictable latency. The native resolution is processed with the 512×512 tiled tiling, and token caps. This gives budgets that are deterministic. Scores on MMIFEval and RealWorldQA are very competitive. The integration process is simplified by open weights, a GGUF-based build and access to LEAP. This is a VLM with transparent benchmarks and clear controls.
Take a look at the Model on HF You can also find out more about the following: Technical details. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


