Does GPT-4o See clearly? The Benchmarking of the MFMs for Vision Tasks

Recent public demonstrations of multimodal foundation models like GPT-4o Gemini and Claude, have demonstrated rapid progress. Their true abilities to grasp visual data are not well understood, despite their good language skills. The majority of benchmarks today are based on tasks that require text, like VQA and classification. These tend to reflect more language abilities than visual skills. Text output is also required for these tests, which makes it hard to compare vision-specific models with MFMs or assess visual abilities fairly. Current evaluations still ignore critical visual aspects like 3D perception and segmentation.

The MFMs are very effective at tasks that require both visual and language comprehension, like captioning or visual questions answering. It is unclear how effective MFMs are at tasks that require detailed comprehension of visual information. The majority of benchmarks are based on outputs that use text, which makes it hard to compare vision-only MFMs fairly. This limitation limits the evaluation of language outputs. MFMs have been able to tackle visually challenging tasks with prompting strategies, which break them down into subtasks. Reproducibility can be a challenge.

Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. The majority of MFMs only output text, and they are accessible through APIs. Therefore, researchers at EPFL developed a framework that converts these visual tasks in text compatible formats. They found that MFMs were competent generalists but fell short in comparison to specialized models. This was especially true for geometric tasks. GPT-4o performed best on 4 of the 6 tasks. The toolkit for evaluation will be made open source.

In order to test MFMs in vision tasks, researchers developed a strategy of prompt chaining, which breaks down complex tasks into easier, language-friendly tasks. Instead of directly predicting bounding box boundaries, for example, the model first identifies objects and then locates these through recursive cropping. Superpixels are used to segment and group images, as they are easy to compare and label. Superpixel rankings are used to estimate surface and depth normals. The modular approach takes advantage of MFMs’ strengths in similarity and classification, while calibrating controls guarantee fair comparisons. This method allows for flexibility, while performance is improved with more fine-grained prompting.

This study compares MFMs such as GPT-4 and Gemini Flash to perform tasks like image segmentation, classification and object detection. GPT-4o outperforms specialist models such as ViT-G (90.94%) and Co-DETR (91.30%) on datasets including ImageNet, COCO and Hypersim. GPT-4o has a semantic segmentation score of 44.89 mIoU. OneFormer comes in first with 66.52. MFMs can handle changes in distribution well, but they lag behind on the precise reasoning. This study introduces oracle and prompt-chaining baselines for evaluating upper bound performance.

The study concludes by introducing a benchmarking system to evaluate the visual abilities of MFMs such as GPT-4o Gemini and Claude. This is done through converting standard tasks for vision into formats based on prompts. The results show that MFMs are better at semantic tasks than geometrical ones. GPT-4o is the best overall. All MFMs, however, lag behind the task-specific models. They show promise, even though they are trained on image-text only. The limitations include a high cost of inference and sensitivity to prompts. This framework offers a uniform approach for evaluating MFMs visual understanding and lays the foundations for future improvements.

Click here to find out more Paper, GitHub Page The following are some examples of how to get started: Project. The researchers are the sole credit holders for this work.

Join the AI Dev Newsletter read by more than 40k developers and researchers from NVIDIA OpenAI DeepMind Meta Microsoft JP Morgan Chase Amgen Aflac Wells Fargo Wells Fargo, Wells Fargo, Microsoft and many others [SUBSCRIBE NOW]

Sana Hassan has a passion for applying AI and technology to real world challenges. Sana Hassan, an intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to real-world challenges.

Does GPT-4o See clearly? The Benchmarking of the MFMs for Vision Tasks

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

I Used Google’s New Gemini-Powered ‘Help Me Create’ Tool in Docs. This is a great tool for corporate-speak

It’s time to hold AI companies accountable for the deaths of children

OpenAI Re-acquires Two Thinking Machines Lab cofounders

Open Source AI can help the US to defeat China

Anthropic’s Claude Controls a Robot Dog

Top Insights

PyGWalker’s features allow for a comprehensive interactive analytics dashboard.

YouTube launches a new AI-based music tool that is free for all creators

Latest News

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

AI-Designed drugs by a DeepMind spinoff are headed to human trials

Does GPT-4o See clearly? The Benchmarking of the MFMs for Vision Tasks

Related Posts