Recent public demonstrations of multimodal foundation models like GPT-4o Gemini and Claude, have demonstrated rapid progress. Their true abilities to grasp visual data are not well understood, despite their good language skills. The majority of benchmarks today are based on tasks that require text, like VQA and classification. These tend to reflect more language abilities than visual skills. Text output is also required for these tests, which makes it hard to compare vision-specific models with MFMs or assess visual abilities fairly. Current evaluations still ignore critical visual aspects like 3D perception and segmentation.
The MFMs are very effective at tasks that require both visual and language comprehension, like captioning or visual questions answering. It is unclear how effective MFMs are at tasks that require detailed comprehension of visual information. The majority of benchmarks are based on outputs that use text, which makes it hard to compare vision-only MFMs fairly. This limitation limits the evaluation of language outputs. MFMs have been able to tackle visually challenging tasks with prompting strategies, which break them down into subtasks. Reproducibility can be a challenge.
Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. The majority of MFMs only output text, and they are accessible through APIs. Therefore, researchers at EPFL developed a framework that converts these visual tasks in text compatible formats. They found that MFMs were competent generalists but fell short in comparison to specialized models. This was especially true for geometric tasks. GPT-4o performed best on 4 of the 6 tasks. The toolkit for evaluation will be made open source.
In order to test MFMs in vision tasks, researchers developed a strategy of prompt chaining, which breaks down complex tasks into easier, language-friendly tasks. Instead of directly predicting bounding box boundaries, for example, the model first identifies objects and then locates these through recursive cropping. Superpixels are used to segment and group images, as they are easy to compare and label. Superpixel rankings are used to estimate surface and depth normals. The modular approach takes advantage of MFMs’ strengths in similarity and classification, while calibrating controls guarantee fair comparisons. This method allows for flexibility, while performance is improved with more fine-grained prompting.
This study compares MFMs such as GPT-4 and Gemini Flash to perform tasks like image segmentation, classification and object detection. GPT-4o outperforms specialist models such as ViT-G (90.94%) and Co-DETR (91.30%) on datasets including ImageNet, COCO and Hypersim. GPT-4o has a semantic segmentation score of 44.89 mIoU. OneFormer comes in first with 66.52. MFMs can handle changes in distribution well, but they lag behind on the precise reasoning. This study introduces oracle and prompt-chaining baselines for evaluating upper bound performance.
The study concludes by introducing a benchmarking system to evaluate the visual abilities of MFMs such as GPT-4o Gemini and Claude. This is done through converting standard tasks for vision into formats based on prompts. The results show that MFMs are better at semantic tasks than geometrical ones. GPT-4o is the best overall. All MFMs, however, lag behind the task-specific models. They show promise, even though they are trained on image-text only. The limitations include a high cost of inference and sensitivity to prompts. This framework offers a uniform approach for evaluating MFMs visual understanding and lays the foundations for future improvements.
Click here to find out more Paper, GitHub Page The following are some examples of how to get started: Project. The researchers are the sole credit holders for this work.
Join the AI Dev Newsletter read by more than 40k developers and researchers from NVIDIA OpenAI DeepMind Meta Microsoft JP Morgan Chase Amgen Aflac Wells Fargo Wells Fargo, Wells Fargo, Microsoft and many others [SUBSCRIBE NOW]

