Multimodal Evaluation Benchmark Vision Language

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O\u{g}uzhan Fatih Kar, Amir Zamir

Published May 5, 2026Featured #8In the daily list May 5, 2026

Open on arXiv Read PDF

Daily score58.3

Editorial review6.8

Relevance0.481

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the capabilities and limitations of multimodal models in vision tasks can guide their application and development in AI systems.

The paper benchmarks multimodal models on vision tasks, revealing their strengths and weaknesses compared to specialist models.

Summary

The paper evaluates multimodal foundation models, including GPT-4o, on standard computer vision tasks by translating these tasks into text-promptable formats and benchmarking them against specialist models.

Key contributions

Benchmarking multimodal models on standard vision tasks.
Development of a standardized framework for evaluating these models.
Analysis of performance differences between semantic and geometric tasks.

Notable insights

Prompt chaining is used to adapt vision tasks for text-based models.
Models with native image generation show failure modes like hallucinated objects.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2507.01955v3 Announce Type: replace-cross Abstract: Multimodal foundation models (MFMs), such as GPT-4o, have recently made remarkable progress. However, their detailed visual understanding beyond question answering remains unclear. In this paper, we benchmark popular MFMs (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet, etc). The main challenges in performing this analysis are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these by translating vision tasks into text-promptable, API-compatible formats via prompt chaining, creating a standardized benchmarking framework. We observe that: 1) The MFMs are not close to the state-of-the-art specialist models at any task. 2) They are respectable generalists; this is remarkable, as they are presumably trained on image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks. 5) Reasoning models, e.g., o3, show improvements in geometric tasks. 6) While prompt chaining techniques affect performance, better models are less sensitive to prompt variations. 7) An analysis of models with native image generation, such as the latest GPT-4o, shows they exhibit failure modes, such as hallucinated objects or misalignment between input and output.