Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
Ying Gu, Mei Chee Leong, Hui Li Tan, Shangbo Mao, Liyuan Li, Nancy Chen
Why It Matters
What makes this one worth your time
This work addresses the limitations of traditional accuracy metrics in evaluating MLLMs, providing a potentially more reliable method for model validation in novel tasks.
A novel metric for evaluating logical consistency in MLLMs without ground-truth annotations.
Summary
The paper proposes a new framework for evaluating the logical consistency of vision-language large language models (MLLMs) without the need for ground-truth annotations, introducing the Vision-Language Logical Consistency Metric (VL-LCM) and demonstrating its effectiveness through experiments on various benchmarks.
Key contributions
- Development of the Vision-Language Logical Consistency Metric (VL-LCM).
- Evaluation of 11 recent open-source MLLMs across multiple benchmarks using VL-LCM.
- Demonstration of the correlation between VL-LCM and traditional ground-truth metrics.
Notable insights
- The introduction of a logical consistency metric that operates independently of ground-truth annotations is a significant methodological advancement.
- The findings highlight a gap between accuracy and logical consistency in MLLMs, suggesting that accuracy alone may not be a sufficient measure of model performance.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.06201v1 Announce Type: new Abstract: Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.