Misleading-Scenario Evaluation
MUB measures whether an MLLM abandons a previously correct answer after receiving explicit or implicit deceptive cues.
Project Page
EMNLP 2025
MUB measures whether an MLLM abandons a previously correct answer after receiving explicit or implicit deceptive cues.
The benchmark covers 1.7K multiple-choice and 0.8K true-or-false items with difficulty splits calibrated by strong MLLMs.
A 2K-sample mixed-instruction fine-tuning recipe sharply reduces misleading rates while preserving base capability.
This project studies a failure mode that many users notice in practice but few benchmarks isolate cleanly: an MLLM gives the correct answer first, then abandons it after receiving a misleading cue. The paper names this phenomenon response uncertainty and argues that standard accuracy metrics miss a large part of the problem because they do not measure how stable a correct answer remains under deceptive instructions.
To study this systematically, the paper introduces a two-stage misleading-instruction pipeline and builds the Multimodal Uncertainty Benchmark (MUB). The benchmark is designed to quantify how easily models are pushed away from previously correct answers by explicit false hints or implicit contradictory context.
MUB measures how often a multimodal model reverses a previously correct answer after explicit or implicit misleading cues are introduced.
Response Uncertainty: the benchmark focuses on the specific case where a model already has the right answer but gives it up after being nudged by deceptive multimodal context.
Two-Stage Misleading Pipeline: the procedure first tests the original prompt, then injects misleading information and measures whether the answer flips from correct to incorrect or vice versa.
Robustness Gains: a compact mixed-instruction fine-tuning strategy reduces the explicit misleading rate to 6.97% and the implicit misleading rate to 32.77%.
Benchmark Signal: across nine datasets, open-source MLLMs overturn a previously correct answer in about 65% of cases after a single deceptive cue.
Paper Resource: the full benchmark and analysis are available on arXiv.
This is not just a robustness stress test bolted onto existing VQA benchmarks. The paper is explicitly about whether a model can hold onto a correct answer once misleading instructions appear, which makes it a useful complement to standard accuracy tables.
The procedure first queries a model on the original image-question pair. It then adds misleading information to create a second version of the prompt and measures whether the answer flips from correct to incorrect or vice versa. Using this framework, the authors curate a 2.5k-sample benchmark consisting of 1.7k multiple-choice questions and 0.8k true-or-false questions. MUB is further divided into low-, medium-, and high-difficulty groups according to how many strong MLLMs the example can mislead.
A high-accuracy model can still be unreliable if it is easy to push off course. This project turns that intuition into a measurable benchmark and gives the community a way to study uncertainty, susceptibility, and recovery under misleading conditions.
That matters for trustworthy multimodal systems, especially in scenarios where the prompt source may be noisy, adversarial, or simply wrong. MUB and the accompanying analysis make it easier to compare models not only by what they know, but by how firmly they can hold onto a correct answer.
@inproceedings{dang2025exploring,
title={Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios},
author={Dang, Yunkai and Gao, Mengxi and Yan, Yibo and Zou, Xin and Gu, Yanggan and Li, Jungang and Wang, Jingyu and Jiang, Peijie and Liu, Aiwei and Liu, Jia and Hu, Xuming},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(EMNLP Main)},
year={2025}
}