Project Page

Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, Xuming Hu

EMNLP 2025

Misleading-Scenario Evaluation

MUB measures whether an MLLM abandons a previously correct answer after receiving explicit or implicit deceptive cues.

2.5K Multimodal Uncertainty Benchmark

The benchmark covers 1.7K multiple-choice and 0.8K true-or-false items with difficulty splits calibrated by strong MLLMs.

Robustness Gains From Compact Tuning

A 2K-sample mixed-instruction fine-tuning recipe sharply reduces misleading rates while preserving base capability.

Paper MUB Code Personal Homepage

Overview

This project studies a failure mode that many users notice in practice but few benchmarks isolate cleanly: an MLLM gives the correct answer first, then abandons it after receiving a misleading cue. The paper names this phenomenon response uncertainty and argues that standard accuracy metrics miss a large part of the problem because they do not measure how stable a correct answer remains under deceptive instructions.

To study this systematically, the paper introduces a two-stage misleading-instruction pipeline and builds the Multimodal Uncertainty Benchmark (MUB). The benchmark is designed to quantify how easily models are pushed away from previously correct answers by explicit false hints or implicit contradictory context.

MUB measures how often a multimodal model reverses a previously correct answer after explicit or implicit misleading cues are introduced.

Response Uncertainty: the benchmark focuses on the specific case where a model already has the right answer but gives it up after being nudged by deceptive multimodal context.

Two-Stage Misleading Pipeline: the procedure first tests the original prompt, then injects misleading information and measures whether the answer flips from correct to incorrect or vice versa.

Robustness Gains: a compact mixed-instruction fine-tuning strategy reduces the explicit misleading rate to 6.97% and the implicit misleading rate to 32.77%.

Benchmark Signal: across nine datasets, open-source MLLMs overturn a previously correct answer in about 65% of cases after a single deceptive cue.

Paper Resource: the full benchmark and analysis are available on arXiv.

This is not just a robustness stress test bolted onto existing VQA benchmarks. The paper is explicitly about whether a model can hold onto a correct answer once misleading instructions appear, which makes it a useful complement to standard accuracy tables.

Method Pipeline

The procedure first queries a model on the original image-question pair. It then adds misleading information to create a second version of the prompt and measures whether the answer flips from correct to incorrect or vice versa. Using this framework, the authors curate a 2.5k-sample benchmark consisting of 1.7k multiple-choice questions and 0.8k true-or-false questions. MUB is further divided into low-, medium-, and high-difficulty groups according to how many strong MLLMs the example can mislead.

Main Experimental Results

Across nine standard datasets, the paper reports that state-of-the-art open-source MLLMs overturn a previously correct answer in about 65% of cases after receiving a single deceptive cue.
In the large-scale benchmark sweep, the average misleading rate for true-to-false transitions is about 65.39%, while the false-to-true transition rate is about 83.35%, showing that model responses are highly unstable under misleading supervision.
On MUB, the paper reports very high susceptibility overall: explicit misleading instructions exceed 67.19% misleading rate, implicit misleading instructions exceed 80.67%, and the selected uncertainty slices remain difficult across both open-source and closed-source model families.
A compact 2,000-sample mixed-instruction fine-tuning strategy dramatically improves robustness: the misleading rates drop to 6.97% for explicit cues and 32.77% for implicit cues, while consistency on highly deceptive inputs improves by nearly 29.37%.
The paper also reports slight accuracy gains on MUB and on additional benchmarks after robustness-oriented fine-tuning, which is important because it shows the defense is not simply trading away base capability.

Why These Results Matter

A high-accuracy model can still be unreliable if it is easy to push off course. This project turns that intuition into a measurable benchmark and gives the community a way to study uncertainty, susceptibility, and recovery under misleading conditions.

That matters for trustworthy multimodal systems, especially in scenarios where the prompt source may be noisy, adversarial, or simply wrong. MUB and the accompanying analysis make it easier to compare models not only by what they know, but by how firmly they can hold onto a correct answer.

BibTeX Citation

BibTeX

@inproceedings{dang2025exploring,
  title={Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios},
  author={Dang, Yunkai and Gao, Mengxi and Yan, Yibo and Zou, Xin and Gu, Yanggan and Li, Jungang and Wang, Jingyu and Jiang, Peijie and Liu, Aiwei and Liu, Jia and Hu, Xuming},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(EMNLP Main)},
  year={2025}
}