Project Page

Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, Xuming Hu

EMNLP 2025

Misleading-Scenario Evaluation

MUB measures whether an MLLM abandons a previously correct answer after receiving explicit or implicit deceptive cues.

2.5K Multimodal Uncertainty Benchmark

The benchmark covers 1.7K multiple-choice and 0.8K true-or-false items with difficulty splits calibrated by strong MLLMs.

Robustness Gains From Compact Tuning

A 2K-sample mixed-instruction fine-tuning recipe sharply reduces misleading rates while preserving base capability.

Overview

Multimodal large language models can answer visual questions correctly, but that does not always mean the answer is stable. This project studies a concrete failure mode: an MLLM gives the correct response on the original image-question pair, then abandons that answer after a misleading cue is inserted into the prompt.

The paper calls this behavior response uncertainty. Instead of only measuring whether a model can solve a benchmark item once, the evaluation asks whether the model can preserve an originally correct answer when it is confronted with explicit false hints or implicit contextual contradictions.

Response uncertainty motivation

Motivation. The consistency histograms show that misleading-prone examples expose unstable responses, and that targeted fine-tuning improves consistency most strongly on high-misleading-rate data.

Response Uncertainty: the benchmark focuses on correct-to-incorrect flips, where a model already has the right answer but gives it up after a misleading instruction.

Multimodal Uncertainty Benchmark: MUB is curated from uncertainty-prone samples and stratified into low-, medium-, and high-difficulty groups according to how many strong MLLMs are misled.

Explicit and Implicit Misleading: the evaluation covers direct false-answer hints as well as contextual contradictions that nudge the model toward a wrong answer less directly.

Robustness Gains: a compact 2,000-sample mixed-instruction fine-tuning strategy sharply reduces misleading rates while slightly improving standard benchmark accuracy.

Benchmark Signal: across nine datasets, twelve open-source MLLMs overturn a previously correct answer in about 65% of cases after a single deceptive cue.

Paper Resource: the full benchmark and analysis are available on arXiv.

Method Pipeline

MUB method overview

Method Overview. The pipeline first extracts misleading-prone examples from widely used multimodal benchmarks, builds the Multimodal Uncertainty Benchmark (MUB), evaluates open-source and closed-source MLLMs with explicit and implicit misleading instructions, and then fine-tunes open-source models with mixed-instruction data to reduce response uncertainty.

The evaluation starts by querying a model on the original image-question pair. After the initial response is obtained, the prompt is modified with a misleading instruction and the model is queried again. The key metric is the misleading rate, which measures how often an originally correct answer flips to an incorrect one.

Using this protocol, the authors collect uncertainty-prone examples from nine widely used multimodal benchmarks, including MME, SEED, MMBench, MMStar, MMMU, ScienceQA, AI2D, MathVista, and ConBench. MUB contains 2.5K samples, including 1.7K multiple-choice questions and 0.8K true-or-false questions, and is grouped into low-, medium-, and high-difficulty splits.

The paper then evaluates 12 open-source and 5 closed-source MLLMs under both explicit misleading instructions, such as a direct false-answer hint, and implicit misleading instructions, such as a contextual statement that contradicts the visual evidence.

Main Experimental Results

Misleading rate results across datasets and instruction types

Dataset and Instruction Trends. The radar plots summarize misleading rates across nine datasets, while the scatter plots compare explicit and implicit misleading behavior across different MLLMs and prompt variants.

MUB misleading rate comparison before fine-tuning

MUB Evaluation Before Fine-Tuning. Closed-source and open-source MLLMs show high misleading rates, especially on high-difficulty examples and implicit misleading instructions.

MUB misleading rate comparison after fine-tuning

MUB Evaluation After Fine-Tuning. Fine-tuning with a compact mixed-instruction dataset substantially lowers correct-to-incorrect misleading rates across model families and difficulty levels.

  • Across nine standard datasets, the paper reports that twelve state-of-the-art open-source MLLMs overturn a previously correct answer in about 65% of cases after receiving a single deceptive cue.
  • On MUB, misleading rates increase with difficulty. Before fine-tuning, the average explicit misleading rate rises from 45.85% on low-difficulty samples to 86.79% on high-difficulty samples, while the average implicit misleading rate reaches 87.68% on high-difficulty samples.
  • Across the benchmark analysis, explicit misleading instructions exceed 67.19% misleading rate, while implicit misleading instructions exceed 80.67%, showing that subtle contradictions can be even more disruptive than direct false hints.
  • After robustness-oriented fine-tuning, the average explicit misleading rate drops to 6.97% and the average implicit misleading rate drops to 32.77%. On the tabled MUB splits, the post-tuning averages are 4.8%, 8.7%, and 7.4% for explicit low-, medium-, and high-difficulty samples.
  • The same fine-tuning recipe boosts consistency by nearly 29.37% on highly deceptive inputs and slightly improves standard benchmark accuracy, indicating that the mitigation does not simply trade away base capability.

Qualitative Examples

Explicit misleading examples from MUB

Explicit Misleading Samples. These examples show how a direct false hint can conflict with the visual evidence and pressure a model to abandon a correct answer.

The qualitative cases illustrate why this benchmark is different from ordinary accuracy evaluation. The question itself is still answerable from the image, but the instruction stream includes an adversarial cue. A reliable MLLM should be able to compare that cue against the visual evidence instead of treating the prompt as ground truth.

Why These Results Matter

A high-accuracy model can still be unreliable if it is easy to push off course. This project turns that concern into a measurable benchmark and gives the community a way to study uncertainty, susceptibility, and recovery under misleading conditions.

That matters for trustworthy multimodal systems, especially in settings where the prompt source may be noisy, adversarial, or simply wrong. MUB and the accompanying analysis make it easier to compare models not only by what they know, but by how firmly they can hold onto a correct answer when misleading information appears.

BibTeX Citation

BibTeX

@inproceedings{dang2025exploring,
  title={Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios},
  author={Dang, Yunkai and Gao, Mengxi and Yan, Yibo and Zou, Xin and Gu, Yanggan and Li, Jungang and Wang, Jingyu and Jiang, Peijie and Liu, Aiwei and Liu, Jia and Hu, Xuming},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(EMNLP Main)},
  year={2025}
}