Project Page

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang*, Meiyi Zhu*, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao

* Equal contribution. Corresponding author. Correspondence to: yunkaidang@smail.nju.edu.cn, liwenbin@nju.edu.cn.

arXiv 2025

5,329 Ultra-High-Resolution Scenes

RSHR-Bench is built from full-scene remote sensing images with long sides of at least 4,000 pixels and much stricter visual requirements.

Language-Prior Resistant Evaluation

The benchmark uses adversarial filtering and human verification to reduce shortcut answers that strong text-only LLMs can exploit.

Perception, Reasoning, and Multi-Turn

It evaluates not only single-turn VQA but also captioning, open-ended reasoning, and multi-turn interaction in super-high-resolution settings.

Project Overview

RSHR-Bench is built around a direct critique of current remote sensing multimodal evaluation: many benchmarks appear to test visual reasoning, but in practice a strong text-only LLM can sometimes answer a large fraction of the questions without seeing the image at all. That means benchmark scores can exaggerate visual understanding and understate the role of language priors.

The paper introduces a new benchmark for ultra-high-resolution remote sensing MLLMs where the images are large, the tasks are more interaction-heavy, and the annotation pipeline explicitly tries to reduce answer shortcuts. The benchmark is designed to test whether a model can connect high-resolution visual evidence to perception, reasoning, and multi-turn interaction.

Benchmark Design

RSHR-Bench contains 5,329 full-scene remote sensing images with a long side of at least 4,000 pixels, with some scenes reaching roughly 3 × 10^8 pixels. It includes 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or human-verified single-image evaluation pairs. The benchmark spans multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation, and covers nine perception categories together with four reasoning types.

RSHR benchmark overview

RSHR-Bench is designed to evaluate visual understanding on genuinely large remote sensing scenes rather than low-resolution shortcuts.

RSHR-Bench motivation

Motivation. Existing remote-sensing MLLM benchmarks can still allow answers driven by language priors, making stricter high-resolution visual grounding necessary.

RSHR-Bench comparison with existing benchmarks

Benchmark Difference. RSHR-Bench emphasizes full-scene ultra-high-resolution imagery, richer interactions, and stronger filtering against shortcut reasoning.

Language-Prior Resistance: RSHR-Bench is explicitly designed to reduce shortcut answers that strong text-only models can exploit without seeing the image.

Ultra-High-Resolution Scale: the benchmark contains 5,329 full-scene images, with long sides of at least 4,000 pixels and scenes reaching roughly 3 × 10^8 pixels.

Task Diversity: it evaluates multiple-choice VQA, open-ended VQA, captioning, and single-image evaluation across perception, reasoning, and multi-turn interaction.

Key Finding: text-only models can still reach 51.6% reasoning accuracy on XLRS-Bench, showing why stricter visual evaluation is necessary.

Paper Resource: the benchmark design and evaluation details are available on arXiv.

Main Empirical Findings

RSHR-Bench experimental results

Experimental Results. The benchmark exposes clear gaps between text-prior behavior and grounded visual reasoning, especially on ultra-high-resolution remote-sensing tasks.

  • On the remote-sensing subset of MME-RealWorld, a text-only Llama3-8B model still answers 31.22% of the questions correctly after the image is removed. This directly shows that some existing tasks remain solvable through priors rather than vision.
  • On XLRS-Bench, the issue becomes even more obvious: text-only Qwen3-8B reaches 51.6% average reasoning accuracy, surpassing the image-conditioned GPT-4o baseline at 45.2%. The same text-only model reaches 72.0% on anomaly detection and 77.0% on existence-and-counting reasoning, while text-only Llama3-8B achieves 48.0% on route planning.
  • On RSHR-Bench itself, the paper reports that open-source models mostly remain around 25% accuracy on reasoning, indicating that the new benchmark is substantially more demanding than many earlier datasets. Among open-source systems, VILA-HD is highlighted as notably stronger on reasoning, reaching an average of 58.0 in the reported setting.
  • Captioning performance remains low even for strong closed-source models. For example, GPT-4o-mini achieves the best overall caption metrics in the main comparison, but the reported BLEU-4 is still only 4.8. This is a useful signal that the benchmark is not easy to game with generic language fluency alone.

Why This Benchmark Matters

A useful benchmark should measure the capability we care about, not just the ability to exploit annotation artifacts. RSHR-Bench matters because it makes that distinction explicit. It is large enough, high-resolution enough, and carefully filtered enough to expose whether a model is actually using the image.

For future remote sensing MLLMs, this benchmark is valuable both as an evaluation tool and as a design pressure. It favors models that can preserve fine-grained visual grounding over long contexts and discourages systems that rely too heavily on language priors.

BibTeX Citation

BibTeX

@article{dang2025RSHR,
  title={A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs},
  author={Dang, Yunkai and Zhu, Meiyi and Wang, Donghao and Zhang, Yizhuo and Yang, Jiacheng and Fan, Qi and Yang, Yuekun and Li, Wenbin and Miao, Feng and Gao, Yang},
  journal={arXiv preprint arXiv:2512.17319},
  year={2025}
}