Project Page

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

Yunkai Dang^*, Minxin Dai^*, Yuekun Yang, Zhangnan Li, Wenbin Li^†, Feng Miao, Yang Gao

^* Equal contribution. ^† Corresponding author. Correspondence to: liwenbin.nju@gmail.com, yunkaidang1@gmail.com.

ICML 2026

Query-Guided Token Compression

UHR-BAT allocates the visual token budget according to the current instruction so that small but decisive evidence is preserved.

Region-Faithful Preserve and Merge

The framework keeps informative regional tokens and merges redundant ones, reducing cost without destroying spatial structure.

Strong UHR Remote Sensing Results

The model reports 44.0 weighted average on XLRS-Bench and strong gains on MMERealworld-RS and RSHR-Bench.

Paper UHR Code RL-MIND

RL-MIND

Demo (Coming Soon)

Weights (Coming Soon) Personal Homepage

Overview

Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget.e Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at this https URL.

UHR-BAT keeps query-relevant evidence while aggressively compressing redundant visual tokens in ultra-high-resolution scenes.

UHR-BAT Framework: We propose UHR-BAT, a token compression framework designed for ultra-high-resolution remote-sensing MLLMs under strict context budgets.

Query-Guided Multi-Scale Input: We propose a query-guided, multi-scale input mechanism to integrate text-derived global priors and capture both holistic context and fine-grained details.

Region-Wise Preserve and Merge: We propose region-wise preserve and merge strategies to preserve salient local evidence and aggregate redundant background into compact representatives.

Empirical Results: Experiments across standard benchmarks confirm that UHR-BAT establishes a new state of the art for efficient UHR understanding, outperforming existing methods under strict token budgets.

Method Pipeline

Method Overview. We encode a high-resolution remote sensing image at multiple scales using a frozen ViT, apply scale-specific positional embeddings to distinguish visual tokens across scales, and obtain an anchor-scale query-to-vision attention map from the MLLM interface. A region partition, induced by SAM or feature-plus-coordinate clustering, enables region-wise preserve-and-merge, while remaining tokens are merged via average pooling to retain coarse context. A final top-k step enforces the per-scale token budget before the processed sequence is fed into the LLM for answer generation.

UHR-BAT encodes ultra-high-resolution remote-sensing images at multiple scales, uses query-to-vision attention to identify the anchor-scale evidence relevant to the current instruction, and then performs region-wise preserve-and-merge so that salient local structures are retained while redundant background tokens are compressed into compact representatives. The final top-k selection step guarantees that each scale respects the preset token budget.

Main Experimental Results

XLRS-Bench Results. This table summarizes perception and reasoning sub-task performance under strict token budgets and shows that UHR-BAT achieves a new state of the art with 44.0 weighted average accuracy.

RSHR-Bench Results. The reported Perception and Reasoning scores show that UHR-BAT remains strong across diverse remote-sensing question types while operating under a constrained visual token budget.

On XLRS-Bench, the paper reports a state-of-the-art weighted average score of 44.0 across perception and reasoning tasks. This corresponds to gains of +22.8 over GeoChat, +20.4 over VHM, +21.0 over GPT-4o, +8.9 over Claude 3.7 Sonnet, and +2.9 over LLaVA-OneVision-72B.
On MMERealworld-RS, UHR-BAT achieves the best mean score of 33.33. Relative to strong remote-sensing baselines, the reported gains are +12.01 over GeoChat, +9.15 over VHM, and +4.92 over GeoLLaVA-8K. It also exceeds the GPT-4o baseline by +5.91, with especially large improvements on Position and Color.
On RSHR-Bench, the method reaches 29.2 on Perception and 45.0 on Reasoning, giving the best overall averages among the open-source and remote-sensing baselines reported in the paper. The perception score is also higher than VILA-HD by +1.2.
The efficiency ablation is also informative: on XLRS-Bench, moving from a 2K to 4K context gives a clear gain for Qwen2.5-VL-7B, from 38.08 to 40.58, while pushing to 8K brings almost no further benefit. The paper uses this to argue that compressed 4K-context processing is the practical operating point.

Qualitative Examples

Attention Maps. We display original images and enlarged views alongside their corresponding attention maps. The heatmaps show that the model focuses on semantically relevant regions, such as specific vehicles or infrastructure aligned with the textual query, while assigning lower importance to the background.

Response Example A. This sample illustrates the versatility of our method in handling spatial reasoning and fine-grained attribute recognition on small objects in ultra-high-resolution scenes.

Response Example B. This sample further shows that UHR-BAT can answer query-specific reasoning questions while preserving the small visual evidence required for precise recognition.

Why These Results Matter

The main takeaway is that token compression for remote sensing cannot be treated as a purely generic speed trick. The model has to remain sensitive to tiny, sparse, and spatially significant evidence. UHR-BAT shows that budget-aware compression can still improve accuracy when the compression rule is aligned with query semantics and region structure.

For practical Earth observation systems, this matters because the deployment bottleneck is usually not just model quality, but quality under strict memory and latency limits. UHR-BAT is therefore useful not only as a better benchmark result, but also as a stronger recipe for building scalable remote sensing MLLMs.

BibTeX Citation

BibTeX

@article{dang2026uhr,
  title={UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing},
  author={Dang, Yunkai and Dai, Minxin and Yang, Yuekun and Li, Zhangnan and Li, Wenbin and Miao, Feng and Gao, Yang},
  journal={arXiv preprint arXiv:2604.13565},
  year={2026}
}