Project Page

FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang*, Donghao Wang*, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao

* Equal contribution. Corresponding author. Correspondence to: yunkaidang@smail.nju.edu.cn, liwenbin@nju.edu.cn.

arXiv 2025

Multi-Feature Fusion for RS Scenes

The model combines global context with fine-grained local features so that small structures and complex scene layouts are preserved.

Recurrent Visual Feature Injection

Visual evidence is injected back into the language model during generation to reduce visual forgetting in long reasoning chains.

Strong Results Across Three Tasks

FUSE-RSVLM reports 65.76% VQA accuracy, 74.51% average classification accuracy, and state-of-the-art captioning results on multiple RS benchmarks.

Project Overview

FUSE-RSVLM is motivated by a recurring mismatch between generic vision-language models and remote sensing imagery. Earth observation scenes differ from natural images in scale, spatial layout, object density, and the importance of very small structures. As a result, models that work well on natural-image VQA or captioning often fail to preserve fine-grained evidence or gradually lose visual grounding during long language decoding.

The paper addresses this by building a remote-sensing-oriented VLM that performs stronger multi-scale feature extraction and repeatedly re-injects visual evidence into the language model. The goal is not only better static image encoding, but also less visual forgetting during generation.

Method Pipeline

The model learns and fuses complementary visual representations at multiple scales, combining global scene context with localized detail features. This is especially important in remote sensing, where a large scene overview is necessary but tiny local structures often determine the answer. On top of that, FUSE-RSVLM uses recurrent visual feature injection so that visual evidence is not consumed only once at the beginning of decoding, but can keep influencing later reasoning steps.

FUSE-RSVLM overview

FUSE-RSVLM fuses multi-scale remote sensing features and repeatedly injects them into the language model to reduce visual forgetting.

Remote Sensing Mismatch: generic vision-language models often miss fine-grained geospatial evidence and gradually lose visual grounding during long decoding chains.

Multi-Feature Fusion: FUSE-RSVLM combines global scene context with localized detail features so that small but important structures remain visible to the model.

Recurrent Visual Injection: visual evidence is repeatedly fed back into the language model to reduce visual forgetting during captioning, VQA, and classification.

Strong Results: the model reports 65.76% VQA accuracy and 74.51% average Top-1 accuracy across remote-sensing classification benchmarks.

Paper Resource: the full method and experiments are available on arXiv.

Main Experimental Results

  • On the VRSBench VQA benchmark, FUSE-RSVLM ranks first overall with an average accuracy of 65.76%. The paper reports improvements of +4.93 over the strongest open-source general VLM, +3.67 over the best closed-source system in the comparison, +14.25 over VHM, and +21.06 over SkySenseGPT.
  • On the same VQA evaluation, the model achieves top results on key perception-oriented tasks, including 65.84 on Category and 90.23 on Existence. The authors attribute this to better contextual retention from the visual feature injection mechanism.
  • Across seven remote-sensing classification datasets, the model ranks first overall with an average Top-1 accuracy of 74.51%. This is +2.68 over the strongest remote-sensing baseline and +11.73 over the strongest open-source general VLM. The paper highlights gains such as +3.11 on AID and +4.63 on SIRI-WHU.
  • On remote-sensing image captioning, the model achieves new state-of-the-art results on UCM-Captions across all four metrics, with reported gains of +20.15 BLEU-4, +45.39 METEOR, +260.20 CIDEr, and +56.42 ROUGE-L over the best prior result. The paper also reports new best results on Sydney-Captions, with gains of +12.20, +17.95, +121.58, and +28.10 on the same four metrics.
  • On the VRSBench-Cap split, the model reaches 38.64 BLEU-4, 28.01 METEOR, 38.64 CIDEr, and 28.01 ROUGE-L. The reported gains include +18.35 BLEU-4, +3.00 METEOR, and +7.72 ROUGE-L over prior methods, indicating that the model generates more semantically grounded captions rather than only maximizing n-gram overlap.

Why These Results Matter

FUSE-RSVLM is not only another remote-sensing fine-tune. Its main value is that it explicitly treats visual grounding as a process that must be maintained through the whole reasoning chain. That is a better design for captioning, VQA, and classification in high-resolution geospatial imagery.

The experimental results also show that remote-sensing-specific modeling choices still matter even in the era of general VLMs. Multi-scale visual fusion and recurrent grounding remain strong advantages when the visual signal is dense, fine-grained, and spatially structured.

BibTeX Citation

BibTeX

@article{dang2025fusersvlm,
  title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
  author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
  journal={arXiv preprint arXiv:2512.24022},
  year={2025}
}