Project Page

FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang*, Donghao Wang*, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao

* Equal contribution. Corresponding author. Correspondence to: yunkaidang@smail.nju.edu.cn, liwenbin@nju.edu.cn.

arXiv 2025

Multi-Feature Fusion for RS Scenes

The model combines global context with fine-grained local features so that small structures and complex scene layouts are preserved.

Recurrent Visual Feature Injection

Visual evidence is injected back into the language model during generation to reduce visual forgetting in long reasoning chains.

Strong Results Across Three Tasks

FUSE-RSVLM reports 65.76% VQA accuracy, 74.51% average classification accuracy, and state-of-the-art captioning results on multiple RS benchmarks.

Overview

Remote sensing imagery differs sharply from natural images: scenes are viewed from a nadir perspective, objects are small and dense, spatial layout is highly structured, and thin elements such as roads, bridges, ships, and vehicles can disappear after ordinary resizing. Existing remote sensing VLMs therefore face two coupled problems: they often fail to extract fine-grained local visual features, and they can suffer from visual forgetting as static visual tokens pass through deep language-centric decoding layers.

In this paper, we propose FUSE-RSVLM, whose core model is MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision-Language Model. The method learns multi-scale visual representations, combines global context with local details, and recurrently injects visual evidence into selected LLM layers. It is instruction-tuned with a 293K-sample remote-sensing instruction corpus covering captioning, VQA, visual grounding, scene classification, instruction QA, and detection.

MF-RSVLM comparison across remote sensing tasks

Task-Level Comparison. MF-RSVLM shows strong results across scene classification, single-image VQA, and image captioning, indicating that multi-feature fusion improves both perception and generation in remote-sensing scenes.

Remote Sensing Mismatch: Generic VLMs struggle with nadir-view imagery, tiny objects, dense layouts, and geospatial structures that are easily lost under fixed low-resolution encoding.

Multi-Scale Feature Extraction: MF-RSVLM combines low-resolution global tokens with high-resolution sliding-window detail stacks to preserve both holistic scene context and local evidence.

Recurrent Visual Injection: A router and gated injection module repeatedly writes relevant detail features into selected LLM layers, reducing visual forgetting during generation.

Strong Results: Experiments report 65.76% VRSBench VQA accuracy, 74.51% average classification accuracy, and state-of-the-art captioning on UCM-Captions and Sydney-Captions.

Method Pipeline

MF-RSVLM method overview

Method Overview. MF-RSVLM takes a low-resolution 336x336 image for global context and a high-resolution 672x672 image for local detail extraction. Multi-sized sliding windows generate local patches, the shared vision encoder builds a high-resolution feature canvas, and detail stacks are fused and injected into selected LLM layers through a gate.

The model follows a CLIP ViT-L/14@336 vision encoder, MLP projector, and Vicuna-v1.5-7B LLM pipeline. For global context, the image is resized to the ordinary low-resolution view and encoded into global visual tokens. For local detail, the image is resized to a 672x672 canvas, split into overlapping windows such as 336x336 and 168x168, and processed by the shared vision encoder. Features from ViT layers 8, 16, and 24 are scattered back onto a high-resolution feature canvas, then sampled into ordered detail stacks.

At LLM layers 2, 4, 6, and 8, a lightweight router selects relevant detail stacks conditioned on the current visual stream. A gate then controls how much of the fused detail is written back into the visual hidden states. This design keeps the representation visually grounded across decoding instead of relying on a one-time visual prefix.

Main Experimental Results

VRSBench VQA benchmark results

VRSBench VQA Results. MF-RSVLM ranks first overall with 65.76% average accuracy across Category, Existence, Position, Quantity, Scene, Color, Image, Shape, and Direction tasks.

Remote sensing image captioning results

Image Captioning Results. Across five remote-sensing captioning benchmarks, MF-RSVLM reports new state-of-the-art results on UCM-Captions and Sydney-Captions and strong METEOR/ROUGE-L performance on the remaining datasets.

  • On VRSBench VQA, MF-RSVLM ranks first overall with an average accuracy of 65.76%. The reported gains are +4.93 over the strongest open-source general VLM, +3.67 over the best closed-source system, +14.25 over VHM, and +21.06 over SkySenseGPT.
  • On the same VQA benchmark, MF-RSVLM achieves leading scores on key perception tasks, including 65.84 on Category and 90.23 on Existence. The paper attributes this to better contextual retention from recurrent visual feature injection.
  • Across seven remote-sensing classification datasets, MF-RSVLM reports the best macro-average Top-1 accuracy of 74.51%. This is +2.68 over the strongest remote-sensing baseline, LHRS-Bot, and +11.73 over InternVL3.5, the strongest open-source general VLM in the comparison.
  • On classification, the method is especially strong on AID, NWPU-RESISC45, and METER-ML, reaching 94.37, 94.29, and 74.87 Top-1 accuracy respectively. The paper highlights gains of +3.11 on AID and +4.63 on SIRI-WHU over LHRS-Bot.
  • On remote-sensing image captioning, MF-RSVLM establishes new state-of-the-art results on UCM-Captions across all four metrics, with 79.92 BLEU-4, 89.47 METEOR, 387.90 CIDEr, and 88.51 ROUGE-L. The reported gains over the best prior results are +20.15, +45.39, +260.20, and +56.42.
  • On Sydney-Captions, MF-RSVLM also reports new best results with 56.21 BLEU-4, 72.86 METEOR, 242.48 CIDEr, and 71.85 ROUGE-L, corresponding to gains of +12.20, +17.95, +121.58, and +28.10.
  • On RSVQA-LRBEN, MF-RSVLM reaches 89.69% average accuracy, with 90.21% on Presence and 89.16% on Comparison. On VRSBench-Cap, it obtains 38.64 BLEU-4, 28.01 METEOR, 38.64 CIDEr, and 28.01 ROUGE-L.

Ablation and Cases

MF-RSVLM qualitative case study

Qualitative Examples. Case studies on category, existence, and counting questions show that MF-RSVLM better preserves remote-sensing evidence needed for fine-grained answers.

The ablation studies support the design choices in the method. Fusing ViT layers 8/16/24 with both 336x336 and 168x168 sliding windows outperforms reduced layer or single-window variants across classification and composition benchmarks. Similarly, injecting visual features into LLM layers 2/4/6/8 gives the best overall result, improving METER-ML from 66.37% with fewer injection layers to 72.74%, and improving HR-Comp from 77.90% to 82.80%.

Why These Results Matter

FUSE-RSVLM is not only another remote-sensing fine-tune. Its main value is that it treats visual grounding as a process that must be maintained through the whole reasoning chain. That is a better design for captioning, VQA, and classification in high-resolution geospatial imagery, where the decisive visual evidence may be small, sparse, or easily overwhelmed by global scene context.

The experimental results also show that remote-sensing-specific modeling choices still matter even in the era of general VLMs. Multi-scale visual fusion and recurrent detail injection remain strong advantages when the visual signal is dense, fine-grained, and spatially structured.

BibTeX Citation

BibTeX

@article{dang2025fusersvlm,
  title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
  author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
  journal={arXiv preprint arXiv:2512.24022},
  year={2025}
}