Project Page

CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

Yunkai Dang^*, Yizhu Jiang^*, Yifan Jiang, Qi Fan, Yinghuan Shi, Wenbin Li^†, Yang Gao

^* Equal contribution. ^† Corresponding author. Correspondence to: liwenbin.nju@gmail.com, yunkaidang1@gmail.com.

arXiv 2026

Class-Adaptive Layer Fusion

CLASP fuses multi-layer vision features according to the instruction category instead of relying on a fixed single-layer token representation.

Dual-Stage Visual Token Pruning

The pruning budget is split between relevance-preserving pivot tokens and coverage-preserving completion tokens for more robust compression.

94.7% Performance at 88.9% Pruning

Under very aggressive compression, CLASP still preserves 94.7% of the original performance and remains strong across multiple MLLM backbones.

Paper RL-MIND

RL-MIND Code (Coming Soon)

Weights (Coming Soon) Personal Homepage

Project Overview

CLASP studies one of the most persistent efficiency problems in multimodal large language models: visual token redundancy. Existing pruning methods often rely on a single vision layer and a fixed pruning rule, which makes them brittle when the prompt changes or when the task requires a different balance between local detail and global coverage. CLASP addresses this by making both the feature construction stage and the pruning stage adaptive to the semantic class of the instruction.

The core argument of the paper is simple but important: if different prompt types need different visual evidence, then fixed token reduction strategies will inevitably waste budget on the wrong patches or over-prune critical details. CLASP turns token reduction into a prompt-conditioned decision process instead of a one-size-fits-all heuristic.

Method Pipeline

The framework first fuses multiple vision encoder layers to construct category-aware visual representations instead of depending on a single-layer feature map. It then performs dual-stage pruning. In the first stage, attention-salient pivot tokens preserve relevance to the instruction. In the second stage, redundancy-aware completion tokens maintain coverage over the scene. This design aims to preserve both “what matters most” and “what would otherwise be lost.”

CLASP combines class-adaptive feature fusion with dual-stage pruning to reduce visual tokens without collapsing task robustness.

Visual Token Redundancy: CLASP targets the heavy computational overhead caused by long visual token sequences in multimodal large language models.

Class-Adaptive Layer Fusion: the framework builds category-specific visual representations instead of relying on a fixed single-layer feature map.

Dual-Stage Pruning: attention-salient pivot tokens preserve relevance, while redundancy-aware completion tokens maintain coverage over the full scene.

Aggressive Compression: CLASP still preserves 94.7% normalized performance at 88.9% pruning.

Paper Resource: the full method and experiments are available on arXiv.

Main Experimental Results

On LLaVA-v1.5-7B, CLASP achieves the best normalized average at all three tested budgets: 98.4 with 192 retained tokens, 97.0 with 128 tokens, and 94.7 with only 64 tokens. These settings correspond to token reductions of 66.7%, 77.8%, and 88.9%.
At 192 tokens, the method keeps 98.4% of the original performance and reaches the strongest GQA score reported in the comparison, 60.4, while outperforming classical reduction baselines by large margins such as +9.9 over ToMe and +10.6 over FastV.
At 128 tokens, CLASP attains the best POPE score of 85.2, which is especially relevant for faithfulness and object hallucination evaluation under tighter budgets.
Even at 64 tokens, CLASP still preserves 94.7% of the original performance and beats the similarity-based DART baseline by +1.7, showing that the method remains stable in very aggressive pruning regimes.
On higher-resolution LLaVA-NeXT-7B with 320 retained tokens, the paper reports a normalized average of 95.2, improving over DART by +1.3 and over HiRED by +1.9. It also records strong task scores including 62.7 GQA, 1723 MME, 85.8 POPE, and 61.7 TextVQA.
The Qwen2.5-VL-7B results show that the method transfers beyond LLaVA: at 66.7% pruning, the normalized average improves from 94.1 to 96.5 over SparseVLM, and at 77.8% pruning it improves from 90.8 to 94.4.

Why These Results Matter

CLASP is useful because it frames efficiency as a conditional modeling problem rather than a hard-coded compression rule. That is a better fit for real multimodal systems, where the visual evidence needed for OCR, counting, grounding, or open-ended reasoning can be very different.

The practical implication is that token pruning does not have to be a crude tradeoff between speed and quality. With prompt-conditioned fusion and budget allocation, a model can cut most of the visual sequence while still preserving the evidence required for robust inference.

BibTeX Citation

BibTeX

@article{dang2026clasp,
  title={CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models},
  author={Dang, Yunkai and Jiang, Yizhu and Jiang, Yifan and Fan, Qi and Shi, Yinghuan and Li, Wenbin and Gao, Yang},
  journal={arXiv preprint arXiv:2604.12767},
  year={2026}
}