Project Page

CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

Yunkai Dang*, Yizhu Jiang*, Yifan Jiang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao

* Equal contribution. Corresponding author. Correspondence to: liwenbin.nju@gmail.com, yunkaidang1@gmail.com.

arXiv 2026

Class-Adaptive Layer Fusion

CLASP fuses multi-layer vision features according to the instruction category instead of relying on a fixed single-layer token representation.

Dual-Stage Visual Token Pruning

The pruning budget is split between relevance-preserving pivot tokens and coverage-preserving completion tokens for more robust compression.

94.7% Performance at 88.9% Pruning

Under very aggressive compression, CLASP still preserves 94.7% of the original performance and remains strong across multiple MLLM backbones.

Overview

Multimodal Large Language Models (MLLMs) often serialize images into long patch-token sequences, creating heavy memory and latency costs during inference. Existing token reduction methods usually rely on single-layer ViT features and static pruning rules, which can be brittle when different instructions require different evidence, such as OCR details, object attributes, counting targets, spatial relations, or scene-level context. In this paper, we propose CLASP, a plug-and-play token reduction framework that combines class-adaptive layer fusion with class-adaptive dual-stage pruning.

The core idea is that visual token pruning should be conditioned on the semantic class of the instruction. CLASP first routes the prompt to a question category, then uses that category to choose how vision-layer features are fused and how the token budget is split between relevance and coverage. This turns token reduction from a fixed heuristic into a prompt-conditioned decision process.

Per-benchmark performance under increasing token pruning

Performance Under Pruning. Across eight evaluation suites, CLASP degrades more slowly than representative pruning baselines and stays closer to the unpruned upper bound, especially under aggressive pruning ratios.

Visual Token Redundancy: CLASP targets the quadratic attention cost and latency overhead caused by long visual token sequences in MLLMs.

Class-Adaptive Layer Fusion: The framework fuses multi-layer ViT features using category-specific weights instead of depending on a fixed single-layer representation.

Dual-Stage Pruning: The pruning budget is dynamically split between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage.

Plug-and-Play Efficiency: CLASP requires no retraining and can be integrated into multiple MLLM backbones for image and video inference.

Method Pipeline

CLASP method overview

Method Overview. CLASP uses a prompt-to-class router to condition visual processing on textual intent. It first performs class-adaptive layer fusion, aggregating ViT features from multiple layers with class-specific mixture weights. It then applies class-adaptive pruning, splitting the retained token budget between attention-based selection and similarity-based clustering.

Given a prompt and an input image, CLASP aims to keep a target visual token budget R while preserving instruction-critical content and broad visual coverage. A lightweight text router maps the prompt to a category. The category then controls two decisions: the layer-mixture weights used to form the visual representation and the split ratio used during pruning.

The first pruning stage keeps attention-salient pivot tokens, which protect the most query-relevant visual evidence. The second stage uses redundancy-aware clustering to add completion tokens that are weakly covered by the pivot set, improving scene coverage without wasting budget on near-duplicate patches. In practice, the paper applies this as a progressive pruning schedule at decoder layers 2, 6, and 15.

Main Experimental Results

LLaVA-v1.5-7B benchmark comparison

LLaVA-v1.5-7B Results. CLASP reports the best normalized average at all three tested retained-token budgets: 98.4% at 192 tokens, 97.0% at 128 tokens, and 94.7% at 64 tokens.

Per-benchmark performance curves

Per-Benchmark Curves. Compared with SparseVLM and PDrop, CLASP maintains stronger results across MME, GQA, MMVet, POPE, TextVQA, VQAv2, SQA, and MMB as pruning becomes more aggressive.

  • On LLaVA-v1.5-7B, CLASP achieves the best normalized average at all three tested budgets: 98.4% with 192 retained tokens, 97.0% with 128 tokens, and 94.7% with only 64 tokens. These correspond to visual token reductions of 66.7%, 77.8%, and 88.9%.
  • At 192 tokens, CLASP reaches the strongest GQA score in the comparison, 60.4, and records the highest normalized average of 98.4%, outperforming DART by +0.3 and VisionZip by +0.3.
  • At 128 tokens, CLASP keeps 97.0% of the original performance and achieves the best POPE score of 85.2, indicating strong faithfulness under tighter visual budgets.
  • At 64 tokens, CLASP still preserves 94.7% normalized performance. This is +1.7 over DART and +1.9 over VisionZip under the same aggressive 88.9% pruning ratio.
  • On LLaVA-NeXT-7B with 320 retained tokens, CLASP reports a normalized average of 95.2%, improving over DART by +1.3 and HiRED by +1.9. It also records strong task scores including 62.7 GQA, 1723 MME, 85.8 POPE, and 61.7 TextVQA.
  • On Qwen2.5-VL-7B, CLASP transfers beyond the LLaVA family. Compared with SparseVLM, the normalized average rises from 94.1% to 96.5% at 66.7% pruning, from 90.8% to 94.4% at 77.8% pruning, and from 82.9% to 89.0% at 88.9% pruning.
  • On Video-LLaVA-7B, CLASP also improves the average score from 51.22 for SparseVLM to 52.95, slightly exceeding the unpruned upper-bound average of 52.23 in the reported table.

Efficiency and Visualization

Token retention visualization

Token-Retention Maps. The visualization compares SparseVLM and CLASP across progressive pruning layers. CLASP retains more task-relevant regions while using attention-selected tokens for relevance and similarity-selected tokens for contextual coverage.

The efficiency study on POPE further shows that CLASP reduces inference cost while preserving accuracy. With 192 retained tokens, CLASP reaches 99.6% accuracy retention and a 1.5x time speedup. With 58 retained tokens, it keeps 95.4% accuracy retention and reaches a 2.1x time speedup.

Why These Results Matter

CLASP is useful because it frames efficiency as a conditional modeling problem rather than a hard-coded compression rule. That is a better fit for real multimodal systems, where the visual evidence needed for OCR, counting, grounding, spatial reasoning, and scene understanding can be very different.

The practical implication is that token pruning does not have to be a crude tradeoff between speed and quality. By conditioning both feature fusion and pruning on the instruction category, CLASP can remove most visual tokens while preserving the evidence needed for robust multimodal inference.

BibTeX Citation

BibTeX

@article{dang2026clasp,
  title={CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models},
  author={Dang, Yunkai and Jiang, Yizhu and Jiang, Yifan and Fan, Qi and Shi, Yinghuan and Li, Wenbin and Gao, Yang},
  journal={arXiv preprint arXiv:2604.12767},
  year={2026}
}