Project Page

FILM: How can Few-Shot Image Classification Benefit from Pre-Trained Language Models?

Zihao Jiang, Yunkai Dang, Dong Pang, Huishuai Zhang, Weiran Huang

arXiv 2023

Language-Guided Few-Shot Learning

FILM uses pretrained language-model semantics as a core signal for few-shot classification instead of a light auxiliary prior.

Adaptive Visual-Text Metric

A learned metric and bi-level optimization improve how visual evidence and textual class semantics are compared across episodes.

Consistent Gains Across Benchmarks

The method reports clear improvements on miniImageNet, tieredImageNet, CIFAR-FS, and CUB in both 1-shot and 5-shot settings.

Overview

FILM asks how few-shot image classification can benefit more substantially from pre-trained language models. Earlier methods often use semantic information from class names, but only as a lightweight auxiliary cue added to otherwise standard few-shot pipelines. FILM instead treats language-derived semantic structure as a central component of the classifier design.

The paper proposes a contrastive few-shot learning framework that aligns visual features and language-model embeddings more carefully. The goal is better transferability to novel classes, especially when only one or a few labeled examples are available.

FILM overview

FILM integrates pre-trained language-model semantics directly into the few-shot matching process.

Language-Guided Few-Shot Learning: FILM treats class semantics from pretrained language models as a core transfer signal rather than a weak auxiliary hint.

Adaptive Visual-Text Metric: the model learns a comparison function that goes beyond cosine similarity so support and query images can be matched with textual semantics more flexibly.

Task-Adaptive Meta Optimization: MAML-style bi-level training allows the metric to adapt to each few-shot episode instead of staying fixed across tasks.

Benchmark Gain: FILM improves miniImageNet by +2.42% and +4.41%, and tieredImageNet by +3.88% and +4.41% in 1-shot and 5-shot settings.

Paper Resource: the full method and experiments are available on arXiv.

This is not simply a language-augmented prototype classifier. The paper is really about building a better transfer metric between vision and language so that few-shot recognition can exploit semantic priors without collapsing into brittle label-name matching.

Method Pipeline

FILM introduces a textual branch that is specifically designed to align with visual representations extracted from few-shot episodes. A learned metric module then generalizes cosine similarity so that the model can compare visual and textual information more flexibly. To improve adaptation across tasks, the method is trained with MAML-style bi-level optimization, allowing the metric to adjust to new few-shot episodes rather than staying fixed.

Main Experimental Results

  • On miniImageNet, FILM improves over prior few-shot methods that use class-name semantics by 2.42% in the 1-shot setting and 4.41% in the 5-shot setting.
  • On tieredImageNet, the paper reports gains of 3.88% and 4.41% over AM3 in 1-shot and 5-shot settings, respectively.
  • On CIFAR-FS, FILM outperforms MetaOptNet by 4.99% in the 1-shot setting and 3.06% in the 5-shot setting.
  • On CUB, the paper reports that FILM surpasses all compared methods, including the previous best-performing RE-Net baseline.
  • The cross-domain and ablation analyses further support the main design choice: better language-visual alignment and a more adaptive metric are both important contributors to the final gains.

Why These Results Matter

FILM is valuable because it does not treat language as an afterthought. It shows that pre-trained language models can serve as a structured semantic prior for low-data visual recognition, especially when the alignment mechanism is designed carefully.

The larger takeaway is that few-shot recognition can benefit from foundation-model semantics, but only if the model learns how to compare visual and textual information in a task-adaptive way. FILM is an early and still useful example of that design principle.

BibTeX Citation

BibTeX

@article{jiang2023film,
  title={FILM: How can Few-Shot Image Classification Benefit from Pre-Trained Language Models?},
  author={Jiang, Zihao and Dang, Yunkai and Pang, Dong and Zhang, Huishuai and Huang, Weiran},
  journal={arXiv preprint arXiv:2307.04114},
  year={2023}
}