SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

1CAML Lab, University of Cambridge 2DeepMind, London 3University of Tuebingen
ICCV 2023
SuS-X teaser figure

We enhance CLIP's downstream classification performance without training/fine-tuning and without access to any downstream task image samples. We do this by (1) curating a support set either by generating synthetic (Stable Diffusion) or retrieving natural (LAION-5B) samples, and (2) observing and fixing a mis-calibration issue with intra-modal distances in CLIP’s embedding space.

Abstract

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks: SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines.

Taxonomy of CLIP Adaptation methods

CLIP's zero-shot downstream classification performance is limited by its pre-training distribution: if the downstream dataset diverges too strongly from the distribution of images seen during pretraining, CLIP’s zeroshot performance drastically drops. Several works have tried to mitigate this problem , we characterise some of these methods along three major axes: (i) if the method requires training, (ii) if the method requires labelled samples from the target task, and (iii) if the method requires samples from the target task distribution.

CLIP adaptation taxonomy figure

Support Set (SuS) construction

To curate a support set with no access to downstream image samples, we employ two strategies. For both these strategies, we use the class names of the downstream task and utilise better prompts from GPT-3.

  • Stable Diffusion Generation: We generate synthetic data samples using Stable-Diffusion-v1.4.
  • LAION-5B Retrieval: We retrieve the nearest neighbour data samples using an efficient approximate nearest-neighbours index search.
SuS samples figure

Motivating TIP-X: Mis-calibrated intra-modal distances in VLMs

We observe that intra-modal and inter-modal similarities are distributed differently in the embedding spaces of VLMs like CLIP: the inter-modal similarities have small variance and mean, whereas the intra-modal similarities have larger means and variances. This mismatch happens because contrastive training of CLIP (and other VLMs) maximises the inter-modal cosine similarities of paired samples without regard to intra-modal similarities. Thus, prior methods that use intra-image CLIP embedding similarities do not reflect the true intra-image similarities. We propose a simple modification in the form of TIP-X where compute intra-image similarities by using the inter-modal similarities as a bridge instead.

Motivating TIP-X figure

Results

Training-free name-only transfer: We beat competitive methods in the name-only transfer setting on 19 diverse datasets.

Name-only transfer figure

SoTA on training-free few-shot setting: We beat TIP-Adapter in the training-free setting on 19 diverse datasets.

Few-shot transfer figure

Generalisation to different CLIP visual backbones: Our method generalises and consistently improves performance across different CLIP visual backbones.

Backbone generalisation figure

Generalisation to different VLMs: Our method generalises across different VLMs other than CLIP.

VLM generalisation figure

Generalisation to different T2I models for SuS-SD curation: Our method generalises across different T2I models used for generation other than StableDiffusion-v1.4.

T2I generalisation figure

BibTeX

@article{udandarao2022sus,
  title={Sus-x: Training-free name-only transfer of vision-language models},
  author={Udandarao, Vishaal and Gupta, Ankush and Albanie, Samuel},
  journal={arXiv preprint arXiv:2211.16198},
  year={2022}
}