The Development and Future Outlook of CLIP and its Derivative Methods

Authors

  • Jiacheng Shi Author

DOI:

https://doi.org/10.61173/0p6jpg91

Keywords:

Contrastive Language–Image Pre-training, Contrastive learning, Vision-Language Models

Abstract

With the rapid growth of multimodal learning, VisionLanguage Models (VLMs) have become a cutting-edge direction in artificial intelligence. Among them, the Contrastive Language–Image Pre-training (CLIP) model, based on large-scale contrastive learning, has demonstrated powerful capabilities in zero-shot transfer and crossmodal retrieval. However, CLIP’s weakly supervised training paradigm shows clear shortcomings when dealing with compositional reasoning. Therefore, this survey systematically reviews and analyzes representative methods proposed in recent years to address CLIP’s compositional reasoning limitations, including Self-supervision meets Language-Image Pre-training (SLIP), Language augmented CLIP (LaCLIP), TripletCLIP, Synthetic Perturbations for Advancing Robust Compositional Learning (SPARCL), Compositionally-aware Learning in CLIP (CLIC), and Training-Time Negation Data Generation for Negation Awareness of CLIP (TNG-CLIP). We introduce the principles and characteristics of these methods, followed by a comparative analysis of their performance on different benchmarks and how they mitigate deficiencies. Through this overview of CLIP and its derivative methods, we hope future research will focus on integrating their strengths, while also developing more efficient data synthesis techniques and more comprehensive evaluation benchmarks.

Downloads

Published

2025-12-19

Issue

Section

Articles