To CLIP (Contrastive Language-Image Pre-Training) learns a joint embedding space for images and text using contrastive learning, enabling zero-shot classification and cross-modal retrieval.
Here is the code snippet you can refer to:

In the above code we are using the following key points:
- Loads Pre-Trained CLIP Model: Uses ViT-B/32 for vision and text encoding.
- Tokenizes Text and Preprocesses Image: Ensures uniform input format.
- Computes Image and Text Embeddings: Generates feature vectors for both modalities.
- Applies Contrastive Similarity: Uses dot product to find closest match.
- Zero-Shot Classification: No need for task-specific fine-tuning.
Hence, CLIP’s contrastive learning framework allows efficient cross-modal understanding, enabling zero-shot classification, image-to-text matching, and visual search applications without requiring task-specific fine-tuning.