What Is Zero Shot Learning in Image Classification?

Published on Apr 14,2025 10 Views
Generative AI enthusiast with expertise in RAG (Retrieval-Augmented Generation) and LangChain, passionate... Generative AI enthusiast with expertise in RAG (Retrieval-Augmented Generation) and LangChain, passionate about building intelligent AI-driven solutions

What Is Zero Shot Learning in Image Classification?

edureka.co

Among the most fascinating developments in natural language processing (NLP) and machine learning is zero-shot categorization. Simply said, it’s a model’s capacity to project classes it has never seen before during training.. In this blog, we will cover everything from what zero-shot classification is, how it works for images and popular models, and how you can implement it.

What is Zero-Shot Learning?

Zero-shot classification is the capacity of a model to categorize data into categories it has never been explicitly trained on. A language model trained on a range of text, for instance, can forecast the sentiment of a movie review without having ever seen a review labeled for sentiment during training. The model makes predictions on unseen tasks using a broad knowledge acquired during pre-training rather than needing labeled data for every conceivable class.

In situations when labeled data is limited or the work is very specialized, this capacity enables models to generalize across jobs without depending on particular training samples for every one, thereby saving a useful tool.

Now that we understand the core concept of zero-shot classification, let’s explore some real-world applications where it proves to be highly valuable.

What is Zero-Shot Classification?

Zero-shot classification has numerous applications, including:

Given the wide array of applications, it’s important to understand which models are best suited for zero-shot classification. Let’s dive into some popular models.

Zero-Shot Classification Applications

Some of the most popular models used for zero-shot classification are:

Now that we have an overview of the models, let’s move on to understanding how to use them for zero-shot classification tasks.

Using zero-shot classification models typically involves:

Here is an example using Hugging Face’s transformers library to perform zero-shot text classification:

</p>
from transformers import pipeline

# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

# Sample text and candidate labels
text = "I love playing football on weekends!"
candidate_labels = ["sports", "politics", "technology", "entertainment"]

# Perform zero-shot classification
result = classifier(text, candidate_labels)

print(result)

Output will be:

With text classification covered, let’s now explore the exciting world of zero-shot image classification and how it works.

How to Use Zero-Shot Classification Models

Zero-shot image classification is the grouping of a picture into categories without reference to particular training data for those categories. Usually, this is accomplished by teaching a model on both photos and verbal descriptions and leveraging their interaction to extend to other image categories.

To better understand how zero-shot image classification works, let’s break it down into a step-by-step process.

Zero-Shot Image Classification

Zero-shot image classification uses models that comprehend both text and images, such as Contrastive Language-Image Pretraining, or CLIP. The model learns to link textual descriptions with visual aspects, so it can forecast the most likely textual description or category connected with an image when it detects an image.

For example, CLIP can classify an image of a dog as “dog” or “animal” based on the textual description of the image. It doesn’t need to have seen specific images of dogs during training but understands the relationship between the visual and textual features.

Let’s see how we can implement zero-shot image classification using the CLIP model in the next section.

How Zero-Shot Image Classification Works?

Here’s an example of how to implement zero-shot image classification using the CLIP model from Hugging Face:


from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

# Load the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load an image
image = Image.open("dog_image.jpg")

# Define candidate labels
labels = ["dog", "cat", "car", "tree"]

# Process the image and text inputs
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

# Perform zero-shot image classification
outputs = model(**inputs)

# Get the logits for each label
logits_per_image = outputs.logits_per_image # this is the similarity score between image and each label

# Find the label with the highest score
probs = logits_per_image.softmax(dim=1) # Convert logits to probabilities
predicted_label = labels[torch.argmax(probs)]

print(f"Predicted label: {predicted_label}")

The output would be:

Having seen how easy it is to implement zero-shot image classification, let’s look at the benefits of using zero-shot models and how they can impact various applications.

Implementing Zero-Shot Classification of Image

Although zero-shot classification models offer significant advantages, there are some challenges and restrictions to consider.

Using a Prebuilt Pipeline

Prebuilt pipelines offer a straightforward and effective approach to apply zero-shot classification models without requiring careful model architecture management or written significant code. The transformers library from Hugging Face offers numerous prebuilt pipelines for jobs like zero-shot categorization, which let users use modern models with little preparation straight-forwardly.

Benefits of Using a Prebuilt Pipeline

Here’s how you can use Hugging Face’s transformers library to perform zero-shot text classification with a prebuilt pipeline:


# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

# Sample text and candidate labels
text = "I love playing football on weekends!"
candidate_labels = ["sports", "politics", "technology", "entertainment"]

# Perform zero-shot classification
result = classifier(text, candidate_labels)

print(result)

The output would be:

In this example, the text “I love playing football on weekends!” is classified into the “sports” category with the highest probability, showing that the model can predict categories it has never specifically trained on.

This approach works seamlessly for many NLP tasks such as sentiment analysis, topic classification, and more, without the need for manually fine-tuning or configuring the model.

Manual Implementation

Although prebuilt pipelines are fast and effective, occasionally you may choose to manually apply zero-shot categorization to have more process control. Manual implementation lets you use custom architectures, fine-tune, or change the behavior of the model to better fit your particular requirements.

Benefits of Manual Implementation

Let’s go through the steps to manually implement zero-shot classification using the Hugging Face transformers library. We’ll use the same example as before but with more control over the model and tokenizer.

1. Load the Pretrained Model and Tokenizer

First, load a model that supports zero-shot classification. For this example, we will use BART, which is a versatile model for NLP tasks.


from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load pre-trained BART model for zero-shot classification
model_name = "facebook/bart-large-mnli"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

2. Tokenize the Input Text and Candidate Labels

Next, we need to tokenize both the input text and the list of candidate labels. This is necessary for the model to understand both inputs.

# Sample input text and candidate labels
text = "I love playing football on weekends!"
candidate_labels = ["sports", "politics", "technology", "entertainment"]

# Tokenize the text and candidate labels
inputs = tokenizer(text, candidate_labels, padding=True, truncation=True, return_tensors="pt")
# Perform inference
with torch.no_grad():
outputs = model(**inputs)

# Extract the logits (raw predictions) from the model
logits = outputs.logits

# Calculate probabilities using softmax
probabilities = torch.nn.functional.softmax(logits, dim=1)

# Get the predicted label and its probability
predicted_label = candidate_labels[torch.argmax(probabilities)]
predicted_probability = torch.max(probabilities).item()

print(f"Predicted label: {predicted_label} with probability {predicted_probability:.4f}")

The output would be:

In this manual implementation, we have complete control over how the model processes inputs and generates predictions. You can modify the pipeline to handle more complex tasks, such as multi-class classification or other forms of NLP tasks.

Zero-Shot Image Classification Benefits

Although zero-shot classification models offer significant advantages, there are some challenges and restrictions to consider.

Challenges and Restrictions

Despite these challenges, zero-shot models have a promising future, and innovations are underway to make them even more effective.

Future Directions

Future developments may bring more effective zero-shot models that increase performance using less resources. Zero-shot models can also be fine-tuned to maximize accuracy on particular tasks while preserving generalizing ability. Combining zero-shot classification with other methods like few-shot learning could also help these models to be even more potent.

To summarize, let’s wrap up with a conclusion that highlights the importance and potential of zero-shot classification.

Conclusion

Zero-shot classification is transforming machine learning by allowing models to classify tasks and comprehend data they’ve never seen before. Zero-shot classification models offer flexible, scalable solutions for a wide range of real-world applications, including text, pictures, and multimodal activities. As these models grow, they have the potential to simplify complex jobs and allow AI to tackle a broader range of problems without requiring vast volumes of labeled data.

If you want certifications in Generative AI and large language models, Edureka offers the best certifications and training in this field.

For a wide range of courses, training, and certification programs across various domains, check out Edureka’s website to explore more and enhance your skills!

Frequently Asked Questions

1. What is zero-shot intent classification?

Zero-shot intent classification refers to a model’s capacity to categorize the intent of a user input (such as a statement or query) into predefined categories (or intents) without explicitly training on all potential categories. In zero-shot learning, the model does not encounter any labeled examples from the target class during training, but instead uses a general grasp of language to categorize the input based on semantic similarities between the input and the class labels.

For example, in a customer support chatbot, a zero-shot intent classification model could predict the intent behind the query “I need help with billing” as related to a “Billing” intent, even if the model has never been trained specifically on billing-related examples.

2. What is zero-shot image segmentation model?

Zero-shot image segmentation refers to a model’s capacity to categorize photos into certain classes (such as “dog,” “car,” and “tree”) without having seen labeled instances of these classes during training. A zero-shot image segmentation model usually employs a big pretrained model that comprehends both images and textual descriptions. Using the semantic relationship between textual labels and visual attributes, the model can execute picture segmentation based on textual descriptions it has never seen before..

For example, using models like CLIP or other multimodal models, the model can classify and segment parts of an image by associating it with descriptive labels such as “person,” “sky,” or “water,” even though the model has not been explicitly trained on those specific image types.

3. What is zero-shot and few-shot classification?

4. Why is it called zero-shot?

It is known as “zero-shot” since the model essentially handles a “shot,” or task without any training examples, making predictions or classifications on tasks it has never seen during training. Relying instead on the model’s capacity to generalize from previously learnt knowledge, the “zero” in “zero-shot” denotes that the model lacks any direct instances or “shots” from the target task to influence its predictions.

5. What is few shot classification?

In machine learning, few-shot classification is the situation when a model is trained to categorize data into groups depending just on a few labeled instances for every category. Particularly in situations when obtaining vast volumes of labeled data is challenging, few-shot learning is a method of increasing data-efficiency of machine learning models. Few-shot learning lets the model view a few number of instances per class (usually between 1 and 100) and generalize from them to handle new, unseen examples inside the same classes unlike zero-shot learning, which requires no labeled data for the target class.

Upcoming Batches For Generative AI Course: Masters Program
Course NameDateDetails
Generative AI Course: Masters Program

Class Starts on 19th April,2025

19th April

SAT&SUN (Weekend Batch)
View Details
BROWSE COURSES