DEV Community

Cover image for OpenAI's CLIP Model for Zero-Shot Image Similarity
EricManara for Sinapsi

Posted on

OpenAI's CLIP Model for Zero-Shot Image Similarity

Image recognition is always a difficult task to get right: with accuracy being inversely proportional to performance, it's always either too slow or too unreliable. OpenAI's CLIP Model, however, makes it a lot easier to check the similarity between a set of images.
My task was to check wether a mechanical object contained in a photo matched a given object type, of which I had a small dataset of images.
An example of an object to recognize
This had a few issues: the datasets for the single object types were all quite small, sometimes the objects were not the focus of the images, and many objects would look very different from different angles. Training a recognition model on a dataset like that would have been very expensive, and it would still have had a hard time recognizing objects for which it was trained on one or two images as in the smallest datasets. The CLIP model offered a solution that is inexpensive (as it runs very fast even on a weak CPU), easy to use and understand, and reliable.

Image-to-Image Similarity

The CLIP Model is mainly used for text-text or image-text similarity, but for the way it works it can be used for image-image similarity as well. I'm not going into too much detail about how the CLIP Model works (more on that here), but basically it encodes both images and texts in the same format (in a shared vector space, for linear algebra enthusiasts), and then compares those encodings and can compute a similarity value in percentage between them. What this means, practically, is that all images and texts are processed in the same way, and all of the encodings are comparable. For my specific task, I just had the model comparing the given image to all the images in the dataset of a certain object type until it found one that was similar enough to consider it a "good match" (I swear this works better than Tinder).

Improving Performance

Even though the similarity detection process is incredibly fast and can give (using the clip-ViT-B-32 model) the similarity score between dozens of images in under a second without the need to operate on a GPU, the encoding process takes a bit longer and would significantly increase response times even for small datasets.
A quick workaround was to "pre-build" and save the encodings for the various object types and, when needed for the similarity calculations, to load that pre-built encoding list instead of loading and encoding all the images again.

def buildEncoding(typeID: str):
    encodedImages = [encodeImage(image).tolist() for image in S3_service.getImagesByType(typeID)] //everything is saved on an Amazon S3 bucket
    if len(encodedImages) != 0:
        saveEncoding(encodedImages, typeID)
Enter fullscreen mode Exit fullscreen mode

This allows the program to only loose time building the encoding for the single image we want to identify.

Expanding the DataSet

There is one more big advantage this implementation brings: since the model is not trained directly on the dataset, as it is instead trained for image recognition in general, adding an image to the dataset doesn't require a full re-training of the model. All it takes is to add the encoding of a successfully matching image to the encoding list of the right type, making the recognition more and more reliable every time an image matches.

def recognizeImage(queryImage: UploadFile, typeID: str):
    name = queryImage.filename
    queryImage = Image.open(queryImage.file)
    encodedImages = S3_service.loadEncoding(typeID)
    queryEncoding = encoding_builder.encodeImage(queryImage)

    for image in encodedImages:
        processed_images = util.cos_sim(image, queryEncoding)
        print("Processed image: " + str(processed_images[0][0].item() * 100) + '%')
        if processed_images[0][0].item() > 0.92: //the threshold to identify a "good" match

            S3_service.saveImage(queryImage, typeID, name)
            encoding_builder.appendEncoding(encodedImages, queryEncoding, typeID)

            return True
    return False
Enter fullscreen mode Exit fullscreen mode

Implementation

If you're not familiar with using the CLIP model, also check this link.
The two libraries needed to make this work are Pillow's Image module for loading images and the Sentence Transformers Framework.
The first thing to do is to load the CLIP model, which is then going to be used to encode the images

from sentence_transformers import SentenceTransformer

CLIPmodel = SentenceTransformer('clip-ViT-B-32')
// note: you can load a local model from your machine
//       by putting the path instead of the model name
Enter fullscreen mode Exit fullscreen mode

Then, to encode an image, all you need is to call the encode() function on the model, passing the image opened with PIL

from PIL import Image
encoding = CLIPmodel.encode(Image.open(image_path))
Enter fullscreen mode Exit fullscreen mode

Finally, the util module of Sentence Transformer lets us compute the (cosine) similarity between encodings through various methods

from sentence_transformers import util

similarities = util.cos_sim(trainEncoding, queryEncoding)
Enter fullscreen mode Exit fullscreen mode

Conclusion

The CLIP model is a very powerful tool that can help in many image recognition tasks, and in many cases it can significantly improve performance and accuracy more than most other methods, even in my situation, with a kind of image that differs wuite a lot and is way more specific than the dataset that CLIP was trained on. And don't forget that the model can be fine-tuned to improve the accuracy even more, but that may be for another article...

Top comments (0)