DEV Community

IronSpiderMan
IronSpiderMan

Posted on

How to implement image search?

Preface

Many search engines have built-in image search functions. The image search function can greatly simplify the search work.However, have you ever thought about how the image search function is implemented?

Today, we will analyze the principles and implement the image search function.

Let’s first discuss the difficulties of searching for pictures with pictures. The first thing to deal with is how to compare the similarity of pictures? What kind of pictures are called similar? People can judge at a glance, but computers are different. Pictures exist in the form of digital matrices, and similarity comparisons also compare the similarity of matrices. But there are some problems with this.

The second problem is the size issue. The sizes of pictures are usually different, and matrices of different sizes cannot be compared for similarity. But this is easy to solve, just modify the image size directly.

The third problem is that the information contained in pixels is very limited and cannot express abstract information. Such as painting style, objects, tones, etc.

Based on the above description, we now have to solve two problems: what information to use to replace pixel information, and how to calculate similarity.

Image features

Representing image features with pixels has many disadvantages, and there are many alternatives in computer vision. Such as SIFT, feature map, etc.In these methods, the image is represented as a vector in a way we call embedding.

Embedding is a very good idea and has been widely used in NLP.After we encode some content into vectors, we can perform algebraic operations on the content, including calculating similarity, semantic addition and subtraction, etc.The figure below is an example of word embedding.Tigers and cats are closer, cars and auto are closer.

word2vec

The question now is, how to extract the image embedding.We can use convolutional networks such as ResNet and AlexNet, or we can use Vit networks. Here we choose Vit.

The Vit network can encode the image into a matrix with a shape of (patches+1)×embedding_dim. We select the first row in the center as the embedding of the image.

Below is the pytorch code:

import glob
import hashlib
import random

import numpy as np
import torch
import chromadb
from tqdm import tqdm
from PIL import Image
from transformers import ViTImageProcessor, ViTModel
from chromadb import Documents, EmbeddingFunction, Embeddings


device = "cuda" if torch.cuda.is_available() else "cpu"


class VitEmbeddingFunction(EmbeddingFunction):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
        self.model = ViTModel.from_pretrained("google/vit-base-patch16-224")

    def __call__(self, images: Documents) -> Embeddings:
        images = [Image.open(image) for image in images]
        inputs = self.processor(images, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            last_hidden_state = outputs.last_hidden_state
        return last_hidden_state[:, 0, :].numpy().tolist()
Enter fullscreen mode Exit fullscreen mode

In the latest version of chromadb, the call method parameter name of EmbeddingFunction can only be input.

We only need to create a VitEmbeddingFunction instance, pass in the image and call the call method to obtain the image embedding. But this matter can be left to chromadb.

Vectordatabase

A vector database is not necessary, but can speed up searches. Before using the vector database, let's talk about how to complete the search.

We can use Euclidean distance to express similarity, and the following is the calculation formula.

distance = np.sqrt(np.sum((e1 - e2) ** 2))
Enter fullscreen mode Exit fullscreen mode

We can use linear search to find the image with the minimum distance from our input image. But this search is too slow. A tree structure will be more efficient to search.

We can build a binary search tree for numbers, but how to build a tree for vectors?

This requires the use of clustering algorithms. We divide all pictures into m clusters. When searching, we first find the cluster center with the smallest distance, and then find the picture with the smallest distance in the current cluster. This reduces the time complexity from O(n) to O(n/m). We can also continue clustering within the cluster to form a tree, which can reduce the time complexity to O(logn).

In the vector database, the clustering operation is automatically completed and we can use it directly.

Below is the python code:

client = chromadb.PersistentClient(path="image_search")
collection = client.get_or_create_collection(name="images", embedding_function=VitEmbeddingFunction())
files = glob.glob(r"G:\datasets\lbxx\lbxx\*")

# You should store embeddings at first time.
# bar = tqdm(total=len(files))
# for file in files:
#     collection.add(
#         documents=[file],
#         metadatas=[{"source": file}],
#         ids=[md5encode(file)]
#     )
#     bar.update(1)
# file = random.choice(files)
file = r"G:\datasets\lbxx\lbxx\10067.jpeg"
similar_images = collection.query(
    query_texts=[file],
    n_results=10,
)
similar_images = similar_images['documents']
images = [file, *similar_images[0]]
image = np.hstack([np.array(Image.open(image).resize((512, 512)))
                   for image in images])
Image.fromarray(image).show()
Enter fullscreen mode Exit fullscreen mode

Below are the search results.The one on the far left is the original image, and the one on the right is the result image.

Image description

The code has been uploaded to github.

Github: ImageSearch.

Top comments (2)

Collapse
 
kurealnum profile image
Oscar

Awesome article! Why did you choose Vit over your other choices (like ResNet)?

Collapse
 
zackfair profile image
IronSpiderMan

If you use a convolutional network, you need to consider which part of the output is used as embedding. When using Vit, the output corresponding to [CLS] is basically used as embedding, so Vit is chosen.