DEV Community

Cover image for Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings
Akmal Chaudhri for SingleStore

Posted on • Edited on

1

Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings

Abstract

In this article, we'll use OpenAI's CLIP model (Contrastive Language-Image Pre-training) to analyse the relationship between text and visual data by encoding and comparing their feature representations. Cosine similarity is calculated between image and text embeddings, and several dimensionality reduction techniques are used to create 2D visualisations of these relationships.

The notebook file used in this article is available on GitHub.

Introduction

In this article, we'll explore OpenAI's CLIP model to evaluate the relationship between images and text data. CLIP's model is used to encode both text and image features, followed by normalisation to compute cosine similarity, which measures the relevance between the two modalities.

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Import the notebook

We'll download the notebook from GitHub.

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

Run the notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll begin by installing the necessary libraries and importing dependencies.

Next, we'll load the CLIP model and preprocess function:

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)
Enter fullscreen mode Exit fullscreen mode

We'll then download a sample image, preprocess the image and create some sample text, as follows:

image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"
response = requests.get(image_url)
display(Image(url = image_url))

image = preprocess(
    PILImage.open(
        BytesIO(response.content)
    )
).unsqueeze(0).to(device)

texts = [
    "What makes SingleStoreDB unique",
    "Ultra-Fast Ingestion",
    "Pipelines"
]
Enter fullscreen mode Exit fullscreen mode

Next, we'll encode the image and text features:

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(
        clip.tokenize(texts).to(device)
    )
Enter fullscreen mode Exit fullscreen mode

We'll normalise the features:

image_features /= image_features.norm(dim = -1, keepdim = True)
text_features /= text_features.norm(dim = -1, keepdim = True)
Enter fullscreen mode Exit fullscreen mode

then combine the embeddings:

combined_features = torch.cat([
    image_features,
    text_features
], dim = 0).cpu().numpy()
Enter fullscreen mode Exit fullscreen mode

and compute the cosine similarities:

similarities = [calculate_similarity(image_features, text_features[i]) for i in range(len(texts))]
labels = ["What makes SingleStoreDB unique (Image)"] + [
    f"{text} (Cosine Similarity: {similarity:.6f})" for text, similarity in zip(texts, similarities)
]
Enter fullscreen mode Exit fullscreen mode

Before plotting, we'll print the similarity scores:

print(f"{'Text':<35} {'Cosine Similarity':<10}")
print("-" * 60)

for text, similarity in zip(texts, similarities):
    print(f"{text:<35} {similarity:<10.6f}")
Enter fullscreen mode Exit fullscreen mode

Example output:

Text                                Cosine Similarity
------------------------------------------------------------
What makes SingleStoreDB unique     0.265887  
Ultra-Fast Ingestion                0.155181  
Pipelines                           0.153016
Enter fullscreen mode Exit fullscreen mode

We'll create a function to handle the different plots:

def plot_reduction(data, title, similarities):
    fig = px.scatter(
        x = data[:, 0],
        y = data[:, 1],
        color = labels,
        title = title,
        labels = {"x": "x", "y": "y"},
        size = similarities
    )
    # fig.update_traces(marker = dict(sizemode = "diameter", sizemin = 5))
    fig.show()

image_marker_size = 1
Enter fullscreen mode Exit fullscreen mode

First, we'll plot PCA:

pca = PCA(n_components = 2)
pca_result = pca.fit_transform(combined_features)
plot_reduction(
    pca_result,
    "PCA",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 1.

Figure 1. PCA.

Figure 1. PCA.

Next, we'll plot UMAP:

n_neighbors = min(15, combined_features.shape[0] - 1)
umap_model = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, random_state = 42)
umap_result = umap_model.fit_transform(combined_features)
plot_reduction(
    umap_result,
    "UMAP",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 2.

Figure 2. UMAP.

Figure 2. UMAP.

Finally, we'll plot t-SNE:

perplexity = min(30, combined_features.shape[0] - 1)
tsne = TSNE(n_components = 2, perplexity = perplexity, random_state = 42)
tsne_result = tsne.fit_transform(combined_features)
plot_reduction(
    tsne_result,
    "t-SNE",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 3.

Figure 3. t-SNE.

Figure 3. t-SNE.

Summary

In this article, we used Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbour Embedding (t-SNE) to visualise the reduced feature space. Plotly charts for each method displayed the embeddings, with text-image cosine similarities determining marker sizes. This showed CLIP's ability to integrate and interpret multi-modal data, offering valuable insights for the visual analysis of textual and visual features.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay