Akmal Chaudhri for SingleStore

Posted on Oct 27, 2024 • Edited on Nov 1, 2024

Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings

#openai #clip #visualisation

Abstract

In this article, we'll use OpenAI's CLIP model (Contrastive Language-Image Pre-training) to analyse the relationship between text and visual data by encoding and comparing their feature representations. Cosine similarity is calculated between image and text embeddings, and several dimensionality reduction techniques are used to create 2D visualisations of these relationships.

The notebook file used in this article is available on GitHub.

Introduction

In this article, we'll explore OpenAI's CLIP model to evaluate the relationship between images and text data. CLIP's model is used to encode both text and image features, followed by normalisation to compute cosine similarity, which measures the relevance between the two modalities.

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Import the notebook

We'll download the notebook from GitHub.

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

Run the notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll begin by installing the necessary libraries and importing dependencies.

Next, we'll load the CLIP model and preprocess function:

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)

We'll then download a sample image, preprocess the image and create some sample text, as follows:

image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"
response = requests.get(image_url)
display(Image(url = image_url))

image = preprocess(
    PILImage.open(
        BytesIO(response.content)
    )
).unsqueeze(0).to(device)

texts = [
    "What makes SingleStoreDB unique",
    "Ultra-Fast Ingestion",
    "Pipelines"
]

Next, we'll encode the image and text features:

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(
        clip.tokenize(texts).to(device)
    )

We'll normalise the features:

image_features /= image_features.norm(dim = -1, keepdim = True)
text_features /= text_features.norm(dim = -1, keepdim = True)

then combine the embeddings:

combined_features = torch.cat([
    image_features,
    text_features
], dim = 0).cpu().numpy()

and compute the cosine similarities:

similarities = [calculate_similarity(image_features, text_features[i]) for i in range(len(texts))]
labels = ["What makes SingleStoreDB unique (Image)"] + [
    f"{text} (Cosine Similarity: {similarity:.6f})" for text, similarity in zip(texts, similarities)
]

Before plotting, we'll print the similarity scores:

print(f"{'Text':<35} {'Cosine Similarity':<10}")
print("-" * 60)

for text, similarity in zip(texts, similarities):
    print(f"{text:<35} {similarity:<10.6f}")

Example output:

Text                                Cosine Similarity
------------------------------------------------------------
What makes SingleStoreDB unique     0.265887  
Ultra-Fast Ingestion                0.155181  
Pipelines                           0.153016

We'll create a function to handle the different plots:

def plot_reduction(data, title, similarities):
    fig = px.scatter(
        x = data[:, 0],
        y = data[:, 1],
        color = labels,
        title = title,
        labels = {"x": "x", "y": "y"},
        size = similarities
    )
    # fig.update_traces(marker = dict(sizemode = "diameter", sizemin = 5))
    fig.show()

image_marker_size = 1

First, we'll plot PCA:

pca = PCA(n_components = 2)
pca_result = pca.fit_transform(combined_features)
plot_reduction(
    pca_result,
    "PCA",
    [image_marker_size] + similarities
)

Example output is shown in Figure 1.

Figure 1. PCA.

Next, we'll plot UMAP:

n_neighbors = min(15, combined_features.shape[0] - 1)
umap_model = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, random_state = 42)
umap_result = umap_model.fit_transform(combined_features)
plot_reduction(
    umap_result,
    "UMAP",
    [image_marker_size] + similarities
)

Example output is shown in Figure 2.

Figure 2. UMAP.

Finally, we'll plot t-SNE:

perplexity = min(30, combined_features.shape[0] - 1)
tsne = TSNE(n_components = 2, perplexity = perplexity, random_state = 42)
tsne_result = tsne.fit_transform(combined_features)
plot_reduction(
    tsne_result,
    "t-SNE",
    [image_marker_size] + similarities
)

Example output is shown in Figure 3.

Figure 3. t-SNE.

Summary

In this article, we used Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbour Embedding (t-SNE) to visualise the reduced feature space. Plotly charts for each method displayed the embeddings, with text-image cosine similarities determining marker sizes. This showed CLIP's ability to integrate and interpret multi-modal data, offering valuable insights for the visual analysis of textual and visual features.

DEV Community

Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings

Abstract

Introduction

Create a SingleStore Cloud account

Import the notebook

Run the notebook

Summary

Top comments (0)

Read next

NATS vs. WebSockets: Understanding Real-Time Communication Protocols

A beginner's guide to the Deepseek-R1 model by Deepseek-Ai on Replicate

Despite AI: you should know these CSS snippets!

Exploring the Best Rails Open Source Projects for Developers in 2025