M Sea Bass

Posted on Jan 4

Try Multimodal Search with ColQwen2!

#multimodal #llm #rag #python

In this article, we introduce how to use ColQwen2.

ColQwen2 is based on Qwen2-VL-2B and generates ColBERT-style multi-vector representations, enabling highly accurate searches across text and image inputs.

We will test ColQwen2 using Google Colab with an A100 GPU.

Library Installation

First, install the necessary libraries:

!pip install git+https://github.com/illuin-tech/colpali
!pip install pymupdf

Preparing Image Data

Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.

Using pymupdf, we’ll extract images from the PDF file:

import pymupdf
import os

# Constants
DPI = 350  # Can be modified as needed

def convert_pdf_to_images(pdf_path, output_dir):
    """
    Convert PDF pages to images.
    Args:
        pdf_path (str): Path to the PDF file.
        output_dir (str): Directory to save images.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    pdf_document = pymupdf.open(pdf_path)

    for page_number in range(pdf_document.page_count):
        page = pdf_document[page_number]
        pix = page.get_pixmap(dpi=DPI)
        output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png')
        pix.save(output_file)

    pdf_document.close()

pdf_path = "/content/2407.01449v3.pdf"
output_dir = "output_images"
convert_pdf_to_images(pdf_path, output_dir)

Images will be saved in the "output_images" folder.

Searching the Images

Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.

After downloading and uploading the paper PDF to Google Colab, execute the following code:

import glob, os

import torch
from PIL import Image

from colpali_engine.models import ColQwen2, ColQwen2Processor

device = "cuda:0" if torch.cuda.is_available() else "cpu"

print(f"cuda available: {torch.cuda.is_available()}")

model = ColQwen2.from_pretrained(
        "vidore/colqwen2-v0.1",
        torch_dtype=torch.bfloat16,
        device_map=device,  # or "mps" if on Apple Silicon
    ).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

# Your inputs
images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))]
queries = [
    "What is the architecture of ColPali?",
    "How does it differ from previous studies?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
batch_size = 1  # Reduced batch size to 1
image_embeddings = []
for i in range(0, len(images), batch_size):
    batch = images[i : i + batch_size]
    resized_batch = [img.resize((512, 512)) for img in batch]  # Resize before processing
    batch_images = processor.process_images(resized_batch).to(model.device)
    with torch.no_grad():
        embeddings = model(**batch_images)
    image_embeddings.extend(embeddings)
with torch.no_grad():
    query_embeddings = model(**batch_queries)

image_embeddings = torch.stack(image_embeddings)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

print(scores)

The scores are returned as a list (matrix):

tensor([[13.2500,  8.4375, 11.3750, 11.1875, 13.8125, 12.0000,  8.3125,  9.0000,
         10.4375,  8.7500, 10.4375, 11.6250,  7.8438,  7.4375,  9.9375,  8.0625,
          7.5000, 10.9375,  9.7500,  7.8750],
        [ 8.3750,  7.5000,  9.6250,  8.3125,  7.5625,  8.1250,  7.9688,  8.4375,
          8.5000,  9.0625,  7.7812,  8.3125,  7.5000,  7.9062,  8.6875,  7.9688,
          7.9062,  7.9688,  8.7500,  7.5000]])

Visualizing Scores

Let’s visualize the scores:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T
scores_df.index.name = 'Images'

# Create two separate bar plots side-by-side
plt.figure(figsize=(12, 6))

# First bar plot
plt.subplot(1, 2, 1)
sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue")
plt.title("Query: " + queries[0])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

# Second bar plot
plt.subplot(1, 2, 2)
sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral")
plt.title("Query: " + queries[1])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

plt.tight_layout()
plt.show()

Inspecting the Top Results

Let’s check the top 2 results:

for query, high_idx in zip(queries, highest_score_indices.tolist()):
    print(f"{query}: {high_idx}")
    # Display the image
    image_path = os.path.join(output_dir, f"page_{high_idx+1}.png")
    display(Image.open(image_path))

Query 1: What is the architecture of ColPali?

Top 2 results:

1st: Page 5

2nd: Page 1

The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.

Query 2: How does it differ from previous studies?

Top 2 results:

1st: Page 3

2nd: Page 10

Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.

Conclusion

We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.

To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.

DEV Community

Try Multimodal Search with ColQwen2!

Library Installation

Preparing Image Data

Searching the Images

Visualizing Scores

Inspecting the Top Results

Query 1: What is the architecture of ColPali?

Query 2: How does it differ from previous studies?

Conclusion

Top comments (0)

Read next

GenAIScript - Comment Code with AI

Setting Up Virtual environment in Python Projects with Conda - 1

Tired of AI Giving You 'Lazy Answers'? Here's Our Solution

Knowledgeable Agents with FalkorDB Graph RAG