DEV Community

Linghua Jin
Linghua Jin

Posted on

Why Your Image Search Sucks (And How ColPali + Multi-Vector Indexing Fixes It)

The Problem: Why Traditional Image Search Is Broken

Why Traditional Image Search Is Broken

If you've ever tried to build an image search system, you know the pain. Traditional approaches collapse entire images into a single dense vector—essentially compressing a complex visual scene into one point in high-dimensional space.

What gets lost?

  • Spatial layout and positioning
  • Multiple objects in cluttered scenes
  • Fine-grained details like charts, diagrams, or text regions
  • Local semantic context that matters for precision matching

Think about searching a technical manual for "the diagram showing database architecture." A global vector can't pinpoint WHERE in the page that diagram appears or distinguish it from other visual elements. You're stuck with fuzzy, imprecise matches.

Enter ColPali: Patch-Level Multi-Vector Indexing

ColPali (Contextual Late-interaction over Patches) fundamentally rethinks visual search. Instead of one vector per image, it generates hundreds or thousands of patch-level embeddings—preserving spatial structure and semantic richness.

How It Works

  1. Image Decomposition: Each image is split into a grid (e.g., 32×32 patches = 1,024 patches per page)
  2. Patch Embeddings: Every patch gets its own contextual embedding using a vision-language model
  3. Late Interaction: At query time, your text query tokens are matched against ALL patch embeddings
  4. MaxSim Scoring: For each query token, we keep only the maximum similarity across all patches, then sum these scores

This is inspired by ColBERT's late interaction paradigm—but adapted for multimodal visual search.

Why This Matters

🎯 Fine-Grained Search: Match specific regions, not just global semantics

🏗️ Preserved Structure: Spatial relationships and layout information stay intact

📊 Better Recall: Dense visual scenes don't "forget" small important regions

Efficient Retrieval: Late interaction avoids expensive cross-attention at index time

🚫 No OCR Needed: Process images natively without error-prone text extraction

Building It with CocoIndex + Qdrant

Here's the architecture we're building:

Images → ColPali Embedding → Multi-Vector Storage (Qdrant) → Late Interaction Search
Enter fullscreen mode Exit fullscreen mode

Step 1: Ingest Images

@cocoindex.flow_def(name="ImageObjectEmbeddingColpali")
def image_object_embedding_flow(flow_builder, data_scope):
    data_scope["images"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(
            path="img",
            included_patterns=["*.jpg", "*.jpeg", "*.png"],
            binary=True
        ),
        refresh_interval=datetime.timedelta(minutes=1),
    )
Enter fullscreen mode Exit fullscreen mode

This watches a local directory and auto-refreshes every minute as new images arrive.

Step 2: Embed with ColPali

img_embeddings = data_scope.add_collector()

with data_scope["images"].row() as img:
    img["embedding"] = img["content"].transform(
        cocoindex.functions.ColPaliEmbedImage(
            model="vidore/colpali-v1.2"
        )
    )
Enter fullscreen mode Exit fullscreen mode

Each image now becomes a multi-vector representation: Vector[Vector[Float32, N]]

Where:

  • Outer dimension = number of patches (e.g., 1024)
  • Inner dimension = model hidden size (e.g., 128)

Step 3: Store in Qdrant

collect_fields = {
    "id": cocoindex.GeneratedField.UUID,
    "filename": img["filename"],
    "embedding": img["embedding"],
}

img_embeddings.collect(**collect_fields)

img_embeddings.export(
    "img_embeddings",
    cocoindex.targets.Qdrant(collection_name="ImageSearchColpali"),
    primary_key_fields=["id"],
)
Enter fullscreen mode Exit fullscreen mode

Qdrant natively supports multi-vector fields, making it perfect for ColPali's patch-based approach.

Step 4: Real-Time Indexing

@asynccontextmanager
async def lifespan(app: FastAPI):
    load_dotenv()
    cocoindex.init()
    image_object_embedding_flow.setup(report_to_stdout=True)

    app.state.live_updater = cocoindex.FlowLiveUpdater(
        image_object_embedding_flow
    )
    app.state.live_updater.start()
    yield
Enter fullscreen mode Exit fullscreen mode

Now your index stays synchronized in real-time as images are added, modified, or deleted.

Querying the Index

@app.get("/search")
def search(
    q: str = Query(..., description="Search query"),
    limit: int = Query(5, description="Number of results"),
) -> Any:
    # Multi-vector embedding for the query
    query_embedding = text_to_colpali_embedding.eval(q)

    # Late interaction search in Qdrant
    results = qdrant_client.search(
        collection_name="ImageSearchColpali",
        query_vector=query_embedding,
        limit=limit
    )

    return results
Enter fullscreen mode Exit fullscreen mode

The Performance Difference

Compared to single-vector approaches (like CLIP), ColPali produces:

Richer retrieval: Captures nuanced visual details

Better localization: Can identify specific regions in complex scenes

Higher recall: Doesn't miss small but important elements

Interpretability: MaxSim scores show which patches matched which query tokens

Beyond Local Files: Connect Any Data Source

CocoIndex supports production-ready source connectors:

  • Google Drive: Auto-sync documents and images
  • Amazon S3/SQS: Event-driven indexing at scale
  • Azure Blob Storage: Enterprise cloud integration

Changes are automatically detected and reflected in your index in real-time—no manual rebuilds required.

Use Cases

🔍 Visual RAG: Build AI agents that understand document layouts

📚 Document Search: Find specific charts, tables, or diagrams in manuals

🏥 Medical Imaging: Search radiology reports by anatomical features

🛍️ E-commerce: Fine-grained product image search

🎨 Digital Asset Management: Search design files by visual composition

The Technical Details That Matter

Storage Format

Vector[Vector[Float32, embedding_dim]]
Enter fullscreen mode Exit fullscreen mode
  • Each image = array of patch vectors
  • Enables late interaction strategies
  • Compatible with quantization and compression (HPC-ColPali)

Late Interaction Scoring

score = Σ max(sim(query_token_i, patch_j)) for all patches j
Enter fullscreen mode Exit fullscreen mode
  • Avoids expensive joint encoding
  • Enables efficient retrieval at scale
  • Preserves interpretability

Scaling Strategies

  • Quantization: Compress embeddings with minimal accuracy loss
  • Hierarchical Patch Compression: Further reduce storage needs
  • Distributed Indexing: Scale to billions of images

Try It Yourself

Full working code: github.com/cocoindex-io/cocoindex/tree/main/examples/image_search

pip install cocoindex
# Run the example
python examples/image_search/colpali_main.py
Enter fullscreen mode Exit fullscreen mode

Why This Matters for Production

Traditional image search fails when you need:

  • Precise localization in complex scenes
  • Multi-object understanding
  • Layout-aware retrieval
  • Real-time synchronization with changing data sources

ColPali + CocoIndex gives you a production-ready foundation that handles all of this—with just a few lines of declarative Python.


Star CocoIndex on GitHub if you're building multimodal AI systems: github.com/cocoindex-io/cocoindex

Questions? Join our Discord community or check out the docs.


Building next-gen AI infrastructure for multimodal search? CocoIndex is the missing piece you've been looking for.

Top comments (0)