The Problem: Why Traditional Image Search Is Broken
If you've ever tried to build an image search system, you know the pain. Traditional approaches collapse entire images into a single dense vector—essentially compressing a complex visual scene into one point in high-dimensional space.
What gets lost?
- Spatial layout and positioning
- Multiple objects in cluttered scenes
- Fine-grained details like charts, diagrams, or text regions
- Local semantic context that matters for precision matching
Think about searching a technical manual for "the diagram showing database architecture." A global vector can't pinpoint WHERE in the page that diagram appears or distinguish it from other visual elements. You're stuck with fuzzy, imprecise matches.
Enter ColPali: Patch-Level Multi-Vector Indexing
ColPali (Contextual Late-interaction over Patches) fundamentally rethinks visual search. Instead of one vector per image, it generates hundreds or thousands of patch-level embeddings—preserving spatial structure and semantic richness.
How It Works
- Image Decomposition: Each image is split into a grid (e.g., 32×32 patches = 1,024 patches per page)
- Patch Embeddings: Every patch gets its own contextual embedding using a vision-language model
- Late Interaction: At query time, your text query tokens are matched against ALL patch embeddings
- MaxSim Scoring: For each query token, we keep only the maximum similarity across all patches, then sum these scores
This is inspired by ColBERT's late interaction paradigm—but adapted for multimodal visual search.
Why This Matters
🎯 Fine-Grained Search: Match specific regions, not just global semantics
🏗️ Preserved Structure: Spatial relationships and layout information stay intact
📊 Better Recall: Dense visual scenes don't "forget" small important regions
⚡ Efficient Retrieval: Late interaction avoids expensive cross-attention at index time
🚫 No OCR Needed: Process images natively without error-prone text extraction
Building It with CocoIndex + Qdrant
Here's the architecture we're building:
Images → ColPali Embedding → Multi-Vector Storage (Qdrant) → Late Interaction Search
Step 1: Ingest Images
@cocoindex.flow_def(name="ImageObjectEmbeddingColpali")
def image_object_embedding_flow(flow_builder, data_scope):
data_scope["images"] = flow_builder.add_source(
cocoindex.sources.LocalFile(
path="img",
included_patterns=["*.jpg", "*.jpeg", "*.png"],
binary=True
),
refresh_interval=datetime.timedelta(minutes=1),
)
This watches a local directory and auto-refreshes every minute as new images arrive.
Step 2: Embed with ColPali
img_embeddings = data_scope.add_collector()
with data_scope["images"].row() as img:
img["embedding"] = img["content"].transform(
cocoindex.functions.ColPaliEmbedImage(
model="vidore/colpali-v1.2"
)
)
Each image now becomes a multi-vector representation: Vector[Vector[Float32, N]]
Where:
- Outer dimension = number of patches (e.g., 1024)
- Inner dimension = model hidden size (e.g., 128)
Step 3: Store in Qdrant
collect_fields = {
"id": cocoindex.GeneratedField.UUID,
"filename": img["filename"],
"embedding": img["embedding"],
}
img_embeddings.collect(**collect_fields)
img_embeddings.export(
"img_embeddings",
cocoindex.targets.Qdrant(collection_name="ImageSearchColpali"),
primary_key_fields=["id"],
)
Qdrant natively supports multi-vector fields, making it perfect for ColPali's patch-based approach.
Step 4: Real-Time Indexing
@asynccontextmanager
async def lifespan(app: FastAPI):
load_dotenv()
cocoindex.init()
image_object_embedding_flow.setup(report_to_stdout=True)
app.state.live_updater = cocoindex.FlowLiveUpdater(
image_object_embedding_flow
)
app.state.live_updater.start()
yield
Now your index stays synchronized in real-time as images are added, modified, or deleted.
Querying the Index
@app.get("/search")
def search(
q: str = Query(..., description="Search query"),
limit: int = Query(5, description="Number of results"),
) -> Any:
# Multi-vector embedding for the query
query_embedding = text_to_colpali_embedding.eval(q)
# Late interaction search in Qdrant
results = qdrant_client.search(
collection_name="ImageSearchColpali",
query_vector=query_embedding,
limit=limit
)
return results
The Performance Difference
Compared to single-vector approaches (like CLIP), ColPali produces:
✅ Richer retrieval: Captures nuanced visual details
✅ Better localization: Can identify specific regions in complex scenes
✅ Higher recall: Doesn't miss small but important elements
✅ Interpretability: MaxSim scores show which patches matched which query tokens
Beyond Local Files: Connect Any Data Source
CocoIndex supports production-ready source connectors:
- Google Drive: Auto-sync documents and images
- Amazon S3/SQS: Event-driven indexing at scale
- Azure Blob Storage: Enterprise cloud integration
Changes are automatically detected and reflected in your index in real-time—no manual rebuilds required.
Use Cases
🔍 Visual RAG: Build AI agents that understand document layouts
📚 Document Search: Find specific charts, tables, or diagrams in manuals
🏥 Medical Imaging: Search radiology reports by anatomical features
🛍️ E-commerce: Fine-grained product image search
🎨 Digital Asset Management: Search design files by visual composition
The Technical Details That Matter
Storage Format
Vector[Vector[Float32, embedding_dim]]
- Each image = array of patch vectors
- Enables late interaction strategies
- Compatible with quantization and compression (HPC-ColPali)
Late Interaction Scoring
score = Σ max(sim(query_token_i, patch_j)) for all patches j
- Avoids expensive joint encoding
- Enables efficient retrieval at scale
- Preserves interpretability
Scaling Strategies
- Quantization: Compress embeddings with minimal accuracy loss
- Hierarchical Patch Compression: Further reduce storage needs
- Distributed Indexing: Scale to billions of images
Try It Yourself
Full working code: github.com/cocoindex-io/cocoindex/tree/main/examples/image_search
pip install cocoindex
# Run the example
python examples/image_search/colpali_main.py
Why This Matters for Production
Traditional image search fails when you need:
- Precise localization in complex scenes
- Multi-object understanding
- Layout-aware retrieval
- Real-time synchronization with changing data sources
ColPali + CocoIndex gives you a production-ready foundation that handles all of this—with just a few lines of declarative Python.
Star CocoIndex on GitHub if you're building multimodal AI systems: github.com/cocoindex-io/cocoindex
Questions? Join our Discord community or check out the docs.
Building next-gen AI infrastructure for multimodal search? CocoIndex is the missing piece you've been looking for.

Top comments (0)