WonderLab

Posted on May 20

RAG Series (23): Multimodal RAG — Images and Tables Can Be Retrieved Too

#rag #multimodal #vision #llm

What Text RAG Can't See

Upload an annual report PDF. It contains revenue trend charts, product comparison tables, architecture diagrams. What does traditional RAG do?

A PDF parser extracts text
Text is chunked, embedded, stored in the vector store
User asks: "What was the revenue growth in Q3?"

The problem: the revenue chart is an image. The PDF parser extracts its alt text (usually empty) or filename. The numbers are in the image, not the text. RAG will never find them.

Tables are slightly better, but still problematic: parsers often flatten tables into lines of text, destroying the row/column structure and garbling the semantics.

This is a real business pain point. Roughly 30–50% of the information in real-world documents exists in non-plain-text form.

Three Approaches

Approach 1: Extract and Textualize

The most direct and most mature approach: convert images and tables into text descriptions, then run standard text RAG.

Images: use a Vision Language Model (VLM) to generate descriptions

from openai import OpenAI
import base64

def describe_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                {"type": "text", "text": "Describe this image in detail, including all numbers, labels, trends, and key information. If this is a chart, list all data points."}
            ]
        }]
    )
    return response.choices[0].message.content

Tables: use pdfplumber to preserve structure, convert to Markdown

import pdfplumber

def extract_tables_as_markdown(pdf_path: str) -> list[str]:
    tables_md = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                if not table:
                    continue
                header = table[0]
                rows = table[1:]
                md = "| " + " | ".join(str(h or "") for h in header) + " |\n"
                md += "| " + " | ".join("---" for _ in header) + " |\n"
                for row in rows:
                    md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
                tables_md.append(f"[Page {page_num+1} table]\n{md}")
    return tables_md

Integrate into the RAG pipeline:

from langchain_core.documents import Document

def process_document(pdf_path: str) -> list[Document]:
    docs = []

    # 1. Extract plain text
    text_chunks = extract_text_chunks(pdf_path)
    docs.extend([Document(page_content=t, metadata={"type": "text", "source": pdf_path}) for t in text_chunks])

    # 2. Extract images → VLM descriptions
    images = extract_images_from_pdf(pdf_path)
    for img_path, page_num in images:
        description = describe_image(img_path)
        docs.append(Document(
            page_content=description,
            metadata={"type": "image", "source": pdf_path, "page": page_num, "image_path": img_path}
        ))

    # 3. Extract tables → Markdown
    tables = extract_tables_as_markdown(pdf_path)
    for table_md in tables:
        docs.append(Document(page_content=table_md, metadata={"type": "table", "source": pdf_path}))

    return docs

Strengths: Compatible with all existing text RAG infrastructure; no changes to the vector store.

Limitations: VLM captioning adds cost and latency; description quality directly affects retrieval quality; OCR is sensitive to scan quality.

Approach 2: CLIP Multimodal Embeddings

Principle: CLIP (Contrastive Language–Image Pre-training, OpenAI 2021) projects both text and images into the same vector space. The embedding of the phrase "revenue trend chart" will be close to the embedding of an actual revenue trend chart image.

from langchain_experimental.open_clip import OpenCLIPEmbeddings

clip_embeddings = OpenCLIPEmbeddings(
    model_name="ViT-H-14",
    checkpoint="laion2b_s32b_b79k"
)

# Embed text
text_embedding = clip_embeddings.embed_query("Q3 revenue trend")

# Embed image
image_embedding = clip_embeddings.embed_image(["path/to/chart.png"])

# Both are in the same vector space — similarity is meaningful
from numpy import dot
from numpy.linalg import norm
similarity = dot(text_embedding, image_embedding[0]) / (norm(text_embedding) * norm(image_embedding[0]))
print(f"Similarity: {similarity:.3f}")  # typically > 0.3 for semantically related pairs

Build a mixed text+image vector store:

import uuid

# Images stored with their CLIP embeddings
for img_path in image_paths:
    img_embedding = clip_embeddings.embed_image([img_path])[0]
    doc_id = str(uuid.uuid4())
    image_vectorstore.add_texts(
        texts=["[IMAGE]"],
        embeddings=[img_embedding],
        metadatas=[{"type": "image", "path": img_path}],
        ids=[doc_id]
    )

Dual-path retrieval at query time:

def multimodal_search(query: str, k: int = 5):
    # Text retrieval
    text_results = text_vectorstore.similarity_search(query, k=k)

    # Image retrieval (via CLIP's text encoder)
    query_embedding = clip_embeddings.embed_query(query)
    image_results = image_vectorstore.similarity_search_by_vector(query_embedding, k=k)

    return text_results + image_results

Strengths: Images don't need pre-captioning; retrieval operates on visual content directly.

Limitations: CLIP performs well on natural photographs but poorly on professional charts and graphs — those require understanding numerical relationships, not just visual recognition.

Approach 3: ColPali (The 2024 Breakthrough)

Background: Traditional document RAG follows this pipeline:

PDF → extract text/images → textualize → embed → retrieve

Every step loses information or introduces noise. ColPali (Google Research, 2024) took a different approach:

PDF → screenshot each page → vision language model → page-level embeddings → retrieve

Process each PDF page directly as an image. Bypass text extraction entirely.

Key components:

Backbone: PaliGemma 3B (Google's vision language model)
Late Interaction (from ColBERT): each page is divided into 1,030 patches; each patch gets its own embedding; queries generate token-level embeddings; retrieval scores via fine-grained patch × token similarity, then aggregates
The result: ColPali can pinpoint which part of a page answers a question

# Using the byaldi library (Python interface for ColPali)
from byaldi import RAGMultiModalModel

# Load ColPali
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

# Index a PDF directory (screenshots each page, generates patch embeddings)
RAG.index(
    input_path="./financial_reports/",
    index_name="reports_index",
    store_collection_with_index=True,  # save original images for answer generation
    overwrite=True,
)

# Retrieve (returns the most relevant pages)
results = RAG.search("Q3 revenue quarter-over-quarter growth", k=3)

for r in results:
    print(f"File: {r['doc_id']}, Page: {r['page_num']}, Score: {r['score']:.3f}")

Generate an answer from the retrieved page image:

import base64
from openai import OpenAI

def answer_with_page_image(question: str, page_image_path: str) -> str:
    with open(page_image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode("utf-8")

    client = OpenAI()
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
                {"type": "text", "text": f"Based on this page, answer: {question}"}
            ]
        }]
    ).choices[0].message.content

The full ColPali flow:

User question → ColPali retrieves most relevant pages → extract page images → send to VLM → generate answer

Strengths:

Handles charts, formulas, and mixed layouts natively — no OCR required
Page-level understanding preserves visual layout
Significantly outperforms traditional methods on visually dense documents (research papers, financial reports)

Limitations:

Heavy model (PaliGemma 3B); retrieval latency higher than vector lookup
Requires NVIDIA GPU; not suitable for CPU-only deployments
Long index-build time (each page requires a forward pass)

Dedicated Table Handling

Tables are different from images — they have structured semantics and deserve specialized treatment.

Method 1: Preserve Markdown structure

def table_to_markdown(table: list[list]) -> str:
    if not table or not table[0]:
        return ""
    header = table[0]
    md = "| " + " | ".join(str(h or "-") for h in header) + " |\n"
    md += "| " + " | ".join(":---:" for _ in header) + " |\n"
    for row in table[1:]:
        md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
    return md

Good LLMs can reason across rows and columns in Markdown format.

Method 2: Summary for retrieval + full table for generation

def index_table(table_md: str, table_metadata: dict) -> None:
    # Use LLM to generate a retrieval-friendly summary
    summary = llm.invoke(
        f"Summarize the key information in this table in one sentence (under 50 words):\n{table_md}"
    )

    # Store summary as the retrieval unit, full table in metadata
    vectorstore.add_texts(
        [summary.content],
        metadatas=[{**table_metadata, "full_table": table_md}]
    )

Retrieve by summary; send the full table Markdown to the LLM for answer generation.

Method 3: Structured extraction → natural language

For high-value tables (financials, product specs), extract as structured data then convert to natural language:

# Table → JSON
table_json = {
    "columns": ["Quarter", "Revenue ($B)", "QoQ Growth"],
    "rows": [
        {"Quarter": "Q1", "Revenue ($B)": 12.3, "QoQ Growth": "+5.2%"},
        {"Quarter": "Q2", "Revenue ($B)": 14.1, "QoQ Growth": "+14.6%"},
        {"Quarter": "Q3", "Revenue ($B)": 13.8, "QoQ Growth": "-2.1%"},
    ]
}

# JSON → natural language (better for semantic retrieval)
nl_description = (
    "Quarterly revenue data: Q1 $12.3B, Q2 $14.1B (up 14.6% QoQ), "
    "Q3 $13.8B (down 2.1% QoQ)."
)

Natural language is more retrieval-friendly and can be directly quoted in the LLM's answer.

Which Approach to Choose

	Extract + Textualize	CLIP Multimodal	ColPali
Document types	All	Image-heavy	Visually dense (reports, academic PDFs)
Infrastructure	Standard text RAG	Requires CLIP	Requires GPU, heavy model
Chart understanding	Depends on VLM caption quality	Weak (charts ≠ natural photos)	Strong (page-level understanding)
Update cost	Low	Medium	High (re-indexing is expensive)
Engineering complexity	Low	Medium	High
Cost	VLM captioning fees	Low	Model inference cost

Practical recommendations for most scenarios:

Scenario                              Recommended approach
──────────────────────────────────────────────────────────────
Standard enterprise docs (few images)  Text RAG, OCR or ignore images
Product docs (architecture diagrams)   Extract + GPT-4V caption
Financial/research reports (charts)    ColPali
E-commerce image search                CLIP
Quick knowledge base prototype         Extract + textualize (simplest)

A Complete Multimodal RAG Pipeline

Combining the approaches into a unified pipeline:

from enum import Enum

class DocElement(Enum):
    TEXT = "text"
    IMAGE = "image"
    TABLE = "table"

class MultimodalRAGPipeline:
    def __init__(self, text_embeddings, clip_embeddings, llm):
        self.text_emb = text_embeddings
        self.clip_emb = clip_embeddings
        self.llm = llm
        self.vectorstore = Chroma(embedding_function=text_embeddings)

    def index(self, pdf_path: str) -> None:
        elements = extract_all_elements(pdf_path)  # text / images / tables
        docs = []
        for elem in elements:
            if elem.type == DocElement.TEXT:
                docs.append(Document(page_content=elem.content, metadata={"type": "text"}))
            elif elem.type == DocElement.IMAGE:
                caption = self._generate_caption(elem.image_path)
                docs.append(Document(
                    page_content=caption,
                    metadata={"type": "image", "path": elem.image_path}
                ))
            elif elem.type == DocElement.TABLE:
                docs.append(Document(
                    page_content=table_to_markdown(elem.content),
                    metadata={"type": "table"}
                ))
        self.vectorstore.add_documents(docs)

    def _generate_caption(self, image_path: str) -> str:
        return describe_image(image_path)  # calls GPT-4V

    def query(self, question: str) -> dict:
        results = self.vectorstore.similarity_search(question, k=5)
        context_parts = []
        images_to_show = []
        for r in results:
            if r.metadata["type"] == "image":
                context_parts.append(f"[Image description] {r.page_content}")
                images_to_show.append(r.metadata["path"])
            else:
                context_parts.append(r.page_content)

        answer = self.llm.invoke(
            f"Answer based on the following:\n\n{'---'.join(context_parts)}\n\nQuestion: {question}"
        )
        return {"answer": answer.content, "images": images_to_show}

Summary

Multimodal RAG is fundamentally about converting non-text information into a retrievable form, then returning the original content to the LLM at answer-generation time. Three approaches:

Extract and textualize: most mature, engineering-simple, but dependent on OCR/VLM quality — suitable for most scenarios
CLIP multimodal embeddings: unified vector space for text and images; good for natural photograph retrieval; limited on professional charts
ColPali: direct visual page processing; best results for chart-heavy documents; requires GPU and higher engineering investment

Tables are often simpler than images: preserve Markdown structure + generate a retrieval summary, and standard text RAG handles them well.

Next (and final) in this series: Code RAG — helping AI understand your codebase, including AST-based splitting, code embedding models, and representing call graphs with knowledge graphs.

DEV Community