DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Multi-Modal RAG: Images, Tables, Documents — Chunking and Retrieval

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Multi-Modal RAG: Images, Tables, Documents — Chunking and Retrieval

Introduction

Real-world documents contain more than text: images, charts, tables, and diagrams carry critical information that text-only RAG systems cannot access. Multi-modal RAG extends retrieval to include visual content, enabling questions like "What does the Q3 revenue chart show?" or "What values are in the configuration table?" This article covers the architectures and techniques for building multi-modal RAG.

Strategies for Multi-Modal RAG

There are three main approaches to handling non-text content:

# Strategy 1: Convert everything to text (simplest)

# Strategy 2: Embed images alongside text (moderate)

# Strategy 3: Multi-modal retrieval with specialized models (most powerful)
Enter fullscreen mode Exit fullscreen mode

Strategy 1: Text Conversion

Convert images and tables to text using vision models or OCR:

from openai import OpenAI

import base64

client = OpenAI()

def describe_image(image_path: str) -> str:

    with open(image_path, "rb") as f:

        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(

        model="gpt-4o",

        messages=[

            {

                "role": "user",

                "content": [

                    {"type": "text", "text": "Describe this image in detail, including all text, data points, and visual elements."},

                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},

                ],

            }

        ],

        max_tokens=1024,

    )

    return response.choices[0].message.content

def convert_table_to_text(table_data: list[list[str]]) -> str:

    """Convert a parsed table to searchable text."""

    headers = table_data[0]

    rows = table_data[1:]

    text_parts = []

    for row in rows:

        row_desc = ", ".join(f"{headers[i]}: {cell}" for i, cell in enumerate(row))

        text_parts.append(row_desc)

    return "\n".join(text_parts)
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Multi-Vector Retriever

Store both text representations and visual embeddings:

from langchain.retrievers.multi_vector import MultiVectorRetriever

from langchain.storage import InMemoryStore

from langchain.vectorstores import Chroma

from langchain.embeddings import OpenAIEmbeddings

from langchain.schema.document import Document

# Store text summaries alongside raw elements

vectorstore = Chroma(

    collection_name="multi_modal_docs",

    embedding_function=OpenAIEmbeddings(),

)

store = InMemoryStore()

retriever = MultiVectorRetriever(

    vectorstore=vectorstore,

    docstore=store,

    id_key="doc_id",

)

# For each document element (text, image, table):

# 1. Generate a text summary

# 2. Store the summary in the vector store

# 3. Store the original element in the doc store

# 4. Link them with a shared doc_id

doc_id = "doc_001_image_03"

summary = "Revenue chart showing Q1-Q4 2025: Q1=$1.2M, Q2=$1.5M, Q3=$1.8M, Q4=$2.1M"

original = Document(

    page_content="[IMAGE: revenue_chart_2025.png]",

    metadata={"type": "image", "path": "revenue_chart_2025.png", "doc_id": doc_id},

)

retriever.vectorstore.add_documents([Document(page_content=summary, metadata={"doc_id": doc_id})])

retriever.docstore.mset([(doc_id, original)])
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Multi-Modal Embeddings

Use embedding models that handle both text and images in a shared space:

from sentence_transformers import SentenceTransformer

import torch

from PIL import Image

class MultiModalEmbedder:

    def __init__(self, model_name="clip-ViT-B-32"):

        self.model = SentenceTransformer(model_name)

    def embed_text(self, text: str) -> list[float]:

        return self.model.encode(text).tolist()

    def embed_image(self, image_path: str) -> list[float]:

        image = Image.open(image_path)

        return self.model.encode(image).tolist()

    def search_by_text(self, query: str, image_embeddings: list, top_k: int = 5):

        query_emb = self.embed_text(query)

        scores = torch.cosine_similarity(

            torch.tensor(query_emb).unsqueeze(0),

            torch.tensor(image_embeddings),

        )

        top_indices = scores.topk(top_k).indices.tolist()

        return top_indices, scores[top_indices].tolist()
Enter fullscreen mode Exit fullscreen mode

Chunking Strategies for Multi-Modal Data

Each content type needs a different chunking approach:

class MultiModalChunker:

    def chunk_pdf(self, pdf_path: str) -> list[dict]:

        """Extract and chunk text, images, and tables from a PDF."""

        import fitz  # PyMuPDF

        doc = fitz.open(pdf_path)

        chunks = []

        for page_num, page in enumerate(doc):
Enter fullscreen mode Exit fullscreen mode

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)