This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
Multi-Modal RAG: Images, Tables, Documents — Chunking and Retrieval
Introduction
Real-world documents contain more than text: images, charts, tables, and diagrams carry critical information that text-only RAG systems cannot access. Multi-modal RAG extends retrieval to include visual content, enabling questions like "What does the Q3 revenue chart show?" or "What values are in the configuration table?" This article covers the architectures and techniques for building multi-modal RAG.
Strategies for Multi-Modal RAG
There are three main approaches to handling non-text content:
# Strategy 1: Convert everything to text (simplest)
# Strategy 2: Embed images alongside text (moderate)
# Strategy 3: Multi-modal retrieval with specialized models (most powerful)
Strategy 1: Text Conversion
Convert images and tables to text using vision models or OCR:
from openai import OpenAI
import base64
client = OpenAI()
def describe_image(image_path: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail, including all text, data points, and visual elements."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
],
}
],
max_tokens=1024,
)
return response.choices[0].message.content
def convert_table_to_text(table_data: list[list[str]]) -> str:
"""Convert a parsed table to searchable text."""
headers = table_data[0]
rows = table_data[1:]
text_parts = []
for row in rows:
row_desc = ", ".join(f"{headers[i]}: {cell}" for i, cell in enumerate(row))
text_parts.append(row_desc)
return "\n".join(text_parts)
Strategy 2: Multi-Vector Retriever
Store both text representations and visual embeddings:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema.document import Document
# Store text summaries alongside raw elements
vectorstore = Chroma(
collection_name="multi_modal_docs",
embedding_function=OpenAIEmbeddings(),
)
store = InMemoryStore()
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key="doc_id",
)
# For each document element (text, image, table):
# 1. Generate a text summary
# 2. Store the summary in the vector store
# 3. Store the original element in the doc store
# 4. Link them with a shared doc_id
doc_id = "doc_001_image_03"
summary = "Revenue chart showing Q1-Q4 2025: Q1=$1.2M, Q2=$1.5M, Q3=$1.8M, Q4=$2.1M"
original = Document(
page_content="[IMAGE: revenue_chart_2025.png]",
metadata={"type": "image", "path": "revenue_chart_2025.png", "doc_id": doc_id},
)
retriever.vectorstore.add_documents([Document(page_content=summary, metadata={"doc_id": doc_id})])
retriever.docstore.mset([(doc_id, original)])
Strategy 3: Multi-Modal Embeddings
Use embedding models that handle both text and images in a shared space:
from sentence_transformers import SentenceTransformer
import torch
from PIL import Image
class MultiModalEmbedder:
def __init__(self, model_name="clip-ViT-B-32"):
self.model = SentenceTransformer(model_name)
def embed_text(self, text: str) -> list[float]:
return self.model.encode(text).tolist()
def embed_image(self, image_path: str) -> list[float]:
image = Image.open(image_path)
return self.model.encode(image).tolist()
def search_by_text(self, query: str, image_embeddings: list, top_k: int = 5):
query_emb = self.embed_text(query)
scores = torch.cosine_similarity(
torch.tensor(query_emb).unsqueeze(0),
torch.tensor(image_embeddings),
)
top_indices = scores.topk(top_k).indices.tolist()
return top_indices, scores[top_indices].tolist()
Chunking Strategies for Multi-Modal Data
Each content type needs a different chunking approach:
class MultiModalChunker:
def chunk_pdf(self, pdf_path: str) -> list[dict]:
"""Extract and chunk text, images, and tables from a PDF."""
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)