What Text RAG Can't See
Upload an annual report PDF. It contains revenue trend charts, product comparison tables, architecture diagrams. What does traditional RAG do?
- A PDF parser extracts text
- Text is chunked, embedded, stored in the vector store
- User asks: "What was the revenue growth in Q3?"
The problem: the revenue chart is an image. The PDF parser extracts its alt text (usually empty) or filename. The numbers are in the image, not the text. RAG will never find them.
Tables are slightly better, but still problematic: parsers often flatten tables into lines of text, destroying the row/column structure and garbling the semantics.
This is a real business pain point. Roughly 30–50% of the information in real-world documents exists in non-plain-text form.
Three Approaches
Approach 1: Extract and Textualize
The most direct and most mature approach: convert images and tables into text descriptions, then run standard text RAG.
Images: use a Vision Language Model (VLM) to generate descriptions
from openai import OpenAI
import base64
def describe_image(image_path: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text", "text": "Describe this image in detail, including all numbers, labels, trends, and key information. If this is a chart, list all data points."}
]
}]
)
return response.choices[0].message.content
Tables: use pdfplumber to preserve structure, convert to Markdown
import pdfplumber
def extract_tables_as_markdown(pdf_path: str) -> list[str]:
tables_md = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
for table in page.extract_tables():
if not table:
continue
header = table[0]
rows = table[1:]
md = "| " + " | ".join(str(h or "") for h in header) + " |\n"
md += "| " + " | ".join("---" for _ in header) + " |\n"
for row in rows:
md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
tables_md.append(f"[Page {page_num+1} table]\n{md}")
return tables_md
Integrate into the RAG pipeline:
from langchain_core.documents import Document
def process_document(pdf_path: str) -> list[Document]:
docs = []
# 1. Extract plain text
text_chunks = extract_text_chunks(pdf_path)
docs.extend([Document(page_content=t, metadata={"type": "text", "source": pdf_path}) for t in text_chunks])
# 2. Extract images → VLM descriptions
images = extract_images_from_pdf(pdf_path)
for img_path, page_num in images:
description = describe_image(img_path)
docs.append(Document(
page_content=description,
metadata={"type": "image", "source": pdf_path, "page": page_num, "image_path": img_path}
))
# 3. Extract tables → Markdown
tables = extract_tables_as_markdown(pdf_path)
for table_md in tables:
docs.append(Document(page_content=table_md, metadata={"type": "table", "source": pdf_path}))
return docs
Strengths: Compatible with all existing text RAG infrastructure; no changes to the vector store.
Limitations: VLM captioning adds cost and latency; description quality directly affects retrieval quality; OCR is sensitive to scan quality.
Approach 2: CLIP Multimodal Embeddings
Principle: CLIP (Contrastive Language–Image Pre-training, OpenAI 2021) projects both text and images into the same vector space. The embedding of the phrase "revenue trend chart" will be close to the embedding of an actual revenue trend chart image.
from langchain_experimental.open_clip import OpenCLIPEmbeddings
clip_embeddings = OpenCLIPEmbeddings(
model_name="ViT-H-14",
checkpoint="laion2b_s32b_b79k"
)
# Embed text
text_embedding = clip_embeddings.embed_query("Q3 revenue trend")
# Embed image
image_embedding = clip_embeddings.embed_image(["path/to/chart.png"])
# Both are in the same vector space — similarity is meaningful
from numpy import dot
from numpy.linalg import norm
similarity = dot(text_embedding, image_embedding[0]) / (norm(text_embedding) * norm(image_embedding[0]))
print(f"Similarity: {similarity:.3f}") # typically > 0.3 for semantically related pairs
Build a mixed text+image vector store:
import uuid
# Images stored with their CLIP embeddings
for img_path in image_paths:
img_embedding = clip_embeddings.embed_image([img_path])[0]
doc_id = str(uuid.uuid4())
image_vectorstore.add_texts(
texts=["[IMAGE]"],
embeddings=[img_embedding],
metadatas=[{"type": "image", "path": img_path}],
ids=[doc_id]
)
Dual-path retrieval at query time:
def multimodal_search(query: str, k: int = 5):
# Text retrieval
text_results = text_vectorstore.similarity_search(query, k=k)
# Image retrieval (via CLIP's text encoder)
query_embedding = clip_embeddings.embed_query(query)
image_results = image_vectorstore.similarity_search_by_vector(query_embedding, k=k)
return text_results + image_results
Strengths: Images don't need pre-captioning; retrieval operates on visual content directly.
Limitations: CLIP performs well on natural photographs but poorly on professional charts and graphs — those require understanding numerical relationships, not just visual recognition.
Approach 3: ColPali (The 2024 Breakthrough)
Background: Traditional document RAG follows this pipeline:
PDF → extract text/images → textualize → embed → retrieve
Every step loses information or introduces noise. ColPali (Google Research, 2024) took a different approach:
PDF → screenshot each page → vision language model → page-level embeddings → retrieve
Process each PDF page directly as an image. Bypass text extraction entirely.
Key components:
- Backbone: PaliGemma 3B (Google's vision language model)
- Late Interaction (from ColBERT): each page is divided into 1,030 patches; each patch gets its own embedding; queries generate token-level embeddings; retrieval scores via fine-grained patch × token similarity, then aggregates
- The result: ColPali can pinpoint which part of a page answers a question
# Using the byaldi library (Python interface for ColPali)
from byaldi import RAGMultiModalModel
# Load ColPali
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")
# Index a PDF directory (screenshots each page, generates patch embeddings)
RAG.index(
input_path="./financial_reports/",
index_name="reports_index",
store_collection_with_index=True, # save original images for answer generation
overwrite=True,
)
# Retrieve (returns the most relevant pages)
results = RAG.search("Q3 revenue quarter-over-quarter growth", k=3)
for r in results:
print(f"File: {r['doc_id']}, Page: {r['page_num']}, Score: {r['score']:.3f}")
Generate an answer from the retrieved page image:
import base64
from openai import OpenAI
def answer_with_page_image(question: str, page_image_path: str) -> str:
with open(page_image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
client = OpenAI()
return client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
{"type": "text", "text": f"Based on this page, answer: {question}"}
]
}]
).choices[0].message.content
The full ColPali flow:
User question → ColPali retrieves most relevant pages → extract page images → send to VLM → generate answer
Strengths:
- Handles charts, formulas, and mixed layouts natively — no OCR required
- Page-level understanding preserves visual layout
- Significantly outperforms traditional methods on visually dense documents (research papers, financial reports)
Limitations:
- Heavy model (PaliGemma 3B); retrieval latency higher than vector lookup
- Requires NVIDIA GPU; not suitable for CPU-only deployments
- Long index-build time (each page requires a forward pass)
Dedicated Table Handling
Tables are different from images — they have structured semantics and deserve specialized treatment.
Method 1: Preserve Markdown structure
def table_to_markdown(table: list[list]) -> str:
if not table or not table[0]:
return ""
header = table[0]
md = "| " + " | ".join(str(h or "-") for h in header) + " |\n"
md += "| " + " | ".join(":---:" for _ in header) + " |\n"
for row in table[1:]:
md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
return md
Good LLMs can reason across rows and columns in Markdown format.
Method 2: Summary for retrieval + full table for generation
def index_table(table_md: str, table_metadata: dict) -> None:
# Use LLM to generate a retrieval-friendly summary
summary = llm.invoke(
f"Summarize the key information in this table in one sentence (under 50 words):\n{table_md}"
)
# Store summary as the retrieval unit, full table in metadata
vectorstore.add_texts(
[summary.content],
metadatas=[{**table_metadata, "full_table": table_md}]
)
Retrieve by summary; send the full table Markdown to the LLM for answer generation.
Method 3: Structured extraction → natural language
For high-value tables (financials, product specs), extract as structured data then convert to natural language:
# Table → JSON
table_json = {
"columns": ["Quarter", "Revenue ($B)", "QoQ Growth"],
"rows": [
{"Quarter": "Q1", "Revenue ($B)": 12.3, "QoQ Growth": "+5.2%"},
{"Quarter": "Q2", "Revenue ($B)": 14.1, "QoQ Growth": "+14.6%"},
{"Quarter": "Q3", "Revenue ($B)": 13.8, "QoQ Growth": "-2.1%"},
]
}
# JSON → natural language (better for semantic retrieval)
nl_description = (
"Quarterly revenue data: Q1 $12.3B, Q2 $14.1B (up 14.6% QoQ), "
"Q3 $13.8B (down 2.1% QoQ)."
)
Natural language is more retrieval-friendly and can be directly quoted in the LLM's answer.
Which Approach to Choose
| Extract + Textualize | CLIP Multimodal | ColPali | |
|---|---|---|---|
| Document types | All | Image-heavy | Visually dense (reports, academic PDFs) |
| Infrastructure | Standard text RAG | Requires CLIP | Requires GPU, heavy model |
| Chart understanding | Depends on VLM caption quality | Weak (charts ≠ natural photos) | Strong (page-level understanding) |
| Update cost | Low | Medium | High (re-indexing is expensive) |
| Engineering complexity | Low | Medium | High |
| Cost | VLM captioning fees | Low | Model inference cost |
Practical recommendations for most scenarios:
Scenario Recommended approach
──────────────────────────────────────────────────────────────
Standard enterprise docs (few images) Text RAG, OCR or ignore images
Product docs (architecture diagrams) Extract + GPT-4V caption
Financial/research reports (charts) ColPali
E-commerce image search CLIP
Quick knowledge base prototype Extract + textualize (simplest)
A Complete Multimodal RAG Pipeline
Combining the approaches into a unified pipeline:
from enum import Enum
class DocElement(Enum):
TEXT = "text"
IMAGE = "image"
TABLE = "table"
class MultimodalRAGPipeline:
def __init__(self, text_embeddings, clip_embeddings, llm):
self.text_emb = text_embeddings
self.clip_emb = clip_embeddings
self.llm = llm
self.vectorstore = Chroma(embedding_function=text_embeddings)
def index(self, pdf_path: str) -> None:
elements = extract_all_elements(pdf_path) # text / images / tables
docs = []
for elem in elements:
if elem.type == DocElement.TEXT:
docs.append(Document(page_content=elem.content, metadata={"type": "text"}))
elif elem.type == DocElement.IMAGE:
caption = self._generate_caption(elem.image_path)
docs.append(Document(
page_content=caption,
metadata={"type": "image", "path": elem.image_path}
))
elif elem.type == DocElement.TABLE:
docs.append(Document(
page_content=table_to_markdown(elem.content),
metadata={"type": "table"}
))
self.vectorstore.add_documents(docs)
def _generate_caption(self, image_path: str) -> str:
return describe_image(image_path) # calls GPT-4V
def query(self, question: str) -> dict:
results = self.vectorstore.similarity_search(question, k=5)
context_parts = []
images_to_show = []
for r in results:
if r.metadata["type"] == "image":
context_parts.append(f"[Image description] {r.page_content}")
images_to_show.append(r.metadata["path"])
else:
context_parts.append(r.page_content)
answer = self.llm.invoke(
f"Answer based on the following:\n\n{'---'.join(context_parts)}\n\nQuestion: {question}"
)
return {"answer": answer.content, "images": images_to_show}
Summary
Multimodal RAG is fundamentally about converting non-text information into a retrievable form, then returning the original content to the LLM at answer-generation time. Three approaches:
- Extract and textualize: most mature, engineering-simple, but dependent on OCR/VLM quality — suitable for most scenarios
- CLIP multimodal embeddings: unified vector space for text and images; good for natural photograph retrieval; limited on professional charts
- ColPali: direct visual page processing; best results for chart-heavy documents; requires GPU and higher engineering investment
Tables are often simpler than images: preserve Markdown structure + generate a retrieval summary, and standard text RAG handles them well.
Next (and final) in this series: Code RAG — helping AI understand your codebase, including AST-based splitting, code embedding models, and representing call graphs with knowledge graphs.
Top comments (0)