Originally published on NextFuture
What you will ship
By the end of this tutorial you will have a Python script that ingests any PDF containing text, embedded images, data tables, and mathematical equations, then answers natural language questions against all four modalities simultaneously. RAG-Anything (built on top of LightRAG, arXiv 2510.12323) wraps a multimodal knowledge-graph pipeline — you supply an OpenAI key, a file path, and three callback functions. It handles MinerU-based PDF parsing, per-modality processors, and knowledge-graph construction automatically. Prerequisites: Python 3.10+, a valid OPENAI_API_KEY, and poppler-utils installed for PDF-to-image rendering. Budget roughly $0.02–$0.06 in OpenAI API calls per 10-page document at gpt-4o-mini + gpt-4o rates.
Step-by-step build
Step 1 — Install the package and system dependency
# Install RAG-Anything with all optional processors
pip install "raganything[all]"
# macOS
brew install poppler
# Debian/Ubuntu
sudo apt install poppler-utils
Step 2 — Create rag_pipeline.py and configure the pipeline
import asyncio
import os
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc
OPENAI_KEY = os.environ["OPENAI_API_KEY"]
config = RAGAnythingConfig(
working_dir="./rag_storage",
parser="mineru", # alternatives: docling, paddleocr
parse_method="auto", # auto | ocr | txt
enable_image_processing=True,
enable_table_processing=True,
enable_equation_processing=True,
)
Step 3 — Wire the LLM, vision, and embedding callbacks
RAG-Anything separates text inference (cheap, gpt-4o-mini) from vision inference (needs gpt-4o for chart accuracy). Passing the wrong model to vision_func is the most common setup mistake — more on this in the gotchas section.
def llm_func(prompt, system_prompt=None, history_messages=[], **kwargs):
return openai_complete_if_cache(
"gpt-4o-mini", prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key=OPENAI_KEY, **kwargs,
)
def vision_func(prompt, system_prompt=None, history_messages=[],
image_data=None, messages=None, **kwargs):
if messages:
return openai_complete_if_cache(
"gpt-4o", "", messages=messages, api_key=OPENAI_KEY, **kwargs
)
if image_data:
return openai_complete_if_cache(
"gpt-4o", "",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}},
]},
],
api_key=OPENAI_KEY, **kwargs,
)
embedding_func = EmbeddingFunc(
embedding_dim=1536,
max_token_size=8192,
func=lambda texts: openai_embed(texts, api_key=OPENAI_KEY),
)
Step 4 — Instantiate, process a document, and run a text query
async def main():
rag = RAGAnything(
config=config,
llm_model_func=llm_func,
vision_model_func=vision_func,
embedding_func=embedding_func,
)
await rag.process_document_complete(
file_path="./annual_report.pdf",
output_dir="./output",
parse_method="auto",
)
answer = await rag.aquery(
"Summarize the key findings shown in the tables and describe the diagrams.",
mode="hybrid",
)
print(answer)
asyncio.run(main())
Step 5 — Query with inline multimodal context
When you already have a specific equation, table cell, or image you want to ask about, pass it directly as multimodal_content — the retriever weights results by modality relevance to that artifact.
mm_answer = await rag.aquery_with_multimodal(
"What does this relevance formula represent in the document?",
multimodal_content=[{
"type": "equation",
"latex": "P(d|q) = \\frac{P(q|d)\\cdot P(d)}{P(q)}",
"equation_caption": "Document relevance probability",
}],
mode="hybrid",
)
print(mm_answer)
Step 6 — Index an entire docs folder in one call
await rag.process_folder_complete(
folder_path="./docs",
output_dir="./output",
)
result = await rag.aquery(
"Which quarterly report shows the highest revenue growth rate?",
mode="global",
)
print(result)
Test it works
After running Step 4, verify the knowledge graph file was created and run a quick sync query:
ls ./rag_storage/graph_chunk_entity_relation.graphml
# should be non-empty — a typical 10-page PDF produces 50–200 entity nodes
result = rag.query("List the main section headings.", mode="naive")
print(result)
# Expected: numbered list of headings from the document
# Empty string means the graph file is missing — re-run process_document_complete
Common gotchas
1. MinerU requires poppler — no fallback. MinerU renders PDF pages to images before extracting layout. Without poppler binaries in PATH you get pdf2image.exceptions.PDFInfoNotInstalledError at index time. Fix it at the OS level (brew install poppler / apt install poppler-utils). If you cannot install system packages, set parser="docling" in RAGAnythingConfig — pure Python, no system deps, but it misses some complex figure captions.
2. gpt-4o-mini in the vision callback silently degrades accuracy by 30–40%. gpt-4o-mini accepts multimodal message payloads and returns a response without error, producing hallucinated chart descriptions with no warning. Reserve gpt-4o-mini exclusively for llm_func (text-only graph queries) and keep vision_func on gpt-4o. The extra cost is small because only image and equation nodes ever hit that path.
3. Re-processing the same file doubles your LLM spend. RAG-Anything persists its knowledge graph across runs in working_dir but does not track which files have been indexed. Calling process_document_complete() on a previously ingested PDF re-runs the full parsing pipeline and bills you again. Track processed files by SHA-256 in a local SQLite table and skip before calling the ingestion method.
Ship it this week
RAG-Anything delivers production-grade multi-modal retrieval in under 50 lines of Python — no custom parsers, no separate image pipeline, no vector-store configuration beyond the default LightRAG storage. Pair it with AI-generated images (see our tutorial on gpt-image-2 API for 2K AI images) to make visual output fully searchable. Already running autonomous agents? Wire a FastAPI endpoint around rag.query() and call it as a retrieval tool from your OpenAI Agents SDK workflow.
This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.
Top comments (0)