RAG-Anything: multi-modal PDF+image RAG in 20 min (2026)

#fullstack #ai #webdev #javascript

Originally published on NextFuture

What you will ship

By the end of this tutorial you will have a Python script that ingests any PDF containing text, embedded images, data tables, and mathematical equations, then answers natural language questions against all four modalities simultaneously. RAG-Anything (built on top of LightRAG, arXiv 2510.12323) wraps a multimodal knowledge-graph pipeline — you supply an OpenAI key, a file path, and three callback functions. It handles MinerU-based PDF parsing, per-modality processors, and knowledge-graph construction automatically. Prerequisites: Python 3.10+, a valid OPENAI_API_KEY, and poppler-utils installed for PDF-to-image rendering. Budget roughly $0.02–$0.06 in OpenAI API calls per 10-page document at gpt-4o-mini + gpt-4o rates.

Step-by-step build

Step 1 — Install the package and system dependency

# Install RAG-Anything with all optional processors
pip install "raganything[all]"

# macOS
brew install poppler

# Debian/Ubuntu
sudo apt install poppler-utils

Step 2 — Create rag_pipeline.py and configure the pipeline

import asyncio
import os
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc

OPENAI_KEY = os.environ["OPENAI_API_KEY"]

config = RAGAnythingConfig(
    working_dir="./rag_storage",
    parser="mineru",          # alternatives: docling, paddleocr
    parse_method="auto",       # auto | ocr | txt
    enable_image_processing=True,
    enable_table_processing=True,
    enable_equation_processing=True,
)

Step 3 — Wire the LLM, vision, and embedding callbacks

RAG-Anything separates text inference (cheap, gpt-4o-mini) from vision inference (needs gpt-4o for chart accuracy). Passing the wrong model to vision_func is the most common setup mistake — more on this in the gotchas section.

def llm_func(prompt, system_prompt=None, history_messages=[], **kwargs):
    return openai_complete_if_cache(
        "gpt-4o-mini", prompt,
        system_prompt=system_prompt,
        history_messages=history_messages,
        api_key=OPENAI_KEY, **kwargs,
    )

def vision_func(prompt, system_prompt=None, history_messages=[],
                image_data=None, messages=None, **kwargs):
    if messages:
        return openai_complete_if_cache(
            "gpt-4o", "", messages=messages, api_key=OPENAI_KEY, **kwargs
        )
    if image_data:
        return openai_complete_if_cache(
            "gpt-4o", "",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }},
                ]},
            ],
            api_key=OPENAI_KEY, **kwargs,
        )

embedding_func = EmbeddingFunc(
    embedding_dim=1536,
    max_token_size=8192,
    func=lambda texts: openai_embed(texts, api_key=OPENAI_KEY),
)

Step 4 — Instantiate, process a document, and run a text query

async def main():
    rag = RAGAnything(
        config=config,
        llm_model_func=llm_func,
        vision_model_func=vision_func,
        embedding_func=embedding_func,
    )

    await rag.process_document_complete(
        file_path="./annual_report.pdf",
        output_dir="./output",
        parse_method="auto",
    )

    answer = await rag.aquery(
        "Summarize the key findings shown in the tables and describe the diagrams.",
        mode="hybrid",
    )
    print(answer)

asyncio.run(main())

Step 5 — Query with inline multimodal context

When you already have a specific equation, table cell, or image you want to ask about, pass it directly as multimodal_content — the retriever weights results by modality relevance to that artifact.

mm_answer = await rag.aquery_with_multimodal(
        "What does this relevance formula represent in the document?",
        multimodal_content=[{
            "type": "equation",
            "latex": "P(d|q) = \\frac{P(q|d)\\cdot P(d)}{P(q)}",
            "equation_caption": "Document relevance probability",
        }],
        mode="hybrid",
    )
    print(mm_answer)

Step 6 — Index an entire docs folder in one call

await rag.process_folder_complete(
        folder_path="./docs",
        output_dir="./output",
    )
    result = await rag.aquery(
        "Which quarterly report shows the highest revenue growth rate?",
        mode="global",
    )
    print(result)

Test it works

After running Step 4, verify the knowledge graph file was created and run a quick sync query:

ls ./rag_storage/graph_chunk_entity_relation.graphml
# should be non-empty — a typical 10-page PDF produces 50–200 entity nodes

result = rag.query("List the main section headings.", mode="naive")
print(result)
# Expected: numbered list of headings from the document
# Empty string means the graph file is missing — re-run process_document_complete

Common gotchas

1. MinerU requires poppler — no fallback. MinerU renders PDF pages to images before extracting layout. Without poppler binaries in PATH you get pdf2image.exceptions.PDFInfoNotInstalledError at index time. Fix it at the OS level (brew install poppler / apt install poppler-utils). If you cannot install system packages, set parser="docling" in RAGAnythingConfig — pure Python, no system deps, but it misses some complex figure captions.

2. gpt-4o-mini in the vision callback silently degrades accuracy by 30–40%. gpt-4o-mini accepts multimodal message payloads and returns a response without error, producing hallucinated chart descriptions with no warning. Reserve gpt-4o-mini exclusively for llm_func (text-only graph queries) and keep vision_func on gpt-4o. The extra cost is small because only image and equation nodes ever hit that path.

3. Re-processing the same file doubles your LLM spend. RAG-Anything persists its knowledge graph across runs in working_dir but does not track which files have been indexed. Calling process_document_complete() on a previously ingested PDF re-runs the full parsing pipeline and bills you again. Track processed files by SHA-256 in a local SQLite table and skip before calling the ingestion method.

Ship it this week

RAG-Anything delivers production-grade multi-modal retrieval in under 50 lines of Python — no custom parsers, no separate image pipeline, no vector-store configuration beyond the default LightRAG storage. Pair it with AI-generated images (see our tutorial on gpt-image-2 API for 2K AI images) to make visual output fully searchable. Already running autonomous agents? Wire a FastAPI endpoint around rag.query() and call it as a retrieval tool from your OpenAI Agents SDK workflow.

This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.