Introduction
"Standard RAG systems treat everything in a document as text — but tables aren't text, equations aren't text, and images are definitely not text."
This is article #113 in the Open Source Project of the Day series. Today's project is RAG-Anything — an all-in-one multimodal RAG framework from the Hong Kong University Data Science Lab (HKUDS), built as an extension of LightRAG.
Take a financial report PDF containing text, data tables, trend charts, and mathematical formulas. Feed it to a standard RAG system and ask "how much did Q3 2023 revenue grow?":
- Tables get parsed into garbled text with row/column relationships lost
- Charts are silently ignored (text-only parsers skip images)
- Equations become LaTeX source strings with no semantic meaning
The quality of answers you get is not much better than throwing the whole document at an LLM.
RAG-Anything's answer: give each modality proper structural representation. Tables become subgraphs of row/column/cell nodes. Images get analyzed by a vision model and become entities with semantic descriptions. Equations retain symbolic nodes with meaning. All modalities fuse into a unified knowledge graph, queried jointly at retrieval time.
What You'll Learn
- Why "convert tables to text" loses critical information
- The dual-graph fusion architecture: how cross-modal and text knowledge graphs are built and merged
- All 5 pipeline stages in detail
- Three query modes: pure text / VLM-enhanced / multimodal
- DocBench and MMLongBench benchmark results
- Comparison with MMGraphRAG and LightRAG
Prerequisites
- Familiarity with RAG basics
- Basic understanding of knowledge graphs (nodes, edges)
- Python experience
Project Background
What Is RAG-Anything?
RAG-Anything is an all-in-one multimodal RAG framework built on LightRAG. It processes text, images, tables, and equations as first-class entities in a knowledge graph, rather than degrading all content types to plain text.
In June 2026, LightRAG natively integrated RAG-Anything into the main project, making it the official multimodal implementation for the LightRAG ecosystem.
Author / Team
- Lab: Hong Kong University Data Science Lab (HKUDS)
- Authors: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang
- arXiv: 2510.12323
- License: MIT
-
PyPI:
raganything
Project Stats
- ⭐ GitHub Stars: 21,900+
- 🍴 Forks: 2,500+
- 📄 License: MIT
The Core Problem: Why "Convert Everything to Text" Falls Short
Tables lose structural information in plain text conversion
Original table:
2021 2022 2023
Revenue $12B $14.5B $16.8B
Profit $1.8B $2.2B $3.1B
Growth 15% 20.8% 42.3%
After text parser conversion:
"Revenue 12B 14.5B 16.8B Profit 1.8B 2.2B 3.1B Growth 15% 20.8% 42.3%"
Row/column relationships are gone. Ask "what was the profit growth rate in 2022?" and the system may link "2.2B" and "42.3%" (both present in the sequential text near 2022 and 2023 data), producing wrong answers.
Charts are invisible to text-only parsers
Most text extraction tools skip images and charts in PDFs entirely. These often contain the document's most important insights — trend charts, comparison figures, process diagrams — and they disappear completely from the RAG system's view.
Equations lose semantic meaning
LaTeX source: $\frac{\partial L}{\partial w} = -\frac{1}{N}\sum_{i=1}^{N} x_i(y_i - \sigma(w^Tx_i))$
This is the logistic regression gradient formula.
Embedding this LaTeX string has no semantic connection to
"gradient descent" or "parameter update."
Architecture: 5-Stage Pipeline
Document (PDF/DOCX/PPTX/Image...)
↓
Stage 1: Document Parsing (MinerU / Docling / PaddleOCR)
Decomposes into text blocks, images, tables, equations, element lists
↓
Stage 2: Content Routing
Dispatches each element to its modality-specific processor
↓
Stage 3: Multimodal Analysis
├── Image processor → VLM generates description + extracts entities
├── Table processor → Row/column/cell structured nodes
├── Equation processor → Symbolic nodes + semantic description
└── Text processor → Standard NER + relation extraction
↓
Stage 4: Knowledge Graph Indexing (dual-graph fusion)
├── Cross-modal knowledge graph (non-text elements as anchor nodes)
├── Text knowledge graph (standard LightRAG graph)
└── Entity name alignment → fused unified graph G=(V,E)
↓
Stage 5: Hybrid Retrieval + Generation
├── Structural navigation (graph traversal + keyword matching)
├── Semantic similarity (vector search, cross-modal embeddings)
└── VLM-enhanced generation (images auto-passed to vision model)
Dual-Graph Fusion Architecture
This is the central design in RAG-Anything.
Cross-Modal Knowledge Graph
Each non-text element (image, table, equation) becomes an anchor node in the graph:
Table representation:
Original table (Revenue/Profit by Year)
↓ builds subgraph
[Table: Revenue-Profit Comparison]
├──[Row: Revenue]──[Cell: $12B]──[Column: 2021]
├──[Row: Revenue]──[Cell: $14.5B]──[Column: 2022]
├──[Row: Profit]──[Cell: $1.8B]──[Column: 2021]
└── ...
Edge types: row-of, column-of, header-applies-to
When queried "profit in 2022," the graph precisely navigates to the intersection of Row:Profit + Column:2022, without guessing among a sequence of numbers.
Image representation:
Image (a sales trend line chart)
↓ VLM analysis
[Image entity: Sales Trend Chart 2021-2023]
├── Description: Shows rising revenue trend from 2021 to 2023
├── [Panel node: Main chart area]──[Axis node: X-axis Years]
├── [Axis node: Y-axis Amount (billions)]
└── [Annotation node: 2023 growth rate 42.3%]
Image content becomes searchable structured nodes, not a black box.
Equation representation:
LaTeX equation
↓ semantic analysis
[Equation entity: Logistic Regression Gradient]
├── Symbol nodes: ∂L, ∂w, σ(·), x_i, y_i
├── Semantic description: logistic regression parameter gradient formula
└── Relationship: belongs_to → [Algorithm: Logistic Regression]
Text Knowledge Graph
Standard LightRAG NER and relation extraction on the document's text content.
Fusion
The two graphs merge via entity name alignment: if the cross-modal graph's image entity "Sales Trend Chart" and the text graph's "Sales Data Analysis" refer to the same concept, they get connected through entity name matching.
Quick Start
Install:
pip install raganything # Basic
pip install 'raganything[all]' # All optional features
Complete initialization:
from raganything import RAGAnything, RAGAnythingConfig
import asyncio
config = RAGAnythingConfig(
working_dir="./rag_storage",
parser="mineru", # mineru / docling / paddleocr
enable_image_processing=True,
enable_table_processing=True,
enable_equation_processing=True,
)
# LLM and embedding functions (OpenAI-compatible API)
async def llm_func(prompt, ...): ...
async def vision_func(prompt, image_data, ...): ... # GPT-4o or equivalent
async def embedding_func(texts): ...
rag = RAGAnything(
config=config,
llm_model_func=llm_func,
vision_model_func=vision_func,
embedding_func=embedding_func,
)
# Process a document
await rag.process_document_complete(
file_path="annual_report.pdf",
output_dir="./output"
)
# Query
result = await rag.aquery(
"What was the Q3 2023 revenue growth rate?",
mode="hybrid"
)
print(result)
Three Query Modes
Pure text mode (fast, uses standard LightRAG pipeline):
result = await rag.aquery("What are the company's main competitive advantages?", mode="local")
VLM-enhanced mode (images auto-passed to vision model when retrieved):
result = await rag.aquery("What does this architecture diagram show?", mode="hybrid")
# When retrieved context contains images, they're automatically
# encoded and passed to vision_model_func
# Answer is grounded in actual image content, not just descriptions
Multimodal query (filter by specific content type):
result = await rag.aquery(
"Analyze the trends in this table",
mode="hybrid",
content_type_filter=["table"]
)
Direct Content Injection
Bypass the built-in parser, inject pre-parsed content:
content_list = [
{"content_type": "text", "content": "Company overview..."},
{"content_type": "table", "content": table_data, "caption": "Financial data"},
{"content_type": "image", "content": image_bytes, "caption": "Product architecture"},
]
await rag.ainsert_content_list(content_list)
Benchmark Results
Evaluated on two long-document QA benchmarks:
DocBench (229 documents, avg. 66 pages):
| System | Accuracy |
|---|---|
| RAG-Anything | 63.4% |
| MMGraphRAG | 61.0% |
| LightRAG | 58.4% |
| GPT-4o-mini (direct input) | 51.2% |
MMLongBench (135 documents, avg. 47.5 pages):
| System | Accuracy |
|---|---|
| RAG-Anything | 42.8% |
| LightRAG | 38.9% |
| MMGraphRAG | 37.7% |
| GPT-4o-mini (direct input) | 33.5% |
Documents over 100 pages show the largest gap: RAG-Anything ~68% vs. MMGraphRAG ~55%, a difference exceeding 13 percentage points. The longer the document, the more structural representation matters — longer plain text makes it harder to locate structural information amid noise.
Key ablation result: removing graph construction entirely (chunk-only retrieval) drops accuracy from 63.4% to 60.0%. Removing the reranker costs only about 1 percentage point. Conclusion: the gains come from the graph architecture itself, not post-retrieval tricks.
Supported Parsers
| Parser | Best for | Notes |
|---|---|---|
| MinerU | PDFs, complex layouts | GPU acceleration, strong OCR |
| Docling | Office docs (DOCX/PPTX/XLSX) | Good structure preservation, HTML support |
| PaddleOCR | Image-heavy PDFs | Strong multilingual OCR |
Office format support (.doc/.ppt/.xls) requires LibreOffice for conversion:
sudo apt-get install libreoffice
Links and Resources
- 🌟 GitHub: HKUDS/RAG-Anything
- 📦 PyPI: raganything
- 📄 arXiv: 2510.12323
- 🔗 LightRAG: HKUDS/LightRAG
Conclusion
RAG-Anything addresses a problem common in real enterprise document processing: contracts, financial reports, academic papers, technical manuals — the core information in these documents often lives in tables and figures, not text paragraphs. Compressing that content into strings and calling it "RAG for documents" is more accurately described as RAG for documents' shadow.
The dual-graph fusion architecture is technically sound: structured nodes for table row/column relationships, VLM processing for images into retrievable semantic entities, symbolic nodes for equations. The ablation results validate that the graph is the mechanism driving improvement, not retrieval post-processing.
LightRAG natively integrating RAG-Anything in June 2026 signals upstream endorsement of the design. For RAG systems that need to handle complex documents with charts, tables, and equations, RAG-Anything is one of the most complete solutions in the current open-source ecosystem.
Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.
Welcome to my Homepage for more useful insights and interesting products.
Top comments (0)