WonderLab

Posted on Jul 3

Open Source Project of the Day (#113): RAG-Anything — Making Images, Tables, and Equations First-Class Citizens in RAG

#opensource #rag #multimodal #knowledgegraph

Introduction

"Standard RAG systems treat everything in a document as text — but tables aren't text, equations aren't text, and images are definitely not text."

This is article #113 in the Open Source Project of the Day series. Today's project is RAG-Anything — an all-in-one multimodal RAG framework from the Hong Kong University Data Science Lab (HKUDS), built as an extension of LightRAG.

Take a financial report PDF containing text, data tables, trend charts, and mathematical formulas. Feed it to a standard RAG system and ask "how much did Q3 2023 revenue grow?":

Tables get parsed into garbled text with row/column relationships lost
Charts are silently ignored (text-only parsers skip images)
Equations become LaTeX source strings with no semantic meaning

The quality of answers you get is not much better than throwing the whole document at an LLM.

RAG-Anything's answer: give each modality proper structural representation. Tables become subgraphs of row/column/cell nodes. Images get analyzed by a vision model and become entities with semantic descriptions. Equations retain symbolic nodes with meaning. All modalities fuse into a unified knowledge graph, queried jointly at retrieval time.

What You'll Learn

Why "convert tables to text" loses critical information
The dual-graph fusion architecture: how cross-modal and text knowledge graphs are built and merged
All 5 pipeline stages in detail
Three query modes: pure text / VLM-enhanced / multimodal
DocBench and MMLongBench benchmark results
Comparison with MMGraphRAG and LightRAG

Prerequisites

Familiarity with RAG basics
Basic understanding of knowledge graphs (nodes, edges)
Python experience

Project Background

What Is RAG-Anything?

RAG-Anything is an all-in-one multimodal RAG framework built on LightRAG. It processes text, images, tables, and equations as first-class entities in a knowledge graph, rather than degrading all content types to plain text.

In June 2026, LightRAG natively integrated RAG-Anything into the main project, making it the official multimodal implementation for the LightRAG ecosystem.

Author / Team

Lab: Hong Kong University Data Science Lab (HKUDS)
Authors: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang
arXiv: 2510.12323
License: MIT
PyPI: raganything

Project Stats

⭐ GitHub Stars: 21,900+
🍴 Forks: 2,500+
📄 License: MIT

The Core Problem: Why "Convert Everything to Text" Falls Short

Tables lose structural information in plain text conversion

Original table:
         2021      2022      2023
Revenue  $12B      $14.5B    $16.8B
Profit   $1.8B     $2.2B     $3.1B
Growth   15%       20.8%     42.3%

After text parser conversion:
"Revenue 12B 14.5B 16.8B Profit 1.8B 2.2B 3.1B Growth 15% 20.8% 42.3%"

Row/column relationships are gone. Ask "what was the profit growth rate in 2022?" and the system may link "2.2B" and "42.3%" (both present in the sequential text near 2022 and 2023 data), producing wrong answers.

Charts are invisible to text-only parsers

Most text extraction tools skip images and charts in PDFs entirely. These often contain the document's most important insights — trend charts, comparison figures, process diagrams — and they disappear completely from the RAG system's view.

Equations lose semantic meaning

LaTeX source: $\frac{\partial L}{\partial w} = -\frac{1}{N}\sum_{i=1}^{N} x_i(y_i - \sigma(w^Tx_i))$

This is the logistic regression gradient formula.
Embedding this LaTeX string has no semantic connection to
"gradient descent" or "parameter update."

Architecture: 5-Stage Pipeline

Document (PDF/DOCX/PPTX/Image...)
        ↓
Stage 1: Document Parsing (MinerU / Docling / PaddleOCR)
        Decomposes into text blocks, images, tables, equations, element lists
        ↓
Stage 2: Content Routing
        Dispatches each element to its modality-specific processor
        ↓
Stage 3: Multimodal Analysis
        ├── Image processor → VLM generates description + extracts entities
        ├── Table processor → Row/column/cell structured nodes
        ├── Equation processor → Symbolic nodes + semantic description
        └── Text processor → Standard NER + relation extraction
        ↓
Stage 4: Knowledge Graph Indexing (dual-graph fusion)
        ├── Cross-modal knowledge graph (non-text elements as anchor nodes)
        ├── Text knowledge graph (standard LightRAG graph)
        └── Entity name alignment → fused unified graph G=(V,E)
        ↓
Stage 5: Hybrid Retrieval + Generation
        ├── Structural navigation (graph traversal + keyword matching)
        ├── Semantic similarity (vector search, cross-modal embeddings)
        └── VLM-enhanced generation (images auto-passed to vision model)

Dual-Graph Fusion Architecture

This is the central design in RAG-Anything.

Cross-Modal Knowledge Graph

Each non-text element (image, table, equation) becomes an anchor node in the graph:

Table representation:

Original table (Revenue/Profit by Year)
        ↓ builds subgraph
    [Table: Revenue-Profit Comparison]
         ├──[Row: Revenue]──[Cell: $12B]──[Column: 2021]
         ├──[Row: Revenue]──[Cell: $14.5B]──[Column: 2022]
         ├──[Row: Profit]──[Cell: $1.8B]──[Column: 2021]
         └── ...
    Edge types: row-of, column-of, header-applies-to

When queried "profit in 2022," the graph precisely navigates to the intersection of Row:Profit + Column:2022, without guessing among a sequence of numbers.

Image representation:

Image (a sales trend line chart)
        ↓ VLM analysis
    [Image entity: Sales Trend Chart 2021-2023]
        ├── Description: Shows rising revenue trend from 2021 to 2023
        ├── [Panel node: Main chart area]──[Axis node: X-axis Years]
        ├── [Axis node: Y-axis Amount (billions)]
        └── [Annotation node: 2023 growth rate 42.3%]

Image content becomes searchable structured nodes, not a black box.

Equation representation:

LaTeX equation
        ↓ semantic analysis
    [Equation entity: Logistic Regression Gradient]
        ├── Symbol nodes: ∂L, ∂w, σ(·), x_i, y_i
        ├── Semantic description: logistic regression parameter gradient formula
        └── Relationship: belongs_to → [Algorithm: Logistic Regression]

Text Knowledge Graph

Standard LightRAG NER and relation extraction on the document's text content.

Fusion

The two graphs merge via entity name alignment: if the cross-modal graph's image entity "Sales Trend Chart" and the text graph's "Sales Data Analysis" refer to the same concept, they get connected through entity name matching.

Quick Start

Install:

pip install raganything              # Basic
pip install 'raganything[all]'       # All optional features

Complete initialization:

from raganything import RAGAnything, RAGAnythingConfig
import asyncio

config = RAGAnythingConfig(
    working_dir="./rag_storage",
    parser="mineru",           # mineru / docling / paddleocr
    enable_image_processing=True,
    enable_table_processing=True,
    enable_equation_processing=True,
)

# LLM and embedding functions (OpenAI-compatible API)
async def llm_func(prompt, ...): ...
async def vision_func(prompt, image_data, ...): ...  # GPT-4o or equivalent
async def embedding_func(texts): ...

rag = RAGAnything(
    config=config,
    llm_model_func=llm_func,
    vision_model_func=vision_func,
    embedding_func=embedding_func,
)

# Process a document
await rag.process_document_complete(
    file_path="annual_report.pdf",
    output_dir="./output"
)

# Query
result = await rag.aquery(
    "What was the Q3 2023 revenue growth rate?",
    mode="hybrid"
)
print(result)

Three Query Modes

Pure text mode (fast, uses standard LightRAG pipeline):

result = await rag.aquery("What are the company's main competitive advantages?", mode="local")

VLM-enhanced mode (images auto-passed to vision model when retrieved):

result = await rag.aquery("What does this architecture diagram show?", mode="hybrid")
# When retrieved context contains images, they're automatically
# encoded and passed to vision_model_func
# Answer is grounded in actual image content, not just descriptions

Multimodal query (filter by specific content type):

result = await rag.aquery(
    "Analyze the trends in this table",
    mode="hybrid",
    content_type_filter=["table"]
)

Direct Content Injection

Bypass the built-in parser, inject pre-parsed content:

content_list = [
    {"content_type": "text", "content": "Company overview..."},
    {"content_type": "table", "content": table_data, "caption": "Financial data"},
    {"content_type": "image", "content": image_bytes, "caption": "Product architecture"},
]
await rag.ainsert_content_list(content_list)

Benchmark Results

Evaluated on two long-document QA benchmarks:

DocBench (229 documents, avg. 66 pages):

System	Accuracy
RAG-Anything	63.4%
MMGraphRAG	61.0%
LightRAG	58.4%
GPT-4o-mini (direct input)	51.2%

MMLongBench (135 documents, avg. 47.5 pages):

System	Accuracy
RAG-Anything	42.8%
LightRAG	38.9%
MMGraphRAG	37.7%
GPT-4o-mini (direct input)	33.5%

Documents over 100 pages show the largest gap: RAG-Anything ~68% vs. MMGraphRAG ~55%, a difference exceeding 13 percentage points. The longer the document, the more structural representation matters — longer plain text makes it harder to locate structural information amid noise.

Key ablation result: removing graph construction entirely (chunk-only retrieval) drops accuracy from 63.4% to 60.0%. Removing the reranker costs only about 1 percentage point. Conclusion: the gains come from the graph architecture itself, not post-retrieval tricks.

Supported Parsers

Parser	Best for	Notes
MinerU	PDFs, complex layouts	GPU acceleration, strong OCR
Docling	Office docs (DOCX/PPTX/XLSX)	Good structure preservation, HTML support
PaddleOCR	Image-heavy PDFs	Strong multilingual OCR

Office format support (.doc/.ppt/.xls) requires LibreOffice for conversion:

sudo apt-get install libreoffice

Links and Resources

🌟 GitHub: HKUDS/RAG-Anything
📦 PyPI: raganything
📄 arXiv: 2510.12323
🔗 LightRAG: HKUDS/LightRAG

Conclusion

RAG-Anything addresses a problem common in real enterprise document processing: contracts, financial reports, academic papers, technical manuals — the core information in these documents often lives in tables and figures, not text paragraphs. Compressing that content into strings and calling it "RAG for documents" is more accurately described as RAG for documents' shadow.

The dual-graph fusion architecture is technically sound: structured nodes for table row/column relationships, VLM processing for images into retrievable semantic entities, symbolic nodes for equations. The ablation results validate that the graph is the mechanism driving improvement, not retrieval post-processing.

LightRAG natively integrating RAG-Anything in June 2026 signals upstream endorsement of the design. For RAG systems that need to handle complex documents with charts, tables, and equations, RAG-Anything is one of the most complete solutions in the current open-source ecosystem.

Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

DEV Community