Alain Airom

Posted on Dec 1

💻 Unlock RAG-Anything’s Power with Ollama on Your Machine (with Docling as Bonus)

#rag #ollama #docling #mineru

Adapting the RAG-Anything library to your specific settings!

Image from Rag-Anything repository

Introduction

Following up on my recent experience with LightRAG, I took a deeper dive into the GitHub repository of the renowned ‘Data Intelligence Lab@HKU.’ That’s where I discovered the immensely appealing ‘RAG-Anything’ project. While reviewing their examples, I was thrilled to spot the potential for integrating Docling! Naturally, I decided to adapt one of their provided samples. Staying true to my standard workflow, I prioritized a fully ad-hoc implementation. This meant bypassing services like OpenAI and instead adapting the setup to leverage Ollama for all local models and embeddings.

What is RAG-Anything anyways?

🚨 Transparency Statement: Please note that I maintain complete independence from the referenced organizations and tools. This content represents an unbiased sharing of my findings and technical implementation journey, without any promotional intent.

Overview of RAG-Anything (excerpt from the official documentation): Next-Generation Multimodal Intelligence

Modern documents increasingly contain diverse multimodal content — text, images, tables, equations, charts, and multimedia — that traditional text-focused RAG systems cannot effectively process. RAG-Anything addresses this challenge as a comprehensive All-in-One Multimodal Document Processing RAG system built on LightRAG.

As a unified solution, RAG-Anything eliminates the need for multiple specialized tools. It provides seamless processing and querying across all content modalities within a single integrated framework. Unlike conventional RAG approaches that struggle with non-textual elements, our all-in-one system delivers comprehensive multimodal retrieval capabilities.

Users can query documents containing interleaved text, visual diagrams, structured tables, and mathematical formulations through one cohesive interface. This consolidated approach makes RAG-Anything particularly valuable for academic research, technical documentation, financial reports, and enterprise knowledge management where rich, mixed-content documents demand a unified processing framework.

Image from Rag-Anything repository

🎯 Key Features

🔄 End-to-End Multimodal Pipeline — Complete workflow from document ingestion and parsing to intelligent multimodal query answering
📄 Universal Document Support — Seamless processing of PDFs, Office documents, images, and diverse file formats
🧠 Specialized Content Analysis — Dedicated processors for images, tables, mathematical equations, and heterogeneous content types
🔗 Multimodal Knowledge Graph — Automatic entity extraction and cross-modal relationship discovery for enhanced understanding
⚡ Adaptive Processing Modes — Flexible MinerU-based parsing or direct multimodal content injection workflows
📋 Direct Content List Insertion — Bypass document parsing by directly inserting pre-parsed content lists from external sources
🎯 Hybrid Intelligent Retrieval — Advanced search capabilities spanning textual and multimodal content with contextual understanding

The following is the architecture of the tool;

Image from Rag-Anything repository

In short the tool does the following;

| Stage                                     | Core Function                                                | Key Highlights                                               |
| ----------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **1. Document Parsing**                   | High-fidelity extraction and decomposition of documents.     | Leverages **MinerU** for structural preservation; **Adaptive Content Decomposition** segments text, visuals, tables, and equations; supports **Universal Format Support** (PDFs, Office files, etc.). |
| **2. Content Understanding & Processing** | Autonomous categorization and parallel processing of content. | **Autonomous Content Categorization** routes content to optimized channels; uses a **Concurrent Multi-Pipeline** architecture for efficient text and multimodal processing; extracts and preserves the **Document Hierarchy**. |
| **3. Multimodal Analysis**                | Deploys specialized units for heterogeneous data modalities. | Includes a **Visual Content Analyzer** (generating captions and extracting spatial relationships), a **Structured Data Interpreter** (for tables and data trends), and a **Mathematical Expression Parser** (with LaTeX support). |
| **4. Multimodal Knowledge Graph Index**   | Transforms parsed content into structured semantic representations. | Performs **Multi-Modal Entity Extraction**; establishes **Cross-Modal Relationship Mapping**; preserves document hierarchy via "belongs_to" chains; applies **Weighted Relationship Scoring** for optimized retrieval. |
| **5. Modality-Aware Retrieval**           | Hybrid system for comprehensive and contextually rich content delivery. | **Vector-Graph Fusion** integrates vector similarity search with graph traversal; uses **Modality-Aware Ranking** based on content type relevance; maintains **Relational Coherence** between retrieved elements. |

In essence, RAG-Anything moves beyond simple text-based retrieval by creating a deeply structured Multimodal Knowledge Graph that captures the full context and relationships within complex documents, leading to more accurate and coherent results.

While exploring the provided samples, my attention was immediately captured by the configuration below, which specifically highlighted the option of utilizing either MinerU or Docling for document processing.

The original sample is provided below;

import asyncio
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc

async def main():
    # Set up API configuration
    api_key = "your-api-key"
    base_url = "your-base-url"  # Optional

    # Create RAGAnything configuration
    config = RAGAnythingConfig(
        working_dir="./rag_storage",
        parser="mineru",  # Parser selection: mineru or docling
        parse_method="auto",  # Parse method: auto, ocr, or txt
        enable_image_processing=True,
        enable_table_processing=True,
        enable_equation_processing=True,
    )

    # Define LLM model function
    def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
        return openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            base_url=base_url,
            **kwargs,
        )

    # Define vision model function for image processing
    def vision_model_func(
        prompt, system_prompt=None, history_messages=[], image_data=None, messages=None, **kwargs
    ):
        # If messages format is provided (for multimodal VLM enhanced query), use it directly
        if messages:
            return openai_complete_if_cache(
                "gpt-4o",
                "",
                system_prompt=None,
                history_messages=[],
                messages=messages,
                api_key=api_key,
                base_url=base_url,
                **kwargs,
            )
        # Traditional single image format
        elif image_data:
            return openai_complete_if_cache(
                "gpt-4o",
                "",
                system_prompt=None,
                history_messages=[],
                messages=[
                    {"role": "system", "content": system_prompt}
                    if system_prompt
                    else None,
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{image_data}"
                                },
                            },
                        ],
                    }
                    if image_data
                    else {"role": "user", "content": prompt},
                ],
                api_key=api_key,
                base_url=base_url,
                **kwargs,
            )
        # Pure text format
        else:
            return llm_model_func(prompt, system_prompt, history_messages, **kwargs)

    # Define embedding function
    embedding_func = EmbeddingFunc(
        embedding_dim=3072,
        max_token_size=8192,
        func=lambda texts: openai_embed(
            texts,
            model="text-embedding-3-large",
            api_key=api_key,
            base_url=base_url,
        ),
    )

    # Initialize RAGAnything
    rag = RAGAnything(
        config=config,
        llm_model_func=llm_model_func,
        vision_model_func=vision_model_func,
        embedding_func=embedding_func,
    )

    # Process a document
    await rag.process_document_complete(
        file_path="path/to/your/document.pdf",
        output_dir="./output",
        parse_method="auto"
    )

    # Query the processed content
    # Pure text query - for basic knowledge base search
    text_result = await rag.aquery(
        "What are the main findings shown in the figures and tables?",
        mode="hybrid"
    )
    print("Text query result:", text_result)

    # Multimodal query with specific multimodal content
    multimodal_result = await rag.aquery_with_multimodal(
    "Explain this formula and its relevance to the document content",
    multimodal_content=[{
        "type": "equation",
        "latex": "P(d|q) = \\frac{P(q|d) \\cdot P(d)}{P(q)}",
        "equation_caption": "Document relevance probability"
    }],
    mode="hybrid"
)
    print("Multimodal query result:", multimodal_result)

if __name__ == "__main__":
    asyncio.run(main())

The sample I chose, shown below, served as the foundation for my work. It initially defaults to "gpt-4o", but the crucial element for me was the flexibility it offers in the parsing stage: it explicitly names MinerU and Docling. Though I know Docling well, MinerU is a completely new component I’ve encountered here, marking it as a high priority for future investigation!

📄 RAG-Anything Parser Comparison

RAG-Anything offers flexibility by supporting multiple parsers, with MinerU and Docling being the key options, each optimized for different strengths in document analysis and extraction.

| Feature                  | MinerU Parser                                                | Docling Parser                                              |
| ------------------------ | ------------------------------------------------------------ | ----------------------------------------------------------- |
| **Primary Format Focus** | PDF, Images, Office Documents, and various other formats.    | Optimized for **Office Documents** and **HTML files**.      |
| **OCR & Tables**         | **Powerful** OCR and advanced table extraction capabilities. | Standard capabilities.                                      |
| **Document Structure**   | High-fidelity extraction.                                    | **Better** preservation of the original document structure. |
| **Performance**          | Supports **GPU acceleration** for faster processing.         | Optimized for the specified formats.                        |
| **Native Support**       | Wide range of formats.                                       | **Native support** for multiple Office formats.             |

Implementation Attempt with Ollama and Docling

🪚 A Note on Progress: I’ve titled this section an “Attempt” because, frankly, I’m still battling some dependency and configuration challenges to get this code running smoothly. However, the architecture is sound, and I’m confident I’m just a few tweaks away from a successful local deployment!

This is the package installations required originally:

# Basic installation
pip install raganything

# With optional dependencies for extended format support:
pip install 'raganything[all]'              # All optional features
pip install 'raganything[image]'            # Image format conversion (BMP, TIFF, GIF, WebP)
pip install 'raganything[text]'             # Text file processing (TXT, MD)
pip install 'raganything[image,text]'       # Multiple features

I provide below the additional packages required for using with Ollama and Docling.

pip install ollama
pip install 'docling[all]'

And the code I implement which has some version dependency problems so far 😢. Beyond the specific Docling implementation, the essential changes centered on shifting the entire stack to local, open-source models. This involved integrating Ollama and selecting llama3.2-vision to serve as the Vision-Language Model (VLM) and embeddinggemma for generating high-quality embeddings.

import asyncio
import os
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.ollama import ollama_complete_if_cache, ollama_embed
from lightrag.utils import EmbeddingFunc
from functools import partial 

OLLAMA_VLM_MODEL = "llama3.2-vision" 
OLLAMA_EMBEDDING_MODEL = "embeddinggemma" 
INPUT_FOLDER = "./file_path"
OUTPUT_FOLDER = "./output_dir"
# ---

import base64
def encode_to_base64(file_path):
    with open(file_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def vision_model_func(
    prompt, system_prompt=None, history_messages=[], image_data=None, messages=None, **kwargs
):
    if messages:
        return ollama_complete_if_cache(
            OLLAMA_VLM_MODEL,
            "",
            system_prompt=None,
            history_messages=[],
            messages=messages,
            base_url="http://localhost:11434",
            **kwargs,
        )
    elif image_data:

        content_list = []
        if prompt:
            content_list.append({"type": "text", "text": prompt})

       content_list.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image_data}"
            }
        })

        messages_to_send = [
            {"role": "system", "content": system_prompt} if system_prompt else None,
            {"role": "user", "content": content_list}
        ]
        messages_to_send = [m for m in messages_to_send if m is not None] # Filter out None

        return ollama_complete_if_cache(
            OLLAMA_VLM_MODEL,
            "",
            system_prompt=None,
            history_messages=[],
            messages=messages_to_send,
            base_url="http://localhost:11434",
            **kwargs,
        )

    else:
        return llm_model_func(prompt, system_prompt, history_messages, **kwargs)


async def main():
    ollama_base_url = "http://localhost:11434"

    config = RAGAnythingConfig(
        working_dir="./rag_storage",
        parser="docling",
        parse_method="auto",
        enable_image_processing=True,
        enable_table_processing=True,
        enable_equation_processing=True,
    )

   llm_model_func = partial(
        ollama_complete_if_cache,
        OLLAMA_VLM_MODEL,
        base_url=ollama_base_url
    )

   embedding_func = EmbeddingFunc(
        embedding_dim=2048, # Assuming 2048 for embeddinggemma
        max_token_size=8192,
        func=lambda texts: ollama_embed(
            texts,
            model=OLLAMA_EMBEDDING_MODEL,
            base_url=ollama_base_url,
        ),
    )

    rag = RAGAnything(
        config=config,
        llm_model_func=llm_model_func,
        vision_model_func=vision_model_func, # Re-enabled
        embedding_func=embedding_func,
    )

    if not os.path.isdir(INPUT_FOLDER):
        print(f"Error: Input folder {INPUT_FOLDER} does not exist. Please create it and add documents.")
        return

    print(f"Starting recursive processing of documents in {INPUT_FOLDER} using {OLLAMA_VLM_MODEL}...")
    await rag.process_document_complete(
        input_dir=INPUT_FOLDER,
        output_dir=OUTPUT_FOLDER,
        parse_method="auto"
    )
    print("Document processing complete.")
    print("-" * 30)

    text_query = "What are the main findings shown in the document content?"
    print(f"Running text query (LLM mode): '{text_query}'")
    text_result = await rag.aquery(
        text_query,
        mode="hybrid"
    )
    print("\nText Query Result:")
    print(text_result)

    print("-" * 30)

    multimodal_query = "Explain this formula and its relevance to the document content"
    print(f"Running multimodal query (VLM mode): '{multimodal_query}'")

    multimodal_result = await rag.aquery_with_multimodal(
        multimodal_query,
        multimodal_content=[{
            "type": "equation",
            "latex": "P(d|q) = \\frac{P(q|d) \\cdot P(d)}{P(q)}",
            "equation_caption": "Document relevance probability (Bayes' Theorem example)"
        }],
        mode="hybrid"
    )
    print("\nMultimodal Query Result:")
    print(multimodal_result)

if __name__ == "__main__":
    asyncio.run(main())

💡 Conclusion: The Essence of RAG-Anything

RAG-Anything is fundamentally an All-in-One Multimodal Document Processing RAG system that elevates traditional RAG by mastering heterogeneous content.

Built upon LightRAG, its essence lies in providing a unified, end-to-end framework that eliminates the need for multiple specialized tools. This system is architected around a powerful multi-stage multimodal pipeline that ensures comprehensive understanding and retrieval across all content types — including text, images, tables, and mathematical equations.

Key Takeaways:

Unified Multimodal Processing: It offers seamless processing and querying across diverse formats (PDFs, Office docs, Images, etc.) within a single integrated solution.
Structured Knowledge Creation: It moves beyond simple chunks by constructing a Multimodal Knowledge Graph, which automatically extracts entities and maps cross-modal relationships for enhanced understanding.
Intelligent Retrieval: The system utilizes Hybrid Intelligent Retrieval, combining vector search with graph traversal, and employs Modality-Aware Ranking to deliver contextually coherent and highly relevant answers.
Flexibility and Extensibility: It offers adaptive parsing (via MinerU or Docling) and supports VLM-Enhanced Queries, ensuring it is adaptable for technical documentation, research, and enterprise knowledge management.

In short, RAG-Anything is designed to handle the complexity of modern, mixed-content documents, providing researchers and developers with a single, powerful tool for Next-Generation Multimodal Intelligence.

⭐ Next Step: Your version of RAG-Anything

Now that you’ve seen the power and architecture of RAG-Anything, it’s time to try the adaptation: why not try integrating the powerful llama3.2-vision and embeddinggemma models into your own local RAG pipeline and join the Journey?

Thank for reading 🙏

Top comments (7)

Krishna Kumari • Dec 12

Hello did u find anything ? To solve issues can u share the code

Alain Airom • Dec 12

What particular issue?

Krishna Kumari • Dec 12

Please i want to have code which works with ollama models and requiremnts.txt for setting up environment it will be really very helpful if u can provide

Krishna Kumari • Dec 12

Do i need to install lightrag too ? Lightrag issue

Alain Airom • Dec 12

Which issue(s)?

Krishna Kumari • Dec 12

Please specify requirements.txt wuth python and pip versions for compstability

Alain Airom • Dec 13

Ahhhh OKay, well I didn't build a requirements.txt, actually "pip install raganything" installs almost all needed, additionally ollama and docling should be installed as in the sample code.