Gao Dalie (Ilyass)

Posted on Jul 12

LightRag (Upgraded) + Multimodal RAG Just Revolutionized AI Forever

#machinelearning #ai #programming #datascience

In this Story, I have a super quick tutorial showing you how to create a multi-agent chatbot using LightRag and Multimodal RAG to build a powerful agent chatbot for your business or personal use.

Recently, I have been involved in Developing Agentic AI from multimodal to text generation, taking a look at the latest techniques to develop advanced agentic AI.

Modern knowledge AI agents are no longer faced with simple, plain text documents every day, but rather complex information carriers containing rich visual elements, structured data, and multimedia content.

These documents often contain various information forms such as text descriptions, chart analyses, data statistics, and formula derivations, which complement each other and jointly form a complete knowledge system

RAG works well for answering questions with plain text, but it has big problems with images, charts, and tables. Most RAG systems use OCR to convert these into text, but this process loses important details such as layout, colours, and how items are placed..

This leads to poor retrieval accuracy and a weak understanding of visual semantics, structured data, and cross-modal relationships. Plus, processing such documents is inefficient, requiring multiple tools and resulting in complex workflows that hinder practical use.

That’s where RAG-Anything comes in. All-in-One RAG System, it was built to create a complete multimodal RAG system to effectively solve the various limitations of traditional RAG in processing complex documents.

The entire system adopts a unified technical framework to truly transition multimodal document processing from a laboratory concept verification to an engineering solution that can be deployed in practice.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

I will ask the chatbot two questions: “Give me a Q3 FY25 Financial Summary.” If you take a look at how the chatbot generates the output, you’ll see that the agent uses RAGAnything to handle both text and image inputs.

It uses a small model for text, and if the input includes images, it switches to the vision model. It also builds a custom embedding function using the text-embedding-3-large model.

It processes a PDF file and automatically pulls out useful data — like words, tables, or images — and stores everything in a neat structure in a local folder. It then uses hybrid mode, combining both keyword and semantic search, to find the best answer.

In the second algorithm, if you take a look at how the chatbot generates the output, you’ll see that the agent uses the get_llm_model_func to create a clean language model handler. Then, the agent detects if the input includes an image—if so, it switches to the vision model. If there’s no image, it defaults to the faster, text-only model.

The agent then initializes LightRAG to manage files, vector embeddings, and metadata and uses two processors: Image Modal Processor, which analyzes images with captions and footnotes in context, and Table Modal Processor, which reads and interprets structured table data using the language model.

So, by the end of this Story, you will understand what RAG-Anything is, how it works, and how we are going to use Lightrag and multimodal RAG to create a powerful Agentic chatbot.

What is RAG-Anything:

RAG-Anything is an open-source multimodal document processing system. it is built on the light RAG framework and aims to address the limitation of traditional retrieval-augmented generation systems that can only process text.

As an “all-in-one” RAG system, it is the first to achieve unified parsing and semantic understanding of heterogeneous content such as text, tables, charts, formulas, etc. in formulas such as PDF, Office documents and images.

it provides an end-to-end solution from document ingestion to intelligent query through innovative multimodal knowledge graphs, flexible parsing architecture, and hybrid retrieval mechanisms significantly improving the processing capabilities of complex documents.

How it works :

RAG — Anything is based on an innovative three—stage technical architecture, which breaks through the technical bottlenecks of traditional RAG systems in multimodal document processing and realizes true end — to — end intelligent processing.

Multimodal document parsing:

It uses an advanced structured extraction engine based on MinerU 2.0 to intelligently parse complex documents, accurately identifying their hierarchical structure, segmenting text, locating images, parsing tables, and recognizing mathematical formulas. Through a standardized intermediate format, it ensures consistent processing across document types while preserving semantic integrity.

With a built-in multimodal engine, the system offers a deep understanding of visual content, intelligent table parsing, and precise interpretation of LaTeX-formatted formulas. It also supports flowcharts, code snippets, and geographic data, all unified under a cross-modal knowledge framework for seamless semantic analysis.

Cross—modal knowledge construction:

RAG — Anything models multimodal content as a structured knowledge graph, effectively breaking down traditional information silos. It uses entity-based modelling to abstract diverse content — like text, charts, and formulas — into unified knowledge entities with complete context, source IDs, and type attributes.

Through semantic analysis, it intelligently builds relationships between different content types, forming a rich, multi-level knowledge network. With a dual-storage system combining graph and vector databases, it enables both structured queries and semantic retrieval, powering advanced question-answering capabilities.
**
Retrieval and generation:**

RAG — Anything uses a two-level retrieval and question-answering mechanism to deliver precise understanding and rich, multi-dimensional responses to complex queries. Combining fine-grained information extraction with high-level semantic understanding enhances both the breadth of retrieval and the depth of generation in multimodal document analysis.

It extracts keywords hierarchically — capturing detailed entities and abstract concepts — and employs a mixed retrieval strategy that includes entity matching, semantic expansion, and vector similarity search.

Let’s Coding

Let us now explore step by step and unravel the answer to how to use the RagAnything and Multimodal RAG. We will install the libraries that support the model. For this, we will do a pip install requirements

!pip install raganything

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.

Raganything implements an effective multi-stage multimodal pipeline that fundamentally extends traditional RAG architectures to seamlessly handle diverse content modalities through intelligent orchestration and cross-modal understanding.

from raganything import RAGAnything
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc
from raganything.modalprocessors import ImageModalProcessor, TableModalProcessor

They develop an asynchronous main() function to organize the workflow. Inside this function, they prepare the necessary API key and set up fields for connecting with OpenAI’s services.

Then They create a RAGAnything instance that can handle both text and image-based prompts. For that, I configure a text-based language model using OpenAI’sgpt-4o-mini, and a vision model that switches to gpt-4o if image data is provided.

They also build a custom embedding function that uses the text-embedding-3-large model to transform input text into high-dimensional vectors. It helps with accurate document searches later.

Next, it processes a PDF format by calling the process_document_complete() method. It automatically decides how to extract data like text, tables, and images from the file and store that structured output in a local folder.

Finally, it creates a query that asks for a financial summary of Q3 FY25 using the query_with_multimodal() method in hybrid mode. it uses both keyword and semantic search to find the best answer.

from raganything import RAGAnything
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc

async def main():
    api_key = ""
    base_url = None

    # Initialize RAGAnything
    rag = RAGAnything(
        working_dir="./rag_storage",
        llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            **kwargs,
        ),
        vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
            "gpt-4o",
            "",
            system_prompt=None,
            history_messages=[],
            messages=[
                {"role": "system", "content": system_prompt} if system_prompt else None,
                {"role": "user", "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
                ]} if image_data else {"role": "user", "content": prompt}
            ],
            api_key=api_key,
            **kwargs,
        ) if image_data else openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            **kwargs,
        ),
        embedding_func=EmbeddingFunc(
            embedding_dim=3072,
            max_token_size=8192,
            func=lambda texts: openai_embed(
                texts,
                model="text-embedding-3-large",
                api_key=api_key,
                base_url=base_url,
            ),
        ),
    )

    # Process a document
    await rag.process_document_complete(
        file_path="./NVDA-F3Q25-Quarterly-Presentation-FINAL.pdf",
        output_dir="./output",
        parse_method="auto"
    )

    # Query the processed content
    result = await rag.query_with_multimodal(
        "give me a Q3 FY25 Financial Summary",
        mode="hybrid"
    )
    print(result)

if __name__ == "__main__":
    await main()

Direct Multimodal Content Processing

They made a function get_llm_model_func that takes an API key and an optional base URL as input. it is designed to return a clean and ready-to-use language model handler.

They developed this handler using a lambda function that wraps around openai_complete_if_cache. which allows the system to send prompts to the gpt-4o-mini model, while also supporting features like system prompts and conversational history. they create this structure so that other parts of my RAG application can plug in this function without needing to know how OpenAI is being called or configured behind the scenes.

def get_llm_model_func(api_key: str, base_url: str = None):
    return (
        lambda prompt,
        system_prompt=None,
        history_messages=[],
        **kwargs: openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            base_url=base_url,
            **kwargs,
        )
    )

Next, They develop this handler using a lambda function that adapts to the presence of image data. They made it so that when an image is included in the query, the system automatically switches to the powerful gpt-4o model, which is capable of understanding both visual and textual input.

They create a structured message format that combines the user’s text prompt and the base64-encoded image into a single query that gpt-4o can understand.

If no image is provided, the function defaults to usinggpt-4o-mini, a lighter and faster model optimized for text-only tasks.

def get_vision_model_func(api_key: str, base_url: str = None):
    return (
        lambda prompt,
        system_prompt=None,
        history_messages=[],
        image_data=None,
        **kwargs: openai_complete_if_cache(
            "gpt-4o",
            "",
            system_prompt=None,
            history_messages=[],
            messages=[
                {"role": "system", "content": system_prompt} if system_prompt else None,
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_data}"
                            },
                        },
                    ],
                }
                if image_data
                else {"role": "user", "content": prompt},
            ],
            api_key=api_key,
            base_url=base_url,
            **kwargs,
        )
        if image_data
        else openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            base_url=base_url,
            **kwargs,
        )
    )

they use API key and initializing a LightRAG allows me to manage files, vector storage, and metadata inside a local directory.

They create two intelligent model functions — one for handling image-based prompts and another for regular text-based input — wrapping them with OpenAI’s GPT-4o and GPT-4o-mini models.

So They made an ImageModalProcessor that takes an image path, caption, and footnote and generates a meaningful description analyzing the image in the context of its document.

Then I made another processor, TableModalProcessor, to handle table data. It reads structured table content, along with its caption and footnote, and interprets it using the LLM.

async def process_multimodal_content():
    api_key = "Your_api"
    base_url = None

    # Initialize LightRAG
    rag = LightRAG(
        working_dir="./rag_storage",
        # ... your LLM and embedding configurations
    )
    await rag.initialize_storages()

    vision_model_func = get_vision_model_func(api_key, base_url)
    llm_model_func = get_llm_model_func(api_key, base_url)

    # Process an image
    image_processor = ImageModalProcessor(
        lightrag=rag,
        modal_caption_func=vision_model_func
    )

    image_content = {
        "img_path": "image.jpg",
        "img_caption": ["Figure 1: Reconciliation of Non-GAAP to GAAP Financial Measures (contd.)"],
        "img_footnote": ["Reconciliation of Non-GAAP to GAAP Financial Measures (contd.)"]
    }

    description, entity_info = await image_processor.process_multimodal_content(
        modal_content=image_content,
        content_type="image",
        file_path="NVDA-F3Q25-Quarterly-Presentation-FINAL.pdf",
        entity_name="Experimental Results Figure"
    )

    # Process a table
    table_processor = TableModalProcessor(
        lightrag=rag,
        modal_caption_func=llm_model_func
    )

    table_content = {
        "table_body": """
        | Gross Margin | Non-GAAP | Acquisition-Related and Other Costs (A) |
        |--------------|----------|-----------------------------------------|
        | Q3 FY 2024   | 75.0%    | 0.7                                     |
        | Q4 FY 2024   | 76.7%    | 0.5                                     |
        """,
        "table_caption": ["Performance Comparison"],
        "table_footnote": ["Results on test dataset"]
    }

    description, entity_info = await table_processor.process_multimodal_content(
        modal_content=table_content,
        content_type="table",
        file_path="research_paper.pdf",
        entity_name="Performance Results Table"
    )

if __name__ == "__main__":
    await main()

Conclusion :

As an innovative multimodal RAG system, RAG-Anything effectively solves the limitations of traditional RAG systems in processing complex documents through an end-to-end processing pipeline, a multimodal content analysis engine, and a hybrid retrieval mechanism based on knowledge graphs.

It supports unified processing of multiple formats such as PDF, Office documents, and images, can accurately parse heterogeneous content such as text, tables, charts, and formulas, and realise intelligent questions and answers through cross-modal semantic associations.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI