DEV Community

Cover image for PaddleOCR VL + RAG: Revolutionize Complex Data Extraction (Open-Source)
Gao Dalie (Ilyass)
Gao Dalie (Ilyass)

Posted on

PaddleOCR VL + RAG: Revolutionize Complex Data Extraction (Open-Source)

Not even a month ago, I made a video about MistralOCR that many of you liked. 

After that, a follower reached out with a problem they were having with an OCR Chatbot. I figured this was a common issue, so I decided to make a new video to help them and other developers.

When documents contain complex tables, mathematical formulas, or multi-column layouts, traditional OCR tools often generate messy content that requires manual sorting.

Then, just last week, I was browsing GitHub and came across Baidu's newly open-sourced PaddleOCR-VL-0.9B. 

I'll be honest - when I saw it had only 0.9 billion parameters, my first thought was " Oh, another small model joining the fun?" But out of professional curiosity, I had to ask: could this one actually deliver? What I found completely stunned me.

This isn't OCR, it's a quantum leap in document understanding
PaddleOCR-VL completely exceeded my expectations. It achieved the world's first place in comprehensive performance, scoring 92.6 on the global authoritative evaluation list, OmniDocBench v1.5. Its inference speed increased by 14.2% compared with MinerU2.5 and 253.01% compared with dots.ocr.

The most intuitive feeling I had was that it was very accurate, or too accurate! It is worthy of being the model that can reach the top and be ranked first.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

Today, I'll be putting PaddleOCR-VL to the test on four key challenges: Formula Recognition, Table Recognition, Reading Order, and Handwritten Text.

Let's start with Formula Recognition. I've uploaded an image containing complex mathematical formulas. As you can see, the model handles them exceptionally well - accurately interpreting superscripts, subscripts, and even very long, intricate expressions.
Next up is Table Recognition.

This is a notoriously difficult problem, and there are many types of tables, sometimes with borders and sometimes without, containing numerous numbers that are very easy for models to misinterpret. I used PaddleOCR-VL on several table examples and found its accuracy to be genuinely impressive.

Another major challenge is understanding document Structure and Reading Order. In modern documents, content is not only more complex but also comes in highly varied layouts. Think multi-column designs, mixed text and images, folds, color printing, tilted scans, and handwritten annotations - all of which complicate OCR. The correct reading order isn't always a simple top-to-bottom, left-to-right flow.

The PaddleOCR-VL technical report demonstrates how the model can understand these complex structures, almost like a human. Whether it's an academic paper, a multi-column newspaper, or a technical report, it intelligently analyzes the layout and restores a reading order that matches human intuition.

Finally, PaddleOCR-VL remains extremely stable even with more complex layouts. Take this handwritten note, for example. It combines text, numbers, paragraphs, and images in a layout with left-right and top-bottom columns that typically only a human could decipher.

What Makes PaddleOCR VL Unique?

PaddleOCR VL is no longer just simple text recognition, but can really "understand" the document structure. Whether it is an academic paper, a multi-column newspaper or a technical report, PaddleOCR-VL can intelligently understand the document layout and automatically organise the content in the correct order.

At the same time, it accurately extracts complex content information, such as tables, mathematical formulas, handwritten notes, and chart data in documents. It converts them into structured data that can be directly used.

In addition, it supports recognition of 109 languages, covering multilingual scenarios such as Chinese, English, French, Japanese, Russian, Arabic, and Spanish, greatly improving the model's recognition and processing capabilities in multilingual documents.

How PaddleOCR VL It Trained

PaddleOCR-VL consists of two parts: PP-DocLayoutV2 and PaddleOCR-VL-0.9B.

Among them, the core part is PaddleOCR-VL-0.9B, which integrates a pre-trained visual encoder with a dynamic resolution preprocessor, a two-layer MLP projector, and a pre-trained large language model.
The preprocessing technology uses native dynamic high resolution. The visual encoder uses the NaViT style encoder, which supports native resolution input.

This design reduces hallucinations and improves the performance of the visual language model PaddleOCR-VL-0.9B.
The projector efficiently connects the features of the visual encoder to the embedding space of the language model.

In an autoregressive language model, the entire sequence is generated by predicting one token at a time. This means that the size of the decoder directly affects the overall inference latency, so smaller models decode faster.

Let's start coding

Let us now explore step by step and unravel the answer to creating a powerful reasoning app. We will install the libraries that support the model. For this, we will do a pip install

!pip uninstall -y torch paddlepaddle paddlepaddle-gpu
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
!pip install paddleocr paddlepaddle
!pip install langchain langchain-community langchain-openai faiss-cpu sentence-transformers openai python-dotenv
Enter fullscreen mode Exit fullscreen mode

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.

PaddleOCR: converts documents and images into structured, AI-friendly data (like JSON and Markdown) with industry-leading accuracy - powering AI applications.

import torch
from paddleocr import PaddleOCR
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
Enter fullscreen mode Exit fullscreen mode

So I built this SimpleRAG system that combines PaddleOCR-VL for text extraction with OpenAI for generating queries. Let me walk you through what I developed here.

In the initialisation, I set up the core components - I'm using HuggingFace's BGE embeddings for vector representations and GPT-4o as the chat model with zero temperature for consistent responses. I initialize placeholders for the vectorstore and QA chain that we'll build later.

Now, for the extraction method, first I tried using the HuggingFace transformers version of PaddleOCR, which threw a weird error about image tokens not matching, then installing PaddlePaddle actually broke PyTorch (had to restart the runtime and reinstall everything in the right order), then I kept guessing at the API because the methods were deprecated and the new ones had different parameters. 

The real breakthrough came when I just printed out what the result object actually looked like - turns out it's just a list with one dictionary inside, and that dictionary has a key called rec_texts which is literally just a list of all the text strings that were found in the image.

So instead of trying to access some complex nested object structure with .boxes.text I just needed to check if the result was a dictionary, grab the rec_texts key, and extend my list with those strings.

class SimpleRAG:
    def __init__(self):
        self.embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.vectorstore = None
        self.qa_chain = None
        self.ocr = PaddleOCR(use_textline_orientation=True, lang='en')

    def extract_text_from_images(self, image_paths: list):
        docs = []
        for path in image_paths:
            result = self.ocr.predict(input=path)

            text_lines = []
            for res in result:
                if isinstance(res, dict) and 'rec_texts' in res:
                    text_lines.extend(res['rec_texts'])

            text = "\n".join(text_lines) if text_lines else "No text found"
            docs.append(Document(page_content=text, metadata={'source': path}))

        return docs
Enter fullscreen mode Exit fullscreen mode

In build_index, extract text from all images, split the documents into 1000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, create a FAISS vectorstore with BGE embeddings, and set up a RetrievalQA chain that uses GPT-4o and retrieves the top 3 relevant chunks per query.

For a query, I just pass the question to the QA chain, which handles retrieval and generation, returning the answer.

def build_index(self, image_paths: list):
        docs = self.extract_text_from_images(image_paths)

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        splits = text_splitter.split_documents(docs)

        self.vectorstore = FAISS.from_documents(splits, self.embeddings)
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3})
        )
def query(self, question: str):
        return self.qa_chain.invoke(question)

# Usage
rag = SimpleRAG()
rag.build_index(["Your pic"])
answer = rag.query("extract all the table?")
print(answer)
Enter fullscreen mode Exit fullscreen mode

Conclusion :

In this era of rapidly advancing AI technology, we're often bombarded with hype about "the most powerful ever" and "disruptive." However, truly valuable breakthroughs often come from innovations that solve specific problems and make technology easier to use.

PaddleOCR-VL may not make mainstream headlines, but for developers who need to process documents every day, it may be the long-awaited solution.

After all, the best technologies are those that are quietly integrated into daily work, making you hardly aware of their existence. PaddleOCR-VL is taking a solid step in this direction.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai

Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d

Subscribe to the Newsletter for free: https://substack.com/@gaodalie

Top comments (1)

Collapse
 
roshan_sharma_7deae5e0742 profile image
roshan sharma

That’s awesome! super clean setup and love how you paired PaddleOCR-VL with RAG for real-world use. The debugging bit with rec_texts was a nice touch too. Curious though, did you try testing it on multilingual docs yet?