DEV Community

Cover image for Beyond OCR: Building a Truly Multimodal Local RAG Pipeline
Adrien Laugueux
Adrien Laugueux

Posted on

Beyond OCR: Building a Truly Multimodal Local RAG Pipeline

If you've ever tried to build a document chatbot over a collection of scanned reports, technical manuals, or mixed-content PDFs, you've probably run into the same wall: classic RAG pipelines are essentially blind.

They extract text, chunk it, embed it, and retrieve it — but the moment your document contains a scanned table, a wiring diagram, or an annotated chart, that information either gets mangled by OCR or vanishes entirely. The retrieved context is impoverished, and your chatbot's answers reflect that. Ask it about the diagram on page 12 and it will confidently summarise the paragraph next to it, which is arguably worse than saying nothing at all.

There's a better way. Instead of treating documents as bags of text, you can treat them the way a human would: read the page as a whole, visuals included. And for pages that do contain native, selectable text, you don't have to choose between precision and visual understanding — you can have both. Revolutionary, we know.

Why OCR-Based RAG Falls Short

The standard pipeline — OCR → chunking → embedding → vector search → LLM — was designed for text-native documents. When applied to rich, heterogeneous content, it breaks down in predictable ways:

  • A scanned table loses its structure and becomes an unreadable string of values
  • A technical diagram is reduced to a handful of disconnected labels
  • Spatial relationships — captions, callouts, annotations — are destroyed
  • Charts and graphs lose all their meaning once flattened to text

The root problem is that OCR reduces a two-dimensional, semantically rich object (a page) to a one-dimensional stream of characters. You can't recover what was never captured. It's a bit like describing a painting by reading the label on the frame — technically accurate, entirely useless.

The Core Idea: Combine Native Text Extraction and Vision Models

Native text extraction and Vision Language Models (VLMs) are not competing approaches — they are complementary. Each covers what the other misses:

  • Native text (via PyMuPDF) is exact, faithful to the character, and computationally free. It carries no risk of hallucination.
  • VLMs understand structure, visual semantics, and spatial relationships — things that text extraction is blind to.

The best pipeline uses both. On pages that contain native text alongside visual elements, native extraction handles the prose while the VLM focuses exclusively on what it does best: tables, diagrams, charts, and images. On fully scanned pages, the VLM takes over entirely.

The pipeline becomes:

Step 1 — Convert Pages to Images and Extract Native Text

import fitz  # PyMuPDF
from pdf2image import convert_from_path

doc = fitz.open("document.pdf")
page_images = convert_from_path("document.pdf", dpi=200)

for i, page in enumerate(doc):
    native_text = page.get_text()
    image = page_images[i]
    # pass both to the processing function
Enter fullscreen mode Exit fullscreen mode

Step 2 — Detect Visual Elements

PyMuPDF lets you check whether a page actually contains visuals, so you can avoid unnecessary VLM calls on text-only pages:

def page_has_visuals(page):
    images = page.get_images()
    drawings = page.get_drawings()
    return len(images) > 0 or len(drawings) > 0
Enter fullscreen mode Exit fullscreen mode

Step 3 — Build a Combined Page Description

import ollama

def describe_visuals(image_path: str) -> str:
    prompt = """Focus only on non-textual elements on this page:
    - Tables, diagrams, charts, images, graphs
    - Describe their content and what they convey
    - For tables, transcribe their content in Markdown or JSON
    - Ignore plain text paragraphs
    If there are no visual elements, say so briefly."""

    response = ollama.chat(
        model="llama3.2-vision",
        messages=[{"role": "user", "content": prompt, "images": [image_path]}]
    )
    return response["message"]["content"]

def describe_full_page(image_path: str) -> str:
    prompt = """Describe this document page exhaustively:
    - If you see a table: transcribe its full content in a structured way
    - If you see a diagram or chart: describe its elements and relationships
    - If you see text: transcribe it faithfully
    - If you see a graph: describe the data and visible trends
    Be precise and thorough."""

    response = ollama.chat(
        model="llama3.2-vision",
        messages=[{"role": "user", "content": prompt, "images": [image_path]}]
    )
    return response["message"]["content"]

def process_page(page, page_image_path):
    native_text = page.get_text().strip()
    has_text = len(native_text) > 100
    has_visuals = page_has_visuals(page)

    if has_text and has_visuals:
        # Best of both worlds: precise text + VLM for visuals
        visual_desc = describe_visuals(page_image_path)
        return f"## Extracted text\n{native_text}\n\n## Visual elements\n{visual_desc}"

    elif has_text:
        # Text-only page: no VLM needed
        return native_text

    else:
        # Fully scanned page: VLM takes over entirely
        return describe_full_page(page_image_path)
Enter fullscreen mode Exit fullscreen mode

This approach gives you maximum precision on text — proper nouns, exact figures, technical references — with no risk of hallucination, while the VLM adds the semantic layer that text extraction can never provide.

The process_page function maps directly to this decision tree: check for native text, check for visuals, and route accordingly.

What This Unlocks

The combined approach improves retrieval in ways that neither method achieves alone:

  • Exact recall: a query for a specific article number or technical specification matches the native text verbatim
  • Semantic recall: a query about "the heat flow diagram" or "the comparison table" matches the VLM's description
  • Structural fidelity: tables are indexed as structured Markdown or JSON, not as a garbled sequence of cell values
  • Compute efficiency: the VLM only runs when there are actual visuals to describe, keeping ingestion time reasonable

Going Further: ColPali

For the most demanding use cases, ColPali takes a fundamentally different approach: it embeds document pages directly as images, without any intermediate text representation. Queries are embedded in the same visual space, and retrieval is based on visual similarity.

from colpali_engine.models import ColPali, ColPaliProcessor

model = ColPali.from_pretrained("vidore/colpali-v1.2")

# Both page images and text queries are embedded directly
# Retrieval happens in the visual embedding space
Enter fullscreen mode Exit fullscreen mode

The benefit is zero information loss — the layout itself is part of the index. ColPali consistently ranks among the best-performing models on document retrieval benchmarks, particularly for visually complex pages. It can also be combined with the hybrid approach above: use ColPali for retrieval, then pass the retrieved page image plus its extracted text to the LLM for generation.

Recommended Stack (Fully Local)


from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./db")

# Index each combined page description
vectorstore.add_texts(
    texts=[page_description],
    metadatas=[{"source": filename, "page": page_num}]
)
Enter fullscreen mode Exit fullscreen mode

Practical Advice

Chunk at the page level. A page is a natural semantic unit for a VLM. Splitting mid-page breaks the visual context the model needs to produce a coherent description.

Keep the original images. Store the source page image alongside its description. When a page is retrieved, you can pass the image directly to the LLM as additional context — especially useful for complex visuals that are hard to describe fully in text.

Tailor your VLM prompts to document type. A technical schematic, a financial report, and a product datasheet each warrant different prompting strategies. Investing in prompt templates per document category pays off in description quality.

Request structured output for tables. When a page contains tabular data, explicitly ask the VLM to output Markdown or JSON. This preserves structure in a way that plain prose cannot, and makes the indexed content far easier for the LLM to reason over.

Conclusion

The classic OCR-based RAG pipeline was never designed for visually rich documents. The solution isn't to replace text extraction with a VLM — it's to use both in concert. Native text gives you precision and reliability; the VLM gives you visual understanding. Together, they produce page descriptions that are richer than either could achieve alone. Think of it as hiring both a speed-reader and an art critic, and making them share a desk.

Combined with a fully local stack, this approach gives you a document chatbot that can reason over tables, diagrams, charts, and mixed content, without any data leaving your infrastructure. The tooling is mature, the models are capable, and the entire pipeline runs on commodity hardware. There's no reason to settle for text-only anymore.

Top comments (0)