Argha Sarkar

Posted on Mar 17

Standard RAG Is Blind — Building Multimodal RAG in .NET to Fix It

#dotnet #ai #rag #programming

The Scenario

A developer builds a RAG system. A user uploads a 60-page service manual — dense with wiring diagrams, installation schematics, and annotated screenshots. They ask: "How do I replace the filter assembly?"

The answer is entirely in Figure 7.

RAG returns three paragraphs of unrelated text. The image was never ingested. It does not exist to the system.

This is not a bug. It is the expected behaviour of every standard RAG pipeline.

Why Standard RAG Fails on Images

A standard RAG pipeline does one thing: convert text into searchable vectors.

flowchart LR
    A[PDF / DOCX Upload] --> B[Text Extraction]
    B --> C[Chunk]
    C --> D[Embed]
    D --> E[(Vector Store)]
    A -. images discarded .-> X[❌]

Images are either skipped entirely or reduced to their alt-text — which is usually empty. The pipeline was not designed to understand visual content. There is no text to extract from a schematic, no words to embed from a photograph, no paragraph to chunk from a technical diagram.

The result: any knowledge that exists only in images is permanently invisible to retrieval. For documents like technical manuals, medical imaging reports, architectural drawings, or slide decks, this is not a minor gap. It is a fundamental failure of coverage.

What Multimodal RAG Needs to Do Differently

Three things must change:

Extract — pull image bytes out of documents alongside text, not instead of text
Describe — pass each image to a vision model and get back a text description that captures what the image means, not just what it looks like
Retrieve and Render — when a retrieval query matches an image description, return both the description as context and the original image to the user

The key insight is that vision models act as a translation layer. They convert visual content into the semantic space that the rest of the RAG pipeline already understands. Chunking, embedding, and vector search require no changes. The pipeline gains a new input channel — it does not need a new architecture.

The Architecture

The multimodal pipeline extends the standard RAG system at two seams: ingestion gains a parallel image track, and retrieval gains an image rendering step.

Ingestion

flowchart TD
    A[PDF / DOCX Upload] --> B[Text Extraction\nexisting]
    A --> C[Image Extraction\nPdfPig · OpenXml]
    B --> D[Chunk & Embed\nexisting]
    C --> E[Vision Model\nGPT-4o]
    E --> F[Image Description\ntext]
    E --> G[Image Bytes\nPostgreSQL]
    F --> H[Embed Description\nas chunk + imageId]
    D --> I[(Qdrant\nVector Store)]
    H --> I

The upload triggers two parallel tracks. The text track is unchanged. The image track extracts raw bytes per page or document part, sends each to a vision model, stores the bytes in PostgreSQL, and embeds the returned description as a standard chunk — with one addition: the chunk carries an imageId reference in its metadata.

Image descriptions live in the same vector space as text chunks. They compete on equal terms during retrieval.

Retrieval

flowchart TD
    A[User Query] --> B[Vector Search\nQdrant]
    B --> C{Chunk Type?}
    C -->|text| D[Text Context]
    C -->|image description| E[Image Description\n+ imageId]
    D --> F[LLM Response]
    E --> F
    E --> G[GET /api/images/id\nimage bytes]
    F --> H[Answer Text]
    G --> H
    H --> I[Chat UI\ntext + inline images]

Retrieval requires no changes to the search layer. When a query matches an image-description chunk, the chunk's metadata surfaces the imageId. A dedicated endpoint streams the image bytes from PostgreSQL. The chat UI renders the LLM answer alongside the relevant image — in the same response panel.

Pipeline Stage Breakdown

Extract

Two document types, two libraries, one output contract.

flowchart LR
    PDF --> PdfPig --> ExtractedImage
    DOCX --> OpenXml --> ExtractedImage

PDF image extraction uses PdfPig's per-page image enumeration. DOCX extraction enumerates MainDocumentPart.ImageParts via the OpenXml SDK. Both apply a 100×100px minimum dimension threshold — images below this are decorative and skipped — and a 20MB safety cap. The output in both cases is an ExtractedImage record carrying bytes, MIME type, and dimension metadata. Text and image extraction run on the same upload; no second pass is required.

Describe

flowchart LR
    ExtractedImage --> B[IVisionService\nDescribeAsync] --> C[Text Description]

Each extracted image is base64-encoded and sent to GPT-4o Vision via IVisionService. The response is a plain-text description of what the image contains and means in context. This is the only pipeline stage that calls an external vision model. Descriptions are generated once at ingest time — not at query time — so retrieval latency is unaffected.

Store

flowchart LR
    ExtractedImage --> A[IImageStore] --> B[(PostgreSQL\nDocumentImages)]
    B --> C[imageId]
    C --> D[Chunk Metadata]

Image bytes are persisted to a DocumentImages table in PostgreSQL via IImageStore. The returned imageId is attached to the description chunk before it enters the embedding pipeline. The bytes never travel to Qdrant — only the description text and the imageId reference flow through the vector store.

Retrieve

No change to the vector search layer. When a query matches an image-description chunk, the chunk's metadata carries imageId and pageNumber. The existing search response shape is extended with an optional image reference — source chunks now carry a type field (text or image) alongside the relevant text excerpt.

Render

A GET /api/images/{id} endpoint streams image bytes directly from PostgreSQL. The Blazor chat UI inspects each source chunk's type: text sources render as before, image sources fetch the endpoint and render the image inline. The user receives the LLM answer and the relevant diagram in the same response — no separate step, no external image hosting.

GitHub

The full source, issue tracker, and phase roadmap are public.

github.com/Argha713/dotnet-rag-api

DEV Community