Mistral OCR 4 brings self-hosted document AI to RAG pipelines

#ai #mistral #ocr #rag

Mistral OCR 4 brings self-hosted document AI to RAG pipelines

Mistral has released Mistral OCR 4, a focused document-intelligence model for turning PDFs, scans, forms, tables, equations, and mixed-layout documents into structured output. This matters now because a lot of useful enterprise AI still fails at ingestion: if the source document is parsed badly, the RAG app, search index, compliance workflow, or agent built on top of it is already broken.

This is an official model launch, not a benchmark leak. It is especially relevant for teams building document-heavy products because Mistral is offering the model through its API, through Document AI, and as a single-container self-hosted deployment.

What Mistral announced

Mistral says OCR 4 returns more than plain extracted text. The model can output:

text extraction;
bounding boxes for locating content in the original document;
typed block classification for elements such as titles, tables, equations, and signatures;
inline confidence scores;
multilingual OCR across 170 languages in 10 language groups.

The company says the model is designed as an ingestion component for enterprise search, RAG, and domain-specific retrieval pipelines. It is also integrated with Mistral Search Toolkit, the company's open-source framework for ingestion, retrieval, and evaluation workflows.

Mistral claims OCR 4 averaged a 72% preference rate from independent annotators against the other OCR and document-AI systems it tested, and reports an 85.20 score on OlmOCRBench. As always, treat vendor benchmark claims as a starting point for testing, not a purchasing decision.

Deployment and pricing

The builder impact is that OCR 4 is not just a hosted demo. Mistral says it can run in a single container for fully self-hosted deployments, which matters for teams handling regulated documents, private customer data, internal knowledge bases, contracts, medical paperwork, insurance files, invoices, or finance documents.

On Mistral's pricing page, the model is listed as mistral-ocr-latest with:

OCR API: $4 per 1,000 pages;
Batch API: $2 per 1,000 pages;
Document AI: $5 per 1,000 pages.

That gives teams a cleaner cost model than token-only pricing for document extraction workloads.

Why builders should care

If you are building RAG over messy documents, OCR quality is product quality. Better layout extraction and confidence metadata can make a noticeable difference in:

source-grounded citations;
human review queues;
redaction and compliance workflows;
table-heavy enterprise search;
contract and invoice parsing;
support agents that need to quote original documents rather than hallucinate summaries.

The bounding-box support is particularly practical. It lets apps highlight where an answer came from, route low-confidence fields to humans, or preserve document structure instead of flattening everything into a blob of text.

The self-hosted option is also important. Some companies cannot send documents to a third-party API, even if the model is good. A containerized deployment gives those teams a path to use Mistral's stack without moving sensitive files outside their own environment.

Caveats

OCR 4 is a specialist model, not a new general-purpose frontier model. Teams should test it against their own documents before replacing existing OCR, especially for handwritten forms, low-quality scans, niche languages, unusual tables, and documents where extraction errors have legal or financial consequences.

The other open question is packaging. Mistral says self-hosting is available, but teams will still need to check hardware requirements, licensing terms, throughput, observability, and how the container fits their security review.