How Multimodal Document Parsing Works: From LayoutLM to Donut
Most AI systems are great at understanding clean text. But the real world doesn't send you clean text. It sends you PDFs, scanned invoices, handwritten forms, and multi-column research papers. This is the unstructured data problem, and it's one of the hardest open challenges in applied AI.
This article breaks down how modern multimodal models tackle document understanding, specifically LayoutLM and Donut, and why this matters for the next generation of AI agents.
Why Plain NLP Fails on Documents
A standard language model reads text as a flat sequence of tokens. But documents are not flat. A table, an invoice, or a form has spatial structure — where something appears on the page is just as important as what it says.
Consider a receipt. The word "Total" appears near the bottom right. The number next to it is the amount due. A model reading raw text has no idea about this spatial relationship. It just sees "Total" and "47.50" somewhere in a stream of tokens.
This is why layout-awareness matters.
LayoutLM: Teaching BERT to Read Layouts
LayoutLM (Microsoft, 2020) was one of the first models to combine text, layout, and visual information for document understanding.
It extends BERT with two additional input embeddings:
2D position embeddings — the (x, y) coordinates of each token's bounding box on the page
Image embeddings — visual features extracted from the document image via CNNs
So instead of just asking "what does this token mean?", LayoutLM asks "what does this token mean, where is it on the page, and what does that region look like visually?"
This allows LayoutLM to achieve strong results on tasks like form understanding, receipt parsing, and information extraction from semi-structured documents.
LayoutLMv3 (2022) further unified the text and image streams using a single transformer, removing the need for separate CNN feature extractors and achieving state-of-the-art on multiple document benchmarks.
Donut: Skipping OCR Entirely
LayoutLM still depends on OCR. You need to extract text and bounding boxes before the model can run. OCR is slow, expensive, and error-prone on noisy scans.
Donut (Document Understanding Transformer, 2022) takes a completely different approach: end-to-end document understanding with no OCR.
Donut uses a simple encoder-decoder architecture:
Encoder: A Swin Transformer that reads the raw document image as a grid of patches. No text extraction needed.
Decoder: A BART-style autoregressive decoder that generates structured output (JSON) directly from the image.
You give Donut a document image and a prompt like "extract the invoice number and total amount" and it outputs structured JSON directly. No OCR pipeline, no bounding box extraction, no preprocessing.
This makes Donut significantly faster and more robust on low-quality scans where OCR typically struggles.
The Core Trade off
LayoutLM Donut OCR Required Yes No InputText + Layout + ImageRaw Image only Output Token classification Generated JSON Speed Slower (OCR bottleneck)Faster end-to-end Best for High-quality structured forms Noisy scans, flexible extraction
Why This Matters for AI Agents
The next wave of AI agents won't just browse the web. They'll need to read contracts, parse invoices, extract data from reports, and interact with enterprise document workflows. All of that is unstructured data.
Models like LayoutLM and Donut are the foundation of vision-first, layout-aware document AI. Systems that don't just read text but understand documents the way humans do: spatially, visually, and contextually.
As these models become more capable and efficient, the gap between "AI that reads clean text" and "AI that handles the real world" will finally start to close.
Currently exploring multimodal document AI as part of my ML research. Happy to connect on LinkedIn.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)