Alvis Ng

Posted on Apr 7 • Originally published at Medium

What Happens When AI Reads a Page

#ai #machinelearning #datascience #softwareengineering

From Tesseract’s character pipeline to GPT’s visual tokens to LandingAI’s agentic decomposition. I tested all five on the same McKinsey report. Every one got the text right. None of them got the charts right.

I uploaded two pages of a McKinsey consulting report to five different OCR systems last week. Same pages. Same prompt. A bubble chart with floating labels and a heatmap table where the data is color, not numbers.

Every system got the text paragraphs perfectly. Every one of them failed on the charts.

But they failed differently. And the differences trace directly to how each system works inside. Not at the API level. At the architecture level. The level where “reading” becomes “predicting” and you can’t tell the difference from the output.

This article is about that architecture. What’s actually happening when a machine looks at a page of text, a table, or a chart, and turns pixels into structured data. Not an API comparison. A mechanism explainer. Because if you use OCR daily, you should understand how it works.

Generation 1: The Pipeline That Reads Character by Character

Before transformers, before neural networks dominated OCR, Tesseract was the standard. It still runs inside thousands of production systems. And it works nothing like what you’re using when you call Claude or GPT with a PDF.

Tesseract (version 4+) processes a page through a multi-stage pipeline:

**Stage 1: Page Layout Analysis

**The engine scans the image for text regions. Dark pixel clusters on a light background. Group them into lines by vertical alignment. Separate words by horizontal gaps. That’s it. At this stage, the system has no idea what any character says. It only knows WHERE text exists.

Stage 2: Line-Level Recognition

This is the interesting part. Each text line goes to an LSTM (Long Short-Term Memory) network. Not one character at a time. The whole line. Picture yourself squinting at someone’s terrible handwriting. You don’t decode each letter in isolation, right? You look at the shape of the whole word. The LSTM does the same thing. It slides across the line and asks “given everything I’ve seen so far, what character comes next?” A vertical bar followed by a curve? Probably “b” or “d.” The letters around it break the tie.

Stage 3: CTC Decoding

The LSTM spits out a messy probability matrix. Every column is a time step, every row is a possible character. CTC (Connectionist Temporal Classification) cleans this up. The network might say “hh-ee-ll-ll-oo” for “hello” because it stutters across time steps. CTC merges the repeated characters, strips out blanks, and gives you the final string.

Page Image  
  │  
  ├── Layout Analysis → text regions, lines, words  
  │  
  ├── Per-Line LSTM → probability matrix [timesteps × characters]  
  │  
  ├── CTC Decoder → final character string  
  │  
  └── Post-processing → dictionary lookup, spell correction

The pipeline nature is the key insight. Each stage has ONE job. Layout analysis finds WHERE. The LSTM identifies WHAT. CTC decodes HOW to collapse probabilities into text. When any stage fails, the failure is visible. A layout analysis miss means a text region is skipped entirely, and you get a gap. A recognition error gives you garbled characters with low confidence scores. The system KNOWS it’s uncertain, and it can tell you.

This matters because everything that came after works differently. And fails differently.

Generation 2: Deep Learning Detection + Recognition

Solutions like EasyOCR and PaddleOCR split the problem into two neural networks instead of Tesseract’s monolithic pipeline.

Network 1: Text Detection

A CNN (convolutional neural network) like CRAFT scans the image and produces two heatmaps. One says “this pixel is probably part of a character.” The other says “these two characters are probably next to each other.” Threshold those heatmaps and you get bounding boxes around text regions. Now you know where the words are.

Network 2: Text Recognition

Crop each detected region. Resize it to a standard height. Feed it to a CRNN (Convolutional Recurrent Neural Network). The CNN looks at the pixels and extracts visual features. The RNN (bidirectional LSTM) reads those features as a sequence, left and right. CTC decoding at the end, same as Tesseract. Out comes the text.

The improvement over Tesseract? The detection network handles rotated text, curved text, text on complex backgrounds. The recognition network is deeper and more accurate. But it’s still a pipeline. Detection first, then recognition. Two networks. Two failure points. Both visible.

When CRAFT fails to detect a text region, you get no output for that region. When the CRNN misrecognizes a character, the confidence score drops. You can build validation logic around both failure modes because the system’s uncertainty is measurable.

Generation 3: End-to-End Transformers (Chandra, TrOCR)

This is where the architecture shift happens.

Chandra OCR and similar models (TrOCR, Donut, Nougat) abandon the pipeline. Instead of “detect then recognize,” they process the entire page in a single forward pass.

Chandra is built on a fine-tuned vision-language model (from the Qwen-VL family). It’s not a classic encoder-decoder like T5 or BART. It’s a VLM that was specifically trained to generate structured document output. The architecture:

Vision encoder

The page gets chopped into patches. 16x16 pixels each. Each patch is flattened into a vector, projected through a linear layer, and tagged with a positional embedding so the model knows where on the page it came from. Then self-attention: every patch can attend to every other patch. A patch showing the top of a “T” can look at the patch below it to confirm it’s a “T” and not an “I.” The output is a sequence of visual feature vectors, each one aware of the entire page.

Language decoder

Takes those visual features and generates text. Token by token. Left to right. Same autoregressive process as a chatbot generating a response, except the “prompt” is a grid of visual features instead of text tokens.

Page Image (e.g., 1024×1024 pixels)  
  │  
  ├── Split into patches (e.g., 4,096 patches of 16×16)  
  │  
  ├── Linear projection → patch embeddings  
  │  
  ├── + Positional embeddings (where on the page)  
  │  
  ├── Vision Encoder (self-attention across ALL patches)  
  │   └── Output: visual feature vectors  
  │  
  └── Autoregressive Language Decoder (fine-tuned Qwen-VL)  
      ├── Attends to visual features + previous tokens  
      └── Generates: structured text (Markdown, HTML, JSON)

No detection step. No bounding boxes first. The model learns WHERE and WHAT simultaneously. A patch containing part of a letter attends to adjacent patches to figure out what it’s looking at. The self-attention mechanism is doing the spatial reasoning that CRAFT and Tesseract had to engineer explicitly.

Chandra specifically uses this approach with what it calls “full-page decoding.” Rather than processing individual text lines, it takes in the entire page and generates structured output (Markdown, HTML, or JSON) that preserves the layout. It also provides spatial grounding: layout blocks are classified into 16+ types (text, section-header, caption, footnote, table, form, list-group, image, figure, diagram, equation-block, code-block, and more), each with bounding box coordinates. It achieved 85.9% on the olmocr benchmark, state-of-the-art for open-source OCR.

The strength: layout preservation

Because the model sees the entire page at once, it can understand that a heading relates to the paragraph below it, that a table cell belongs to a specific row and column, that a footnote marker connects to a footnote at the bottom. The pipeline approaches can’t do this because they process text regions independently.

The weakness: the decoder is autoregressive

It predicts tokens one at a time. And like any autoregressive model, it can hallucinate. If the visual features are ambiguous (a smudged character, a low-resolution scan), the model doesn’t output a low-confidence garbled character like Tesseract would. It outputs a plausible character. Confidently. Silently.

Generation 4: Vision Language Models (Claude, GPT, Gemini)

When you upload a PDF to Claude or GPT and ask it to extract the text, you’re using a vision language model. This is architecturally different from both traditional OCR and end-to-end transformers like Chandra. And the three major providers do it differently from each other.

The general pattern is the same, which is to convert the image into visual tokens, combine them with your text prompt, and let the language model generate a response. But the HOW varies:

GPT-5 is natively multimodal. It was trained from scratch on text and images simultaneously. The visual and language capabilities developed together during training, not bolted on after the fact. It supports configurable resolution: a [_detail_](https://developers.openai.com/cookbook/examples/multimodal/document_and_multimodal_understanding_tips) parameter controls whether the model processes your image at standard or original resolution. For documents with small labels, dense tables, or fine print, switching to detail="original" lets the model see individual pixels up to 10 million pixels without compression. This architectural choice means GPT-5 doesn't have a separate "vision module." Images and text live in the same representational space from the start.

Claude takes a different approach. The image is converted to tokens using a pixel-area formula: tokens = (width × height) / 750. Images with a long edge over 1,568 pixels are downscaled before processing. The visual tokens are then processed alongside text tokens by the language model. Claude's vision docs note specific limitations: limited spatial reasoning, approximate (not precise) object counting, and potential hallucinations on low-quality or rotated images. These limitations map directly to the failure modes we saw in the experiment.

Gemini uses a sparse Mixture-of-Experts (MoE) transformer architecture trained to be natively multimodal from the ground up. Google’s docs describe it as “built to be multimodal from the ground up.” The MoE design activates only a subset of model parameters per input token, routing each token to specialized “experts” within the network. This means Gemini can scale total model capacity without proportionally increasing compute cost per token. For vision, Gemini also supports bounding box detection with normalized coordinates (0–1000), giving it spatial awareness that the other two providers don’t expose at the API level.

Despite these architectural differences, all three follow the same high-level flow:

Document Image  
  │  
  ├── Visual Encoding (architecture-specific)  
  │   ├── GPT-5: native multimodal, configurable resolution up to 10M pixels  
  │   ├── Claude: pixel-area tokenization (w×h/750), 1568px downscale  
  │   └── Gemini: sparse MoE, unified multimodal space  
  │  
  ├── [visual tokens] + [prompt tokens: "extract all text..."]  
  │  
  └── Language Model generates text output  
      └── Not structured OCR. Text completion conditioned on visual tokens.

The critical difference from Chandra (Generation 3) is that none of these models were trained specifically for OCR. GPT-5 and Gemini are natively multimodal, but they’re general-purpose models trained on everything, not document specialists. Claude adds vision to a language-first architecture. All three CAN extract text from images. None were BUILT to do it. Chandra was fine-tuned specifically on document extraction tasks. That’s a different optimization target, and it shows in the results.

Recent research has mapped what’s happening inside this process at a granular level. Earlier work by Baek et al. (2025) identified “OCR heads,” specific attention heads that specialize in text recognition. A paper from February 2026, “Where Vision Becomes Text”, used causal interventions to locate exactly where these OCR heads operate and how the text signal routes through different architectures. These heads are qualitatively different from general retrieval heads. They concentrate in specific layers (L12, L16-L20 in the models tested) and have less sparse activation patterns than other attention heads.

The OCR signal is remarkably low-dimensional: the first principal component captures 72.9% of the variance in how these heads process text. This means the model processes text through a narrow bottleneck, and that bottleneck’s location depends on the architecture. In DeepStack models (like Qwen3-VL), the bottleneck appears at mid-depth (~50% of layers). In single-stage projection models (like Phi-4), it peaks at early layers (6–25%).

Why does this matter for you? Because the model isn’t doing OCR in the way you think. It doesn’t have a dedicated text-reading module. It has attention heads that LEARNED to read text as a byproduct of training on massive datasets of images paired with text descriptions. The “reading” is emergent, not engineered.

And that’s why it fails differently than everything before it.

Why Vision Models Fail Silently

When Tesseract encounters a character it can’t read, the confidence score drops. You get a garbled output or a low-confidence warning. The failure is noisy. You can catch it.

When a vision language model encounters a chart with floating labels, something different happens. The model sees visual tokens for the label “Smartphones” near visual tokens for both the Electronics row and the Other Manufacturing row. It needs to decide which row the label belongs to. It makes this decision the same way it decides anything: by predicting what token would most plausibly come next, given the visual context and the text generated so far.

If the model is generating a list of products under “Electronics” and it has already listed several items, the next-token probability for “Smartphones” might be high simply because smartphones are associated with electronics in the training data. Not because of the spatial position of the label on the page. The model is predicting what SHOULD be there based on world knowledge, not reading what IS there based on pixel positions.

This is why my experiment produced the results it did. But first, one more approach to understand.

The Hybrid: Agentic Document Extraction (LandingAI)

LandingAI’s Agentic Document Extraction isn’t a new generation of OCR. The model underneath, DPT-2, is still a transformer. What’s different is what happens BEFORE the model reads anything.

Every other system on this list looks at the whole page and says “what text is here?” LandingAI looks at the page and says “what KIND of things are on this page?” first.

Stage 1: Document Decomposition

Before reading a single character, DPT-2 classifies every region on the page. That block of text? Paragraph. That grid? Table. That scatter plot? Chart. That squiggle in the corner? Signature. It identifies text, tables, charts, images, headers, footers, captions, logos, barcodes, QR codes, forms, even handwritten margin notes. Each one gets a bounding box with pixel coordinates. The system knows WHAT it’s looking at before it tries to read it. That’s the difference.

Stage 2: Table Structure Prediction

Tables get special treatment. DPT-2 maps the geometry first: where do rows start, where do columns end, are any cells merged? It breaks the table into individual cell regions before extracting text from them. This is why it doesn’t hallucinate rows or shift columns the way vision models do. It understands the grid before reading the content.

Stage 3: Cell-Level Extraction with Visual Grounding

Each cell’s contents are extracted and paired with a bounding box. This is the key architectural difference: every extracted value traces back to exact pixel coordinates on the source page. If DPT-2 says a cell contains “330,” you can verify that by checking the bounding box against the original image. Vision models give you text with no coordinates. LandingAI gives you text with a return address.

Stage 4: Agentic Reasoning

An AI agent coordinates the outputs from different components, resolving cross-element references. If a text block says “see the table above,” the agent links them. Because each region is processed independently, the system can parallelize extraction across components.

Document Image  
  │  
  ├── Document Decomposition (multiple element types)  
  │   ├── Text blocks     → with bounding boxes  
  │   ├── Tables           → with structure prediction (rows, cols, merged cells)  
  │   ├── Charts           → with chart-type classification  
  │   ├── Images           → with captioning  
  │   └── Forms, headers, signatures, barcodes...  
  │  
  ├── Component-Specific Processing  
  │   ├── Text → text extraction + coordinates  
  │   ├── Table → cell-level extraction + per-cell bounding boxes  
  │   ├── Chart → data extraction  
  │   └── Image → captioning  
  │  
  ├── Agentic Reasoning (cross-component linking)  
  │  
  └── Structured Output with Visual Grounding  
      └── Every element has: content + page number + bounding box coordinates

Now you know how all five approaches work. Let’s see what happens when they meet the same document.

The Experiment: Same Pages, Five Systems, Five Architectures

I took two pages from the McKinsey Global Institute: 2025 in Charts report. Page 12: a bubble scatter chart mapping US-China trade rearrangement ratios across 13 industry sectors, with circles of varying sizes, floating product labels, diamond markers for sector averages, and a continuous X axis from 0 to 1.25+.

Page 12 of McKinsey Global Institute: 2025 in Charts. The bubble chart that broke every OCR system I tested.

Page 22: a heatmap table showing economic empowerment factors across 11 countries, with 9 columns of color-coded cells (light to dark blue), rotated column headers, three income-group sections, and numerical columns for population and empowerment share.

Page 22. The “data” here is color, not numbers. None of the five systems could reliably read it.

I chose these pages because they combine text (left-column prose that any system should handle) with visual data encoding (spatial position, bubble size, color intensity) that forces each system to do more than read characters.

The five systems and how I ran each:

The same prompt was used for all three vision models to keep the comparison fair. LandingAI and Chandra don’t accept prompts, which is itself a finding: they’re opinionated about HOW to extract, not waiting for you to tell them.

Timing and cost:

System Wall time Tokens consumed  
──────────────────────────────────────────────  
GPT-5.1 38.6s 2,717 tokens  
Claude Sonnet 43.4s 5,126 tokens  
Claude Opus 60.3s 6,385 tokens  
LandingAI 19.9s N/A (proprietary, 6 credits)  
Chandra ~15s N/A (playground, free tier)

Chandra was fastest (~15s), followed by LandingAI (19.9s). Claude Opus consumed the most tokens because it attempted the most detailed output (including estimated ratio values and a 44-cell color assessment).

Results: Page 12 (Bubble Chart)

On the bubble chart, every system extracted the prose text in the left column perfectly. The chart data diverged:

GPT-5.1 listed products as flat text with no spatial structure. Products ended up assigned to wrong sectors (Video game consoles under “Other manufacturing,” Charcoal barbecues under “Other manufacturing”). No ratio values. No indication of where anything sat on the X axis. The model read every label but couldn’t determine which label belongs to which row.

Claude Sonnet made similar sector misassignments: Plastic footwear under Textiles, Tungsten carbide under Transportation. It extracted the chart title, legend, and axis labels cleanly, but the spatial mapping from product label to sector row was wrong in multiple places. No ratio values.

Claude Opus was the most ambitious. It estimated ratio values for each product: “Logic chips (small, ~0.05),” “Smartphones (large, ~0.5),” “Laptops (large, ~1.0).” It was the only system that tried to read WHERE on the axis each bubble sat. But these values were approximations (the model was guessing position from pixel proximity), and several product-to-sector mappings were still wrong. Cotton T-shirts appeared under “Other manufacturing” instead of Textiles.

LandingAI produced the most structured output. It explicitly annotated the chart as a chart block with axis definitions, legend items, and products grouped by sector. Most sector assignments were correct. But it didn’t attempt ratio values. It described the structure faithfully without reading the quantitative data.

Chandra (with Chart Understanding enabled) extracted every label but with zero spatial structure. Products listed flat, not mapped to sectors, not positioned on the axis. Chart Understanding did not produce a structured table for this bubble chart type.

Results: Page 22 (Heatmap Table)

On the heatmap, the failure was more stark. The “data” is the shade of blue in each cell. A human reads it instantly: Japan’s working-age population cell is darkest (most important factor). Brazil’s job opportunities cell is darkest. India’s food cell is dark.

GPT-5.1 gave up entirely. For every country row, it output: “[Row of 9 colored squares].” It extracted the numbers on the right perfectly (330, 79 for the United States; 1,420, 29 for India). But the entire heatmap, the insight of the visualization, was replaced with a placeholder.

Claude Sonnet tried. It used labels like “light,” “medium,” “[dark],” “[darkest].” It got Japan’s working-age population right as “[darkest].” But it called South Africa’s labor participation “medium” when it’s one of the darkest cells. Inconsistent, with no confidence scores.

Claude Opus tried hardest. It used a five-level scale: “medium,” “medium-high,” “high,” “highest/darkest.” It produced a detailed breakdown for every country and every column, 44 individual color assessments, plus a separate “Chart Detail Notes” section repeating the analysis. Some assessments were right. Some were wrong. All stated with equal confidence.

LandingAI took a different approach. Instead of describing colors, it inferred importance levels and expressed them as ratings: “Medium (3/5),” “High (4/5),” “Highest (5/5).” It produced a full structured markdown table with every cell filled. The most usable output for downstream processing. But also the most interpretive: it wasn’t reading color, it was inferring meaning from visual intensity.

Chandra got the numbers right and the text structure. But the heatmap data was completely absent. No colors, no levels, no mapping of the 9 columns to country rows.

Page 22 Heatmap Extraction Summary  
═══════════════════════════════════════════════════════════  
System          Numbers  Color data       Column mapping  
──────────────────────────────────────────────────────────  
GPT-5.1           ✓      ✗ (gave up)        ✗  
Claude Sonnet     ✓      Partial            ✓ (table)  
Claude Opus       ✓      Best attempt       ✓ (table + notes)  
LandingAI         ✓      Inferred (3/5-5/5) ✓ (structured)  
Chandra           ✓      ✗                  ✗  
──────────────────────────────────────────────────────────  
Nobody got it fully right. Nobody told you they were uncertain.

What the Experiment Proves

The results map directly to the architectures.

Vision language models (GPT, Claude) treated the whole page as one image and predicted text from visual tokens. Text? Easy. Left-to-right, consistent spacing, clear patterns. Charts? A mess. Which label belongs to which bubble? The model doesn’t know. It guesses based on what it’s seen in training data, not based on pixel coordinates. And color-as-data (light blue vs dark blue meaning different things)? These models weren’t trained to interpret shade as a category. So they either skip it or hallucinate.

LandingAI knew it was looking at a chart before it tried to read it. The decomposition step gave it a head start. It classified the region, applied chart-specific extraction, and produced structured output. Not perfect, but structured. That’s the architectural advantage of classifying first, reading second.

Chandra is optimized for text extraction with layout preservation. It’s state-of-the-art for reading text from documents. But when the “data” is color or spatial position rather than characters, it has the same blind spot as the general vision models. Full-page decoding helps with layout, not with visual data encoding.

In both cases, across all five systems: clean, confident, structured output. No uncertainty markers. No confidence scores. No indication that the chart data was guessed rather than read.

How to Choose: A Decision Framework

After running this experiment and digging into the architectures, the decision isn’t “which tool is best.” It’s “which tool matches what’s on your page.”

The worst mistake you can make is trusting one tool for everything. The second worst is trusting any tool on chart data without validation. Every system I tested produced confident, clean output on chart data that was partially or fully wrong. None told me they were uncertain.

If you’re building a document processing pipeline, the architecture should match the document:

Classify first. Before extraction, classify each page: is it mostly text, a table, a chart, or mixed? Route different page types to different tools.
Validate chart data. Any data extracted from charts, heatmaps, or visual encodings needs a human check or a secondary verification source. The data exists somewhere else (a spreadsheet, a database) before it becomes a chart. Get the source data when possible.
Build ground truth assertions. I created JSON files with expected values per page: titles, dates, specific numbers, opening sentences. Each assertion has a match type and a category. Run every OCR output against these assertions automatically.

{  
  "pdf_filename": "mckinsey-global-institute-2025-in-charts.pdf",  
  "assertions": [  
    {  
      "page": 12,  
      "expected": "Ease of rearrangement varies across products",  
      "category": "chart_title",  
      "match_type": "present"  
    },  
    {  
      "page": 12,  
      "expected": "Frozen tilapia fillets",  
      "category": "data_label",  
      "match_type": "present"  
    },  
    {  
      "page": 22,  
      "expected": "330",  
      "category": "table_value",  
      "match_type": "exact",  
      "description": "United States population in millions"  
    }  
  ]  
}

If the model says “December 2025” but your ground truth says “December 2024,” you catch it before it hits your vector database or production. If it misses “Frozen tilapia fillets” entirely, you know it dropped a data label. This is how you turn silent failures into noisy ones. This is a simplified example. My full validation workflow runs assertions across an entire data pipeline with automated scoring per provider. That’s a story for another article.

The model will never tell you it’s unsure. That’s your job.

DEV Community