DEV Community: Alvis Ng

Your agent can't find the tool it needs

Alvis Ng — Fri, 19 Jun 2026 13:30:00 +0000

I reproduced the BM25 mode of Anthropic's Tool Search Tool. At its documented limit of 10,000 tools, a correct tool missed the search's top 5 more than 40% of the time.

Tool overload has an official fix now. It works. You no longer begin every session, and every turn after, with your whole tool set loaded into the model's context. Anthropic's Tool Search Tool keeps the catalogue out of context and hands the model only the handful of tools a task needs, pulled by a search. Enable all the plugins and MCP servers you want; the search sorts it out. The token usage number drops.

There is a catch though. The model can no longer reach a tool directly. When it needs one it writes a search query, the search returns the top 3 to 5 matches, and the model can only call those. It can search again with different words, but it can only call a tool that a search has actually returned. The high-level flow looks like this:

Picking the right tool used to be the model's job, with the whole list in front of it. Now the list sits behind a search, and the model only sees what the search returns. I started noticing the model miss tool calls it used to get right. However, Anthropic's numbers tell a different story: with the search on, accuracy rose, 79.5% to 88.1% on Opus 4.5 in its MCP tests. That measures a different gate than the one I was concerned about. Theirs is end-to-end accuracy, whether the agent picked the right tool and finished the task, at a catalogue size the post does not give. Mine is the step before it: whether the right tool even reaches the model, as the catalogue grows toward the 10,000 tools the docs allow. Neither number refutes the other, and that earlier step was the one I could not find measured. So I went through the documentation and rebuilt the search: the same BM25 algorithm (one of the two modes the tool offers, next to a regex one), indexing the four fields the docs list (name, description, argument names, argument descriptions). The docs cap the tool at 10,000 tools, so that is where I set the catalogue, filled with real tools drawn from a public benchmark (ToolRet; I used its 37,292-tool web category, the largest of the benchmark's three). Then I ran 1,100 real queries through it and checked whether the right tool was located correctly.

It made the top 5 57% of the time. Top 1, 36%.

A query counts as a hit if any of the tools ToolRet marks correct for it, about 2.39 on average, reaches the top 5. That makes 57% the generous reading: demand the one tool you meant and recall only falls.

That 57% is a ceiling, and everything the agent does sits under it. At the catalogue size Anthropic supports, the right tool is missing from what the model sees more than 40% of the time, no matter how good the model is. That is your retrieval ceiling, and almost no agent dashboard shows it. That figure is per search. An agent can try again with new words and recover some of it, but every retry runs against the same ceiling.

Adding tools lowers the ceiling

Connect more tools and the ceiling drops. I ran the same search at growing catalogue sizes, filling each with random, unrelated tools. At 5 tools the right one was in the top 5 every time. At 100, 93%. At 1,000, 80%. At 5,000, 64%. At 10,000, the most Anthropic's tool supports, 57%. I did push beyond that cap as well, loading the full 37,000-tool corpus I had, and recall rate sat at 44%. That decline never reverses or shows any sign of bouncing back up; the black curve in the chart above falls from the first tool to the last. Every tool you connect, even one the agent never needs, is one more chance for the search to rank it over the tool you wanted. Under 100 tools the effect is small, and you would not bother with tool search at all. It starts to matter in the hundreds and turns serious past a few thousand, which is further into the curve than most think they are.

The Model Context Protocol made connecting a tool almost free, so catalogues grow. Zapier's MCP server advertises 9,000-plus apps behind one connector, among them Gmail, Slack, and Salesforce. Wire a few hundred of those apps in, each exposing several tools, and your catalogue is already in the low thousands, deep enough for the decline to show.

Overlap does most of the damage

The size of the catalogue erodes recall slowly. It gets steep when tools resemble each other in the exact fields the search indexes. To see overlap on its own, I fixed the catalogue at 1,000 tools, the right tool always in it, and changed only the tools around it. When those tools were random and unrelated, the right tool made the top 5 79% of the time. When they were near-duplicates of it, the kind a real catalogue is full of, that fell to 52%. Same count. 27 points gone, purely because the other tools looked like the one I needed.

Real catalogues are the second kind. In that 37,000-tool set, 39% of tools have a near-twin by surface text alone, and the real number is higher, because matching on text misses the tools that mean the same thing in different words. The corpus carries three separate "get all climate change news" tools, from three different API versions. Every service ships a search, a create, a list. search_drive sits next to search_email sits next to search_web, and the retriever has to tell them apart on a name and one line of description. Anthropic names the same failure in the launch post: wrong tool selection happens most when tools have similar names, its example being notification-send-user against notification-send-channel.

At a thousand tools, making them resemble each other cut the right tool's odds of being found from 79% to 52%. The count never changed.

This is also why searching again does not guarantee the right tool comes back. The look-alikes rank high because they resemble the tool you want, so any query that finds your tool tends to find them too. Reword the search and the same near-duplicates come back, just in a different order. The overlap penalty is a property of the catalogue, and rewording the query does not touch it.

There are studies that measured this from the other side. ToolScope, which merges redundant tools "with overlapping names and descriptions" and adds a context-aware retriever, reported gains of up to 38.6% in tool-selection accuracy across three models and three benchmarks. Another study found that editing only a tool's description, leaving the function untouched, changed how often a model picked it by more than 10 times. The choice rides on the wording, at the retrieval step I measured and at the selection step they did.

The fix is in the tool definitions

When the search hands the model the wrong tool, my first instinct is to blame the model and reach for a bigger one. The experiment is what changed my mind. A bigger model still sees only the 3 to 5 tools the search returned, and the search ran with no model in it at all. None of these numbers move because the model got smarter.

The next instinct is to blame the BM25 part and reach for embeddings, on the theory that a semantic search knows "find me the weather" and get_current_conditions are the same request. So I ran the whole thing again with real dense retrievers, a small one and the larger model in the same family (BGE-small and BGE-large, normalised cosine). They helped, and hit the same wall.

Retriever, at the 10,000-tool cap	hit@5	top-1
BM25 (lexical, what the tool ships)	57%	36%
BGE-small (dense)	68%	42%
BGE-large (dense)	72%	47%

Going from lexical to semantic buys about 10 points of hit@5; scaling the embedder from small to large buys 4 more. Neither clears the ceiling. Even BGE-large leaves the right tool out of the top 5 about 28% of the time at the cap, and on the full 37,000-tool corpus its best guess is right only 26% of the time. The overlap cliff survives every retriever: at 1,000 tools, BGE-large still falls from 91% on random fillers to 65% on look-alikes. A better retriever lifts the floor, with diminishing returns, and never moves the work off the catalogue.

The cleanup that helps is dull. Merge the duplicate tools. Put the service in the name, so gmail_search_messages instead of a bare search, and write descriptions in the words a real query would use. Anthropic's own troubleshooting guide for the Tool Search Tool says exactly this: when Claude cannot find a tool, "add common keywords to tool descriptions" and use "consistent namespacing in tool names: prefix by service or resource." The vendor that shipped the search also shipped the list of ways it misses. ToolScope measured that end-to-end gain with its full merge-and-retrieve system, and my own numbers size the cleanup piece from the retrieval side: the 27 points between look-alike and random tools at 1,000 is the overlap penalty, which is what deduping and renaming are fighting to win back. It only goes so far. That penalty comes from overlap. The slow decline from sheer count is a separate problem, and cleanup does not touch it, so past a few thousand tools the next move is to stop exposing tools a given task will never touch.

The whole method fits on one screen

There is no server and no model in the loop. A tool becomes searchable text from the four fields the docs index, BM25 ranks that text against the query, and I check whether the right tool landed in the top 5. ToolRet stores those fields in two different shapes depending on the source API, so the text builder reads both.

from rank_bm25 import BM25Okapi
import re, numpy as np

def text(tool): # the four fields the docs index, both schemas
    parts = [tool["name"], tool["description"]]
    props = (tool.get("doc_arguments") or {}).get("properties") or {} # schema A
    for argname, spec in props.items():
        parts += [argname, spec.get("description", "")]
    for key in ("required_parameters", "optional_parameters"): # schema B
        for p in (tool.get(key) or []):
            parts += [p.get("name", ""), p.get("description", "")]
    return re.findall(r"[a-z0-9]+", " ".join(parts).lower())

# catalogue: the query's gold tools plus fillers, up to size total, sampled from ToolRet
index = BM25Okapi([text(t) for t in catalogue])

hits = 0
for q in queries: # 1,100 real ToolRet queries
    scores = index.get_scores(re.findall(r"[a-z0-9]+", q.text.lower()))
    top5 = [catalogue[i] for i in np.argsort(scores)[::-1][:5]]
    hits += any(g in top5 for g in q.gold) # hit = any of the query's ~2.39 gold tools in the top 5

recall = hits / len(queries) # 0.57 at a 10,000-tool catalogue

That is the retriever in full, the two argument schemas included. The size curve runs this same loop at each catalogue size, averaged over a few random fills; the look-alike test swaps the random fillers for the target's nearest neighbors. Loading ToolRet and sampling a catalogue is the ordinary dataset plumbing around it. The corpus is public and the scoring is on this screen, so the 57% is yours to check.

Measure your own retrieval recall

My 57% is on a benchmark corpus, at the catalogue size Anthropic's tool supports. Yours will move with your tools, the descriptions attached to each of them, and your retriever. Stop focusing on how many tools you connected. For a sample of your real tasks, consider tracking whether the correct tool was in what the search returned, before the model chose. The loop below walks every search in a response and gathers all the tools they returned, so the number reflects what your agent actually retrieved across all its searches, however many it ran.

# Retrieval recall: did the search return the right tool,
# before the model got to choose? Run this across your eval set.
def retrieved_tools(response):
    # response shape (Anthropic docs): each tool_search_tool_result block carries
    # .content.tool_references[].tool_name; verify the keys against your SDK version
    names = []
    for block in response.content:
        if block.type == "tool_search_tool_result":
            # each search returns the 3 to 5 tools it ranked most relevant
            names += [ref.tool_name for ref in block.content.tool_references]
    return names

hits = 0
for task in eval_set:                 # each task knows the tool it should use
    resp = client.messages.create(
        model="claude-opus-4-8", tools=TOOLS, messages=task.messages)
    hits += task.expected_tool in retrieved_tools(resp)

recall = hits / len(eval_set)         # this is your retrieval ceiling

The response shape is in Anthropic's docs: every search returns a tool_search_tool_result block listing the tools it found.

	Context bloat	Wrong tool called
Symptom	55,000 tokens spent before the first message	the right tool is missing from the results, or lost among look-alikes
Fixed by	deferring tool loading	deduping and renaming the overlapping tools
Number to watch	tokens per request	retrieval recall, then selection accuracy on what comes back

When recall is low, at least for now, the work should be in the catalogue, and a model upgrade will not mend it. When recall is high and the model still picks the wrong one, then you have a selection problem, and a better model earns its cost. Reach for the bigger model first and you spend on tool selection, the step the model already handled, while retrieval, the step that was failing, stays broken.

A few limits. I measured a BM25 reproduction and two embeddings retrievers, not Anthropic's exact server, so the absolute numbers may differ from production. The tool ships in two modes, regex and BM25; I tested BM25, the one that takes natural-language queries and so is the likeliest to return a loosely matching tool. The regex mode matches name patterns instead, and I did not run it, including against the overlap test, so if you rely on regex, consider measuring that case yourself. The 57% is measured at the 10,000-tool limit the docs set; the 37,000-tool figures run past that cap and are here only to show where the curve heads. The exact numbers are specific to that corpus, which mixes real APIs with synthetic ones. The benchmark also scores only its labeled tools as correct, so some of what I counted as a miss returned a different tool that would have done the job, which means the real failure rate is lower than the raw number. These are also single-shot retrieval numbers, one search per query; a real agent can search again with different words, so its end-to-end miss rate is lower than the per-search figure here.

I tested two embedding models, BGE-small and BGE-large; the larger one added about 4 more points and hit the same wall, so a heavier retriever is unlikely to clear it. What I am sure of is simpler than the exact numbers: recall declines as the catalogue grows, and it declines even more steeply when the tools resemble each other. It occurred with every retriever I tried. Every number above comes from a deterministic script with no API in the loop, and none of it is private: the corpus is the public ToolRet set, the retrievers are a standard BM25 index and off-the-shelf embedding models, and the metric is whether a labeled tool lands in the top k. You can rebuild the curve from those without my code. Plot your own curve before you reach for a bigger model.

What Happens When AI Reads a Page

Alvis Ng — Tue, 07 Apr 2026 13:30:00 +0000

From Tesseract’s character pipeline to GPT’s visual tokens to LandingAI’s agentic decomposition. I tested all five on the same McKinsey report. Every one got the text right. None of them got the charts right.

I uploaded two pages of a McKinsey consulting report to five different OCR systems last week. Same pages. Same prompt. A bubble chart with floating labels and a heatmap table where the data is color, not numbers.

Every system got the text paragraphs perfectly. Every one of them failed on the charts.

But they failed differently. And the differences trace directly to how each system works inside. Not at the API level. At the architecture level. The level where “reading” becomes “predicting” and you can’t tell the difference from the output.

This article is about that architecture. What’s actually happening when a machine looks at a page of text, a table, or a chart, and turns pixels into structured data. Not an API comparison. A mechanism explainer. Because if you use OCR daily, you should understand how it works.

Generation 1: The Pipeline That Reads Character by Character

Before transformers, before neural networks dominated OCR, Tesseract was the standard. It still runs inside thousands of production systems. And it works nothing like what you’re using when you call Claude or GPT with a PDF.

Tesseract (version 4+) processes a page through a multi-stage pipeline:

**Stage 1: Page Layout Analysis

**The engine scans the image for text regions. Dark pixel clusters on a light background. Group them into lines by vertical alignment. Separate words by horizontal gaps. That’s it. At this stage, the system has no idea what any character says. It only knows WHERE text exists.

Stage 2: Line-Level Recognition

This is the interesting part. Each text line goes to an LSTM (Long Short-Term Memory) network. Not one character at a time. The whole line. Picture yourself squinting at someone’s terrible handwriting. You don’t decode each letter in isolation, right? You look at the shape of the whole word. The LSTM does the same thing. It slides across the line and asks “given everything I’ve seen so far, what character comes next?” A vertical bar followed by a curve? Probably “b” or “d.” The letters around it break the tie.

Stage 3: CTC Decoding

The LSTM spits out a messy probability matrix. Every column is a time step, every row is a possible character. CTC (Connectionist Temporal Classification) cleans this up. The network might say “hh-ee-ll-ll-oo” for “hello” because it stutters across time steps. CTC merges the repeated characters, strips out blanks, and gives you the final string.

Page Image  
  │  
  ├── Layout Analysis → text regions, lines, words  
  │  
  ├── Per-Line LSTM → probability matrix [timesteps × characters]  
  │  
  ├── CTC Decoder → final character string  
  │  
  └── Post-processing → dictionary lookup, spell correction

The pipeline nature is the key insight. Each stage has ONE job. Layout analysis finds WHERE. The LSTM identifies WHAT. CTC decodes HOW to collapse probabilities into text. When any stage fails, the failure is visible. A layout analysis miss means a text region is skipped entirely, and you get a gap. A recognition error gives you garbled characters with low confidence scores. The system KNOWS it’s uncertain, and it can tell you.

This matters because everything that came after works differently. And fails differently.

Generation 2: Deep Learning Detection + Recognition

Solutions like EasyOCR and PaddleOCR split the problem into two neural networks instead of Tesseract’s monolithic pipeline.

Network 1: Text Detection

A CNN (convolutional neural network) like CRAFT scans the image and produces two heatmaps. One says “this pixel is probably part of a character.” The other says “these two characters are probably next to each other.” Threshold those heatmaps and you get bounding boxes around text regions. Now you know where the words are.

Network 2: Text Recognition

Crop each detected region. Resize it to a standard height. Feed it to a CRNN (Convolutional Recurrent Neural Network). The CNN looks at the pixels and extracts visual features. The RNN (bidirectional LSTM) reads those features as a sequence, left and right. CTC decoding at the end, same as Tesseract. Out comes the text.

The improvement over Tesseract? The detection network handles rotated text, curved text, text on complex backgrounds. The recognition network is deeper and more accurate. But it’s still a pipeline. Detection first, then recognition. Two networks. Two failure points. Both visible.

When CRAFT fails to detect a text region, you get no output for that region. When the CRNN misrecognizes a character, the confidence score drops. You can build validation logic around both failure modes because the system’s uncertainty is measurable.

Generation 3: End-to-End Transformers (Chandra, TrOCR)

This is where the architecture shift happens.

Chandra OCR and similar models (TrOCR, Donut, Nougat) abandon the pipeline. Instead of “detect then recognize,” they process the entire page in a single forward pass.

Chandra is built on a fine-tuned vision-language model (from the Qwen-VL family). It’s not a classic encoder-decoder like T5 or BART. It’s a VLM that was specifically trained to generate structured document output. The architecture:

Vision encoder

The page gets chopped into patches. 16x16 pixels each. Each patch is flattened into a vector, projected through a linear layer, and tagged with a positional embedding so the model knows where on the page it came from. Then self-attention: every patch can attend to every other patch. A patch showing the top of a “T” can look at the patch below it to confirm it’s a “T” and not an “I.” The output is a sequence of visual feature vectors, each one aware of the entire page.

Language decoder

Takes those visual features and generates text. Token by token. Left to right. Same autoregressive process as a chatbot generating a response, except the “prompt” is a grid of visual features instead of text tokens.

Page Image (e.g., 1024×1024 pixels)  
  │  
  ├── Split into patches (e.g., 4,096 patches of 16×16)  
  │  
  ├── Linear projection → patch embeddings  
  │  
  ├── + Positional embeddings (where on the page)  
  │  
  ├── Vision Encoder (self-attention across ALL patches)  
  │   └── Output: visual feature vectors  
  │  
  └── Autoregressive Language Decoder (fine-tuned Qwen-VL)  
      ├── Attends to visual features + previous tokens  
      └── Generates: structured text (Markdown, HTML, JSON)

No detection step. No bounding boxes first. The model learns WHERE and WHAT simultaneously. A patch containing part of a letter attends to adjacent patches to figure out what it’s looking at. The self-attention mechanism is doing the spatial reasoning that CRAFT and Tesseract had to engineer explicitly.

Chandra specifically uses this approach with what it calls “full-page decoding.” Rather than processing individual text lines, it takes in the entire page and generates structured output (Markdown, HTML, or JSON) that preserves the layout. It also provides spatial grounding: layout blocks are classified into 16+ types (text, section-header, caption, footnote, table, form, list-group, image, figure, diagram, equation-block, code-block, and more), each with bounding box coordinates. It achieved 85.9% on the olmocr benchmark, state-of-the-art for open-source OCR.

The strength: layout preservation

Because the model sees the entire page at once, it can understand that a heading relates to the paragraph below it, that a table cell belongs to a specific row and column, that a footnote marker connects to a footnote at the bottom. The pipeline approaches can’t do this because they process text regions independently.

The weakness: the decoder is autoregressive

It predicts tokens one at a time. And like any autoregressive model, it can hallucinate. If the visual features are ambiguous (a smudged character, a low-resolution scan), the model doesn’t output a low-confidence garbled character like Tesseract would. It outputs a plausible character. Confidently. Silently.

Generation 4: Vision Language Models (Claude, GPT, Gemini)

When you upload a PDF to Claude or GPT and ask it to extract the text, you’re using a vision language model. This is architecturally different from both traditional OCR and end-to-end transformers like Chandra. And the three major providers do it differently from each other.

The general pattern is the same, which is to convert the image into visual tokens, combine them with your text prompt, and let the language model generate a response. But the HOW varies:

GPT-5 is natively multimodal. It was trained from scratch on text and images simultaneously. The visual and language capabilities developed together during training, not bolted on after the fact. It supports configurable resolution: a [_detail_](https://developers.openai.com/cookbook/examples/multimodal/document_and_multimodal_understanding_tips) parameter controls whether the model processes your image at standard or original resolution. For documents with small labels, dense tables, or fine print, switching to detail="original" lets the model see individual pixels up to 10 million pixels without compression. This architectural choice means GPT-5 doesn't have a separate "vision module." Images and text live in the same representational space from the start.

Claude takes a different approach. The image is converted to tokens using a pixel-area formula: tokens = (width × height) / 750. Images with a long edge over 1,568 pixels are downscaled before processing. The visual tokens are then processed alongside text tokens by the language model. Claude's vision docs note specific limitations: limited spatial reasoning, approximate (not precise) object counting, and potential hallucinations on low-quality or rotated images. These limitations map directly to the failure modes we saw in the experiment.

Gemini uses a sparse Mixture-of-Experts (MoE) transformer architecture trained to be natively multimodal from the ground up. Google’s docs describe it as “built to be multimodal from the ground up.” The MoE design activates only a subset of model parameters per input token, routing each token to specialized “experts” within the network. This means Gemini can scale total model capacity without proportionally increasing compute cost per token. For vision, Gemini also supports bounding box detection with normalized coordinates (0–1000), giving it spatial awareness that the other two providers don’t expose at the API level.

Despite these architectural differences, all three follow the same high-level flow:

Document Image  
  │  
  ├── Visual Encoding (architecture-specific)  
  │   ├── GPT-5: native multimodal, configurable resolution up to 10M pixels  
  │   ├── Claude: pixel-area tokenization (w×h/750), 1568px downscale  
  │   └── Gemini: sparse MoE, unified multimodal space  
  │  
  ├── [visual tokens] + [prompt tokens: "extract all text..."]  
  │  
  └── Language Model generates text output  
      └── Not structured OCR. Text completion conditioned on visual tokens.

The critical difference from Chandra (Generation 3) is that none of these models were trained specifically for OCR. GPT-5 and Gemini are natively multimodal, but they’re general-purpose models trained on everything, not document specialists. Claude adds vision to a language-first architecture. All three CAN extract text from images. None were BUILT to do it. Chandra was fine-tuned specifically on document extraction tasks. That’s a different optimization target, and it shows in the results.

Recent research has mapped what’s happening inside this process at a granular level. Earlier work by Baek et al. (2025) identified “OCR heads,” specific attention heads that specialize in text recognition. A paper from February 2026, “Where Vision Becomes Text”, used causal interventions to locate exactly where these OCR heads operate and how the text signal routes through different architectures. These heads are qualitatively different from general retrieval heads. They concentrate in specific layers (L12, L16-L20 in the models tested) and have less sparse activation patterns than other attention heads.

The OCR signal is remarkably low-dimensional: the first principal component captures 72.9% of the variance in how these heads process text. This means the model processes text through a narrow bottleneck, and that bottleneck’s location depends on the architecture. In DeepStack models (like Qwen3-VL), the bottleneck appears at mid-depth (~50% of layers). In single-stage projection models (like Phi-4), it peaks at early layers (6–25%).

Why does this matter for you? Because the model isn’t doing OCR in the way you think. It doesn’t have a dedicated text-reading module. It has attention heads that LEARNED to read text as a byproduct of training on massive datasets of images paired with text descriptions. The “reading” is emergent, not engineered.

And that’s why it fails differently than everything before it.

Why Vision Models Fail Silently

When Tesseract encounters a character it can’t read, the confidence score drops. You get a garbled output or a low-confidence warning. The failure is noisy. You can catch it.

When a vision language model encounters a chart with floating labels, something different happens. The model sees visual tokens for the label “Smartphones” near visual tokens for both the Electronics row and the Other Manufacturing row. It needs to decide which row the label belongs to. It makes this decision the same way it decides anything: by predicting what token would most plausibly come next, given the visual context and the text generated so far.

If the model is generating a list of products under “Electronics” and it has already listed several items, the next-token probability for “Smartphones” might be high simply because smartphones are associated with electronics in the training data. Not because of the spatial position of the label on the page. The model is predicting what SHOULD be there based on world knowledge, not reading what IS there based on pixel positions.

This is why my experiment produced the results it did. But first, one more approach to understand.

The Hybrid: Agentic Document Extraction (LandingAI)

LandingAI’s Agentic Document Extraction isn’t a new generation of OCR. The model underneath, DPT-2, is still a transformer. What’s different is what happens BEFORE the model reads anything.

Every other system on this list looks at the whole page and says “what text is here?” LandingAI looks at the page and says “what KIND of things are on this page?” first.

Stage 1: Document Decomposition

Before reading a single character, DPT-2 classifies every region on the page. That block of text? Paragraph. That grid? Table. That scatter plot? Chart. That squiggle in the corner? Signature. It identifies text, tables, charts, images, headers, footers, captions, logos, barcodes, QR codes, forms, even handwritten margin notes. Each one gets a bounding box with pixel coordinates. The system knows WHAT it’s looking at before it tries to read it. That’s the difference.

Stage 2: Table Structure Prediction

Tables get special treatment. DPT-2 maps the geometry first: where do rows start, where do columns end, are any cells merged? It breaks the table into individual cell regions before extracting text from them. This is why it doesn’t hallucinate rows or shift columns the way vision models do. It understands the grid before reading the content.

Stage 3: Cell-Level Extraction with Visual Grounding

Each cell’s contents are extracted and paired with a bounding box. This is the key architectural difference: every extracted value traces back to exact pixel coordinates on the source page. If DPT-2 says a cell contains “330,” you can verify that by checking the bounding box against the original image. Vision models give you text with no coordinates. LandingAI gives you text with a return address.

Stage 4: Agentic Reasoning

An AI agent coordinates the outputs from different components, resolving cross-element references. If a text block says “see the table above,” the agent links them. Because each region is processed independently, the system can parallelize extraction across components.

Document Image  
  │  
  ├── Document Decomposition (multiple element types)  
  │   ├── Text blocks     → with bounding boxes  
  │   ├── Tables           → with structure prediction (rows, cols, merged cells)  
  │   ├── Charts           → with chart-type classification  
  │   ├── Images           → with captioning  
  │   └── Forms, headers, signatures, barcodes...  
  │  
  ├── Component-Specific Processing  
  │   ├── Text → text extraction + coordinates  
  │   ├── Table → cell-level extraction + per-cell bounding boxes  
  │   ├── Chart → data extraction  
  │   └── Image → captioning  
  │  
  ├── Agentic Reasoning (cross-component linking)  
  │  
  └── Structured Output with Visual Grounding  
      └── Every element has: content + page number + bounding box coordinates

Now you know how all five approaches work. Let’s see what happens when they meet the same document.

The Experiment: Same Pages, Five Systems, Five Architectures

I took two pages from the McKinsey Global Institute: 2025 in Charts report. Page 12: a bubble scatter chart mapping US-China trade rearrangement ratios across 13 industry sectors, with circles of varying sizes, floating product labels, diamond markers for sector averages, and a continuous X axis from 0 to 1.25+.

Page 12 of McKinsey Global Institute: 2025 in Charts. The bubble chart that broke every OCR system I tested.

Page 22: a heatmap table showing economic empowerment factors across 11 countries, with 9 columns of color-coded cells (light to dark blue), rotated column headers, three income-group sections, and numerical columns for population and empowerment share.

Page 22. The “data” here is color, not numbers. None of the five systems could reliably read it.

I chose these pages because they combine text (left-column prose that any system should handle) with visual data encoding (spatial position, bubble size, color intensity) that forces each system to do more than read characters.

The five systems and how I ran each:

The same prompt was used for all three vision models to keep the comparison fair. LandingAI and Chandra don’t accept prompts, which is itself a finding: they’re opinionated about HOW to extract, not waiting for you to tell them.

Timing and cost:

System Wall time Tokens consumed  
──────────────────────────────────────────────  
GPT-5.1 38.6s 2,717 tokens  
Claude Sonnet 43.4s 5,126 tokens  
Claude Opus 60.3s 6,385 tokens  
LandingAI 19.9s N/A (proprietary, 6 credits)  
Chandra ~15s N/A (playground, free tier)

Chandra was fastest (~15s), followed by LandingAI (19.9s). Claude Opus consumed the most tokens because it attempted the most detailed output (including estimated ratio values and a 44-cell color assessment).

Results: Page 12 (Bubble Chart)

On the bubble chart, every system extracted the prose text in the left column perfectly. The chart data diverged:

GPT-5.1 listed products as flat text with no spatial structure. Products ended up assigned to wrong sectors (Video game consoles under “Other manufacturing,” Charcoal barbecues under “Other manufacturing”). No ratio values. No indication of where anything sat on the X axis. The model read every label but couldn’t determine which label belongs to which row.

Claude Sonnet made similar sector misassignments: Plastic footwear under Textiles, Tungsten carbide under Transportation. It extracted the chart title, legend, and axis labels cleanly, but the spatial mapping from product label to sector row was wrong in multiple places. No ratio values.

Claude Opus was the most ambitious. It estimated ratio values for each product: “Logic chips (small, ~0.05),” “Smartphones (large, ~0.5),” “Laptops (large, ~1.0).” It was the only system that tried to read WHERE on the axis each bubble sat. But these values were approximations (the model was guessing position from pixel proximity), and several product-to-sector mappings were still wrong. Cotton T-shirts appeared under “Other manufacturing” instead of Textiles.

LandingAI produced the most structured output. It explicitly annotated the chart as a chart block with axis definitions, legend items, and products grouped by sector. Most sector assignments were correct. But it didn’t attempt ratio values. It described the structure faithfully without reading the quantitative data.

Chandra (with Chart Understanding enabled) extracted every label but with zero spatial structure. Products listed flat, not mapped to sectors, not positioned on the axis. Chart Understanding did not produce a structured table for this bubble chart type.

Results: Page 22 (Heatmap Table)

On the heatmap, the failure was more stark. The “data” is the shade of blue in each cell. A human reads it instantly: Japan’s working-age population cell is darkest (most important factor). Brazil’s job opportunities cell is darkest. India’s food cell is dark.

GPT-5.1 gave up entirely. For every country row, it output: “[Row of 9 colored squares].” It extracted the numbers on the right perfectly (330, 79 for the United States; 1,420, 29 for India). But the entire heatmap, the insight of the visualization, was replaced with a placeholder.

Claude Sonnet tried. It used labels like “light,” “medium,” “[dark],” “[darkest].” It got Japan’s working-age population right as “[darkest].” But it called South Africa’s labor participation “medium” when it’s one of the darkest cells. Inconsistent, with no confidence scores.

Claude Opus tried hardest. It used a five-level scale: “medium,” “medium-high,” “high,” “highest/darkest.” It produced a detailed breakdown for every country and every column, 44 individual color assessments, plus a separate “Chart Detail Notes” section repeating the analysis. Some assessments were right. Some were wrong. All stated with equal confidence.

LandingAI took a different approach. Instead of describing colors, it inferred importance levels and expressed them as ratings: “Medium (3/5),” “High (4/5),” “Highest (5/5).” It produced a full structured markdown table with every cell filled. The most usable output for downstream processing. But also the most interpretive: it wasn’t reading color, it was inferring meaning from visual intensity.

Chandra got the numbers right and the text structure. But the heatmap data was completely absent. No colors, no levels, no mapping of the 9 columns to country rows.

Page 22 Heatmap Extraction Summary  
═══════════════════════════════════════════════════════════  
System          Numbers  Color data       Column mapping  
──────────────────────────────────────────────────────────  
GPT-5.1           ✓      ✗ (gave up)        ✗  
Claude Sonnet     ✓      Partial            ✓ (table)  
Claude Opus       ✓      Best attempt       ✓ (table + notes)  
LandingAI         ✓      Inferred (3/5-5/5) ✓ (structured)  
Chandra           ✓      ✗                  ✗  
──────────────────────────────────────────────────────────  
Nobody got it fully right. Nobody told you they were uncertain.

What the Experiment Proves

The results map directly to the architectures.

Vision language models (GPT, Claude) treated the whole page as one image and predicted text from visual tokens. Text? Easy. Left-to-right, consistent spacing, clear patterns. Charts? A mess. Which label belongs to which bubble? The model doesn’t know. It guesses based on what it’s seen in training data, not based on pixel coordinates. And color-as-data (light blue vs dark blue meaning different things)? These models weren’t trained to interpret shade as a category. So they either skip it or hallucinate.

LandingAI knew it was looking at a chart before it tried to read it. The decomposition step gave it a head start. It classified the region, applied chart-specific extraction, and produced structured output. Not perfect, but structured. That’s the architectural advantage of classifying first, reading second.

Chandra is optimized for text extraction with layout preservation. It’s state-of-the-art for reading text from documents. But when the “data” is color or spatial position rather than characters, it has the same blind spot as the general vision models. Full-page decoding helps with layout, not with visual data encoding.

In both cases, across all five systems: clean, confident, structured output. No uncertainty markers. No confidence scores. No indication that the chart data was guessed rather than read.

How to Choose: A Decision Framework

After running this experiment and digging into the architectures, the decision isn’t “which tool is best.” It’s “which tool matches what’s on your page.”

The worst mistake you can make is trusting one tool for everything. The second worst is trusting any tool on chart data without validation. Every system I tested produced confident, clean output on chart data that was partially or fully wrong. None told me they were uncertain.

If you’re building a document processing pipeline, the architecture should match the document:

Classify first. Before extraction, classify each page: is it mostly text, a table, a chart, or mixed? Route different page types to different tools.
Validate chart data. Any data extracted from charts, heatmaps, or visual encodings needs a human check or a secondary verification source. The data exists somewhere else (a spreadsheet, a database) before it becomes a chart. Get the source data when possible.
Build ground truth assertions. I created JSON files with expected values per page: titles, dates, specific numbers, opening sentences. Each assertion has a match type and a category. Run every OCR output against these assertions automatically.

{  
  "pdf_filename": "mckinsey-global-institute-2025-in-charts.pdf",  
  "assertions": [  
    {  
      "page": 12,  
      "expected": "Ease of rearrangement varies across products",  
      "category": "chart_title",  
      "match_type": "present"  
    },  
    {  
      "page": 12,  
      "expected": "Frozen tilapia fillets",  
      "category": "data_label",  
      "match_type": "present"  
    },  
    {  
      "page": 22,  
      "expected": "330",  
      "category": "table_value",  
      "match_type": "exact",  
      "description": "United States population in millions"  
    }  
  ]  
}

If the model says “December 2025” but your ground truth says “December 2024,” you catch it before it hits your vector database or production. If it misses “Frozen tilapia fillets” entirely, you know it dropped a data label. This is how you turn silent failures into noisy ones. This is a simplified example. My full validation workflow runs assertions across an entire data pipeline with automated scoring per provider. That’s a story for another article.

The model will never tell you it’s unsure. That’s your job.

Nobody Wants to Learn AI

Alvis Ng — Fri, 03 Apr 2026 02:04:09 +0000

The “lifelong learner” identity isn’t aspiration. It’s a subscription you can’t cancel.

Last week, someone in your feed posted about being "thrilled to build AI agents this weekend." Forty-seven likes. Fire emojis. "Love the growth mindset!" You stared at it and felt something you couldn't name. Not jealousy. Not inspiration. Something closer to recognition. The way you recognize a mask because you're wearing the same one.

Nobody is thrilled to learn agent frameworks on a Saturday. They're afraid of what happens if they don't.

You probably have your own version of this. A browser tab you keep meaning to open. A course you bookmarked during a sale and never started. A Slack thread about "AI readiness" that you skimmed and closed. The quiet admission that you don't know enough to stay relevant, and the quieter admission that bookmarking the resource let you feel like you were doing something about it.

You've been writing production code for years. By most reasonable measures, you know what you're doing.

And yet.

That tab sits there. Glowing faintly. A talisman against obsolescence.

The Same Activity, Two Different Meanings

Early in your career, you learned a framework because it was genuinely exciting. A new mental model for building interfaces. You stayed up late not because you were anxious but because you couldn't stop. The learning felt like building. Like adding rooms to a house you were just beginning to inhabit.

Now you open a course tab because your company started an "AI readiness" initiative. There's no mandate, technically. Just a Slack message from leadership about "staying ahead of the curve" and a shared spreadsheet where you can log your upskilling hours. Voluntary, of course. The way salary negotiations are voluntary.

The distinction matters because the industry doesn't acknowledge it. Every conference talk, every LinkedIn post, every corporate learning initiative treats all learning as the same species. Growth. Development. Curiosity. Whether you're a new grad exploring your first language or a career-changer using Coursera to escape a dead-end job or a 15-year veteran trying to prove you're not a dinosaur, it's all filed under the same aspirational banner.

But the body knows the difference. Curiosity feels like leaning forward. Defense feels like bracing.

I want to be clear about something: for some people, these platforms are genuinely life-changing. The self-taught developer who used a Coursera scholarship to move from data entry to engineering, the career-changer who learned Python on free YouTube tutorials. That's real. That matters. This isn't about those people. This is about a system that takes their stories and uses them as marketing for something very different: the perpetual anxiety machine that tells experienced professionals their decade of work might expire next quarter.

The Thing You Bought vs. The Thing You Got

Here's where the economics get clarifying.

The average MOOC completion rate is below 10% for free-track learners, with some studies putting it as low as 3%. More than nine out of ten people who enroll in an online course never finish it. They buy access. They don't buy knowledge. They buy the feeling of progress, the same way a gym membership on January 2nd buys the feeling of fitness.

The corporate side isn't better. A 2014 study by Gartner found that 45% of corporate training qualifies as "scrap learning," content employees complete but never apply on the job. That Kubernetes course you watched on double-speed during lunch breaks? By performance review season, you'll have retained the certificate and almost none of the knowledge.

The upskilling industry doesn't sell knowledge. It sells the feeling of doing something, and even that feeling has a completion rate of 3%.

The Skill That Expires vs. The Skill That Compounds

This is the part nobody talks about, because talking about it would break the business model.

Not all skills depreciate at the same rate. Your knowledge of a specific framework has a half-life of about two to three years. React hooks, Kubernetes configurations, the agent framework API you learned last month: the specifics you picked up this quarter will be partially obsolete by next year and largely irrelevant in three. This isn't a failure of learning. It's the nature of implementation knowledge. It's perishable by design.

Your ability to debug a system you've never seen before has a half-life measured in decades. Architectural judgment, the instinct for where complexity will hurt you later, the capacity to read a codebase and understand not just what it does but why someone built it that way: these are durable skills. They compound with experience instead of depreciating with time.

Think about the SRE on your team who has spent eight years keeping a payments system alive through migrations, acquisitions, and near-catastrophic failovers. Who knows where every brittleness hides. Who can diagnose cascading failures from the shape of a latency graph the way a cardiologist reads an EKG. Now imagine their manager asking what they're doing to "stay current with AI." Their entire career is a durable skill. The system that evaluates them can't see it.

The upskilling industry has no product for that kind of expertise. You can't package it into a six-week course with a certificate. It develops slowly, through years of building and maintaining real systems, through the painful process of watching your own architectural decisions age and learning which ones held. There is no shortcut. There is no 90%-off sale.

So the industry sells you the perishable kind. Over and over. React today, agent frameworks tomorrow, whatever tool emerges next quarter. Each course addresses a skill with a two-year shelf life, which means you'll need another course in two years. And another after that. The business model depends on your knowledge expiring.

Your React knowledge has a half-life of about two years. Your ability to debug a system you've never seen has a half-life of decades. The industry charges you quarterly for the first and has no idea how to price the second.

I should say the obvious thing: sometimes you do need the perishable skill. Sometimes the new framework is the right tool and learning it is the right call. And if you're four years in, still building your foundation, the cruelest part is this: you're being told the perishable stuff is all that matters before you've had time to develop anything durable. The system is pricing your potential at zero while demanding you pay full rate for currency.

The problem isn't learning new things. The problem is an economy that treats perishable and durable skills as interchangeable, prices them identically at hiring time, and then acts surprised when experienced engineers feel like they're running in place. Durable skills make you a terrible customer. Perishable skills make you a subscription.

The Confession I Keep Editing

I should be honest about something, and more honest than I usually am.

I build AI systems. I know what these tools can and can't do. And I still feel it. The pressure to signal that I'm keeping up. The impulse to post about the latest framework on LinkedIn not because I learned something meaningful, but because the keyword matters more than the knowledge. I know it, and I've done it anyway.

I've watched engineers I respect get evaluated on whether they "adopted AI tools" instead of whether they built systems that worked. I've seen performance rubrics that reward course completion and penalize the kind of deep, quiet expertise that keeps production systems alive. The rubric measures compliance, not capability. It rewards the perishable and ignores the durable. And the people filling out those rubrics know it. They tighten their jaws and check the boxes because the alternative is admitting the system they're enforcing is broken.

I know because I've been on both sides of that table. The feeling is the same: you're performing competence instead of practicing it.

The One Question That Tells You Everything

Next time you're about to open a course, a tutorial, a weekend project with a new framework, ask yourself one question:

Will I use this in the next six months, or am I learning it because having it on my profile makes me feel safer?

The first is investment. The second is a minimum payment on a debt that keeps growing.

If the answer is "I'll use it," learn it. Learn it deeply. Build something real with it. That's not anxiety. That's craft.

If the answer is "it makes me feel safer," close the tab. Spend that hour reading the codebase you already maintain. Understand the system you already own. Build the durable skill that no course can teach and no certificate can prove. The thing that makes you the person in the room who says "this will break in six months" and is right.

The upskilling industry can't sell you that. Which is exactly why it's valuable.

Competence Debt

That tab is still open.

Some mornings I open my browser and it's there, between Jira and Slack, glowing with the particular patience of things that know they've already won. I'll click it eventually. Not because the course will teach me something durable. Not because it will make me meaningfully better at the work that actually matters. But because the credential economy demands proof of currency, and currency is what expires.

Somewhere, someone is genuinely excited about AI agents, staying up late because the possibilities feel electric. Not because of a performance review. Not because of a Slack message from leadership. Because the thing itself is interesting. That feeling still exists. Most of us just can't reach it through the anxiety anymore.

Here's what I've started calling it: competence debt. The accumulation of perishable certifications while your durable skills atrophy from neglect. Every hour spent on a course that teaches you the API of the moment is an hour not spent understanding the system you've been maintaining for three years. Every certificate is a minimum payment on a debt that keeps growing. You feel productive. Your profile looks current. And underneath, the skills that would actually make you irreplaceable are quietly compounding interest in the wrong direction.

The industry taught us to call this growth.

The rubric was designed to measure it.

The word we were all looking for is depreciation.

—Viz