Mark Turner

Posted on May 28 • Edited on May 30

I gave my AI agent a 2MB PDF. Here's what happened to my token count.

#ai #llm #agents #mcp

Every token your agent spends on file I/O is wasted reasoning capacity.

I was building a document processing agent — the kind that reads incoming research reports, extracts key findings, and produces executive briefings. Nothing exotic. The kind of workflow thousands of teams are automating right now.

The PDF I was testing with was 2MB. Dense text. A typical industry research report.

When I measured the token cost of processing it inline, the number was 97,354 input tokens — just to get the text into Claude's context. At claude-sonnet-4-6 pricing, that's $0.29 per document. For a pipeline that processes 500 reports a month, you're looking at $150/month before your agent writes a single word of output.

That's the problem nobody talks about in the AI agent space. Everyone optimises prompt engineering and output tokens. The silent cost is input: the files, the content, the raw data you're shoving into context before the agent can do anything useful.

How the token count explodes

When you pass a document to an agent inline, one of two things happens:

Option A — Base64 encoding. You read the binary file, encode it, embed it in the prompt. A 2MB PDF in base64 is ~2.7MB of text. At roughly 3.5 characters per token, that's ~770,000 tokens before your agent has read a single word. This is catastrophic. Don't do this.

Option B — Text extraction. You extract the raw text content first (via pdftotext, PyMuPDF, or equivalent), then pass the text to the agent. Better — but a 2MB PDF with dense content still yields ~97,000 tokens of extracted text. You've paid for every word, every header, every footnote.

Either way, the document content dominates your context window, crowds out your system prompt, and you're burning money on file I/O instead of reasoning.

The alternative: specialist services via MCP

Model Context Protocol (MCP) is Anthropic's open standard for connecting AI agents to external tools and services. The key insight is simple: your agent doesn't need to contain the computation — it needs to orchestrate it.

File conversion is a perfect example. Converting a PDF to clean markdown is deterministic, CPU-bound work. It doesn't require LLM reasoning. Running it inside your agent's context window is like using a screwdriver to hammer a nail — you can, sort of, but why would you?

Here's what the same 2MB PDF workflow looks like when the agent delegates to a specialist service:

Agent token cost to process the 2MB PDF:
  ├── convert_from_url call:    ~300 tokens
  ├── get_job_status poll:      ~200 tokens  
  ├── get_download_url call:    ~200 tokens
  └── Total MCP overhead:    ~8,000 tokens

8,000 tokens vs 97,354 tokens. A 12× reduction.

The converted markdown (clean, structured, without PDF encoding artifacts) is stored externally. The agent gets back a URL. For many workflows — convert, store, route, email — the agent never needs to read the full content at all. For workflows that do need to reason over the content, the clean markdown is 40-50% smaller than the raw extracted text.

The cost difference: $0.029 vs $0.29. Ten times cheaper per document.

A concrete example

Say your agent handles incoming vendor contracts. The workflow: receive a PDF, convert to a canonical format, extract key dates and obligations, store the result, notify procurement.

Inline approach:

import anthropic, base64

client = anthropic.Anthropic()

with open("vendor_contract.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode()

# This burns ~770,000 tokens just loading the file
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract key dates, parties, and obligations from this contract."},
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_b64
                }
            }
        ]
    }]
)

Cost: ~$2.31 in input tokens for a 2MB contract, before any output.

MCP approach (with Botverse):

Set up the MCP server once in your Claude Desktop config or agent runtime:

{
  "mcpServers": {
    "botverse": {
      "url": "https://botverse.cloud/mcp?token=YOUR_TOKEN"
    }
  }
}

Then your agent prompt becomes:

You have access to the Botverse MCP tools.

A new vendor contract has arrived at:
https://contracts.yourcompany.com/incoming/vendor_contract_2026.pdf

1. Convert it to markdown using convert_from_url
2. Once converted, retrieve the content and extract:
   - Contract parties (names and roles)
   - Effective date and term
   - Payment obligations with amounts and dates
   - Key termination conditions
3. Format the output as structured JSON

What happens:

Tool call: convert_from_url(url, "md")
  → job_id: "job_abc123", estimated_cost: $0.05

Tool call: get_job_status("job_abc123")  
  → status: "complete"

Tool call: get_download_url("job_abc123")
  → download_url: "https://cdn.botverse.cloud/..."

The agent fetches the clean markdown (~35,000 tokens for the same contract), extracts the structured data, and produces the output. Total input tokens: ~37,000. Total cost including the $0.05 Botverse job: ~$0.16 vs $2.31.

For a pipeline processing 500 contracts a month: $80 vs $1,155.

When the agent doesn't need to read the file at all

The pattern gets even more powerful when the agent's job is transformation, not comprehension.

Consider: "Convert all this week's board meeting notes from DOCX to PDF and upload them to the board portal."

The agent doesn't need to read the content. It needs to:

Get a list of DOCX files (from Drive, S3, wherever)
For each: call convert_from_url(url, "pdf"), get the output URL
Upload each PDF to the board portal

Total agent token cost per document: ~1,500 tokens (three MCP calls). The files can be arbitrarily large — 100MB, 1GB — the token cost doesn't change.

This is the correct architecture for file-processing agents. The agent reasons. Specialist services compute.

The underlying principle

There's an argument that LLM context windows are getting so large that this doesn't matter anymore. Gemini 2.0 Pro has 2M tokens. Claude has 200K. Just throw everything in.

This argument has two problems.

Cost. Large context windows are expensive. 97,354 input tokens at claude-sonnet-4-6 is $0.29. At claude-opus-4-7 it's over $1.00. If your agent processes high volumes, you're paying a huge premium for compute you don't need.

Performance. LLMs degrade at extracting specific information from large, noisy contexts. The "lost in the middle" problem is well-documented — models struggle to attend to relevant content buried in long inputs. A clean, well-structured markdown document is easier for a model to reason about than raw extracted PDF text with page numbers, headers, and encoding artifacts embedded throughout.

The right mental model: every token your agent spends on file I/O is reasoning capacity it can't spend on your actual problem.

Getting started

Botverse is an MCP-native cloud API for file processing — video transcoding and document conversion, built for AI agents.

Sign up at botverse.cloud — $2.50 minimum top-up, no monthly fees
Document conversion: $0.05/job (md, html, docx, pdf, rst, txt, xlsx)
Video transcoding: $0.25/job base (mp4, webm, ProRes, mp3, gif)
Add the MCP server in Claude Desktop in 30 seconds: Smithery search "Botverse"

The benchmark numbers in this article came from processing a 2MB research PDF in both modes using claude-sonnet-4-6. https://botverse.cloud/benchmark

Have a different approach to token-efficient file processing? I'd love to hear it in the comments.

Top comments (1)

Harjot Singh • May 31

The 2MB-PDF-to-exploding-token-count moment is the lesson everyone learns once and never forgets: dumping a whole document into context is the most expensive possible way to use it, and the cost is invisible until the bill (or the context-window error) shows up. The instinct to hand the agent the raw file feels right (give it everything, let it figure it out) but it's backwards, because the agent uses maybe 2% of those tokens to answer any given question and you paid for 100%. The fix is the boring discipline: extract and structure the PDF once (text, tables, sections), then retrieve only the chunks a question actually needs, so the model sees a few hundred relevant tokens instead of the whole megabyte. Parse-once-retrieve-little beats dump-it-all-every-time on both cost and answer quality, because the model also reasons better when it's not drowning in 2MB of mostly-irrelevant text. The PDF is data to be queried, not context to be inlined. That treat-the-doc-as-a-retrievable-source instinct is core to how I think about cost in Moonshift. Did you move to extraction-plus-retrieval after this, or try chunking the raw PDF directly into the prompt?