Claude API PDF & Document Parsing Guide

#pdf #vision #python

Originally published at claudeguide.io/claude-api-pdf-document-parsing

Claude API PDF & Document Parsing Guide

To parse PDFs with the Claude API, encode the file as base64 and pass it as a document content block in your message. Claude reads the entire document natively — no external OCR step required. For a 10-page contract, this takes under three seconds with claude-haiku-4-5. For structured extraction (tables, form fields, key-value pairs), include an explicit JSON schema in your prompt. Claude returns machine-readable output in one API call, skipping the multi-tool OCR pipelines that traditionally add latency, cost, and failure points.

Using Claude's Native PDF Support (Base64 Upload)

Claude accepts PDF files directly via the document content type. Encode the file bytes as base64 and attach it alongside your text prompt.

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def parse_pdf(pdf_path: str, prompt: str) -

Build production document-processing agents: multi-step PDF pipelines, structured extraction chains, Batch API integration, and cost-optimized routing between Haiku/Sonnet/Opus. 100+ copy-paste Python recipes.

[→ Get the Agent SDK Cookbook — $49](https://shoutfirst.gumroad.com/l/ogxhmy?utm_source=claudeguide&utm_medium=article&utm_campaign=claude-api-pdf-document-parsing)

*Instant download. 30-day money-back guarantee.*

---

## Comparing Document Parsing Approaches

| Approach | Setup complexity | Accuracy (scanned docs) | Latency | Cost per page | Best for |
|---|---|---|---|---|---|
| **Claude native (base64)** | Minimal — 1 API call | High (digital PDFs), Medium (scanned) | 1–3s for 10 pages | ~$0.0004 with Haiku | Digital PDFs, fast prototyping |
| **Tesseract + Claude** | Medium — run OCR first, pass text | High (scanned), depends on OCR quality | 5–15s (OCR adds latency) | ~$0.0001 + OCR infra | Scanned docs at scale, offline OCR |
| **Amazon Textract + Claude** | High — AWS setup, IAM, S3 | Very high (tables, forms, signatures) | 10–30s async | ~$0.015 per page (Textract) + Claude | Complex forms, regulated industries |

**Recommendation:** Start with Claude native. Add Textract only when you need its specialized form/signature detection at regulated accuracy levels (healthcare, legal). Tesseract is a cost-effective middle path if you already run on-prem infrastructure.

For semantic search over extracted content, see [Claude API Semantic Search](/claude-api-semantic-search).

---

## Batch Document Processing

For processing hundreds of PDFs, use the Anthropic Batch API to reduce cost by 50% and bypass per-minute rate limits.

python
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def build_batch_requests(pdf_paths: list[str], prompt: str) -

The cookbook includes complete document-processing agent blueprints: PDF ingestion pipelines, multi-step extraction chains, error-recovery patterns for malformed documents, and integration examples with PostgreSQL and S3. All recipes use the Anthropic Python SDK with prompt caching enabled by default.

→ Get the Agent SDK Cookbook — $49

Instant download. 30-day money-back guarantee.

Frequently Asked Questions

Does Claude support scanned (image-based) PDFs?

Yes. Claude's vision capabilities apply to scanned PDFs. The model reads each page as an image and extracts text, tables, and layout information. Accuracy is slightly lower than on digital PDFs with embedded text, especially for low-resolution scans below 150 DPI. For high-volume scanned document workloads where accuracy is critical, pre-process with Tesseract or Amazon Textract to produce clean text, then pass the text to Claude for semantic extraction.

What is the maximum PDF size Claude can accept?

The current limit is 32 MB per file and up to 100 pages per document block. Documents exceeding these limits should be split before sending. Use the chunked processing pattern shown above — split into 20-page segments, process in parallel, and aggregate results.

How do I extract data from a PDF form with checkboxes and signatures?

Use claude-sonnet-4-6 with a prompt that explicitly lists every form field including checkboxes and signature lines. Ask Claude to return a JSON object with boolean values for checkboxes (true/false) and a string status for signatures ("signed", "unsigned", or "initials only"). For legally binding signature verification, combine with Amazon Textract's signature detection, which provides a confidence score.

How much does PDF parsing cost with the Claude API?

With claude-haiku-4-5, a 10-page digital PDF typically costs $0.0003–$0.0005 per document in input tokens. With the Batch API (50% discount), you can process 10,000 documents for roughly $2–$5. Sonnet costs about 5x more per token but delivers better accuracy on complex tables and multi-column layouts. Enable prompt caching on the document block if you query the same PDF more than once — cache hits cost 10% of normal input price.

Can I parse Word documents (.docx) or Excel files (.xlsx) with Claude?

Claude's native document type supports PDF and plain text. For Word and Excel files, convert to PDF first (using python-docx + reportlab, or LibreOffice headless), then send the PDF. Alternatively, extract text from .docx using python-docx and pass as a text content block. For spreadsheets, serialize to CSV and include as text — Claude handles CSV table parsing well without needing the binary format.