How to Convert PDF to Text via API (No poppler, No pdfminer, No Local Libraries)
Converting PDFs to text locally means installing poppler-utils, pdfminer, or PyMuPDF — and then handling edge cases: scanned PDFs needing OCR, multi-column layouts, embedded images, password-protected files. It's a rabbit hole.
For most applications — especially RAG pipelines, document processing workflows, and data extraction — a PDF API is the cleaner solution. Send the file, get back structured text.
What to Consider When Choosing a PDF API
- Text extraction vs OCR: Does it handle scanned PDFs (image-based)?
- Structure preservation: Tables, headers, lists — does it maintain them?
- Output format: Plain text, markdown, or JSON with page/section structure?
- File size limits: PDFs can be large; check limits.
- Language support: OCR quality across languages varies.
- Price: Per page or per document?
Comparison Table
| Tool | Price | OCR | Output Format | File Limit | Limitations |
|---|---|---|---|---|---|
| IteraTools | ~$0.005/page (credits) | Yes | Text, markdown | 50MB | Complex tables may lose structure |
| Adobe Extract API | $0.15/page | Yes | JSON (rich) | 100MB | Expensive, complex auth |
| AWS Textract | $0.0015/page | Yes | JSON | 500MB | AWS ecosystem required |
| LlamaParse | $0.003/page | Yes | Markdown | 50MB | LlamaIndex ecosystem |
| Unstructured.io | $0.002/page | Yes | JSON | Varies | Complex output schema |
IteraTools PDF Extraction — How to Use It
Extract text from a PDF URL:
curl -X POST https://api.iteratools.com/v1/pdf/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/annual_report.pdf",
"output": "markdown"
}'
Upload a local PDF:
curl -X POST https://api.iteratools.com/v1/pdf/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@contract.pdf" \
-F "output=text" \
-F "ocr=true"
Response:
{
"text": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked a significant...",
"pages": 42,
"format": "markdown",
"has_ocr": false,
"credits_used": 21
}
Specify page range:
curl -X POST https://api.iteratools.com/v1/pdf/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/long_report.pdf",
"pages": "1-10",
"output": "markdown"
}'
Complete Python Example
import requests
from pathlib import Path
import json
API_KEY = "your_api_key_here"
BASE_URL = "https://api.iteratools.com/v1"
def pdf_to_text(pdf_path: str = None, pdf_url: str = None,
output_format: str = "markdown", ocr: bool = False) -> dict:
"""Extract text from PDF file or URL."""
headers = {"Authorization": f"Bearer {API_KEY}"}
if pdf_path:
with open(pdf_path, "rb") as f:
response = requests.post(
f"{BASE_URL}/pdf/extract",
headers=headers,
files={"file": (Path(pdf_path).name, f)},
data={"output": output_format, "ocr": str(ocr).lower()}
)
elif pdf_url:
response = requests.post(
f"{BASE_URL}/pdf/extract",
headers=headers,
json={"url": pdf_url, "output": output_format, "ocr": ocr}
)
else:
raise ValueError("Provide either pdf_path or pdf_url")
response.raise_for_status()
return response.json()
def pdf_to_rag_chunks(pdf_path: str, chunk_size: int = 1000) -> list[dict]:
"""Extract PDF and split into chunks for RAG/embeddings."""
result = pdf_to_text(pdf_path=pdf_path, output_format="text")
text = result["text"]
# Split into chunks with overlap
chunks = []
words = text.split()
chunk_words = chunk_size // 6 # Rough word count per chunk
for i in range(0, len(words), chunk_words - 50): # 50-word overlap
chunk = " ".join(words[i:i + chunk_words])
if chunk.strip():
chunks.append({
"content": chunk,
"index": len(chunks),
"source": pdf_path
})
return chunks
def process_invoice(pdf_path: str) -> dict:
"""Extract structured data from an invoice PDF."""
result = pdf_to_text(pdf_path=pdf_path, output_format="text", ocr=True)
text = result["text"]
# Pass to LLM for structured extraction (example with OpenAI)
# In practice, you'd use your preferred LLM here
return {
"raw_text": text,
"pages": result["pages"],
"ready_for_llm": True
}
def batch_process_pdfs(pdf_dir: str, output_dir: str) -> list[dict]:
"""Convert all PDFs in a directory to text files."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
results = []
pdf_files = list(Path(pdf_dir).glob("*.pdf"))
print(f"Processing {len(pdf_files)} PDFs...")
for pdf_file in pdf_files:
print(f" Converting {pdf_file.name}...")
try:
result = pdf_to_text(pdf_path=str(pdf_file), output_format="markdown")
# Save as .md file
output_file = Path(output_dir) / f"{pdf_file.stem}.md"
output_file.write_text(result["text"])
results.append({
"source": str(pdf_file),
"output": str(output_file),
"pages": result["pages"],
"success": True
})
print(f" ✓ {result['pages']} pages → {output_file.name}")
except Exception as e:
print(f" ✗ Error: {e}")
results.append({
"source": str(pdf_file),
"success": False,
"error": str(e)
})
return results
if __name__ == "__main__":
# Simple extraction
result = pdf_to_text(
pdf_url="https://arxiv.org/pdf/2303.08774.pdf",
output_format="markdown"
)
print(f"Extracted {result['pages']} pages")
print(result["text"][:1000])
# For RAG pipeline
chunks = pdf_to_rag_chunks("research_paper.pdf", chunk_size=800)
print(f"\nCreated {len(chunks)} chunks for embedding")
# Save chunks for vector DB ingestion
with open("chunks.json", "w") as f:
json.dump(chunks, f, indent=2)
Handling Scanned PDFs
Scanned PDFs (images inside a PDF container) require OCR. Set "ocr": true:
curl -X POST https://api.iteratools.com/v1/pdf/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/scanned_contract.pdf",
"ocr": true,
"language": "en"
}'
OCR quality depends on scan quality. For critical documents (legal, financial), verify results against the original.
Conclusion
For developers building document processing pipelines, RAG knowledge bases, or invoice extraction workflows, a PDF API eliminates the entire poppler/pdfminer setup and handles edge cases like scanned PDFs automatically.
IteraTools provides PDF extraction as part of a broader toolkit — you can extract, chunk, and immediately store embeddings all within the same API ecosystem, at a fraction of the cost of Adobe Extract or AWS Textract.
Top comments (0)