How to Convert PDF to Text via API (No poppler, No pdfminer, No Local Libraries)

#ai #webdev #api #tutorial

How to Convert PDF to Text via API (No poppler, No pdfminer, No Local Libraries)

Converting PDFs to text locally means installing poppler-utils, pdfminer, or PyMuPDF — and then handling edge cases: scanned PDFs needing OCR, multi-column layouts, embedded images, password-protected files. It's a rabbit hole.

For most applications — especially RAG pipelines, document processing workflows, and data extraction — a PDF API is the cleaner solution. Send the file, get back structured text.

What to Consider When Choosing a PDF API

Text extraction vs OCR: Does it handle scanned PDFs (image-based)?
Structure preservation: Tables, headers, lists — does it maintain them?
Output format: Plain text, markdown, or JSON with page/section structure?
File size limits: PDFs can be large; check limits.
Language support: OCR quality across languages varies.
Price: Per page or per document?

Comparison Table

Tool	Price	OCR	Output Format	File Limit	Limitations
IteraTools	~$0.005/page (credits)	Yes	Text, markdown	50MB	Complex tables may lose structure
Adobe Extract API	$0.15/page	Yes	JSON (rich)	100MB	Expensive, complex auth
AWS Textract	$0.0015/page	Yes	JSON	500MB	AWS ecosystem required
LlamaParse	$0.003/page	Yes	Markdown	50MB	LlamaIndex ecosystem
Unstructured.io	$0.002/page	Yes	JSON	Varies	Complex output schema

IteraTools PDF Extraction — How to Use It

Extract text from a PDF URL:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/annual_report.pdf",
    "output": "markdown"
  }'

Upload a local PDF:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf" \
  -F "output=text" \
  -F "ocr=true"

Response:

{
  "text": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked a significant...",
  "pages": 42,
  "format": "markdown",
  "has_ocr": false,
  "credits_used": 21
}

Specify page range:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/long_report.pdf",
    "pages": "1-10",
    "output": "markdown"
  }'

Complete Python Example

import requests
from pathlib import Path
import json

API_KEY = "your_api_key_here"
BASE_URL = "https://api.iteratools.com/v1"

def pdf_to_text(pdf_path: str = None, pdf_url: str = None,
                output_format: str = "markdown", ocr: bool = False) -> dict:
    """Extract text from PDF file or URL."""
    headers = {"Authorization": f"Bearer {API_KEY}"}

    if pdf_path:
        with open(pdf_path, "rb") as f:
            response = requests.post(
                f"{BASE_URL}/pdf/extract",
                headers=headers,
                files={"file": (Path(pdf_path).name, f)},
                data={"output": output_format, "ocr": str(ocr).lower()}
            )
    elif pdf_url:
        response = requests.post(
            f"{BASE_URL}/pdf/extract",
            headers=headers,
            json={"url": pdf_url, "output": output_format, "ocr": ocr}
        )
    else:
        raise ValueError("Provide either pdf_path or pdf_url")

    response.raise_for_status()
    return response.json()

def pdf_to_rag_chunks(pdf_path: str, chunk_size: int = 1000) -> list[dict]:
    """Extract PDF and split into chunks for RAG/embeddings."""
    result = pdf_to_text(pdf_path=pdf_path, output_format="text")
    text = result["text"]

    # Split into chunks with overlap
    chunks = []
    words = text.split()
    chunk_words = chunk_size // 6  # Rough word count per chunk

    for i in range(0, len(words), chunk_words - 50):  # 50-word overlap
        chunk = " ".join(words[i:i + chunk_words])
        if chunk.strip():
            chunks.append({
                "content": chunk,
                "index": len(chunks),
                "source": pdf_path
            })

    return chunks

def process_invoice(pdf_path: str) -> dict:
    """Extract structured data from an invoice PDF."""
    result = pdf_to_text(pdf_path=pdf_path, output_format="text", ocr=True)
    text = result["text"]

    # Pass to LLM for structured extraction (example with OpenAI)
    # In practice, you'd use your preferred LLM here
    return {
        "raw_text": text,
        "pages": result["pages"],
        "ready_for_llm": True
    }

def batch_process_pdfs(pdf_dir: str, output_dir: str) -> list[dict]:
    """Convert all PDFs in a directory to text files."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    results = []

    pdf_files = list(Path(pdf_dir).glob("*.pdf"))
    print(f"Processing {len(pdf_files)} PDFs...")

    for pdf_file in pdf_files:
        print(f"  Converting {pdf_file.name}...")
        try:
            result = pdf_to_text(pdf_path=str(pdf_file), output_format="markdown")

            # Save as .md file
            output_file = Path(output_dir) / f"{pdf_file.stem}.md"
            output_file.write_text(result["text"])

            results.append({
                "source": str(pdf_file),
                "output": str(output_file),
                "pages": result["pages"],
                "success": True
            })
            print(f"    ✓ {result['pages']} pages → {output_file.name}")

        except Exception as e:
            print(f"    ✗ Error: {e}")
            results.append({
                "source": str(pdf_file),
                "success": False,
                "error": str(e)
            })

    return results

if __name__ == "__main__":
    # Simple extraction
    result = pdf_to_text(
        pdf_url="https://arxiv.org/pdf/2303.08774.pdf",
        output_format="markdown"
    )
    print(f"Extracted {result['pages']} pages")
    print(result["text"][:1000])

    # For RAG pipeline
    chunks = pdf_to_rag_chunks("research_paper.pdf", chunk_size=800)
    print(f"\nCreated {len(chunks)} chunks for embedding")

    # Save chunks for vector DB ingestion
    with open("chunks.json", "w") as f:
        json.dump(chunks, f, indent=2)

Handling Scanned PDFs

Scanned PDFs (images inside a PDF container) require OCR. Set "ocr": true:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/scanned_contract.pdf",
    "ocr": true,
    "language": "en"
  }'

OCR quality depends on scan quality. For critical documents (legal, financial), verify results against the original.

Conclusion

For developers building document processing pipelines, RAG knowledge bases, or invoice extraction workflows, a PDF API eliminates the entire poppler/pdfminer setup and handles edge cases like scanned PDFs automatically.

IteraTools provides PDF extraction as part of a broader toolkit — you can extract, chunk, and immediately store embeddings all within the same API ecosystem, at a fraction of the cost of Adobe Extract or AWS Textract.

→ Start extracting PDFs at api.iteratools.com