DEV Community

Fred Santos
Fred Santos

Posted on

How to Convert PDF to Text via API (No poppler, No pdfminer, No Local Libraries)

How to Convert PDF to Text via API (No poppler, No pdfminer, No Local Libraries)

Converting PDFs to text locally means installing poppler-utils, pdfminer, or PyMuPDF — and then handling edge cases: scanned PDFs needing OCR, multi-column layouts, embedded images, password-protected files. It's a rabbit hole.

For most applications — especially RAG pipelines, document processing workflows, and data extraction — a PDF API is the cleaner solution. Send the file, get back structured text.

What to Consider When Choosing a PDF API

  • Text extraction vs OCR: Does it handle scanned PDFs (image-based)?
  • Structure preservation: Tables, headers, lists — does it maintain them?
  • Output format: Plain text, markdown, or JSON with page/section structure?
  • File size limits: PDFs can be large; check limits.
  • Language support: OCR quality across languages varies.
  • Price: Per page or per document?

Comparison Table

Tool Price OCR Output Format File Limit Limitations
IteraTools ~$0.005/page (credits) Yes Text, markdown 50MB Complex tables may lose structure
Adobe Extract API $0.15/page Yes JSON (rich) 100MB Expensive, complex auth
AWS Textract $0.0015/page Yes JSON 500MB AWS ecosystem required
LlamaParse $0.003/page Yes Markdown 50MB LlamaIndex ecosystem
Unstructured.io $0.002/page Yes JSON Varies Complex output schema

IteraTools PDF Extraction — How to Use It

Extract text from a PDF URL:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/annual_report.pdf",
    "output": "markdown"
  }'
Enter fullscreen mode Exit fullscreen mode

Upload a local PDF:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf" \
  -F "output=text" \
  -F "ocr=true"
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "text": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked a significant...",
  "pages": 42,
  "format": "markdown",
  "has_ocr": false,
  "credits_used": 21
}
Enter fullscreen mode Exit fullscreen mode

Specify page range:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/long_report.pdf",
    "pages": "1-10",
    "output": "markdown"
  }'
Enter fullscreen mode Exit fullscreen mode

Complete Python Example

import requests
from pathlib import Path
import json

API_KEY = "your_api_key_here"
BASE_URL = "https://api.iteratools.com/v1"

def pdf_to_text(pdf_path: str = None, pdf_url: str = None,
                output_format: str = "markdown", ocr: bool = False) -> dict:
    """Extract text from PDF file or URL."""
    headers = {"Authorization": f"Bearer {API_KEY}"}

    if pdf_path:
        with open(pdf_path, "rb") as f:
            response = requests.post(
                f"{BASE_URL}/pdf/extract",
                headers=headers,
                files={"file": (Path(pdf_path).name, f)},
                data={"output": output_format, "ocr": str(ocr).lower()}
            )
    elif pdf_url:
        response = requests.post(
            f"{BASE_URL}/pdf/extract",
            headers=headers,
            json={"url": pdf_url, "output": output_format, "ocr": ocr}
        )
    else:
        raise ValueError("Provide either pdf_path or pdf_url")

    response.raise_for_status()
    return response.json()

def pdf_to_rag_chunks(pdf_path: str, chunk_size: int = 1000) -> list[dict]:
    """Extract PDF and split into chunks for RAG/embeddings."""
    result = pdf_to_text(pdf_path=pdf_path, output_format="text")
    text = result["text"]

    # Split into chunks with overlap
    chunks = []
    words = text.split()
    chunk_words = chunk_size // 6  # Rough word count per chunk

    for i in range(0, len(words), chunk_words - 50):  # 50-word overlap
        chunk = " ".join(words[i:i + chunk_words])
        if chunk.strip():
            chunks.append({
                "content": chunk,
                "index": len(chunks),
                "source": pdf_path
            })

    return chunks

def process_invoice(pdf_path: str) -> dict:
    """Extract structured data from an invoice PDF."""
    result = pdf_to_text(pdf_path=pdf_path, output_format="text", ocr=True)
    text = result["text"]

    # Pass to LLM for structured extraction (example with OpenAI)
    # In practice, you'd use your preferred LLM here
    return {
        "raw_text": text,
        "pages": result["pages"],
        "ready_for_llm": True
    }

def batch_process_pdfs(pdf_dir: str, output_dir: str) -> list[dict]:
    """Convert all PDFs in a directory to text files."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    results = []

    pdf_files = list(Path(pdf_dir).glob("*.pdf"))
    print(f"Processing {len(pdf_files)} PDFs...")

    for pdf_file in pdf_files:
        print(f"  Converting {pdf_file.name}...")
        try:
            result = pdf_to_text(pdf_path=str(pdf_file), output_format="markdown")

            # Save as .md file
            output_file = Path(output_dir) / f"{pdf_file.stem}.md"
            output_file.write_text(result["text"])

            results.append({
                "source": str(pdf_file),
                "output": str(output_file),
                "pages": result["pages"],
                "success": True
            })
            print(f"{result['pages']} pages → {output_file.name}")

        except Exception as e:
            print(f"    ✗ Error: {e}")
            results.append({
                "source": str(pdf_file),
                "success": False,
                "error": str(e)
            })

    return results

if __name__ == "__main__":
    # Simple extraction
    result = pdf_to_text(
        pdf_url="https://arxiv.org/pdf/2303.08774.pdf",
        output_format="markdown"
    )
    print(f"Extracted {result['pages']} pages")
    print(result["text"][:1000])

    # For RAG pipeline
    chunks = pdf_to_rag_chunks("research_paper.pdf", chunk_size=800)
    print(f"\nCreated {len(chunks)} chunks for embedding")

    # Save chunks for vector DB ingestion
    with open("chunks.json", "w") as f:
        json.dump(chunks, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Handling Scanned PDFs

Scanned PDFs (images inside a PDF container) require OCR. Set "ocr": true:

curl -X POST https://api.iteratools.com/v1/pdf/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/scanned_contract.pdf",
    "ocr": true,
    "language": "en"
  }'
Enter fullscreen mode Exit fullscreen mode

OCR quality depends on scan quality. For critical documents (legal, financial), verify results against the original.

Conclusion

For developers building document processing pipelines, RAG knowledge bases, or invoice extraction workflows, a PDF API eliminates the entire poppler/pdfminer setup and handles edge cases like scanned PDFs automatically.

IteraTools provides PDF extraction as part of a broader toolkit — you can extract, chunk, and immediately store embeddings all within the same API ecosystem, at a fraction of the cost of Adobe Extract or AWS Textract.

Start extracting PDFs at api.iteratools.com

Top comments (0)