DEV Community

Cover image for PDF OCR: Extract Text from Scanned PDFs with an API
AI Engine
AI Engine

Posted on • Originally published at ai-engine.net

PDF OCR: Extract Text from Scanned PDFs with an API

Scanned PDFs are everywhere — contracts, invoices, academic papers, government forms — yet the text inside is locked in pixel data. You can't search it, copy it, or pipe it into a database. PDF OCR solves this with a single API call.

Why Not Just Use Image OCR?

You could export each PDF page as JPEG and send it to an image OCR endpoint. But a dedicated PDF OCR endpoint is better:

  • Multi-page in one request — Upload the full PDF, get text for every page at once
  • Page-range control — Process only pages 3–7 of a 200-page document
  • Structured response — Text organized by page number
  • No conversion overhead — Skip the PDF-to-image pipeline entirely

Quick Start

import requests

HEADERS = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

with open("scanned.pdf", "rb") as f:
    resp = requests.post(
        "https://ocr-wizard.p.rapidapi.com/ocr-pdf",
        headers=HEADERS,
        files={"pdf_file": f},
        data={"first_page": 1, "last_page": 5},
    )

for page in resp.json()["body"]["pages"]:
    print(f"--- Page {page['pageNumber']} ({page['detectedLanguage']}) ---")
    print(page["fullText"])
Enter fullscreen mode Exit fullscreen mode

Response format:

{
  "body": {
    "pages": [
      {
        "pageNumber": 1,
        "fullText": "Extracted text from page 1...",
        "detectedLanguage": "en"
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Process a Full Document

import json, requests

HEADERS = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

def extract_pdf_text(pdf_path, first_page=None, last_page=None):
    with open(pdf_path, "rb") as f:
        payload = {}
        if first_page: payload["first_page"] = first_page
        if last_page: payload["last_page"] = last_page

        resp = requests.post(
            "https://ocr-wizard.p.rapidapi.com/ocr-pdf",
            headers=HEADERS,
            files={"pdf_file": f},
            data=payload,
        )
        resp.raise_for_status()
    return resp.json()["body"]["pages"]

pages = extract_pdf_text("annual-report.pdf")

for page in pages:
    preview = page["fullText"][:80].replace("\n", " ")
    print(f"Page {page['pageNumber']} [{page['detectedLanguage']}]: {preview}...")

with open("extracted.json", "w") as out:
    json.dump(pages, out, indent=2, ensure_ascii=False)
print(f"Done — {len(pages)} pages saved")
Enter fullscreen mode Exit fullscreen mode

Handle Mixed Input (PDFs + Images)

Real-world pipelines often mix scanned PDFs and standalone images. Route to the right endpoint automatically:

import os, requests

HEADERS = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

def ocr_file(path):
    ext = os.path.splitext(path)[1].lower()
    if ext == ".pdf":
        with open(path, "rb") as f:
            resp = requests.post(
                "https://ocr-wizard.p.rapidapi.com/ocr-pdf",
                headers=HEADERS, files={"pdf_file": f},
            )
        pages = resp.json()["body"]["pages"]
        return "\n\n".join(p["fullText"] for p in pages)
    else:
        with open(path, "rb") as f:
            resp = requests.post(
                "https://ocr-wizard.p.rapidapi.com/ocr",
                headers=HEADERS, files={"image": f},
            )
        return resp.json()["body"]["fullText"]

for f in ["invoice.pdf", "receipt.jpg", "contract.pdf"]:
    print(f"=== {f} ===")
    print(ocr_file(f)[:200])
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

  • Invoice processing — Extract vendor names, dates, line items, totals → feed into accounting software
  • Contract digitization — Full-text search across thousands of signed agreements
  • Archive scanning — Turn decades of paper records into a searchable digital archive
  • Academic papers — Extract text from older scanned-only papers for citation tools and literature reviews
  • Compliance & audit — Automate data extraction from regulatory documents

Tips for Best Results

  1. Use page ranges for large PDFs — skip irrelevant pages to save time and API credits
  2. Scan at 300 DPI — significantly better accuracy than 72 DPI thumbnails
  3. Check detectedLanguage — route multilingual documents to the right downstream processing
  4. Batch with concurrency — process 5–10 PDFs in parallel instead of sequentially

Pricing

Each /ocr-pdf request counts as one API call regardless of page count. Free tier: 100 requests/month. Pro plan available for production workloads.

The OCR Wizard API is available on RapidAPI with a free tier to get started.

👉 Read the full guide with JavaScript examples and more use cases

Top comments (0)