Scanned PDFs are everywhere — contracts, invoices, academic papers, government forms — yet the text inside is locked in pixel data. You can't search it, copy it, or pipe it into a database. PDF OCR solves this with a single API call.
Why Not Just Use Image OCR?
You could export each PDF page as JPEG and send it to an image OCR endpoint. But a dedicated PDF OCR endpoint is better:
- Multi-page in one request — Upload the full PDF, get text for every page at once
- Page-range control — Process only pages 3–7 of a 200-page document
- Structured response — Text organized by page number
- No conversion overhead — Skip the PDF-to-image pipeline entirely
Quick Start
import requests
HEADERS = {
"x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
"x-rapidapi-key": "YOUR_API_KEY",
}
with open("scanned.pdf", "rb") as f:
resp = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr-pdf",
headers=HEADERS,
files={"pdf_file": f},
data={"first_page": 1, "last_page": 5},
)
for page in resp.json()["body"]["pages"]:
print(f"--- Page {page['pageNumber']} ({page['detectedLanguage']}) ---")
print(page["fullText"])
Response format:
{
"body": {
"pages": [
{
"pageNumber": 1,
"fullText": "Extracted text from page 1...",
"detectedLanguage": "en"
}
]
}
}
Process a Full Document
import json, requests
HEADERS = {
"x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
"x-rapidapi-key": "YOUR_API_KEY",
}
def extract_pdf_text(pdf_path, first_page=None, last_page=None):
with open(pdf_path, "rb") as f:
payload = {}
if first_page: payload["first_page"] = first_page
if last_page: payload["last_page"] = last_page
resp = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr-pdf",
headers=HEADERS,
files={"pdf_file": f},
data=payload,
)
resp.raise_for_status()
return resp.json()["body"]["pages"]
pages = extract_pdf_text("annual-report.pdf")
for page in pages:
preview = page["fullText"][:80].replace("\n", " ")
print(f"Page {page['pageNumber']} [{page['detectedLanguage']}]: {preview}...")
with open("extracted.json", "w") as out:
json.dump(pages, out, indent=2, ensure_ascii=False)
print(f"Done — {len(pages)} pages saved")
Handle Mixed Input (PDFs + Images)
Real-world pipelines often mix scanned PDFs and standalone images. Route to the right endpoint automatically:
import os, requests
HEADERS = {
"x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
"x-rapidapi-key": "YOUR_API_KEY",
}
def ocr_file(path):
ext = os.path.splitext(path)[1].lower()
if ext == ".pdf":
with open(path, "rb") as f:
resp = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr-pdf",
headers=HEADERS, files={"pdf_file": f},
)
pages = resp.json()["body"]["pages"]
return "\n\n".join(p["fullText"] for p in pages)
else:
with open(path, "rb") as f:
resp = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr",
headers=HEADERS, files={"image": f},
)
return resp.json()["body"]["fullText"]
for f in ["invoice.pdf", "receipt.jpg", "contract.pdf"]:
print(f"=== {f} ===")
print(ocr_file(f)[:200])
Real-World Use Cases
- Invoice processing — Extract vendor names, dates, line items, totals → feed into accounting software
- Contract digitization — Full-text search across thousands of signed agreements
- Archive scanning — Turn decades of paper records into a searchable digital archive
- Academic papers — Extract text from older scanned-only papers for citation tools and literature reviews
- Compliance & audit — Automate data extraction from regulatory documents
Tips for Best Results
- Use page ranges for large PDFs — skip irrelevant pages to save time and API credits
- Scan at 300 DPI — significantly better accuracy than 72 DPI thumbnails
-
Check
detectedLanguage— route multilingual documents to the right downstream processing - Batch with concurrency — process 5–10 PDFs in parallel instead of sequentially
Pricing
Each /ocr-pdf request counts as one API call regardless of page count. Free tier: 100 requests/month. Pro plan available for production workloads.
The OCR Wizard API is available on RapidAPI with a free tier to get started.
👉 Read the full guide with JavaScript examples and more use cases
Top comments (0)