DEV Community

Cover image for PDF OCR in Python: Extract Text from Scanned PDFs in 5 Lines
AI Engine
AI Engine

Posted on • Originally published at ai-engine.net

PDF OCR in Python: Extract Text from Scanned PDFs in 5 Lines

You have a folder of scanned PDFs. Invoices, contracts, old reports, bank statements your accountant emailed you. None of them are searchable. PyPDF2 returns empty strings because the text is locked inside pixel data, not actual characters.

The classic Python answer is Tesseract: install the binary, install pytesseract, install pdf2image and Poppler, convert each page to an image, run OCR per page, stitch the text back together. Forty lines of code, three system dependencies, and accuracy that varies with scan quality.

A cloud OCR API skips all of that. Here are the five lines of Python that replace the entire pipeline.

Want to run it now? Grab a key from the OCR Wizard API and paste it in.

The 5-line solution

import requests

with open("scanned.pdf", "rb") as f:
    r = requests.post(
        "https://ocr-wizard.p.rapidapi.com/ocr-pdf",
        headers={"x-rapidapi-key": "YOUR_API_KEY", "x-rapidapi-host": "ocr-wizard.p.rapidapi.com"},
        files={"pdf_file": f},
        data={"first_page": 1, "last_page": 10},
    )

print("\n\n".join(p["fullText"] for p in r.json()["body"]["pages"]))
Enter fullscreen mode Exit fullscreen mode

No Tesseract install, no pdf2image, no Poppler, no per-page image conversion. Open the PDF, post it, join the pages.

One detail worth knowing: the API caps each request at a 10-page range (the difference between first_page and last_page cannot exceed 10). The snippet covers any PDF from 1 to 10 pages in a single call.

Process PDFs longer than 10 pages

Slide a 10-page window across the document:

import requests

HEADERS = {"x-rapidapi-key": "YOUR_API_KEY", "x-rapidapi-host": "ocr-wizard.p.rapidapi.com"}
URL = "https://ocr-wizard.p.rapidapi.com/ocr-pdf"
BATCH = 10

def ocr_pdf(pdf_path, total_pages):
    all_pages = []
    for start in range(1, total_pages + 1, BATCH):
        end = min(start + BATCH - 1, total_pages)
        with open(pdf_path, "rb") as f:
            r = requests.post(URL, headers=HEADERS, files={"pdf_file": f},
                              data={"first_page": start, "last_page": end})
        all_pages.extend(r.json()["body"]["pages"])
    return all_pages

pages = ocr_pdf("annual-report.pdf", total_pages=120)
print(f"{len(pages)} pages, {sum(len(p['fullText']) for p in pages)} characters")
Enter fullscreen mode Exit fullscreen mode

See the folder-batch and citation variations in the complete guide.

Track page numbers for citations

The API does not return a pageNumber field, so use the list index plus your first_page offset when you need to cite specific pages:

pages = r.json()["body"]["pages"]
first_page = 1
for offset, p in enumerate(pages):
    print(f"[Page {first_page + offset}] ({p['detectedLanguage']})")
    print(p["fullText"][:200], "...\n")
Enter fullscreen mode Exit fullscreen mode

Why skip Tesseract locally

Tesseract is a fine engine and still the go-to for offline, privacy-sensitive workloads. For most other use cases, the cloud approach wins on three concrete points: no system dependencies (no binary, no Poppler), no per-page image conversion (the API accepts the PDF directly), and better accuracy on noisy scans like faxes and skewed pages.

Read the full PDF OCR developer guide on ai-engine.net, or see how to pipe OCR into GPT-4 in the ID Card to JSON tutorial.

Top comments (0)