Edgaras

Posted on Mar 22

Using a Self-Hosted PDF OCR API with PaddleOCR

#ocr #api #python #fastapi

The problem

If you need to extract text from PDFs - especially large ones with 100+ pages - and don't want to pay for cloud OCR services or use LLM APIs on it, PaddleOCR can handle it locally on your own GPU.

paddleocr-pdf-api is an open-source Docker image that wraps PaddleOCR's vision-language model into a REST API. It runs on your GPU and lets you fetch results page-by-page as they're processed, without waiting for the entire document to finish.

When this is useful

Processing large volumes of PDFs - submit documents via API and process them one by one through a job queue
Sensitive documents that can't leave your network - everything runs locally, no external API calls
Large documents (100+ pages) - results stream page-by-page, so you can start consuming output before the full document is done
Integrating OCR into a pipeline - simple REST API that any language/tool can call
Less common languages - handles languages that many OCR tools struggle with

What's under the hood

Model: PaddleOCR-VL-1.5 (0.9B parameters)
GPU VRAM: ~8.5 GB
Output: Markdown (headings, tables, paragraphs) via JSON
Storage: SQLite + filesystem, persisted via Docker volume
Stack: FastAPI, PaddlePaddle GPU, pypdfium2 - single Python file

Setup

Requirements: Docker with NVIDIA Container Toolkit and a GPU with ~8.5 GB VRAM.

Create a docker-compose.yml:

services:
  paddleocr:
    image: edgaras0x4e/paddleocr-pdf-api:latest
    ports:
      - "8099:8000"
    volumes:
      - ocr-data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ocr-data:

docker compose up -d

Usage

Submit a PDF

curl -X POST http://localhost:8099/ocr -F "file=@document.pdf"

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "filename": "document.pdf",
  "status": "queued"
}

Poll progress

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15

{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "status": "processing",
  "total_pages": 185,
  "processed_pages": 42
}

Fetch a single page (as soon as it's ready)

No need to wait for the entire document:

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/pages/1

{
  "page_num": 1,
  "markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet..."
}

Fetch all pages

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/result

Returns the full document with a pages array containing each page's markdown.

List jobs, cancel, delete

# List all jobs
curl http://localhost:8099/jobs

# Cancel a running job
curl -X POST http://localhost:8099/ocr/{job_id}/cancel

# Delete a job and its data
curl -X DELETE http://localhost:8099/ocr/{job_id}

Full API reference

Method	Endpoint	Description
`POST`	`/ocr`	Upload a PDF
`GET`	`/ocr/{job_id}`	Job status and progress
`GET`	`/ocr/{job_id}/pages/{page_num}`	Single page result
`GET`	`/ocr/{job_id}/result`	All pages
`POST`	`/ocr/{job_id}/cancel`	Cancel a job
`DELETE`	`/ocr/{job_id}`	Delete job and its data
`GET`	`/jobs`	List all jobs

API key authentication

To restrict access, set the API_KEY environment variable:

environment:
  - API_KEY=your-secret-key

All requests then require the header:

curl -H "X-API-Key: your-secret-key" http://localhost:8099/jobs

Configuration

Variable	Default	Description
`API_KEY`	(empty)	Optional authentication key
`OCR_DPI`	`200`	DPI for PDF page rendering (higher = better quality, slower)
`DB_PATH`	`/data/ocr.db`	SQLite database path
`UPLOAD_DIR`	`/data/uploads`	Upload storage path

Example: integrating into a Python script

A minimal example that submits a PDF and waits for results:

import requests
import time

API = "http://localhost:8099"

# Submit
resp = requests.post(f"{API}/ocr", files={"file": open("scan.pdf", "rb")})
job_id = resp.json()["job_id"]

# Poll until done
while True:
    status = requests.get(f"{API}/ocr/{job_id}").json()
    if status["status"] == "completed":
        break
    if status["status"] == "failed":
        raise Exception(status["error"])
    print(f"{status['processed_pages']}/{status['total_pages']} pages done")
    time.sleep(5)

# Get results
result = requests.get(f"{API}/ocr/{job_id}/result").json()
for page in result["pages"]:
    print(f"--- Page {page['page_num']} ---")
    print(page["markdown"])

How it works

PDF is uploaded and saved to disk
A background worker picks up queued jobs sequentially
Each page is rendered to an image using pypdfium2
PaddleOCR-VL extracts text and converts it to markdown
HTML artifacts and image placeholders are cleaned from the output
Results are stored in SQLite and available per-page as they complete
Jobs interrupted by a container restart are automatically re-queued

Source: github.com/Edgaras0x4E/paddleocr-pdf-api
Docker image: hub.docker.com/r/edgaras0x4e/paddleocr-pdf-api

DEV Community