DEV Community

Edgaras
Edgaras

Posted on

Using a Self-Hosted PDF OCR API with PaddleOCR

The problem

If you need to extract text from PDFs - especially large ones with 100+ pages - and don't want to pay for cloud OCR services or use LLM APIs on it, PaddleOCR can handle it locally on your own GPU.

paddleocr-pdf-api is an open-source Docker image that wraps PaddleOCR's vision-language model into a REST API. It runs on your GPU and lets you fetch results page-by-page as they're processed, without waiting for the entire document to finish.

When this is useful

  • Processing large volumes of PDFs - submit documents via API and process them one by one through a job queue
  • Sensitive documents that can't leave your network - everything runs locally, no external API calls
  • Large documents (100+ pages) - results stream page-by-page, so you can start consuming output before the full document is done
  • Integrating OCR into a pipeline - simple REST API that any language/tool can call
  • Less common languages - handles languages that many OCR tools struggle with

What's under the hood

  • Model: PaddleOCR-VL-1.5 (0.9B parameters)
  • GPU VRAM: ~8.5 GB
  • Output: Markdown (headings, tables, paragraphs) via JSON
  • Storage: SQLite + filesystem, persisted via Docker volume
  • Stack: FastAPI, PaddlePaddle GPU, pypdfium2 - single Python file

Setup

Requirements: Docker with NVIDIA Container Toolkit and a GPU with ~8.5 GB VRAM.

Create a docker-compose.yml:

services:
  paddleocr:
    image: edgaras0x4e/paddleocr-pdf-api:latest
    ports:
      - "8099:8000"
    volumes:
      - ocr-data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ocr-data:
Enter fullscreen mode Exit fullscreen mode
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Usage

Submit a PDF

curl -X POST http://localhost:8099/ocr -F "file=@document.pdf"
Enter fullscreen mode Exit fullscreen mode
{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "filename": "document.pdf",
  "status": "queued"
}
Enter fullscreen mode Exit fullscreen mode

Poll progress

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15
Enter fullscreen mode Exit fullscreen mode
{
  "job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
  "status": "processing",
  "total_pages": 185,
  "processed_pages": 42
}
Enter fullscreen mode Exit fullscreen mode

Fetch a single page (as soon as it's ready)

No need to wait for the entire document:

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/pages/1
Enter fullscreen mode Exit fullscreen mode
{
  "page_num": 1,
  "markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet..."
}
Enter fullscreen mode Exit fullscreen mode

Fetch all pages

curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/result
Enter fullscreen mode Exit fullscreen mode

Returns the full document with a pages array containing each page's markdown.

List jobs, cancel, delete

# List all jobs
curl http://localhost:8099/jobs

# Cancel a running job
curl -X POST http://localhost:8099/ocr/{job_id}/cancel

# Delete a job and its data
curl -X DELETE http://localhost:8099/ocr/{job_id}
Enter fullscreen mode Exit fullscreen mode

Full API reference

Method Endpoint Description
POST /ocr Upload a PDF
GET /ocr/{job_id} Job status and progress
GET /ocr/{job_id}/pages/{page_num} Single page result
GET /ocr/{job_id}/result All pages
POST /ocr/{job_id}/cancel Cancel a job
DELETE /ocr/{job_id} Delete job and its data
GET /jobs List all jobs

API key authentication

To restrict access, set the API_KEY environment variable:

environment:
  - API_KEY=your-secret-key
Enter fullscreen mode Exit fullscreen mode

All requests then require the header:

curl -H "X-API-Key: your-secret-key" http://localhost:8099/jobs
Enter fullscreen mode Exit fullscreen mode

Configuration

Variable Default Description
API_KEY (empty) Optional authentication key
OCR_DPI 200 DPI for PDF page rendering (higher = better quality, slower)
DB_PATH /data/ocr.db SQLite database path
UPLOAD_DIR /data/uploads Upload storage path

Example: integrating into a Python script

A minimal example that submits a PDF and waits for results:

import requests
import time

API = "http://localhost:8099"

# Submit
resp = requests.post(f"{API}/ocr", files={"file": open("scan.pdf", "rb")})
job_id = resp.json()["job_id"]

# Poll until done
while True:
    status = requests.get(f"{API}/ocr/{job_id}").json()
    if status["status"] == "completed":
        break
    if status["status"] == "failed":
        raise Exception(status["error"])
    print(f"{status['processed_pages']}/{status['total_pages']} pages done")
    time.sleep(5)

# Get results
result = requests.get(f"{API}/ocr/{job_id}/result").json()
for page in result["pages"]:
    print(f"--- Page {page['page_num']} ---")
    print(page["markdown"])
Enter fullscreen mode Exit fullscreen mode

How it works

  1. PDF is uploaded and saved to disk
  2. A background worker picks up queued jobs sequentially
  3. Each page is rendered to an image using pypdfium2
  4. PaddleOCR-VL extracts text and converts it to markdown
  5. HTML artifacts and image placeholders are cleaned from the output
  6. Results are stored in SQLite and available per-page as they complete
  7. Jobs interrupted by a container restart are automatically re-queued

Top comments (0)