The problem
If you need to extract text from PDFs - especially large ones with 100+ pages - and don't want to pay for cloud OCR services or use LLM APIs on it, PaddleOCR can handle it locally on your own GPU.
paddleocr-pdf-api is an open-source Docker image that wraps PaddleOCR's vision-language model into a REST API. It runs on your GPU and lets you fetch results page-by-page as they're processed, without waiting for the entire document to finish.
When this is useful
- Processing large volumes of PDFs - submit documents via API and process them one by one through a job queue
- Sensitive documents that can't leave your network - everything runs locally, no external API calls
- Large documents (100+ pages) - results stream page-by-page, so you can start consuming output before the full document is done
- Integrating OCR into a pipeline - simple REST API that any language/tool can call
- Less common languages - handles languages that many OCR tools struggle with
What's under the hood
- Model: PaddleOCR-VL-1.5 (0.9B parameters)
- GPU VRAM: ~8.5 GB
- Output: Markdown (headings, tables, paragraphs) via JSON
- Storage: SQLite + filesystem, persisted via Docker volume
- Stack: FastAPI, PaddlePaddle GPU, pypdfium2 - single Python file
Setup
Requirements: Docker with NVIDIA Container Toolkit and a GPU with ~8.5 GB VRAM.
Create a docker-compose.yml:
services:
paddleocr:
image: edgaras0x4e/paddleocr-pdf-api:latest
ports:
- "8099:8000"
volumes:
- ocr-data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ocr-data:
docker compose up -d
Usage
Submit a PDF
curl -X POST http://localhost:8099/ocr -F "file=@document.pdf"
{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"filename": "document.pdf",
"status": "queued"
}
Poll progress
curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15
{
"job_id": "994e7b398bb44d8ab5eade4d2ef57a15",
"status": "processing",
"total_pages": 185,
"processed_pages": 42
}
Fetch a single page (as soon as it's ready)
No need to wait for the entire document:
curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/pages/1
{
"page_num": 1,
"markdown": "## Chapter 1\n\nLorem ipsum dolor sit amet..."
}
Fetch all pages
curl http://localhost:8099/ocr/994e7b398bb44d8ab5eade4d2ef57a15/result
Returns the full document with a pages array containing each page's markdown.
List jobs, cancel, delete
# List all jobs
curl http://localhost:8099/jobs
# Cancel a running job
curl -X POST http://localhost:8099/ocr/{job_id}/cancel
# Delete a job and its data
curl -X DELETE http://localhost:8099/ocr/{job_id}
Full API reference
| Method | Endpoint | Description |
|---|---|---|
POST |
/ocr |
Upload a PDF |
GET |
/ocr/{job_id} |
Job status and progress |
GET |
/ocr/{job_id}/pages/{page_num} |
Single page result |
GET |
/ocr/{job_id}/result |
All pages |
POST |
/ocr/{job_id}/cancel |
Cancel a job |
DELETE |
/ocr/{job_id} |
Delete job and its data |
GET |
/jobs |
List all jobs |
API key authentication
To restrict access, set the API_KEY environment variable:
environment:
- API_KEY=your-secret-key
All requests then require the header:
curl -H "X-API-Key: your-secret-key" http://localhost:8099/jobs
Configuration
| Variable | Default | Description |
|---|---|---|
API_KEY |
(empty) | Optional authentication key |
OCR_DPI |
200 |
DPI for PDF page rendering (higher = better quality, slower) |
DB_PATH |
/data/ocr.db |
SQLite database path |
UPLOAD_DIR |
/data/uploads |
Upload storage path |
Example: integrating into a Python script
A minimal example that submits a PDF and waits for results:
import requests
import time
API = "http://localhost:8099"
# Submit
resp = requests.post(f"{API}/ocr", files={"file": open("scan.pdf", "rb")})
job_id = resp.json()["job_id"]
# Poll until done
while True:
status = requests.get(f"{API}/ocr/{job_id}").json()
if status["status"] == "completed":
break
if status["status"] == "failed":
raise Exception(status["error"])
print(f"{status['processed_pages']}/{status['total_pages']} pages done")
time.sleep(5)
# Get results
result = requests.get(f"{API}/ocr/{job_id}/result").json()
for page in result["pages"]:
print(f"--- Page {page['page_num']} ---")
print(page["markdown"])
How it works
- PDF is uploaded and saved to disk
- A background worker picks up queued jobs sequentially
- Each page is rendered to an image using
pypdfium2 - PaddleOCR-VL extracts text and converts it to markdown
- HTML artifacts and image placeholders are cleaned from the output
- Results are stored in SQLite and available per-page as they complete
- Jobs interrupted by a container restart are automatically re-queued
- Source: github.com/Edgaras0x4E/paddleocr-pdf-api
- Docker image: hub.docker.com/r/edgaras0x4e/paddleocr-pdf-api
Top comments (0)