Alex Spinov

Posted on Mar 28

Paperless-ngx Has a Free API: Self-Hosted Document Management with OCR and Full-Text Search

#selfhosted #opensource #tutorial #productivity

What is Paperless-ngx?

Paperless-ngx is a self-hosted document management system that transforms physical documents into a searchable online archive. It OCRs your documents, extracts text, and lets you search everything via API.

Scanned receipts, invoices, letters — all searchable in seconds.

Quick Start

mkdir paperless && cd paperless
wget https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/docker/compose/docker-compose.sqlite.yml -O docker-compose.yml
docker compose up -d
docker compose run --rm webserver createsuperuser

Open http://localhost:8000.

The REST API

export PL_URL="http://localhost:8000/api"
export PL_TOKEN="your-token"

Upload Documents

curl -X POST "$PL_URL/documents/post_document/" \
  -H "Authorization: Token $PL_TOKEN" \
  -F "document=@invoice.pdf" \
  -F "title=March Invoice" \
  -F "correspondent=Acme Corp" \
  -F "tags=2,5"

Paperless automatically: OCRs the document, extracts text, classifies it, and makes it searchable.

Search Documents

# Full-text search
curl -s "$PL_URL/documents/?query=invoice+march+2026" \
  -H "Authorization: Token $PL_TOKEN" | jq '.results[] | {title, correspondent, created}'

# Filter by tags
curl -s "$PL_URL/documents/?tags__id__in=2,5" \
  -H "Authorization: Token $PL_TOKEN"

# Filter by date range
curl -s "$PL_URL/documents/?created__date__gt=2026-01-01&created__date__lt=2026-04-01" \
  -H "Authorization: Token $PL_TOKEN"

# Filter by correspondent
curl -s "$PL_URL/documents/?correspondent__name=Acme" \
  -H "Authorization: Token $PL_TOKEN"

Get Document

# Get metadata
curl -s "$PL_URL/documents/DOC_ID/" \
  -H "Authorization: Token $PL_TOKEN" | jq '{title, content, tags, correspondent, created}'

# Download original
curl -o document.pdf "$PL_URL/documents/DOC_ID/download/" \
  -H "Authorization: Token $PL_TOKEN"

# Download thumbnail
curl -o thumb.png "$PL_URL/documents/DOC_ID/thumb/" \
  -H "Authorization: Token $PL_TOKEN"

# Get OCR'd text
curl -s "$PL_URL/documents/DOC_ID/" \
  -H "Authorization: Token $PL_TOKEN" | jq -r '.content'

Tags, Correspondents, Document Types

# Create tag
curl -X POST "$PL_URL/tags/" \
  -H "Authorization: Token $PL_TOKEN" \
  -d '{"name": "Tax 2026", "color": "#3498db", "is_inbox_tag": false}'

# Create correspondent
curl -X POST "$PL_URL/correspondents/" \
  -H "Authorization: Token $PL_TOKEN" \
  -d '{"name": "Acme Corp", "matching_algorithm": 1, "match": "acme"}'

# Create document type
curl -X POST "$PL_URL/document_types/" \
  -H "Authorization: Token $PL_TOKEN" \
  -d '{"name": "Invoice", "matching_algorithm": 1, "match": "invoice"}'

Auto-Matching

Paperless can automatically assign tags, correspondents, and types:

# Set matching rules
curl -X PATCH "$PL_URL/tags/TAG_ID/" \
  -H "Authorization: Token $PL_TOKEN" \
  -d '{"matching_algorithm": 3, "match": "receipt OR invoice"}'

Matching algorithms: exact, any word, all words, regex, fuzzy, auto (ML).

Features

OCR: Tesseract-based text extraction (100+ languages)
Full-text search: Find any word in any document
Auto-classification: ML-based tag/type assignment
Email consumption: Scan email attachments automatically
Mobile scanning: Upload from phone camera
Workflows: Rules for automatic processing

Paperless vs Google Drive

Feature	Paperless-ngx	Google Drive
OCR	Automatic	Manual
Full-text search	Yes (OCR'd)	Yes (native PDFs)
Auto-classify	ML-based	No
Self-hosted	Yes	No
API	Full REST	Yes
Storage	Unlimited	15 GB

Need document automation or data extraction tools?

📧 spinov001@gmail.com
🔧 My tools on Apify Store

How do you manage documents? Physical or digital-first?

DEV Community