Your documents deserve better than a folder called Scans_2024_FINAL_v2.
Paperless-ngx has been the gold standard for self-hosted document management for a while now. But the 2026 version of this stack hits different — you can wire it up to a local LLM for automatic classification, smarter tagging, and search that actually understands what's in your documents. No cloud. No API fees. Everything stays on your hardware.
Here's how to set it up from scratch.
What You're Building
A Docker stack with three components:
- Paperless-ngx — document ingestion, OCR, search, and web UI
-
Ollama — local LLM inference (we'll use
mistralfor classification) - A small Python classifier — bridges Paperless webhooks to Ollama for auto-tagging
The result: drop a PDF into a folder (or email it), and it gets OCR'd, classified by AI, tagged, and made searchable — all within your local network.
Prerequisites
- Docker + Docker Compose
- At least 8 GB RAM (16 GB recommended if running 7B+ models)
- ~10 GB disk for Ollama models
- A machine that stays on (mini PC, NAS, old laptop — anything works)
Step 1: The Docker Compose Stack
Create a docker-compose.yml:
version: "3.8"
services:
paperless-broker:
image: docker.io/library/redis:7
restart: unless-stopped
volumes:
- redis-data:/data
paperless:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- paperless-broker
ports:
- "8000:8000"
volumes:
- paperless-data:/usr/src/paperless/data
- paperless-media:/usr/src/paperless/media
- ./consume:/usr/src/paperless/consume
- ./export:/usr/src/paperless/export
environment:
PAPERLESS_REDIS: redis://paperless-broker:6379
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_POST_CONSUME_SCRIPT: /usr/src/paperless/scripts/classify.sh
USERMAP_UID: 1000
USERMAP_GID: 1000
ollama:
image: ollama/ollama:latest
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
volumes:
redis-data:
paperless-data:
paperless-media:
ollama-data:
Spin it up:
docker compose up -d
Then pull the model you'll use for classification:
docker exec -it ollama ollama pull mistral
Step 2: The AI Classification Script
This is where it gets interesting. Paperless-ngx supports post-consume scripts — code that runs after every document is ingested. We'll use this hook to send the extracted text to Ollama and get back structured tags.
Create scripts/classify.py:
#!/usr/bin/env python3
"""
Post-consume classifier for Paperless-ngx.
Sends document text to Ollama, gets back tags and document type.
"""
import json
import os
import sys
import urllib.request
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434")
PAPERLESS_URL = os.getenv("PAPERLESS_URL", "http://localhost:8000")
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN", "")
DOCUMENT_ID = os.getenv("DOCUMENT_ID")
DOCUMENT_FILE_NAME = os.getenv("DOCUMENT_FILE_NAME", "unknown")
SYSTEM_PROMPT = """You are a document classifier. Given the text of a document,
return a JSON object with:
- "tags": array of 1-4 relevant tags (e.g., "invoice", "medical", "tax", "receipt", "contract", "insurance")
- "doc_type": a single document type (e.g., "Invoice", "Letter", "Receipt", "Report", "Contract")
- "correspondent": who sent this document (company or person name), or null
Return ONLY valid JSON. No explanation."""
def classify_document(text: str) -> dict:
"""Send document text to Ollama for classification."""
payload = json.dumps({
"model": "mistral",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Classify this document:\n\n{text[:3000]}"}
],
"stream": False,
"format": "json"
}).encode()
req = urllib.request.Request(
f"{OLLAMA_URL}/api/chat",
data=payload,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=60) as resp:
result = json.loads(resp.read())
return json.loads(result["message"]["content"])
def apply_tags(doc_id: str, classification: dict):
"""Apply AI-generated tags back to Paperless via API."""
tags = classification.get("tags", [])
doc_type = classification.get("doc_type")
for tag_name in tags:
req = urllib.request.Request(
f"{PAPERLESS_URL}/api/tags/?name__iexact={tag_name}",
headers={"Authorization": f"Token {PAPERLESS_TOKEN}"},
)
with urllib.request.urlopen(req) as resp:
existing = json.loads(resp.read())
if existing["count"] == 0:
create_payload = json.dumps({"name": tag_name}).encode()
req = urllib.request.Request(
f"{PAPERLESS_URL}/api/tags/",
data=create_payload,
headers={
"Authorization": f"Token {PAPERLESS_TOKEN}",
"Content-Type": "application/json",
},
)
urllib.request.urlopen(req)
print(f"[AI] Doc {doc_id}: type={doc_type}, tags={tags}")
if __name__ == "__main__":
if not DOCUMENT_ID:
print("No DOCUMENT_ID set, skipping classification")
sys.exit(0)
req = urllib.request.Request(
f"{PAPERLESS_URL}/api/documents/{DOCUMENT_ID}/",
headers={"Authorization": f"Token {PAPERLESS_TOKEN}"},
)
with urllib.request.urlopen(req) as resp:
doc = json.loads(resp.read())
text = doc.get("content", "")
if len(text) < 50:
print(f"[AI] Doc {DOCUMENT_ID}: too short, skipping")
sys.exit(0)
classification = classify_document(text)
apply_tags(DOCUMENT_ID, classification)
print(f"[AI] Classified: {DOCUMENT_FILE_NAME} -> {classification}")
And the shell wrapper scripts/classify.sh:
#!/bin/bash
# Post-consume hook for Paperless-ngx
export OLLAMA_URL="http://ollama:11434"
export PAPERLESS_URL="http://localhost:8000"
export PAPERLESS_TOKEN="${PAPERLESS_API_TOKEN}"
python3 /usr/src/paperless/scripts/classify.py
Make it executable:
chmod +x scripts/classify.sh scripts/classify.py
Step 3: Generate Your API Token
docker exec -it paperless-webserver-1 \
python3 manage.py shell -c \
"from rest_framework.authtoken.models import Token; \
from django.contrib.auth.models import User; \
u = User.objects.first(); \
t, _ = Token.objects.get_or_create(user=u); \
print(t.key)"
Add the token to your compose environment:
environment:
PAPERLESS_API_TOKEN: "your-token-here"
Step 4: Test the Pipeline
Drop a PDF into the consume/ folder:
cp ~/Downloads/some-invoice.pdf ./consume/
Watch the logs:
docker compose logs -f paperless
You should see the document get ingested, OCR'd, then classified:
[AI] Classified: some-invoice.pdf -> {"tags": ["invoice", "utilities"], "doc_type": "Invoice", "correspondent": "Electric Company Inc"}
Making It Smarter
Swap the model. Mistral works fine for classification, but if you have the VRAM, try llama3:8b or phi3 for better accuracy on mixed-language documents.
Add a feedback loop. When you manually correct a tag in Paperless, log it. After enough corrections, you can fine-tune your prompt or switch to a specialized model.
Email ingestion. Paperless-ngx supports IMAP consumption out of the box:
environment:
PAPERLESS_CONSUMER_ENABLE_IMAP: "true"
PAPERLESS_CONSUMER_IMAP_HOST: "imap.example.com"
PAPERLESS_CONSUMER_IMAP_USER: "docs@example.com"
PAPERLESS_CONSUMER_IMAP_PASSWORD: "your-password"
Forward receipts and invoices to a dedicated email, and they land in Paperless, classified and tagged, without you lifting a finger.
Why Local Matters
Every time you upload a document to a cloud service, you're trusting someone else with your tax returns, medical records, and contracts. Running this locally means:
- Zero data leaves your network — not even for OCR
- No monthly fees — Ollama is free, Paperless is free
- No rate limits — classify 1,000 documents at 3 AM if you want
- Full control — swap models, change prompts, add custom logic
The hardware cost? A used mini PC with 16 GB RAM runs this stack comfortably. That's a one-time $150-200 investment vs. $10-20/month for cloud document management that still can't auto-tag your stuff.
What's Next
This is a foundation. From here you can add:
-
Semantic search with embeddings (pipe document text through
nomic-embed-textand store vectors in pgvector) - Multi-language support by switching OCR languages and using multilingual models
- Mobile scanning with apps that upload directly to your consume folder via WebDAV
The self-hosted document stack in 2026 is genuinely better than most paid alternatives. The AI layer just makes it unfair.
SIGNAL covers practical AI and self-hosting for builders. No hype, no fluff — just things that work.
Top comments (0)