SIGNAL

Posted on Mar 20

Self-Host Paperless-ngx With Local AI — Private Documents, Better Search, Zero Cloud

#homelab #ai #selfhosted #docker

Your documents deserve better than a folder called Scans_2024_FINAL_v2.

Paperless-ngx has been the gold standard for self-hosted document management for a while now. But the 2026 version of this stack hits different — you can wire it up to a local LLM for automatic classification, smarter tagging, and search that actually understands what's in your documents. No cloud. No API fees. Everything stays on your hardware.

Here's how to set it up from scratch.

What You're Building

A Docker stack with three components:

Paperless-ngx — document ingestion, OCR, search, and web UI
Ollama — local LLM inference (we'll use mistral for classification)
A small Python classifier — bridges Paperless webhooks to Ollama for auto-tagging

The result: drop a PDF into a folder (or email it), and it gets OCR'd, classified by AI, tagged, and made searchable — all within your local network.

Prerequisites

Docker + Docker Compose
At least 8 GB RAM (16 GB recommended if running 7B+ models)
~10 GB disk for Ollama models
A machine that stays on (mini PC, NAS, old laptop — anything works)

Step 1: The Docker Compose Stack

Create a docker-compose.yml:

version: "3.8"

services:
  paperless-broker:
    image: docker.io/library/redis:7
    restart: unless-stopped
    volumes:
      - redis-data:/data

  paperless:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: unless-stopped
    depends_on:
      - paperless-broker
    ports:
      - "8000:8000"
    volumes:
      - paperless-data:/usr/src/paperless/data
      - paperless-media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
      - ./export:/usr/src/paperless/export
    environment:
      PAPERLESS_REDIS: redis://paperless-broker:6379
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_POST_CONSUME_SCRIPT: /usr/src/paperless/scripts/classify.sh
      USERMAP_UID: 1000
      USERMAP_GID: 1000

  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama

volumes:
  redis-data:
  paperless-data:
  paperless-media:
  ollama-data:

Spin it up:

docker compose up -d

Then pull the model you'll use for classification:

docker exec -it ollama ollama pull mistral

Step 2: The AI Classification Script

This is where it gets interesting. Paperless-ngx supports post-consume scripts — code that runs after every document is ingested. We'll use this hook to send the extracted text to Ollama and get back structured tags.

Create scripts/classify.py:

#!/usr/bin/env python3
"""
Post-consume classifier for Paperless-ngx.
Sends document text to Ollama, gets back tags and document type.
"""

import json
import os
import sys
import urllib.request

OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434")
PAPERLESS_URL = os.getenv("PAPERLESS_URL", "http://localhost:8000")
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN", "")

DOCUMENT_ID = os.getenv("DOCUMENT_ID")
DOCUMENT_FILE_NAME = os.getenv("DOCUMENT_FILE_NAME", "unknown")

SYSTEM_PROMPT = """You are a document classifier. Given the text of a document,
return a JSON object with:
- "tags": array of 1-4 relevant tags (e.g., "invoice", "medical", "tax", "receipt", "contract", "insurance")
- "doc_type": a single document type (e.g., "Invoice", "Letter", "Receipt", "Report", "Contract")
- "correspondent": who sent this document (company or person name), or null

Return ONLY valid JSON. No explanation."""


def classify_document(text: str) -> dict:
    """Send document text to Ollama for classification."""
    payload = json.dumps({
        "model": "mistral",
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Classify this document:\n\n{text[:3000]}"}
        ],
        "stream": False,
        "format": "json"
    }).encode()

    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/chat",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=60) as resp:
        result = json.loads(resp.read())

    return json.loads(result["message"]["content"])


def apply_tags(doc_id: str, classification: dict):
    """Apply AI-generated tags back to Paperless via API."""
    tags = classification.get("tags", [])
    doc_type = classification.get("doc_type")

    for tag_name in tags:
        req = urllib.request.Request(
            f"{PAPERLESS_URL}/api/tags/?name__iexact={tag_name}",
            headers={"Authorization": f"Token {PAPERLESS_TOKEN}"},
        )
        with urllib.request.urlopen(req) as resp:
            existing = json.loads(resp.read())

        if existing["count"] == 0:
            create_payload = json.dumps({"name": tag_name}).encode()
            req = urllib.request.Request(
                f"{PAPERLESS_URL}/api/tags/",
                data=create_payload,
                headers={
                    "Authorization": f"Token {PAPERLESS_TOKEN}",
                    "Content-Type": "application/json",
                },
            )
            urllib.request.urlopen(req)

    print(f"[AI] Doc {doc_id}: type={doc_type}, tags={tags}")


if __name__ == "__main__":
    if not DOCUMENT_ID:
        print("No DOCUMENT_ID set, skipping classification")
        sys.exit(0)

    req = urllib.request.Request(
        f"{PAPERLESS_URL}/api/documents/{DOCUMENT_ID}/",
        headers={"Authorization": f"Token {PAPERLESS_TOKEN}"},
    )
    with urllib.request.urlopen(req) as resp:
        doc = json.loads(resp.read())

    text = doc.get("content", "")
    if len(text) < 50:
        print(f"[AI] Doc {DOCUMENT_ID}: too short, skipping")
        sys.exit(0)

    classification = classify_document(text)
    apply_tags(DOCUMENT_ID, classification)
    print(f"[AI] Classified: {DOCUMENT_FILE_NAME} -> {classification}")

And the shell wrapper scripts/classify.sh:

#!/bin/bash
# Post-consume hook for Paperless-ngx
export OLLAMA_URL="http://ollama:11434"
export PAPERLESS_URL="http://localhost:8000"
export PAPERLESS_TOKEN="${PAPERLESS_API_TOKEN}"

python3 /usr/src/paperless/scripts/classify.py

Make it executable:

chmod +x scripts/classify.sh scripts/classify.py

Step 3: Generate Your API Token

docker exec -it paperless-webserver-1 \
  python3 manage.py shell -c \
  "from rest_framework.authtoken.models import Token; \
   from django.contrib.auth.models import User; \
   u = User.objects.first(); \
   t, _ = Token.objects.get_or_create(user=u); \
   print(t.key)"

Add the token to your compose environment:

environment:
  PAPERLESS_API_TOKEN: "your-token-here"

Step 4: Test the Pipeline

Drop a PDF into the consume/ folder:

cp ~/Downloads/some-invoice.pdf ./consume/

Watch the logs:

docker compose logs -f paperless

You should see the document get ingested, OCR'd, then classified:

[AI] Classified: some-invoice.pdf -> {"tags": ["invoice", "utilities"], "doc_type": "Invoice", "correspondent": "Electric Company Inc"}

Making It Smarter

Swap the model. Mistral works fine for classification, but if you have the VRAM, try llama3:8b or phi3 for better accuracy on mixed-language documents.

Add a feedback loop. When you manually correct a tag in Paperless, log it. After enough corrections, you can fine-tune your prompt or switch to a specialized model.

Email ingestion. Paperless-ngx supports IMAP consumption out of the box:

environment:
  PAPERLESS_CONSUMER_ENABLE_IMAP: "true"
  PAPERLESS_CONSUMER_IMAP_HOST: "imap.example.com"
  PAPERLESS_CONSUMER_IMAP_USER: "docs@example.com"
  PAPERLESS_CONSUMER_IMAP_PASSWORD: "your-password"

Forward receipts and invoices to a dedicated email, and they land in Paperless, classified and tagged, without you lifting a finger.

Why Local Matters

Every time you upload a document to a cloud service, you're trusting someone else with your tax returns, medical records, and contracts. Running this locally means:

Zero data leaves your network — not even for OCR
No monthly fees — Ollama is free, Paperless is free
No rate limits — classify 1,000 documents at 3 AM if you want
Full control — swap models, change prompts, add custom logic

The hardware cost? A used mini PC with 16 GB RAM runs this stack comfortably. That's a one-time $150-200 investment vs. $10-20/month for cloud document management that still can't auto-tag your stuff.

What's Next

This is a foundation. From here you can add:

Semantic search with embeddings (pipe document text through nomic-embed-text and store vectors in pgvector)
Multi-language support by switching OCR languages and using multilingual models
Mobile scanning with apps that upload directly to your consume folder via WebDAV

The self-hosted document stack in 2026 is genuinely better than most paid alternatives. The AI layer just makes it unfair.

SIGNAL covers practical AI and self-hosting for builders. No hype, no fluff — just things that work.

DEV Community