Docify: Building a Production RAG System for Knowledge Management

#ai #rag #llm #architecture

Knowledge workers drown in information. We collect documents at scale—research papers, PDFs, articles, code—but can't retrieve or synthesize what we've gathered. Most solutions force a choice: keep data local and lose AI, or move to cloud and lose privacy. Docify dissolves this false binary through 11 specialized services orchestrated into a complete RAG pipeline.

Architecture: 11 Services, One Pipeline

Input Layer: Parsing & Chunking -
Resource Ingestion handles heterogeneous formats (PDF, DOCX, XLSX, Markdown, TXT). Deduplication Service computes SHA-256 hashes on raw content—preventing re-processing when the same research paper arrives from three sources. Chunking Service uses tiktoken for accurate token counting (512 tokens, 50-token overlap) while respecting paragraph boundaries and preserving section hierarchies.

Embedding Layer: Async Vector Generation -
Embeddings block APIs. Docify uses Celery + Redis to decouple: uploads return immediately, workers process embeddings asynchronously. Embeddings Service uses all-minilm:22m (384-dim, 22MB)—aggressively lightweight compared to 768-dim models, but sentence-transformers research shows minimal quality loss. Storage in PostgreSQL pgvector with HNSW indexing enables <200ms vector search across 10K documents.

Search Layer: Hybrid Retrieval -
Semantic search alone fails on exact phrases; keyword search alone fails on synonyms. Hybrid Search combines pgvector cosine distance with BM25 ranking via reciprocal rank fusion—a technique that elegantly merges different ranking philosophies. A chunk ranked #2 by vectors and #5 by keywords scores higher than one ranked #1 by vectors and #100 by keywords.

Ranking Layer: Multi-Factor Scoring -
Re-Ranking Service refines results using five factors: base relevance (40%), citation frequency (15%), recency (15%), specificity (15%), source quality (15%). This produces 5-10 final chunks sent to LLM. Notably, it flags conflicting sources—if multiple documents contradict each other, the service signals this upstream.

Context Layer: Token Budget Management -
LLMs have finite context windows. Context Assembly respects token budgets: 2000-token default split 60% primary sources (top-ranked chunks), 30% supporting context, 10% metadata. Truncates intelligently at sentence boundaries (never mid-sentence gibberish). Most questions need 2-3 high-quality sources, not 100.

Prompt Layer: Anti-Hallucination Engineering -
Prompts with strict rules: "ONLY use provided context. ALWAYS cite sources [Source 1]. If unknown, say not available. When sources conflict, present both sides." Source markers in context enable citation verification—making it tractable to validate claims post-generation.

LLM Service: Provider Flexibility -
Provider-agnostic architecture. Ollama local Mistral 7B (4-bit quantized) is default, with OpenAI/Anthropic support. Hardware auto-detection adjusts: GPU available? Accelerate. CPU-only? Extend timeouts. Low VRAM? Switch models. Streaming enabled by default for responsive UX.

Verification Layer: Citation Grounding -
LLMs fabricate sources. Citation Verification runs post-generation: extracts [Source N] references, searches for cited claims in source chunks, flags mismatches. Catches egregious errors like citing sources containing no relevant information. Not foolproof, but reduces hallucination significantly.

Orchestration: Message Generation Pipeline -
Message Generation Service coordinates all services:

User Query → Query Expansion (3-5 variants) → Hybrid Search (20-30 candidates)
→ Re-Ranking (5-10 selected) → Context Assembly (token budgeting)
→ Prompt Engineering → LLM Call → Citation Verification → Response with metrics

Returns structured data: message content, source UUIDs, citations, verification results, pipeline latencies.

Database Design

Chunks table optimized for vector retrieval:

CREATE TABLE chunks (
  id UUID PRIMARY KEY,
  resource_id UUID REFERENCES resources,
  content TEXT,
  embedding Vector(384),
  chunk_metadata JSONB
);
CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);

HNSW indexing enables approximate nearest-neighbor search in logarithmic time. For semantic search with millions of vectors, this speedup is essential.

Resources table tracks documents with content_hash VARCHAR(64) UNIQUE (SHA-256) and is_duplicate_of foreign key for deduplication.
Conversations & Messages maintain chat history with source tracking, citations as JSONB, model metadata.
Workspaces enable personal/team/hybrid collaboration with data isolation via workspace_id in all queries.

API & Infrastructure

REST Endpoints (full documentation at /docs):

POST /api/resources/upload - Upload documents
GET /api/resources/{id}/embedding-status - Poll async embedding progress
POST /api/conversations/{id}/messages - Triggers RAG pipeline
GET /api/conversations/{id}/export - Export as JSON/Markdown

Docker Stack (7 services, docker-compose up):

PostgreSQL (pgvector pre-loaded)
Redis (cache + Celery broker)
Ollama (local LLM)
FastAPI backend
Celery worker (async embeddings)
Celery Beat (optional scheduled tasks)
Vite frontend

Health checks ensure dependencies before dependent services start. Models persist in volumes (~2GB total).

Frontend

React 18 + TypeScript. React Query manages server state (caching, invalidation). Zustand for UI state. API client wrappers shield UI from streaming/polling complexity. Tailwind CSS for styling.

Performance & Design Patterns

Key Patterns:

Async-First: Embeddings/LLM happen async via Celery; API returns immediately
Content Dedup: SHA-256 hashing prevents re-processing identical documents regardless of source
Hybrid Search: Reciprocal rank fusion merges semantic + BM25 for robustness
Token-Aware Assembly: Respects context windows, prioritizes by relevance, truncates intelligently
Multi-Factor Ranking: Combines recency, specificity, source quality, usage history into unified ranking
Citation Verification: Validates LLM claims against source chunks post-generation
Hardware Adaptation: Auto-detects GPU/CPU/VRAM, adjusts timeouts and models accordingly