DEV Community

Cover image for Docify: Building a Production RAG System for Knowledge Management
Keshav Ashiya
Keshav Ashiya

Posted on

Docify: Building a Production RAG System for Knowledge Management

Knowledge workers drown in information. We collect documents at scale—research papers, PDFs, articles, code—but can't retrieve or synthesize what we've gathered. Most solutions force a choice: keep data local and lose AI, or move to cloud and lose privacy. Docify dissolves this false binary through 11 specialized services orchestrated into a complete RAG pipeline.

Architecture: 11 Services, One Pipeline

Input Layer: Parsing & Chunking -
Resource Ingestion handles heterogeneous formats (PDF, DOCX, XLSX, Markdown, TXT). Deduplication Service computes SHA-256 hashes on raw content—preventing re-processing when the same research paper arrives from three sources. Chunking Service uses tiktoken for accurate token counting (512 tokens, 50-token overlap) while respecting paragraph boundaries and preserving section hierarchies.

Embedding Layer: Async Vector Generation -
Embeddings block APIs. Docify uses Celery + Redis to decouple: uploads return immediately, workers process embeddings asynchronously. Embeddings Service uses all-minilm:22m (384-dim, 22MB)—aggressively lightweight compared to 768-dim models, but sentence-transformers research shows minimal quality loss. Storage in PostgreSQL pgvector with HNSW indexing enables <200ms vector search across 10K documents.

Search Layer: Hybrid Retrieval -
Semantic search alone fails on exact phrases; keyword search alone fails on synonyms. Hybrid Search combines pgvector cosine distance with BM25 ranking via reciprocal rank fusion—a technique that elegantly merges different ranking philosophies. A chunk ranked #2 by vectors and #5 by keywords scores higher than one ranked #1 by vectors and #100 by keywords.

Ranking Layer: Multi-Factor Scoring -
Re-Ranking Service refines results using five factors: base relevance (40%), citation frequency (15%), recency (15%), specificity (15%), source quality (15%). This produces 5-10 final chunks sent to LLM. Notably, it flags conflicting sources—if multiple documents contradict each other, the service signals this upstream.

Context Layer: Token Budget Management -
LLMs have finite context windows. Context Assembly respects token budgets: 2000-token default split 60% primary sources (top-ranked chunks), 30% supporting context, 10% metadata. Truncates intelligently at sentence boundaries (never mid-sentence gibberish). Most questions need 2-3 high-quality sources, not 100.

Prompt Layer: Anti-Hallucination Engineering -
Prompts with strict rules: "ONLY use provided context. ALWAYS cite sources [Source 1]. If unknown, say not available. When sources conflict, present both sides." Source markers in context enable citation verification—making it tractable to validate claims post-generation.

LLM Service: Provider Flexibility -
Provider-agnostic architecture. Ollama local Mistral 7B (4-bit quantized) is default, with OpenAI/Anthropic support. Hardware auto-detection adjusts: GPU available? Accelerate. CPU-only? Extend timeouts. Low VRAM? Switch models. Streaming enabled by default for responsive UX.

Verification Layer: Citation Grounding -
LLMs fabricate sources. Citation Verification runs post-generation: extracts [Source N] references, searches for cited claims in source chunks, flags mismatches. Catches egregious errors like citing sources containing no relevant information. Not foolproof, but reduces hallucination significantly.

Orchestration: Message Generation Pipeline -
Message Generation Service coordinates all services:

User Query → Query Expansion (3-5 variants) → Hybrid Search (20-30 candidates)
→ Re-Ranking (5-10 selected) → Context Assembly (token budgeting)
→ Prompt Engineering → LLM Call → Citation Verification → Response with metrics
Enter fullscreen mode Exit fullscreen mode

Returns structured data: message content, source UUIDs, citations, verification results, pipeline latencies.

Database Design

Chunks table optimized for vector retrieval:

CREATE TABLE chunks (
  id UUID PRIMARY KEY,
  resource_id UUID REFERENCES resources,
  content TEXT,
  embedding Vector(384),
  chunk_metadata JSONB
);
CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);
Enter fullscreen mode Exit fullscreen mode

HNSW indexing enables approximate nearest-neighbor search in logarithmic time. For semantic search with millions of vectors, this speedup is essential.

Resources table tracks documents with content_hash VARCHAR(64) UNIQUE (SHA-256) and is_duplicate_of foreign key for deduplication.
Conversations & Messages maintain chat history with source tracking, citations as JSONB, model metadata.
Workspaces enable personal/team/hybrid collaboration with data isolation via workspace_id in all queries.

API & Infrastructure

REST Endpoints (full documentation at /docs):

  • POST /api/resources/upload - Upload documents
  • GET /api/resources/{id}/embedding-status - Poll async embedding progress
  • POST /api/conversations/{id}/messages - Triggers RAG pipeline
  • GET /api/conversations/{id}/export - Export as JSON/Markdown

Docker Stack (7 services, docker-compose up):

  1. PostgreSQL (pgvector pre-loaded)
  2. Redis (cache + Celery broker)
  3. Ollama (local LLM)
  4. FastAPI backend
  5. Celery worker (async embeddings)
  6. Celery Beat (optional scheduled tasks)
  7. Vite frontend

Health checks ensure dependencies before dependent services start. Models persist in volumes (~2GB total).

Frontend

React 18 + TypeScript. React Query manages server state (caching, invalidation). Zustand for UI state. API client wrappers shield UI from streaming/polling complexity. Tailwind CSS for styling.

Performance & Design Patterns

Key Patterns:

  • Async-First: Embeddings/LLM happen async via Celery; API returns immediately
  • Content Dedup: SHA-256 hashing prevents re-processing identical documents regardless of source
  • Hybrid Search: Reciprocal rank fusion merges semantic + BM25 for robustness
  • Token-Aware Assembly: Respects context windows, prioritizes by relevance, truncates intelligently
  • Multi-Factor Ranking: Combines recency, specificity, source quality, usage history into unified ranking
  • Citation Verification: Validates LLM claims against source chunks post-generation
  • Hardware Adaptation: Auto-detects GPU/CPU/VRAM, adjusts timeouts and models accordingly

Docify

Top comments (0)