Knowledge workers drown in information. We collect documents at scale—research papers, PDFs, articles, code—but can't retrieve or synthesize what we've gathered. Most solutions force a choice: keep data local and lose AI, or move to cloud and lose privacy. Docify dissolves this false binary through 11 specialized services orchestrated into a complete RAG pipeline.
Architecture: 11 Services, One Pipeline
Input Layer: Parsing & Chunking -
Resource Ingestion handles heterogeneous formats (PDF, DOCX, XLSX, Markdown, TXT). Deduplication Service computes SHA-256 hashes on raw content—preventing re-processing when the same research paper arrives from three sources. Chunking Service uses tiktoken for accurate token counting (512 tokens, 50-token overlap) while respecting paragraph boundaries and preserving section hierarchies.
Embedding Layer: Async Vector Generation -
Embeddings block APIs. Docify uses Celery + Redis to decouple: uploads return immediately, workers process embeddings asynchronously. Embeddings Service uses all-minilm:22m (384-dim, 22MB)—aggressively lightweight compared to 768-dim models, but sentence-transformers research shows minimal quality loss. Storage in PostgreSQL pgvector with HNSW indexing enables <200ms vector search across 10K documents.
Search Layer: Hybrid Retrieval -
Semantic search alone fails on exact phrases; keyword search alone fails on synonyms. Hybrid Search combines pgvector cosine distance with BM25 ranking via reciprocal rank fusion—a technique that elegantly merges different ranking philosophies. A chunk ranked #2 by vectors and #5 by keywords scores higher than one ranked #1 by vectors and #100 by keywords.
Ranking Layer: Multi-Factor Scoring -
Re-Ranking Service refines results using five factors: base relevance (40%), citation frequency (15%), recency (15%), specificity (15%), source quality (15%). This produces 5-10 final chunks sent to LLM. Notably, it flags conflicting sources—if multiple documents contradict each other, the service signals this upstream.
Context Layer: Token Budget Management -
LLMs have finite context windows. Context Assembly respects token budgets: 2000-token default split 60% primary sources (top-ranked chunks), 30% supporting context, 10% metadata. Truncates intelligently at sentence boundaries (never mid-sentence gibberish). Most questions need 2-3 high-quality sources, not 100.
Prompt Layer: Anti-Hallucination Engineering -
Prompts with strict rules: "ONLY use provided context. ALWAYS cite sources [Source 1]. If unknown, say not available. When sources conflict, present both sides." Source markers in context enable citation verification—making it tractable to validate claims post-generation.
LLM Service: Provider Flexibility -
Provider-agnostic architecture. Ollama local Mistral 7B (4-bit quantized) is default, with OpenAI/Anthropic support. Hardware auto-detection adjusts: GPU available? Accelerate. CPU-only? Extend timeouts. Low VRAM? Switch models. Streaming enabled by default for responsive UX.
Verification Layer: Citation Grounding -
LLMs fabricate sources. Citation Verification runs post-generation: extracts [Source N] references, searches for cited claims in source chunks, flags mismatches. Catches egregious errors like citing sources containing no relevant information. Not foolproof, but reduces hallucination significantly.
Orchestration: Message Generation Pipeline -
Message Generation Service coordinates all services:
User Query → Query Expansion (3-5 variants) → Hybrid Search (20-30 candidates)
→ Re-Ranking (5-10 selected) → Context Assembly (token budgeting)
→ Prompt Engineering → LLM Call → Citation Verification → Response with metrics
Returns structured data: message content, source UUIDs, citations, verification results, pipeline latencies.
Database Design
Chunks table optimized for vector retrieval:
CREATE TABLE chunks (
id UUID PRIMARY KEY,
resource_id UUID REFERENCES resources,
content TEXT,
embedding Vector(384),
chunk_metadata JSONB
);
CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);
HNSW indexing enables approximate nearest-neighbor search in logarithmic time. For semantic search with millions of vectors, this speedup is essential.
Resources table tracks documents with content_hash VARCHAR(64) UNIQUE (SHA-256) and is_duplicate_of foreign key for deduplication.
Conversations & Messages maintain chat history with source tracking, citations as JSONB, model metadata.
Workspaces enable personal/team/hybrid collaboration with data isolation via workspace_id in all queries.
API & Infrastructure
REST Endpoints (full documentation at /docs):
-
POST /api/resources/upload- Upload documents -
GET /api/resources/{id}/embedding-status- Poll async embedding progress -
POST /api/conversations/{id}/messages- Triggers RAG pipeline -
GET /api/conversations/{id}/export- Export as JSON/Markdown
Docker Stack (7 services, docker-compose up):
- PostgreSQL (pgvector pre-loaded)
- Redis (cache + Celery broker)
- Ollama (local LLM)
- FastAPI backend
- Celery worker (async embeddings)
- Celery Beat (optional scheduled tasks)
- Vite frontend
Health checks ensure dependencies before dependent services start. Models persist in volumes (~2GB total).
Frontend
React 18 + TypeScript. React Query manages server state (caching, invalidation). Zustand for UI state. API client wrappers shield UI from streaming/polling complexity. Tailwind CSS for styling.
Performance & Design Patterns
Key Patterns:
- Async-First: Embeddings/LLM happen async via Celery; API returns immediately
- Content Dedup: SHA-256 hashing prevents re-processing identical documents regardless of source
- Hybrid Search: Reciprocal rank fusion merges semantic + BM25 for robustness
- Token-Aware Assembly: Respects context windows, prioritizes by relevance, truncates intelligently
- Multi-Factor Ranking: Combines recency, specificity, source quality, usage history into unified ranking
- Citation Verification: Validates LLM claims against source chunks post-generation
- Hardware Adaptation: Auto-detects GPU/CPU/VRAM, adjusts timeouts and models accordingly
Top comments (0)