Anil Nayak

Posted on Jun 1

Building Nexus: An Enterprise-Grade RAG & LLMOps Engine from Scratch

#ai #webdev #mlops #rag

Building a basic Retrieval-Augmented Generation (RAG) prototype is a weekend project. You pip install an orchestration library, load a small text file, and throw raw strings at the OpenAI API.

But taking that prototype into production is an entirely different engineering challenge.

In a real-world enterprise environment, native LLM implementations quickly break down due to three severe operational bugs:

Unpredictable API token burn
High inference latency
The business risk of silent hallucinations

To solve these specific bottlenecks, I built Nexus Knowledge Engine — a secure, fully containerized, production-ready enterprise RAG and LLMOps platform designed around strict retrieval quality gates, high-performance database indexing, and deep system reliability. :contentReference[oaicite:0]{index=0}

Here is a deep dive into the architecture, design trade-offs, and engineering metrics behind the project.

💻 Enterprise Tech Stack

Core Backend

FastAPI (Python 3.11)
Uvicorn
Asyncpg
Pydantic v2
Pydantic Settings

Vector & Relational Storage

PostgreSQL 16
pgvector extension

Caching Layer

Redis 7 (Alpine)

ML & Embeddings

sentence-transformers
all-MiniLM-L6-v2 (384-dimensional)

Orchestration

LangChain
OpenAI gpt-4o-mini

Infrastructure

Docker
Docker Compose
Multi-stage production builds

CI/CD & Testing

GitHub Actions
Pytest
Pytest-cov
Ruff

Telemetry & Observability

MLflow

🛠️ Deep Dive: Architecture & Engineering Decisions

1. High-Performance Document Ingestion & Storage

When a document is uploaded, processing it naively can freeze a web server’s event loop. Nexus handles ingestion using structured, memory-efficient processing patterns.

Processing Pipeline

Raw document extraction via PyMuPDF
Aggressive text normalization & cleaning
Custom Sliding Window Chunking
- Chunk Size: 1000
- Overlap: 200

This preserves semantic continuity between chunks while improving retrieval quality.

High-Speed Vector Search with HNSW

Instead of using traditional linear vector scanning:

O(N)

Nexus implements a Hierarchical Navigable Small World (HNSW) index directly inside PostgreSQL using pgvector.

This shifts semantic retrieval complexity closer to:

O(log N)

allowing millions of embeddings to be queried with sub-second latency.

Idempotent Upload Design

Uploads are fully idempotent using:

ON CONFLICT (filename) DO UPDATE

This prevents duplicate vector insertion while maintaining clean reference mapping.

⚡ 2. Cost & Latency Optimization with Two-Level Semantic Caching

LLM inference costs scale dangerously fast in production systems.

To reduce redundant token usage, Nexus implements a dual-layer Redis cache.

Level 1 — Exact Match Cache

Incoming prompts are hashed instantly.

If an identical request already exists:

Response is returned directly from Redis
Typical response time:

Single-digit milliseconds

Level 2 — Semantic Cache

If an exact cache miss occurs:

The incoming query is converted into an embedding
Cosine similarity is computed against historical prompts
If similarity is:

\ge 0.95

the cached answer is reused.

Result

Near-zero redundant token costs
Lower inference latency
Reduced OpenAI API burn

🛡️ 3. Retrieval Gate — Eliminating Hallucinations

LLMs generate responses from whatever context they receive — even irrelevant context.

Nexus prevents hallucinations using a strict Retrieval Confidence Gate.

How It Works

If the top retrieved chunks from pgvector fall below:

0.5

confidence score:

The LLM call is blocked immediately
The API returns a secure fallback response
A gate-block event is logged to MLflow telemetry

This ensures the system never fabricates unsupported answers.

🔄 4. Fault Tolerance & Graceful Degradation

Production systems must survive partial failures.

Nexus is engineered to degrade gracefully instead of crashing catastrophically.

Failure Handling Strategies

No OpenAI API Key?

Fallback automatically switches to:

Integrated demonstration mock stub

MLflow Offline?

Socket-based health checks bypass telemetry safely
Core API remains operational

Redis Failure?

Treated as a standard cache miss
Retrieval falls back to PostgreSQL vector search

🧪 CI/CD & MLOps Quality Assurance

AI infrastructure should follow the same engineering rigor as traditional microservices.

The repository includes a fully automated GitHub Actions pipeline.

CI Pipeline Includes

✅ Ruff Linting

Static analysis & code quality enforcement.

✅ Pytest Test Suite

Covers:

Health endpoints
Upload workflows
Query APIs
Metrics endpoints
Cache behavior
Retrieval gates

✅ Coverage Enforcement

80%+ minimum coverage required

✅ Golden Dataset Evaluation

Before builds pass CI:

A 15-case evaluation matrix runs automatically
Retrieval quality metrics are validated
Prompt or chunking regressions fail the build instantly

📊 Engineering Metrics

Metric	Value
Core Test Coverage	90%
Docker Build Time	~3 minutes
Production Image Size	~400MB
Embedding Model	all-MiniLM-L6-v2
Vector Search Engine	pgvector + HNSW
Cache Layer	Redis 7
Observability	MLflow

📈 Observability & Metrics

Nexus exposes a custom:

/metrics

endpoint providing:

Query volume
Average retrieval confidence
Cache hit ratios
Hallucination gate frequency
System health telemetry

This enables production-level monitoring and operational visibility.

💡 Why I Chose pgvector over Dedicated Vector Databases

One of the most important architectural decisions was choosing pgvector instead of standalone vector databases like:

Pinecone
Weaviate
Qdrant

Why?

Dedicated vector databases introduce:

Additional infrastructure complexity
Network hop latency
Vendor lock-in
Separate operational overhead

Using pgvector allows:

Relational metadata
User permissions
Upload records
Embeddings

to exist inside the same ACID-compliant PostgreSQL layer.

This dramatically simplifies production operations while preserving full SQL power.

📂 Project Structure & Scaling

The platform is fully structured for production scaling with:

Multi-stage Docker builds
Modular FastAPI architecture
Async database handling
Service isolation
CI automation
Observability hooks
Production-ready deployment patterns

🔗 Repository

GitHub Repository:
https://github.com/Anilnayak126/nexus-core-rag.git

🎯 Final Thoughts

Most RAG tutorials stop at the prototype stage.

The real engineering challenge begins when you need:

reliability,
observability,
retrieval accuracy,
fault tolerance,
scalability,
and cost control.

Nexus Knowledge Engine was built specifically to solve those production-level problems.

If you found this deep dive useful — or have ideas for improving semantic cache thresholds or retrieval gating strategies — feel free to share your thoughts in the comments.

Happy building 🚀