DEV Community

Prajwal
Prajwal

Posted on

Designing a Production-Oriented RAG System for Technical Documentation

Large Language Models are incredibly powerful, but they have a major limitation:

They do not inherently know your infrastructure, your internal documentation, your deployment standards, or your engineering workflows.

Generic LLMs can explain concepts like Docker, Terraform, or NGINX at a broad level, but when building real engineering systems, broad knowledge is not enough. Engineering teams need:

  • accurate retrieval,
  • contextual understanding,
  • domain-specific responses,
  • conversational continuity,
  • and reliable citations.

That is where Retrieval-Augmented Generation (RAG) systems become important.

This article explores the architecture and implementation of the RAG pipeline built for VizLab.xyz — an internal AI-powered documentation assistant and developer copilot designed around real engineering workflows.

Instead of functioning as a generic chatbot, the system was designed to:

  • retrieve highly relevant technical documentation,
  • maintain conversational context,
  • reduce hallucinations,
  • provide citation-backed answers,
  • and operate entirely within a controlled documentation ecosystem.

The architecture combines:

  • FastAPI
  • FAISS
  • BM25
  • AWS Bedrock
  • AWS Titan Embeddings
  • Docker
  • Tailscale
  • Caddy Reverse Proxy
  • S3-backed vector persistence

while remaining lightweight enough to self-host on a private infrastructure stack.


Why Traditional LLMs Fail for Engineering Workflows

One of the biggest problems with using general-purpose LLMs in technical environments is hallucination.

A model may:

  • generate outdated commands,
  • invent configuration syntax,
  • confuse versions,
  • or answer from unrelated training data.

For engineering environments, this is dangerous.

If a system generates:

  • incorrect Terraform configurations,
  • invalid NGINX directives,
  • broken IAM policies,
  • or misleading Docker instructions,

the consequences can directly affect infrastructure stability.

The goal of the VizLab RAG system was therefore not to create a “smart chatbot.”

The goal was to create a retrieval-first architecture where:

  1. trusted documentation is indexed,
  2. relevant context is retrieved,
  3. and the LLM is forced to answer primarily from that retrieved context.

This architectural approach significantly improves response reliability.


System Goals

The system was specifically designed for:

  • internal technical documentation retrieval,
  • educational AI workflows,
  • developer assistance,
  • DevOps troubleshooting,
  • infrastructure guidance,
  • and conversational technical support.

The indexed knowledge base includes curated documentation from domains such as:

  • Docker
  • NGINX
  • Terraform
  • GitHub Actions
  • Solidity
  • AWS Policies
  • Infrastructure tooling

Rather than scraping the entire internet, the system intentionally targets a highly curated set of engineering documentation sources.

This dramatically improves:

  • retrieval precision,
  • chunk quality,
  • and contextual relevance.

High-Level Architecture

The architecture is divided into three major pipelines:

  1. Offline Ingestion Pipeline
  2. Retrieval Pipeline
  3. Generation Pipeline

The complete architecture diagram is shown below.

Step 1: Documentation Scraping

The system begins by scraping a curated list of documentation URLs using BeautifulSoup.

Instead of indexing random internet pages, the scraper focuses on trusted engineering sources.

Examples include:

  • Docker documentation
  • Terraform documentation
  • NGINX references
  • AWS documentation
  • Solidity references

This dramatically improves knowledge quality.

One important architectural decision was storing raw scraped content immediately into AWS S3 before processing.

This provides:

  • durability,
  • backup recovery,
  • ingestion reproducibility,
  • and debugging visibility.

If chunking or embedding pipelines fail, raw documentation is still preserved.


Step 2: Text Cleaning & Normalization

Raw documentation contains:

  • navigation elements,
  • repeated menus,
  • headers,
  • footers,
  • formatting artifacts,
  • inconsistent whitespace.

Before chunking, the pipeline normalizes and cleans the content.

This stage improves:

  • embedding quality,
  • retrieval relevance,
  • and token efficiency.

Garbage input produces garbage embeddings.

Cleaning matters more than many people realize.


Step 3: Chunking Strategy

One of the most important parts of any RAG system is chunking.

The VizLab pipeline uses a recursive text splitting strategy designed to preserve semantic meaning.

Poor chunking destroys retrieval quality.

If chunks are:

  • too small → context becomes fragmented
  • too large → embeddings become noisy
  • overlapping incorrectly → retrieval becomes redundant

The system therefore uses overlapping semantic chunks to preserve continuity across boundaries.

This allows the retrieval system to maintain:

  • contextual coherence,
  • command relationships,
  • and infrastructure explanations.

The indexed system currently maintains roughly:

  • 300–500 chunks,
  • 300–500 embeddings,
  • with a FAISS index smaller than 5MB.

Because the dataset is intentionally curated and focused, retrieval quality remains high without massive storage overhead.


Step 4: Embedding Generation

After chunking, the documents are converted into vector embeddings using AWS Titan Embeddings through Amazon Bedrock.

Embeddings convert semantic meaning into numerical vector representations.

This enables similarity search based on meaning rather than exact keyword matching.

For example:

A user asking:

“How do I configure reverse proxying?”

can still retrieve chunks mentioning:

  • upstream routing,
  • proxy_pass,
  • load balancing,
  • or NGINX forwarding,

even if the exact wording differs.


Why Hybrid Search Was Used

One of the most important architectural decisions was combining:

  • FAISS dense retrieval
  • with BM25 sparse retrieval.

Dense vector search is excellent for:

  • semantic understanding,
  • paraphrased questions,
  • conceptual similarity.

But dense retrieval sometimes struggles with:

  • exact command syntax,
  • specific keywords,
  • infrastructure flags,
  • version identifiers.

BM25 solves this by ranking exact keyword relevance.

Combining both systems creates a hybrid retrieval architecture that balances:

  • semantic similarity,
  • and exact-match retrieval.

Here is the Reciprocal Rank Fusion (RRF) logic that combines both lists:

@staticmethod
def _rrf_merge(
    dense: list[RetrievedChunk],
    sparse: list[RetrievedChunk],
    k: int = 60,
) -> list[RetrievedChunk]:
    """
    Reciprocal Rank Fusion.
    RRF score = Σ 1 / (k + rank_i)
    """
    rrf: dict[str, RetrievedChunk] = {}

    for rank, chunk in enumerate(dense, start=1):
        if chunk.chunk_id not in rrf:
            rrf[chunk.chunk_id] = chunk
        rrf[chunk.chunk_id].rrf_score += 1.0 / (k + rank)
        rrf[chunk.chunk_id].faiss_score = max(
            rrf[chunk.chunk_id].faiss_score, chunk.faiss_score
        )

    for rank, chunk in enumerate(sparse, start=1):
        if chunk.chunk_id not in rrf:
            rrf[chunk.chunk_id] = chunk
        rrf[chunk.chunk_id].rrf_score += 1.0 / (k + rank)
        rrf[chunk.chunk_id].bm25_score = max(
            rrf[chunk.chunk_id].bm25_score, chunk.bm25_score
        )

    return sorted(rrf.values(), key=lambda c: c.rrf_score, reverse=True)
Enter fullscreen mode Exit fullscreen mode

This significantly improves engineering query performance.

Especially for:

  • CLI commands,
  • configuration syntax,
  • infrastructure tooling,
  • and troubleshooting workflows.

Retrieval Pipeline

Once ingestion is complete, the system becomes queryable.

The retrieval pipeline is responsible for:

  • understanding user intent,
  • retrieving relevant context,
  • and assembling prompt-ready information.

Query Cache Layer

Before entering retrieval, queries first pass through an LRU TTL cache.

If an identical query already exists, the system bypasses:

  • vector retrieval,
  • embedding generation,
  • and LLM invocation.

Cached responses return in under 5ms.

This dramatically reduces:

  • Bedrock API usage,
  • latency,
  • and infrastructure cost.

Caching becomes especially important in developer environments where:

  • repeated troubleshooting questions,
  • repeated configuration lookups,
  • and repeated deployment issues

occur frequently.


Conversational Memory System

The conversational memory implementation uses an in-memory sliding-window architecture.

Each session maintains:

  • recent user messages,
  • assistant replies,
  • timestamps,
  • and conversational ordering.

The memory system stores:

  • 6 user turns
  • and 6 assistant turns

before automatically truncating older context.

This sliding-window approach prevents:

  • prompt explosion,
  • token overflows,
  • and degraded retrieval quality.

The memory subsystem influences two separate stages:

1. Query Rewriting

If a user asks:

“How do I configure that for NGINX?”

the retriever analyzes previous messages to understand what “that” refers to.

This significantly improves retrieval accuracy for follow-up conversations.


2. Prompt Compilation

The conversation history is also injected into the final LLM prompt.

This enables conversational continuity while keeping context size controlled.

The implementation currently uses:

  • an in-memory Python dictionary,
  • mapped by session_id.

This architecture is lightweight and extremely fast for single-instance deployments.

However, for multi-replica scaling, this would eventually need migration to:

  • Redis,
  • or another distributed memory layer.

Query Rewriting

One major issue with naive RAG systems is weak query formulation.

Users rarely ask perfectly structured questions.

The VizLab system expands queries into multiple search variants before retrieval.

This dramatically improves recall.

For example:

A query like:

“How do I secure my containers?”

may internally generate variants related to:

  • Docker security
  • container isolation
  • runtime permissions
  • capabilities
  • reverse proxy security
  • TLS hardening

The implementation for expanding queries looks like this:

def _rewrite_query(self, query: str, intent: str, domains: list[str]) -> list[str]:
    """Expand the query into search-optimised variants."""
    rewrites = [query]
    q_lower = query.lower()

    if intent == "debug":
        rewrites.append(f"{' '.join(domains)} {q_lower} causes fix solution")
        rewrites.append(f"troubleshoot {q_lower}")
    elif intent == "how_to":
        rewrites.append(f"{q_lower} step by step guide configuration")
        rewrites.append(f"{'  '.join(domains)} {q_lower} example")

    # Cross-domain context injection
    if len(domains) > 1:
        rewrites.append(f"{' '.join(domains)} integration {query}")

    return list(dict.fromkeys(rewrites))
Enter fullscreen mode Exit fullscreen mode

This improves retrieval coverage substantially.


Re-Ranking Pipeline

After retrieval, results are re-ranked based on:

  • keyword density,
  • contextual relevance,
  • freshness,
  • and semantic confidence.

Without re-ranking, vector systems often retrieve:

  • partially related chunks,
  • noisy semantic neighbors,
  • or overly broad context.

To solve this, a custom re-ranking step applies domain-specific heuristic boosts:

def _rerank(self, chunks: list[RetrievedChunk], qu: QueryUnderstanding) -> list[RetrievedChunk]:
    """
    Boost chunks by:
      +0.1 per exact keyword match in chunk text
      +0.05 if chunk domain matches query domain
      +0.05 if chunk is Section 0 (intro / overview)
    """
    q_keywords_lower = {k.lower() for k in qu.keywords}
    for chunk in chunks:
        text_lower = chunk.text.lower()

        # Keyword density bonus
        keyword_hits = sum(1 for kw in q_keywords_lower if kw in text_lower)
        chunk.rrf_score += keyword_hits * 0.1

        # Domain match bonus
        if chunk.metadata.get("domain") in qu.domains:
            chunk.rrf_score += 0.05

        # Introductory chunk bonus
        if chunk.metadata.get("chunk_index", 99) == 0:
            chunk.rrf_score += 0.05

    return sorted(chunks, key=lambda c: c.rrf_score, reverse=True)
Enter fullscreen mode Exit fullscreen mode

Re-ranking significantly improves response precision.


Prompt Engineering & Hallucination Control

One of the hardest engineering problems during development was strict citation enforcement.

Amazon Bedrock Titan occasionally attempted to answer from:

  • its base model training,
  • rather than the retrieved documentation context.

This is a common RAG failure mode.

The solution required:

  • strict system prompts,
  • structured prompt assembly,
  • guardrails,
  • context enforcement,
  • and fallback validation logic.

The final prompt compiler injects:

  • retrieved context,
  • system instructions,
  • conversation history,
  • user query,
  • and formatting constraints

into a single structured payload.

The model is heavily instructed to answer primarily from retrieved context.

This significantly reduces hallucinations and improves citation reliability.


Why Streaming Responses Were Avoided

Many AI systems use token streaming.

This architecture intentionally avoids it.

Instead, the system waits for the complete generation before returning a strict JSON payload.

Why?

Because the system prioritizes:

  • citation integrity,
  • structured outputs,
  • deterministic formatting,
  • and response validation.

Streaming complicated reliable citation attachment.

For technical documentation systems, correctness was prioritized over token-by-token rendering speed.


Deployment Architecture

The entire RAG service is containerized using Docker.

The backend stack includes:

  • FastAPI
  • Gunicorn
  • AWS Bedrock
  • FAISS
  • Caddy
  • Tailscale

The infrastructure is fully self-hosted.

Importantly:
the backend is never directly exposed to the public internet.

Instead:

  • the service operates inside a private Tailscale network,
  • secured behind a Caddy reverse proxy.

Caddy was specifically chosen because it integrates cleanly with Tailscale and automatically manages TLS certificates inside the private mesh network.

This removed significant operational overhead.


Request Lifecycle Walkthrough

A typical request flows through the following stages:

  1. User submits query from frontend UI
  2. Request enters Tailscale-secured network
  3. Caddy proxies request to FastAPI backend
  4. Query cache checks for existing response
  5. Conversational memory enriches context
  6. Query rewriting expands retrieval scope
  7. FAISS + BM25 perform hybrid search
  8. Retrieved chunks are re-ranked
  9. Prompt compiler assembles final payload
  10. AWS Bedrock Titan generates response
  11. Structured JSON response returned to frontend
  12. Response stored in memory and cache

This entire process currently averages around:

~8.9 seconds end-to-end latency.


What Broke During Development

The hardest parts of the project were not the LLM APIs.

The hardest parts were infrastructure and operational consistency.


1. Dockerizing FAISS & Bedrock Dependencies

Packaging the ingestion pipeline inside Docker caused repeated failures during CI/CD.

The challenge was getting:

  • Python dependencies,
  • FAISS native bindings,
  • Bedrock credentials,
  • and ingestion runtime behavior

to work consistently inside the container environment.

This took extensive debugging across:

  • container startup,
  • dependency compatibility,
  • and runtime initialization.

2. Tailscale + Caddy Networking

One of the most painful debugging sessions involved:

  • reverse proxying,
  • CORS,
  • Tailscale networking,
  • and frontend-backend communication.

Ensuring:

  • correct TLS handling,
  • proper headers,
  • secure private networking,
  • and browser compatibility

required multiple iterations.

Networking problems are often far harder than application logic.


3. Prompt Instability

Early versions of the system occasionally ignored retrieved context and answered generically.

This produced:

  • weak citations,
  • hallucinated explanations,
  • inconsistent formatting.

The solution required extensive prompt engineering and retrieval refinement.

This reinforced an important lesson:

In RAG systems, retrieval quality matters more than model size.


Future Improvements

Several improvements are planned for future iterations:

  • distributed vector databases
  • semantic caching
  • Redis-backed distributed memory
  • streaming retrieval pipelines
  • reranking models
  • agentic retrieval workflows
  • observability dashboards
  • tracing and telemetry
  • Kubernetes-native deployment
  • multi-user access control
  • prompt injection detection

The current architecture intentionally prioritizes:

  • correctness,
  • reliability,
  • retrieval quality,
  • and operational simplicity

over premature scaling complexity.


Final Thoughts

Building RAG systems is not simply:

“connecting an LLM to a vector database.”

Production-oriented retrieval systems require:

  • retrieval engineering,
  • chunking strategy,
  • ranking pipelines,
  • conversational memory,
  • prompt control,
  • infrastructure reliability,
  • and operational debugging.

The most valuable lesson from building this architecture was understanding that AI systems are fundamentally systems engineering problems.

Not just machine learning problems.

The quality of:

  • retrieval,
  • infrastructure,
  • networking,
  • caching,
  • and prompt orchestration

ultimately determines whether the system feels reliable in real-world engineering workflows.

And as RAG architectures continue evolving, the engineering surrounding retrieval pipelines may become even more important than the models themselves.

Top comments (0)