Large Language Models are incredibly powerful, but they have a major limitation:
They do not inherently know your infrastructure, your internal documentation, your deployment standards, or your engineering workflows.
Generic LLMs can explain concepts like Docker, Terraform, or NGINX at a broad level, but when building real engineering systems, broad knowledge is not enough. Engineering teams need:
- accurate retrieval,
- contextual understanding,
- domain-specific responses,
- conversational continuity,
- and reliable citations.
That is where Retrieval-Augmented Generation (RAG) systems become important.
This article explores the architecture and implementation of the RAG pipeline built for VizLab.xyz — an internal AI-powered documentation assistant and developer copilot designed around real engineering workflows.
Instead of functioning as a generic chatbot, the system was designed to:
- retrieve highly relevant technical documentation,
- maintain conversational context,
- reduce hallucinations,
- provide citation-backed answers,
- and operate entirely within a controlled documentation ecosystem.
The architecture combines:
- FastAPI
- FAISS
- BM25
- AWS Bedrock
- AWS Titan Embeddings
- Docker
- Tailscale
- Caddy Reverse Proxy
- S3-backed vector persistence
while remaining lightweight enough to self-host on a private infrastructure stack.
Why Traditional LLMs Fail for Engineering Workflows
One of the biggest problems with using general-purpose LLMs in technical environments is hallucination.
A model may:
- generate outdated commands,
- invent configuration syntax,
- confuse versions,
- or answer from unrelated training data.
For engineering environments, this is dangerous.
If a system generates:
- incorrect Terraform configurations,
- invalid NGINX directives,
- broken IAM policies,
- or misleading Docker instructions,
the consequences can directly affect infrastructure stability.
The goal of the VizLab RAG system was therefore not to create a “smart chatbot.”
The goal was to create a retrieval-first architecture where:
- trusted documentation is indexed,
- relevant context is retrieved,
- and the LLM is forced to answer primarily from that retrieved context.
This architectural approach significantly improves response reliability.
System Goals
The system was specifically designed for:
- internal technical documentation retrieval,
- educational AI workflows,
- developer assistance,
- DevOps troubleshooting,
- infrastructure guidance,
- and conversational technical support.
The indexed knowledge base includes curated documentation from domains such as:
- Docker
- NGINX
- Terraform
- GitHub Actions
- Solidity
- AWS Policies
- Infrastructure tooling
Rather than scraping the entire internet, the system intentionally targets a highly curated set of engineering documentation sources.
This dramatically improves:
- retrieval precision,
- chunk quality,
- and contextual relevance.
High-Level Architecture
The architecture is divided into three major pipelines:
- Offline Ingestion Pipeline
- Retrieval Pipeline
- Generation Pipeline
The complete architecture diagram is shown below.
Step 1: Documentation Scraping
The system begins by scraping a curated list of documentation URLs using BeautifulSoup.
Instead of indexing random internet pages, the scraper focuses on trusted engineering sources.
Examples include:
- Docker documentation
- Terraform documentation
- NGINX references
- AWS documentation
- Solidity references
This dramatically improves knowledge quality.
One important architectural decision was storing raw scraped content immediately into AWS S3 before processing.
This provides:
- durability,
- backup recovery,
- ingestion reproducibility,
- and debugging visibility.
If chunking or embedding pipelines fail, raw documentation is still preserved.
Step 2: Text Cleaning & Normalization
Raw documentation contains:
- navigation elements,
- repeated menus,
- headers,
- footers,
- formatting artifacts,
- inconsistent whitespace.
Before chunking, the pipeline normalizes and cleans the content.
This stage improves:
- embedding quality,
- retrieval relevance,
- and token efficiency.
Garbage input produces garbage embeddings.
Cleaning matters more than many people realize.
Step 3: Chunking Strategy
One of the most important parts of any RAG system is chunking.
The VizLab pipeline uses a recursive text splitting strategy designed to preserve semantic meaning.
Poor chunking destroys retrieval quality.
If chunks are:
- too small → context becomes fragmented
- too large → embeddings become noisy
- overlapping incorrectly → retrieval becomes redundant
The system therefore uses overlapping semantic chunks to preserve continuity across boundaries.
This allows the retrieval system to maintain:
- contextual coherence,
- command relationships,
- and infrastructure explanations.
The indexed system currently maintains roughly:
- 300–500 chunks,
- 300–500 embeddings,
- with a FAISS index smaller than 5MB.
Because the dataset is intentionally curated and focused, retrieval quality remains high without massive storage overhead.
Step 4: Embedding Generation
After chunking, the documents are converted into vector embeddings using AWS Titan Embeddings through Amazon Bedrock.
Embeddings convert semantic meaning into numerical vector representations.
This enables similarity search based on meaning rather than exact keyword matching.
For example:
A user asking:
“How do I configure reverse proxying?”
can still retrieve chunks mentioning:
- upstream routing,
- proxy_pass,
- load balancing,
- or NGINX forwarding,
even if the exact wording differs.
Why Hybrid Search Was Used
One of the most important architectural decisions was combining:
- FAISS dense retrieval
- with BM25 sparse retrieval.
Dense vector search is excellent for:
- semantic understanding,
- paraphrased questions,
- conceptual similarity.
But dense retrieval sometimes struggles with:
- exact command syntax,
- specific keywords,
- infrastructure flags,
- version identifiers.
BM25 solves this by ranking exact keyword relevance.
Combining both systems creates a hybrid retrieval architecture that balances:
- semantic similarity,
- and exact-match retrieval.
Here is the Reciprocal Rank Fusion (RRF) logic that combines both lists:
@staticmethod
def _rrf_merge(
dense: list[RetrievedChunk],
sparse: list[RetrievedChunk],
k: int = 60,
) -> list[RetrievedChunk]:
"""
Reciprocal Rank Fusion.
RRF score = Σ 1 / (k + rank_i)
"""
rrf: dict[str, RetrievedChunk] = {}
for rank, chunk in enumerate(dense, start=1):
if chunk.chunk_id not in rrf:
rrf[chunk.chunk_id] = chunk
rrf[chunk.chunk_id].rrf_score += 1.0 / (k + rank)
rrf[chunk.chunk_id].faiss_score = max(
rrf[chunk.chunk_id].faiss_score, chunk.faiss_score
)
for rank, chunk in enumerate(sparse, start=1):
if chunk.chunk_id not in rrf:
rrf[chunk.chunk_id] = chunk
rrf[chunk.chunk_id].rrf_score += 1.0 / (k + rank)
rrf[chunk.chunk_id].bm25_score = max(
rrf[chunk.chunk_id].bm25_score, chunk.bm25_score
)
return sorted(rrf.values(), key=lambda c: c.rrf_score, reverse=True)
This significantly improves engineering query performance.
Especially for:
- CLI commands,
- configuration syntax,
- infrastructure tooling,
- and troubleshooting workflows.
Retrieval Pipeline
Once ingestion is complete, the system becomes queryable.
The retrieval pipeline is responsible for:
- understanding user intent,
- retrieving relevant context,
- and assembling prompt-ready information.
Query Cache Layer
Before entering retrieval, queries first pass through an LRU TTL cache.
If an identical query already exists, the system bypasses:
- vector retrieval,
- embedding generation,
- and LLM invocation.
Cached responses return in under 5ms.
This dramatically reduces:
- Bedrock API usage,
- latency,
- and infrastructure cost.
Caching becomes especially important in developer environments where:
- repeated troubleshooting questions,
- repeated configuration lookups,
- and repeated deployment issues
occur frequently.
Conversational Memory System
The conversational memory implementation uses an in-memory sliding-window architecture.
Each session maintains:
- recent user messages,
- assistant replies,
- timestamps,
- and conversational ordering.
The memory system stores:
- 6 user turns
- and 6 assistant turns
before automatically truncating older context.
This sliding-window approach prevents:
- prompt explosion,
- token overflows,
- and degraded retrieval quality.
The memory subsystem influences two separate stages:
1. Query Rewriting
If a user asks:
“How do I configure that for NGINX?”
the retriever analyzes previous messages to understand what “that” refers to.
This significantly improves retrieval accuracy for follow-up conversations.
2. Prompt Compilation
The conversation history is also injected into the final LLM prompt.
This enables conversational continuity while keeping context size controlled.
The implementation currently uses:
- an in-memory Python dictionary,
- mapped by
session_id.
This architecture is lightweight and extremely fast for single-instance deployments.
However, for multi-replica scaling, this would eventually need migration to:
- Redis,
- or another distributed memory layer.
Query Rewriting
One major issue with naive RAG systems is weak query formulation.
Users rarely ask perfectly structured questions.
The VizLab system expands queries into multiple search variants before retrieval.
This dramatically improves recall.
For example:
A query like:
“How do I secure my containers?”
may internally generate variants related to:
- Docker security
- container isolation
- runtime permissions
- capabilities
- reverse proxy security
- TLS hardening
The implementation for expanding queries looks like this:
def _rewrite_query(self, query: str, intent: str, domains: list[str]) -> list[str]:
"""Expand the query into search-optimised variants."""
rewrites = [query]
q_lower = query.lower()
if intent == "debug":
rewrites.append(f"{' '.join(domains)} {q_lower} causes fix solution")
rewrites.append(f"troubleshoot {q_lower}")
elif intent == "how_to":
rewrites.append(f"{q_lower} step by step guide configuration")
rewrites.append(f"{' '.join(domains)} {q_lower} example")
# Cross-domain context injection
if len(domains) > 1:
rewrites.append(f"{' '.join(domains)} integration {query}")
return list(dict.fromkeys(rewrites))
This improves retrieval coverage substantially.
Re-Ranking Pipeline
After retrieval, results are re-ranked based on:
- keyword density,
- contextual relevance,
- freshness,
- and semantic confidence.
Without re-ranking, vector systems often retrieve:
- partially related chunks,
- noisy semantic neighbors,
- or overly broad context.
To solve this, a custom re-ranking step applies domain-specific heuristic boosts:
def _rerank(self, chunks: list[RetrievedChunk], qu: QueryUnderstanding) -> list[RetrievedChunk]:
"""
Boost chunks by:
+0.1 per exact keyword match in chunk text
+0.05 if chunk domain matches query domain
+0.05 if chunk is Section 0 (intro / overview)
"""
q_keywords_lower = {k.lower() for k in qu.keywords}
for chunk in chunks:
text_lower = chunk.text.lower()
# Keyword density bonus
keyword_hits = sum(1 for kw in q_keywords_lower if kw in text_lower)
chunk.rrf_score += keyword_hits * 0.1
# Domain match bonus
if chunk.metadata.get("domain") in qu.domains:
chunk.rrf_score += 0.05
# Introductory chunk bonus
if chunk.metadata.get("chunk_index", 99) == 0:
chunk.rrf_score += 0.05
return sorted(chunks, key=lambda c: c.rrf_score, reverse=True)
Re-ranking significantly improves response precision.
Prompt Engineering & Hallucination Control
One of the hardest engineering problems during development was strict citation enforcement.
Amazon Bedrock Titan occasionally attempted to answer from:
- its base model training,
- rather than the retrieved documentation context.
This is a common RAG failure mode.
The solution required:
- strict system prompts,
- structured prompt assembly,
- guardrails,
- context enforcement,
- and fallback validation logic.
The final prompt compiler injects:
- retrieved context,
- system instructions,
- conversation history,
- user query,
- and formatting constraints
into a single structured payload.
The model is heavily instructed to answer primarily from retrieved context.
This significantly reduces hallucinations and improves citation reliability.
Why Streaming Responses Were Avoided
Many AI systems use token streaming.
This architecture intentionally avoids it.
Instead, the system waits for the complete generation before returning a strict JSON payload.
Why?
Because the system prioritizes:
- citation integrity,
- structured outputs,
- deterministic formatting,
- and response validation.
Streaming complicated reliable citation attachment.
For technical documentation systems, correctness was prioritized over token-by-token rendering speed.
Deployment Architecture
The entire RAG service is containerized using Docker.
The backend stack includes:
- FastAPI
- Gunicorn
- AWS Bedrock
- FAISS
- Caddy
- Tailscale
The infrastructure is fully self-hosted.
Importantly:
the backend is never directly exposed to the public internet.
Instead:
- the service operates inside a private Tailscale network,
- secured behind a Caddy reverse proxy.
Caddy was specifically chosen because it integrates cleanly with Tailscale and automatically manages TLS certificates inside the private mesh network.
This removed significant operational overhead.
Request Lifecycle Walkthrough
A typical request flows through the following stages:
- User submits query from frontend UI
- Request enters Tailscale-secured network
- Caddy proxies request to FastAPI backend
- Query cache checks for existing response
- Conversational memory enriches context
- Query rewriting expands retrieval scope
- FAISS + BM25 perform hybrid search
- Retrieved chunks are re-ranked
- Prompt compiler assembles final payload
- AWS Bedrock Titan generates response
- Structured JSON response returned to frontend
- Response stored in memory and cache
This entire process currently averages around:
~8.9 seconds end-to-end latency.
What Broke During Development
The hardest parts of the project were not the LLM APIs.
The hardest parts were infrastructure and operational consistency.
1. Dockerizing FAISS & Bedrock Dependencies
Packaging the ingestion pipeline inside Docker caused repeated failures during CI/CD.
The challenge was getting:
- Python dependencies,
- FAISS native bindings,
- Bedrock credentials,
- and ingestion runtime behavior
to work consistently inside the container environment.
This took extensive debugging across:
- container startup,
- dependency compatibility,
- and runtime initialization.
2. Tailscale + Caddy Networking
One of the most painful debugging sessions involved:
- reverse proxying,
- CORS,
- Tailscale networking,
- and frontend-backend communication.
Ensuring:
- correct TLS handling,
- proper headers,
- secure private networking,
- and browser compatibility
required multiple iterations.
Networking problems are often far harder than application logic.
3. Prompt Instability
Early versions of the system occasionally ignored retrieved context and answered generically.
This produced:
- weak citations,
- hallucinated explanations,
- inconsistent formatting.
The solution required extensive prompt engineering and retrieval refinement.
This reinforced an important lesson:
In RAG systems, retrieval quality matters more than model size.
Future Improvements
Several improvements are planned for future iterations:
- distributed vector databases
- semantic caching
- Redis-backed distributed memory
- streaming retrieval pipelines
- reranking models
- agentic retrieval workflows
- observability dashboards
- tracing and telemetry
- Kubernetes-native deployment
- multi-user access control
- prompt injection detection
The current architecture intentionally prioritizes:
- correctness,
- reliability,
- retrieval quality,
- and operational simplicity
over premature scaling complexity.
Final Thoughts
Building RAG systems is not simply:
“connecting an LLM to a vector database.”
Production-oriented retrieval systems require:
- retrieval engineering,
- chunking strategy,
- ranking pipelines,
- conversational memory,
- prompt control,
- infrastructure reliability,
- and operational debugging.
The most valuable lesson from building this architecture was understanding that AI systems are fundamentally systems engineering problems.
Not just machine learning problems.
The quality of:
- retrieval,
- infrastructure,
- networking,
- caching,
- and prompt orchestration
ultimately determines whether the system feels reliable in real-world engineering workflows.
And as RAG architectures continue evolving, the engineering surrounding retrieval pipelines may become even more important than the models themselves.

Top comments (0)