Jotty John

Posted on Mar 26

RAG Architecture

#ai #architecture

RAG stands for Retrieval-Augmented Generation — a technique used in AI to give more accurate and up-to-date answers.

Instead of the AI only using what it already “knows”
It first searches for relevant information
Then uses that information to generate a better answer

You can build a RAG app by combining 3 core pieces:

Store knowledge from your documents in a searchable form
Retrieve relevant chunks for a user question
Ask an LLM to answer using only that retrieved context

Think of it as: search first, generate second. The model gets smarter for your data without retraining—very budget-friendly, very civilized.

What a RAG app needs

A typical RAG pipeline looks like this:

The basic architecture

1) Data ingestion

Load your source data:

PDFs
Word docs
Markdown files
HTML/web pages
database records
internal knowledge base

2) Chunking

Split large documents into smaller passages, for example:

300–800 tokens per chunk
some overlap, like 50–100 tokens

Why? Because embedding and retrieval work better on focused pieces of text.

3) Embeddings

Convert each chunk into a numeric vector using an embedding model.

Examples:

OpenAI embeddings
sentence-transformers
Cohere embeddings
Voyage AI embeddings

4) Vector database

Store the embeddings and original text.

Popular options:

Chroma: easiest for local prototypes
FAISS: very fast local similarity search
Pinecone: managed hosted service
Weaviate
Qdrant
Milvus

5) Retriever

When the user asks a question:

embed the question
search the vector DB for top-k similar chunks
optionally rerank them

6) Generation

Send the retrieved chunks + question to the LLM with a prompt like:

Answer using only the provided context. If the answer is not in the context, say you don’t know.

That reduces hallucinations a lot.

Easiest stack for a first RAG app

If you want the fastest path, I’d suggest:

Python
FastAPI or Streamlit
LangChain or LlamaIndex
Chroma for local vector storage
OpenAI or another chat model

That’s a very beginner-friendly setup.

Minimal build plan

Option A: super simple prototype

Use:

Python
Streamlit
Chroma
OpenAI API

Good for:

uploading docs
asking questions in a browser UI
learning the flow

Option B: production-ish starter

Use:

FastAPI backend
React/Next.js frontend
Qdrant or Pinecone
background ingestion jobs
auth + logging + evaluation

Good for:

team/internal tool
customer-facing app
scaling beyond hobby level

A simple MVP flow

Step 1: install packages

For a simple Python version, you’d usually need packages like:

langchain
chromadb
openai
pypdf
tiktoken
streamlit

Step 2: load documents

Read your files from a folder like data/.

Step 3: split into chunks

Use a text splitter with overlap.

Step 4: create embeddings

Generate embeddings for each chunk.

Step 5: store them in Chroma

Persist the vector store locally.

Step 6: build a query function

Take a user question, retrieve top matches, and send them to the model.

Step 7: add a UI

A simple chat box is enough for v1.

Tiny example in Python

Here’s the shape of a minimal RAG script:

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA

# 1. Load document
loader = PyPDFLoader("docs/manual.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)
chunks = splitter.split_documents(docs)

# 3. Embed + store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 5. LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# 6. QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

# 7. Ask a question
query = "How do I reset the device?"
answer = qa.run(query)
print(answer)

That’s the basic idea. Real apps add citations, source display, better prompts, and error handling.

What makes a RAG app actually good

A lot of RAG apps work, but only a few work well. The difference usually comes from these:

Better chunking

Bad chunks = bad retrieval.

Tips:

keep sections semantically meaningful
don’t split tables badly
preserve headings/metadata
include source info like filename and page number

Metadata filtering

Store metadata such as:

file name
page number
department
date
access level

Then retrieve with filters, like:

only HR docs
only 2025 policies
only docs user is allowed to see

Hybrid search

Use:

vector search + keyword search

This helps when users ask for:

exact product codes
error IDs
names/acronyms

Reranking

After getting top 10 chunks, rerank them with a stronger relevance model and keep top 3–5.

This often improves quality a lot.

Grounded prompting

Use prompts that force the model to stay within context and cite sources.

Example:

answer only from context
quote source passages when possible
say “I don’t know” if unsupported

Citations

Show users where the answer came from:

file name
page number
snippet preview

This builds trust and makes debugging much easier.

Common mistakes

1) Chunks are too large

If chunks are giant, retrieval gets fuzzy.

2) No overlap

Then important context gets chopped in half like an unfortunate sandwich.

3) Bad PDFs

Some PDFs extract terribly. You may need OCR or better parsing.

4) No evaluation

If you don’t test retrieval quality, you won’t know whether the issue is:

retrieval
prompt
model
data quality

5) Letting the model answer without guardrails

Always tell it to use only retrieved context.

How to evaluate your RAG app

You should test 3 things separately:

Retrieval quality

Ask:

Did the right chunks come back?
Was the right answer present in retrieved docs?

Answer quality

Ask:

Is the answer accurate?
Is it complete?
Does it cite sources?

Latency/cost

Track:

embedding cost
retrieval speed
generation speed
token usage

A very practical evaluation set is:

20–50 real user questions
expected source documents
expected answer points

Good first project ideas

Start with one of these:

PDF question-answering app
Company handbook assistant
Support docs chatbot
Course notes tutor
Codebase documentation assistant

For a first build, I’d recommend:

upload PDFs
ask questions
show citations
keep everything local except the model API

That’s enough to learn the full RAG loop.

If you want a production-ready design

A more serious app usually has:

Ingestion service
Embedding/index pipeline
Vector DB
Chat backend
Frontend
Evaluation dashboard
User auth
Document permissions
Observability/logging

A common production flow:

user uploads document
backend extracts text
text is chunked and embedded
vectors stored with metadata
chat query retrieves relevant chunks
model answers with citations
logs stored for debugging and eval

Best beginner recommendation

If you want the shortest path, do this:

Python
Streamlit
LangChain
Chroma
OpenAI API
local folder of PDFs

That lets you learn:

ingestion
chunking
embeddings
retrieval
prompting
UI

Use Cases
RAG is an effective solution for business scenarios that involve large volumes of documentation and complex rules, where users need reliable and authoritative answers. It is particularly well-suited for enhancing LLM-based chatbots by incorporating proprietary or domain-specific knowledge, while significantly reducing the risk of hallucinations.

Top comments (2)

Tessa Mariam • Mar 26

Great article! It clearly explains how RAG can bridge the gap between generic AI responses and real-world business needs. I especially liked the emphasis on using domain-specific data to improve accuracy and reduce hallucinations—it’s something many organizations struggle with when adopting LLMs. The practical perspective on handling large documentation and business rules makes this very relevant for enterprise use cases. Looking forward to seeing more real-world implementation examples!

tantess • Mar 26

Well Explained