DEV Community

Jotty John
Jotty John

Posted on

RAG Architecture

RAG stands for Retrieval-Augmented Generation — a technique used in AI to give more accurate and up-to-date answers.

  • Instead of the AI only using what it already “knows”
  • It first searches for relevant information
  • Then uses that information to generate a better answer

You can build a RAG app by combining 3 core pieces:

  1. Store knowledge from your documents in a searchable form
  2. Retrieve relevant chunks for a user question
  3. Ask an LLM to answer using only that retrieved context

Think of it as: search first, generate second. The model gets smarter for your data without retraining—very budget-friendly, very civilized.

What a RAG app needs

A typical RAG pipeline looks like this:

The basic architecture

1) Data ingestion

Load your source data:

  • PDFs
  • Word docs
  • Markdown files
  • HTML/web pages
  • database records
  • internal knowledge base

2) Chunking

Split large documents into smaller passages, for example:

  • 300–800 tokens per chunk
  • some overlap, like 50–100 tokens

Why? Because embedding and retrieval work better on focused pieces of text.

3) Embeddings

Convert each chunk into a numeric vector using an embedding model.

Examples:

  • OpenAI embeddings
  • sentence-transformers
  • Cohere embeddings
  • Voyage AI embeddings

4) Vector database

Store the embeddings and original text.

Popular options:

  • Chroma: easiest for local prototypes
  • FAISS: very fast local similarity search
  • Pinecone: managed hosted service
  • Weaviate
  • Qdrant
  • Milvus

5) Retriever

When the user asks a question:

  • embed the question
  • search the vector DB for top-k similar chunks
  • optionally rerank them

6) Generation

Send the retrieved chunks + question to the LLM with a prompt like:

Answer using only the provided context. If the answer is not in the context, say you don’t know.

That reduces hallucinations a lot.

Easiest stack for a first RAG app

If you want the fastest path, I’d suggest:

  • Python
  • FastAPI or Streamlit
  • LangChain or LlamaIndex
  • Chroma for local vector storage
  • OpenAI or another chat model

That’s a very beginner-friendly setup.

Minimal build plan

Option A: super simple prototype

Use:

  • Python
  • Streamlit
  • Chroma
  • OpenAI API

Good for:

  • uploading docs
  • asking questions in a browser UI
  • learning the flow

Option B: production-ish starter

Use:

  • FastAPI backend
  • React/Next.js frontend
  • Qdrant or Pinecone
  • background ingestion jobs
  • auth + logging + evaluation

Good for:

  • team/internal tool
  • customer-facing app
  • scaling beyond hobby level

A simple MVP flow

Step 1: install packages

For a simple Python version, you’d usually need packages like:

  • langchain
  • chromadb
  • openai
  • pypdf
  • tiktoken
  • streamlit

Step 2: load documents

Read your files from a folder like data/.

Step 3: split into chunks

Use a text splitter with overlap.

Step 4: create embeddings

Generate embeddings for each chunk.

Step 5: store them in Chroma

Persist the vector store locally.

Step 6: build a query function

Take a user question, retrieve top matches, and send them to the model.

Step 7: add a UI

A simple chat box is enough for v1.


Tiny example in Python

Here’s the shape of a minimal RAG script:

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA

# 1. Load document
loader = PyPDFLoader("docs/manual.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)
chunks = splitter.split_documents(docs)

# 3. Embed + store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 5. LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# 6. QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

# 7. Ask a question
query = "How do I reset the device?"
answer = qa.run(query)
print(answer)
Enter fullscreen mode Exit fullscreen mode

That’s the basic idea. Real apps add citations, source display, better prompts, and error handling.


What makes a RAG app actually good

A lot of RAG apps work, but only a few work well. The difference usually comes from these:

Better chunking

Bad chunks = bad retrieval.

Tips:

  • keep sections semantically meaningful
  • don’t split tables badly
  • preserve headings/metadata
  • include source info like filename and page number

Metadata filtering

Store metadata such as:

  • file name
  • page number
  • department
  • date
  • access level

Then retrieve with filters, like:

  • only HR docs
  • only 2025 policies
  • only docs user is allowed to see

Hybrid search

Use:

  • vector search + keyword search

This helps when users ask for:

  • exact product codes
  • error IDs
  • names/acronyms

Reranking

After getting top 10 chunks, rerank them with a stronger relevance model and keep top 3–5.

This often improves quality a lot.

Grounded prompting

Use prompts that force the model to stay within context and cite sources.

Example:

  • answer only from context
  • quote source passages when possible
  • say “I don’t know” if unsupported

Citations

Show users where the answer came from:

  • file name
  • page number
  • snippet preview

This builds trust and makes debugging much easier.


Common mistakes

1) Chunks are too large

If chunks are giant, retrieval gets fuzzy.

2) No overlap

Then important context gets chopped in half like an unfortunate sandwich.

3) Bad PDFs

Some PDFs extract terribly. You may need OCR or better parsing.

4) No evaluation

If you don’t test retrieval quality, you won’t know whether the issue is:

  • retrieval
  • prompt
  • model
  • data quality

5) Letting the model answer without guardrails

Always tell it to use only retrieved context.


How to evaluate your RAG app

You should test 3 things separately:

Retrieval quality

Ask:

  • Did the right chunks come back?
  • Was the right answer present in retrieved docs?

Answer quality

Ask:

  • Is the answer accurate?
  • Is it complete?
  • Does it cite sources?

Latency/cost

Track:

  • embedding cost
  • retrieval speed
  • generation speed
  • token usage

A very practical evaluation set is:

  • 20–50 real user questions
  • expected source documents
  • expected answer points

Good first project ideas

Start with one of these:

  • PDF question-answering app
  • Company handbook assistant
  • Support docs chatbot
  • Course notes tutor
  • Codebase documentation assistant

For a first build, I’d recommend:

  1. upload PDFs
  2. ask questions
  3. show citations
  4. keep everything local except the model API

That’s enough to learn the full RAG loop.


If you want a production-ready design

A more serious app usually has:

  • Ingestion service
  • Embedding/index pipeline
  • Vector DB
  • Chat backend
  • Frontend
  • Evaluation dashboard
  • User auth
  • Document permissions
  • Observability/logging

A common production flow:

  1. user uploads document
  2. backend extracts text
  3. text is chunked and embedded
  4. vectors stored with metadata
  5. chat query retrieves relevant chunks
  6. model answers with citations
  7. logs stored for debugging and eval

Best beginner recommendation

If you want the shortest path, do this:

  • Python
  • Streamlit
  • LangChain
  • Chroma
  • OpenAI API
  • local folder of PDFs

That lets you learn:

  • ingestion
  • chunking
  • embeddings
  • retrieval
  • prompting
  • UI

Use Cases
RAG is an effective solution for business scenarios that involve large volumes of documentation and complex rules, where users need reliable and authoritative answers. It is particularly well-suited for enhancing LLM-based chatbots by incorporating proprietary or domain-specific knowledge, while significantly reducing the risk of hallucinations.

Top comments (2)

Collapse
 
tessamariam profile image
Tessa Mariam

Great article! It clearly explains how RAG can bridge the gap between generic AI responses and real-world business needs. I especially liked the emphasis on using domain-specific data to improve accuracy and reduce hallucinations—it’s something many organizations struggle with when adopting LLMs. The practical perspective on handling large documentation and business rules makes this very relevant for enterprise use cases. Looking forward to seeing more real-world implementation examples!

Collapse
 
tante profile image
tantess

Well Explained