DEV Community

Thesius Code
Thesius Code

Posted on

Building RAG Applications with LangChain: A Production-Ready Guide

Fine-tuning is expensive, slow, and usually overkill. Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your data without touching model weights — and you can have a working prototype in an afternoon.

But "working prototype" and "production system" are very different things. The gap between a demo that retrieves something and a pipeline that retrieves the right thing with good latency and manageable cost is where most teams get stuck. This guide bridges that gap.

Architecture Overview

A RAG pipeline has two phases:

Indexing (offline):

  1. Load documents from various sources
  2. Split into semantically meaningful chunks
  3. Generate embedding vectors
  4. Store in a vector database

Retrieval + Generation (runtime):

  1. User asks a question
  2. Embed the question using the same model
  3. Search the vector store for similar chunks
  4. Feed chunks + question to the LLM
  5. Return the generated answer with source attribution
┌─────────┐    ┌──────────┐    ┌────────────┐    ┌──────────┐
│Documents│───>│ Chunking │───>│ Embeddings │───>│VectorDB  │
└─────────┘    └──────────┘    └────────────┘    └──────────┘
                                                       │
                                                       ▼
┌─────────┐    ┌──────────┐    ┌────────────┐    ┌──────────┐
│ Answer  │<───│   LLM    │<───│  Prompt +  │<───│Retriever │
└─────────┘    └──────────┘    │  Context   │    └──────────┘
                               └────────────┘
Enter fullscreen mode Exit fullscreen mode

Setup and Dependencies

pip install langchain langchain-openai langchain-community \
    chromadb tiktoken unstructured pypdf
Enter fullscreen mode Exit fullscreen mode
import os

# Always use environment variables for credentials — never hardcode keys
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
Enter fullscreen mode Exit fullscreen mode

Step 1: Document Loading

LangChain supports dozens of document loaders. Here's a flexible loader that handles mixed file types from a directory:

from langchain_community.document_loaders import (
    PyPDFLoader,
    DirectoryLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
    WebBaseLoader,
)


def load_documents(source_dir: str) -> list:
    """Load documents from a directory of mixed file types."""

    loaders = {
        "*.pdf": PyPDFLoader,
        "*.txt": TextLoader,
        "*.md": UnstructuredMarkdownLoader,
    }

    all_docs = []
    for glob_pattern, loader_cls in loaders.items():
        dir_loader = DirectoryLoader(
            source_dir,
            glob=glob_pattern,
            loader_cls=loader_cls,
            show_progress=True,
            use_multithreading=True,
        )
        docs = dir_loader.load()
        all_docs.extend(docs)
        print(f"Loaded {len(docs)} documents matching {glob_pattern}")

    return all_docs


def load_web_documents(urls: list[str]) -> list:
    """Load documents from web URLs."""
    loader = WebBaseLoader(urls)
    return loader.load()


# Usage
docs = load_documents("./data/knowledge_base/")
print(f"Total documents loaded: {len(docs)}")
Enter fullscreen mode Exit fullscreen mode

Step 2: Text Chunking

Chunking is the single most impactful decision for retrieval quality. The wrong chunk size means the LLM either gets too little context or drowns in noise.

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)


def create_chunks(
    documents: list,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
) -> list:
    """Split documents into chunks with metadata preservation."""

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
        is_separator_regex=False,
    )

    chunks = splitter.split_documents(documents)

    # Attach chunk metadata for debugging and filtering
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_id"] = i
        chunk.metadata["chunk_size"] = len(chunk.page_content)

    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    print(
        f"Avg chunk size: "
        f"{sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars"
    )

    return chunks


def create_markdown_chunks(documents: list) -> list:
    """Chunk markdown by headers for better semantic boundaries."""

    headers_to_split_on = [
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ]

    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=False,
    )

    # Further split large sections with recursive splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

    all_chunks = []
    for doc in documents:
        md_chunks = md_splitter.split_text(doc.page_content)
        sub_chunks = text_splitter.split_documents(md_chunks)
        all_chunks.extend(sub_chunks)

    return all_chunks
Enter fullscreen mode Exit fullscreen mode

Chunking Strategy Guidelines

Content Type Chunk Size Overlap Strategy
Technical docs 800–1200 200 Recursive by paragraphs
Legal documents 1000–1500 300 Recursive with high overlap
Code documentation 500–800 100 Markdown header splitting
FAQ / Q&A 300–500 50 Per question-answer pair

Step 3: Vector Store with ChromaDB

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma


def create_vector_store(
    chunks: list,
    persist_directory: str = "./chroma_db",
    collection_name: str = "knowledge_base",
) -> Chroma:
    """Create and persist a ChromaDB vector store."""

    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small",
        # Cost: ~$0.02 per 1M tokens — very affordable for most corpora
    )

    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory,
        collection_name=collection_name,
    )

    print(f"Vector store created with {vector_store._collection.count()} vectors")
    return vector_store


def load_vector_store(
    persist_directory: str = "./chroma_db",
    collection_name: str = "knowledge_base",
) -> Chroma:
    """Load an existing vector store from disk."""

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    return Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings,
        collection_name=collection_name,
    )
Enter fullscreen mode Exit fullscreen mode

Step 4: Building the Retriever

Basic similarity search is a starting point, but production systems benefit from diversity-aware and multi-query retrieval strategies:

from langchain.retrievers import (
    ContextualCompressionRetriever,
    MultiQueryRetriever,
)
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI


def create_basic_retriever(vector_store: Chroma, k: int = 4):
    """Simple similarity search retriever."""
    return vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": k},
    )


def create_mmr_retriever(
    vector_store: Chroma, k: int = 4, fetch_k: int = 20
):
    """MMR retriever — balances relevance with diversity to reduce redundancy."""
    return vector_store.as_retriever(
        search_type="mmr",
        search_kwargs={
            "k": k,
            "fetch_k": fetch_k,
            "lambda_mult": 0.7,  # 0 = max diversity, 1 = max relevance
        },
    )


def create_multi_query_retriever(vector_store: Chroma):
    """Generate multiple query variations for better recall."""
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

    return MultiQueryRetriever.from_llm(
        retriever=vector_store.as_retriever(search_kwargs={"k": 4}),
        llm=llm,
    )


def create_compression_retriever(vector_store: Chroma):
    """Retrieve broadly, then compress — extract only the relevant parts."""
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)

    base_retriever = vector_store.as_retriever(
        search_kwargs={"k": 6}  # Fetch more, then compress
    )

    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever,
    )
Enter fullscreen mode Exit fullscreen mode

Step 5: The RAG Chain

Now connect everything into a chain that takes a question and returns an answer with source attribution:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs: list) -> str:
    """Format retrieved documents for the prompt."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "Unknown")
        formatted.append(
            f"[Source {i}: {source}]\n{doc.page_content}"
        )
    return "\n\n---\n\n".join(formatted)


def create_rag_chain(retriever, model_name: str = "gpt-4o"):
    """Create a complete RAG chain with source attribution."""

    llm = ChatOpenAI(model=model_name, temperature=0.1)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant that answers questions
based on the provided context. Follow these rules:

1. Only answer based on the provided context
2. If the context doesn't contain enough information, say so
3. Cite your sources using [Source N] notation
4. Be concise and direct
5. If you're unsure, express your uncertainty

Context:
{context}"""),
        ("human", "{question}"),
    ])

    chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain


# Build and use the chain
vector_store = load_vector_store()
retriever = create_mmr_retriever(vector_store)
rag_chain = create_rag_chain(retriever)

# Ask a question
answer = rag_chain.invoke(
    "How do I configure auto-scaling for the data pipeline?"
)
print(answer)
Enter fullscreen mode Exit fullscreen mode

Step 6: Adding Chat History

For conversational RAG, follow-up questions like "Which one is fastest?" need the context of what "one" refers to. This requires rephrasing the question using chat history before retrieval:

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains.history_aware_retriever import (
    create_history_aware_retriever,
)
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import (
    create_stuff_documents_chain,
)


def create_conversational_rag(retriever):
    """RAG chain with conversation history support."""

    llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

    # Rephrase the user's question into a standalone query
    contextualize_prompt = ChatPromptTemplate.from_messages([
        ("system",
         "Given the chat history and latest question, "
         "rephrase the question to be standalone. "
         "Do NOT answer the question."),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ])

    history_aware_retriever = create_history_aware_retriever(
        llm, retriever, contextualize_prompt
    )

    # Answer using retrieved context
    answer_prompt = ChatPromptTemplate.from_messages([
        ("system",
         "Answer based on the context below. "
         "If unsure, say you don't know.\n\n{context}"),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ])

    question_answer_chain = create_stuff_documents_chain(
        llm, answer_prompt
    )

    return create_retrieval_chain(
        history_aware_retriever, question_answer_chain
    )


# Usage with history
conversational_chain = create_conversational_rag(retriever)
chat_history = []

# First question
result = conversational_chain.invoke({
    "input": "What databases does the platform support?",
    "chat_history": chat_history,
})
print(result["answer"])

# Track history
chat_history.extend([
    HumanMessage(content="What databases does the platform support?"),
    AIMessage(content=result["answer"]),
])

# Follow-up — "Which one" is resolved via history
result = conversational_chain.invoke({
    "input": "Which one has the best performance?",
    "chat_history": chat_history,
})
print(result["answer"])
Enter fullscreen mode Exit fullscreen mode

Step 7: Evaluation

You can't improve what you don't measure. Here's a practical evaluation framework using LLM-as-judge:

import time
from dataclasses import dataclass


@dataclass
class EvalCase:
    question: str
    expected_answer: str
    expected_sources: list[str] | None = None


def evaluate_rag(
    chain,
    retriever,
    eval_cases: list[EvalCase],
) -> dict:
    """Evaluate RAG pipeline on a test set."""

    results = {
        "total": len(eval_cases),
        "retrieval_hits": 0,
        "answer_quality": [],
        "latencies": [],
    }

    llm_judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    for case in eval_cases:
        start = time.time()

        retrieved_docs = retriever.invoke(case.question)
        answer = chain.invoke(case.question)

        latency = time.time() - start
        results["latencies"].append(latency)

        # Check if expected sources were retrieved
        if case.expected_sources:
            retrieved_sources = [
                d.metadata.get("source", "") for d in retrieved_docs
            ]
            hit = any(
                exp in src
                for exp in case.expected_sources
                for src in retrieved_sources
            )
            if hit:
                results["retrieval_hits"] += 1

        # LLM-as-judge scoring
        judge_prompt = (
            f"Rate this answer from 1-5:\n"
            f"Question: {case.question}\n"
            f"Expected: {case.expected_answer}\n"
            f"Actual: {answer}\n\n"
            f"Score (1-5):"
        )

        score_response = llm_judge.invoke(judge_prompt)
        try:
            score = int(score_response.content.strip()[0])
        except (ValueError, IndexError):
            score = 3  # Default to neutral on parse failure
        results["answer_quality"].append(score)

    # Aggregate metrics
    results["avg_quality"] = (
        sum(results["answer_quality"]) / len(results["answer_quality"])
    )
    results["avg_latency"] = (
        sum(results["latencies"]) / len(results["latencies"])
    )
    results["retrieval_accuracy"] = (
        results["retrieval_hits"] / results["total"]
    )

    return results


# Define test cases
eval_cases = [
    EvalCase(
        question="How do I set up auto-scaling?",
        expected_answer="Configure min/max instances in the scaling policy...",
        expected_sources=["auto-scaling-guide.md"],
    ),
    EvalCase(
        question="What authentication methods are supported?",
        expected_answer="OAuth2, API keys, and SAML SSO...",
        expected_sources=["auth-docs.md"],
    ),
]

results = evaluate_rag(rag_chain, retriever, eval_cases)
print(f"Retrieval Accuracy: {results['retrieval_accuracy']:.1%}")
print(f"Answer Quality: {results['avg_quality']:.1f}/5")
print(f"Avg Latency: {results['avg_latency']:.2f}s")
Enter fullscreen mode Exit fullscreen mode

Production Tips

1. Chunk size matters more than you think. Start with 800–1000 characters and experiment. Too small = missing context. Too large = noise drowning out the signal.

2. Use hybrid search. Combine vector similarity with keyword search (BM25) for better results on exact-match terms like error codes or product names.

3. Cache embeddings aggressively. Track file hashes and only re-index documents that actually changed. Re-embedding an unchanged corpus is wasted money.

4. Monitor retrieval quality in production. Log every query, the retrieved chunks, and user feedback. This data is gold for iterating on your chunking and retrieval strategy.

5. Set token budgets. Calculate: context_tokens + prompt_tokens + max_output_tokens < model_limit. Exceeding the window silently truncates context and kills answer quality.

Summary

Building a RAG pipeline is iterative. Start simple, measure quality, and improve one component at a time:

  1. Load and chunk your documents
  2. Build basic retrieval with similarity search
  3. Wire up a simple prompt chain
  4. Evaluate with representative test cases
  5. Improve chunking, retrieval, and prompts based on measured results

Every component in this guide is modular and swappable — replace ChromaDB with Pinecone, swap OpenAI for a local model, or add a reranker between retrieval and generation. The architecture stays the same.


If you found this guide useful, check out DataStack Pro for production-ready data pipeline templates, AI integration patterns, and engineering toolkits you can deploy today.

Top comments (0)