Priyadharshini Selvaraj

Posted on Jun 7

RAG and LangChain Basics

#ai #rag #langchain #llm

Introduction

Large Language Models (LLMs) like GPT-4 and Claude are incredibly powerful, but they have a significant limitation: they can only work with information they were trained on. What if you want your AI application to access your company's internal documents, real-time data, or domain-specific knowledge? This is where Retrieval-Augmented Generation (RAG) comes in, and LangChain makes implementing RAG systems remarkably straightforward.

What is RAG?

RAG is an architectural pattern that combines the generative capabilities of LLMs with external knowledge retrieval. Instead of relying solely on the model's pre-trained knowledge, RAG systems first retrieve relevant information from external sources, then use that context to generate more accurate and up-to-date responses.

Traditional LLM vs RAG Architecture

Traditional LLM:
User Query → LLM → Response (limited to training data)

RAG System:
User Query → Retrieve Relevant Docs → LLM + Context → Enhanced Response

Key Benefits of RAG:

Access to current, domain-specific information
Reduced hallucinations through grounded responses
Cost-effective alternative to fine-tuning
Easy to update knowledge base without retraining

RAG System Architecture

The RAG pipeline consists of two main phases:

1. Indexing Phase (Data Preparation)

Document Loading: Ingest documents from various sources (PDFs, web pages, databases)
Text Splitting: Break documents into manageable chunks
Embedding: Convert text chunks into vector representations
Storage: Store embeddings in a vector database

2. Retrieval Phase (Query Processing)

Query Embedding: Convert user query to vector format
Similarity Search: Find most relevant document chunks
Context Assembly: Combine retrieved chunks with the original query
Generation: LLM generates response using augmented context

Enter LangChain

LangChain is a framework that simplifies building applications with LLMs. It provides abstractions and tools that make implementing RAG systems much easier than building from scratch.

Core LangChain Components for RAG

# Document Loading
from langchain.document_loaders import PyPDFLoader, WebBaseLoader

# Text Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Embeddings
from langchain.embeddings import OpenAIEmbeddings

# Vector Stores
from langchain.vectorstores import Chroma, Pinecone

# LLMs
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

# Chains
from langchain.chains import RetrievalQA

Building a RAG System with LangChain

Let's walk through creating a simple RAG system that can answer questions about your documentation:

Step 1: Document Ingestion and Processing

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Load documents
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)

# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings
)

Step 2: Setting up the Retrieval Chain

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Initialize LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_kwargs={"k": 3}  # Retrieve top 3 relevant chunks
    )
)

# Query the system
response = qa_chain.run("How do I implement authentication?")
print(response)

RAG Query Flow Visualization

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   User Query    │───▶│  Vector Search   │───▶│  Top K Chunks   │
│"How to deploy?" │    │   (Similarity)   │    │   Retrieved     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│Final Response   │◀───│      LLM         │◀───│  Query + Context│
│  (Grounded)     │    │   Generation     │    │   Combined      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Advanced RAG Patterns

1. Multi-Query Retrieval

Generate multiple variations of the user query to improve retrieval quality:

from langchain.retrievers import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

2. Contextual Compression

Filter retrieved documents to only include relevant portions:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
)

3. Self-Query Retrieval

Allow the system to construct filters based on query metadata:

from langchain.retrievers.self_query.base import SelfQueryRetriever

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents="Technical documentation",
    metadata_field_info=metadata_field_info
)

Best Practices for Production RAG Systems

Data Quality and Chunking Strategy

Optimal Chunk Size: 500-1000 characters with 100-200 character overlap
Preserve Context: Avoid splitting mid-sentence or breaking logical units
Metadata Enrichment: Add source, timestamp, and category information

Retrieval Optimization

Hybrid Search: Combine semantic and keyword search
Reranking: Use cross-encoders to improve retrieval quality
Evaluation Metrics: Track retrieval precision, recall, and end-to-end accuracy

Performance Considerations

# Async processing for better performance
from langchain.chains import AsyncLLMChain

# Caching for repeated queries
from langchain.cache import InMemoryCache
import langchain
langchain.llm_cache = InMemoryCache()

# Streaming responses
def stream_response(query):
    for chunk in qa_chain.stream({"query": query}):
        yield chunk

Real-World Applications

RAG systems are being used across various domains:

Customer Support: AI chatbots with access to knowledge bases
Legal Research: Query legal documents and case law
Medical Assistance: Access medical literature and guidelines
Code Documentation: Developer tools with codebase awareness
Educational Content: Personalized learning with curriculum access

Conclusion

RAG represents a paradigm shift in how we build AI applications. By combining the power of LLMs with external knowledge retrieval, we can create systems that are both intelligent and grounded in factual information. LangChain makes this implementation accessible to developers, providing the tools and abstractions needed to build production-ready RAG systems.

The key to successful RAG implementation lies in understanding your data, optimizing your retrieval strategy, and continuously evaluating system performance. As the field evolves, we're seeing more sophisticated approaches like agentic RAG, multi-modal retrieval, and graph-based knowledge systems.

Start small with a simple RAG system using LangChain, then gradually incorporate advanced patterns as your needs grow. The future of AI applications is not just about better models, but about better ways to ground those models in real-world knowledge.

Ready to build your own RAG system? Check out the LangChain documentation and start experimenting with your own documents today!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.