Dharmendra Singh

Posted on Jun 23 • Originally published at Medium

Building RAG Applications with LangChain(Part-2)

#langchain #langgraph #rag #llm

Part 2: Document Loaders — Theory, Usage, and Examples

Welcome back to our LangChain RAG series!

In Part 1, we covered the architecture and theory behind Retrieval-Augmented Generation (RAG). We broke down the full pipeline and its major components.

Today, in Part 2, we’ll take our first deep dive — into the Document Loader component.

A RAG pipeline is only as good as the data you feed into it. That journey starts with document loaders_._

What Are Document Loaders?

Document loaders are responsible for reading raw content (from files, databases, URLs, APIs, etc.) and converting it into a format that LangChain can work with — typically a list of Document objects.

Each Document contains:

page_content — the actual text
metadata — file name, source, page number, etc.

Document(  
page_content="Technical support is available 24/7 through chat or phone.",  
metadata="faq.txt"  
)

Why Do We Need Document Loaders?

LLMs like GPT or Gemini can’t natively read PDFs, CSVs, or websites. You need to:

Extract text
Clean it up
Split it into chunks
Embed & retrieve it

Without good document ingestion, your RAG model is flying blind.

Supported Content Sources in LangChain

LangChain makes it easy to load and process content from a wide variety of source types. Whether you’re working with PDFs, web pages, or structured data, there’s likely a loader (or two) that fits your use case.

Common Source Types & Loaders

PDFs Use cases: Reports, eBooks, scanned documents Loaders: PyPDFLoader, PDFMinerLoader, UnstructuredPDFLoader
Text / Markdown Use cases: Notes, technical documentation, blog posts Loaders: TextLoader, MarkdownLoader
Word Documents Use cases: Contracts, resumes, letters Loader: Docx2txtLoader
Web Pages Use cases: Articles, blog content, public websites Loaders: WebBaseLoader (static), SeleniumURLLoader (JavaScript-heavy)
Images / OCR Use cases: Scanned forms, handwritten notes, image-based PDFs Loader: UnstructuredImageLoader
APIs & Structured Data Use cases: JSON files, databases, Google Sheets Approach: Use custom loaders or make direct

API/database calls to fetch content

How Document Loaders Work

Few examples how can load documents:

Example 1: Load a PDF

from langchain.document_loaders import PyPDFLoader  

loader = PyPDFLoader("files/ai_report.pdf")  
docs = loader.load()  

print(docs[0].page_content)  
print(docs[0].metadata)

Example 2: Load a website

from langchain.document_loaders import WebBaseLoader  
loader = WebBaseLoader("https://openai.com/research")  
docs = loader.load()

Example 3: Load a folder of `.txt` files

from langchain.document_loaders import TextLoader  
from pathlib import Path  

loaders = [TextLoader(str(file)) for file in Path("notes").glob("*.txt")]  
docs = []  
for loader in loaders:  
    docs.extend(loader.load())

Example 4: Load `.CSV` files

from langchain_community.document_loaders import CSVLoader  

loader = CSVLoader(file_path='Social_Network_Ads.csv')  

docs = loader.load()  

print(len(docs))  
print(docs[1])

Pro Tip: Use Metadata

Good metadata (e.g. page number, source file, date) can be used to:

Improve retrieval accuracy
Add filters (e.g. date, topic)
Show context in results

print(docs[0].metadata)

Choosing the Right Document Loader in LangChain

LangChain provides a wide range of document loaders tailored to different content types and use cases. Here’s a quick guide to help you choose the best one:

PyPDFLoader — Ideal for general PDF files with mostly text content.
PDFMinerLoader — Best for PDFs where layout and positioning of content matter.
UnstructuredPDFLoader — Great for scanned PDFs or those with mixed content (images + text).
WebBaseLoader — Use this for simple, static HTML web pages.
SeleniumURLLoader — Designed for JavaScript-heavy websites like Medium, LinkedIn, or dynamic dashboards.
TextLoader — Perfect for plain .txt or .md (Markdown) files.
Docx2txtLoader — Loads content from Microsoft Word .docx files.
Unstructured — A versatile loader for scanned images, documents, forms, and content in mixed or unknown formats.

Best Practices for Document Loading

Clean up raw content (remove headers/footers)
Store source info for traceability
Use RecursiveCharacterTextSplitter after loading
Combine multiple loaders in pipelines
Avoid unnecessary chunking during loading stage

Putting It All Together (Mini App)

from langchain.document_loaders import PyPDFLoader  
from langchain.text_splitter import RecursiveCharacterTextSplitter  

loader = PyPDFLoader("sample.pdf")  
documents = loader.load()  

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)  
docs = splitter.split_documents(documents)  

for doc in docs[:3]:  
    print(doc.metadata)  
    print(doc.page_content[:100])

Coming Up Next

In Part 3, we’ll explore Text Splitters — how to break large documents into chunks that actually work well with vector search and LLM prompts.

📖 Catch up on:

Part 1 — What is RAG & Why It Matters

Have a use case in mind?

Drop it in the comments! We’ll include community examples in upcoming parts.

DEV Community

Building RAG Applications with LangChain(Part-2)

Part 2: Document Loaders — Theory, Usage, and Examples

What Are Document Loaders?

Why Do We Need Document Loaders?

Supported Content Sources in LangChain

Common Source Types & Loaders

How Document Loaders Work

Example 2: Load a website

Example 3: Load a folder of `.txt` files

Example 4: Load `.CSV` files

Pro Tip: Use Metadata

Choosing the Right Document Loader in LangChain

Best Practices for Document Loading

Putting It All Together (Mini App)

Coming Up Next

Have a use case in mind?

Top comments (0)

Part 2: Document Loaders — Theory, Usage, and Examples

What Are Document Loaders?

Why Do We Need Document Loaders?

Supported Content Sources in LangChain

Common Source Types & Loaders

How Document Loaders Work

Example 2: Load a website

Example 3: Load a folder of .txt files

Example 4: Load .CSV files

Pro Tip: Use Metadata

Choosing the Right Document Loader in LangChain

Best Practices for Document Loading

Putting It All Together (Mini App)

Coming Up Next

Have a use case in mind?

Example 3: Load a folder of `.txt` files

Example 4: Load `.CSV` files