Part 2: Document Loaders — Theory, Usage, and Examples
Welcome back to our LangChain RAG series!
In Part 1, we covered the architecture and theory behind Retrieval-Augmented Generation (RAG). We broke down the full pipeline and its major components.
Today, in Part 2, we’ll take our first deep dive — into the Document Loader component.
A RAG pipeline is only as good as the data you feed into it. That journey starts with document loaders_._
What Are Document Loaders?
Document loaders are responsible for reading raw content (from files, databases, URLs, APIs, etc.) and converting it into a format that LangChain can work with — typically a list of Document
objects.
Each Document
contains:
-
page_content
— the actual text -
metadata
— file name, source, page number, etc.
Document(
page_content="Technical support is available 24/7 through chat or phone.",
metadata="faq.txt"
)
Why Do We Need Document Loaders?
LLMs like GPT or Gemini can’t natively read PDFs, CSVs, or websites. You need to:
- Extract text
- Clean it up
- Split it into chunks
- Embed & retrieve it
Without good document ingestion, your RAG model is flying blind.
Supported Content Sources in LangChain
LangChain makes it easy to load and process content from a wide variety of source types. Whether you’re working with PDFs, web pages, or structured data, there’s likely a loader (or two) that fits your use case.
Common Source Types & Loaders
- PDFs
Use cases: Reports, eBooks, scanned documents
Loaders:
PyPDFLoader
,PDFMinerLoader
,UnstructuredPDFLoader
- Text / Markdown
Use cases: Notes, technical documentation, blog posts
Loaders:
TextLoader
,MarkdownLoader
- Word Documents
Use cases: Contracts, resumes, letters
Loader:
Docx2txtLoader
- Web Pages
Use cases: Articles, blog content, public websites
Loaders:
WebBaseLoader
(static),SeleniumURLLoader
(JavaScript-heavy) - Images / OCR
Use cases: Scanned forms, handwritten notes, image-based PDFs
Loader:
UnstructuredImageLoader
- APIs & Structured Data Use cases: JSON files, databases, Google Sheets Approach: Use custom loaders or make direct
API/database calls to fetch content
How Document Loaders Work
Few examples how can load documents:
Example 1: Load a PDF
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("files/ai_report.pdf")
docs = loader.load()
print(docs[0].page_content)
print(docs[0].metadata)
Example 2: Load a website
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://openai.com/research")
docs = loader.load()
Example 3: Load a folder of .txt
files
from langchain.document_loaders import TextLoader
from pathlib import Path
loaders = [TextLoader(str(file)) for file in Path("notes").glob("*.txt")]
docs = []
for loader in loaders:
docs.extend(loader.load())
Example 4: Load .CSV
files
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(file_path='Social_Network_Ads.csv')
docs = loader.load()
print(len(docs))
print(docs[1])
Pro Tip: Use Metadata
Good metadata (e.g. page number, source file, date) can be used to:
- Improve retrieval accuracy
- Add filters (e.g. date, topic)
- Show context in results
print(docs[0].metadata)
Choosing the Right Document Loader in LangChain
LangChain provides a wide range of document loaders tailored to different content types and use cases. Here’s a quick guide to help you choose the best one:
- PyPDFLoader — Ideal for general PDF files with mostly text content.
- PDFMinerLoader — Best for PDFs where layout and positioning of content matter.
- UnstructuredPDFLoader — Great for scanned PDFs or those with mixed content (images + text).
- WebBaseLoader — Use this for simple, static HTML web pages.
- SeleniumURLLoader — Designed for JavaScript-heavy websites like Medium, LinkedIn, or dynamic dashboards.
- TextLoader — Perfect for plain
.txt
or.md
(Markdown) files. - Docx2txtLoader — Loads content from Microsoft Word
.docx
files. - Unstructured — A versatile loader for scanned images, documents, forms, and content in mixed or unknown formats.
Best Practices for Document Loading
- Clean up raw content (remove headers/footers)
- Store source info for traceability
- Use
RecursiveCharacterTextSplitter
after loading - Combine multiple loaders in pipelines
- Avoid unnecessary chunking during loading stage
Putting It All Together (Mini App)
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("sample.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(documents)
for doc in docs[:3]:
print(doc.metadata)
print(doc.page_content[:100])
Coming Up Next
In Part 3, we’ll explore Text Splitters — how to break large documents into chunks that actually work well with vector search and LLM prompts.
📖 Catch up on:
Part 1 — What is RAG & Why It Matters
Have a use case in mind?
Drop it in the comments! We’ll include community examples in upcoming parts.
Top comments (0)