DEV Community

Cover image for Building RAG Applications with LangChain(Part-2)
Dharmendra Singh
Dharmendra Singh

Posted on • Originally published at Medium

Building RAG Applications with LangChain(Part-2)

Part 2: Document Loaders — Theory, Usage, and Examples

Welcome back to our LangChain RAG series!

In Part 1, we covered the architecture and theory behind Retrieval-Augmented Generation (RAG). We broke down the full pipeline and its major components.

Today, in Part 2, we’ll take our first deep dive — into the Document Loader component.

A RAG pipeline is only as good as the data you feed into it. That journey starts with document loaders_._

What Are Document Loaders?

Document loaders are responsible for reading raw content (from files, databases, URLs, APIs, etc.) and converting it into a format that LangChain can work with — typically a list of Document objects.

Each Document contains:

  • page_content — the actual text
  • metadata — file name, source, page number, etc.
Document(  
page_content="Technical support is available 24/7 through chat or phone.",  
metadata="faq.txt"  
)
Enter fullscreen mode Exit fullscreen mode

Why Do We Need Document Loaders?

LLMs like GPT or Gemini can’t natively read PDFs, CSVs, or websites. You need to:

  • Extract text
  • Clean it up
  • Split it into chunks
  • Embed & retrieve it

Without good document ingestion, your RAG model is flying blind.

Supported Content Sources in LangChain

LangChain makes it easy to load and process content from a wide variety of source types. Whether you’re working with PDFs, web pages, or structured data, there’s likely a loader (or two) that fits your use case.

Common Source Types & Loaders

  • PDFs Use cases: Reports, eBooks, scanned documents Loaders: PyPDFLoader, PDFMinerLoader, UnstructuredPDFLoader
  • Text / Markdown Use cases: Notes, technical documentation, blog posts Loaders: TextLoader, MarkdownLoader
  • Word Documents Use cases: Contracts, resumes, letters Loader: Docx2txtLoader
  • Web Pages Use cases: Articles, blog content, public websites Loaders: WebBaseLoader (static), SeleniumURLLoader (JavaScript-heavy)
  • Images / OCR Use cases: Scanned forms, handwritten notes, image-based PDFs Loader: UnstructuredImageLoader
  • APIs & Structured Data Use cases: JSON files, databases, Google Sheets Approach: Use custom loaders or make direct

API/database calls to fetch content

How Document Loaders Work

Few examples how can load documents:

Example 1: Load a PDF

from langchain.document_loaders import PyPDFLoader  

loader = PyPDFLoader("files/ai_report.pdf")  
docs = loader.load()  

print(docs[0].page_content)  
print(docs[0].metadata)
Enter fullscreen mode Exit fullscreen mode

Example 2: Load a website

from langchain.document_loaders import WebBaseLoader  
loader = WebBaseLoader("https://openai.com/research")  
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

Example 3: Load a folder of .txt files

from langchain.document_loaders import TextLoader  
from pathlib import Path  

loaders = [TextLoader(str(file)) for file in Path("notes").glob("*.txt")]  
docs = []  
for loader in loaders:  
    docs.extend(loader.load())
Enter fullscreen mode Exit fullscreen mode

Example 4: Load .CSV files

from langchain_community.document_loaders import CSVLoader  

loader = CSVLoader(file_path='Social_Network_Ads.csv')  

docs = loader.load()  

print(len(docs))  
print(docs[1])
Enter fullscreen mode Exit fullscreen mode

Pro Tip: Use Metadata

Good metadata (e.g. page number, source file, date) can be used to:

  • Improve retrieval accuracy
  • Add filters (e.g. date, topic)
  • Show context in results

print(docs[0].metadata)

Choosing the Right Document Loader in LangChain

LangChain provides a wide range of document loaders tailored to different content types and use cases. Here’s a quick guide to help you choose the best one:

  • PyPDFLoader — Ideal for general PDF files with mostly text content.
  • PDFMinerLoader — Best for PDFs where layout and positioning of content matter.
  • UnstructuredPDFLoader — Great for scanned PDFs or those with mixed content (images + text).
  • WebBaseLoader — Use this for simple, static HTML web pages.
  • SeleniumURLLoader — Designed for JavaScript-heavy websites like Medium, LinkedIn, or dynamic dashboards.
  • TextLoader — Perfect for plain .txt or .md (Markdown) files.
  • Docx2txtLoader — Loads content from Microsoft Word .docx files.
  • Unstructured — A versatile loader for scanned images, documents, forms, and content in mixed or unknown formats.

Best Practices for Document Loading

  • Clean up raw content (remove headers/footers)
  • Store source info for traceability
  • Use RecursiveCharacterTextSplitter after loading
  • Combine multiple loaders in pipelines
  • Avoid unnecessary chunking during loading stage

Putting It All Together (Mini App)

from langchain.document_loaders import PyPDFLoader  
from langchain.text_splitter import RecursiveCharacterTextSplitter  

loader = PyPDFLoader("sample.pdf")  
documents = loader.load()  

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)  
docs = splitter.split_documents(documents)  

for doc in docs[:3]:  
    print(doc.metadata)  
    print(doc.page_content[:100])

Enter fullscreen mode Exit fullscreen mode

Coming Up Next

In Part 3, we’ll explore Text Splitters — how to break large documents into chunks that actually work well with vector search and LLM prompts.

📖 Catch up on:

Part 1 — What is RAG & Why It Matters

Have a use case in mind?

Drop it in the comments! We’ll include community examples in upcoming parts.

Top comments (0)