Have you ever wanted to chat with your PDF files using AI? The simplest way is to load all your documents and send them to the LLM with every question. But this approach has a big problem: it's expensive and doesn't scale.
In this tutorial, we'll compare two approaches:
- Document Injection: Load everything (simple but expensive)
- RAG (Retrieval Augmented Generation): Load only what's needed (smart and efficient)
Try it yourself: I've built an interactive web app where you can test both approaches with your own documents and see the token usage difference in real-time: https://f861194c6f5a8196b6.gradio.live/
Upload your documents and compare Document Injection vs. Full RAG mode - watch how token counts change dramatically between the two approaches!
What You'll Learn
- How document injection works and why it's expensive
- What RAG is and how it saves money
- Real token usage comparisons
- How to implement both systems
- When to use each approach
Prerequisites
You need:
- Python 3.8+
- Basic understanding of Python and APIs
- OpenAI API key (or Google AI key)
- Some documents to test with
The Problem
Let's say you have 10 PDF documents with 20 pages each. That's about 100,000 tokens. If you send all this with every question:
- Cost per question: ~$3.00 (using GPT-4)
- 10 questions: $30.00
- 100 questions: $300.00
This doesn't scale. Your costs explode quickly.
Approach 1: Document Injection
How It Works
Document Injection loads ALL your documents into the system prompt. Every time you ask a question, the LLM processes everything.
Here's the flow:
User asks question
↓
Load ALL documents (PDFs, text files)
↓
Send everything to LLM
↓
LLM processes ALL tokens
↓
Get answer
The Code
First, we need a function to load documents:
# load_documents.py
from pathlib import Path
from pypdf import PdfReader
def get_documents(folder='docs'):
"""Load all text, markdown, and PDF files from a folder."""
path = Path(folder)
contents = []
for file in path.rglob('*'):
if file.suffix in {'.txt', '.md', '.pdf'}:
if file.suffix == '.pdf':
# Extract text from PDF
content = ''
reader = PdfReader(file)
for page in reader.pages:
content += '\n' + page.extract_text()
else:
# Read text files
content = file.read_text(encoding='utf-8')
contents.append({str(file): content})
return contents
Now the chat system:
# documents_injection_system.py
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from load_documents import get_documents
def main():
load_dotenv()
llm = ChatOpenAI(model='gpt-4o-mini')
# Load ALL documents into the system prompt
system_prompt = f"You are a helpful assistant. Answer questions about these documents: {get_documents()}"
messages = [SystemMessage(content=system_prompt)]
while True:
question = input("Enter a question: ")
if question == 'exit':
break
messages.append(HumanMessage(content=question))
response = llm.invoke(messages)
messages.append(response)
print(response.content)
if __name__ == "__main__":
main()
The key point: get_documents() loads EVERYTHING, and this gets sent with EVERY question.
Token Usage: Document Injection
Test with 3 PDFs (10 pages each) + 2 text files = ~15,000 tokens total
First question: "What is document1.pdf about?"
- Tokens sent: 15,010
- Cost: $0.15
Second question: "Summarize document2.pdf"
- Tokens sent: ~15,500
- Cost: $0.155
After 10 questions: ~$1.60 total
Every question processes ALL 15,000 tokens!
Pros and Cons
Pros:
- Simple to implement
- LLM has access to everything
Cons:
- Expensive - every query processes all documents
- Slow with large documents
- Doesn't scale beyond small document sets
- Hits token limits quickly
Approach 2: RAG System
What is RAG?
RAG (Retrieval Augmented Generation) is smarter. Instead of sending everything, it:
- Converts documents into numbers (embeddings)
- Stores them in a database
- When you ask a question, finds only the relevant documents
- Sends only those to the LLM
How It Works
User asks question
↓
Convert question to embedding
↓
Search database for similar documents
↓
Get top 2 most relevant documents
↓
Send ONLY those to LLM
↓
Get answer
This means instead of processing 100,000 tokens, you might only process 1,000 tokens!
The Code
# full_rag_system.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.document_loaders import PyPDFDirectoryLoader, DirectoryLoader, TextLoader
load_dotenv()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
directory = 'docs'
# Initialize LLM
llm = ChatOpenAI(model='gpt-4o-mini')
# Load documents
pdf_loader = PyPDFDirectoryLoader(directory)
pdf_docs = pdf_loader.load()
text_loader = DirectoryLoader(directory, glob='**/*.txt', loader_cls=TextLoader)
text_docs = text_loader.load()
docs = pdf_docs + text_docs
# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(docs)
# Chat loop
while True:
question = input("Enter a question: ")
if question == 'exit':
break
# KEY: Only retrieve top 2 most relevant documents
retrieved_docs = vector_store.similarity_search(question, k=2)
context = "\n".join([doc.page_content for doc in retrieved_docs])
# System prompt with ONLY relevant context
system_prompt = f"Answer questions about these documents: {context}"
messages = [SystemMessage(content=system_prompt)]
messages.append(HumanMessage(content=question))
response = llm.invoke(messages)
print(response.content)
The magic happens at k=2 - we only get the 2 most relevant documents, not all of them!
Token Usage: RAG System
Same test: 3 PDFs + 2 text files = ~15,000 tokens total
First question: "What is document1.pdf about?"
- Tokens sent: 1,010 (only 2 relevant chunks!)
- Cost: $0.01
Second question: "Summarize document2.pdf"
- Tokens sent: ~1,010
- Cost: $0.01
After 10 questions: ~$0.11 total
RAG only sends relevant documents, not everything!
Pros and Cons
Pros:
- 90-95% cost reduction
- Works with thousands of documents
- Fast responses
- Only processes relevant information
Cons:
- More complex setup
- Needs embeddings model
- Initial indexing takes time
Head-to-Head Comparison
Token Usage
| Metric | Document Injection | RAG System | Savings |
|---|---|---|---|
| Tokens per query | ~15,000 | ~1,000 | 93% |
| Cost per query (GPT-4o-mini) | $0.15 | $0.01 | 93% |
| 100 queries | $15.00 | $1.00 | 93% |
| 1,000 queries | $150.00 | $10.00 | 93% |
Scalability
| Documents | Document Injection | RAG System |
|---|---|---|
| 10 files | Works (expensive) | Works (cheap) |
| 100 files | Very expensive | Works great |
| 1,000 files | Exceeds token limits | Works great |
| 10,000 files | Impossible | Works great |
Performance
Document Injection:
- Initial load: Fast
- Query speed: Slow (large context)
- Cost: High
RAG:
- Initial load: Slower (builds embeddings)
- Query speed: Fast (small context)
- Cost: Low
Getting Started
Step 1: Setup
# Create project
mkdir document-chat
cd document-chat
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Create folders
mkdir docs
touch .env
Step 2: Install Packages
pip install langchain langchain-openai langchain-huggingface
pip install langchain-community pypdf python-dotenv
pip install sentence-transformers
Step 3: Configure
Create .env file:
OPENAI_API_KEY=your_key_here
Step 4: Add Documents
Put some PDF or text files in the docs/ folder.
Step 5: Run
# Try Document Injection
python documents_injection_system.py
# Try RAG
python full_rag_system.py
Common Issues
Issue 1: Tokenizer Warning
If you see warnings about TOKENIZERS_PARALLELISM:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Issue 2: Wrong Documents Retrieved
If RAG returns irrelevant documents:
# Get more documents
retrieved_docs = vector_store.similarity_search(question, k=5)
# Or split documents into smaller chunks first
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = splitter.split_documents(docs)
Issue 3: Out of Memory
For large document sets:
# Process in batches
batch_size = 100
for i in range(0, len(docs), batch_size):
batch = docs[i:i+batch_size]
vector_store.add_documents(batch)
When to Use Each
Use Document Injection:
- Very small document set (less than 5 pages total)
- Quick prototyping
- You need ALL context for every answer
Use RAG:
- More than 10 pages of documents
- Cost matters
- Production applications
- Any real-world use case
Simple rule: If you're building anything real, use RAG.
Making RAG Better
Once your basic RAG works, you can improve it:
1. Split Documents Better
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
split_docs = splitter.split_documents(docs)
2. Save Vector Store
Instead of rebuilding every time:
from langchain_community.vectorstores import Chroma
vector_store = Chroma(
collection_name="my_docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
3. Get More Documents
If answers aren't good enough:
# Increase k to get more context
retrieved_docs = vector_store.similarity_search(question, k=5)
Real Cost Example
Scenario: Customer support chatbot with 1,000 product manuals
Document Injection:
- Total: ~500,000 tokens
- Problem: Exceeds token limits!
- Impossible to build
RAG:
- Per question: ~2,000 tokens (only relevant docs)
- 1,000 questions/day: 2M tokens/day
- Cost with GPT-4o-mini: ~$20/day = $600/month
RAG makes impossible systems possible and affordable.
Key Takeaways
Document Injection is simple but expensive - fine for tiny documents, breaks quickly
RAG saves 90-95% on costs - by only processing relevant documents
RAG scales - works from 10 to 10,000 documents
RAG is easy to implement - just a few extra lines of code with LangChain
Use RAG for production - unless you have a very specific reason not to
Tutorial by Teemu Virta - teemu.tech
Special thans to Ardit Sulce
Tags: #python #rag #llm #langchain #ai #openai #tutorial
Top comments (0)