Teemu Virta

Posted on Dec 17, 2025 • Edited on Dec 19, 2025

RAG vs Document Injection: Why Your AI Document Chat Needs Smart Retrieval

#python #rag #langchain #llm

Have you ever wanted to chat with your PDF files using AI? The simplest way is to load all your documents and send them to the LLM with every question. But this approach has a big problem: it's expensive and doesn't scale.

In this tutorial, we'll compare two approaches:

Document Injection: Load everything (simple but expensive)
RAG (Retrieval Augmented Generation): Load only what's needed (smart and efficient)

Try it yourself: I've built an interactive web app where you can test both approaches with your own documents and see the token usage difference in real-time: https://www.teemu.tech/rag/

Upload your documents and compare Document Injection vs. Full RAG mode - watch how token counts change dramatically between the two approaches!

What You'll Learn

How document injection works and why it's expensive
What RAG is and how it saves money
Real token usage comparisons
How to implement both systems
When to use each approach

Prerequisites

You need:

Python 3.8+
Basic understanding of Python and APIs
OpenAI API key (or Google AI key)
Some documents to test with

The Problem

Let's say you have 10 PDF documents with 20 pages each. That's about 100,000 tokens. If you send all this with every question:

Cost per question: ~$3.00 (using GPT-4)
10 questions: $30.00
100 questions: $300.00

This doesn't scale. Your costs explode quickly.

Approach 1: Document Injection

How It Works

Document Injection loads ALL your documents into the system prompt. Every time you ask a question, the LLM processes everything.

Here's the flow:

User asks question
    ↓
Load ALL documents (PDFs, text files)
    ↓
Send everything to LLM
    ↓
LLM processes ALL tokens
    ↓
Get answer

The Code

First, we need a function to load documents:

# load_documents.py
from pathlib import Path
from pypdf import PdfReader


def get_documents(folder='docs'):
    """Load all text, markdown, and PDF files from a folder."""
    path = Path(folder)
    contents = []

    for file in path.rglob('*'):
        if file.suffix in {'.txt', '.md', '.pdf'}:

            if file.suffix == '.pdf':
                # Extract text from PDF
                content = ''
                reader = PdfReader(file)
                for page in reader.pages:
                    content += '\n' + page.extract_text()
            else:
                # Read text files
                content = file.read_text(encoding='utf-8')

            contents.append({str(file): content})

    return contents

Now the chat system:

# documents_injection_system.py
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from load_documents import get_documents


def main():
    load_dotenv()
    llm = ChatOpenAI(model='gpt-4o-mini')

    # Load ALL documents into the system prompt
    system_prompt = f"You are a helpful assistant. Answer questions about these documents: {get_documents()}"

    messages = [SystemMessage(content=system_prompt)]

    while True:
        question = input("Enter a question: ")
        if question == 'exit':
            break

        messages.append(HumanMessage(content=question))
        response = llm.invoke(messages)
        messages.append(response)
        print(response.content)


if __name__ == "__main__":
    main()

The key point: get_documents() loads EVERYTHING, and this gets sent with EVERY question.

Token Usage: Document Injection

Test with 3 PDFs (10 pages each) + 2 text files = ~15,000 tokens total

First question: "What is document1.pdf about?"

Tokens sent: 15,010
Cost: $0.15

Second question: "Summarize document2.pdf"

Tokens sent: ~15,500
Cost: $0.155

After 10 questions: ~$1.60 total

Every question processes ALL 15,000 tokens!

Pros and Cons

Pros:

Simple to implement
LLM has access to everything

Cons:

Expensive - every query processes all documents
Slow with large documents
Doesn't scale beyond small document sets
Hits token limits quickly

Approach 2: RAG System

What is RAG?

RAG (Retrieval Augmented Generation) is smarter. Instead of sending everything, it:

Converts documents into numbers (embeddings)
Stores them in a database
When you ask a question, finds only the relevant documents
Sends only those to the LLM

How It Works

User asks question
    ↓
Convert question to embedding
    ↓
Search database for similar documents
    ↓
Get top 2 most relevant documents
    ↓
Send ONLY those to LLM
    ↓
Get answer

This means instead of processing 100,000 tokens, you might only process 1,000 tokens!

The Code

# full_rag_system.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.document_loaders import PyPDFDirectoryLoader, DirectoryLoader, TextLoader

load_dotenv()
os.environ["TOKENIZERS_PARALLELISM"] = "false"

directory = 'docs'

# Initialize LLM
llm = ChatOpenAI(model='gpt-4o-mini')

# Load documents
pdf_loader = PyPDFDirectoryLoader(directory)
pdf_docs = pdf_loader.load()

text_loader = DirectoryLoader(directory, glob='**/*.txt', loader_cls=TextLoader)
text_docs = text_loader.load()

docs = pdf_docs + text_docs

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(docs)

# Chat loop
while True: 
    question = input("Enter a question: ")
    if question == 'exit':
        break

    # KEY: Only retrieve top 2 most relevant documents
    retrieved_docs = vector_store.similarity_search(question, k=2)
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    # System prompt with ONLY relevant context
    system_prompt = f"Answer questions about these documents: {context}"

    messages = [SystemMessage(content=system_prompt)]
    messages.append(HumanMessage(content=question))

    response = llm.invoke(messages)
    print(response.content)

The magic happens at k=2 - we only get the 2 most relevant documents, not all of them!

Token Usage: RAG System

Same test: 3 PDFs + 2 text files = ~15,000 tokens total

First question: "What is document1.pdf about?"

Tokens sent: 1,010 (only 2 relevant chunks!)
Cost: $0.01

Second question: "Summarize document2.pdf"

Tokens sent: ~1,010
Cost: $0.01

After 10 questions: ~$0.11 total

RAG only sends relevant documents, not everything!

Pros and Cons

Pros:

90-95% cost reduction
Works with thousands of documents
Fast responses
Only processes relevant information

Cons:

More complex setup
Needs embeddings model
Initial indexing takes time

Head-to-Head Comparison

Token Usage

Metric	Document Injection	RAG System	Savings
Tokens per query	~15,000	~1,000	93%
Cost per query (GPT-4o-mini)	$0.15	$0.01	93%
100 queries	$15.00	$1.00	93%
1,000 queries	$150.00	$10.00	93%

Scalability

Documents	Document Injection	RAG System
10 files	Works (expensive)	Works (cheap)
100 files	Very expensive	Works great
1,000 files	Exceeds token limits	Works great
10,000 files	Impossible	Works great

Performance

Document Injection:

Initial load: Fast
Query speed: Slow (large context)
Cost: High

RAG:

Initial load: Slower (builds embeddings)
Query speed: Fast (small context)
Cost: Low

Getting Started

Step 1: Setup

# Create project
mkdir document-chat
cd document-chat
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Create folders
mkdir docs
touch .env

Step 2: Install Packages

pip install langchain langchain-openai langchain-huggingface
pip install langchain-community pypdf python-dotenv
pip install sentence-transformers

Step 3: Configure

Create .env file:

OPENAI_API_KEY=your_key_here

Step 4: Add Documents

Put some PDF or text files in the docs/ folder.

Step 5: Run

# Try Document Injection
python documents_injection_system.py

# Try RAG
python full_rag_system.py

Common Issues

Issue 1: Tokenizer Warning

If you see warnings about TOKENIZERS_PARALLELISM:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Issue 2: Wrong Documents Retrieved

If RAG returns irrelevant documents:

# Get more documents
retrieved_docs = vector_store.similarity_search(question, k=5)

# Or split documents into smaller chunks first
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = splitter.split_documents(docs)

Issue 3: Out of Memory

For large document sets:

# Process in batches
batch_size = 100
for i in range(0, len(docs), batch_size):
    batch = docs[i:i+batch_size]
    vector_store.add_documents(batch)

When to Use Each

Use Document Injection:

Very small document set (less than 5 pages total)
Quick prototyping
You need ALL context for every answer

Use RAG:

More than 10 pages of documents
Cost matters
Production applications
Any real-world use case

Simple rule: If you're building anything real, use RAG.

Making RAG Better

Once your basic RAG works, you can improve it:

1. Split Documents Better

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
split_docs = splitter.split_documents(docs)

2. Save Vector Store

Instead of rebuilding every time:

from langchain_community.vectorstores import Chroma

vector_store = Chroma(
    collection_name="my_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

3. Get More Documents

If answers aren't good enough:

# Increase k to get more context
retrieved_docs = vector_store.similarity_search(question, k=5)

Real Cost Example

Scenario: Customer support chatbot with 1,000 product manuals

Document Injection:

Total: ~500,000 tokens
Problem: Exceeds token limits!
Impossible to build

RAG:

Per question: ~2,000 tokens (only relevant docs)
1,000 questions/day: 2M tokens/day
Cost with GPT-4o-mini: ~$20/day = $600/month

RAG makes impossible systems possible and affordable.

Key Takeaways

Document Injection is simple but expensive - fine for tiny documents, breaks quickly
RAG saves 90-95% on costs - by only processing relevant documents
RAG scales - works from 10 to 10,000 documents
RAG is easy to implement - just a few extra lines of code with LangChain
Use RAG for production - unless you have a very specific reason not to

Tutorial by Teemu Virta - teemu.tech

Special thanks to Ardit Sulce
Tags: #python #rag #llm #langchain #ai #openai #tutorial