DEV Community

Teemu Virta
Teemu Virta

Posted on

RAG vs Document Injection: Why Your AI Document Chat Needs Smart Retrieval

Have you ever wanted to chat with your PDF files using AI? The simplest way is to load all your documents and send them to the LLM with every question. But this approach has a big problem: it's expensive and doesn't scale.

In this tutorial, we'll compare two approaches:

  • Document Injection: Load everything (simple but expensive)
  • RAG (Retrieval Augmented Generation): Load only what's needed (smart and efficient)

Try it yourself: I've built an interactive web app where you can test both approaches with your own documents and see the token usage difference in real-time: https://f861194c6f5a8196b6.gradio.live/

Upload your documents and compare Document Injection vs. Full RAG mode - watch how token counts change dramatically between the two approaches!


What You'll Learn

  • How document injection works and why it's expensive
  • What RAG is and how it saves money
  • Real token usage comparisons
  • How to implement both systems
  • When to use each approach

Prerequisites

You need:

  • Python 3.8+
  • Basic understanding of Python and APIs
  • OpenAI API key (or Google AI key)
  • Some documents to test with

The Problem

Let's say you have 10 PDF documents with 20 pages each. That's about 100,000 tokens. If you send all this with every question:

  • Cost per question: ~$3.00 (using GPT-4)
  • 10 questions: $30.00
  • 100 questions: $300.00

This doesn't scale. Your costs explode quickly.


Approach 1: Document Injection

How It Works

Document Injection loads ALL your documents into the system prompt. Every time you ask a question, the LLM processes everything.

Here's the flow:

User asks question
    ↓
Load ALL documents (PDFs, text files)
    ↓
Send everything to LLM
    ↓
LLM processes ALL tokens
    ↓
Get answer
Enter fullscreen mode Exit fullscreen mode

The Code

First, we need a function to load documents:

# load_documents.py
from pathlib import Path
from pypdf import PdfReader


def get_documents(folder='docs'):
    """Load all text, markdown, and PDF files from a folder."""
    path = Path(folder)
    contents = []

    for file in path.rglob('*'):
        if file.suffix in {'.txt', '.md', '.pdf'}:

            if file.suffix == '.pdf':
                # Extract text from PDF
                content = ''
                reader = PdfReader(file)
                for page in reader.pages:
                    content += '\n' + page.extract_text()
            else:
                # Read text files
                content = file.read_text(encoding='utf-8')

            contents.append({str(file): content})

    return contents
Enter fullscreen mode Exit fullscreen mode

Now the chat system:

# documents_injection_system.py
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from load_documents import get_documents


def main():
    load_dotenv()
    llm = ChatOpenAI(model='gpt-4o-mini')

    # Load ALL documents into the system prompt
    system_prompt = f"You are a helpful assistant. Answer questions about these documents: {get_documents()}"

    messages = [SystemMessage(content=system_prompt)]

    while True:
        question = input("Enter a question: ")
        if question == 'exit':
            break

        messages.append(HumanMessage(content=question))
        response = llm.invoke(messages)
        messages.append(response)
        print(response.content)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

The key point: get_documents() loads EVERYTHING, and this gets sent with EVERY question.

Token Usage: Document Injection

Test with 3 PDFs (10 pages each) + 2 text files = ~15,000 tokens total

First question: "What is document1.pdf about?"

  • Tokens sent: 15,010
  • Cost: $0.15

Second question: "Summarize document2.pdf"

  • Tokens sent: ~15,500
  • Cost: $0.155

After 10 questions: ~$1.60 total

Every question processes ALL 15,000 tokens!

Pros and Cons

Pros:

  • Simple to implement
  • LLM has access to everything

Cons:

  • Expensive - every query processes all documents
  • Slow with large documents
  • Doesn't scale beyond small document sets
  • Hits token limits quickly

Approach 2: RAG System

What is RAG?

RAG (Retrieval Augmented Generation) is smarter. Instead of sending everything, it:

  1. Converts documents into numbers (embeddings)
  2. Stores them in a database
  3. When you ask a question, finds only the relevant documents
  4. Sends only those to the LLM

How It Works

User asks question
    ↓
Convert question to embedding
    ↓
Search database for similar documents
    ↓
Get top 2 most relevant documents
    ↓
Send ONLY those to LLM
    ↓
Get answer
Enter fullscreen mode Exit fullscreen mode

This means instead of processing 100,000 tokens, you might only process 1,000 tokens!

The Code

# full_rag_system.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.messages import SystemMessage, HumanMessage
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.document_loaders import PyPDFDirectoryLoader, DirectoryLoader, TextLoader

load_dotenv()
os.environ["TOKENIZERS_PARALLELISM"] = "false"

directory = 'docs'

# Initialize LLM
llm = ChatOpenAI(model='gpt-4o-mini')

# Load documents
pdf_loader = PyPDFDirectoryLoader(directory)
pdf_docs = pdf_loader.load()

text_loader = DirectoryLoader(directory, glob='**/*.txt', loader_cls=TextLoader)
text_docs = text_loader.load()

docs = pdf_docs + text_docs

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(docs)

# Chat loop
while True: 
    question = input("Enter a question: ")
    if question == 'exit':
        break

    # KEY: Only retrieve top 2 most relevant documents
    retrieved_docs = vector_store.similarity_search(question, k=2)
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    # System prompt with ONLY relevant context
    system_prompt = f"Answer questions about these documents: {context}"

    messages = [SystemMessage(content=system_prompt)]
    messages.append(HumanMessage(content=question))

    response = llm.invoke(messages)
    print(response.content)
Enter fullscreen mode Exit fullscreen mode

The magic happens at k=2 - we only get the 2 most relevant documents, not all of them!

Token Usage: RAG System

Same test: 3 PDFs + 2 text files = ~15,000 tokens total

First question: "What is document1.pdf about?"

  • Tokens sent: 1,010 (only 2 relevant chunks!)
  • Cost: $0.01

Second question: "Summarize document2.pdf"

  • Tokens sent: ~1,010
  • Cost: $0.01

After 10 questions: ~$0.11 total

RAG only sends relevant documents, not everything!

Pros and Cons

Pros:

  • 90-95% cost reduction
  • Works with thousands of documents
  • Fast responses
  • Only processes relevant information

Cons:

  • More complex setup
  • Needs embeddings model
  • Initial indexing takes time

Head-to-Head Comparison

Token Usage

Metric Document Injection RAG System Savings
Tokens per query ~15,000 ~1,000 93%
Cost per query (GPT-4o-mini) $0.15 $0.01 93%
100 queries $15.00 $1.00 93%
1,000 queries $150.00 $10.00 93%

Scalability

Documents Document Injection RAG System
10 files Works (expensive) Works (cheap)
100 files Very expensive Works great
1,000 files Exceeds token limits Works great
10,000 files Impossible Works great

Performance

Document Injection:

  • Initial load: Fast
  • Query speed: Slow (large context)
  • Cost: High

RAG:

  • Initial load: Slower (builds embeddings)
  • Query speed: Fast (small context)
  • Cost: Low

Getting Started

Step 1: Setup

# Create project
mkdir document-chat
cd document-chat
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Create folders
mkdir docs
touch .env
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Packages

pip install langchain langchain-openai langchain-huggingface
pip install langchain-community pypdf python-dotenv
pip install sentence-transformers
Enter fullscreen mode Exit fullscreen mode

Step 3: Configure

Create .env file:

OPENAI_API_KEY=your_key_here
Enter fullscreen mode Exit fullscreen mode

Step 4: Add Documents

Put some PDF or text files in the docs/ folder.

Step 5: Run

# Try Document Injection
python documents_injection_system.py

# Try RAG
python full_rag_system.py
Enter fullscreen mode Exit fullscreen mode

Common Issues

Issue 1: Tokenizer Warning

If you see warnings about TOKENIZERS_PARALLELISM:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Enter fullscreen mode Exit fullscreen mode

Issue 2: Wrong Documents Retrieved

If RAG returns irrelevant documents:

# Get more documents
retrieved_docs = vector_store.similarity_search(question, k=5)

# Or split documents into smaller chunks first
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = splitter.split_documents(docs)
Enter fullscreen mode Exit fullscreen mode

Issue 3: Out of Memory

For large document sets:

# Process in batches
batch_size = 100
for i in range(0, len(docs), batch_size):
    batch = docs[i:i+batch_size]
    vector_store.add_documents(batch)
Enter fullscreen mode Exit fullscreen mode

When to Use Each

Use Document Injection:

  • Very small document set (less than 5 pages total)
  • Quick prototyping
  • You need ALL context for every answer

Use RAG:

  • More than 10 pages of documents
  • Cost matters
  • Production applications
  • Any real-world use case

Simple rule: If you're building anything real, use RAG.


Making RAG Better

Once your basic RAG works, you can improve it:

1. Split Documents Better

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
split_docs = splitter.split_documents(docs)
Enter fullscreen mode Exit fullscreen mode

2. Save Vector Store

Instead of rebuilding every time:

from langchain_community.vectorstores import Chroma

vector_store = Chroma(
    collection_name="my_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
Enter fullscreen mode Exit fullscreen mode

3. Get More Documents

If answers aren't good enough:

# Increase k to get more context
retrieved_docs = vector_store.similarity_search(question, k=5)
Enter fullscreen mode Exit fullscreen mode

Real Cost Example

Scenario: Customer support chatbot with 1,000 product manuals

Document Injection:

  • Total: ~500,000 tokens
  • Problem: Exceeds token limits!
  • Impossible to build

RAG:

  • Per question: ~2,000 tokens (only relevant docs)
  • 1,000 questions/day: 2M tokens/day
  • Cost with GPT-4o-mini: ~$20/day = $600/month

RAG makes impossible systems possible and affordable.


Key Takeaways

  1. Document Injection is simple but expensive - fine for tiny documents, breaks quickly

  2. RAG saves 90-95% on costs - by only processing relevant documents

  3. RAG scales - works from 10 to 10,000 documents

  4. RAG is easy to implement - just a few extra lines of code with LangChain

  5. Use RAG for production - unless you have a very specific reason not to


Tutorial by Teemu Virta - teemu.tech

Special thans to Ardit Sulce
Tags: #python #rag #llm #langchain #ai #openai #tutorial

Top comments (0)