Learn what RAG is, why we choose it over fine-tuning and other alternatives, with detailed comparisons, pros/cons, and current limitations.
Have you ever wanted an AI to write stories in your favorite author's style? Or wished ChatGPT knew about your company's internal documents?
That's exactly what RAG (Retrieval-Augmented Generation) enables.
In this 3-part tutorial series, we'll build a complete AI story generator that learns writing styles from your ebook collection. By the end, you'll understand RAG deeply—not just theoretically, but through hands-on implementation.
What we're building:
- A system that learns writing styles from any ebook collection
- Multi-chapter story generation with consistency
- Support for multiple LLM backends (Ollama, OpenAI, Claude)
Source Code: github.com/namtran/ai-rag-tutorial-story-generator
Table of Contents
- The Problem: LLMs Don't Know Your Data
- Methods to Add Custom Knowledge
- Deep Dive: RAG Pros and Cons
- Current Limitations of RAG
- Why We Choose RAG for This Project
- How RAG Actually Works
- Our Architecture
The Problem: LLMs Don't Know Your Data
Large Language Models like GPT-4, Claude, and Llama are trained on massive datasets from the internet. They're incredibly capable, but they have fundamental limitations:
1. Knowledge Cutoff
Models only know information up to their training date.
You: "What happened in the 2024 Olympics?"
GPT-4: "I don't have information about events after April 2024..."
More importantly for us: LLMs don't know about your personal book collection, your company documents, or any private data.
2. Generic Responses
Ask an LLM to write a story, and you'll get competent but generic prose:
Prompt: "Write about a cultivator discovering a cave"
Generic LLM Response:
"The young man walked into the cave. It was dark and mysterious.
He felt a strange energy. Something powerful was hidden here..."
What we want (Xianxia style):
"Chen Wei's spiritual sense trembled as he pushed through the
waterfall. Ancient qi, dense enough to manifest as mist, swirled
within the cave mouth. His dantian resonated with a frequency
he had only read about in the Celestial Archives—the signature
of a Nascent Soul realm cultivator's inheritance..."
3. Hallucinations
Without access to source material, LLMs confidently generate plausible-sounding but incorrect information:
You: "What does Chapter 7 of my company handbook say about vacation policy?"
LLM: "According to your handbook, employees receive 15 days..."
(completely made up - it has no access to your handbook!)
4. No Personalization
Everyone gets the same model. A fantasy author and a technical writer get identical responses to the same prompt. There's no way to customize the model to your specific domain without significant effort.
5. Context Window Limitations
Even if you try to paste your entire book into the prompt:
| Model | Context Window | Equivalent |
|---|---|---|
| GPT-3.5 | 4K tokens | ~3,000 words |
| GPT-4 | 8K-128K tokens | ~6,000-96,000 words |
| Claude 3 | 200K tokens | ~150,000 words |
A typical novel is 80,000-100,000 words. A book collection? Millions of words. You simply can't fit everything in the context window.
Methods to Add Custom Knowledge to LLMs
There are several approaches to make LLMs work with your custom data. Let's explore each one in detail:
Method 1: Prompt Engineering (In-Context Learning)
How it works: Paste your data directly into the prompt.
prompt = f"""
You are a story writer. Here are some example passages to follow:
Example 1:
{example_passage_1}
Example 2:
{example_passage_2}
Example 3:
{example_passage_3}
Now write a story about: {user_request}
"""
Pros:
- Zero setup - Just copy-paste, no infrastructure needed
- Immediate - Results in seconds, no preprocessing
- Flexible - Change examples anytime
- No training - Works with any off-the-shelf model
Cons:
- Context limits - Can only fit 3-10 examples (can't represent diverse styles)
- High cost - Pay for example tokens every call ($0.01-0.10 per request)
- No intelligence - Must manually choose examples (may pick irrelevant ones)
- Doesn't scale - 100 books = impossible
Best for: Quick prototypes, very small datasets (<10 pages)
Method 2: Fine-Tuning
How it works: Train the model's weights on your specific data.
# Conceptual fine-tuning workflow
training_data = [
{
"messages": [
{"role": "system", "content": "You write in xianxia style"},
{"role": "user", "content": "Write about a breakthrough"},
{"role": "assistant", "content": "The qi vortex above Chen Wei's head..."}
]
},
# ... hundreds or thousands more examples
]
# Train the model (this costs money and time!)
fine_tuned_model = openai.fine_tuning.jobs.create(
training_file="training_data.jsonl",
model="gpt-3.5-turbo"
)
Pros:
- Persistent knowledge - Style/knowledge "baked into" weights
- Fast inference - No retrieval step needed
- Deep learning - Can learn subtle patterns over time
- Consistent outputs - Same style every time
Cons:
- Expensive - GPT-3.5 fine-tuning costs $50-500+ per training run
- Time-consuming - Hours to days of training, slow iteration
- Expertise required - Need to understand ML concepts (high barrier)
- Catastrophic forgetting - Model may lose general capabilities
- Static knowledge - Can't update without retraining (adding one book = full retrain)
- Data preparation - Need to format data properly (hours of prep work)
- Overfitting risk - Model memorizes instead of learning
Best for: Production systems with stable data, when you need consistent style
Method 3: RAG (Retrieval-Augmented Generation)
How it works: Store data in a searchable database, retrieve relevant pieces at query time.
# RAG workflow
def generate_with_rag(user_query):
# 1. Search for relevant content
relevant_passages = vector_db.search(user_query, top_k=5)
# 2. Build augmented prompt
prompt = f"""
Reference material:
{relevant_passages}
User request: {user_query}
"""
# 3. Generate with context
return llm.generate(prompt)
Pros:
- No training - Use any model off-the-shelf
- Easy updates - Add/remove documents instantly
- Scalable - Handle millions of documents
- Transparent - See exactly what was retrieved
- Cost-effective - Only embed once, query forever
- Model-agnostic - Same DB works with any LLM
- Grounded responses - Output based on real sources
Cons:
- Retrieval quality - Bad retrieval = bad output (need good embeddings)
- Additional latency - Search adds 100-500ms (slower than fine-tuning)
- Infrastructure - Need vector database (more moving parts)
- Chunking challenges - How to split documents affects retrieval quality
- Context assembly - Retrieved chunks may not flow naturally
- Embedding costs - Need to embed all documents (one-time cost)
Best for: Large/dynamic knowledge bases, when data changes frequently
Method 4: Knowledge Graphs
How it works: Structure data as entities and relationships.
[Brandon Sanderson] --wrote--> [Mistborn] --has_magic_system--> [Allomancy]
|
+--has_character--> [Vin]
|
+--has_trait--> [Street Urchin]
+--has_power--> [Mistborn]
Pros:
- Explicit relationships - Captures "how things connect"
- Complex queries - "Find all characters who use fire magic"
- Reasoning - Can infer new relationships
- Structured output - Clean, organized data
Cons:
- Complex to build - Must define schema, extract entities (weeks of work)
- Maintenance burden - New data needs manual structuring (ongoing effort)
- Doesn't capture prose - Style/voice can't be graphed (bad for creative writing)
- Domain expertise - Need to understand your data deeply (high barrier)
Best for: Structured data, when relationships matter more than content
Method 5: Hybrid Approaches
Combine methods for best results:
| Hybrid Approach | How It Works | Best For |
|---|---|---|
| RAG + Fine-tuning | Fine-tune for style, RAG for facts | News/research writing |
| RAG + Knowledge Graph | Graph for structure, RAG for content | Complex domains |
| Multi-stage RAG | Retrieve, Rerank, Generate | High-precision needs |
| RAG + Prompt Engineering | RAG retrieves, few-shot guides format | Specific output formats |
Comparison Table: All Methods
| Criteria | Prompt Eng. | Fine-Tuning | RAG | Knowledge Graph |
|---|---|---|---|---|
| Setup Time | Minutes | Days | Hours | Weeks |
| Setup Cost | $0 | $50-500 | $0-50 | $100+ |
| Per-Query Cost | High | Low | Medium | Low |
| Technical Skill | Low | High | Medium | High |
| Knowledge Update | Instant | Re-train | Instant | Manual |
| Max Data Size | ~50 pages | Unlimited | Millions of docs | Millions of nodes |
| Retrieval Intelligence | None | N/A | Semantic | Graph traversal |
| Output Consistency | Variable | High | Variable | High |
| Debugging | Easy | Hard | Medium | Medium |
| Style Learning | Limited | Excellent | Good | Poor |
| Fact Accuracy | Low | Medium | High | High |
Deep Dive: RAG Pros and Cons
Since we're using RAG, let's examine its strengths and weaknesses in detail:
RAG Advantages (Detailed)
1. No Training Required
Fine-tuning workflow:
1. Prepare training data (hours)
2. Format into JSONL (hours)
3. Upload and validate (minutes)
4. Train model (hours-days)
5. Test and iterate (days)
Total: Days to weeks
RAG workflow:
1. Parse documents (minutes)
2. Chunk and embed (minutes-hours)
3. Store in vector DB (minutes)
Total: Hours
2. Easy Knowledge Updates
# Adding a new book to RAG
def add_book(filepath):
text = parse_ebook(filepath) # 10 seconds
chunks = chunk_text(text) # 1 second
embeddings = embed(chunks) # 30 seconds
vector_db.add(chunks, embeddings) # 5 seconds
# Done! New book is searchable
# Adding a new book with fine-tuning
def add_book_finetune(filepath):
# 1. Prepare new training examples (1 hour)
# 2. Combine with existing training data (10 min)
# 3. Re-run fine-tuning job ($50-200, 2-8 hours)
# 4. Test new model (1 hour)
# 5. Deploy new model (30 min)
# Total: ~12 hours and $50-200
3. Scalability
| Data Size | Prompt Engineering | Fine-Tuning | RAG |
|---|---|---|---|
| 10 pages | Works | Works | Works |
| 100 pages | Too big | Works | Works |
| 1,000 pages | Impossible | Expensive | Works |
| 10,000 pages | Impossible | Very expensive | Works |
| 1M pages | Impossible | Impractical | Works |
4. Transparency and Debugging
# With RAG, you can see exactly what the model sees
result = generator.generate("Write about a warrior")
# Debug: What did we retrieve?
print("Retrieved passages:")
for i, passage in enumerate(result.retrieved_context):
print(f"{i+1}. {passage[:100]}...")
print(f" Similarity: {result.scores[i]}")
print(f" Source: {result.sources[i]}")
# If output is wrong, you know exactly where to look:
# - Bad retrieval? Improve embeddings or chunking
# - Good retrieval, bad output? Improve prompt
5. Model Agnostic
# Same knowledge base works with ANY model
knowledge_base = VectorDB("./chroma_db")
# Use with Ollama (free, local)
ollama_response = generate(knowledge_base, model="ollama/qwen2.5")
# Use with OpenAI (paid, cloud)
openai_response = generate(knowledge_base, model="gpt-4")
# Use with Claude (paid, cloud)
claude_response = generate(knowledge_base, model="claude-3-sonnet")
# Switch models without rebuilding anything!
6. Cost Effective
| Operation | Fine-Tuning Cost | RAG Cost |
|---|---|---|
| Initial setup | $50-500 | $0-10 |
| Add 1 book | $50-200 (retrain) | ~$0.01 (embed) |
| Add 100 books | $50-200 (retrain) | ~$1 (embed) |
| Query (GPT-4) | ~$0.03/query | ~$0.04/query |
| Query (Ollama) | $0 | $0 |
RAG Disadvantages (Detailed)
1. Retrieval Quality Dependency
The RAG Equation:
Final Output Quality = Retrieval Quality × Generation Quality
If retrieval finds irrelevant passages:
- User asks about "sword fighting"
- System retrieves passages about "cooking swords" (wrong!)
- LLM generates cooking-related nonsense
Retrieval failure modes:
- Semantic gap: query and relevant docs use different words
- Chunking errors: relevant info split across chunks
- Embedding limitations: model doesn't understand domain
2. Additional Latency
Request Timeline Comparison:
Direct LLM (no RAG):
[User Query] → [LLM Generate: 500ms] → [Response]
Total: ~500ms
RAG:
[User Query] → [Embed Query: 50ms] → [Vector Search: 100ms] →
[Fetch Documents: 50ms] → [Build Prompt: 10ms] → [LLM Generate: 600ms] → [Response]
Total: ~810ms (+62% slower)
3. Chunking Challenges
The Chunking Dilemma:
Too Small (100 chars):
"The warrior drew his" | "sword and faced the" | "dragon with courage"
→ Loses context, meaningless fragments
Too Large (5000 chars):
[Entire chapter about many topics]
→ Dilutes relevance, wastes context, may retrieve wrong parts
Just Right (500-1000 chars):
[Complete paragraph about sword fighting]
→ Self-contained, meaningful, searchable
But even "just right" has problems:
- Important info may span two chunks
- Context from previous paragraphs lost
- Character names may not appear in every chunk
4. Context Assembly Issues
# Retrieved chunks may not flow naturally
retrieved = [
"...he defeated the demon lord. THE END.", # End of chapter 5
"Chapter 1: The young warrior woke...", # Beginning of book
"...said Master Liu. 'Your training...'" # Middle of dialogue
]
# Assembled context is disjointed!
# The LLM must make sense of this jumble
5. Cold Start Problem
New RAG System:
- No documents indexed yet
- User queries return nothing relevant
- Output quality = base LLM (no improvement)
Solution: Must index documents before system is useful
This takes time for large collections
Current Limitations of RAG
Understanding limitations helps you build better systems:
Retrieval Limitations
| Limitation | Description | Workaround |
|---|---|---|
| Semantic gap | Different words for same concept | Hybrid search (keyword + semantic) |
| No cross-document reasoning | Can't connect info across books | Knowledge graphs, multi-hop retrieval |
| Recency bias | All chunks treated equally | Add timestamp metadata, boost recent |
| No negation understanding | "not about war" still retrieves war | Better query processing |
| Fixed chunk boundaries | Important info split across chunks | Overlapping chunks, larger windows |
Embedding Limitations
| Limitation | Description | Workaround |
|---|---|---|
| Domain mismatch | General embeddings miss domain terms | Fine-tune embedding model |
| Length limits | Most models cap at 512 tokens | Chunk appropriately |
| Language bias | English-trained models struggle with other languages | Multilingual models |
| No structured data | Can't embed tables well | Special preprocessing |
Generation Limitations
| Limitation | Description | Workaround |
|---|---|---|
| Context window | Can only fit N retrieved chunks | Summarization, selection |
| Lost in the middle | LLMs ignore middle of long contexts | Reorder important info to start/end |
| Hallucination | May still make things up | Fact-checking, citations |
| Style inconsistency | May not maintain style throughout | More style examples, fine-tuning |
Our Tutorial's Specific Limitations
Since this is a learning-focused tutorial, we've made simplifying choices:
┌─────────────────────────────────────────────────────────────────┐
│ WHAT THIS TUTORIAL COVERS │
├─────────────────────────────────────────────────────────────────┤
│ ✓ Basic RAG pipeline (ingest → embed → store → retrieve) │
│ ✓ Simple fixed-size chunking with overlap │
│ ✓ Single embedding model (no fine-tuning) │
│ ✓ Basic similarity search (no reranking) │
│ ✓ Single-query retrieval (no query expansion) │
│ ✓ Straightforward prompt templates │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION SYSTEMS WOULD ADD │
├─────────────────────────────────────────────────────────────────┤
│ ○ Hybrid search (BM25 keyword + semantic vectors) │
│ ○ Query expansion ("sword" → "sword, blade, weapon") │
│ ○ Cross-encoder reranking for better precision │
│ ○ Semantic chunking (split on topic boundaries) │
│ ○ Metadata filtering (by author, genre, date) │
│ ○ Caching layer for repeated queries │
│ ○ Evaluation metrics (retrieval recall, generation quality) │
│ ○ A/B testing for prompt variations │
│ ○ Streaming responses for better UX │
│ ○ Rate limiting and cost management │
│ ○ Observability (logging, tracing, metrics) │
└─────────────────────────────────────────────────────────────────┘
Why We Choose RAG for This Project
Given all the above, here's why RAG is the right choice for our story generator:
1. Our Use Case Fits RAG Perfectly
| Requirement | Why RAG Works |
|---|---|
| Learn from book collection | Easy to add books to vector DB |
| Multiple genres/styles | Retrieval finds relevant style samples |
| Users add their own books | No retraining needed |
| Works offline | Ollama + local ChromaDB |
| Educational project | RAG is easier to understand and debug |
2. The Alternatives Don't Fit
| Method | Why Not for This Project |
|---|---|
| Prompt Engineering | Can't fit entire book collection |
| Fine-Tuning | Too expensive, can't easily add books |
| Knowledge Graphs | Style/prose can't be structured as graphs |
3. RAG Handles Our Scale
Typical user's book collection:
- 50-200 ebooks
- 5-20 million words total
- 50,000-200,000 chunks
RAG handles this easily:
- ChromaDB can store millions of vectors
- Search takes <100ms even with 200K chunks
- Adding new books takes seconds
4. Style Learning Works Well with RAG
For creative writing, we don't need exact fact retrieval. We need style examples:
# Query: "Write about a warrior discovering a cave"
# RAG retrieves passages about:
# - Warriors in various situations
# - Cave discoveries
# - Mysterious findings
# These serve as STYLE EXAMPLES, not facts
# The LLM learns "how to write" from them
# Output naturally varies based on what's retrieved
RAG Deep Dive: How It Actually Works
Now let's understand the mechanics:
The RAG Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ╔═══════════════════════════════════════════════════════════╗ │
│ ║ OFFLINE PHASE (One-time setup) ║ │
│ ╠═══════════════════════════════════════════════════════════╣ │
│ ║ ║ │
│ ║ Documents ──▶ Parse ──▶ Chunk ──▶ Embed ──▶ Store ║ │
│ ║ ║ │
│ ║ ┌─────────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ║ │
│ ║ │ Ebooks │─▶│Extract│─▶│ Split │─▶│Vector │─▶│ChromaDB ║ │
│ ║ │PDF/EPUB │ │ Text │ │ 500ch │ │ 384d │ │ │ ║ │
│ ║ └─────────┘ └───────┘ └───────┘ └───────┘ └───────┘ ║ │
│ ║ ║ │
│ ╚═══════════════════════════════════════════════════════════╝ │
│ │
│ ╔═══════════════════════════════════════════════════════════╗ │
│ ║ ONLINE PHASE (Every query) ║ │
│ ╠═══════════════════════════════════════════════════════════╣ │
│ ║ ║ │
│ ║ Query ──▶ Embed ──▶ Search ──▶ Retrieve ──▶ Augment ──▶ Gen
│ ║ ║ │
│ ║ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ║ │
│ ║ │"Write │─▶│Query │─▶│Cosine │─▶│Top 5 │─▶│Prompt + │ ║ │
│ ║ │about..│ │Vector │ │Search │ │Chunks │ │Context │ ║ │
│ ║ └───────┘ └───────┘ └───────┘ └───────┘ └────┬────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────┐ ║ │
│ ║ │ LLM │ ║ │
│ ║ │Generate │ ║ │
│ ║ └────┬────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────┐ ║ │
│ ║ │ Story │ ║ │
│ ║ │ Output │ ║ │
│ ║ └─────────┘ ║ │
│ ╚═══════════════════════════════════════════════════════════╝ │
│ │
└─────────────────────────────────────────────────────────────────┘
Step 1: Document Parsing
# We support multiple ebook formats
def parse_ebook(filepath):
ext = filepath.suffix.lower()
if ext == '.pdf':
return extract_pdf_text(filepath) # PyMuPDF
elif ext == '.epub':
return extract_epub_text(filepath) # ebooklib
elif ext == '.mobi':
return extract_mobi_text(filepath) # mobi library
elif ext == '.txt':
return filepath.read_text()
# Output: Raw text string
# "Chapter 1\n\nThe young warrior stood at the edge..."
Step 2: Text Chunking
Why we chunk:
Problem: A novel has 80,000 words
- Can't embed entire book (embedding models have limits)
- Can't retrieve entire book (wastes context window)
- Need granularity for relevant retrieval
Solution: Split into chunks
- Each chunk is self-contained
- Small enough to embed
- Large enough to be meaningful
Chunking strategies compared:
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| Fixed-size | Every 500 characters | Simple, consistent | May cut mid-sentence |
| Sentence | Split on periods | Natural boundaries | Variable sizes |
| Paragraph | Split on newlines | Preserves context | Very variable sizes |
| Semantic | Split on topic change | Best relevance | Complex, slow |
We use fixed-size with overlap:
def chunk_text(text, size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + size
chunk = text[start:end]
# Try to break at sentence boundary
last_period = chunk.rfind('. ')
if last_period > size * 0.5:
chunk = chunk[:last_period + 1]
end = start + last_period + 1
chunks.append(chunk)
start = end - overlap # Overlap!
return chunks
Text: [AAAAA][BBBBB][CCCCC][DDDDD][EEEEE]
Chunks with overlap:
1: [AAAAA][BB]
2: [BB][BBBBB][CC]
3: [CC][CCCCC][DD]
4: [DD][DDDDD][EE]
5: [EE][EEEEE]
Overlap ensures we don't lose context at boundaries!
Step 3: Embedding
Convert text to vectors that capture meaning:
from sentence_transformers import SentenceTransformer
# Load model (downloads ~100MB first time)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Embed text
text = "The warrior drew his ancient blade"
vector = model.encode(text)
print(vector.shape) # (384,)
print(vector[:5]) # [0.23, -0.45, 0.67, 0.12, -0.89]
Why embeddings work:
# Similar meanings → Similar vectors
v1 = embed("The warrior drew his sword")
v2 = embed("The fighter unsheathed his blade")
v3 = embed("I like pizza")
cosine_similarity(v1, v2) # 0.89 - very similar!
cosine_similarity(v1, v3) # 0.12 - very different!
Step 4: Vector Storage (ChromaDB)
import chromadb
# Create persistent database
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection
collection = client.create_collection(
name="story_styles",
metadata={"description": "Writing style samples"}
)
# Add documents
collection.add(
ids=["chunk_001", "chunk_002", "chunk_003"],
documents=[
"The warrior drew his blade...",
"Magic sparkled in the air...",
"The ancient tome revealed..."
],
embeddings=[
[0.23, -0.45, ...], # 384 dimensions
[0.12, 0.67, ...],
[-0.34, 0.21, ...]
],
metadatas=[
{"source": "book1.txt", "chunk_id": 1},
{"source": "book1.txt", "chunk_id": 2},
{"source": "book2.txt", "chunk_id": 1}
]
)
print(f"Stored {collection.count()} chunks")
Step 5: Retrieval
# User's query
query = "A young cultivator discovers a mysterious cave"
# Embed the query
query_vector = model.encode(query)
# Search for similar chunks
results = collection.query(
query_embeddings=[query_vector],
n_results=5,
include=["documents", "distances", "metadatas"]
)
# Results contain the most relevant passages
for i, doc in enumerate(results['documents'][0]):
print(f"Result {i+1} (distance: {results['distances'][0][i]:.3f}):")
print(f" {doc[:100]}...")
print(f" Source: {results['metadatas'][0][i]['source']}")
Step 6: Augmented Generation
# Build the augmented prompt
def build_prompt(query, retrieved_passages):
context = "\n\n---\n\n".join(retrieved_passages)
prompt = f"""Here are some example passages showing the writing style to follow:
{context}
---
Now write a NEW story passage in a similar style.
Story idea: {query}
Requirements:
- Match the writing style of the examples above
- Create original content (don't copy)
- Include vivid descriptions and dialogue
Story:
"""
return prompt
# Generate
prompt = build_prompt(user_query, retrieved_passages)
story = llm.generate(prompt)
The Architecture for Our Story Generator
┌────────────────────────────────────────────────────────────────────────┐
│ AI RAG STORY GENERATOR │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ YOUR EBOOK COLLECTION │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ data/raw/ │ │
│ │ ├── fantasy_novel.epub │ │
│ │ ├── xianxia_story.pdf │ │
│ │ ├── magic_school.mobi │ │
│ │ └── cultivation_tale.txt │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ PARSE (parse_ebooks.py) │ │
│ │ • Extract text from PDF, EPUB, MOBI, TXT │ │
│ │ • Clean and normalize text │ │
│ │ • Output: data/txt/*.txt │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ BUILD DATABASE (build_style_db.py) │ │
│ │ • Chunk text (500 chars, 50 overlap) │ │
│ │ • Generate embeddings (SentenceTransformer) │ │
│ │ • Store in ChromaDB │ │
│ │ • Output: chroma_db/ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ VECTOR DATABASE (ChromaDB) │ │
│ │ • Stores: text chunks + embeddings + metadata │ │
│ │ • Enables: fast similarity search │ │
│ │ • Persists: survives restarts │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ CLI MODE │ │ WEB UI │ │
│ │ generate_ │ │ app.py │ │
│ │ with_style │ │ (Gradio) │ │
│ │ .py │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └────────────────┬───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GENERATION PIPELINE │ │
│ │ │ │
│ │ User: "Write about a cultivator finding a cave" │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 1. Embed query with SentenceTransformer │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 2. Search ChromaDB for similar passages │ │ │
│ │ │ Returns: 3-5 style samples │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 3. Build augmented prompt │ │ │
│ │ │ [Style examples] + [User request] │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 4. Send to LLM │ │ │
│ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
│ │ │ │ Ollama │ │ OpenAI │ │ Claude │ │ Gemini │ │ │ │
│ │ │ │ (local) │ │ (API) │ │ (API) │ │ (API) │ │ │ │
│ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 5. Return generated story in learned style │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
What We'll Build in This Series
Part 1 (This Article)
- ✅ Understanding RAG concepts
- ✅ Comparing all alternatives in detail
- ✅ Deep dive into RAG pros and cons
- ✅ Current limitations
- ✅ Why we chose RAG
- ✅ How RAG works step-by-step
- ✅ Architecture overview
Part 2: Building the RAG Pipeline
- Project setup and dependencies
- Parsing ebooks (PDF, EPUB, MOBI) with code
- Text chunking implementation
- Generating embeddings with Sentence Transformers
- Storing in ChromaDB
- Testing retrieval quality
Part 3: Story Generation
- Connecting to LLMs (Ollama, OpenAI, Claude)
- Prompt engineering for style transfer
- Single chapter generation
- Multi-chapter story generation
- Maintaining consistency with summaries
- Web interface with Gradio
Prerequisites
Before Part 2, make sure you have:
- Python 3.10+ installed
- 8GB+ RAM (16GB recommended for larger models)
- Some ebooks to learn from (any genre!)
- (Optional) Ollama for local LLM inference
Clone the repository:
git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator
Summary
| Topic | Key Takeaway |
|---|---|
| The Problem | LLMs don't know your custom data, have knowledge cutoffs, and generate generic content |
| Alternatives | Prompt engineering (simple but limited), Fine-tuning (powerful but expensive), RAG (balanced), Knowledge graphs (structured data) |
| RAG Pros | No training, easy updates, scalable, transparent, cost-effective, model-agnostic |
| RAG Cons | Retrieval quality dependency, added latency, chunking challenges, context assembly issues |
| Limitations | Semantic gaps, embedding limits, no cross-document reasoning (in basic RAG) |
| Why RAG for Us | Fits our use case perfectly: large book collections, easy updates, style learning |
| How RAG Works | Parse, Chunk, Embed, Store, Query, Retrieve, Augment, Generate |
Next Article: Part 2: Building the RAG Pipeline →
Source Code: github.com/namtran/ai-rag-tutorial-story-generator
Found this helpful? Follow me for Parts 2 and 3!
Top comments (0)