Nam Tran

Posted on Jan 26

Build Your Own AI Story Generator with RAG - Part 1: Understanding RAG

#rag #ai #novel #generator

Learn what RAG is, why we choose it over fine-tuning and other alternatives, with detailed comparisons, pros/cons, and current limitations.

Have you ever wanted an AI to write stories in your favorite author's style? Or wished ChatGPT knew about your company's internal documents?

That's exactly what RAG (Retrieval-Augmented Generation) enables.

In this 3-part tutorial series, we'll build a complete AI story generator that learns writing styles from your ebook collection. By the end, you'll understand RAG deeply—not just theoretically, but through hands-on implementation.

What we're building:

A system that learns writing styles from any ebook collection
Multi-chapter story generation with consistency
Support for multiple LLM backends (Ollama, OpenAI, Claude)

Source Code: github.com/namtran/ai-rag-tutorial-story-generator

The Problem: LLMs Don't Know Your Data
Methods to Add Custom Knowledge
Deep Dive: RAG Pros and Cons
Current Limitations of RAG
Why We Choose RAG for This Project
How RAG Actually Works
Our Architecture

The Problem: LLMs Don't Know Your Data

Large Language Models like GPT-4, Claude, and Llama are trained on massive datasets from the internet. They're incredibly capable, but they have fundamental limitations:

1. Knowledge Cutoff

Models only know information up to their training date.

You: "What happened in the 2024 Olympics?"
GPT-4: "I don't have information about events after April 2024..."

More importantly for us: LLMs don't know about your personal book collection, your company documents, or any private data.

2. Generic Responses

Ask an LLM to write a story, and you'll get competent but generic prose:

Prompt: "Write about a cultivator discovering a cave"

Generic LLM Response:
"The young man walked into the cave. It was dark and mysterious.
He felt a strange energy. Something powerful was hidden here..."

What we want (Xianxia style):
"Chen Wei's spiritual sense trembled as he pushed through the
waterfall. Ancient qi, dense enough to manifest as mist, swirled
within the cave mouth. His dantian resonated with a frequency
he had only read about in the Celestial Archives—the signature
of a Nascent Soul realm cultivator's inheritance..."

3. Hallucinations

Without access to source material, LLMs confidently generate plausible-sounding but incorrect information:

You: "What does Chapter 7 of my company handbook say about vacation policy?"
LLM: "According to your handbook, employees receive 15 days..."
     (completely made up - it has no access to your handbook!)

4. No Personalization

Everyone gets the same model. A fantasy author and a technical writer get identical responses to the same prompt. There's no way to customize the model to your specific domain without significant effort.

5. Context Window Limitations

Even if you try to paste your entire book into the prompt:

Model	Context Window	Equivalent
GPT-3.5	4K tokens	~3,000 words
GPT-4	8K-128K tokens	~6,000-96,000 words
Claude 3	200K tokens	~150,000 words

A typical novel is 80,000-100,000 words. A book collection? Millions of words. You simply can't fit everything in the context window.

Methods to Add Custom Knowledge to LLMs

There are several approaches to make LLMs work with your custom data. Let's explore each one in detail:

Method 1: Prompt Engineering (In-Context Learning)

How it works: Paste your data directly into the prompt.

prompt = f"""
You are a story writer. Here are some example passages to follow:

Example 1:
{example_passage_1}

Example 2:
{example_passage_2}

Example 3:
{example_passage_3}

Now write a story about: {user_request}
"""

Pros:

Zero setup - Just copy-paste, no infrastructure needed
Immediate - Results in seconds, no preprocessing
Flexible - Change examples anytime
No training - Works with any off-the-shelf model

Cons:

Context limits - Can only fit 3-10 examples (can't represent diverse styles)
High cost - Pay for example tokens every call ($0.01-0.10 per request)
No intelligence - Must manually choose examples (may pick irrelevant ones)
Doesn't scale - 100 books = impossible

Best for: Quick prototypes, very small datasets (<10 pages)

Method 2: Fine-Tuning

How it works: Train the model's weights on your specific data.

# Conceptual fine-tuning workflow
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You write in xianxia style"},
            {"role": "user", "content": "Write about a breakthrough"},
            {"role": "assistant", "content": "The qi vortex above Chen Wei's head..."}
        ]
    },
    # ... hundreds or thousands more examples
]

# Train the model (this costs money and time!)
fine_tuned_model = openai.fine_tuning.jobs.create(
    training_file="training_data.jsonl",
    model="gpt-3.5-turbo"
)

Pros:

Persistent knowledge - Style/knowledge "baked into" weights
Fast inference - No retrieval step needed
Deep learning - Can learn subtle patterns over time
Consistent outputs - Same style every time

Cons:

Expensive - GPT-3.5 fine-tuning costs $50-500+ per training run
Time-consuming - Hours to days of training, slow iteration
Expertise required - Need to understand ML concepts (high barrier)
Catastrophic forgetting - Model may lose general capabilities
Static knowledge - Can't update without retraining (adding one book = full retrain)
Data preparation - Need to format data properly (hours of prep work)
Overfitting risk - Model memorizes instead of learning

Best for: Production systems with stable data, when you need consistent style

Method 3: RAG (Retrieval-Augmented Generation)

How it works: Store data in a searchable database, retrieve relevant pieces at query time.

# RAG workflow
def generate_with_rag(user_query):
    # 1. Search for relevant content
    relevant_passages = vector_db.search(user_query, top_k=5)

    # 2. Build augmented prompt
    prompt = f"""
    Reference material:
    {relevant_passages}

    User request: {user_query}
    """

    # 3. Generate with context
    return llm.generate(prompt)

Pros:

No training - Use any model off-the-shelf
Easy updates - Add/remove documents instantly
Scalable - Handle millions of documents
Transparent - See exactly what was retrieved
Cost-effective - Only embed once, query forever
Model-agnostic - Same DB works with any LLM
Grounded responses - Output based on real sources

Cons:

Retrieval quality - Bad retrieval = bad output (need good embeddings)
Additional latency - Search adds 100-500ms (slower than fine-tuning)
Infrastructure - Need vector database (more moving parts)
Chunking challenges - How to split documents affects retrieval quality
Context assembly - Retrieved chunks may not flow naturally
Embedding costs - Need to embed all documents (one-time cost)

Best for: Large/dynamic knowledge bases, when data changes frequently

Method 4: Knowledge Graphs

How it works: Structure data as entities and relationships.

[Brandon Sanderson] --wrote--> [Mistborn] --has_magic_system--> [Allomancy]
                                    |
                                    +--has_character--> [Vin]
                                                          |
                                                          +--has_trait--> [Street Urchin]
                                                          +--has_power--> [Mistborn]

Pros:

Explicit relationships - Captures "how things connect"
Complex queries - "Find all characters who use fire magic"
Reasoning - Can infer new relationships
Structured output - Clean, organized data

Cons:

Complex to build - Must define schema, extract entities (weeks of work)
Maintenance burden - New data needs manual structuring (ongoing effort)
Doesn't capture prose - Style/voice can't be graphed (bad for creative writing)
Domain expertise - Need to understand your data deeply (high barrier)

Best for: Structured data, when relationships matter more than content

Method 5: Hybrid Approaches

Combine methods for best results:

Hybrid Approach	How It Works	Best For
RAG + Fine-tuning	Fine-tune for style, RAG for facts	News/research writing
RAG + Knowledge Graph	Graph for structure, RAG for content	Complex domains
Multi-stage RAG	Retrieve, Rerank, Generate	High-precision needs
RAG + Prompt Engineering	RAG retrieves, few-shot guides format	Specific output formats

Comparison Table: All Methods

Criteria	Prompt Eng.	Fine-Tuning	RAG	Knowledge Graph
Setup Time	Minutes	Days	Hours	Weeks
Setup Cost	$0	$50-500	$0-50	$100+
Per-Query Cost	High	Low	Medium	Low
Technical Skill	Low	High	Medium	High
Knowledge Update	Instant	Re-train	Instant	Manual
Max Data Size	~50 pages	Unlimited	Millions of docs	Millions of nodes
Retrieval Intelligence	None	N/A	Semantic	Graph traversal
Output Consistency	Variable	High	Variable	High
Debugging	Easy	Hard	Medium	Medium
Style Learning	Limited	Excellent	Good	Poor
Fact Accuracy	Low	Medium	High	High

Deep Dive: RAG Pros and Cons

Since we're using RAG, let's examine its strengths and weaknesses in detail:

RAG Advantages (Detailed)

1. No Training Required

Fine-tuning workflow:
1. Prepare training data (hours)
2. Format into JSONL (hours)
3. Upload and validate (minutes)
4. Train model (hours-days)
5. Test and iterate (days)
Total: Days to weeks

RAG workflow:
1. Parse documents (minutes)
2. Chunk and embed (minutes-hours)
3. Store in vector DB (minutes)
Total: Hours

2. Easy Knowledge Updates

# Adding a new book to RAG
def add_book(filepath):
    text = parse_ebook(filepath)      # 10 seconds
    chunks = chunk_text(text)          # 1 second
    embeddings = embed(chunks)         # 30 seconds
    vector_db.add(chunks, embeddings)  # 5 seconds
    # Done! New book is searchable

# Adding a new book with fine-tuning
def add_book_finetune(filepath):
    # 1. Prepare new training examples (1 hour)
    # 2. Combine with existing training data (10 min)
    # 3. Re-run fine-tuning job ($50-200, 2-8 hours)
    # 4. Test new model (1 hour)
    # 5. Deploy new model (30 min)
    # Total: ~12 hours and $50-200

3. Scalability

Data Size	Prompt Engineering	Fine-Tuning	RAG
10 pages	Works	Works	Works
100 pages	Too big	Works	Works
1,000 pages	Impossible	Expensive	Works
10,000 pages	Impossible	Very expensive	Works
1M pages	Impossible	Impractical	Works

4. Transparency and Debugging

# With RAG, you can see exactly what the model sees
result = generator.generate("Write about a warrior")

# Debug: What did we retrieve?
print("Retrieved passages:")
for i, passage in enumerate(result.retrieved_context):
    print(f"{i+1}. {passage[:100]}...")
    print(f"   Similarity: {result.scores[i]}")
    print(f"   Source: {result.sources[i]}")

# If output is wrong, you know exactly where to look:
# - Bad retrieval? Improve embeddings or chunking
# - Good retrieval, bad output? Improve prompt

5. Model Agnostic

# Same knowledge base works with ANY model
knowledge_base = VectorDB("./chroma_db")

# Use with Ollama (free, local)
ollama_response = generate(knowledge_base, model="ollama/qwen2.5")

# Use with OpenAI (paid, cloud)
openai_response = generate(knowledge_base, model="gpt-4")

# Use with Claude (paid, cloud)
claude_response = generate(knowledge_base, model="claude-3-sonnet")

# Switch models without rebuilding anything!

6. Cost Effective

Operation	Fine-Tuning Cost	RAG Cost
Initial setup	$50-500	$0-10
Add 1 book	$50-200 (retrain)	~$0.01 (embed)
Add 100 books	$50-200 (retrain)	~$1 (embed)
Query (GPT-4)	~$0.03/query	~$0.04/query
Query (Ollama)	$0	$0

RAG Disadvantages (Detailed)

1. Retrieval Quality Dependency

The RAG Equation:
Final Output Quality = Retrieval Quality × Generation Quality

If retrieval finds irrelevant passages:
- User asks about "sword fighting"
- System retrieves passages about "cooking swords" (wrong!)
- LLM generates cooking-related nonsense

Retrieval failure modes:
- Semantic gap: query and relevant docs use different words
- Chunking errors: relevant info split across chunks
- Embedding limitations: model doesn't understand domain

2. Additional Latency

Request Timeline Comparison:

Direct LLM (no RAG):
[User Query] → [LLM Generate: 500ms] → [Response]
Total: ~500ms

RAG:
[User Query] → [Embed Query: 50ms] → [Vector Search: 100ms] →
[Fetch Documents: 50ms] → [Build Prompt: 10ms] → [LLM Generate: 600ms] → [Response]
Total: ~810ms (+62% slower)

3. Chunking Challenges

The Chunking Dilemma:

Too Small (100 chars):
"The warrior drew his" | "sword and faced the" | "dragon with courage"
→ Loses context, meaningless fragments

Too Large (5000 chars):
[Entire chapter about many topics]
→ Dilutes relevance, wastes context, may retrieve wrong parts

Just Right (500-1000 chars):
[Complete paragraph about sword fighting]
→ Self-contained, meaningful, searchable

But even "just right" has problems:
- Important info may span two chunks
- Context from previous paragraphs lost
- Character names may not appear in every chunk

4. Context Assembly Issues

# Retrieved chunks may not flow naturally
retrieved = [
    "...he defeated the demon lord. THE END.",  # End of chapter 5
    "Chapter 1: The young warrior woke...",      # Beginning of book
    "...said Master Liu. 'Your training...'"     # Middle of dialogue
]

# Assembled context is disjointed!
# The LLM must make sense of this jumble

5. Cold Start Problem

New RAG System:
- No documents indexed yet
- User queries return nothing relevant
- Output quality = base LLM (no improvement)

Solution: Must index documents before system is useful
This takes time for large collections

Current Limitations of RAG

Understanding limitations helps you build better systems:

Retrieval Limitations

Limitation	Description	Workaround
Semantic gap	Different words for same concept	Hybrid search (keyword + semantic)
No cross-document reasoning	Can't connect info across books	Knowledge graphs, multi-hop retrieval
Recency bias	All chunks treated equally	Add timestamp metadata, boost recent
No negation understanding	"not about war" still retrieves war	Better query processing
Fixed chunk boundaries	Important info split across chunks	Overlapping chunks, larger windows

Embedding Limitations

Limitation	Description	Workaround
Domain mismatch	General embeddings miss domain terms	Fine-tune embedding model
Length limits	Most models cap at 512 tokens	Chunk appropriately
Language bias	English-trained models struggle with other languages	Multilingual models
No structured data	Can't embed tables well	Special preprocessing

Generation Limitations

Limitation	Description	Workaround
Context window	Can only fit N retrieved chunks	Summarization, selection
Lost in the middle	LLMs ignore middle of long contexts	Reorder important info to start/end
Hallucination	May still make things up	Fact-checking, citations
Style inconsistency	May not maintain style throughout	More style examples, fine-tuning

Our Tutorial's Specific Limitations

Since this is a learning-focused tutorial, we've made simplifying choices:

┌─────────────────────────────────────────────────────────────────┐
│                 WHAT THIS TUTORIAL COVERS                       │
├─────────────────────────────────────────────────────────────────┤
│  ✓ Basic RAG pipeline (ingest → embed → store → retrieve)       │
│  ✓ Simple fixed-size chunking with overlap                      │
│  ✓ Single embedding model (no fine-tuning)                      │
│  ✓ Basic similarity search (no reranking)                       │
│  ✓ Single-query retrieval (no query expansion)                  │
│  ✓ Straightforward prompt templates                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│              PRODUCTION SYSTEMS WOULD ADD                       │
├─────────────────────────────────────────────────────────────────┤
│  ○ Hybrid search (BM25 keyword + semantic vectors)              │
│  ○ Query expansion ("sword" → "sword, blade, weapon")           │
│  ○ Cross-encoder reranking for better precision                 │
│  ○ Semantic chunking (split on topic boundaries)                │
│  ○ Metadata filtering (by author, genre, date)                  │
│  ○ Caching layer for repeated queries                           │
│  ○ Evaluation metrics (retrieval recall, generation quality)    │
│  ○ A/B testing for prompt variations                            │
│  ○ Streaming responses for better UX                            │
│  ○ Rate limiting and cost management                            │
│  ○ Observability (logging, tracing, metrics)                    │
└─────────────────────────────────────────────────────────────────┘

Why We Choose RAG for This Project

Given all the above, here's why RAG is the right choice for our story generator:

1. Our Use Case Fits RAG Perfectly

Requirement	Why RAG Works
Learn from book collection	Easy to add books to vector DB
Multiple genres/styles	Retrieval finds relevant style samples
Users add their own books	No retraining needed
Works offline	Ollama + local ChromaDB
Educational project	RAG is easier to understand and debug

2. The Alternatives Don't Fit

Method	Why Not for This Project
Prompt Engineering	Can't fit entire book collection
Fine-Tuning	Too expensive, can't easily add books
Knowledge Graphs	Style/prose can't be structured as graphs

3. RAG Handles Our Scale

Typical user's book collection:
- 50-200 ebooks
- 5-20 million words total
- 50,000-200,000 chunks

RAG handles this easily:
- ChromaDB can store millions of vectors
- Search takes <100ms even with 200K chunks
- Adding new books takes seconds

4. Style Learning Works Well with RAG

For creative writing, we don't need exact fact retrieval. We need style examples:

# Query: "Write about a warrior discovering a cave"

# RAG retrieves passages about:
# - Warriors in various situations
# - Cave discoveries
# - Mysterious findings

# These serve as STYLE EXAMPLES, not facts
# The LLM learns "how to write" from them
# Output naturally varies based on what's retrieved

RAG Deep Dive: How It Actually Works

Now let's understand the mechanics:

The RAG Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                        RAG PIPELINE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║  OFFLINE PHASE (One-time setup)                           ║  │
│  ╠═══════════════════════════════════════════════════════════╣  │
│  ║                                                           ║  │
│  ║  Documents ──▶ Parse ──▶ Chunk ──▶ Embed ──▶ Store        ║  │
│  ║                                                           ║  │
│  ║  ┌─────────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐ ║   │
│  ║  │ Ebooks  │─▶│Extract│─▶│ Split │─▶│Vector │─▶│ChromaDB ║   │
│  ║  │PDF/EPUB │  │ Text  │  │ 500ch │  │ 384d  │  │        │ ║  │
│  ║  └─────────┘  └───────┘  └───────┘  └───────┘  └───────┘ ║   │
│  ║                                                           ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│                                                                 │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║  ONLINE PHASE (Every query)                               ║  │
│  ╠═══════════════════════════════════════════════════════════╣  │
│  ║                                                           ║  │
│  ║  Query ──▶ Embed ──▶ Search ──▶ Retrieve ──▶ Augment ──▶ Gen
│  ║                                                           ║  │
│  ║  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌─────────┐ ║   │
│  ║  │"Write │─▶│Query  │─▶│Cosine │─▶│Top 5  │─▶│Prompt + │ ║   │
│  ║  │about..│  │Vector │  │Search │  │Chunks │  │Context  │ ║   │
│  ║  └───────┘  └───────┘  └───────┘  └───────┘  └────┬────┘ ║   │
│  ║                                                    │      ║  │
│  ║                                                    ▼      ║  │
│  ║                                              ┌─────────┐  ║  │
│  ║                                              │   LLM   │  ║  │
│  ║                                              │Generate │  ║  │
│  ║                                              └────┬────┘  ║  │
│  ║                                                   │       ║  │
│  ║                                                   ▼       ║  │
│  ║                                              ┌─────────┐  ║  │
│  ║                                              │  Story  │  ║  │
│  ║                                              │ Output  │  ║  │
│  ║                                              └─────────┘  ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 1: Document Parsing

# We support multiple ebook formats
def parse_ebook(filepath):
    ext = filepath.suffix.lower()

    if ext == '.pdf':
        return extract_pdf_text(filepath)    # PyMuPDF
    elif ext == '.epub':
        return extract_epub_text(filepath)   # ebooklib
    elif ext == '.mobi':
        return extract_mobi_text(filepath)   # mobi library
    elif ext == '.txt':
        return filepath.read_text()

# Output: Raw text string
# "Chapter 1\n\nThe young warrior stood at the edge..."

Step 2: Text Chunking

Why we chunk:

Problem: A novel has 80,000 words
- Can't embed entire book (embedding models have limits)
- Can't retrieve entire book (wastes context window)
- Need granularity for relevant retrieval

Solution: Split into chunks
- Each chunk is self-contained
- Small enough to embed
- Large enough to be meaningful

Chunking strategies compared:

Strategy	Example	Pros	Cons
Fixed-size	Every 500 characters	Simple, consistent	May cut mid-sentence
Sentence	Split on periods	Natural boundaries	Variable sizes
Paragraph	Split on newlines	Preserves context	Very variable sizes
Semantic	Split on topic change	Best relevance	Complex, slow

We use fixed-size with overlap:

def chunk_text(text, size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + size
        chunk = text[start:end]

        # Try to break at sentence boundary
        last_period = chunk.rfind('. ')
        if last_period > size * 0.5:
            chunk = chunk[:last_period + 1]
            end = start + last_period + 1

        chunks.append(chunk)
        start = end - overlap  # Overlap!

    return chunks

Text: [AAAAA][BBBBB][CCCCC][DDDDD][EEEEE]

Chunks with overlap:
1: [AAAAA][BB]
2:    [BB][BBBBB][CC]
3:          [CC][CCCCC][DD]
4:                [DD][DDDDD][EE]
5:                      [EE][EEEEE]

Overlap ensures we don't lose context at boundaries!

Step 3: Embedding

Convert text to vectors that capture meaning:

from sentence_transformers import SentenceTransformer

# Load model (downloads ~100MB first time)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Embed text
text = "The warrior drew his ancient blade"
vector = model.encode(text)

print(vector.shape)  # (384,)
print(vector[:5])    # [0.23, -0.45, 0.67, 0.12, -0.89]

Why embeddings work:

# Similar meanings → Similar vectors
v1 = embed("The warrior drew his sword")
v2 = embed("The fighter unsheathed his blade")
v3 = embed("I like pizza")

cosine_similarity(v1, v2)  # 0.89 - very similar!
cosine_similarity(v1, v3)  # 0.12 - very different!

Step 4: Vector Storage (ChromaDB)

import chromadb

# Create persistent database
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.create_collection(
    name="story_styles",
    metadata={"description": "Writing style samples"}
)

# Add documents
collection.add(
    ids=["chunk_001", "chunk_002", "chunk_003"],
    documents=[
        "The warrior drew his blade...",
        "Magic sparkled in the air...",
        "The ancient tome revealed..."
    ],
    embeddings=[
        [0.23, -0.45, ...],  # 384 dimensions
        [0.12, 0.67, ...],
        [-0.34, 0.21, ...]
    ],
    metadatas=[
        {"source": "book1.txt", "chunk_id": 1},
        {"source": "book1.txt", "chunk_id": 2},
        {"source": "book2.txt", "chunk_id": 1}
    ]
)

print(f"Stored {collection.count()} chunks")

Step 5: Retrieval

# User's query
query = "A young cultivator discovers a mysterious cave"

# Embed the query
query_vector = model.encode(query)

# Search for similar chunks
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5,
    include=["documents", "distances", "metadatas"]
)

# Results contain the most relevant passages
for i, doc in enumerate(results['documents'][0]):
    print(f"Result {i+1} (distance: {results['distances'][0][i]:.3f}):")
    print(f"  {doc[:100]}...")
    print(f"  Source: {results['metadatas'][0][i]['source']}")

Step 6: Augmented Generation

# Build the augmented prompt
def build_prompt(query, retrieved_passages):
    context = "\n\n---\n\n".join(retrieved_passages)

    prompt = f"""Here are some example passages showing the writing style to follow:

{context}

---

Now write a NEW story passage in a similar style.
Story idea: {query}

Requirements:
- Match the writing style of the examples above
- Create original content (don't copy)
- Include vivid descriptions and dialogue

Story:
"""
    return prompt

# Generate
prompt = build_prompt(user_query, retrieved_passages)
story = llm.generate(prompt)

The Architecture for Our Story Generator

┌────────────────────────────────────────────────────────────────────────┐
│                     AI RAG STORY GENERATOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│   YOUR EBOOK COLLECTION                                                │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  data/raw/                                                    │    │
│   │  ├── fantasy_novel.epub                                       │    │
│   │  ├── xianxia_story.pdf                                        │    │
│   │  ├── magic_school.mobi                                        │    │
│   │  └── cultivation_tale.txt                                     │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  PARSE (parse_ebooks.py)                                      │    │
│   │  • Extract text from PDF, EPUB, MOBI, TXT                     │    │
│   │  • Clean and normalize text                                   │    │
│   │  • Output: data/txt/*.txt                                     │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  BUILD DATABASE (build_style_db.py)                           │    │
│   │  • Chunk text (500 chars, 50 overlap)                         │    │
│   │  • Generate embeddings (SentenceTransformer)                  │    │
│   │  • Store in ChromaDB                                          │    │
│   │  • Output: chroma_db/                                         │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  VECTOR DATABASE (ChromaDB)                                   │    │
│   │  • Stores: text chunks + embeddings + metadata                │    │
│   │  • Enables: fast similarity search                            │    │
│   │  • Persists: survives restarts                                │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│         ┌────────────────────┴────────────────────┐                   │
│         │                                         │                    │
│         ▼                                         ▼                    │
│   ┌─────────────┐                          ┌─────────────┐            │
│   │  CLI MODE   │                          │   WEB UI    │            │
│   │ generate_   │                          │   app.py    │            │
│   │ with_style  │                          │  (Gradio)   │            │
│   │    .py      │                          │             │            │
│   └──────┬──────┘                          └──────┬──────┘            │
│          │                                        │                    │
│          └────────────────┬───────────────────────┘                   │
│                           │                                            │
│                           ▼                                            │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  GENERATION PIPELINE                                          │    │
│   │                                                               │    │
│   │  User: "Write about a cultivator finding a cave"             │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 1. Embed query with SentenceTransformer                 │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 2. Search ChromaDB for similar passages                 │ │    │
│   │  │    Returns: 3-5 style samples                           │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 3. Build augmented prompt                               │ │    │
│   │  │    [Style examples] + [User request]                    │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 4. Send to LLM                                          │ │    │
│   │  │    ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │ │    │
│   │  │    │ Ollama  │ │ OpenAI  │ │ Claude  │ │ Gemini  │     │ │    │
│   │  │    │ (local) │ │  (API)  │ │  (API)  │ │  (API)  │     │ │    │
│   │  │    └─────────┘ └─────────┘ └─────────┘ └─────────┘     │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 5. Return generated story in learned style              │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

What We'll Build in This Series

Part 1 (This Article)

✅ Understanding RAG concepts
✅ Comparing all alternatives in detail
✅ Deep dive into RAG pros and cons
✅ Current limitations
✅ Why we chose RAG
✅ How RAG works step-by-step
✅ Architecture overview

Part 2: Building the RAG Pipeline

Project setup and dependencies
Parsing ebooks (PDF, EPUB, MOBI) with code
Text chunking implementation
Generating embeddings with Sentence Transformers
Storing in ChromaDB
Testing retrieval quality

Part 3: Story Generation

Connecting to LLMs (Ollama, OpenAI, Claude)
Prompt engineering for style transfer
Single chapter generation
Multi-chapter story generation
Maintaining consistency with summaries
Web interface with Gradio

Prerequisites

Before Part 2, make sure you have:

Python 3.10+ installed
8GB+ RAM (16GB recommended for larger models)
Some ebooks to learn from (any genre!)
(Optional) Ollama for local LLM inference

Clone the repository:

git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator

Summary

Topic	Key Takeaway
The Problem	LLMs don't know your custom data, have knowledge cutoffs, and generate generic content
Alternatives	Prompt engineering (simple but limited), Fine-tuning (powerful but expensive), RAG (balanced), Knowledge graphs (structured data)
RAG Pros	No training, easy updates, scalable, transparent, cost-effective, model-agnostic
RAG Cons	Retrieval quality dependency, added latency, chunking challenges, context assembly issues
Limitations	Semantic gaps, embedding limits, no cross-document reasoning (in basic RAG)
Why RAG for Us	Fits our use case perfectly: large book collections, easy updates, style learning
How RAG Works	Parse, Chunk, Embed, Store, Query, Retrieve, Augment, Generate

Next Article: Part 2: Building the RAG Pipeline →

Source Code: github.com/namtran/ai-rag-tutorial-story-generator

Found this helpful? Follow me for Parts 2 and 3!

Learn what RAG is, why we choose it over fine-tuning and other alternatives, with detailed comparisons, pros/cons, and current limitations.

Table of Contents

The Problem: LLMs Don't Know Your Data

1. Knowledge Cutoff

2. Generic Responses

3. Hallucinations

4. No Personalization

5. Context Window Limitations

Methods to Add Custom Knowledge to LLMs

Method 1: Prompt Engineering (In-Context Learning)

Method 2: Fine-Tuning

Method 3: RAG (Retrieval-Augmented Generation)

Method 4: Knowledge Graphs

Method 5: Hybrid Approaches

Comparison Table: All Methods

Deep Dive: RAG Pros and Cons

RAG Advantages (Detailed)

1. No Training Required

2. Easy Knowledge Updates

3. Scalability

4. Transparency and Debugging

5. Model Agnostic

6. Cost Effective

RAG Disadvantages (Detailed)

1. Retrieval Quality Dependency

2. Additional Latency

3. Chunking Challenges

4. Context Assembly Issues

5. Cold Start Problem

Current Limitations of RAG

Retrieval Limitations

Embedding Limitations

Generation Limitations

Our Tutorial's Specific Limitations

Why We Choose RAG for This Project

1. Our Use Case Fits RAG Perfectly

2. The Alternatives Don't Fit

3. RAG Handles Our Scale

4. Style Learning Works Well with RAG

RAG Deep Dive: How It Actually Works

The RAG Pipeline

Step 1: Document Parsing

Step 2: Text Chunking

Step 3: Embedding

Step 4: Vector Storage (ChromaDB)

Step 5: Retrieval

Step 6: Augmented Generation

The Architecture for Our Story Generator

What We'll Build in This Series

Part 1 (This Article)

Part 2: Building the RAG Pipeline

Part 3: Story Generation

Prerequisites

Summary