DEV Community

Cover image for Build Your Own AI Story Generator with RAG - Part 1: Understanding RAG
Nam Tran
Nam Tran

Posted on

Build Your Own AI Story Generator with RAG - Part 1: Understanding RAG

Learn what RAG is, why we choose it over fine-tuning and other alternatives, with detailed comparisons, pros/cons, and current limitations.

Have you ever wanted an AI to write stories in your favorite author's style? Or wished ChatGPT knew about your company's internal documents?

That's exactly what RAG (Retrieval-Augmented Generation) enables.

In this 3-part tutorial series, we'll build a complete AI story generator that learns writing styles from your ebook collection. By the end, you'll understand RAG deeply—not just theoretically, but through hands-on implementation.

What we're building:

  • A system that learns writing styles from any ebook collection
  • Multi-chapter story generation with consistency
  • Support for multiple LLM backends (Ollama, OpenAI, Claude)

Source Code: github.com/namtran/ai-rag-tutorial-story-generator


Table of Contents

  1. The Problem: LLMs Don't Know Your Data
  2. Methods to Add Custom Knowledge
  3. Deep Dive: RAG Pros and Cons
  4. Current Limitations of RAG
  5. Why We Choose RAG for This Project
  6. How RAG Actually Works
  7. Our Architecture

The Problem: LLMs Don't Know Your Data

Large Language Models like GPT-4, Claude, and Llama are trained on massive datasets from the internet. They're incredibly capable, but they have fundamental limitations:

1. Knowledge Cutoff

Models only know information up to their training date.

You: "What happened in the 2024 Olympics?"
GPT-4: "I don't have information about events after April 2024..."
Enter fullscreen mode Exit fullscreen mode

More importantly for us: LLMs don't know about your personal book collection, your company documents, or any private data.

2. Generic Responses

Ask an LLM to write a story, and you'll get competent but generic prose:

Prompt: "Write about a cultivator discovering a cave"

Generic LLM Response:
"The young man walked into the cave. It was dark and mysterious.
He felt a strange energy. Something powerful was hidden here..."

What we want (Xianxia style):
"Chen Wei's spiritual sense trembled as he pushed through the
waterfall. Ancient qi, dense enough to manifest as mist, swirled
within the cave mouth. His dantian resonated with a frequency
he had only read about in the Celestial Archives—the signature
of a Nascent Soul realm cultivator's inheritance..."
Enter fullscreen mode Exit fullscreen mode

3. Hallucinations

Without access to source material, LLMs confidently generate plausible-sounding but incorrect information:

You: "What does Chapter 7 of my company handbook say about vacation policy?"
LLM: "According to your handbook, employees receive 15 days..."
     (completely made up - it has no access to your handbook!)
Enter fullscreen mode Exit fullscreen mode

4. No Personalization

Everyone gets the same model. A fantasy author and a technical writer get identical responses to the same prompt. There's no way to customize the model to your specific domain without significant effort.

5. Context Window Limitations

Even if you try to paste your entire book into the prompt:

Model Context Window Equivalent
GPT-3.5 4K tokens ~3,000 words
GPT-4 8K-128K tokens ~6,000-96,000 words
Claude 3 200K tokens ~150,000 words

A typical novel is 80,000-100,000 words. A book collection? Millions of words. You simply can't fit everything in the context window.


Methods to Add Custom Knowledge to LLMs

There are several approaches to make LLMs work with your custom data. Let's explore each one in detail:

Method 1: Prompt Engineering (In-Context Learning)

How it works: Paste your data directly into the prompt.

prompt = f"""
You are a story writer. Here are some example passages to follow:

Example 1:
{example_passage_1}

Example 2:
{example_passage_2}

Example 3:
{example_passage_3}

Now write a story about: {user_request}
"""
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Zero setup - Just copy-paste, no infrastructure needed
  • Immediate - Results in seconds, no preprocessing
  • Flexible - Change examples anytime
  • No training - Works with any off-the-shelf model

Cons:

  • Context limits - Can only fit 3-10 examples (can't represent diverse styles)
  • High cost - Pay for example tokens every call ($0.01-0.10 per request)
  • No intelligence - Must manually choose examples (may pick irrelevant ones)
  • Doesn't scale - 100 books = impossible

Best for: Quick prototypes, very small datasets (<10 pages)


Method 2: Fine-Tuning

How it works: Train the model's weights on your specific data.

# Conceptual fine-tuning workflow
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You write in xianxia style"},
            {"role": "user", "content": "Write about a breakthrough"},
            {"role": "assistant", "content": "The qi vortex above Chen Wei's head..."}
        ]
    },
    # ... hundreds or thousands more examples
]

# Train the model (this costs money and time!)
fine_tuned_model = openai.fine_tuning.jobs.create(
    training_file="training_data.jsonl",
    model="gpt-3.5-turbo"
)
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Persistent knowledge - Style/knowledge "baked into" weights
  • Fast inference - No retrieval step needed
  • Deep learning - Can learn subtle patterns over time
  • Consistent outputs - Same style every time

Cons:

  • Expensive - GPT-3.5 fine-tuning costs $50-500+ per training run
  • Time-consuming - Hours to days of training, slow iteration
  • Expertise required - Need to understand ML concepts (high barrier)
  • Catastrophic forgetting - Model may lose general capabilities
  • Static knowledge - Can't update without retraining (adding one book = full retrain)
  • Data preparation - Need to format data properly (hours of prep work)
  • Overfitting risk - Model memorizes instead of learning

Best for: Production systems with stable data, when you need consistent style


Method 3: RAG (Retrieval-Augmented Generation)

How it works: Store data in a searchable database, retrieve relevant pieces at query time.

# RAG workflow
def generate_with_rag(user_query):
    # 1. Search for relevant content
    relevant_passages = vector_db.search(user_query, top_k=5)

    # 2. Build augmented prompt
    prompt = f"""
    Reference material:
    {relevant_passages}

    User request: {user_query}
    """

    # 3. Generate with context
    return llm.generate(prompt)
Enter fullscreen mode Exit fullscreen mode

Pros:

  • No training - Use any model off-the-shelf
  • Easy updates - Add/remove documents instantly
  • Scalable - Handle millions of documents
  • Transparent - See exactly what was retrieved
  • Cost-effective - Only embed once, query forever
  • Model-agnostic - Same DB works with any LLM
  • Grounded responses - Output based on real sources

Cons:

  • Retrieval quality - Bad retrieval = bad output (need good embeddings)
  • Additional latency - Search adds 100-500ms (slower than fine-tuning)
  • Infrastructure - Need vector database (more moving parts)
  • Chunking challenges - How to split documents affects retrieval quality
  • Context assembly - Retrieved chunks may not flow naturally
  • Embedding costs - Need to embed all documents (one-time cost)

Best for: Large/dynamic knowledge bases, when data changes frequently


Method 4: Knowledge Graphs

How it works: Structure data as entities and relationships.

[Brandon Sanderson] --wrote--> [Mistborn] --has_magic_system--> [Allomancy]
                                    |
                                    +--has_character--> [Vin]
                                                          |
                                                          +--has_trait--> [Street Urchin]
                                                          +--has_power--> [Mistborn]
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Explicit relationships - Captures "how things connect"
  • Complex queries - "Find all characters who use fire magic"
  • Reasoning - Can infer new relationships
  • Structured output - Clean, organized data

Cons:

  • Complex to build - Must define schema, extract entities (weeks of work)
  • Maintenance burden - New data needs manual structuring (ongoing effort)
  • Doesn't capture prose - Style/voice can't be graphed (bad for creative writing)
  • Domain expertise - Need to understand your data deeply (high barrier)

Best for: Structured data, when relationships matter more than content


Method 5: Hybrid Approaches

Combine methods for best results:

Hybrid Approach How It Works Best For
RAG + Fine-tuning Fine-tune for style, RAG for facts News/research writing
RAG + Knowledge Graph Graph for structure, RAG for content Complex domains
Multi-stage RAG Retrieve, Rerank, Generate High-precision needs
RAG + Prompt Engineering RAG retrieves, few-shot guides format Specific output formats

Comparison Table: All Methods

Criteria Prompt Eng. Fine-Tuning RAG Knowledge Graph
Setup Time Minutes Days Hours Weeks
Setup Cost $0 $50-500 $0-50 $100+
Per-Query Cost High Low Medium Low
Technical Skill Low High Medium High
Knowledge Update Instant Re-train Instant Manual
Max Data Size ~50 pages Unlimited Millions of docs Millions of nodes
Retrieval Intelligence None N/A Semantic Graph traversal
Output Consistency Variable High Variable High
Debugging Easy Hard Medium Medium
Style Learning Limited Excellent Good Poor
Fact Accuracy Low Medium High High

Deep Dive: RAG Pros and Cons

Since we're using RAG, let's examine its strengths and weaknesses in detail:

RAG Advantages (Detailed)

1. No Training Required

Fine-tuning workflow:
1. Prepare training data (hours)
2. Format into JSONL (hours)
3. Upload and validate (minutes)
4. Train model (hours-days)
5. Test and iterate (days)
Total: Days to weeks

RAG workflow:
1. Parse documents (minutes)
2. Chunk and embed (minutes-hours)
3. Store in vector DB (minutes)
Total: Hours
Enter fullscreen mode Exit fullscreen mode

2. Easy Knowledge Updates

# Adding a new book to RAG
def add_book(filepath):
    text = parse_ebook(filepath)      # 10 seconds
    chunks = chunk_text(text)          # 1 second
    embeddings = embed(chunks)         # 30 seconds
    vector_db.add(chunks, embeddings)  # 5 seconds
    # Done! New book is searchable

# Adding a new book with fine-tuning
def add_book_finetune(filepath):
    # 1. Prepare new training examples (1 hour)
    # 2. Combine with existing training data (10 min)
    # 3. Re-run fine-tuning job ($50-200, 2-8 hours)
    # 4. Test new model (1 hour)
    # 5. Deploy new model (30 min)
    # Total: ~12 hours and $50-200
Enter fullscreen mode Exit fullscreen mode

3. Scalability

Data Size Prompt Engineering Fine-Tuning RAG
10 pages Works Works Works
100 pages Too big Works Works
1,000 pages Impossible Expensive Works
10,000 pages Impossible Very expensive Works
1M pages Impossible Impractical Works

4. Transparency and Debugging

# With RAG, you can see exactly what the model sees
result = generator.generate("Write about a warrior")

# Debug: What did we retrieve?
print("Retrieved passages:")
for i, passage in enumerate(result.retrieved_context):
    print(f"{i+1}. {passage[:100]}...")
    print(f"   Similarity: {result.scores[i]}")
    print(f"   Source: {result.sources[i]}")

# If output is wrong, you know exactly where to look:
# - Bad retrieval? Improve embeddings or chunking
# - Good retrieval, bad output? Improve prompt
Enter fullscreen mode Exit fullscreen mode

5. Model Agnostic

# Same knowledge base works with ANY model
knowledge_base = VectorDB("./chroma_db")

# Use with Ollama (free, local)
ollama_response = generate(knowledge_base, model="ollama/qwen2.5")

# Use with OpenAI (paid, cloud)
openai_response = generate(knowledge_base, model="gpt-4")

# Use with Claude (paid, cloud)
claude_response = generate(knowledge_base, model="claude-3-sonnet")

# Switch models without rebuilding anything!
Enter fullscreen mode Exit fullscreen mode

6. Cost Effective

Operation Fine-Tuning Cost RAG Cost
Initial setup $50-500 $0-10
Add 1 book $50-200 (retrain) ~$0.01 (embed)
Add 100 books $50-200 (retrain) ~$1 (embed)
Query (GPT-4) ~$0.03/query ~$0.04/query
Query (Ollama) $0 $0

RAG Disadvantages (Detailed)

1. Retrieval Quality Dependency

The RAG Equation:
Final Output Quality = Retrieval Quality × Generation Quality

If retrieval finds irrelevant passages:
- User asks about "sword fighting"
- System retrieves passages about "cooking swords" (wrong!)
- LLM generates cooking-related nonsense

Retrieval failure modes:
- Semantic gap: query and relevant docs use different words
- Chunking errors: relevant info split across chunks
- Embedding limitations: model doesn't understand domain
Enter fullscreen mode Exit fullscreen mode

2. Additional Latency

Request Timeline Comparison:

Direct LLM (no RAG):
[User Query] → [LLM Generate: 500ms] → [Response]
Total: ~500ms

RAG:
[User Query] → [Embed Query: 50ms] → [Vector Search: 100ms] →
[Fetch Documents: 50ms] → [Build Prompt: 10ms] → [LLM Generate: 600ms] → [Response]
Total: ~810ms (+62% slower)
Enter fullscreen mode Exit fullscreen mode

3. Chunking Challenges

The Chunking Dilemma:

Too Small (100 chars):
"The warrior drew his" | "sword and faced the" | "dragon with courage"
→ Loses context, meaningless fragments

Too Large (5000 chars):
[Entire chapter about many topics]
→ Dilutes relevance, wastes context, may retrieve wrong parts

Just Right (500-1000 chars):
[Complete paragraph about sword fighting]
→ Self-contained, meaningful, searchable

But even "just right" has problems:
- Important info may span two chunks
- Context from previous paragraphs lost
- Character names may not appear in every chunk
Enter fullscreen mode Exit fullscreen mode

4. Context Assembly Issues

# Retrieved chunks may not flow naturally
retrieved = [
    "...he defeated the demon lord. THE END.",  # End of chapter 5
    "Chapter 1: The young warrior woke...",      # Beginning of book
    "...said Master Liu. 'Your training...'"     # Middle of dialogue
]

# Assembled context is disjointed!
# The LLM must make sense of this jumble
Enter fullscreen mode Exit fullscreen mode

5. Cold Start Problem

New RAG System:
- No documents indexed yet
- User queries return nothing relevant
- Output quality = base LLM (no improvement)

Solution: Must index documents before system is useful
This takes time for large collections
Enter fullscreen mode Exit fullscreen mode

Current Limitations of RAG

Understanding limitations helps you build better systems:

Retrieval Limitations

Limitation Description Workaround
Semantic gap Different words for same concept Hybrid search (keyword + semantic)
No cross-document reasoning Can't connect info across books Knowledge graphs, multi-hop retrieval
Recency bias All chunks treated equally Add timestamp metadata, boost recent
No negation understanding "not about war" still retrieves war Better query processing
Fixed chunk boundaries Important info split across chunks Overlapping chunks, larger windows

Embedding Limitations

Limitation Description Workaround
Domain mismatch General embeddings miss domain terms Fine-tune embedding model
Length limits Most models cap at 512 tokens Chunk appropriately
Language bias English-trained models struggle with other languages Multilingual models
No structured data Can't embed tables well Special preprocessing

Generation Limitations

Limitation Description Workaround
Context window Can only fit N retrieved chunks Summarization, selection
Lost in the middle LLMs ignore middle of long contexts Reorder important info to start/end
Hallucination May still make things up Fact-checking, citations
Style inconsistency May not maintain style throughout More style examples, fine-tuning

Our Tutorial's Specific Limitations

Since this is a learning-focused tutorial, we've made simplifying choices:

┌─────────────────────────────────────────────────────────────────┐
│                 WHAT THIS TUTORIAL COVERS                       │
├─────────────────────────────────────────────────────────────────┤
│  ✓ Basic RAG pipeline (ingest → embed → store → retrieve)       │
│  ✓ Simple fixed-size chunking with overlap                      │
│  ✓ Single embedding model (no fine-tuning)                      │
│  ✓ Basic similarity search (no reranking)                       │
│  ✓ Single-query retrieval (no query expansion)                  │
│  ✓ Straightforward prompt templates                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│              PRODUCTION SYSTEMS WOULD ADD                       │
├─────────────────────────────────────────────────────────────────┤
│  ○ Hybrid search (BM25 keyword + semantic vectors)              │
│  ○ Query expansion ("sword" → "sword, blade, weapon")           │
│  ○ Cross-encoder reranking for better precision                 │
│  ○ Semantic chunking (split on topic boundaries)                │
│  ○ Metadata filtering (by author, genre, date)                  │
│  ○ Caching layer for repeated queries                           │
│  ○ Evaluation metrics (retrieval recall, generation quality)    │
│  ○ A/B testing for prompt variations                            │
│  ○ Streaming responses for better UX                            │
│  ○ Rate limiting and cost management                            │
│  ○ Observability (logging, tracing, metrics)                    │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why We Choose RAG for This Project

Given all the above, here's why RAG is the right choice for our story generator:

1. Our Use Case Fits RAG Perfectly

Requirement Why RAG Works
Learn from book collection Easy to add books to vector DB
Multiple genres/styles Retrieval finds relevant style samples
Users add their own books No retraining needed
Works offline Ollama + local ChromaDB
Educational project RAG is easier to understand and debug

2. The Alternatives Don't Fit

Method Why Not for This Project
Prompt Engineering Can't fit entire book collection
Fine-Tuning Too expensive, can't easily add books
Knowledge Graphs Style/prose can't be structured as graphs

3. RAG Handles Our Scale

Typical user's book collection:
- 50-200 ebooks
- 5-20 million words total
- 50,000-200,000 chunks

RAG handles this easily:
- ChromaDB can store millions of vectors
- Search takes <100ms even with 200K chunks
- Adding new books takes seconds
Enter fullscreen mode Exit fullscreen mode

4. Style Learning Works Well with RAG

For creative writing, we don't need exact fact retrieval. We need style examples:

# Query: "Write about a warrior discovering a cave"

# RAG retrieves passages about:
# - Warriors in various situations
# - Cave discoveries
# - Mysterious findings

# These serve as STYLE EXAMPLES, not facts
# The LLM learns "how to write" from them
# Output naturally varies based on what's retrieved
Enter fullscreen mode Exit fullscreen mode

RAG Deep Dive: How It Actually Works

Now let's understand the mechanics:

The RAG Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                        RAG PIPELINE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║  OFFLINE PHASE (One-time setup)                           ║  │
│  ╠═══════════════════════════════════════════════════════════╣  │
│  ║                                                           ║  │
│  ║  Documents ──▶ Parse ──▶ Chunk ──▶ Embed ──▶ Store        ║  │
│  ║                                                           ║  │
│  ║  ┌─────────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐ ║   │
│  ║  │ Ebooks  │─▶│Extract│─▶│ Split │─▶│Vector │─▶│ChromaDB ║   │
│  ║  │PDF/EPUB │  │ Text  │  │ 500ch │  │ 384d  │  │        │ ║  │
│  ║  └─────────┘  └───────┘  └───────┘  └───────┘  └───────┘ ║   │
│  ║                                                           ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│                                                                 │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║  ONLINE PHASE (Every query)                               ║  │
│  ╠═══════════════════════════════════════════════════════════╣  │
│  ║                                                           ║  │
│  ║  Query ──▶ Embed ──▶ Search ──▶ Retrieve ──▶ Augment ──▶ Gen
│  ║                                                           ║  │
│  ║  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌─────────┐ ║   │
│  ║  │"Write │─▶│Query  │─▶│Cosine │─▶│Top 5  │─▶│Prompt + │ ║   │
│  ║  │about..│  │Vector │  │Search │  │Chunks │  │Context  │ ║   │
│  ║  └───────┘  └───────┘  └───────┘  └───────┘  └────┬────┘ ║   │
│  ║                                                    │      ║  │
│  ║                                                    ▼      ║  │
│  ║                                              ┌─────────┐  ║  │
│  ║                                              │   LLM   │  ║  │
│  ║                                              │Generate │  ║  │
│  ║                                              └────┬────┘  ║  │
│  ║                                                   │       ║  │
│  ║                                                   ▼       ║  │
│  ║                                              ┌─────────┐  ║  │
│  ║                                              │  Story  │  ║  │
│  ║                                              │ Output  │  ║  │
│  ║                                              └─────────┘  ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: Document Parsing

# We support multiple ebook formats
def parse_ebook(filepath):
    ext = filepath.suffix.lower()

    if ext == '.pdf':
        return extract_pdf_text(filepath)    # PyMuPDF
    elif ext == '.epub':
        return extract_epub_text(filepath)   # ebooklib
    elif ext == '.mobi':
        return extract_mobi_text(filepath)   # mobi library
    elif ext == '.txt':
        return filepath.read_text()

# Output: Raw text string
# "Chapter 1\n\nThe young warrior stood at the edge..."
Enter fullscreen mode Exit fullscreen mode

Step 2: Text Chunking

Why we chunk:

Problem: A novel has 80,000 words
- Can't embed entire book (embedding models have limits)
- Can't retrieve entire book (wastes context window)
- Need granularity for relevant retrieval

Solution: Split into chunks
- Each chunk is self-contained
- Small enough to embed
- Large enough to be meaningful
Enter fullscreen mode Exit fullscreen mode

Chunking strategies compared:

Strategy Example Pros Cons
Fixed-size Every 500 characters Simple, consistent May cut mid-sentence
Sentence Split on periods Natural boundaries Variable sizes
Paragraph Split on newlines Preserves context Very variable sizes
Semantic Split on topic change Best relevance Complex, slow

We use fixed-size with overlap:

def chunk_text(text, size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + size
        chunk = text[start:end]

        # Try to break at sentence boundary
        last_period = chunk.rfind('. ')
        if last_period > size * 0.5:
            chunk = chunk[:last_period + 1]
            end = start + last_period + 1

        chunks.append(chunk)
        start = end - overlap  # Overlap!

    return chunks
Enter fullscreen mode Exit fullscreen mode
Text: [AAAAA][BBBBB][CCCCC][DDDDD][EEEEE]

Chunks with overlap:
1: [AAAAA][BB]
2:    [BB][BBBBB][CC]
3:          [CC][CCCCC][DD]
4:                [DD][DDDDD][EE]
5:                      [EE][EEEEE]

Overlap ensures we don't lose context at boundaries!
Enter fullscreen mode Exit fullscreen mode

Step 3: Embedding

Convert text to vectors that capture meaning:

from sentence_transformers import SentenceTransformer

# Load model (downloads ~100MB first time)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Embed text
text = "The warrior drew his ancient blade"
vector = model.encode(text)

print(vector.shape)  # (384,)
print(vector[:5])    # [0.23, -0.45, 0.67, 0.12, -0.89]
Enter fullscreen mode Exit fullscreen mode

Why embeddings work:

# Similar meanings → Similar vectors
v1 = embed("The warrior drew his sword")
v2 = embed("The fighter unsheathed his blade")
v3 = embed("I like pizza")

cosine_similarity(v1, v2)  # 0.89 - very similar!
cosine_similarity(v1, v3)  # 0.12 - very different!
Enter fullscreen mode Exit fullscreen mode

Step 4: Vector Storage (ChromaDB)

import chromadb

# Create persistent database
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.create_collection(
    name="story_styles",
    metadata={"description": "Writing style samples"}
)

# Add documents
collection.add(
    ids=["chunk_001", "chunk_002", "chunk_003"],
    documents=[
        "The warrior drew his blade...",
        "Magic sparkled in the air...",
        "The ancient tome revealed..."
    ],
    embeddings=[
        [0.23, -0.45, ...],  # 384 dimensions
        [0.12, 0.67, ...],
        [-0.34, 0.21, ...]
    ],
    metadatas=[
        {"source": "book1.txt", "chunk_id": 1},
        {"source": "book1.txt", "chunk_id": 2},
        {"source": "book2.txt", "chunk_id": 1}
    ]
)

print(f"Stored {collection.count()} chunks")
Enter fullscreen mode Exit fullscreen mode

Step 5: Retrieval

# User's query
query = "A young cultivator discovers a mysterious cave"

# Embed the query
query_vector = model.encode(query)

# Search for similar chunks
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5,
    include=["documents", "distances", "metadatas"]
)

# Results contain the most relevant passages
for i, doc in enumerate(results['documents'][0]):
    print(f"Result {i+1} (distance: {results['distances'][0][i]:.3f}):")
    print(f"  {doc[:100]}...")
    print(f"  Source: {results['metadatas'][0][i]['source']}")
Enter fullscreen mode Exit fullscreen mode

Step 6: Augmented Generation

# Build the augmented prompt
def build_prompt(query, retrieved_passages):
    context = "\n\n---\n\n".join(retrieved_passages)

    prompt = f"""Here are some example passages showing the writing style to follow:

{context}

---

Now write a NEW story passage in a similar style.
Story idea: {query}

Requirements:
- Match the writing style of the examples above
- Create original content (don't copy)
- Include vivid descriptions and dialogue

Story:
"""
    return prompt

# Generate
prompt = build_prompt(user_query, retrieved_passages)
story = llm.generate(prompt)
Enter fullscreen mode Exit fullscreen mode

The Architecture for Our Story Generator

┌────────────────────────────────────────────────────────────────────────┐
│                     AI RAG STORY GENERATOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│   YOUR EBOOK COLLECTION                                                │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  data/raw/                                                    │    │
│   │  ├── fantasy_novel.epub                                       │    │
│   │  ├── xianxia_story.pdf                                        │    │
│   │  ├── magic_school.mobi                                        │    │
│   │  └── cultivation_tale.txt                                     │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  PARSE (parse_ebooks.py)                                      │    │
│   │  • Extract text from PDF, EPUB, MOBI, TXT                     │    │
│   │  • Clean and normalize text                                   │    │
│   │  • Output: data/txt/*.txt                                     │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  BUILD DATABASE (build_style_db.py)                           │    │
│   │  • Chunk text (500 chars, 50 overlap)                         │    │
│   │  • Generate embeddings (SentenceTransformer)                  │    │
│   │  • Store in ChromaDB                                          │    │
│   │  • Output: chroma_db/                                         │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              ▼                                         │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  VECTOR DATABASE (ChromaDB)                                   │    │
│   │  • Stores: text chunks + embeddings + metadata                │    │
│   │  • Enables: fast similarity search                            │    │
│   │  • Persists: survives restarts                                │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│         ┌────────────────────┴────────────────────┐                   │
│         │                                         │                    │
│         ▼                                         ▼                    │
│   ┌─────────────┐                          ┌─────────────┐            │
│   │  CLI MODE   │                          │   WEB UI    │            │
│   │ generate_   │                          │   app.py    │            │
│   │ with_style  │                          │  (Gradio)   │            │
│   │    .py      │                          │             │            │
│   └──────┬──────┘                          └──────┬──────┘            │
│          │                                        │                    │
│          └────────────────┬───────────────────────┘                   │
│                           │                                            │
│                           ▼                                            │
│   ┌──────────────────────────────────────────────────────────────┐    │
│   │  GENERATION PIPELINE                                          │    │
│   │                                                               │    │
│   │  User: "Write about a cultivator finding a cave"             │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 1. Embed query with SentenceTransformer                 │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 2. Search ChromaDB for similar passages                 │ │    │
│   │  │    Returns: 3-5 style samples                           │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 3. Build augmented prompt                               │ │    │
│   │  │    [Style examples] + [User request]                    │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 4. Send to LLM                                          │ │    │
│   │  │    ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │ │    │
│   │  │    │ Ollama  │ │ OpenAI  │ │ Claude  │ │ Gemini  │     │ │    │
│   │  │    │ (local) │ │  (API)  │ │  (API)  │ │  (API)  │     │ │    │
│   │  │    └─────────┘ └─────────┘ └─────────┘ └─────────┘     │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   │                           │                                   │    │
│   │                           ▼                                   │    │
│   │  ┌─────────────────────────────────────────────────────────┐ │    │
│   │  │ 5. Return generated story in learned style              │ │    │
│   │  └─────────────────────────────────────────────────────────┘ │    │
│   └──────────────────────────────────────────────────────────────┘    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

What We'll Build in This Series

Part 1 (This Article)

  • ✅ Understanding RAG concepts
  • ✅ Comparing all alternatives in detail
  • ✅ Deep dive into RAG pros and cons
  • ✅ Current limitations
  • ✅ Why we chose RAG
  • ✅ How RAG works step-by-step
  • ✅ Architecture overview

Part 2: Building the RAG Pipeline

  • Project setup and dependencies
  • Parsing ebooks (PDF, EPUB, MOBI) with code
  • Text chunking implementation
  • Generating embeddings with Sentence Transformers
  • Storing in ChromaDB
  • Testing retrieval quality

Part 3: Story Generation

  • Connecting to LLMs (Ollama, OpenAI, Claude)
  • Prompt engineering for style transfer
  • Single chapter generation
  • Multi-chapter story generation
  • Maintaining consistency with summaries
  • Web interface with Gradio

Prerequisites

Before Part 2, make sure you have:

  • Python 3.10+ installed
  • 8GB+ RAM (16GB recommended for larger models)
  • Some ebooks to learn from (any genre!)
  • (Optional) Ollama for local LLM inference

Clone the repository:

git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator
Enter fullscreen mode Exit fullscreen mode

Summary

Topic Key Takeaway
The Problem LLMs don't know your custom data, have knowledge cutoffs, and generate generic content
Alternatives Prompt engineering (simple but limited), Fine-tuning (powerful but expensive), RAG (balanced), Knowledge graphs (structured data)
RAG Pros No training, easy updates, scalable, transparent, cost-effective, model-agnostic
RAG Cons Retrieval quality dependency, added latency, chunking challenges, context assembly issues
Limitations Semantic gaps, embedding limits, no cross-document reasoning (in basic RAG)
Why RAG for Us Fits our use case perfectly: large book collections, easy updates, style learning
How RAG Works Parse, Chunk, Embed, Store, Query, Retrieve, Augment, Generate

Next Article: Part 2: Building the RAG Pipeline →

Source Code: github.com/namtran/ai-rag-tutorial-story-generator


Found this helpful? Follow me for Parts 2 and 3!

Top comments (0)