DEV Community

Cover image for Build Your Own AI Story Generator with RAG - Part 2: Building the RAG Pipeline
Nam Tran
Nam Tran

Posted on

Build Your Own AI Story Generator with RAG - Part 2: Building the RAG Pipeline

In Part 1, we learned what RAG is, compared it to alternatives, and understood its pros, cons, and limitations. Now let's build it.

In this article, we'll create the complete data pipeline:

Ebooks → Parse → Chunk → Embed → Store in ChromaDB
Enter fullscreen mode Exit fullscreen mode

By the end, you'll have a searchable vector database of writing styles ready for story generation.

Source Code: github.com/namtran/ai-rag-tutorial-story-generator


Table of Contents

  1. Project Setup
  2. Step 1: Parsing Ebooks
  3. Step 2: Text Chunking
  4. Step 3: Generating Embeddings
  5. Step 4: Storing in ChromaDB
  6. Step 5: Testing Retrieval
  7. Troubleshooting Common Issues
  8. Performance Optimization Tips

Project Setup

Clone and Install

git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Key Dependencies Explained

# requirements.txt

# Vector Database
chromadb>=0.4.0          # Lightweight, embedded vector DB

# Embeddings
sentence-transformers    # Pre-trained embedding models

# Ebook Parsing
PyMuPDF                  # PDF text extraction (fast, reliable)
ebooklib                 # EPUB parsing
mobi                     # MOBI/PRC parsing
beautifulsoup4           # HTML cleaning for EPUB

# LLM Backends (for Part 3)
requests                 # For Ollama API
openai                   # For OpenAI API

# Web UI (for Part 3)
gradio                   # Simple web interface
Enter fullscreen mode Exit fullscreen mode

Project Structure

ai-rag-tutorial-story-generator/
├── data/
│   ├── raw/              # Your ebooks go here (.pdf, .epub, .mobi, .txt)
│   └── txt/              # Parsed text files (auto-generated)
├── chroma_db/            # Vector database (auto-generated)
├── models/               # Cached embedding models
│
├── config.py             # All configuration in one place
├── parse_ebooks.py       # Step 1: Parse ebooks → text
├── build_style_db.py     # Step 2-4: Chunk → Embed → Store
├── generate_with_style.py # Step 5+: Retrieve → Generate (Part 3)
│
├── run.sh                # Quick commands
└── requirements.txt
Enter fullscreen mode Exit fullscreen mode

Configuration Deep Dive

# config.py - Key settings explained

# ===== DIRECTORIES =====
BASE_DIR = Path(__file__).parent.resolve()
RAW_DIR = BASE_DIR / "data" / "raw"    # Put your ebooks here
TXT_DIR = BASE_DIR / "data" / "txt"    # Parsed text output
CHROMA_DIR = BASE_DIR / "chroma_db"    # Vector database

# ===== EMBEDDING MODEL =====
# We use a multilingual model to support books in any language
# This model outputs 384-dimensional vectors
EMBED_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

# Alternative models:
# "all-MiniLM-L6-v2"           # Faster, English-only, 384d
# "all-mpnet-base-v2"          # Better quality, slower, 768d
# "paraphrase-multilingual-mpnet-base-v2"  # Better multilingual, 768d

# ===== CHUNKING SETTINGS =====
RAG_CONFIG = {
    "chunk_size": 500,      # Characters per chunk
    "chunk_overlap": 50,    # Overlap between chunks
    "min_chunk_length": 100 # Skip chunks smaller than this
}

# Why these values?
# - 500 chars ≈ 100 words ≈ 1-2 paragraphs
# - Large enough for context, small enough for precise retrieval
# - 50 char overlap prevents losing info at boundaries
# - 10% overlap is a good balance (not too much redundancy)

# ===== COLLECTION NAME =====
COLLECTION_NAME = "story_styles"  # Name in ChromaDB
Enter fullscreen mode Exit fullscreen mode

Step 1: Parsing Ebooks

The Challenge

Ebooks come in many formats, each with its own structure:

Format Structure Challenge
PDF Fixed layout, pages May have headers/footers, columns
EPUB HTML/CSS in a ZIP Need to extract from HTML
MOBI/PRC Amazon proprietary Need special library
TXT Plain text Encoding issues

Understanding the Parser Code

# parse_ebooks.py - Complete with explanations

from pathlib import Path
import fitz  # PyMuPDF - Note the import name!
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import mobi

def parse_pdf(file_path: Path) -> str:
    """
    Extract text from PDF files.

    Uses PyMuPDF (fitz) which is fast and handles most PDFs well.
    Preserves paragraph structure by keeping line breaks.
    """
    doc = fitz.open(file_path)
    text_parts = []

    for page_num, page in enumerate(doc):
        # Extract text with layout preservation
        text = page.get_text("text")

        # Optional: Skip first/last pages (often cover/copyright)
        # if page_num == 0 or page_num == len(doc) - 1:
        #     continue

        text_parts.append(text)

    doc.close()

    # Join with double newline to preserve page breaks
    full_text = "\n\n".join(text_parts)

    # Clean up excessive whitespace
    import re
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)

    return full_text


def parse_epub(file_path: Path) -> str:
    """
    Extract text from EPUB files.

    EPUB files are basically ZIP files containing HTML.
    We extract text from each HTML document in reading order.
    """
    book = epub.read_epub(str(file_path))
    text_parts = []

    # Get items in reading order
    for item in book.get_items():
        # Only process document items (not images, CSS, etc.)
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            # Parse HTML content
            soup = BeautifulSoup(item.get_content(), 'html.parser')

            # Remove script and style elements
            for element in soup(['script', 'style', 'nav']):
                element.decompose()

            # Get text
            text = soup.get_text(separator='\n')
            text_parts.append(text)

    full_text = "\n\n".join(text_parts)

    # Clean up
    import re
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    full_text = re.sub(r' {2,}', ' ', full_text)

    return full_text


def parse_mobi(file_path: Path) -> str:
    """
    Extract text from MOBI/PRC files (Kindle format).

    These files are more complex - we extract to temp directory
    then parse the resulting HTML.
    """
    import tempfile
    import os

    with tempfile.TemporaryDirectory() as temp_dir:
        # Extract MOBI to temp directory
        temp_path, _ = mobi.extract(str(file_path))

        # Find the HTML file
        html_file = None
        for root, dirs, files in os.walk(temp_path):
            for file in files:
                if file.endswith('.html'):
                    html_file = os.path.join(root, file)
                    break

        if html_file:
            with open(html_file, 'r', encoding='utf-8', errors='ignore') as f:
                soup = BeautifulSoup(f.read(), 'html.parser')
                text = soup.get_text(separator='\n')
        else:
            # Fallback: try to read as text
            with open(temp_path, 'r', encoding='utf-8', errors='ignore') as f:
                text = f.read()

    return text


def parse_txt(file_path: Path) -> str:
    """
    Read plain text files.

    Handle various encodings gracefully.
    """
    encodings = ['utf-8', 'latin-1', 'cp1252', 'ascii']

    for encoding in encodings:
        try:
            return file_path.read_text(encoding=encoding)
        except UnicodeDecodeError:
            continue

    # Last resort: ignore errors
    return file_path.read_text(encoding='utf-8', errors='ignore')


def parse_ebook(file_path: Path) -> str:
    """
    Parse any supported ebook format.

    Returns cleaned text ready for chunking.
    """
    suffix = file_path.suffix.lower()

    parsers = {
        '.pdf': parse_pdf,
        '.epub': parse_epub,
        '.mobi': parse_mobi,
        '.prc': parse_mobi,  # PRC is same as MOBI
        '.txt': parse_txt,
    }

    if suffix not in parsers:
        raise ValueError(f"Unsupported format: {suffix}")

    return parsers[suffix](file_path)


def clean_text(text: str) -> str:
    """
    Clean and normalize extracted text.

    - Remove excessive whitespace
    - Fix common OCR errors
    - Normalize quotes and dashes
    """
    import re

    # Normalize line endings
    text = text.replace('\r\n', '\n').replace('\r', '\n')

    # Remove excessive blank lines
    text = re.sub(r'\n{3,}', '\n\n', text)

    # Remove excessive spaces
    text = re.sub(r' {2,}', ' ', text)

    # Fix common issues
    text = text.replace('"', '"').replace('"', '"')  # Smart quotes
    text = text.replace(''', "'").replace(''', "'")  # Smart apostrophes
    text = text.replace('', '--').replace('', '-')  # Dashes

    # Remove page numbers (common pattern)
    text = re.sub(r'\n\d+\n', '\n', text)

    # Strip leading/trailing whitespace from lines
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)

    return text.strip()
Enter fullscreen mode Exit fullscreen mode

Running the Parser

# Add your ebooks
cp ~/Books/*.epub data/raw/
cp ~/Books/*.pdf data/raw/

# Run parser
python parse_ebooks.py
Enter fullscreen mode Exit fullscreen mode

Expected Output:

============================================================
EBOOK PARSER
============================================================
Source: data/raw/
Output: data/txt/
============================================================

[PARSE] Found 5 ebooks to process

[1/5] fantasy_novel.epub
      Format: EPUB
      Processing... Done!
      Output: fantasy_novel.txt (245,832 characters)

[2/5] cultivation_story.pdf
      Format: PDF (127 pages)
      Processing... Done!
      Output: cultivation_story.txt (523,109 characters)

[3/5] magic_school.mobi
      Format: MOBI
      Processing... Done!
      Output: magic_school.txt (312,445 characters)

...

============================================================
COMPLETE
============================================================
Processed: 5 files
Total text: 1,523,891 characters
Output directory: data/txt/
============================================================
Enter fullscreen mode Exit fullscreen mode

Step 2: Text Chunking

Why Chunking Matters

Chunking is critical for RAG quality. Bad chunking = bad retrieval = bad output.

┌─────────────────────────────────────────────────────────────────┐
│                    CHUNKING IMPACT                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  User Query: "Write about a warrior's first battle"            │
│                                                                 │
│  GOOD CHUNKING:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ "Chen Wei gripped his sword tightly, knuckles white.    │   │
│  │ Before him stood a hundred enemy soldiers. This was it  │   │
│  │ - his first real battle. Master Liu's training echoed   │   │
│  │ in his mind: 'When fear comes, let it pass through you.'│   │
│  │ He raised his blade and charged."                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│  → Complete scene, good context, useful for style learning     │
│                                                                 │
│  BAD CHUNKING:                                                  │
│  ┌──────────────────────┐ ┌──────────────────────────────────┐ │
│  │ "Chen Wei gripped his│ │sword tightly, knuckles white.   │ │
│  │"                     │ │Before him stood a hundred enemy │ │
│  └──────────────────────┘ └──────────────────────────────────┘ │
│  → Split mid-sentence, loses meaning, poor retrieval           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Chunking Strategies Compared

Strategy How It Works Pros Cons Best For
Fixed-size Every N characters Simple, predictable May cut mid-sentence General use
Sentence Split on periods Natural boundaries Variable sizes Precise retrieval
Paragraph Split on newlines Preserves ideas Very variable Long-form content
Semantic ML-based topic detection Best relevance Slow, complex Production systems
Recursive Try large, then smaller Adaptive More complex Mixed content

Our Implementation: Fixed-Size with Smart Boundaries

# From build_style_db.py

def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50,
    min_length: int = 100
) -> list[str]:
    """
    Split text into overlapping chunks with smart boundary detection.

    Args:
        text: Full text to chunk
        chunk_size: Target size in characters
        overlap: Characters to overlap between chunks
        min_length: Minimum chunk size (skip smaller)

    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        # Get initial chunk
        end = start + chunk_size

        # Don't go past the end
        if end >= text_length:
            chunk = text[start:].strip()
            if len(chunk) >= min_length:
                chunks.append(chunk)
            break

        # Extract chunk
        chunk = text[start:end]

        # Find the best break point (sentence boundary)
        # Look for period, exclamation, or question mark followed by space
        best_break = -1

        for punct in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
            pos = chunk.rfind(punct)
            if pos > best_break and pos > chunk_size * 0.5:
                best_break = pos + len(punct)

        # If found a good break point, use it
        if best_break > 0:
            chunk = chunk[:best_break].strip()
            end = start + best_break

        # Also try paragraph break
        para_break = chunk.rfind('\n\n')
        if para_break > chunk_size * 0.7:  # Prefer paragraph if late enough
            chunk = chunk[:para_break].strip()
            end = start + para_break

        # Add chunk if long enough
        if len(chunk) >= min_length:
            chunks.append(chunk)

        # Move start, accounting for overlap
        start = end - overlap if end > overlap else end

    return chunks
Enter fullscreen mode Exit fullscreen mode

Visualizing Overlap

Original text (simplified):
"AAAAAAAAAA BBBBBBBBBB CCCCCCCCCC DDDDDDDDDD EEEEEEEEEE"
 |-------- chunk 1 --------|
              |-------- chunk 2 --------|
                           |-------- chunk 3 --------|

Chunk 1: "AAAAAAAAAA BBBBBBBBBB CC"
Chunk 2: "BB CCCCCCCCCC DDDDDDDDDD"  ← "BB CC" appears in both!
Chunk 3: "DD EEEEEEEEEE"

Why overlap?
- Sentence about "B and C" isn't lost at boundary
- Queries about "C" can match chunks 1 or 2
- Better retrieval for edge cases
Enter fullscreen mode Exit fullscreen mode

Testing Your Chunks

# test_chunking.py - Verify chunk quality

from build_style_db import chunk_text

# Load a sample text
with open("data/txt/sample_book.txt") as f:
    text = f.read()

# Chunk it
chunks = chunk_text(text, chunk_size=500, overlap=50)

# Analyze
print(f"Total chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
print(f"Min chunk size: {min(len(c) for c in chunks)} chars")
print(f"Max chunk size: {max(len(c) for c in chunks)} chars")

# Show a few samples
print("\n--- Sample Chunks ---")
for i in [0, len(chunks)//2, -1]:
    print(f"\nChunk {i}:")
    print(chunks[i][:200] + "...")
    print(f"Length: {len(chunks[i])} chars")
Enter fullscreen mode Exit fullscreen mode

Step 3: Generating Embeddings

What Are Embeddings?

Embeddings convert text into dense vectors that capture semantic meaning:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Text becomes a vector
text = "The warrior drew his sword"
vector = model.encode(text)

print(f"Text: '{text}'")
print(f"Vector shape: {vector.shape}")  # (384,)
print(f"First 5 values: {vector[:5]}")  # [0.23, -0.45, 0.67, ...]
Enter fullscreen mode Exit fullscreen mode

Why Embeddings Work

Semantic similarity is captured in vector space:

"The warrior drew his sword"     →  [0.23, -0.45, 0.67, ...]
"The fighter unsheathed blade"   →  [0.21, -0.43, 0.65, ...]  ← Similar!
"I like to eat pizza"            →  [-0.56, 0.32, -0.11, ...] ← Different!

                    sword/blade
                         ↑
                    [warrior] [fighter]
                         |
    pizza →  [ ]         |
                         |
                    word embedding space
Enter fullscreen mode Exit fullscreen mode

Choosing an Embedding Model

Model Dimensions Speed Quality Languages Size
all-MiniLM-L6-v2 384 Very Fast Good English 80MB
all-MiniLM-L12-v2 384 Fast Better English 120MB
paraphrase-multilingual-MiniLM-L12-v2 384 Fast Good 50+ 420MB
all-mpnet-base-v2 768 Medium Best English 420MB
paraphrase-multilingual-mpnet-base-v2 768 Medium Best 50+ 970MB

We use paraphrase-multilingual-MiniLM-L12-v2 because:

  • Supports 50+ languages (Chinese, Vietnamese, etc.)
  • Good balance of speed and quality
  • 384 dimensions is efficient for storage
  • Works well for style/semantic similarity

Embedding Implementation

# From build_style_db.py

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

class EmbeddingGenerator:
    def __init__(self, model_name: str = EMBED_MODEL):
        print(f"[EMBED] Loading model: {model_name}")
        self.model = SentenceTransformer(model_name)
        print(f"[EMBED] Model loaded! Dimension: {self.model.get_sentence_embedding_dimension()}")

    def embed_chunks(self, chunks: list[str], batch_size: int = 32) -> list:
        """
        Generate embeddings for a list of text chunks.

        Uses batching for efficiency on large datasets.
        Shows progress bar for long operations.
        """
        print(f"[EMBED] Generating embeddings for {len(chunks)} chunks...")

        # For small datasets, encode all at once
        if len(chunks) <= batch_size:
            embeddings = self.model.encode(chunks, show_progress_bar=True)
            return embeddings.tolist()

        # For large datasets, batch for memory efficiency
        all_embeddings = []

        for i in tqdm(range(0, len(chunks), batch_size), desc="Embedding"):
            batch = chunks[i:i + batch_size]
            batch_embeddings = self.model.encode(batch)
            all_embeddings.extend(batch_embeddings.tolist())

        return all_embeddings

    def embed_query(self, query: str) -> list:
        """Embed a single query string."""
        return self.model.encode(query).tolist()
Enter fullscreen mode Exit fullscreen mode

Embedding Performance Tips

# Speed comparison for 10,000 chunks:

# CPU (Intel i7)
# - batch_size=32:  ~5 minutes
# - batch_size=64:  ~4 minutes
# - batch_size=128: ~3.5 minutes (may OOM on 8GB RAM)

# GPU (NVIDIA RTX 3080)
# - batch_size=32:  ~30 seconds
# - batch_size=64:  ~20 seconds
# - batch_size=128: ~15 seconds

# Apple Silicon (M1/M2)
# - batch_size=32:  ~2 minutes
# - batch_size=64:  ~1.5 minutes

# Tip: For large collections, run overnight!
Enter fullscreen mode Exit fullscreen mode

Step 4: Storing in ChromaDB

Why ChromaDB?

Feature ChromaDB Pinecone Weaviate Milvus
Deployment Embedded Cloud Self-hosted Self-hosted
Setup pip install Account required Docker Docker/K8s
Cost Free Free tier + paid Free Free
Scale ~1M vectors Billions Billions Billions
Best For Learning, prototypes Production Production Enterprise

ChromaDB is perfect for learning because:

  • No server to run
  • Data persists to disk
  • Simple Python API
  • Works offline

Creating the Database

# From build_style_db.py

import chromadb
from chromadb.config import Settings

def create_database(db_path: str, collection_name: str):
    """
    Create or connect to a ChromaDB database.

    Args:
        db_path: Directory to store database files
        collection_name: Name for the collection

    Returns:
        ChromaDB collection object
    """
    # Create persistent client (data survives restarts)
    client = chromadb.PersistentClient(
        path=db_path,
        settings=Settings(
            anonymized_telemetry=False  # Disable telemetry
        )
    )

    # Delete existing collection if present (for clean rebuild)
    try:
        client.delete_collection(collection_name)
        print(f"[DB] Deleted existing collection: {collection_name}")
    except ValueError:
        pass  # Collection didn't exist

    # Create new collection
    collection = client.create_collection(
        name=collection_name,
        metadata={
            "description": "Writing style samples for story generation",
            "hnsw:space": "cosine"  # Use cosine similarity
        }
    )

    print(f"[DB] Created collection: {collection_name}")
    return collection
Enter fullscreen mode Exit fullscreen mode

Adding Documents

def add_to_database(
    collection,
    chunks: list[str],
    embeddings: list[list[float]],
    source_file: str
):
    """
    Add chunks and embeddings to ChromaDB.

    Args:
        collection: ChromaDB collection
        chunks: List of text chunks
        embeddings: Corresponding embeddings
        source_file: Name of source file (for metadata)
    """
    # Generate unique IDs
    # Format: source_chunknum (e.g., "fantasy_novel_0042")
    base_name = Path(source_file).stem
    ids = [f"{base_name}_{i:04d}" for i in range(len(chunks))]

    # Create metadata for each chunk
    metadatas = [
        {
            "source": source_file,
            "chunk_index": i,
            "char_count": len(chunk)
        }
        for i, chunk in enumerate(chunks)
    ]

    # Add to collection
    # ChromaDB handles batching internally
    collection.add(
        ids=ids,
        documents=chunks,
        embeddings=embeddings,
        metadatas=metadatas
    )

    print(f"[DB] Added {len(chunks)} chunks from {source_file}")
Enter fullscreen mode Exit fullscreen mode

The Complete Build Script

# build_style_db.py - Complete pipeline

def build_database():
    """Build the complete vector database from parsed texts."""

    print("=" * 60)
    print("BUILDING STYLE DATABASE")
    print("=" * 60)

    # Initialize components
    embedder = EmbeddingGenerator()
    collection = create_database(str(CHROMA_DIR), COLLECTION_NAME)

    # Track statistics
    total_chunks = 0
    total_chars = 0

    # Process each text file
    txt_files = list(TXT_DIR.glob("*.txt"))
    print(f"\nFound {len(txt_files)} text files to process\n")

    for txt_file in txt_files:
        print(f"[PROCESS] {txt_file.name}")

        # Read text
        text = txt_file.read_text(encoding='utf-8')
        print(f"  Characters: {len(text):,}")

        # Chunk
        chunks = chunk_text(
            text,
            chunk_size=RAG_CONFIG["chunk_size"],
            overlap=RAG_CONFIG["chunk_overlap"],
            min_length=RAG_CONFIG["min_chunk_length"]
        )
        print(f"  Chunks: {len(chunks)}")

        # Embed
        embeddings = embedder.embed_chunks(chunks)

        # Store
        add_to_database(collection, chunks, embeddings, txt_file.name)

        # Update stats
        total_chunks += len(chunks)
        total_chars += len(text)

        print()

    # Final summary
    print("=" * 60)
    print("BUILD COMPLETE")
    print("=" * 60)
    print(f"Total text processed: {total_chars:,} characters")
    print(f"Total chunks created: {total_chunks:,}")
    print(f"Database location: {CHROMA_DIR}")
    print(f"Collection: {COLLECTION_NAME}")
    print("=" * 60)


if __name__ == "__main__":
    build_database()
Enter fullscreen mode Exit fullscreen mode

Running the Build

python build_style_db.py
Enter fullscreen mode Exit fullscreen mode

Expected Output:

============================================================
BUILDING STYLE DATABASE
============================================================
[EMBED] Loading model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
[EMBED] Model loaded! Dimension: 384
[DB] Created collection: story_styles

Found 5 text files to process

[PROCESS] fantasy_novel.txt
  Characters: 245,832
  Chunks: 523
  Embedding: 100%|████████████████████| 17/17 [00:08<00:00]
  [DB] Added 523 chunks from fantasy_novel.txt

[PROCESS] cultivation_story.txt
  Characters: 523,109
  Chunks: 1,112
  Embedding: 100%|████████████████████| 35/35 [00:17<00:00]
  [DB] Added 1,112 chunks from cultivation_story.txt

...

============================================================
BUILD COMPLETE
============================================================
Total text processed: 1,523,891 characters
Total chunks created: 3,247
Database location: chroma_db/
Collection: story_styles
============================================================
Enter fullscreen mode Exit fullscreen mode

Step 5: Testing Retrieval

Basic Retrieval Test

# test_retrieval.py

import chromadb
from sentence_transformers import SentenceTransformer
from config import CHROMA_DIR, EMBED_MODEL, COLLECTION_NAME

# Connect to database
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
collection = client.get_collection(COLLECTION_NAME)

# Load embedding model
embedder = SentenceTransformer(EMBED_MODEL)

# Test queries
test_queries = [
    "A young warrior discovers a magical sword",
    "The cultivation technique for immortality",
    "A magic school hidden from ordinary people",
    "A dark lord threatens the kingdom",
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print('='*60)

    # Embed query
    query_embedding = embedder.encode(query).tolist()

    # Search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3,
        include=["documents", "distances", "metadatas"]
    )

    # Display results
    for i, (doc, dist, meta) in enumerate(zip(
        results['documents'][0],
        results['distances'][0],
        results['metadatas'][0]
    )):
        print(f"\n--- Result {i+1} (distance: {dist:.4f}) ---")
        print(f"Source: {meta['source']}")
        print(f"Preview: {doc[:200]}...")
Enter fullscreen mode Exit fullscreen mode

Understanding Distance Scores

ChromaDB returns distance, not similarity. Lower = more similar.

Distance interpretation (cosine):
0.0 - 0.3  : Very relevant (almost identical meaning)
0.3 - 0.5  : Relevant (similar topic)
0.5 - 0.7  : Somewhat relevant (related)
0.7 - 1.0  : Not very relevant
1.0+       : Unrelated
Enter fullscreen mode Exit fullscreen mode

What Good Retrieval Looks Like

Query: "A young warrior discovers a magical sword"

--- Result 1 (distance: 0.2341) ---
Source: xianxia_novel.txt
Preview: "Chen Wei's fingers closed around the hilt, and ancient
power surged through his meridians. The sword had chosen him.
After ten thousand years, the Heavenly Demon Blade had found
a new master..."

--- Result 2 (distance: 0.2876) ---
Source: fantasy_epic.txt
Preview: "The blade sang as it left the stone, a sound that had
not been heard in seven generations. Young Thomas stared at
his own hands in disbelief. He had done what kings and warriors
could not..."

--- Result 3 (distance: 0.3102) ---
Source: cultivation_story.txt
Preview: "Master Liu held out the rusted sword. 'This weapon chose
your ancestor,' he said. 'Now it stirs again. Take it, if you
dare face the trials that come with such power...'"
Enter fullscreen mode Exit fullscreen mode

All three results are about discovering magical swords!


Troubleshooting Common Issues

Issue 1: "Collection not found" Error

# Error: ValueError: Collection story_styles does not exist.

# Solution: Build the database first!
python build_style_db.py

# Or with reset flag:
python build_style_db.py --reset
Enter fullscreen mode Exit fullscreen mode

Issue 2: Poor Retrieval Quality

Symptom: Retrieved passages don't match query

Causes and solutions:
1. Too few source documents
   → Add more ebooks to data/raw/

2. Chunks too small
   → Increase chunk_size in config.py

3. Wrong embedding model for language
   → Use multilingual model for non-English

4. Query too vague
   → Make queries more specific
Enter fullscreen mode Exit fullscreen mode

Issue 3: Out of Memory During Embedding

Symptom: MemoryError or process killed

Solutions:
1. Reduce batch_size in embed_chunks()
2. Process fewer books at once
3. Use a smaller embedding model
4. Add more RAM (16GB+ recommended)
Enter fullscreen mode Exit fullscreen mode

Issue 4: Slow Embedding Speed

Symptom: Takes hours to embed

Solutions:
1. Use GPU if available:
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Use smaller model:
   EMBED_MODEL = "all-MiniLM-L6-v2"  # 2x faster

3. Increase batch_size if you have enough RAM
Enter fullscreen mode Exit fullscreen mode

Performance Optimization Tips

1. Optimal Chunk Sizes by Use Case

Use Case Chunk Size Overlap Why
Short stories 300-400 30 Tighter focus
Novels 500-600 50 Balance
Technical docs 400-500 50 Preserve sections
Poetry 200-300 20 Keep stanzas

2. When to Rebuild vs. Add

# ADD new documents (fast):
# When: Adding a few new books
# How: Run build_style_db.py with --add flag (if implemented)
#      Or manually add to existing collection

# REBUILD entire database:
# When: Changed chunk_size, changed embedding model, major changes
# How: Delete chroma_db/ and run build_style_db.py fresh
Enter fullscreen mode Exit fullscreen mode

3. Database Size Estimates

Rule of thumb:
- 1 ebook ≈ 500 chunks
- 500 chunks × 384 dimensions × 4 bytes = ~750 KB embeddings
- Plus text storage ≈ 500 KB
- Total per book ≈ 1.5 MB

For 100 books: ~150 MB database
For 1000 books: ~1.5 GB database
Enter fullscreen mode Exit fullscreen mode

Summary

In this article, we built:

Component Purpose Key Files
Ebook Parser Extract text from PDF, EPUB, MOBI, TXT parse_ebooks.py
Text Chunker Split into overlapping chunks build_style_db.py
Embedding Generator Convert text to vectors build_style_db.py
Vector Database Store and search embeddings chroma_db/

Our RAG data pipeline is complete. In Part 3, we'll connect this to LLMs and generate stories that match our learned writing styles.


Quick Reference

# Parse ebooks
./run.sh parse
# or
python parse_ebooks.py

# Build vector database
./run.sh build
# or
python build_style_db.py

# Check status
./run.sh status

# Test retrieval
python -c "
from test_retrieval import test_query
test_query('warrior discovers sword')
"
Enter fullscreen mode Exit fullscreen mode

Next Article: Part 3: Story Generation with RAG →

Previous: Part 1: Understanding RAG

Source Code: github.com/namtran/ai-rag-tutorial-story-generator


Top comments (0)