Nam Tran

Posted on Jan 26

Build Your Own AI Story Generator with RAG - Part 2: Building the RAG Pipeline

#ai #rag #novel #machinelearning

In Part 1, we learned what RAG is, compared it to alternatives, and understood its pros, cons, and limitations. Now let's build it.

In this article, we'll create the complete data pipeline:

Ebooks → Parse → Chunk → Embed → Store in ChromaDB

By the end, you'll have a searchable vector database of writing styles ready for story generation.

Source Code: github.com/namtran/ai-rag-tutorial-story-generator

Project Setup
Step 1: Parsing Ebooks
Step 2: Text Chunking
Step 3: Generating Embeddings
Step 4: Storing in ChromaDB
Step 5: Testing Retrieval
Troubleshooting Common Issues
Performance Optimization Tips

Project Setup

Clone and Install

git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Key Dependencies Explained

# requirements.txt

# Vector Database
chromadb>=0.4.0          # Lightweight, embedded vector DB

# Embeddings
sentence-transformers    # Pre-trained embedding models

# Ebook Parsing
PyMuPDF                  # PDF text extraction (fast, reliable)
ebooklib                 # EPUB parsing
mobi                     # MOBI/PRC parsing
beautifulsoup4           # HTML cleaning for EPUB

# LLM Backends (for Part 3)
requests                 # For Ollama API
openai                   # For OpenAI API

# Web UI (for Part 3)
gradio                   # Simple web interface

Project Structure

ai-rag-tutorial-story-generator/
├── data/
│   ├── raw/              # Your ebooks go here (.pdf, .epub, .mobi, .txt)
│   └── txt/              # Parsed text files (auto-generated)
├── chroma_db/            # Vector database (auto-generated)
├── models/               # Cached embedding models
│
├── config.py             # All configuration in one place
├── parse_ebooks.py       # Step 1: Parse ebooks → text
├── build_style_db.py     # Step 2-4: Chunk → Embed → Store
├── generate_with_style.py # Step 5+: Retrieve → Generate (Part 3)
│
├── run.sh                # Quick commands
└── requirements.txt

Configuration Deep Dive

# config.py - Key settings explained

# ===== DIRECTORIES =====
BASE_DIR = Path(__file__).parent.resolve()
RAW_DIR = BASE_DIR / "data" / "raw"    # Put your ebooks here
TXT_DIR = BASE_DIR / "data" / "txt"    # Parsed text output
CHROMA_DIR = BASE_DIR / "chroma_db"    # Vector database

# ===== EMBEDDING MODEL =====
# We use a multilingual model to support books in any language
# This model outputs 384-dimensional vectors
EMBED_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

# Alternative models:
# "all-MiniLM-L6-v2"           # Faster, English-only, 384d
# "all-mpnet-base-v2"          # Better quality, slower, 768d
# "paraphrase-multilingual-mpnet-base-v2"  # Better multilingual, 768d

# ===== CHUNKING SETTINGS =====
RAG_CONFIG = {
    "chunk_size": 500,      # Characters per chunk
    "chunk_overlap": 50,    # Overlap between chunks
    "min_chunk_length": 100 # Skip chunks smaller than this
}

# Why these values?
# - 500 chars ≈ 100 words ≈ 1-2 paragraphs
# - Large enough for context, small enough for precise retrieval
# - 50 char overlap prevents losing info at boundaries
# - 10% overlap is a good balance (not too much redundancy)

# ===== COLLECTION NAME =====
COLLECTION_NAME = "story_styles"  # Name in ChromaDB

Step 1: Parsing Ebooks

The Challenge

Ebooks come in many formats, each with its own structure:

Format	Structure	Challenge
PDF	Fixed layout, pages	May have headers/footers, columns
EPUB	HTML/CSS in a ZIP	Need to extract from HTML
MOBI/PRC	Amazon proprietary	Need special library
TXT	Plain text	Encoding issues

Understanding the Parser Code

# parse_ebooks.py - Complete with explanations

from pathlib import Path
import fitz  # PyMuPDF - Note the import name!
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import mobi

def parse_pdf(file_path: Path) -> str:
    """
    Extract text from PDF files.

    Uses PyMuPDF (fitz) which is fast and handles most PDFs well.
    Preserves paragraph structure by keeping line breaks.
    """
    doc = fitz.open(file_path)
    text_parts = []

    for page_num, page in enumerate(doc):
        # Extract text with layout preservation
        text = page.get_text("text")

        # Optional: Skip first/last pages (often cover/copyright)
        # if page_num == 0 or page_num == len(doc) - 1:
        #     continue

        text_parts.append(text)

    doc.close()

    # Join with double newline to preserve page breaks
    full_text = "\n\n".join(text_parts)

    # Clean up excessive whitespace
    import re
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)

    return full_text


def parse_epub(file_path: Path) -> str:
    """
    Extract text from EPUB files.

    EPUB files are basically ZIP files containing HTML.
    We extract text from each HTML document in reading order.
    """
    book = epub.read_epub(str(file_path))
    text_parts = []

    # Get items in reading order
    for item in book.get_items():
        # Only process document items (not images, CSS, etc.)
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            # Parse HTML content
            soup = BeautifulSoup(item.get_content(), 'html.parser')

            # Remove script and style elements
            for element in soup(['script', 'style', 'nav']):
                element.decompose()

            # Get text
            text = soup.get_text(separator='\n')
            text_parts.append(text)

    full_text = "\n\n".join(text_parts)

    # Clean up
    import re
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    full_text = re.sub(r' {2,}', ' ', full_text)

    return full_text


def parse_mobi(file_path: Path) -> str:
    """
    Extract text from MOBI/PRC files (Kindle format).

    These files are more complex - we extract to temp directory
    then parse the resulting HTML.
    """
    import tempfile
    import os

    with tempfile.TemporaryDirectory() as temp_dir:
        # Extract MOBI to temp directory
        temp_path, _ = mobi.extract(str(file_path))

        # Find the HTML file
        html_file = None
        for root, dirs, files in os.walk(temp_path):
            for file in files:
                if file.endswith('.html'):
                    html_file = os.path.join(root, file)
                    break

        if html_file:
            with open(html_file, 'r', encoding='utf-8', errors='ignore') as f:
                soup = BeautifulSoup(f.read(), 'html.parser')
                text = soup.get_text(separator='\n')
        else:
            # Fallback: try to read as text
            with open(temp_path, 'r', encoding='utf-8', errors='ignore') as f:
                text = f.read()

    return text


def parse_txt(file_path: Path) -> str:
    """
    Read plain text files.

    Handle various encodings gracefully.
    """
    encodings = ['utf-8', 'latin-1', 'cp1252', 'ascii']

    for encoding in encodings:
        try:
            return file_path.read_text(encoding=encoding)
        except UnicodeDecodeError:
            continue

    # Last resort: ignore errors
    return file_path.read_text(encoding='utf-8', errors='ignore')


def parse_ebook(file_path: Path) -> str:
    """
    Parse any supported ebook format.

    Returns cleaned text ready for chunking.
    """
    suffix = file_path.suffix.lower()

    parsers = {
        '.pdf': parse_pdf,
        '.epub': parse_epub,
        '.mobi': parse_mobi,
        '.prc': parse_mobi,  # PRC is same as MOBI
        '.txt': parse_txt,
    }

    if suffix not in parsers:
        raise ValueError(f"Unsupported format: {suffix}")

    return parsers[suffix](file_path)


def clean_text(text: str) -> str:
    """
    Clean and normalize extracted text.

    - Remove excessive whitespace
    - Fix common OCR errors
    - Normalize quotes and dashes
    """
    import re

    # Normalize line endings
    text = text.replace('\r\n', '\n').replace('\r', '\n')

    # Remove excessive blank lines
    text = re.sub(r'\n{3,}', '\n\n', text)

    # Remove excessive spaces
    text = re.sub(r' {2,}', ' ', text)

    # Fix common issues
    text = text.replace('"', '"').replace('"', '"')  # Smart quotes
    text = text.replace(''', "'").replace(''', "'")  # Smart apostrophes
    text = text.replace('—', '--').replace('–', '-')  # Dashes

    # Remove page numbers (common pattern)
    text = re.sub(r'\n\d+\n', '\n', text)

    # Strip leading/trailing whitespace from lines
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)

    return text.strip()

Running the Parser

# Add your ebooks
cp ~/Books/*.epub data/raw/
cp ~/Books/*.pdf data/raw/

# Run parser
python parse_ebooks.py

Expected Output:

============================================================
EBOOK PARSER
============================================================
Source: data/raw/
Output: data/txt/
============================================================

[PARSE] Found 5 ebooks to process

[1/5] fantasy_novel.epub
      Format: EPUB
      Processing... Done!
      Output: fantasy_novel.txt (245,832 characters)

[2/5] cultivation_story.pdf
      Format: PDF (127 pages)
      Processing... Done!
      Output: cultivation_story.txt (523,109 characters)

[3/5] magic_school.mobi
      Format: MOBI
      Processing... Done!
      Output: magic_school.txt (312,445 characters)

...

============================================================
COMPLETE
============================================================
Processed: 5 files
Total text: 1,523,891 characters
Output directory: data/txt/
============================================================

Step 2: Text Chunking

Why Chunking Matters

Chunking is critical for RAG quality. Bad chunking = bad retrieval = bad output.

┌─────────────────────────────────────────────────────────────────┐
│                    CHUNKING IMPACT                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  User Query: "Write about a warrior's first battle"            │
│                                                                 │
│  GOOD CHUNKING:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ "Chen Wei gripped his sword tightly, knuckles white.    │   │
│  │ Before him stood a hundred enemy soldiers. This was it  │   │
│  │ - his first real battle. Master Liu's training echoed   │   │
│  │ in his mind: 'When fear comes, let it pass through you.'│   │
│  │ He raised his blade and charged."                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│  → Complete scene, good context, useful for style learning     │
│                                                                 │
│  BAD CHUNKING:                                                  │
│  ┌──────────────────────┐ ┌──────────────────────────────────┐ │
│  │ "Chen Wei gripped his│ │sword tightly, knuckles white.   │ │
│  │"                     │ │Before him stood a hundred enemy │ │
│  └──────────────────────┘ └──────────────────────────────────┘ │
│  → Split mid-sentence, loses meaning, poor retrieval           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Chunking Strategies Compared

Strategy	How It Works	Pros	Cons	Best For
Fixed-size	Every N characters	Simple, predictable	May cut mid-sentence	General use
Sentence	Split on periods	Natural boundaries	Variable sizes	Precise retrieval
Paragraph	Split on newlines	Preserves ideas	Very variable	Long-form content
Semantic	ML-based topic detection	Best relevance	Slow, complex	Production systems
Recursive	Try large, then smaller	Adaptive	More complex	Mixed content

Our Implementation: Fixed-Size with Smart Boundaries

# From build_style_db.py

def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50,
    min_length: int = 100
) -> list[str]:
    """
    Split text into overlapping chunks with smart boundary detection.

    Args:
        text: Full text to chunk
        chunk_size: Target size in characters
        overlap: Characters to overlap between chunks
        min_length: Minimum chunk size (skip smaller)

    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        # Get initial chunk
        end = start + chunk_size

        # Don't go past the end
        if end >= text_length:
            chunk = text[start:].strip()
            if len(chunk) >= min_length:
                chunks.append(chunk)
            break

        # Extract chunk
        chunk = text[start:end]

        # Find the best break point (sentence boundary)
        # Look for period, exclamation, or question mark followed by space
        best_break = -1

        for punct in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
            pos = chunk.rfind(punct)
            if pos > best_break and pos > chunk_size * 0.5:
                best_break = pos + len(punct)

        # If found a good break point, use it
        if best_break > 0:
            chunk = chunk[:best_break].strip()
            end = start + best_break

        # Also try paragraph break
        para_break = chunk.rfind('\n\n')
        if para_break > chunk_size * 0.7:  # Prefer paragraph if late enough
            chunk = chunk[:para_break].strip()
            end = start + para_break

        # Add chunk if long enough
        if len(chunk) >= min_length:
            chunks.append(chunk)

        # Move start, accounting for overlap
        start = end - overlap if end > overlap else end

    return chunks

Visualizing Overlap

Original text (simplified):
"AAAAAAAAAA BBBBBBBBBB CCCCCCCCCC DDDDDDDDDD EEEEEEEEEE"
 |-------- chunk 1 --------|
              |-------- chunk 2 --------|
                           |-------- chunk 3 --------|

Chunk 1: "AAAAAAAAAA BBBBBBBBBB CC"
Chunk 2: "BB CCCCCCCCCC DDDDDDDDDD"  ← "BB CC" appears in both!
Chunk 3: "DD EEEEEEEEEE"

Why overlap?
- Sentence about "B and C" isn't lost at boundary
- Queries about "C" can match chunks 1 or 2
- Better retrieval for edge cases

Testing Your Chunks

# test_chunking.py - Verify chunk quality

from build_style_db import chunk_text

# Load a sample text
with open("data/txt/sample_book.txt") as f:
    text = f.read()

# Chunk it
chunks = chunk_text(text, chunk_size=500, overlap=50)

# Analyze
print(f"Total chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
print(f"Min chunk size: {min(len(c) for c in chunks)} chars")
print(f"Max chunk size: {max(len(c) for c in chunks)} chars")

# Show a few samples
print("\n--- Sample Chunks ---")
for i in [0, len(chunks)//2, -1]:
    print(f"\nChunk {i}:")
    print(chunks[i][:200] + "...")
    print(f"Length: {len(chunks[i])} chars")

Step 3: Generating Embeddings

What Are Embeddings?

Embeddings convert text into dense vectors that capture semantic meaning:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Text becomes a vector
text = "The warrior drew his sword"
vector = model.encode(text)

print(f"Text: '{text}'")
print(f"Vector shape: {vector.shape}")  # (384,)
print(f"First 5 values: {vector[:5]}")  # [0.23, -0.45, 0.67, ...]

Why Embeddings Work

Semantic similarity is captured in vector space:

"The warrior drew his sword"     →  [0.23, -0.45, 0.67, ...]
"The fighter unsheathed blade"   →  [0.21, -0.43, 0.65, ...]  ← Similar!
"I like to eat pizza"            →  [-0.56, 0.32, -0.11, ...] ← Different!

                    sword/blade
                         ↑
                    [warrior] [fighter]
                         |
    pizza →  [ ]         |
                         |
                    word embedding space

Choosing an Embedding Model

Model	Dimensions	Speed	Quality	Languages	Size
all-MiniLM-L6-v2	384	Very Fast	Good	English	80MB
all-MiniLM-L12-v2	384	Fast	Better	English	120MB
paraphrase-multilingual-MiniLM-L12-v2	384	Fast	Good	50+	420MB
all-mpnet-base-v2	768	Medium	Best	English	420MB
paraphrase-multilingual-mpnet-base-v2	768	Medium	Best	50+	970MB

We use paraphrase-multilingual-MiniLM-L12-v2 because:

Supports 50+ languages (Chinese, Vietnamese, etc.)
Good balance of speed and quality
384 dimensions is efficient for storage
Works well for style/semantic similarity

Embedding Implementation

# From build_style_db.py

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

class EmbeddingGenerator:
    def __init__(self, model_name: str = EMBED_MODEL):
        print(f"[EMBED] Loading model: {model_name}")
        self.model = SentenceTransformer(model_name)
        print(f"[EMBED] Model loaded! Dimension: {self.model.get_sentence_embedding_dimension()}")

    def embed_chunks(self, chunks: list[str], batch_size: int = 32) -> list:
        """
        Generate embeddings for a list of text chunks.

        Uses batching for efficiency on large datasets.
        Shows progress bar for long operations.
        """
        print(f"[EMBED] Generating embeddings for {len(chunks)} chunks...")

        # For small datasets, encode all at once
        if len(chunks) <= batch_size:
            embeddings = self.model.encode(chunks, show_progress_bar=True)
            return embeddings.tolist()

        # For large datasets, batch for memory efficiency
        all_embeddings = []

        for i in tqdm(range(0, len(chunks), batch_size), desc="Embedding"):
            batch = chunks[i:i + batch_size]
            batch_embeddings = self.model.encode(batch)
            all_embeddings.extend(batch_embeddings.tolist())

        return all_embeddings

    def embed_query(self, query: str) -> list:
        """Embed a single query string."""
        return self.model.encode(query).tolist()

Embedding Performance Tips

# Speed comparison for 10,000 chunks:

# CPU (Intel i7)
# - batch_size=32:  ~5 minutes
# - batch_size=64:  ~4 minutes
# - batch_size=128: ~3.5 minutes (may OOM on 8GB RAM)

# GPU (NVIDIA RTX 3080)
# - batch_size=32:  ~30 seconds
# - batch_size=64:  ~20 seconds
# - batch_size=128: ~15 seconds

# Apple Silicon (M1/M2)
# - batch_size=32:  ~2 minutes
# - batch_size=64:  ~1.5 minutes

# Tip: For large collections, run overnight!

Step 4: Storing in ChromaDB

Why ChromaDB?

Feature	ChromaDB	Pinecone	Weaviate	Milvus
Deployment	Embedded	Cloud	Self-hosted	Self-hosted
Setup	pip install	Account required	Docker	Docker/K8s
Cost	Free	Free tier + paid	Free	Free
Scale	~1M vectors	Billions	Billions	Billions
Best For	Learning, prototypes	Production	Production	Enterprise

ChromaDB is perfect for learning because:

No server to run
Data persists to disk
Simple Python API
Works offline

Creating the Database

# From build_style_db.py

import chromadb
from chromadb.config import Settings

def create_database(db_path: str, collection_name: str):
    """
    Create or connect to a ChromaDB database.

    Args:
        db_path: Directory to store database files
        collection_name: Name for the collection

    Returns:
        ChromaDB collection object
    """
    # Create persistent client (data survives restarts)
    client = chromadb.PersistentClient(
        path=db_path,
        settings=Settings(
            anonymized_telemetry=False  # Disable telemetry
        )
    )

    # Delete existing collection if present (for clean rebuild)
    try:
        client.delete_collection(collection_name)
        print(f"[DB] Deleted existing collection: {collection_name}")
    except ValueError:
        pass  # Collection didn't exist

    # Create new collection
    collection = client.create_collection(
        name=collection_name,
        metadata={
            "description": "Writing style samples for story generation",
            "hnsw:space": "cosine"  # Use cosine similarity
        }
    )

    print(f"[DB] Created collection: {collection_name}")
    return collection

Adding Documents

def add_to_database(
    collection,
    chunks: list[str],
    embeddings: list[list[float]],
    source_file: str
):
    """
    Add chunks and embeddings to ChromaDB.

    Args:
        collection: ChromaDB collection
        chunks: List of text chunks
        embeddings: Corresponding embeddings
        source_file: Name of source file (for metadata)
    """
    # Generate unique IDs
    # Format: source_chunknum (e.g., "fantasy_novel_0042")
    base_name = Path(source_file).stem
    ids = [f"{base_name}_{i:04d}" for i in range(len(chunks))]

    # Create metadata for each chunk
    metadatas = [
        {
            "source": source_file,
            "chunk_index": i,
            "char_count": len(chunk)
        }
        for i, chunk in enumerate(chunks)
    ]

    # Add to collection
    # ChromaDB handles batching internally
    collection.add(
        ids=ids,
        documents=chunks,
        embeddings=embeddings,
        metadatas=metadatas
    )

    print(f"[DB] Added {len(chunks)} chunks from {source_file}")

The Complete Build Script

# build_style_db.py - Complete pipeline

def build_database():
    """Build the complete vector database from parsed texts."""

    print("=" * 60)
    print("BUILDING STYLE DATABASE")
    print("=" * 60)

    # Initialize components
    embedder = EmbeddingGenerator()
    collection = create_database(str(CHROMA_DIR), COLLECTION_NAME)

    # Track statistics
    total_chunks = 0
    total_chars = 0

    # Process each text file
    txt_files = list(TXT_DIR.glob("*.txt"))
    print(f"\nFound {len(txt_files)} text files to process\n")

    for txt_file in txt_files:
        print(f"[PROCESS] {txt_file.name}")

        # Read text
        text = txt_file.read_text(encoding='utf-8')
        print(f"  Characters: {len(text):,}")

        # Chunk
        chunks = chunk_text(
            text,
            chunk_size=RAG_CONFIG["chunk_size"],
            overlap=RAG_CONFIG["chunk_overlap"],
            min_length=RAG_CONFIG["min_chunk_length"]
        )
        print(f"  Chunks: {len(chunks)}")

        # Embed
        embeddings = embedder.embed_chunks(chunks)

        # Store
        add_to_database(collection, chunks, embeddings, txt_file.name)

        # Update stats
        total_chunks += len(chunks)
        total_chars += len(text)

        print()

    # Final summary
    print("=" * 60)
    print("BUILD COMPLETE")
    print("=" * 60)
    print(f"Total text processed: {total_chars:,} characters")
    print(f"Total chunks created: {total_chunks:,}")
    print(f"Database location: {CHROMA_DIR}")
    print(f"Collection: {COLLECTION_NAME}")
    print("=" * 60)


if __name__ == "__main__":
    build_database()

Running the Build

python build_style_db.py

Expected Output:

============================================================
BUILDING STYLE DATABASE
============================================================
[EMBED] Loading model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
[EMBED] Model loaded! Dimension: 384
[DB] Created collection: story_styles

Found 5 text files to process

[PROCESS] fantasy_novel.txt
  Characters: 245,832
  Chunks: 523
  Embedding: 100%|████████████████████| 17/17 [00:08<00:00]
  [DB] Added 523 chunks from fantasy_novel.txt

[PROCESS] cultivation_story.txt
  Characters: 523,109
  Chunks: 1,112
  Embedding: 100%|████████████████████| 35/35 [00:17<00:00]
  [DB] Added 1,112 chunks from cultivation_story.txt

...

============================================================
BUILD COMPLETE
============================================================
Total text processed: 1,523,891 characters
Total chunks created: 3,247
Database location: chroma_db/
Collection: story_styles
============================================================

Step 5: Testing Retrieval

Basic Retrieval Test

# test_retrieval.py

import chromadb
from sentence_transformers import SentenceTransformer
from config import CHROMA_DIR, EMBED_MODEL, COLLECTION_NAME

# Connect to database
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
collection = client.get_collection(COLLECTION_NAME)

# Load embedding model
embedder = SentenceTransformer(EMBED_MODEL)

# Test queries
test_queries = [
    "A young warrior discovers a magical sword",
    "The cultivation technique for immortality",
    "A magic school hidden from ordinary people",
    "A dark lord threatens the kingdom",
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print('='*60)

    # Embed query
    query_embedding = embedder.encode(query).tolist()

    # Search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3,
        include=["documents", "distances", "metadatas"]
    )

    # Display results
    for i, (doc, dist, meta) in enumerate(zip(
        results['documents'][0],
        results['distances'][0],
        results['metadatas'][0]
    )):
        print(f"\n--- Result {i+1} (distance: {dist:.4f}) ---")
        print(f"Source: {meta['source']}")
        print(f"Preview: {doc[:200]}...")

Understanding Distance Scores

ChromaDB returns distance, not similarity. Lower = more similar.

Distance interpretation (cosine):
0.0 - 0.3  : Very relevant (almost identical meaning)
0.3 - 0.5  : Relevant (similar topic)
0.5 - 0.7  : Somewhat relevant (related)
0.7 - 1.0  : Not very relevant
1.0+       : Unrelated

What Good Retrieval Looks Like

Query: "A young warrior discovers a magical sword"

--- Result 1 (distance: 0.2341) ---
Source: xianxia_novel.txt
Preview: "Chen Wei's fingers closed around the hilt, and ancient
power surged through his meridians. The sword had chosen him.
After ten thousand years, the Heavenly Demon Blade had found
a new master..."

--- Result 2 (distance: 0.2876) ---
Source: fantasy_epic.txt
Preview: "The blade sang as it left the stone, a sound that had
not been heard in seven generations. Young Thomas stared at
his own hands in disbelief. He had done what kings and warriors
could not..."

--- Result 3 (distance: 0.3102) ---
Source: cultivation_story.txt
Preview: "Master Liu held out the rusted sword. 'This weapon chose
your ancestor,' he said. 'Now it stirs again. Take it, if you
dare face the trials that come with such power...'"

All three results are about discovering magical swords!

Troubleshooting Common Issues

Issue 1: "Collection not found" Error

# Error: ValueError: Collection story_styles does not exist.

# Solution: Build the database first!
python build_style_db.py

# Or with reset flag:
python build_style_db.py --reset

Issue 2: Poor Retrieval Quality

Symptom: Retrieved passages don't match query

Causes and solutions:
1. Too few source documents
   → Add more ebooks to data/raw/

2. Chunks too small
   → Increase chunk_size in config.py

3. Wrong embedding model for language
   → Use multilingual model for non-English

4. Query too vague
   → Make queries more specific

Issue 3: Out of Memory During Embedding

Symptom: MemoryError or process killed

Solutions:
1. Reduce batch_size in embed_chunks()
2. Process fewer books at once
3. Use a smaller embedding model
4. Add more RAM (16GB+ recommended)

Issue 4: Slow Embedding Speed

Symptom: Takes hours to embed

Solutions:
1. Use GPU if available:
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Use smaller model:
   EMBED_MODEL = "all-MiniLM-L6-v2"  # 2x faster

3. Increase batch_size if you have enough RAM

Performance Optimization Tips

1. Optimal Chunk Sizes by Use Case

Use Case	Chunk Size	Overlap	Why
Short stories	300-400	30	Tighter focus
Novels	500-600	50	Balance
Technical docs	400-500	50	Preserve sections
Poetry	200-300	20	Keep stanzas

2. When to Rebuild vs. Add

# ADD new documents (fast):
# When: Adding a few new books
# How: Run build_style_db.py with --add flag (if implemented)
#      Or manually add to existing collection

# REBUILD entire database:
# When: Changed chunk_size, changed embedding model, major changes
# How: Delete chroma_db/ and run build_style_db.py fresh

3. Database Size Estimates

Rule of thumb:
- 1 ebook ≈ 500 chunks
- 500 chunks × 384 dimensions × 4 bytes = ~750 KB embeddings
- Plus text storage ≈ 500 KB
- Total per book ≈ 1.5 MB

For 100 books: ~150 MB database
For 1000 books: ~1.5 GB database

Summary

In this article, we built:

Component	Purpose	Key Files
Ebook Parser	Extract text from PDF, EPUB, MOBI, TXT	parse_ebooks.py
Text Chunker	Split into overlapping chunks	build_style_db.py
Embedding Generator	Convert text to vectors	build_style_db.py
Vector Database	Store and search embeddings	chroma_db/

Our RAG data pipeline is complete. In Part 3, we'll connect this to LLMs and generate stories that match our learned writing styles.

Quick Reference

# Parse ebooks
./run.sh parse
# or
python parse_ebooks.py

# Build vector database
./run.sh build
# or
python build_style_db.py

# Check status
./run.sh status

# Test retrieval
python -c "
from test_retrieval import test_query
test_query('warrior discovers sword')
"

Next Article: Part 3: Story Generation with RAG →

Previous: Part 1: Understanding RAG