In Part 1, we learned what RAG is, compared it to alternatives, and understood its pros, cons, and limitations. Now let's build it.
In this article, we'll create the complete data pipeline:
Ebooks → Parse → Chunk → Embed → Store in ChromaDB
By the end, you'll have a searchable vector database of writing styles ready for story generation.
Source Code: github.com/namtran/ai-rag-tutorial-story-generator
Table of Contents
- Project Setup
- Step 1: Parsing Ebooks
- Step 2: Text Chunking
- Step 3: Generating Embeddings
- Step 4: Storing in ChromaDB
- Step 5: Testing Retrieval
- Troubleshooting Common Issues
- Performance Optimization Tips
Project Setup
Clone and Install
git clone https://github.com/namtran/ai-rag-tutorial-story-generator.git
cd ai-rag-tutorial-story-generator
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Key Dependencies Explained
# requirements.txt
# Vector Database
chromadb>=0.4.0 # Lightweight, embedded vector DB
# Embeddings
sentence-transformers # Pre-trained embedding models
# Ebook Parsing
PyMuPDF # PDF text extraction (fast, reliable)
ebooklib # EPUB parsing
mobi # MOBI/PRC parsing
beautifulsoup4 # HTML cleaning for EPUB
# LLM Backends (for Part 3)
requests # For Ollama API
openai # For OpenAI API
# Web UI (for Part 3)
gradio # Simple web interface
Project Structure
ai-rag-tutorial-story-generator/
├── data/
│ ├── raw/ # Your ebooks go here (.pdf, .epub, .mobi, .txt)
│ └── txt/ # Parsed text files (auto-generated)
├── chroma_db/ # Vector database (auto-generated)
├── models/ # Cached embedding models
│
├── config.py # All configuration in one place
├── parse_ebooks.py # Step 1: Parse ebooks → text
├── build_style_db.py # Step 2-4: Chunk → Embed → Store
├── generate_with_style.py # Step 5+: Retrieve → Generate (Part 3)
│
├── run.sh # Quick commands
└── requirements.txt
Configuration Deep Dive
# config.py - Key settings explained
# ===== DIRECTORIES =====
BASE_DIR = Path(__file__).parent.resolve()
RAW_DIR = BASE_DIR / "data" / "raw" # Put your ebooks here
TXT_DIR = BASE_DIR / "data" / "txt" # Parsed text output
CHROMA_DIR = BASE_DIR / "chroma_db" # Vector database
# ===== EMBEDDING MODEL =====
# We use a multilingual model to support books in any language
# This model outputs 384-dimensional vectors
EMBED_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
# Alternative models:
# "all-MiniLM-L6-v2" # Faster, English-only, 384d
# "all-mpnet-base-v2" # Better quality, slower, 768d
# "paraphrase-multilingual-mpnet-base-v2" # Better multilingual, 768d
# ===== CHUNKING SETTINGS =====
RAG_CONFIG = {
"chunk_size": 500, # Characters per chunk
"chunk_overlap": 50, # Overlap between chunks
"min_chunk_length": 100 # Skip chunks smaller than this
}
# Why these values?
# - 500 chars ≈ 100 words ≈ 1-2 paragraphs
# - Large enough for context, small enough for precise retrieval
# - 50 char overlap prevents losing info at boundaries
# - 10% overlap is a good balance (not too much redundancy)
# ===== COLLECTION NAME =====
COLLECTION_NAME = "story_styles" # Name in ChromaDB
Step 1: Parsing Ebooks
The Challenge
Ebooks come in many formats, each with its own structure:
| Format | Structure | Challenge |
|---|---|---|
| Fixed layout, pages | May have headers/footers, columns | |
| EPUB | HTML/CSS in a ZIP | Need to extract from HTML |
| MOBI/PRC | Amazon proprietary | Need special library |
| TXT | Plain text | Encoding issues |
Understanding the Parser Code
# parse_ebooks.py - Complete with explanations
from pathlib import Path
import fitz # PyMuPDF - Note the import name!
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import mobi
def parse_pdf(file_path: Path) -> str:
"""
Extract text from PDF files.
Uses PyMuPDF (fitz) which is fast and handles most PDFs well.
Preserves paragraph structure by keeping line breaks.
"""
doc = fitz.open(file_path)
text_parts = []
for page_num, page in enumerate(doc):
# Extract text with layout preservation
text = page.get_text("text")
# Optional: Skip first/last pages (often cover/copyright)
# if page_num == 0 or page_num == len(doc) - 1:
# continue
text_parts.append(text)
doc.close()
# Join with double newline to preserve page breaks
full_text = "\n\n".join(text_parts)
# Clean up excessive whitespace
import re
full_text = re.sub(r'\n{3,}', '\n\n', full_text)
return full_text
def parse_epub(file_path: Path) -> str:
"""
Extract text from EPUB files.
EPUB files are basically ZIP files containing HTML.
We extract text from each HTML document in reading order.
"""
book = epub.read_epub(str(file_path))
text_parts = []
# Get items in reading order
for item in book.get_items():
# Only process document items (not images, CSS, etc.)
if item.get_type() == ebooklib.ITEM_DOCUMENT:
# Parse HTML content
soup = BeautifulSoup(item.get_content(), 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'nav']):
element.decompose()
# Get text
text = soup.get_text(separator='\n')
text_parts.append(text)
full_text = "\n\n".join(text_parts)
# Clean up
import re
full_text = re.sub(r'\n{3,}', '\n\n', full_text)
full_text = re.sub(r' {2,}', ' ', full_text)
return full_text
def parse_mobi(file_path: Path) -> str:
"""
Extract text from MOBI/PRC files (Kindle format).
These files are more complex - we extract to temp directory
then parse the resulting HTML.
"""
import tempfile
import os
with tempfile.TemporaryDirectory() as temp_dir:
# Extract MOBI to temp directory
temp_path, _ = mobi.extract(str(file_path))
# Find the HTML file
html_file = None
for root, dirs, files in os.walk(temp_path):
for file in files:
if file.endswith('.html'):
html_file = os.path.join(root, file)
break
if html_file:
with open(html_file, 'r', encoding='utf-8', errors='ignore') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
text = soup.get_text(separator='\n')
else:
# Fallback: try to read as text
with open(temp_path, 'r', encoding='utf-8', errors='ignore') as f:
text = f.read()
return text
def parse_txt(file_path: Path) -> str:
"""
Read plain text files.
Handle various encodings gracefully.
"""
encodings = ['utf-8', 'latin-1', 'cp1252', 'ascii']
for encoding in encodings:
try:
return file_path.read_text(encoding=encoding)
except UnicodeDecodeError:
continue
# Last resort: ignore errors
return file_path.read_text(encoding='utf-8', errors='ignore')
def parse_ebook(file_path: Path) -> str:
"""
Parse any supported ebook format.
Returns cleaned text ready for chunking.
"""
suffix = file_path.suffix.lower()
parsers = {
'.pdf': parse_pdf,
'.epub': parse_epub,
'.mobi': parse_mobi,
'.prc': parse_mobi, # PRC is same as MOBI
'.txt': parse_txt,
}
if suffix not in parsers:
raise ValueError(f"Unsupported format: {suffix}")
return parsers[suffix](file_path)
def clean_text(text: str) -> str:
"""
Clean and normalize extracted text.
- Remove excessive whitespace
- Fix common OCR errors
- Normalize quotes and dashes
"""
import re
# Normalize line endings
text = text.replace('\r\n', '\n').replace('\r', '\n')
# Remove excessive blank lines
text = re.sub(r'\n{3,}', '\n\n', text)
# Remove excessive spaces
text = re.sub(r' {2,}', ' ', text)
# Fix common issues
text = text.replace('"', '"').replace('"', '"') # Smart quotes
text = text.replace(''', "'").replace(''', "'") # Smart apostrophes
text = text.replace('—', '--').replace('–', '-') # Dashes
# Remove page numbers (common pattern)
text = re.sub(r'\n\d+\n', '\n', text)
# Strip leading/trailing whitespace from lines
lines = [line.strip() for line in text.split('\n')]
text = '\n'.join(lines)
return text.strip()
Running the Parser
# Add your ebooks
cp ~/Books/*.epub data/raw/
cp ~/Books/*.pdf data/raw/
# Run parser
python parse_ebooks.py
Expected Output:
============================================================
EBOOK PARSER
============================================================
Source: data/raw/
Output: data/txt/
============================================================
[PARSE] Found 5 ebooks to process
[1/5] fantasy_novel.epub
Format: EPUB
Processing... Done!
Output: fantasy_novel.txt (245,832 characters)
[2/5] cultivation_story.pdf
Format: PDF (127 pages)
Processing... Done!
Output: cultivation_story.txt (523,109 characters)
[3/5] magic_school.mobi
Format: MOBI
Processing... Done!
Output: magic_school.txt (312,445 characters)
...
============================================================
COMPLETE
============================================================
Processed: 5 files
Total text: 1,523,891 characters
Output directory: data/txt/
============================================================
Step 2: Text Chunking
Why Chunking Matters
Chunking is critical for RAG quality. Bad chunking = bad retrieval = bad output.
┌─────────────────────────────────────────────────────────────────┐
│ CHUNKING IMPACT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "Write about a warrior's first battle" │
│ │
│ GOOD CHUNKING: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ "Chen Wei gripped his sword tightly, knuckles white. │ │
│ │ Before him stood a hundred enemy soldiers. This was it │ │
│ │ - his first real battle. Master Liu's training echoed │ │
│ │ in his mind: 'When fear comes, let it pass through you.'│ │
│ │ He raised his blade and charged." │ │
│ └─────────────────────────────────────────────────────────┘ │
│ → Complete scene, good context, useful for style learning │
│ │
│ BAD CHUNKING: │
│ ┌──────────────────────┐ ┌──────────────────────────────────┐ │
│ │ "Chen Wei gripped his│ │sword tightly, knuckles white. │ │
│ │" │ │Before him stood a hundred enemy │ │
│ └──────────────────────┘ └──────────────────────────────────┘ │
│ → Split mid-sentence, loses meaning, poor retrieval │
│ │
└─────────────────────────────────────────────────────────────────┘
Chunking Strategies Compared
| Strategy | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed-size | Every N characters | Simple, predictable | May cut mid-sentence | General use |
| Sentence | Split on periods | Natural boundaries | Variable sizes | Precise retrieval |
| Paragraph | Split on newlines | Preserves ideas | Very variable | Long-form content |
| Semantic | ML-based topic detection | Best relevance | Slow, complex | Production systems |
| Recursive | Try large, then smaller | Adaptive | More complex | Mixed content |
Our Implementation: Fixed-Size with Smart Boundaries
# From build_style_db.py
def chunk_text(
text: str,
chunk_size: int = 500,
overlap: int = 50,
min_length: int = 100
) -> list[str]:
"""
Split text into overlapping chunks with smart boundary detection.
Args:
text: Full text to chunk
chunk_size: Target size in characters
overlap: Characters to overlap between chunks
min_length: Minimum chunk size (skip smaller)
Returns:
List of text chunks
"""
chunks = []
start = 0
text_length = len(text)
while start < text_length:
# Get initial chunk
end = start + chunk_size
# Don't go past the end
if end >= text_length:
chunk = text[start:].strip()
if len(chunk) >= min_length:
chunks.append(chunk)
break
# Extract chunk
chunk = text[start:end]
# Find the best break point (sentence boundary)
# Look for period, exclamation, or question mark followed by space
best_break = -1
for punct in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
pos = chunk.rfind(punct)
if pos > best_break and pos > chunk_size * 0.5:
best_break = pos + len(punct)
# If found a good break point, use it
if best_break > 0:
chunk = chunk[:best_break].strip()
end = start + best_break
# Also try paragraph break
para_break = chunk.rfind('\n\n')
if para_break > chunk_size * 0.7: # Prefer paragraph if late enough
chunk = chunk[:para_break].strip()
end = start + para_break
# Add chunk if long enough
if len(chunk) >= min_length:
chunks.append(chunk)
# Move start, accounting for overlap
start = end - overlap if end > overlap else end
return chunks
Visualizing Overlap
Original text (simplified):
"AAAAAAAAAA BBBBBBBBBB CCCCCCCCCC DDDDDDDDDD EEEEEEEEEE"
|-------- chunk 1 --------|
|-------- chunk 2 --------|
|-------- chunk 3 --------|
Chunk 1: "AAAAAAAAAA BBBBBBBBBB CC"
Chunk 2: "BB CCCCCCCCCC DDDDDDDDDD" ← "BB CC" appears in both!
Chunk 3: "DD EEEEEEEEEE"
Why overlap?
- Sentence about "B and C" isn't lost at boundary
- Queries about "C" can match chunks 1 or 2
- Better retrieval for edge cases
Testing Your Chunks
# test_chunking.py - Verify chunk quality
from build_style_db import chunk_text
# Load a sample text
with open("data/txt/sample_book.txt") as f:
text = f.read()
# Chunk it
chunks = chunk_text(text, chunk_size=500, overlap=50)
# Analyze
print(f"Total chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
print(f"Min chunk size: {min(len(c) for c in chunks)} chars")
print(f"Max chunk size: {max(len(c) for c in chunks)} chars")
# Show a few samples
print("\n--- Sample Chunks ---")
for i in [0, len(chunks)//2, -1]:
print(f"\nChunk {i}:")
print(chunks[i][:200] + "...")
print(f"Length: {len(chunks[i])} chars")
Step 3: Generating Embeddings
What Are Embeddings?
Embeddings convert text into dense vectors that capture semantic meaning:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Text becomes a vector
text = "The warrior drew his sword"
vector = model.encode(text)
print(f"Text: '{text}'")
print(f"Vector shape: {vector.shape}") # (384,)
print(f"First 5 values: {vector[:5]}") # [0.23, -0.45, 0.67, ...]
Why Embeddings Work
Semantic similarity is captured in vector space:
"The warrior drew his sword" → [0.23, -0.45, 0.67, ...]
"The fighter unsheathed blade" → [0.21, -0.43, 0.65, ...] ← Similar!
"I like to eat pizza" → [-0.56, 0.32, -0.11, ...] ← Different!
sword/blade
↑
[warrior] [fighter]
|
pizza → [ ] |
|
word embedding space
Choosing an Embedding Model
| Model | Dimensions | Speed | Quality | Languages | Size |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | English | 80MB |
| all-MiniLM-L12-v2 | 384 | Fast | Better | English | 120MB |
| paraphrase-multilingual-MiniLM-L12-v2 | 384 | Fast | Good | 50+ | 420MB |
| all-mpnet-base-v2 | 768 | Medium | Best | English | 420MB |
| paraphrase-multilingual-mpnet-base-v2 | 768 | Medium | Best | 50+ | 970MB |
We use paraphrase-multilingual-MiniLM-L12-v2 because:
- Supports 50+ languages (Chinese, Vietnamese, etc.)
- Good balance of speed and quality
- 384 dimensions is efficient for storage
- Works well for style/semantic similarity
Embedding Implementation
# From build_style_db.py
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
class EmbeddingGenerator:
def __init__(self, model_name: str = EMBED_MODEL):
print(f"[EMBED] Loading model: {model_name}")
self.model = SentenceTransformer(model_name)
print(f"[EMBED] Model loaded! Dimension: {self.model.get_sentence_embedding_dimension()}")
def embed_chunks(self, chunks: list[str], batch_size: int = 32) -> list:
"""
Generate embeddings for a list of text chunks.
Uses batching for efficiency on large datasets.
Shows progress bar for long operations.
"""
print(f"[EMBED] Generating embeddings for {len(chunks)} chunks...")
# For small datasets, encode all at once
if len(chunks) <= batch_size:
embeddings = self.model.encode(chunks, show_progress_bar=True)
return embeddings.tolist()
# For large datasets, batch for memory efficiency
all_embeddings = []
for i in tqdm(range(0, len(chunks), batch_size), desc="Embedding"):
batch = chunks[i:i + batch_size]
batch_embeddings = self.model.encode(batch)
all_embeddings.extend(batch_embeddings.tolist())
return all_embeddings
def embed_query(self, query: str) -> list:
"""Embed a single query string."""
return self.model.encode(query).tolist()
Embedding Performance Tips
# Speed comparison for 10,000 chunks:
# CPU (Intel i7)
# - batch_size=32: ~5 minutes
# - batch_size=64: ~4 minutes
# - batch_size=128: ~3.5 minutes (may OOM on 8GB RAM)
# GPU (NVIDIA RTX 3080)
# - batch_size=32: ~30 seconds
# - batch_size=64: ~20 seconds
# - batch_size=128: ~15 seconds
# Apple Silicon (M1/M2)
# - batch_size=32: ~2 minutes
# - batch_size=64: ~1.5 minutes
# Tip: For large collections, run overnight!
Step 4: Storing in ChromaDB
Why ChromaDB?
| Feature | ChromaDB | Pinecone | Weaviate | Milvus |
|---|---|---|---|---|
| Deployment | Embedded | Cloud | Self-hosted | Self-hosted |
| Setup | pip install | Account required | Docker | Docker/K8s |
| Cost | Free | Free tier + paid | Free | Free |
| Scale | ~1M vectors | Billions | Billions | Billions |
| Best For | Learning, prototypes | Production | Production | Enterprise |
ChromaDB is perfect for learning because:
- No server to run
- Data persists to disk
- Simple Python API
- Works offline
Creating the Database
# From build_style_db.py
import chromadb
from chromadb.config import Settings
def create_database(db_path: str, collection_name: str):
"""
Create or connect to a ChromaDB database.
Args:
db_path: Directory to store database files
collection_name: Name for the collection
Returns:
ChromaDB collection object
"""
# Create persistent client (data survives restarts)
client = chromadb.PersistentClient(
path=db_path,
settings=Settings(
anonymized_telemetry=False # Disable telemetry
)
)
# Delete existing collection if present (for clean rebuild)
try:
client.delete_collection(collection_name)
print(f"[DB] Deleted existing collection: {collection_name}")
except ValueError:
pass # Collection didn't exist
# Create new collection
collection = client.create_collection(
name=collection_name,
metadata={
"description": "Writing style samples for story generation",
"hnsw:space": "cosine" # Use cosine similarity
}
)
print(f"[DB] Created collection: {collection_name}")
return collection
Adding Documents
def add_to_database(
collection,
chunks: list[str],
embeddings: list[list[float]],
source_file: str
):
"""
Add chunks and embeddings to ChromaDB.
Args:
collection: ChromaDB collection
chunks: List of text chunks
embeddings: Corresponding embeddings
source_file: Name of source file (for metadata)
"""
# Generate unique IDs
# Format: source_chunknum (e.g., "fantasy_novel_0042")
base_name = Path(source_file).stem
ids = [f"{base_name}_{i:04d}" for i in range(len(chunks))]
# Create metadata for each chunk
metadatas = [
{
"source": source_file,
"chunk_index": i,
"char_count": len(chunk)
}
for i, chunk in enumerate(chunks)
]
# Add to collection
# ChromaDB handles batching internally
collection.add(
ids=ids,
documents=chunks,
embeddings=embeddings,
metadatas=metadatas
)
print(f"[DB] Added {len(chunks)} chunks from {source_file}")
The Complete Build Script
# build_style_db.py - Complete pipeline
def build_database():
"""Build the complete vector database from parsed texts."""
print("=" * 60)
print("BUILDING STYLE DATABASE")
print("=" * 60)
# Initialize components
embedder = EmbeddingGenerator()
collection = create_database(str(CHROMA_DIR), COLLECTION_NAME)
# Track statistics
total_chunks = 0
total_chars = 0
# Process each text file
txt_files = list(TXT_DIR.glob("*.txt"))
print(f"\nFound {len(txt_files)} text files to process\n")
for txt_file in txt_files:
print(f"[PROCESS] {txt_file.name}")
# Read text
text = txt_file.read_text(encoding='utf-8')
print(f" Characters: {len(text):,}")
# Chunk
chunks = chunk_text(
text,
chunk_size=RAG_CONFIG["chunk_size"],
overlap=RAG_CONFIG["chunk_overlap"],
min_length=RAG_CONFIG["min_chunk_length"]
)
print(f" Chunks: {len(chunks)}")
# Embed
embeddings = embedder.embed_chunks(chunks)
# Store
add_to_database(collection, chunks, embeddings, txt_file.name)
# Update stats
total_chunks += len(chunks)
total_chars += len(text)
print()
# Final summary
print("=" * 60)
print("BUILD COMPLETE")
print("=" * 60)
print(f"Total text processed: {total_chars:,} characters")
print(f"Total chunks created: {total_chunks:,}")
print(f"Database location: {CHROMA_DIR}")
print(f"Collection: {COLLECTION_NAME}")
print("=" * 60)
if __name__ == "__main__":
build_database()
Running the Build
python build_style_db.py
Expected Output:
============================================================
BUILDING STYLE DATABASE
============================================================
[EMBED] Loading model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
[EMBED] Model loaded! Dimension: 384
[DB] Created collection: story_styles
Found 5 text files to process
[PROCESS] fantasy_novel.txt
Characters: 245,832
Chunks: 523
Embedding: 100%|████████████████████| 17/17 [00:08<00:00]
[DB] Added 523 chunks from fantasy_novel.txt
[PROCESS] cultivation_story.txt
Characters: 523,109
Chunks: 1,112
Embedding: 100%|████████████████████| 35/35 [00:17<00:00]
[DB] Added 1,112 chunks from cultivation_story.txt
...
============================================================
BUILD COMPLETE
============================================================
Total text processed: 1,523,891 characters
Total chunks created: 3,247
Database location: chroma_db/
Collection: story_styles
============================================================
Step 5: Testing Retrieval
Basic Retrieval Test
# test_retrieval.py
import chromadb
from sentence_transformers import SentenceTransformer
from config import CHROMA_DIR, EMBED_MODEL, COLLECTION_NAME
# Connect to database
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
collection = client.get_collection(COLLECTION_NAME)
# Load embedding model
embedder = SentenceTransformer(EMBED_MODEL)
# Test queries
test_queries = [
"A young warrior discovers a magical sword",
"The cultivation technique for immortality",
"A magic school hidden from ordinary people",
"A dark lord threatens the kingdom",
]
for query in test_queries:
print(f"\n{'='*60}")
print(f"Query: {query}")
print('='*60)
# Embed query
query_embedding = embedder.encode(query).tolist()
# Search
results = collection.query(
query_embeddings=[query_embedding],
n_results=3,
include=["documents", "distances", "metadatas"]
)
# Display results
for i, (doc, dist, meta) in enumerate(zip(
results['documents'][0],
results['distances'][0],
results['metadatas'][0]
)):
print(f"\n--- Result {i+1} (distance: {dist:.4f}) ---")
print(f"Source: {meta['source']}")
print(f"Preview: {doc[:200]}...")
Understanding Distance Scores
ChromaDB returns distance, not similarity. Lower = more similar.
Distance interpretation (cosine):
0.0 - 0.3 : Very relevant (almost identical meaning)
0.3 - 0.5 : Relevant (similar topic)
0.5 - 0.7 : Somewhat relevant (related)
0.7 - 1.0 : Not very relevant
1.0+ : Unrelated
What Good Retrieval Looks Like
Query: "A young warrior discovers a magical sword"
--- Result 1 (distance: 0.2341) ---
Source: xianxia_novel.txt
Preview: "Chen Wei's fingers closed around the hilt, and ancient
power surged through his meridians. The sword had chosen him.
After ten thousand years, the Heavenly Demon Blade had found
a new master..."
--- Result 2 (distance: 0.2876) ---
Source: fantasy_epic.txt
Preview: "The blade sang as it left the stone, a sound that had
not been heard in seven generations. Young Thomas stared at
his own hands in disbelief. He had done what kings and warriors
could not..."
--- Result 3 (distance: 0.3102) ---
Source: cultivation_story.txt
Preview: "Master Liu held out the rusted sword. 'This weapon chose
your ancestor,' he said. 'Now it stirs again. Take it, if you
dare face the trials that come with such power...'"
All three results are about discovering magical swords!
Troubleshooting Common Issues
Issue 1: "Collection not found" Error
# Error: ValueError: Collection story_styles does not exist.
# Solution: Build the database first!
python build_style_db.py
# Or with reset flag:
python build_style_db.py --reset
Issue 2: Poor Retrieval Quality
Symptom: Retrieved passages don't match query
Causes and solutions:
1. Too few source documents
→ Add more ebooks to data/raw/
2. Chunks too small
→ Increase chunk_size in config.py
3. Wrong embedding model for language
→ Use multilingual model for non-English
4. Query too vague
→ Make queries more specific
Issue 3: Out of Memory During Embedding
Symptom: MemoryError or process killed
Solutions:
1. Reduce batch_size in embed_chunks()
2. Process fewer books at once
3. Use a smaller embedding model
4. Add more RAM (16GB+ recommended)
Issue 4: Slow Embedding Speed
Symptom: Takes hours to embed
Solutions:
1. Use GPU if available:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
2. Use smaller model:
EMBED_MODEL = "all-MiniLM-L6-v2" # 2x faster
3. Increase batch_size if you have enough RAM
Performance Optimization Tips
1. Optimal Chunk Sizes by Use Case
| Use Case | Chunk Size | Overlap | Why |
|---|---|---|---|
| Short stories | 300-400 | 30 | Tighter focus |
| Novels | 500-600 | 50 | Balance |
| Technical docs | 400-500 | 50 | Preserve sections |
| Poetry | 200-300 | 20 | Keep stanzas |
2. When to Rebuild vs. Add
# ADD new documents (fast):
# When: Adding a few new books
# How: Run build_style_db.py with --add flag (if implemented)
# Or manually add to existing collection
# REBUILD entire database:
# When: Changed chunk_size, changed embedding model, major changes
# How: Delete chroma_db/ and run build_style_db.py fresh
3. Database Size Estimates
Rule of thumb:
- 1 ebook ≈ 500 chunks
- 500 chunks × 384 dimensions × 4 bytes = ~750 KB embeddings
- Plus text storage ≈ 500 KB
- Total per book ≈ 1.5 MB
For 100 books: ~150 MB database
For 1000 books: ~1.5 GB database
Summary
In this article, we built:
| Component | Purpose | Key Files |
|---|---|---|
| Ebook Parser | Extract text from PDF, EPUB, MOBI, TXT | parse_ebooks.py |
| Text Chunker | Split into overlapping chunks | build_style_db.py |
| Embedding Generator | Convert text to vectors | build_style_db.py |
| Vector Database | Store and search embeddings | chroma_db/ |
Our RAG data pipeline is complete. In Part 3, we'll connect this to LLMs and generate stories that match our learned writing styles.
Quick Reference
# Parse ebooks
./run.sh parse
# or
python parse_ebooks.py
# Build vector database
./run.sh build
# or
python build_style_db.py
# Check status
./run.sh status
# Test retrieval
python -c "
from test_retrieval import test_query
test_query('warrior discovers sword')
"
Next Article: Part 3: Story Generation with RAG →
Previous: Part 1: Understanding RAG
Source Code: github.com/namtran/ai-rag-tutorial-story-generator
Top comments (0)