We've built our RAG pipeline (Part 1, Part 2). Now let's use it to generate stories.
In this final article, we'll:
- Connect to LLMs (local and cloud)
- Build augmented prompts
- Generate multi-chapter stories
- Maintain consistency across chapters
Source Code: github.com/namtran/ai-rag-tutorial-story-generator
The Generation Flow
┌─────────────────────────────────────────────────────────────┐
│ STORY GENERATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ User: "Write about a young cultivator finding a cave" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ 1. EMBED QUERY │ │
│ │ Convert prompt → vector │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ 2. RETRIEVE │ │
│ │ Find similar passages in ChromaDB │ │
│ │ Returns: 3-5 style samples │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ 3. AUGMENT PROMPT │ │
│ │ "Here are style examples: │ │
│ │ [retrieved passages] │ │
│ │ Now write: [user prompt]" │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ 4. GENERATE │ │
│ │ Send to LLM (Ollama/OpenAI) │ │
│ │ Generate story with learned style │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ OUTPUT: │ │
│ │ "Chen Wei pushed aside the waterfall, │ │
│ │ revealing a cave mouth wreathed in │ │
│ │ ancient qi. His cultivation base │ │
│ │ trembled as Heaven's Will..." │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Step 1: The Style Retriever
First, let's build a class to retrieve relevant passages:
# generate_with_style.py
from sentence_transformers import SentenceTransformer
import chromadb
from config import CHROMA_DIR, EMBED_MODEL, COLLECTION_NAME
class StyleRetriever:
"""Retrieve writing styles from ChromaDB"""
def __init__(self):
self.embedder = SentenceTransformer(EMBED_MODEL)
self.client = chromadb.PersistentClient(path=str(CHROMA_DIR))
self.collection = self.client.get_collection(COLLECTION_NAME)
print(f"[RAG] Connected: {self.collection.count()} chunks")
def retrieve(self, query: str, n_results: int = 3) -> list[str]:
"""Find passages with similar writing style"""
# Embed the query
query_embedding = self.embedder.encode([query])
# Search ChromaDB
results = self.collection.query(
query_embeddings=query_embedding.tolist(),
n_results=n_results
)
return results['documents'][0]
Usage:
retriever = StyleRetriever()
passages = retriever.retrieve("A young warrior discovers a magical sword")
for p in passages:
print(p[:200] + "...")
Step 2: LLM Backends
We support multiple LLM backends. Let's implement two: Ollama (local) and OpenAI (cloud).
Ollama Generator (Local)
class OllamaGenerator:
"""Generate text using local Ollama models"""
def __init__(self, model_name: str = "qwen2.5:7b"):
import requests
self.model_name = model_name
self.base_url = "http://localhost:11434"
# Verify connection
response = requests.get(f"{self.base_url}/api/tags")
if response.status_code != 200:
raise ConnectionError("Ollama not running. Start with: ollama serve")
print(f"[Ollama] Model: {model_name}")
def generate(self, prompt: str, max_tokens: int = 1000) -> str:
import requests
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": self.model_name,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.85,
"top_p": 0.92,
"num_predict": max_tokens
}
},
timeout=300
)
return response.json()["response"]
OpenAI Generator (Cloud)
class OpenAIGenerator:
"""Generate text using OpenAI API"""
def __init__(self, model: str = "gpt-4"):
from openai import OpenAI
import os
self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
self.model = model
print(f"[OpenAI] Model: {model}")
def generate(self, prompt: str, max_tokens: int = 1000) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.85
)
return response.choices[0].message.content
Step 3: The Augmented Prompt
This is where RAG magic happens. We inject retrieved passages as style examples:
STYLE_PROMPT_TEMPLATE = """Here are some example passages showing the writing style to follow:
{context}
---
Now write a NEW story passage in a similar style.
Story idea: {user_request}
Story:
"""
Why This Works
The LLM receives:
- Style examples - Shows how to write (vocabulary, pacing, tone)
- Clear instruction - Write something NEW, not copy
- User's idea - The creative direction
The model mimics the style while generating original content.
Step 4: The Story Generator
Putting it all together:
class StoryGenerator:
"""Generate stories using RAG + LLM"""
def __init__(self, backend: str = "ollama", model: str = None):
# Initialize retriever
self.retriever = StyleRetriever()
# Initialize generator
if backend == "ollama":
self.generator = OllamaGenerator(model or "qwen2.5:7b")
elif backend == "openai":
self.generator = OpenAIGenerator(model or "gpt-4")
else:
raise ValueError(f"Unknown backend: {backend}")
def generate(self, user_request: str, n_style_samples: int = 3) -> str:
"""Generate a story with learned style"""
# Step 1: Retrieve style samples
print("[RAG] Retrieving style samples...")
style_samples = self.retriever.retrieve(
user_request,
n_results=n_style_samples
)
# Step 2: Build augmented prompt
context = "\n\n---\n\n".join(style_samples)
prompt = STYLE_PROMPT_TEMPLATE.format(
context=context,
user_request=user_request
)
# Step 3: Generate
print("[LLM] Generating story...")
story = self.generator.generate(prompt)
return story
Usage
generator = StoryGenerator(backend="ollama", model="qwen2.5:7b")
story = generator.generate(
"A young cultivator discovers a mysterious cave behind a waterfall"
)
print(story)
Output:
Chen Wei had wandered these mountains for three days, following the
whispers of his jade pendant. The ancient artifact had belonged to
his master, and now it pulsed with an urgency he couldn't ignore.
The waterfall appeared without warning—a curtain of silver crashing
into a pool of impossible clarity. But it wasn't the water that made
his cultivation base tremble. It was what lay behind it.
"Impossible," he breathed.
The cave mouth gaped like the maw of a sleeping dragon, and from within
emanated a pressure that spoke of ages long forgotten. Qi so dense it
was almost visible swirled at the entrance, forming patterns that hurt
to look upon.
His pendant grew warm against his chest. A confirmation. A warning.
Chen Wei stepped through the waterfall.
What he found inside would change the course of his cultivation forever...
Multi-Chapter Generation
For longer stories, we need to maintain consistency across chapters.
The Challenge
Chapter 1: "Chen Wei has blue eyes"
Chapter 5: "Chen Wei's brown eyes sparkled" ← Inconsistency!
The Solution: Summaries
After each chapter, we generate a summary. This summary is included in the prompt for subsequent chapters.
┌────────────────────────────────────────────────────────────────┐
│ MULTI-CHAPTER GENERATION │
├────────────────────────────────────────────────────────────────┤
│ │
│ 1. Generate Story Outline │
│ ┌────────────────────────────────────────┐ │
│ │ Chapter 1: The Discovery │ │
│ │ Chapter 2: The Ancient Inheritance │ │
│ │ Chapter 3: First Breakthrough │ │
│ │ ... │ │
│ └────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. Generate Chapter 1 │
│ ┌────────────────────────────────────────┐ │
│ │ Context: [Style samples from RAG] │ │
│ │ Outline: Chapter 1 summary │ │
│ │ → Generate full chapter │ │
│ └────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 3. Summarize Chapter 1 │
│ "Chen Wei discovered a cave containing an │
│ ancient cultivator's inheritance..." │
│ │ │
│ ▼ │
│ 4. Generate Chapter 2 │
│ ┌────────────────────────────────────────┐ │
│ │ Context: [Style samples from RAG] │ │
│ │ Previous: [Chapter 1 summary] │ ← Key! │
│ │ Outline: Chapter 2 summary │ │
│ │ → Generate full chapter │ │
│ └────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 5. Repeat for all chapters... │
│ │
└────────────────────────────────────────────────────────────────┘
Implementation
# generate_long_story.py (simplified)
class LongStoryGenerator:
def __init__(self, backend="ollama"):
self.base_generator = StoryGenerator(backend=backend)
self.chapter_summaries = []
def generate_outline(self, premise: str, num_chapters: int = 10) -> list:
"""Generate a story outline"""
prompt = f"""Create a {num_chapters}-chapter story outline for:
{premise}
For each chapter provide:
- Title
- Summary (2-3 sentences)
- Key events
"""
outline_text = self.base_generator.generator.generate(prompt)
return self._parse_outline(outline_text)
def generate_chapter(self, chapter_num: int, chapter_outline: dict) -> str:
"""Generate a single chapter with context"""
# Build previous summary
previous = "\n".join([
f"Chapter {i+1}: {s}"
for i, s in enumerate(self.chapter_summaries)
])
prompt = f"""
Previous chapters summary:
{previous if previous else "This is the beginning of the story."}
---
Write Chapter {chapter_num}: {chapter_outline['title']}
Chapter outline: {chapter_outline['summary']}
Write 2500-3500 words. Include dialogue, descriptions, and character thoughts.
"""
# Get style samples based on chapter content
style_samples = self.base_generator.retriever.retrieve(
chapter_outline['summary'],
n_results=5
)
full_prompt = f"""Reference writing style:
{chr(10).join(style_samples)}
---
{prompt}
"""
return self.base_generator.generator.generate(
full_prompt,
max_tokens=4000
)
def summarize_chapter(self, chapter_content: str) -> str:
"""Create a summary for context in next chapters"""
prompt = f"""Summarize this chapter in 100-150 words:
{chapter_content}
Focus on key events and character changes.
"""
return self.base_generator.generator.generate(prompt, max_tokens=200)
def generate_full_story(self, premise: str, num_chapters: int = 10):
"""Generate a complete multi-chapter story"""
# Step 1: Generate outline
print("Generating outline...")
outline = self.generate_outline(premise, num_chapters)
# Step 2: Generate each chapter
chapters = []
for i, chapter_outline in enumerate(outline):
print(f"Generating Chapter {i+1}/{num_chapters}...")
# Generate chapter
chapter = self.generate_chapter(i+1, chapter_outline)
chapters.append(chapter)
# Summarize for next chapter's context
summary = self.summarize_chapter(chapter)
self.chapter_summaries.append(summary)
# Save progress
self._save_chapter(i+1, chapter)
return chapters
Usage
# Interactive mode
python generate_long_story.py --interactive
# Direct generation
python generate_long_story.py \
--premise "A young cultivator discovers an ancient inheritance" \
--chapters 10 \
--genre "Xianxia"
# Resume interrupted story
python generate_long_story.py --resume story_20240101_120000
Configuration Options
Generation Settings
# config.py
GENERATION_CONFIG = {
"max_new_tokens": 1000, # Short stories
"temperature": 0.85, # Creativity level
"top_p": 0.92, # Sampling diversity
"repetition_penalty": 1.15 # Reduce repetition
}
CHAPTER_GENERATION_CONFIG = {
"max_new_tokens": 4000, # Full chapters (~3000 words)
"temperature": 0.85,
"repetition_penalty": 1.18 # Higher for long text
}
Model Recommendations
| Model | Best For | Notes |
|---|---|---|
| qwen2.5:7b | Multilingual stories | Best for Chinese/English |
| llama3.1:8b | English stories | Fast, good quality |
| gemma2:9b | Balanced | Good all-around |
| gpt-4 | Highest quality | Cloud, costs money |
| claude-3-sonnet | Creative writing | Excellent prose |
Quick Commands
# Generate short story (CLI)
./run.sh generate
# Generate full chapter
./run.sh chapter
# Multi-chapter story (interactive)
./run.sh story
# List all generated stories
./run.sh stories
Example Output
Here's a sample from a Xianxia story generated by the system:
Chapter 1: The Sealed Cave
The waterfall roared like a caged beast, but Chen Wei barely heard it. His attention was fixed on the jade pendant hanging from his neck—the last gift from his dying master.
"Beyond the Crying Dragon Falls," Master Liu had whispered with his final breath, "lies the inheritance I could never claim. Perhaps you, with your crippled spiritual roots, will succeed where I failed."
Chen Wei had thought the old man delirious. But now, standing before the hundred-meter cascade, he felt it. A resonance. The pendant pulsed with warmth, responding to something hidden behind the wall of water.
He stepped through.
The cave beyond defied mortal understanding. Luminescent moss clung to walls carved with formations so complex they made his eyes water. At the center, upon a throne of crystallized qi, sat a skeleton in meditation pose.
"You have come," a voice echoed in his mind. "I have waited nine thousand years for one with spiritual roots damaged enough to contain my inheritance. Normal cultivators would explode from the power. But you... you are broken in exactly the right way."
Chen Wei's crippled dantian, the shame that had haunted him for eighteen years, suddenly felt less like a curse and more like a key.
"Who are you?" he asked the skeleton.
"I am what remains of the Heavenly Demon Emperor. And you, boy, are about to become something the cultivation world has not seen in ten thousand years."
The skeleton's empty eye sockets began to glow...
Deployment Options
Our tutorial runs everything locally on your machine. Let's explore how this works and how you can extend it to a server.
Current Architecture: Local-First
┌─────────────────────────────────────────────────────────────┐
│ YOUR LOCAL MACHINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Ollama │ │ ChromaDB │ │ Python │ │
│ │ (LLM API) │◀───▶│ (Vector DB) │◀──▶│ Scripts │ │
│ │ │ │ │ │ │ │
│ │ Port 11434 │ │ ./chroma_db │ │ Flask App │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ GPU Inference Port 5000 │
│ (if available) │
│ │
└─────────────────────────────────────────────────────────────┘
Why Local-First?
- Privacy: Your books and stories never leave your machine
- Free: No API costs for generation
- Offline: Works without internet connection
- Learning: You understand every component
Running Locally with Ollama
Ollama makes running LLMs locally trivially easy:
# Install Ollama (macOS)
brew install ollama
# Start the server
ollama serve
# Pull a model
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
# Check available models
ollama list
Hardware Requirements:
| Model Size | RAM Needed | GPU VRAM | Speed |
|---|---|---|---|
| 3B params | 8GB | 4GB | Fast |
| 7B params | 16GB | 8GB | Good |
| 14B params | 32GB | 16GB | Slower |
| 70B params | 64GB+ | 40GB+ | Slow |
For story generation, 7B models like qwen2.5:7b offer the best balance of quality and speed.
Local ChromaDB
ChromaDB runs as an embedded database by default:
# Embedded mode (default) - no server needed
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
# Data stored in ./chroma_db/ directory
# ~100MB for 10,000 chunks
This is perfect for local development and small-to-medium datasets.
Extending to Server Deployment
Want to deploy for multiple users or remote access? Here's how:
Option 1: Docker Compose (Simplest)
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
chromadb:
image: chromadb/chroma
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
app:
build: .
ports:
- "5000:5000"
environment:
- OLLAMA_HOST=http://ollama:11434
- CHROMA_HOST=http://chromadb:8000
depends_on:
- ollama
- chromadb
volumes:
ollama_data:
chroma_data:
Option 2: Cloud API Backend
Switch from local Ollama to cloud APIs:
# config.py
# Option A: Use Ollama (local)
LLM_BACKEND = "ollama"
OLLAMA_MODEL = "qwen2.5:7b"
# Option B: Use OpenAI
LLM_BACKEND = "openai"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_MODEL = "gpt-4"
# Option C: Use Anthropic Claude
LLM_BACKEND = "anthropic"
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
ANTHROPIC_MODEL = "claude-3-sonnet-20240229"
# Option D: Use Google Gemini
LLM_BACKEND = "gemini"
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
GEMINI_MODEL = "gemini-pro"
Option 3: Managed Vector Database
For production, consider managed vector databases:
# Pinecone (managed)
import pinecone
pinecone.init(api_key="YOUR_KEY", environment="us-east-1")
index = pinecone.Index("story-styles")
# Weaviate (self-hosted or cloud)
import weaviate
client = weaviate.Client(url="https://your-cluster.weaviate.network")
# Qdrant (self-hosted or cloud)
from qdrant_client import QdrantClient
client = QdrantClient(url="https://your-qdrant-instance.com")
Architecture Comparison
┌────────────────────────────────────────────────────────────────┐
│ LOCAL (Tutorial) │
├────────────────────────────────────────────────────────────────┤
│ User → Python App → ChromaDB (file) → Ollama (local) │
│ │
│ Pros: Free, private, offline │
│ Cons: Limited to your hardware │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ SERVER (Docker) │
├────────────────────────────────────────────────────────────────┤
│ Users → Flask App → ChromaDB Server → Ollama (GPU server) │
│ │
│ Pros: Multiple users, better GPU │
│ Cons: Server costs, network latency │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ CLOUD (Production) │
├────────────────────────────────────────────────────────────────┤
│ Users → Web App → Pinecone (managed) → OpenAI/Claude API │
│ │
│ Pros: Scalable, no maintenance, best models │
│ Cons: API costs, data leaves your control │
└────────────────────────────────────────────────────────────────┘
Summary
In this series, we built a complete RAG-powered story generator:
Part 1: Understanding RAG
- What RAG is and why it matters
- Architecture overview
- Key components (embeddings, vector DB, retrieval)
- Comparison with alternatives (fine-tuning, prompt engineering)
Part 2: Building the Pipeline
- Parsing ebooks (PDF, EPUB, MOBI)
- Text chunking strategies
- Generating embeddings
- Storing in ChromaDB
Part 3: Story Generation
- Connecting to LLMs (Ollama, OpenAI)
- Building augmented prompts
- Multi-chapter generation with summaries
- Deployment options (local → server → cloud)
What's Next? Advanced RAG Features
Now that you understand the basics, here are the next features to learn and implement:
1. Hybrid Search (BM25 + Semantic)
Problem: Semantic search sometimes misses exact keyword matches.
Solution: Combine keyword search (BM25) with vector search:
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, chunks, embeddings):
# BM25 for keyword matching
tokenized = [c.split() for c in chunks]
self.bm25 = BM25Okapi(tokenized)
# Vector search for semantic
self.vector_store = chromadb.Client()
def search(self, query, alpha=0.5):
# Get BM25 scores
bm25_scores = self.bm25.get_scores(query.split())
# Get vector similarity scores
vector_results = self.collection.query(query)
# Combine with weighted average
final_scores = alpha * bm25_scores + (1-alpha) * vector_scores
return ranked_results
When to use: When users search for specific character names, locations, or technical terms.
2. Query Expansion
Problem: User query may not match document vocabulary.
Solution: Expand query with synonyms or LLM-generated variations:
def expand_query(self, query: str) -> list[str]:
"""Generate query variations"""
prompt = f"""Generate 3 alternative phrasings for this search:
"{query}"
List only the alternatives, one per line."""
variations = self.llm.generate(prompt)
return [query] + variations.split('\n')
3. Cross-Encoder Reranking
Problem: Bi-encoder embeddings miss nuanced relevance.
Solution: Use a cross-encoder to rerank top results:
from sentence_transformers import CrossEncoder
class RerankedRetriever:
def __init__(self):
self.retriever = StyleRetriever()
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve(self, query: str, n_results: int = 5):
# First pass: get top 20 candidates
candidates = self.retriever.retrieve(query, n_results=20)
# Second pass: rerank with cross-encoder
pairs = [[query, doc] for doc in candidates]
scores = self.reranker.predict(pairs)
# Return top N after reranking
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:n_results]]
Why it works: Cross-encoders see query and document together, understanding their relationship better.
4. Semantic Chunking
Problem: Fixed-size chunks cut sentences and paragraphs awkwardly.
Solution: Chunk by semantic boundaries:
def semantic_chunk(text: str, max_size: int = 1000):
"""Split at paragraph/scene boundaries"""
# Split by paragraph
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
# Check if adding this paragraph exceeds limit
if len(current_chunk) + len(para) > max_size:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n\n" + para
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Advanced: Use an LLM to identify natural break points (scene changes, topic shifts).
5. Metadata Filtering
Problem: All chunks are treated equally regardless of source.
Solution: Add and filter by metadata:
# When building the database
collection.add(
documents=[chunk],
embeddings=[embedding],
metadatas=[{
"source_file": "cultivation_novel_1.txt",
"author": "Unknown",
"genre": "xianxia",
"chapter": 5,
"word_count": len(chunk.split())
}],
ids=[chunk_id]
)
# When querying
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={
"genre": "xianxia",
"word_count": {"$gt": 200}
}
)
6. Caching for Performance
Problem: Re-embedding the same queries wastes compute.
Solution: Cache embeddings and results:
from functools import lru_cache
import hashlib
class CachedRetriever:
def __init__(self):
self.retriever = StyleRetriever()
self._cache = {}
def retrieve(self, query: str, n_results: int = 5):
# Create cache key
cache_key = hashlib.md5(f"{query}:{n_results}".encode()).hexdigest()
if cache_key in self._cache:
return self._cache[cache_key]
results = self.retriever.retrieve(query, n_results)
self._cache[cache_key] = results
return results
7. Evaluation Metrics
Problem: How do you know if retrieval is actually working?
Solution: Implement evaluation metrics:
def evaluate_retrieval(test_queries: list, ground_truth: dict):
"""
Measure retrieval quality
Args:
test_queries: List of test queries
ground_truth: {query: [relevant_doc_ids]}
"""
retriever = StyleRetriever()
metrics = {
"precision@5": [],
"recall@5": [],
"mrr": [] # Mean Reciprocal Rank
}
for query in test_queries:
results = retriever.retrieve(query, n_results=5)
relevant = ground_truth[query]
# Calculate precision@5
retrieved_ids = [r['id'] for r in results]
hits = len(set(retrieved_ids) & set(relevant))
metrics["precision@5"].append(hits / 5)
metrics["recall@5"].append(hits / len(relevant))
# Calculate MRR
for i, rid in enumerate(retrieved_ids):
if rid in relevant:
metrics["mrr"].append(1 / (i + 1))
break
else:
metrics["mrr"].append(0)
return {k: sum(v)/len(v) for k, v in metrics.items()}
8. Document Hierarchy
Problem: Losing context about where a chunk came from.
Solution: Store hierarchical context:
Book → Chapter → Section → Paragraph → Chunk
# Store parent context with each chunk
metadata = {
"book_title": "Cultivation Journey",
"chapter_number": 5,
"chapter_title": "The Hidden Inheritance",
"section": "discovery",
"parent_chunk_id": "chunk_004", # Previous chunk
"child_chunk_ids": ["chunk_006", "chunk_007"]
}
When generating, you can include parent context for better coherence.
Learning Path
Here's a suggested order to learn these features:
1. Metadata Filtering (Easy)
└── Add author/genre filters to your queries
2. Caching (Easy)
└── Speed up repeated queries
3. Semantic Chunking (Medium)
└── Better chunk quality = better retrieval
4. Hybrid Search (Medium)
└── Combine the best of keyword + semantic
5. Cross-Encoder Reranking (Medium)
└── Significantly improve relevance
6. Query Expansion (Medium)
└── Handle query-document vocabulary mismatch
7. Evaluation Metrics (Advanced)
└── Measure and improve systematically
8. Document Hierarchy (Advanced)
└── Handle complex document structures
Production Considerations
If you're taking this to production, also consider:
| Concern | Solution |
|---|---|
| Scale | Distributed vector DB (Pinecone, Weaviate) |
| Latency | Pre-compute embeddings, cache aggressively |
| Cost | Smaller models, batched requests |
| Quality | Evaluation pipeline, A/B testing |
| Security | Input sanitization, output filtering |
| Monitoring | Log queries, track retrieval quality |
Resources
- Source Code: github.com/namtran/ai-rag-tutorial-story-generator
- ChromaDB Docs: docs.trychroma.com
- Sentence Transformers: sbert.net
- Ollama: ollama.ai
Previous Articles:
*Thanks for following this series!
Top comments (0)