DEV Community

Cover image for Build Your Own AI Story Generator with RAG - Part 3: Generating Stories
Nam Tran
Nam Tran

Posted on

Build Your Own AI Story Generator with RAG - Part 3: Generating Stories

We've built our RAG pipeline (Part 1, Part 2). Now let's use it to generate stories.

In this final article, we'll:

  • Connect to LLMs (local and cloud)
  • Build augmented prompts
  • Generate multi-chapter stories
  • Maintain consistency across chapters

Source Code: github.com/namtran/ai-rag-tutorial-story-generator


The Generation Flow

┌─────────────────────────────────────────────────────────────┐
│                   STORY GENERATION                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  User: "Write about a young cultivator finding a cave"     │
│                         │                                   │
│                         ▼                                   │
│  ┌─────────────────────────────────────────┐               │
│  │  1. EMBED QUERY                         │               │
│  │     Convert prompt → vector             │               │
│  └─────────────────────────────────────────┘               │
│                         │                                   │
│                         ▼                                   │
│  ┌─────────────────────────────────────────┐               │
│  │  2. RETRIEVE                            │               │
│  │     Find similar passages in ChromaDB   │               │
│  │     Returns: 3-5 style samples          │               │
│  └─────────────────────────────────────────┘               │
│                         │                                   │
│                         ▼                                   │
│  ┌─────────────────────────────────────────┐               │
│  │  3. AUGMENT PROMPT                      │               │
│  │     "Here are style examples:           │               │
│  │      [retrieved passages]               │               │
│  │      Now write: [user prompt]"          │               │
│  └─────────────────────────────────────────┘               │
│                         │                                   │
│                         ▼                                   │
│  ┌─────────────────────────────────────────┐               │
│  │  4. GENERATE                            │               │
│  │     Send to LLM (Ollama/OpenAI)         │               │
│  │     Generate story with learned style   │               │
│  └─────────────────────────────────────────┘               │
│                         │                                   │
│                         ▼                                   │
│  ┌─────────────────────────────────────────┐               │
│  │  OUTPUT:                                │               │
│  │  "Chen Wei pushed aside the waterfall,  │               │
│  │   revealing a cave mouth wreathed in    │               │
│  │   ancient qi. His cultivation base      │               │
│  │   trembled as Heaven's Will..."         │               │
│  └─────────────────────────────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: The Style Retriever

First, let's build a class to retrieve relevant passages:

# generate_with_style.py

from sentence_transformers import SentenceTransformer
import chromadb
from config import CHROMA_DIR, EMBED_MODEL, COLLECTION_NAME

class StyleRetriever:
    """Retrieve writing styles from ChromaDB"""

    def __init__(self):
        self.embedder = SentenceTransformer(EMBED_MODEL)
        self.client = chromadb.PersistentClient(path=str(CHROMA_DIR))
        self.collection = self.client.get_collection(COLLECTION_NAME)

        print(f"[RAG] Connected: {self.collection.count()} chunks")

    def retrieve(self, query: str, n_results: int = 3) -> list[str]:
        """Find passages with similar writing style"""
        # Embed the query
        query_embedding = self.embedder.encode([query])

        # Search ChromaDB
        results = self.collection.query(
            query_embeddings=query_embedding.tolist(),
            n_results=n_results
        )

        return results['documents'][0]
Enter fullscreen mode Exit fullscreen mode

Usage:

retriever = StyleRetriever()
passages = retriever.retrieve("A young warrior discovers a magical sword")

for p in passages:
    print(p[:200] + "...")
Enter fullscreen mode Exit fullscreen mode

Step 2: LLM Backends

We support multiple LLM backends. Let's implement two: Ollama (local) and OpenAI (cloud).

Ollama Generator (Local)

class OllamaGenerator:
    """Generate text using local Ollama models"""

    def __init__(self, model_name: str = "qwen2.5:7b"):
        import requests

        self.model_name = model_name
        self.base_url = "http://localhost:11434"

        # Verify connection
        response = requests.get(f"{self.base_url}/api/tags")
        if response.status_code != 200:
            raise ConnectionError("Ollama not running. Start with: ollama serve")

        print(f"[Ollama] Model: {model_name}")

    def generate(self, prompt: str, max_tokens: int = 1000) -> str:
        import requests

        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model_name,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.85,
                    "top_p": 0.92,
                    "num_predict": max_tokens
                }
            },
            timeout=300
        )

        return response.json()["response"]
Enter fullscreen mode Exit fullscreen mode

OpenAI Generator (Cloud)

class OpenAIGenerator:
    """Generate text using OpenAI API"""

    def __init__(self, model: str = "gpt-4"):
        from openai import OpenAI
        import os

        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
        self.model = model
        print(f"[OpenAI] Model: {model}")

    def generate(self, prompt: str, max_tokens: int = 1000) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=0.85
        )

        return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Step 3: The Augmented Prompt

This is where RAG magic happens. We inject retrieved passages as style examples:

STYLE_PROMPT_TEMPLATE = """Here are some example passages showing the writing style to follow:

{context}

---

Now write a NEW story passage in a similar style.
Story idea: {user_request}

Story:
"""
Enter fullscreen mode Exit fullscreen mode

Why This Works

The LLM receives:

  1. Style examples - Shows how to write (vocabulary, pacing, tone)
  2. Clear instruction - Write something NEW, not copy
  3. User's idea - The creative direction

The model mimics the style while generating original content.


Step 4: The Story Generator

Putting it all together:

class StoryGenerator:
    """Generate stories using RAG + LLM"""

    def __init__(self, backend: str = "ollama", model: str = None):
        # Initialize retriever
        self.retriever = StyleRetriever()

        # Initialize generator
        if backend == "ollama":
            self.generator = OllamaGenerator(model or "qwen2.5:7b")
        elif backend == "openai":
            self.generator = OpenAIGenerator(model or "gpt-4")
        else:
            raise ValueError(f"Unknown backend: {backend}")

    def generate(self, user_request: str, n_style_samples: int = 3) -> str:
        """Generate a story with learned style"""

        # Step 1: Retrieve style samples
        print("[RAG] Retrieving style samples...")
        style_samples = self.retriever.retrieve(
            user_request,
            n_results=n_style_samples
        )

        # Step 2: Build augmented prompt
        context = "\n\n---\n\n".join(style_samples)
        prompt = STYLE_PROMPT_TEMPLATE.format(
            context=context,
            user_request=user_request
        )

        # Step 3: Generate
        print("[LLM] Generating story...")
        story = self.generator.generate(prompt)

        return story
Enter fullscreen mode Exit fullscreen mode

Usage

generator = StoryGenerator(backend="ollama", model="qwen2.5:7b")

story = generator.generate(
    "A young cultivator discovers a mysterious cave behind a waterfall"
)

print(story)
Enter fullscreen mode Exit fullscreen mode

Output:

Chen Wei had wandered these mountains for three days, following the
whispers of his jade pendant. The ancient artifact had belonged to
his master, and now it pulsed with an urgency he couldn't ignore.

The waterfall appeared without warning—a curtain of silver crashing
into a pool of impossible clarity. But it wasn't the water that made
his cultivation base tremble. It was what lay behind it.

"Impossible," he breathed.

The cave mouth gaped like the maw of a sleeping dragon, and from within
emanated a pressure that spoke of ages long forgotten. Qi so dense it
was almost visible swirled at the entrance, forming patterns that hurt
to look upon.

His pendant grew warm against his chest. A confirmation. A warning.

Chen Wei stepped through the waterfall.

What he found inside would change the course of his cultivation forever...
Enter fullscreen mode Exit fullscreen mode

Multi-Chapter Generation

For longer stories, we need to maintain consistency across chapters.

The Challenge

Chapter 1: "Chen Wei has blue eyes"
Chapter 5: "Chen Wei's brown eyes sparkled"  ← Inconsistency!
Enter fullscreen mode Exit fullscreen mode

The Solution: Summaries

After each chapter, we generate a summary. This summary is included in the prompt for subsequent chapters.

┌────────────────────────────────────────────────────────────────┐
│              MULTI-CHAPTER GENERATION                          │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  1. Generate Story Outline                                     │
│     ┌────────────────────────────────────────┐                │
│     │ Chapter 1: The Discovery               │                │
│     │ Chapter 2: The Ancient Inheritance     │                │
│     │ Chapter 3: First Breakthrough          │                │
│     │ ...                                    │                │
│     └────────────────────────────────────────┘                │
│                         │                                      │
│                         ▼                                      │
│  2. Generate Chapter 1                                         │
│     ┌────────────────────────────────────────┐                │
│     │ Context: [Style samples from RAG]      │                │
│     │ Outline: Chapter 1 summary             │                │
│     │ → Generate full chapter                │                │
│     └────────────────────────────────────────┘                │
│                         │                                      │
│                         ▼                                      │
│  3. Summarize Chapter 1                                        │
│     "Chen Wei discovered a cave containing an                  │
│      ancient cultivator's inheritance..."                      │
│                         │                                      │
│                         ▼                                      │
│  4. Generate Chapter 2                                         │
│     ┌────────────────────────────────────────┐                │
│     │ Context: [Style samples from RAG]      │                │
│     │ Previous: [Chapter 1 summary]          │  ← Key!        │
│     │ Outline: Chapter 2 summary             │                │
│     │ → Generate full chapter                │                │
│     └────────────────────────────────────────┘                │
│                         │                                      │
│                         ▼                                      │
│  5. Repeat for all chapters...                                 │
│                                                                │
└────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Implementation

# generate_long_story.py (simplified)

class LongStoryGenerator:
    def __init__(self, backend="ollama"):
        self.base_generator = StoryGenerator(backend=backend)
        self.chapter_summaries = []

    def generate_outline(self, premise: str, num_chapters: int = 10) -> list:
        """Generate a story outline"""
        prompt = f"""Create a {num_chapters}-chapter story outline for:
{premise}

For each chapter provide:
- Title
- Summary (2-3 sentences)
- Key events
"""
        outline_text = self.base_generator.generator.generate(prompt)
        return self._parse_outline(outline_text)

    def generate_chapter(self, chapter_num: int, chapter_outline: dict) -> str:
        """Generate a single chapter with context"""

        # Build previous summary
        previous = "\n".join([
            f"Chapter {i+1}: {s}"
            for i, s in enumerate(self.chapter_summaries)
        ])

        prompt = f"""
Previous chapters summary:
{previous if previous else "This is the beginning of the story."}

---

Write Chapter {chapter_num}: {chapter_outline['title']}

Chapter outline: {chapter_outline['summary']}

Write 2500-3500 words. Include dialogue, descriptions, and character thoughts.
"""
        # Get style samples based on chapter content
        style_samples = self.base_generator.retriever.retrieve(
            chapter_outline['summary'],
            n_results=5
        )

        full_prompt = f"""Reference writing style:
{chr(10).join(style_samples)}

---

{prompt}
"""
        return self.base_generator.generator.generate(
            full_prompt,
            max_tokens=4000
        )

    def summarize_chapter(self, chapter_content: str) -> str:
        """Create a summary for context in next chapters"""
        prompt = f"""Summarize this chapter in 100-150 words:

{chapter_content}

Focus on key events and character changes.
"""
        return self.base_generator.generator.generate(prompt, max_tokens=200)

    def generate_full_story(self, premise: str, num_chapters: int = 10):
        """Generate a complete multi-chapter story"""

        # Step 1: Generate outline
        print("Generating outline...")
        outline = self.generate_outline(premise, num_chapters)

        # Step 2: Generate each chapter
        chapters = []
        for i, chapter_outline in enumerate(outline):
            print(f"Generating Chapter {i+1}/{num_chapters}...")

            # Generate chapter
            chapter = self.generate_chapter(i+1, chapter_outline)
            chapters.append(chapter)

            # Summarize for next chapter's context
            summary = self.summarize_chapter(chapter)
            self.chapter_summaries.append(summary)

            # Save progress
            self._save_chapter(i+1, chapter)

        return chapters
Enter fullscreen mode Exit fullscreen mode

Usage

# Interactive mode
python generate_long_story.py --interactive

# Direct generation
python generate_long_story.py \
  --premise "A young cultivator discovers an ancient inheritance" \
  --chapters 10 \
  --genre "Xianxia"

# Resume interrupted story
python generate_long_story.py --resume story_20240101_120000
Enter fullscreen mode Exit fullscreen mode

Configuration Options

Generation Settings

# config.py

GENERATION_CONFIG = {
    "max_new_tokens": 1000,     # Short stories
    "temperature": 0.85,        # Creativity level
    "top_p": 0.92,              # Sampling diversity
    "repetition_penalty": 1.15  # Reduce repetition
}

CHAPTER_GENERATION_CONFIG = {
    "max_new_tokens": 4000,     # Full chapters (~3000 words)
    "temperature": 0.85,
    "repetition_penalty": 1.18  # Higher for long text
}
Enter fullscreen mode Exit fullscreen mode

Model Recommendations

Model Best For Notes
qwen2.5:7b Multilingual stories Best for Chinese/English
llama3.1:8b English stories Fast, good quality
gemma2:9b Balanced Good all-around
gpt-4 Highest quality Cloud, costs money
claude-3-sonnet Creative writing Excellent prose

Quick Commands

# Generate short story (CLI)
./run.sh generate

# Generate full chapter
./run.sh chapter

# Multi-chapter story (interactive)
./run.sh story

# List all generated stories
./run.sh stories
Enter fullscreen mode Exit fullscreen mode

Example Output

Here's a sample from a Xianxia story generated by the system:

Chapter 1: The Sealed Cave

The waterfall roared like a caged beast, but Chen Wei barely heard it. His attention was fixed on the jade pendant hanging from his neck—the last gift from his dying master.

"Beyond the Crying Dragon Falls," Master Liu had whispered with his final breath, "lies the inheritance I could never claim. Perhaps you, with your crippled spiritual roots, will succeed where I failed."

Chen Wei had thought the old man delirious. But now, standing before the hundred-meter cascade, he felt it. A resonance. The pendant pulsed with warmth, responding to something hidden behind the wall of water.

He stepped through.

The cave beyond defied mortal understanding. Luminescent moss clung to walls carved with formations so complex they made his eyes water. At the center, upon a throne of crystallized qi, sat a skeleton in meditation pose.

"You have come," a voice echoed in his mind. "I have waited nine thousand years for one with spiritual roots damaged enough to contain my inheritance. Normal cultivators would explode from the power. But you... you are broken in exactly the right way."

Chen Wei's crippled dantian, the shame that had haunted him for eighteen years, suddenly felt less like a curse and more like a key.

"Who are you?" he asked the skeleton.

"I am what remains of the Heavenly Demon Emperor. And you, boy, are about to become something the cultivation world has not seen in ten thousand years."

The skeleton's empty eye sockets began to glow...


Deployment Options

Our tutorial runs everything locally on your machine. Let's explore how this works and how you can extend it to a server.

Current Architecture: Local-First

┌─────────────────────────────────────────────────────────────┐
│                    YOUR LOCAL MACHINE                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐     ┌──────────────┐    ┌──────────────┐ │
│  │   Ollama     │     │   ChromaDB   │    │   Python     │ │
│  │  (LLM API)   │◀───▶│  (Vector DB) │◀──▶│   Scripts    │ │
│  │              │     │              │    │              │ │
│  │  Port 11434  │     │  ./chroma_db │    │  Flask App   │ │
│  └──────────────┘     └──────────────┘    └──────────────┘ │
│         ▲                                        ▲          │
│         │                                        │          │
│    GPU Inference                           Port 5000        │
│    (if available)                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why Local-First?

  1. Privacy: Your books and stories never leave your machine
  2. Free: No API costs for generation
  3. Offline: Works without internet connection
  4. Learning: You understand every component

Running Locally with Ollama

Ollama makes running LLMs locally trivially easy:

# Install Ollama (macOS)
brew install ollama

# Start the server
ollama serve

# Pull a model
ollama pull qwen2.5:7b
ollama pull llama3.1:8b

# Check available models
ollama list
Enter fullscreen mode Exit fullscreen mode

Hardware Requirements:

Model Size RAM Needed GPU VRAM Speed
3B params 8GB 4GB Fast
7B params 16GB 8GB Good
14B params 32GB 16GB Slower
70B params 64GB+ 40GB+ Slow

For story generation, 7B models like qwen2.5:7b offer the best balance of quality and speed.

Local ChromaDB

ChromaDB runs as an embedded database by default:

# Embedded mode (default) - no server needed
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")

# Data stored in ./chroma_db/ directory
# ~100MB for 10,000 chunks
Enter fullscreen mode Exit fullscreen mode

This is perfect for local development and small-to-medium datasets.


Extending to Server Deployment

Want to deploy for multiple users or remote access? Here's how:

Option 1: Docker Compose (Simplest)

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  chromadb:
    image: chromadb/chroma
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma

  app:
    build: .
    ports:
      - "5000:5000"
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - CHROMA_HOST=http://chromadb:8000
    depends_on:
      - ollama
      - chromadb

volumes:
  ollama_data:
  chroma_data:
Enter fullscreen mode Exit fullscreen mode

Option 2: Cloud API Backend

Switch from local Ollama to cloud APIs:

# config.py

# Option A: Use Ollama (local)
LLM_BACKEND = "ollama"
OLLAMA_MODEL = "qwen2.5:7b"

# Option B: Use OpenAI
LLM_BACKEND = "openai"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_MODEL = "gpt-4"

# Option C: Use Anthropic Claude
LLM_BACKEND = "anthropic"
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
ANTHROPIC_MODEL = "claude-3-sonnet-20240229"

# Option D: Use Google Gemini
LLM_BACKEND = "gemini"
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
GEMINI_MODEL = "gemini-pro"
Enter fullscreen mode Exit fullscreen mode

Option 3: Managed Vector Database

For production, consider managed vector databases:

# Pinecone (managed)
import pinecone

pinecone.init(api_key="YOUR_KEY", environment="us-east-1")
index = pinecone.Index("story-styles")

# Weaviate (self-hosted or cloud)
import weaviate

client = weaviate.Client(url="https://your-cluster.weaviate.network")

# Qdrant (self-hosted or cloud)
from qdrant_client import QdrantClient

client = QdrantClient(url="https://your-qdrant-instance.com")
Enter fullscreen mode Exit fullscreen mode

Architecture Comparison

┌────────────────────────────────────────────────────────────────┐
│                    LOCAL (Tutorial)                             │
├────────────────────────────────────────────────────────────────┤
│  User → Python App → ChromaDB (file) → Ollama (local)          │
│                                                                 │
│  Pros: Free, private, offline                                   │
│  Cons: Limited to your hardware                                 │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│                    SERVER (Docker)                              │
├────────────────────────────────────────────────────────────────┤
│  Users → Flask App → ChromaDB Server → Ollama (GPU server)     │
│                                                                 │
│  Pros: Multiple users, better GPU                               │
│  Cons: Server costs, network latency                            │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│                    CLOUD (Production)                           │
├────────────────────────────────────────────────────────────────┤
│  Users → Web App → Pinecone (managed) → OpenAI/Claude API      │
│                                                                 │
│  Pros: Scalable, no maintenance, best models                    │
│  Cons: API costs, data leaves your control                      │
└────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Summary

In this series, we built a complete RAG-powered story generator:

Part 1: Understanding RAG

  • What RAG is and why it matters
  • Architecture overview
  • Key components (embeddings, vector DB, retrieval)
  • Comparison with alternatives (fine-tuning, prompt engineering)

Part 2: Building the Pipeline

  • Parsing ebooks (PDF, EPUB, MOBI)
  • Text chunking strategies
  • Generating embeddings
  • Storing in ChromaDB

Part 3: Story Generation

  • Connecting to LLMs (Ollama, OpenAI)
  • Building augmented prompts
  • Multi-chapter generation with summaries
  • Deployment options (local → server → cloud)

What's Next? Advanced RAG Features

Now that you understand the basics, here are the next features to learn and implement:

1. Hybrid Search (BM25 + Semantic)

Problem: Semantic search sometimes misses exact keyword matches.

Solution: Combine keyword search (BM25) with vector search:

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks, embeddings):
        # BM25 for keyword matching
        tokenized = [c.split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

        # Vector search for semantic
        self.vector_store = chromadb.Client()

    def search(self, query, alpha=0.5):
        # Get BM25 scores
        bm25_scores = self.bm25.get_scores(query.split())

        # Get vector similarity scores
        vector_results = self.collection.query(query)

        # Combine with weighted average
        final_scores = alpha * bm25_scores + (1-alpha) * vector_scores
        return ranked_results
Enter fullscreen mode Exit fullscreen mode

When to use: When users search for specific character names, locations, or technical terms.

2. Query Expansion

Problem: User query may not match document vocabulary.

Solution: Expand query with synonyms or LLM-generated variations:

def expand_query(self, query: str) -> list[str]:
    """Generate query variations"""
    prompt = f"""Generate 3 alternative phrasings for this search:
    "{query}"

    List only the alternatives, one per line."""

    variations = self.llm.generate(prompt)
    return [query] + variations.split('\n')
Enter fullscreen mode Exit fullscreen mode

3. Cross-Encoder Reranking

Problem: Bi-encoder embeddings miss nuanced relevance.

Solution: Use a cross-encoder to rerank top results:

from sentence_transformers import CrossEncoder

class RerankedRetriever:
    def __init__(self):
        self.retriever = StyleRetriever()
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def retrieve(self, query: str, n_results: int = 5):
        # First pass: get top 20 candidates
        candidates = self.retriever.retrieve(query, n_results=20)

        # Second pass: rerank with cross-encoder
        pairs = [[query, doc] for doc in candidates]
        scores = self.reranker.predict(pairs)

        # Return top N after reranking
        ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
        return [doc for doc, score in ranked[:n_results]]
Enter fullscreen mode Exit fullscreen mode

Why it works: Cross-encoders see query and document together, understanding their relationship better.

4. Semantic Chunking

Problem: Fixed-size chunks cut sentences and paragraphs awkwardly.

Solution: Chunk by semantic boundaries:

def semantic_chunk(text: str, max_size: int = 1000):
    """Split at paragraph/scene boundaries"""
    # Split by paragraph
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        # Check if adding this paragraph exceeds limit
        if len(current_chunk) + len(para) > max_size:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para
        else:
            current_chunk += "\n\n" + para

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks
Enter fullscreen mode Exit fullscreen mode

Advanced: Use an LLM to identify natural break points (scene changes, topic shifts).

5. Metadata Filtering

Problem: All chunks are treated equally regardless of source.

Solution: Add and filter by metadata:

# When building the database
collection.add(
    documents=[chunk],
    embeddings=[embedding],
    metadatas=[{
        "source_file": "cultivation_novel_1.txt",
        "author": "Unknown",
        "genre": "xianxia",
        "chapter": 5,
        "word_count": len(chunk.split())
    }],
    ids=[chunk_id]
)

# When querying
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={
        "genre": "xianxia",
        "word_count": {"$gt": 200}
    }
)
Enter fullscreen mode Exit fullscreen mode

6. Caching for Performance

Problem: Re-embedding the same queries wastes compute.

Solution: Cache embeddings and results:

from functools import lru_cache
import hashlib

class CachedRetriever:
    def __init__(self):
        self.retriever = StyleRetriever()
        self._cache = {}

    def retrieve(self, query: str, n_results: int = 5):
        # Create cache key
        cache_key = hashlib.md5(f"{query}:{n_results}".encode()).hexdigest()

        if cache_key in self._cache:
            return self._cache[cache_key]

        results = self.retriever.retrieve(query, n_results)
        self._cache[cache_key] = results
        return results
Enter fullscreen mode Exit fullscreen mode

7. Evaluation Metrics

Problem: How do you know if retrieval is actually working?

Solution: Implement evaluation metrics:

def evaluate_retrieval(test_queries: list, ground_truth: dict):
    """
    Measure retrieval quality

    Args:
        test_queries: List of test queries
        ground_truth: {query: [relevant_doc_ids]}
    """
    retriever = StyleRetriever()

    metrics = {
        "precision@5": [],
        "recall@5": [],
        "mrr": []  # Mean Reciprocal Rank
    }

    for query in test_queries:
        results = retriever.retrieve(query, n_results=5)
        relevant = ground_truth[query]

        # Calculate precision@5
        retrieved_ids = [r['id'] for r in results]
        hits = len(set(retrieved_ids) & set(relevant))
        metrics["precision@5"].append(hits / 5)
        metrics["recall@5"].append(hits / len(relevant))

        # Calculate MRR
        for i, rid in enumerate(retrieved_ids):
            if rid in relevant:
                metrics["mrr"].append(1 / (i + 1))
                break
        else:
            metrics["mrr"].append(0)

    return {k: sum(v)/len(v) for k, v in metrics.items()}
Enter fullscreen mode Exit fullscreen mode

8. Document Hierarchy

Problem: Losing context about where a chunk came from.

Solution: Store hierarchical context:

Book → Chapter → Section → Paragraph → Chunk
Enter fullscreen mode Exit fullscreen mode
# Store parent context with each chunk
metadata = {
    "book_title": "Cultivation Journey",
    "chapter_number": 5,
    "chapter_title": "The Hidden Inheritance",
    "section": "discovery",
    "parent_chunk_id": "chunk_004",  # Previous chunk
    "child_chunk_ids": ["chunk_006", "chunk_007"]
}
Enter fullscreen mode Exit fullscreen mode

When generating, you can include parent context for better coherence.

Learning Path

Here's a suggested order to learn these features:

1. Metadata Filtering (Easy)
   └── Add author/genre filters to your queries

2. Caching (Easy)
   └── Speed up repeated queries

3. Semantic Chunking (Medium)
   └── Better chunk quality = better retrieval

4. Hybrid Search (Medium)
   └── Combine the best of keyword + semantic

5. Cross-Encoder Reranking (Medium)
   └── Significantly improve relevance

6. Query Expansion (Medium)
   └── Handle query-document vocabulary mismatch

7. Evaluation Metrics (Advanced)
   └── Measure and improve systematically

8. Document Hierarchy (Advanced)
   └── Handle complex document structures
Enter fullscreen mode Exit fullscreen mode

Production Considerations

If you're taking this to production, also consider:

Concern Solution
Scale Distributed vector DB (Pinecone, Weaviate)
Latency Pre-compute embeddings, cache aggressively
Cost Smaller models, batched requests
Quality Evaluation pipeline, A/B testing
Security Input sanitization, output filtering
Monitoring Log queries, track retrieval quality

Resources


Previous Articles:


*Thanks for following this series!

Top comments (0)