Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

RAG Pipeline Framework

#ai #llm #machinelearning #python

RAG Pipeline Framework

Retrieval-Augmented Generation is the most practical way to give LLMs access to your private data — but getting it right is harder than the tutorials suggest. This framework provides the complete pipeline: document ingestion, intelligent chunking, embedding generation, vector store integration, retrieval with reranking, and answer generation with source citations. Plus evaluation tools to measure whether your RAG system actually returns correct answers.

Key Features

Document Ingestion — Load PDFs, Markdown, HTML, Word docs, and plain text with automatic format detection and metadata extraction
Chunking Strategies — Fixed-size, semantic, recursive, and document-structure-aware chunking with configurable overlap
Embedding Generation — Support for OpenAI, Cohere, and local embedding models with automatic batching and rate limiting
Vector Store Integration — Pluggable backends for ChromaDB, Pinecone, Weaviate, pgvector, and FAISS with unified query interface
Hybrid Search — Combine dense vector search with sparse keyword search (BM25) for higher recall
Reranking — Cross-encoder reranking to boost precision after initial retrieval
Citation Generation — Automatically include source references in generated answers with page numbers and document titles
RAG Evaluation — Measure retrieval precision, answer faithfulness, and end-to-end accuracy

Quick Start

from rag_pipeline import RAGPipeline, ChunkingStrategy, VectorStore

# 1. Configure the pipeline
pipeline = RAGPipeline(
    chunking=ChunkingStrategy(
        method="recursive",
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "],
    ),
    embedding_model="text-embedding-3-small",
    vector_store=VectorStore(
        backend="chromadb",
        collection="company_docs",
        persist_dir="./vector_db",
    ),
    llm_model="gpt-4o",
)

# 2. Ingest documents
pipeline.ingest("documents/", glob="**/*.{pdf,md,txt}")
print(f"Ingested {pipeline.document_count} documents, {pipeline.chunk_count} chunks")

# 3. Query with RAG
result = pipeline.query("What is our refund policy for enterprise customers?")
print(result.answer)
print(f"\nSources ({len(result.sources)}):")
for src in result.sources:
    print(f"  - {src.document}: p.{src.page} (relevance: {src.score:.2f})")

Architecture

Documents (PDF, MD, HTML, TXT)
         │
         ▼
┌─────────────────┐
│  Document Loader│──── Parse + extract text + metadata
└────────┬────────┘
         ▼
┌─────────────────┐
│   Chunker       │──── Split into retrieval-sized chunks
└────────┬────────┘
         ▼
┌─────────────────┐
│  Embedder       │──── Generate vector embeddings
└────────┬────────┘
         ▼
┌─────────────────┐
│  Vector Store   │──── Store embeddings + metadata
└────────┬────────┘
         │
    Query Time:
         │
User Query ──▶ Embed ──▶ Vector Search ──▶ Rerank ──▶ LLM + Context ──▶ Answer
                              │                              │
                         Top K chunks                  Source citations

Usage Examples

Chunking Strategy Comparison

from rag_pipeline.chunking import FixedSizeChunker, SemanticChunker, RecursiveChunker

# Fixed-size: Predictable chunk sizes, may break mid-sentence
fixed = FixedSizeChunker(chunk_size=500, overlap=50)

# Semantic: Chunks at natural topic boundaries using embeddings
semantic = SemanticChunker(
    embedding_model="text-embedding-3-small",
    breakpoint_threshold=0.3,   # Lower = more chunks
    min_chunk_size=100,
    max_chunk_size=1000,
)

# Recursive: Tries multiple separators in order of preference
recursive = RecursiveChunker(
    chunk_size=500,
    overlap=50,
    separators=["\n\n", "\n", ". ", " "],  # Paragraphs → lines → sentences → words
)

Hybrid Search with Reranking

from rag_pipeline.retrieval import HybridRetriever, CrossEncoderReranker

retriever = HybridRetriever(
    dense_weight=0.7,           # 70% vector similarity
    sparse_weight=0.3,          # 30% BM25 keyword match
    top_k=20,                   # Retrieve 20 candidates
    reranker=CrossEncoderReranker(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n=5,                # Return top 5 after reranking
    ),
)

results = retriever.search("enterprise refund policy", vector_store=store)
for r in results:
    print(f"[{r.score:.3f}] {r.document} — {r.text[:100]}...")

RAG Evaluation

from rag_pipeline.evaluation import RAGEvaluator

evaluator = RAGEvaluator(
    test_dataset="eval_data/qa_pairs.jsonl",  # Questions with ground truth answers
    metrics=["retrieval_precision", "answer_faithfulness", "answer_correctness"],
    judge_model="gpt-4o-mini",
)

scores = evaluator.evaluate(pipeline)
print(f"Retrieval Precision@5: {scores.retrieval_precision:.2%}")
print(f"Answer Faithfulness:   {scores.answer_faithfulness:.2%}")
print(f"Answer Correctness:    {scores.answer_correctness:.2%}")

Metadata Filtering

# Only search within specific document types or date ranges
result = pipeline.query(
    "What changed in the Q3 update?",
    filters={
        "document_type": "changelog",
        "date": {"$gte": "2025-07-01"},
        "department": "engineering",
    },
    top_k=5,
)

Configuration

# rag_config.yaml
ingestion:
  supported_formats: ["pdf", "md", "txt", "html", "docx"]
  metadata_extraction: true
  skip_existing: true            # Don't re-ingest unchanged docs
  watch_directory: false         # Auto-ingest new files

chunking:
  method: "recursive"            # fixed | semantic | recursive
  chunk_size: 500                # Tokens
  chunk_overlap: 50
  separators: ["\n\n", "\n", ". ", " "]
  include_metadata: true         # Attach source doc metadata to chunks

embedding:
  model: "text-embedding-3-small"
  batch_size: 100
  rate_limit_rpm: 3000
  dimensions: 1536

vector_store:
  backend: "chromadb"            # chromadb | pinecone | weaviate | pgvector | faiss
  collection: "company_docs"
  persist_dir: "./vector_db"

retrieval:
  method: "hybrid"               # dense | sparse | hybrid
  dense_weight: 0.7
  sparse_weight: 0.3
  top_k: 20
  reranker:
    enabled: true
    model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
    top_n: 5

generation:
  model: "gpt-4o"
  temperature: 0.1
  max_tokens: 1000
  system_prompt: |
    Answer the question based ONLY on the provided context.
    If the context doesn't contain the answer, say "I don't have
    enough information to answer that." Always cite your sources.
  include_citations: true

evaluation:
  test_dataset: "eval_data/qa_pairs.jsonl"
  metrics: ["retrieval_precision", "answer_faithfulness", "answer_correctness"]
  judge_model: "gpt-4o-mini"
  sample_size: 100

Best Practices

Chunk size matters more than you think — Too small (< 200 tokens) loses context. Too large (> 1000 tokens) dilutes relevance. Start at 500 and tune with eval.
Always use overlap — 10-20% overlap prevents information loss at chunk boundaries.
Hybrid search beats dense-only — Adding BM25 catches keyword-exact matches that embedding models miss (proper nouns, error codes, IDs).
Reranking is worth the latency — A cross-encoder reranker on your top-20 results dramatically improves the top-5 quality.
Include metadata in chunks — Prepend document title, section headers, and dates to each chunk. The LLM needs this context.
Evaluate on real questions — Build a test set of 100+ real user questions with ground truth answers. Run it after every pipeline change.

Troubleshooting

Problem	Cause	Fix
Retrieved chunks are irrelevant to the query	Chunk size too large or embedding model weak	Reduce chunk size to 300-500; try `text-embedding-3-large` for better retrieval
Answers hallucinate despite relevant context	System prompt doesn't enforce grounding	Add "ONLY use the provided context" constraint and enable faithfulness evaluation
Ingestion is slow on large document sets	Embedding API rate limits	Increase `batch_size`, use local embedding model, or set `skip_existing: true`
Duplicate chunks from overlapping documents	Same content in multiple source files	Enable deduplication with `dedup_threshold: 0.95` in chunking config

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [RAG Pipeline Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

RAG Pipeline Framework

RAG Pipeline Framework

Key Features

Quick Start

Architecture

Usage Examples

Chunking Strategy Comparison

Hybrid Search with Reranking

RAG Evaluation

Metadata Filtering

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)