DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

RAG Pipeline Framework

RAG Pipeline Framework

Retrieval-Augmented Generation is the most practical way to give LLMs access to your private data — but getting it right is harder than the tutorials suggest. This framework provides the complete pipeline: document ingestion, intelligent chunking, embedding generation, vector store integration, retrieval with reranking, and answer generation with source citations. Plus evaluation tools to measure whether your RAG system actually returns correct answers.

Key Features

  • Document Ingestion — Load PDFs, Markdown, HTML, Word docs, and plain text with automatic format detection and metadata extraction
  • Chunking Strategies — Fixed-size, semantic, recursive, and document-structure-aware chunking with configurable overlap
  • Embedding Generation — Support for OpenAI, Cohere, and local embedding models with automatic batching and rate limiting
  • Vector Store Integration — Pluggable backends for ChromaDB, Pinecone, Weaviate, pgvector, and FAISS with unified query interface
  • Hybrid Search — Combine dense vector search with sparse keyword search (BM25) for higher recall
  • Reranking — Cross-encoder reranking to boost precision after initial retrieval
  • Citation Generation — Automatically include source references in generated answers with page numbers and document titles
  • RAG Evaluation — Measure retrieval precision, answer faithfulness, and end-to-end accuracy

Quick Start

from rag_pipeline import RAGPipeline, ChunkingStrategy, VectorStore

# 1. Configure the pipeline
pipeline = RAGPipeline(
    chunking=ChunkingStrategy(
        method="recursive",
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "],
    ),
    embedding_model="text-embedding-3-small",
    vector_store=VectorStore(
        backend="chromadb",
        collection="company_docs",
        persist_dir="./vector_db",
    ),
    llm_model="gpt-4o",
)

# 2. Ingest documents
pipeline.ingest("documents/", glob="**/*.{pdf,md,txt}")
print(f"Ingested {pipeline.document_count} documents, {pipeline.chunk_count} chunks")

# 3. Query with RAG
result = pipeline.query("What is our refund policy for enterprise customers?")
print(result.answer)
print(f"\nSources ({len(result.sources)}):")
for src in result.sources:
    print(f"  - {src.document}: p.{src.page} (relevance: {src.score:.2f})")
Enter fullscreen mode Exit fullscreen mode

Architecture

Documents (PDF, MD, HTML, TXT)
         │
         ▼
┌─────────────────┐
│  Document Loader│──── Parse + extract text + metadata
└────────┬────────┘
         ▼
┌─────────────────┐
│   Chunker       │──── Split into retrieval-sized chunks
└────────┬────────┘
         ▼
┌─────────────────┐
│  Embedder       │──── Generate vector embeddings
└────────┬────────┘
         ▼
┌─────────────────┐
│  Vector Store   │──── Store embeddings + metadata
└────────┬────────┘
         │
    Query Time:
         │
User Query ──▶ Embed ──▶ Vector Search ──▶ Rerank ──▶ LLM + Context ──▶ Answer
                              │                              │
                         Top K chunks                  Source citations
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Chunking Strategy Comparison

from rag_pipeline.chunking import FixedSizeChunker, SemanticChunker, RecursiveChunker

# Fixed-size: Predictable chunk sizes, may break mid-sentence
fixed = FixedSizeChunker(chunk_size=500, overlap=50)

# Semantic: Chunks at natural topic boundaries using embeddings
semantic = SemanticChunker(
    embedding_model="text-embedding-3-small",
    breakpoint_threshold=0.3,   # Lower = more chunks
    min_chunk_size=100,
    max_chunk_size=1000,
)

# Recursive: Tries multiple separators in order of preference
recursive = RecursiveChunker(
    chunk_size=500,
    overlap=50,
    separators=["\n\n", "\n", ". ", " "],  # Paragraphs → lines → sentences → words
)
Enter fullscreen mode Exit fullscreen mode

Hybrid Search with Reranking

from rag_pipeline.retrieval import HybridRetriever, CrossEncoderReranker

retriever = HybridRetriever(
    dense_weight=0.7,           # 70% vector similarity
    sparse_weight=0.3,          # 30% BM25 keyword match
    top_k=20,                   # Retrieve 20 candidates
    reranker=CrossEncoderReranker(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n=5,                # Return top 5 after reranking
    ),
)

results = retriever.search("enterprise refund policy", vector_store=store)
for r in results:
    print(f"[{r.score:.3f}] {r.document}{r.text[:100]}...")
Enter fullscreen mode Exit fullscreen mode

RAG Evaluation

from rag_pipeline.evaluation import RAGEvaluator

evaluator = RAGEvaluator(
    test_dataset="eval_data/qa_pairs.jsonl",  # Questions with ground truth answers
    metrics=["retrieval_precision", "answer_faithfulness", "answer_correctness"],
    judge_model="gpt-4o-mini",
)

scores = evaluator.evaluate(pipeline)
print(f"Retrieval Precision@5: {scores.retrieval_precision:.2%}")
print(f"Answer Faithfulness:   {scores.answer_faithfulness:.2%}")
print(f"Answer Correctness:    {scores.answer_correctness:.2%}")
Enter fullscreen mode Exit fullscreen mode

Metadata Filtering

# Only search within specific document types or date ranges
result = pipeline.query(
    "What changed in the Q3 update?",
    filters={
        "document_type": "changelog",
        "date": {"$gte": "2025-07-01"},
        "department": "engineering",
    },
    top_k=5,
)
Enter fullscreen mode Exit fullscreen mode

Configuration

# rag_config.yaml
ingestion:
  supported_formats: ["pdf", "md", "txt", "html", "docx"]
  metadata_extraction: true
  skip_existing: true            # Don't re-ingest unchanged docs
  watch_directory: false         # Auto-ingest new files

chunking:
  method: "recursive"            # fixed | semantic | recursive
  chunk_size: 500                # Tokens
  chunk_overlap: 50
  separators: ["\n\n", "\n", ". ", " "]
  include_metadata: true         # Attach source doc metadata to chunks

embedding:
  model: "text-embedding-3-small"
  batch_size: 100
  rate_limit_rpm: 3000
  dimensions: 1536

vector_store:
  backend: "chromadb"            # chromadb | pinecone | weaviate | pgvector | faiss
  collection: "company_docs"
  persist_dir: "./vector_db"

retrieval:
  method: "hybrid"               # dense | sparse | hybrid
  dense_weight: 0.7
  sparse_weight: 0.3
  top_k: 20
  reranker:
    enabled: true
    model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
    top_n: 5

generation:
  model: "gpt-4o"
  temperature: 0.1
  max_tokens: 1000
  system_prompt: |
    Answer the question based ONLY on the provided context.
    If the context doesn't contain the answer, say "I don't have
    enough information to answer that." Always cite your sources.
  include_citations: true

evaluation:
  test_dataset: "eval_data/qa_pairs.jsonl"
  metrics: ["retrieval_precision", "answer_faithfulness", "answer_correctness"]
  judge_model: "gpt-4o-mini"
  sample_size: 100
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Chunk size matters more than you think — Too small (< 200 tokens) loses context. Too large (> 1000 tokens) dilutes relevance. Start at 500 and tune with eval.
  2. Always use overlap — 10-20% overlap prevents information loss at chunk boundaries.
  3. Hybrid search beats dense-only — Adding BM25 catches keyword-exact matches that embedding models miss (proper nouns, error codes, IDs).
  4. Reranking is worth the latency — A cross-encoder reranker on your top-20 results dramatically improves the top-5 quality.
  5. Include metadata in chunks — Prepend document title, section headers, and dates to each chunk. The LLM needs this context.
  6. Evaluate on real questions — Build a test set of 100+ real user questions with ground truth answers. Run it after every pipeline change.

Troubleshooting

Problem Cause Fix
Retrieved chunks are irrelevant to the query Chunk size too large or embedding model weak Reduce chunk size to 300-500; try text-embedding-3-large for better retrieval
Answers hallucinate despite relevant context System prompt doesn't enforce grounding Add "ONLY use the provided context" constraint and enable faithfulness evaluation
Ingestion is slow on large document sets Embedding API rate limits Increase batch_size, use local embedding model, or set skip_existing: true
Duplicate chunks from overlapping documents Same content in multiple source files Enable deduplication with dedup_threshold: 0.95 in chunking config

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [RAG Pipeline Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)