RAG Pipeline Framework
Retrieval-Augmented Generation is the most practical way to give LLMs access to your private data — but getting it right is harder than the tutorials suggest. This framework provides the complete pipeline: document ingestion, intelligent chunking, embedding generation, vector store integration, retrieval with reranking, and answer generation with source citations. Plus evaluation tools to measure whether your RAG system actually returns correct answers.
Key Features
- Document Ingestion — Load PDFs, Markdown, HTML, Word docs, and plain text with automatic format detection and metadata extraction
- Chunking Strategies — Fixed-size, semantic, recursive, and document-structure-aware chunking with configurable overlap
- Embedding Generation — Support for OpenAI, Cohere, and local embedding models with automatic batching and rate limiting
- Vector Store Integration — Pluggable backends for ChromaDB, Pinecone, Weaviate, pgvector, and FAISS with unified query interface
- Hybrid Search — Combine dense vector search with sparse keyword search (BM25) for higher recall
- Reranking — Cross-encoder reranking to boost precision after initial retrieval
- Citation Generation — Automatically include source references in generated answers with page numbers and document titles
- RAG Evaluation — Measure retrieval precision, answer faithfulness, and end-to-end accuracy
Quick Start
from rag_pipeline import RAGPipeline, ChunkingStrategy, VectorStore
# 1. Configure the pipeline
pipeline = RAGPipeline(
chunking=ChunkingStrategy(
method="recursive",
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "],
),
embedding_model="text-embedding-3-small",
vector_store=VectorStore(
backend="chromadb",
collection="company_docs",
persist_dir="./vector_db",
),
llm_model="gpt-4o",
)
# 2. Ingest documents
pipeline.ingest("documents/", glob="**/*.{pdf,md,txt}")
print(f"Ingested {pipeline.document_count} documents, {pipeline.chunk_count} chunks")
# 3. Query with RAG
result = pipeline.query("What is our refund policy for enterprise customers?")
print(result.answer)
print(f"\nSources ({len(result.sources)}):")
for src in result.sources:
print(f" - {src.document}: p.{src.page} (relevance: {src.score:.2f})")
Architecture
Documents (PDF, MD, HTML, TXT)
│
▼
┌─────────────────┐
│ Document Loader│──── Parse + extract text + metadata
└────────┬────────┘
▼
┌─────────────────┐
│ Chunker │──── Split into retrieval-sized chunks
└────────┬────────┘
▼
┌─────────────────┐
│ Embedder │──── Generate vector embeddings
└────────┬────────┘
▼
┌─────────────────┐
│ Vector Store │──── Store embeddings + metadata
└────────┬────────┘
│
Query Time:
│
User Query ──▶ Embed ──▶ Vector Search ──▶ Rerank ──▶ LLM + Context ──▶ Answer
│ │
Top K chunks Source citations
Usage Examples
Chunking Strategy Comparison
from rag_pipeline.chunking import FixedSizeChunker, SemanticChunker, RecursiveChunker
# Fixed-size: Predictable chunk sizes, may break mid-sentence
fixed = FixedSizeChunker(chunk_size=500, overlap=50)
# Semantic: Chunks at natural topic boundaries using embeddings
semantic = SemanticChunker(
embedding_model="text-embedding-3-small",
breakpoint_threshold=0.3, # Lower = more chunks
min_chunk_size=100,
max_chunk_size=1000,
)
# Recursive: Tries multiple separators in order of preference
recursive = RecursiveChunker(
chunk_size=500,
overlap=50,
separators=["\n\n", "\n", ". ", " "], # Paragraphs → lines → sentences → words
)
Hybrid Search with Reranking
from rag_pipeline.retrieval import HybridRetriever, CrossEncoderReranker
retriever = HybridRetriever(
dense_weight=0.7, # 70% vector similarity
sparse_weight=0.3, # 30% BM25 keyword match
top_k=20, # Retrieve 20 candidates
reranker=CrossEncoderReranker(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=5, # Return top 5 after reranking
),
)
results = retriever.search("enterprise refund policy", vector_store=store)
for r in results:
print(f"[{r.score:.3f}] {r.document} — {r.text[:100]}...")
RAG Evaluation
from rag_pipeline.evaluation import RAGEvaluator
evaluator = RAGEvaluator(
test_dataset="eval_data/qa_pairs.jsonl", # Questions with ground truth answers
metrics=["retrieval_precision", "answer_faithfulness", "answer_correctness"],
judge_model="gpt-4o-mini",
)
scores = evaluator.evaluate(pipeline)
print(f"Retrieval Precision@5: {scores.retrieval_precision:.2%}")
print(f"Answer Faithfulness: {scores.answer_faithfulness:.2%}")
print(f"Answer Correctness: {scores.answer_correctness:.2%}")
Metadata Filtering
# Only search within specific document types or date ranges
result = pipeline.query(
"What changed in the Q3 update?",
filters={
"document_type": "changelog",
"date": {"$gte": "2025-07-01"},
"department": "engineering",
},
top_k=5,
)
Configuration
# rag_config.yaml
ingestion:
supported_formats: ["pdf", "md", "txt", "html", "docx"]
metadata_extraction: true
skip_existing: true # Don't re-ingest unchanged docs
watch_directory: false # Auto-ingest new files
chunking:
method: "recursive" # fixed | semantic | recursive
chunk_size: 500 # Tokens
chunk_overlap: 50
separators: ["\n\n", "\n", ". ", " "]
include_metadata: true # Attach source doc metadata to chunks
embedding:
model: "text-embedding-3-small"
batch_size: 100
rate_limit_rpm: 3000
dimensions: 1536
vector_store:
backend: "chromadb" # chromadb | pinecone | weaviate | pgvector | faiss
collection: "company_docs"
persist_dir: "./vector_db"
retrieval:
method: "hybrid" # dense | sparse | hybrid
dense_weight: 0.7
sparse_weight: 0.3
top_k: 20
reranker:
enabled: true
model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
top_n: 5
generation:
model: "gpt-4o"
temperature: 0.1
max_tokens: 1000
system_prompt: |
Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have
enough information to answer that." Always cite your sources.
include_citations: true
evaluation:
test_dataset: "eval_data/qa_pairs.jsonl"
metrics: ["retrieval_precision", "answer_faithfulness", "answer_correctness"]
judge_model: "gpt-4o-mini"
sample_size: 100
Best Practices
- Chunk size matters more than you think — Too small (< 200 tokens) loses context. Too large (> 1000 tokens) dilutes relevance. Start at 500 and tune with eval.
- Always use overlap — 10-20% overlap prevents information loss at chunk boundaries.
- Hybrid search beats dense-only — Adding BM25 catches keyword-exact matches that embedding models miss (proper nouns, error codes, IDs).
- Reranking is worth the latency — A cross-encoder reranker on your top-20 results dramatically improves the top-5 quality.
- Include metadata in chunks — Prepend document title, section headers, and dates to each chunk. The LLM needs this context.
- Evaluate on real questions — Build a test set of 100+ real user questions with ground truth answers. Run it after every pipeline change.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Retrieved chunks are irrelevant to the query | Chunk size too large or embedding model weak | Reduce chunk size to 300-500; try text-embedding-3-large for better retrieval |
| Answers hallucinate despite relevant context | System prompt doesn't enforce grounding | Add "ONLY use the provided context" constraint and enable faithfulness evaluation |
| Ingestion is slow on large document sets | Embedding API rate limits | Increase batch_size, use local embedding model, or set skip_existing: true
|
| Duplicate chunks from overlapping documents | Same content in multiple source files | Enable deduplication with dedup_threshold: 0.95 in chunking config |
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [RAG Pipeline Framework] with all files, templates, and documentation for $59.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)