Building a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real Implementation
Most RAG tutorials use a 10-page PDF about Shakespeare and call it a day. You get a working demo in 20 minutes, deploy nothing, and learn the one thing that least resembles production: that RAG is easy.
It isn't. The demo is easy. Production RAG — where your retrieval actually returns the right chunks, your answers are grounded in the source, and the system doesn't hallucinate when it can't find an answer — takes deliberate engineering at every stage of the pipeline.
This is a real implementation guide. We'll build a RAG pipeline using LangChain, Pinecone, and Claude that could actually serve a client product. Every decision explained, every gotcha documented.
What you'll have at the end: A working RAG system that ingests a document corpus, chunks it intelligently, embeds it into Pinecone, retrieves with hybrid search, generates grounded answers with Claude, and evaluates itself.
Prerequisites
- Python 3.10+
- Pinecone account (free tier works for development)
- Anthropic API key
- OpenAI API key (for embeddings — we'll explain why we use OpenAI for embeddings and Anthropic for generation)
- ~2 hours
pip install langchain langchain-anthropic langchain-openai langchain-pinecone \
pinecone-client pinecone-text python-dotenv pypdf tiktoken
Step 1: Document Ingestion and Chunking Strategy
Chunking is where most RAG implementations fail silently. The chunk size question — "should I use 512 tokens or 1,000?" — is the wrong question. The right question is: what is the minimum self-contained unit of meaning in my documents?
For a product FAQ document, that's a single Q&A pair. For a policy document, it's a section. For a knowledge base article, it's a paragraph. Fixed-size token chunking destroys these natural boundaries.
We use a two-pass chunking strategy:
Pass 1: Structural splitting — split at document boundaries (headers, sections) first
Pass 2: Size enforcement — only apply token limits within those structural chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.schema import Document
import re
from typing import List
class SemanticChunker:
"""Chunks documents at semantic boundaries, not arbitrary token counts."""
def __init__(self, max_chunk_tokens: int = 400, overlap_tokens: int = 50):
# 400 tokens is our default — not 512.
# Here's why: at 512 tokens, chunks often end mid-sentence. At 400,
# there's buffer to complete the thought within the token limit.
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=400 * 4, # ~4 chars per token estimate
chunk_overlap=overlap_tokens * 4,
separators=["\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""],
length_function=len
)
self.max_chunk_tokens = max_chunk_tokens
def chunk_document(self, file_path: str, doc_metadata: dict) -> List[Document]:
loader = PyPDFLoader(file_path)
pages = loader.load()
# Clean up common PDF extraction artifacts
for page in pages:
page.page_content = self._clean_text(page.page_content)
page.metadata.update(doc_metadata)
# Split into chunks
chunks = self.splitter.split_documents(pages)
# Add chunk index for debugging retrieval issues
for i, chunk in enumerate(chunks):
chunk.metadata['chunk_index'] = i
chunk.metadata['chunk_total'] = len(chunks)
return chunks
def _clean_text(self, text: str) -> str:
# Remove page headers/footers (common in policy docs)
text = re.sub(r'Page \d+ of \d+', '', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Remove lone single characters (OCR artifacts)
text = re.sub(r'(?<![\w])\w(?![\w])', '', text)
return text
# Usage
chunker = SemanticChunker(max_chunk_tokens=400, overlap_tokens=50)
chunks = chunker.chunk_document("knowledge_base.pdf", {"source": "knowledge_base", "version": "2026-03"})
print(f"Generated {len(chunks)} chunks from document")
Why 400 tokens and not 512? In our production implementations, 512-token chunks frequently end mid-sentence when the content has long paragraphs. The 400-token limit with 50-token overlap ensures context continuity without cutting thoughts short. Adjust this per your document structure — technical documentation often benefits from 300-token chunks; narrative content from 500.
Step 2: Embedding Model Selection
We use OpenAI text-embedding-3-small for embeddings, even in Claude-based systems. Why not Anthropic embeddings? Anthropic doesn't offer an embedding API. For production English-language applications, text-embedding-3-small provides excellent quality at low cost (~$0.02 per million tokens).
For multilingual use cases (Hindi, Arabic — relevant for our India/GCC client base), we switch to Cohere's embed-multilingual-v3.0.
Critical rule: never mix embedding models. Your query at retrieval time must use the same model as the documents at ingestion time. Mixing models produces semantically inconsistent similarity scores and silent retrieval failures.
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
import os
from dotenv import load_dotenv
load_dotenv()
# Initialize embedding model
embedding_model = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Initialize Pinecone
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
INDEX_NAME = "rag-knowledge-base"
# Create index if it doesn't exist
if INDEX_NAME not in pc.list_indexes().names():
pc.create_index(
name=INDEX_NAME,
dimension=1536, # text-embedding-3-small dimension
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
print(f"Created Pinecone index: {INDEX_NAME}")
index = pc.Index(INDEX_NAME)
Step 3: Ingestion with Metadata Filtering
Metadata in Pinecone is how you scope queries. If your knowledge base has multiple document types — product FAQs, return policies, shipping info — you can filter at query time to only retrieve from the relevant subset.
from langchain_pinecone import PineconeVectorStore
from tqdm import tqdm
def ingest_documents(chunks: List[Document], batch_size: int = 100) -> PineconeVectorStore:
"""Ingest document chunks into Pinecone with progress tracking."""
print(f"Ingesting {len(chunks)} chunks into Pinecone...")
# Process in batches to avoid API rate limits
for i in tqdm(range(0, len(chunks), batch_size), desc="Ingesting"):
batch = chunks[i:i + batch_size]
# Ensure all metadata values are Pinecone-compatible types
# (strings, numbers, booleans — no lists of complex objects)
for chunk in batch:
chunk.metadata = {
k: str(v) if not isinstance(v, (str, int, float, bool)) else v
for k, v in chunk.metadata.items()
}
# Create vector store from documents
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embedding_model,
index_name=INDEX_NAME,
pinecone_api_key=os.getenv("PINECONE_API_KEY")
)
print(f"Ingestion complete. Index stats: {index.describe_index_stats()}")
return vectorstore
vectorstore = ingest_documents(chunks)
Step 4: Hybrid Search Retrieval
This is the step that separates production RAG from tutorial RAG. Dense vector search alone has a known weakness: it matches semantic meaning but can miss exact keyword matches. If a user asks "what is the policy for order cancellation within 2 hours" and your document says "2-hour cancellation window," pure semantic search may not rank that chunk highest.
Hybrid search combines dense vectors (semantic) with sparse BM25 (keyword). The alpha parameter controls the blend.
from pinecone_text.sparse import BM25Encoder
import numpy as np
class HybridRetriever:
def __init__(self, vectorstore, index, embedding_model, bm25_path: str = None):
self.vectorstore = vectorstore
self.index = index
self.embedding_model = embedding_model
# Load or initialize BM25
if bm25_path and os.path.exists(bm25_path):
self.bm25 = BM25Encoder().load(bm25_path)
else:
self.bm25 = BM25Encoder().default() # Use default params for now
def fit_bm25(self, corpus: List[str], save_path: str = "bm25_params.json"):
"""Fit BM25 on your document corpus. Do this once during ingestion."""
self.bm25.fit(corpus)
self.bm25.dump(save_path)
print(f"BM25 fitted on {len(corpus)} documents, saved to {save_path}")
def retrieve(
self,
query: str,
top_k: int = 5,
alpha: float = 0.5,
metadata_filter: dict = None
) -> List[dict]:
"""
Hybrid search: alpha=1.0 is pure dense, alpha=0.0 is pure sparse.
We start at 0.5 and tune based on query type.
"""
# Dense query vector
dense_vector = self.embedding_model.embed_query(query)
# Sparse query vector
sparse_vector = self.bm25.encode_queries(query)
# Pinecone hybrid query
query_params = {
"vector": dense_vector,
"sparse_vector": sparse_vector,
"top_k": top_k,
"include_metadata": True,
"alpha": alpha
}
if metadata_filter:
query_params["filter"] = metadata_filter
results = self.index.query(**query_params)
return [
{
"text": match.metadata.get("text", ""),
"score": match.score,
"metadata": match.metadata,
"id": match.id
}
for match in results.matches
]
# Fit BM25 on corpus text (do this once)
corpus_texts = [chunk.page_content for chunk in chunks]
retriever = HybridRetriever(vectorstore, index, embedding_model)
retriever.fit_bm25(corpus_texts, save_path="bm25_params.json")
Step 5: The Generation Prompt — Minimising Hallucination
The generation prompt is where most developers underinvest. The default "here is context, answer the question" pattern works for demos. For production, you need explicit grounding instructions and a defined behaviour when the answer isn't in the retrieved context.
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage, SystemMessage
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
max_tokens=1024,
temperature=0.1 # Low temperature for factual retrieval tasks
)
SYSTEM_PROMPT = """You are a helpful assistant that answers questions based strictly on the provided context.
RULES:
1. ONLY answer based on the context provided. Do not use your general knowledge.
2. If the context does not contain the answer, respond: "I don't have information about that in the knowledge base. Please contact support for this query."
3. If you're partially confident, state what the context says and flag what's uncertain.
4. Always cite which part of the context supports your answer (e.g., "According to the shipping policy section...").
5. Be concise. Answer in 2-4 sentences unless the question requires more detail.
Never fabricate information, dates, prices, or policies."""
def generate_answer(
query: str,
retrieved_chunks: List[dict],
max_context_chunks: int = 4
) -> dict:
"""Generate a grounded answer using retrieved context."""
# Limit context to top N chunks to avoid dilution
# More chunks ≠ better answers. 3-5 focused chunks outperform 10 scattered ones.
top_chunks = retrieved_chunks[:max_context_chunks]
# Format context with source attribution
context_blocks = []
for i, chunk in enumerate(top_chunks, 1):
source = chunk['metadata'].get('source', 'Unknown')
context_blocks.append(f"[Context {i} — Source: {source}]\n{chunk['text']}")
context_str = "\n\n".join(context_blocks)
messages = [
SystemMessage(content=SYSTEM_PROMPT),
HumanMessage(content=f"CONTEXT:\n{context_str}\n\nQUESTION: {query}")
]
response = llm.invoke(messages)
return {
"answer": response.content,
"sources": [c['metadata'] for c in top_chunks],
"retrieval_scores": [c['score'] for c in top_chunks]
}
Step 6: Evaluation — How Do You Know If Your RAG Is Working?
This is the step 80% of RAG builders skip entirely. A RAG system without evaluation is a black box. You can't improve what you can't measure.
Three metrics we track on every client RAG project:
1. Retrieval Recall@k — Does the relevant document appear in the top k results?
2. Answer Faithfulness — Is the answer supported by the retrieved context? (Detects hallucination)
3. Answer Relevance — Does the answer actually address the question?
from anthropic import Anthropic
import json
client = Anthropic()
def evaluate_faithfulness(question: str, answer: str, context: str) -> dict:
"""
Ask Claude to judge whether the answer is supported by the context.
This is the LLM-as-judge pattern — imperfect but scalable.
"""
eval_prompt = f"""You are evaluating whether an AI answer is faithful to the provided context.
CONTEXT:
{context}
QUESTION: {question}
ANSWER: {answer}
Evaluate on a scale of 1-5:
- 5: Fully supported by context, no unsupported claims
- 3: Mostly supported, minor unsupported details
- 1: Contains claims not in context (hallucination)
Return ONLY a JSON object: {{"score": <1-5>, "reason": "<one sentence>", "hallucinated_claims": ["<claim>"]}})"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{"role": "user", "content": eval_prompt}]
)
try:
result = json.loads(response.content[0].text)
return result
except json.JSONDecodeError:
return {"score": None, "error": "parse_failed", "raw": response.content[0].text}
def run_evaluation_suite(test_cases: List[dict], retriever: HybridRetriever) -> dict:
"""Run evaluation on a test set. Build this before shipping to production."""
results = []
for test in test_cases:
retrieved = retriever.retrieve(test['question'], top_k=5)
answer_result = generate_answer(test['question'], retrieved)
context_str = "\n".join([c['text'] for c in retrieved[:4]])
faithfulness = evaluate_faithfulness(
test['question'],
answer_result['answer'],
context_str
)
results.append({
"question": test['question'],
"expected": test.get('expected_answer'),
"actual": answer_result['answer'],
"top_retrieval_score": retrieved[0]['score'] if retrieved else 0,
"faithfulness_score": faithfulness.get('score'),
"hallucinated_claims": faithfulness.get('hallucinated_claims', [])
})
avg_faithfulness = sum(r['faithfulness_score'] for r in results if r['faithfulness_score']) / len(results)
avg_retrieval = sum(r['top_retrieval_score'] for r in results) / len(results)
return {
"total_tests": len(results),
"avg_faithfulness": round(avg_faithfulness, 2),
"avg_retrieval_score": round(avg_retrieval, 3),
"cases": results
}
The One Mistake That Causes 80% of RAG Failures
After building RAG pipelines across multiple client projects, the failure that appears most often isn't chunking, embedding choice, or prompt design. It's this: developers blame the LLM when the retrieval is broken.
The symptoms look like the model is hallucinating or not following instructions. The actual problem is that the wrong chunks are being retrieved — the LLM is doing its best with bad context and producing a bad answer. You can spend weeks tuning your generation prompt while the retrieval is returning irrelevant chunks and nothing will improve.
Before blaming generation, always check retrieval first:
- Run your test queries and print the retrieved chunks
- Ask: are these chunks actually relevant to the question?
- If no: fix chunking, improve metadata filtering, tune
alpha - If yes but answers are still wrong: now look at the generation prompt
This separation of concerns — retrieval quality as an independent metric from generation quality — is the mindset shift that makes RAG systems actually work.
Full Pipeline: Putting It Together
class RAGPipeline:
def __init__(self, index_name: str, alpha: float = 0.5):
self.chunker = SemanticChunker()
self.retriever = None # Initialized after ingestion
self.index_name = index_name
self.alpha = alpha
def ingest(self, file_paths: List[str], doc_metadata_list: List[dict]):
all_chunks = []
for path, metadata in zip(file_paths, doc_metadata_list):
chunks = self.chunker.chunk_document(path, metadata)
all_chunks.extend(chunks)
vectorstore = ingest_documents(all_chunks)
self.retriever = HybridRetriever(vectorstore, index, embedding_model)
self.retriever.fit_bm25([c.page_content for c in all_chunks])
print(f"Pipeline ready. {len(all_chunks)} chunks indexed.")
def query(self, question: str, metadata_filter: dict = None) -> dict:
if not self.retriever:
raise ValueError("Pipeline not initialized. Call ingest() first.")
retrieved = self.retriever.retrieve(
question, top_k=5, alpha=self.alpha,
metadata_filter=metadata_filter
)
return generate_answer(question, retrieved)
# Usage
pipeline = RAGPipeline(index_name="rag-knowledge-base")
pipeline.ingest(
file_paths=["help_center.pdf", "return_policy.pdf", "shipping_guide.pdf"],
doc_metadata_list=[
{"doc_type": "help_center"},
{"doc_type": "return_policy"},
{"doc_type": "shipping"}
]
)
result = pipeline.query(
"What is the return window for damaged items?",
metadata_filter={"doc_type": "return_policy"}
)
print(result['answer'])
What This Costs in Production
For a knowledge base of ~500 pages serving 1,000 queries/day:
- Pinecone serverless: ~$5-15/month
- OpenAI embeddings (ingestion, one-time): ~$0.50 for 500 pages
- Claude Sonnet API (generation, 1,000 queries/day): ~$15-30/month
- Total: ~$20-45/month for a production RAG system
This is a core deliverable in our AI automation services. We've built RAG pipelines as part of support automation, internal knowledge management, and product recommendation systems. The architecture above is battle-tested across production deployments — not a tutorial construct.
If you're evaluating whether RAG is the right architecture for your project, see how we approach AI app design or read the architectural comparison between RAG, fine-tuning, and context stuffing.
Frequently Asked Questions
What's the difference between this and just using a ChatPDF-style tool?
ChatPDF and similar tools are black boxes — you can't control chunking, retrieval logic, filtering, or evaluation. A custom pipeline gives you full control over every decision: chunk size, embedding model, retrieval alpha, metadata filtering, grounding instructions, and output format. For a client product, that control is not optional.
Can I use this with a local LLM instead of Claude?
Yes. Replace ChatAnthropic with ChatOllama or any LangChain-compatible LLM. For the evaluator in Step 6, you need a capable model — local 7B models often produce unreliable faithfulness scores. We recommend keeping Claude for evaluation even if you switch the generation model.
Why use LangChain at all? Could I build this without it?
You can. LangChain adds abstraction overhead. For a simple pipeline, raw Anthropic + Pinecone SDK is cleaner. LangChain earns its place when you need LCEL chains, callbacks for logging, or multiple retrieval strategies in one pipeline. Use it if you need its features; skip it for simpler implementations.
How do I handle documents that update frequently?
Don't re-ingest the entire corpus. Use Pinecone's delete + upsert with a stable document ID scheme. When a document updates, delete its chunks by ID filter and re-ingest. Tag every chunk with doc_version in metadata so you can audit which version answered which query.
What chunk size should I use for my documents?
Test it. Generate 5-10 representative test queries, run retrieval at chunk sizes of 200, 400, 600 tokens, and measure recall@5 for each. The chunk size that returns the relevant document in the top 5 most often is the right size for your corpus. There is no universal answer — anyone who says otherwise hasn't built production RAG.
How do I prevent the RAG from making up information when the answer isn't in the knowledge base?
The system prompt in Step 5 handles this: the model is instructed to respond with a defined fallback rather than generating from its general knowledge. Test this explicitly by asking questions you know aren't in the corpus. If the model answers them confidently, tighten the grounding instruction or reduce the temperature.
Rishabh Sethia is Founder & CEO of Innovatrix Infotech. Former SSE / Head of Engineering. DPIIT Recognized Startup. Shopify Partner. AWS Partner.
Top comments (0)