Ankit Sharma

Posted on Jun 30

Build Production RAG: LangChain & Pinecone Tutorial

#rag #langchain #pinecone #ai

Have you ever built an amazing AI application with a Large Language Model (LLM), only to find it confidently making up facts or struggling with information outside its training data? It's a common frustration. LLMs are powerful, but they often lack real-time, specific knowledge, leading to "hallucinations." This is where Retrieval Augmented Generation (RAG) steps in, transforming your LLM into a knowledgeable expert by giving it access to external, up-to-date information.

But moving from a simple RAG demo to a system that can handle real-world traffic, maintain accuracy, and stay cost-effective is a whole different ball game. This tutorial will guide you through building a production-ready RAG system using LangChain for orchestration and Pinecone as your high-performance vector database. You'll learn how to create a system that's not just smart, but also reliable and ready for prime time.

Introduction to Production-Ready RAG Systems

Retrieval Augmented Generation (RAG) is a technique that enhances the capabilities of Large Language Models (LLMs) by giving them access to external, up-to-date information. Instead of relying solely on what they learned during training, RAG systems first retrieve relevant documents or data from a knowledge base and then augment the LLM's prompt with this context. This helps the LLM generate more accurate, relevant, and factual responses, significantly reducing the problem of "hallucinations" where LLMs invent information.

When we talk about "production" RAG, we're thinking beyond a simple script. A production system needs to be scalable, meaning it can handle many users and large amounts of data without slowing down. It must be reliable, consistently delivering correct answers and gracefully handling errors. Accuracy is paramount, ensuring the retrieved information is truly relevant and the generated answers are correct. Finally, it needs to be cost-effective, optimizing resource usage for both computation and storage.

LangChain acts as your orchestration layer, providing a structured way to build complex LLM applications. It helps you connect different components like data loaders, text splitters, embedding models, and LLMs into a cohesive workflow. Pinecone, on the other hand, is a specialized vector database. It's designed to store and quickly search through vast amounts of high-dimensional vectors, which are numerical representations of text. This makes Pinecone an excellent choice for the retrieval part of your RAG system, especially when dealing with large knowledge bases in a production setting.

Architecting Your RAG System: Components and Flow

[IMAGE: A clear architectural diagram illustrating the flow of a RAG system with LangChain, Pinecone, and an LLM.]

Understanding the core components and how they interact is key to building any RAG system. Here's a breakdown of what you'll be working with:

Data Source: This is where your knowledge lives. It could be documents, web pages, databases, or any custom text you want your LLM to reference.
Embedding Model: This component converts your text data into numerical vectors, called "embeddings." These embeddings capture the semantic meaning of the text, allowing similar pieces of text to have similar vector representations.
Vector Database (Pinecone): This specialized database stores your text embeddings along with their original text and any associated metadata. Its primary job is to perform fast and efficient similarity searches, finding the most relevant text chunks based on a query's embedding.
Large Language Model (LLM): This is the brain that generates the final answer. It takes the user's query and the retrieved context to formulate a coherent response.
LangChain: This framework ties everything together. It helps you manage the entire RAG workflow, from loading data to orchestrating the retrieval and generation steps.

The RAG workflow typically follows these steps:

Ingestion: Your raw data is loaded, split into smaller, manageable chunks, and then converted into embeddings using an embedding model. These embeddings are then stored in Pinecone.
Retrieval: When a user asks a question, that question is also converted into an embedding. Pinecone then searches its database to find the top-k (e.g., top 3 or 5) most similar text chunks to the query.
Augmentation: The retrieved text chunks are added to the user's original question, forming an enriched prompt.
Generation: This augmented prompt is sent to the LLM, which then generates a factual and contextually relevant answer.

Pinecone is a preferred choice for production vector storage because it offers high performance, low latency, and handles large-scale vector indexes efficiently. It's built for speed and reliability, which are critical for real-time RAG applications.

graph TD
    A[User Query] --> B(Embed Query)
    B --> C{Pinecone Vector Database}
    C -- Similarity Search --> D[Retrieved Context Chunks]
    D --> E(Augment Prompt with Context)
    E --> F[Large Language Model (LLM)]
    F --> G[Generated Answer]
    H[Data Source] --> I(Load & Split Data)
    I --> J(Embed Data Chunks)
    J --> C

Data Ingestion: Preparing and Embedding Your Knowledge Base

[IMAGE: An image depicting documents being processed and transformed into vector embeddings.]

The first step in building your RAG system is to prepare your knowledge base. This involves loading your data, breaking it into smaller pieces, and converting those pieces into numerical representations called embeddings.

Setting Up Your Environment

Before we dive into the code, make sure you have the necessary libraries installed and your API keys configured.

pip install langchain langchain-openai pinecone-client tiktoken

You'll need API keys for OpenAI (for embeddings and the LLM) and Pinecone. Set them as environment variables:

export OPENAI_API_KEY="your_openai_api_key"
export PINECONE_API_KEY="your_pinecone_api_key"
export PINECONE_ENVIRONMENT="your_pinecone_environment" # e.g., "us-east-1" or "gcp-starter"

Loading and Splitting Data

For this tutorial, we'll use a simple list of strings as our data source. In a real application, you might load from files, web pages, or databases using LangChain's various document loaders. Text splitting is crucial because LLMs have token limits, and smaller, focused chunks lead to more precise retrieval. We'll use RecursiveCharacterTextSplitter which tries to split text in a smart way, preserving context.

import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# 1. Prepare your data
# In a real scenario, you'd load from files, URLs, etc.
# For simplicity, we'll use a list of strings.
raw_documents = [
    "The quick brown fox jumps over the lazy dog. This is a classic sentence.",
    "Artificial intelligence (AI) is rapidly transforming industries worldwide.",
    "LangChain is a framework designed to simplify the creation of applications using large language models.",
    "Pinecone is a vector database that makes it easy to build high-performance vector search applications.",
    "RAG systems combine retrieval and generation to improve LLM accuracy.",
    "Production RAG systems require scalability, reliability, and cost-effectiveness.",
    "The capital of France is Paris, a beautiful city known for its art and culture.",
    "The Eiffel Tower is a famous landmark in Paris, France.",
    "Machine learning is a subset of AI that focuses on algorithms learning from data.",
    "Deep learning is a specialized field within machine learning, often using neural networks."
]

# 2. Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)
documents = text_splitter.create_documents(raw_documents)

print(f"Split {len(raw_documents)} raw documents into {len(documents)} chunks.")
# Example of a chunk:
# print(documents[0].page_content)

Initializing Pinecone and Upserting Data

Now, we'll initialize our embedding model (OpenAIEmbeddings) and Pinecone. We'll then convert our text chunks into embeddings and upload them to a Pinecone index. An "embedding" is a numerical list that represents the meaning of text. "Upserting" means inserting new data or updating existing data in the database.

# 3. Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# 4. Initialize Pinecone
api_key = os.environ.get("PINECONE_API_KEY")
environment = os.environ.get("PINECONE_ENVIRONMENT") # For older setups
# For Pinecone Serverless, you might use:
# cloud = os.environ.get("PINECONE_CLOUD") # e.g., "aws"
# region = os.environ.get("PINECONE_REGION") # e.g., "us-east-1"

if not api_key or not environment:
    raise ValueError("PINECONE_API_KEY and PINECONE_ENVIRONMENT must be set.")

pc = Pinecone(api_key=api_key)

index_name = "rag-tutorial-index"

# Check if index exists, if not, create it
if index_name not in pc.list_indexes().names():
    print(f"Creating index '{index_name}'...")
    pc.create_index(
        name=index_name,
        dimension=1536, # Dimension for text-embedding-ada-002
        metric="cosine", # Similarity metric
        spec=ServerlessSpec(cloud='aws', region='us-east-1') # Or PodSpec for older setups
    )
    print(f"Index '{index_name}' created.")
else:
    print(f"Index '{index_name}' already exists.")

# 5. Upsert embeddings to Pinecone
# This step uses LangChain's PineconeVectorStore to simplify the upsert process.
# It will create embeddings for each document and store them in Pinecone.
vectorstore = PineconeVectorStore.from_documents(
    documents,
    index_name=index_name,
    embedding=embeddings
)

print(f"Successfully upserted {len(documents)} documents to Pinecone index '{index_name}'.")

This code snippet sets up your Pinecone index and populates it with your knowledge base. Each chunk of text is now a searchable vector, ready for retrieval.

Retrieval: Finding Relevant Context with LangChain and Pinecone

[IMAGE: A magnifying glass icon over a database, symbolizing efficient information retrieval.]

With your data ingested, the next step is to retrieve relevant information when a user asks a question. LangChain provides a clean interface to interact with Pinecone for this purpose.

Setting up the Pinecone Vector Store as a Retriever

LangChain's PineconeVectorStore can be easily converted into a retriever. A "retriever" is a component that takes a user query and returns relevant documents.

# Assuming 'vectorstore' was initialized in the previous step
# If running this section independently, re-initialize:
# from langchain_pinecone import PineconeVectorStore
# from langchain_openai import OpenAIEmbeddings
# import os
# from pinecone import Pinecone, ServerlessSpec
#
# api_key = os.environ.get("PINECONE_API_KEY")
# environment = os.environ.get("PINECONE_ENVIRONMENT")
# pc = Pinecone(api_key=api_key)
# index_name = "rag-tutorial-index"
# embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
# vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Convert the vector store into a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve top 3 most relevant documents

print("Pinecone vector store configured as a retriever.")

# Test the retriever
query = "What is LangChain used for?"
retrieved_docs = retriever.invoke(query)

print(f"\nQuery: '{query}'")
print(f"Retrieved {len(retrieved_docs)} documents:")
for i, doc in enumerate(retrieved_docs):
    print(f"--- Document {i+1} ---")
    print(doc.page_content)
    # print(f"Metadata: {doc.metadata}") # If you added metadata during ingestion

When you call retriever.invoke(query), LangChain takes your query, embeds it using the same embedding model, sends that embedding to Pinecone, and Pinecone returns the k most similar document chunks. These chunks are then passed back as Document objects.

Leveraging Metadata Filtering

Pinecone allows you to store metadata alongside your vectors. This is incredibly powerful for more precise retrieval. For example, you could store the source of a document, its creation date, or its topic. Then, you can filter your search results based on this metadata.

Let's imagine we added a source metadata field during ingestion.

# Example of how you might add metadata during ingestion (not run here, just for illustration)
# from langchain_core.documents import Document
# documents_with_metadata = [
#     Document(page_content="The quick brown fox jumps over the lazy dog.", metadata={"source": "classic_sentences"}),
#     Document(page_content="Artificial intelligence (AI) is rapidly transforming industries worldwide.", metadata={"source": "ai_news"}),
# ]
# vectorstore_with_metadata = PineconeVectorStore.from_documents(
#     documents_with_metadata,
#     index_name=index_name,
#     embedding=embeddings
# )

# To demonstrate metadata filtering, let's assume some documents have a 'source' metadata field.
# For our current simple example, we don't have diverse metadata, but here's how you'd use it:
# retriever_with_filter = vectorstore.as_retriever(
#     search_kwargs={
#         "k": 3,
#         "filter": {"source": "ai_news"} # Only retrieve documents where source is 'ai_news'
#     }
# )

# query_filtered = "What is AI?"
# retrieved_filtered_docs = retriever_with_filter.invoke(query_filtered)
# print(f"\nQuery with filter: '{query_filtered}' (source='ai_news')")
# for i, doc in enumerate(retrieved_filtered_docs):
#     print(f"--- Filtered Document {i+1} ---")
#     print(doc.page_content)
#     print(f"Metadata: {doc.metadata}")

Metadata filtering is a crucial feature for production systems, allowing you to narrow down searches and ensure the LLM receives context from specific, relevant subsets of your knowledge base.

Generation: Augmenting LLM Prompts for Accurate Answers

[IMAGE: A thought bubble with text, showing how retrieved context enhances the LLM's response.]

Once you have the relevant context, the next step is to combine it with the user's query and send it to an LLM to generate an answer. This is where the "augmentation" part of RAG truly shines.

Crafting Effective Prompt Templates

A prompt template defines the structure of the input you send to the LLM. For RAG, it's essential to clearly separate the user's question from the retrieved context. This helps the LLM understand its role: to answer the question based on the provided context.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 1. Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) # temperature=0 for more deterministic answers

# 2. Define a prompt template
# The 'context' variable will be populated by the retrieved documents.
# The 'question' variable will be the user's query.
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an AI assistant. Answer the user's question ONLY based on the provided context. If the answer is not in the context, state that you don't know."),
        ("human", "Context: {context}\n\nQuestion: {question}"),
    ]
)

print("Prompt template created.")

# Example of how the prompt would look (without actually calling the LLM yet)
sample_context = "LangChain is a framework for developing applications powered by language models. It enables chaining together different components to build more complex use cases."
sample_question = "What is LangChain?"

formatted_prompt = prompt_template.format(context=sample_context, question=sample_question)
print("\n--- Example Formatted Prompt ---")
print(formatted_prompt)

The system message sets the tone and instructions for the LLM, guiding its behavior. The human message then provides the actual content, clearly labeling the context and the question.

Combining Query and Context for Augmented Generation

The core idea is to take the documents retrieved by Pinecone and insert their content directly into the prompt template. LangChain makes this process straightforward when building a chain.

A critical consideration for production systems is handling prompt length and token limits. LLMs have a maximum number of tokens they can process in a single request. If your retrieved context is too long, you might need strategies like summarizing the context, selecting only the most relevant sentences, or using an LLM with a larger context window. For now, we'll assume our chunks are small enough.

Building the End-to-End RAG Chain with LangChain

[IMAGE: A flowchart showing the sequential steps of the LangChain RAG chain from query to answer.]

Now it's time to bring all the pieces together into a single, cohesive RAG chain using LangChain Expression Language (LCEL). LCEL allows you to compose complex chains from simple components in a readable and efficient way.

graph TD
    A[User Query] --> B{Retriever}
    B -- Retrieved Docs --> C[Format Docs for Prompt]
    C --> D{Prompt Template}
    D -- Formatted Prompt --> E[LLM]
    E -- LLM Response --> F[Output Parser]
    F --> G[Final Answer]

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Assuming 'retriever', 'prompt_template', and 'llm' are initialized from previous steps.
# If running this section independently, ensure they are initialized:
# from langchain_pinecone import PineconeVectorStore
# from langchain_openai import OpenAIEmbeddings, ChatOpenAI
# from langchain_core.prompts import ChatPromptTemplate
# import os
# from pinecone import Pinecone, ServerlessSpec
#
# api_key = os.environ.get("PINECONE_API_KEY")
# environment = os.environ.get("PINECONE_ENVIRONMENT")
# pc = Pinecone(api_key=api_key)
# index_name = "rag-tutorial-index"
# embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
# vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)
# retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# prompt_template = ChatPromptTemplate.from_messages(
#     [
#         ("system", "You are an AI assistant. Answer the user's question ONLY based on the provided context. If the answer is not in the context, state that you don't know."),
#         ("human", "Context: {context}\n\nQuestion: {question}"),
#     ]
# )

# Define a function to format the retrieved documents into a single string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain using LCEL
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

print("RAG chain built successfully.")

# Invoke the complete RAG chain to answer a user query
user_query_1 = "What is Pinecone?"
print(f"\n--- Answering query: '{user_query_1}' ---")
response_1 = rag_chain.invoke(user_query_1)
print(response_1)

user_query_2 = "Tell me about the capital of France."
print(f"\n--- Answering query: '{user_query_2}' ---")
response_2 = rag_chain.invoke(user_query_2)
print(response_2)

user_query_3 = "What is the tallest mountain in the world?"
print(f"\n--- Answering query: '{user_query_3}' ---")
response_3 = rag_chain.invoke(user_query_3)
print(response_3)

In this chain:

{"context": retriever | format_docs, "question": RunnablePassthrough()}: This is a dictionary that prepares the inputs for the prompt.
- "context": The user's query first goes to the retriever, which fetches documents. These documents are then piped (|) to format_docs to turn them into a single string.
- "question": The original user query is passed through directly using RunnablePassthrough().
| prompt_template: The prepared context and question are fed into our prompt_template.
| llm: The formatted prompt is sent to the llm for generation.
| StrOutputParser(): The LLM's output is parsed into a simple string.

This chain is now a complete, runnable RAG system. You can invoke it with any user query, and it will handle the retrieval, augmentation, and generation steps automatically. Testing with various queries, including those outside your knowledge base, helps you refine your prompt and retrieval strategy.

Productionizing and Deploying Your RAG Application

[IMAGE: An icon representing cloud deployment or a server rack, symbolizing a production environment.]

Building the RAG chain is a significant step, but making it production-ready involves several more considerations.

Deployment Strategies

How you deploy your RAG system depends on your existing infrastructure and traffic needs. Common approaches include:

Web Frameworks (Flask/FastAPI): You can wrap your LangChain RAG chain in a REST API using frameworks like Flask or FastAPI. This allows other applications to interact with your RAG system via HTTP requests.
Docker: Containerizing your application with Docker ensures consistency across different environments and simplifies deployment. You can package your Python code, dependencies, and environment variables into a single image.
Cloud Platforms: Deploying on cloud platforms like AWS (ECS, Lambda), Google Cloud (Cloud Run, App Engine), or Azure (App Service, Azure Functions) offers scalability, managed services, and integration with other cloud tools. Serverless options are great for cost-effectiveness with fluctuating traffic.

Monitoring Performance, Latency, and Accuracy

Once deployed, continuous monitoring is essential.

Performance: Track metrics like queries per second (QPS) and resource utilization (CPU, memory).
Latency: Measure the time it takes for your system to respond to a query. High latency can degrade user experience.
Accuracy: This is trickier for RAG. You might implement human feedback loops, A/B testing different retrieval strategies, or use evaluation datasets to periodically assess the quality of answers. LangChain also offers tools for evaluation.

Implementing Logging and Error Handling

Robust logging is crucial for debugging and understanding how your system behaves in production. Log key events, such as incoming queries, retrieved documents, LLM responses, and any errors that occur. Implement comprehensive error handling to prevent your application from crashing and to provide meaningful feedback to users or administrators.

Updating and Maintaining Your Knowledge Base

Your knowledge base isn't static. Information changes, new documents are added, and old ones become obsolete.

Scheduled Updates: Set up automated processes to periodically re-ingest data, update embeddings, and refresh your Pinecone index.
Incremental Updates: For very large knowledge bases, consider incremental updates where only new or changed documents are processed, rather than re-indexing everything.
Version Control: If your data sources are versioned, ensure your RAG system can handle different versions of documents.

Scaling Your Pinecone Index and LangChain Application

Pinecone Scaling: Pinecone is designed for scale. As your data grows, you can adjust your index's capacity (e.g., by adding more pods or using serverless which scales automatically) to maintain performance.
LangChain Application Scaling: If you've deployed your application as a web service, you can scale it horizontally by running multiple instances behind a load balancer. For serverless functions, scaling is often handled automatically by the cloud provider.

Productionizing a RAG system is an ongoing process of deployment, monitoring, and refinement. By considering these aspects early, you can build a system that not only works but thrives in a real-world environment.

Key Takeaways

RAG is essential for factual LLMs: It prevents hallucinations by providing external, up-to-date context.
LangChain orchestrates, Pinecone stores: LangChain simplifies building the RAG workflow, while Pinecone provides a high-performance vector database for efficient retrieval.
Data preparation is critical: Effective text splitting and embedding are foundational for accurate retrieval.
LCEL enables powerful chains: LangChain Expression Language allows you to build complex, readable, and efficient RAG pipelines.
Production means more than just code: Consider deployment, monitoring, logging, and maintenance for a truly robust system.
Metadata filtering enhances retrieval: Use metadata in Pinecone to achieve more precise and targeted searches.
Prompt engineering guides the LLM: Craft clear prompt templates to ensure the LLM uses the provided context effectively.

What challenges are you currently facing when trying to move your AI prototypes into production?

DEV Community