DEV Community

Cover image for How to Build a Production-Ready Private RAG Pipeline with vLLM, LangChain, and Dedicated GPUs
Peter Chambers for GPUYard

Posted on • Originally published at gpuyard.com

How to Build a Production-Ready Private RAG Pipeline with vLLM, LangChain, and Dedicated GPUs

Deploying a Retrieval-Augmented Generation (RAG) pipeline is the standard approach for allowing LLMs to securely interact with proprietary data. However, relying on public APIs introduces latency and data sovereignty risks.

By self-hosting your inference architecture, you retain absolute data sovereignty. This guide demonstrates how to architect a high-performance, fully private RAG pipeline using vLLM, LangChain, and Qdrant.


🛠️ Prerequisites

  • OS: Ubuntu 22.04 LTS
  • GPU: Minimum 24GB VRAM (NVIDIA RTX 3090/4090). 70B+ models require A100/H100 clusters.
  • Drivers: NVIDIA Drivers (v535+) & CUDA 12.1+
  • Environment: Python 3.10+, Docker & Docker Compose

🚀 Step 1: Prepare the GPU Environment

Verify your GPU availability and setup a virtual environment:

nvidia-smi

python3 -m venv rag_env
source rag_env/bin/activate
pip install vllm langchain langchain-openai langchain-community sentence-transformers qdrant-client pypdf
Enter fullscreen mode Exit fullscreen mode

🤖 Step 2: Deploy vLLM API Server

vLLM is an optimized inference engine that uses PagedAttention to maximize throughput. We will serve Meta-Llama-3-8B-Instruct:

python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype auto \
    --api-key private-rag-key \
    --max-model-len 4096 \
    --port 8000
Enter fullscreen mode Exit fullscreen mode

Pro Tip: Running LLMs on virtualized cloud instances often introduces hypervisor overhead. For maximum tokens-per-second (TPS), deploying directly on bare-metal dedicated GPU servers is recommended.

📦 Step 3: Initialize Qdrant (Vector Database)

Spin up a local Qdrant instance via Docker:

docker run -d -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant
Enter fullscreen mode Exit fullscreen mode

🔗 Step 5: Build the Retrieval Loop

Connect LangChain to your local vLLM API and Qdrant to execute queries:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    openai_api_base="http://localhost:8000/v1",
    openai_api_key="private-rag-key",
    model_name="meta-llama/Meta-Llama-3-8B-Instruct"
)

# ... (Add your retrieval chain logic here)
Enter fullscreen mode Exit fullscreen mode
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant

# Load and Chunk
loader = PyPDFLoader("enterprise_policy.pdf")
documents = loader.load()

# Local Embeddings (Running on CUDA)
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={'device': 'cuda'}
)

qdrant = Qdrant.from_documents(
    documents,
    embeddings,
    url="http://localhost:6333",
    collection_name="enterprise_knowledge",
)
Enter fullscreen mode Exit fullscreen mode

💡 Conclusion & Full Source Code

Building a private RAG pipeline ensures your data never leaks to external providers while maintaining top-tier performance.

For the complete Python scripts, detailed troubleshooting of CUDA OOM errors, and scaling strategies, check out the original post:

👉 Read the Full Tutorial on GPUYard

Top comments (0)