Build a RAG Pipeline with DeepSeek V4 Pro and Python in 15 Minutes

What is RAG?

RAG (Retrieval-Augmented Generation) combines the power of LLMs with your own data. Instead of relying solely on the model's training data, RAG retrieves relevant documents and uses them to generate more accurate responses.

Architecture

User Query → Embedding → Vector Search → Context + Query → LLM → Response

Step 1: Install Dependencies

pip install openai chromadb sentence-transformers

Step 2: Set Up DeepSeek Client

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key",
    base_url="https://api.token-china.cc/v1"
)

Step 3: Create Vector Store

import chromadb
from sentence_transformers import SentenceTransformer

# Initialize embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Create vector store
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")

# Add documents
documents = [
    "DeepSeek V4 Pro costs $2 per million tokens.",
    "GLM 5.1 is Zhipu AI's latest model.",
    "Token China provides unified API access to Chinese AI models."
]

embeddings = embedder.encode(documents).tolist()
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

Step 4: Query and Generate

def query_rag(question: str) -> str:
    # Embed the question
    query_embedding = embedder.encode([question]).tolist()

    # Search for relevant documents
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=3
    )

    # Build context
    context = "\n".join(results['documents'][0])

    # Generate response
    response = client.chat.completions.create(
        model="deepseek-v4-pro",
        messages=[
            {"role": "system", "content": f"Answer based on this context: {context}"},
            {"role": "user", "content": question}
        ]
    )

    return response.choices[0].message.content

# Test it
answer = query_rag("How much does DeepSeek V4 Pro cost?")
print(answer)

Production Tips

Chunking: Split documents into 500-1000 token chunks
Overlap: Use 10-20% overlap between chunks
Embeddings: Use a dedicated embedding model (not the LLM)
Caching: Cache embeddings to avoid re-computing

Why DeepSeek for RAG?

Cost: $2/1M tokens vs $15 for GPT-5
Context: 128K window fits most documents
Quality: Comparable to GPT-5 for RAG tasks
Speed: Faster inference for real-time applications

Try building your own RAG pipeline with Token China's 100K free tokens!

DEV Community