What is RAG?
RAG (Retrieval-Augmented Generation) combines the power of LLMs with your own data. Instead of relying solely on the model's training data, RAG retrieves relevant documents and uses them to generate more accurate responses.
Architecture
User Query → Embedding → Vector Search → Context + Query → LLM → Response
Step 1: Install Dependencies
pip install openai chromadb sentence-transformers
Step 2: Set Up DeepSeek Client
from openai import OpenAI
client = OpenAI(
api_key="sk-your-key",
base_url="https://api.token-china.cc/v1"
)
Step 3: Create Vector Store
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Create vector store
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")
# Add documents
documents = [
"DeepSeek V4 Pro costs $2 per million tokens.",
"GLM 5.1 is Zhipu AI's latest model.",
"Token China provides unified API access to Chinese AI models."
]
embeddings = embedder.encode(documents).tolist()
collection.add(
documents=documents,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(documents))]
)
Step 4: Query and Generate
def query_rag(question: str) -> str:
# Embed the question
query_embedding = embedder.encode([question]).tolist()
# Search for relevant documents
results = collection.query(
query_embeddings=query_embedding,
n_results=3
)
# Build context
context = "\n".join(results['documents'][0])
# Generate response
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": f"Answer based on this context: {context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# Test it
answer = query_rag("How much does DeepSeek V4 Pro cost?")
print(answer)
Production Tips
- Chunking: Split documents into 500-1000 token chunks
- Overlap: Use 10-20% overlap between chunks
- Embeddings: Use a dedicated embedding model (not the LLM)
- Caching: Cache embeddings to avoid re-computing
Why DeepSeek for RAG?
- Cost: $2/1M tokens vs $15 for GPT-5
- Context: 128K window fits most documents
- Quality: Comparable to GPT-5 for RAG tasks
- Speed: Faster inference for real-time applications
Try building your own RAG pipeline with Token China's 100K free tokens!
Top comments (0)