Building a RAG System With Chinese AI Models
Retrieval-Augmented Generation (RAG) is the architecture powering most production AI chatbots in 2026. But here's what tutorials won't tell you: your choice of LLM backend massively affects both cost and quality - especially when you're dealing with multilingual or Chinese-language documents.
In this tutorial, we'll build a production-grade RAG pipeline using open-source Chinese AI models accessed through a unified API. The entire system costs roughly 95% less than running the same pipeline with GPT-4o.
Why Chinese Models for RAG?
Three concrete reasons:
- Cost efficiency: Models like DeepSeek V4 and GLM-5 charge $0.14-$0.28 per million tokens, compared to $5+ for GPT-4o. When your RAG pipeline processes thousands of queries daily, this compounds fast.
- Multilingual performance: Chinese models handle code-switching between English and Chinese (and other Asian languages) far better than Western models optimized primarily for English.
- Open-weight transparency: Most Chinese models publish their weights, meaning you can self-host for maximum privacy - or use an API gateway for convenience.
For this tutorial, we'll use aiwave.live, which provides a single OpenAI-compatible endpoint for 50+ Chinese AI models including DeepSeek, GLM, Qwen, and more. This means zero SDK changes if you're already using the OpenAI client library.
Architecture Overview
Documents ? Chunking ? Embeddings ? Vector Store
?
User Query ? Embedding ? Similarity Search ? Top-K Chunks
?
Context + Query ? LLM ? Response
We'll implement each stage with real, runnable Python code.
Step 1: Install Dependencies
pip install openai faiss-cpu numpy tiktoken
We use faiss-cpu for the vector store (no GPU needed for prototyping), and the openai library because - as mentioned - the API at aiwave.live is fully OpenAI-compatible.
Step 2: Set Up the API Client
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://aiwave.live/v1"
)
That's it. No new SDK, no proprietary client. If you've used the OpenAI Python library before, you already know the API.
Step 3: Document Chunking
RAG quality depends heavily on how you split your documents. Here's a chunking strategy that preserves semantic boundaries:
import tiktoken
def chunk_text(text, max_tokens=512, overlap=64):
"""Split text into overlapping chunks respecting token limits."""
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunks.append(encoding.decode(chunk_tokens))
start += max_tokens - overlap
return chunks
# Example usage
document = """
Large language models have transformed how we interact with software...
[Your long document text here]
"""
chunks = chunk_text(document)
print(f"Created {len(chunks)} chunks")
The 64-token overlap ensures context isn't lost at chunk boundaries. For production, consider semantic chunking with a model like text-embedding-3-small.
Step 4: Generate Embeddings
We'll use Qwen's embedding model, which supports both Chinese and English text natively:
import numpy as np
def create_embeddings(chunks, model="qwen-text-embedding-v3"):
"""Generate embeddings for all document chunks."""
embeddings = []
for chunk in chunks:
response = client.embeddings.create(
model=model,
input=chunk
)
embeddings.append(response.data[0].embedding)
return np.array(embeddings)
chunks = chunk_text(document)
chunk_embeddings = create_embeddings(chunks)
print(f"Embedding shape: {chunk_embeddings.shape}")
Why Qwen embeddings? They're optimized for Chinese + English mixed corpora and cost roughly $0.02 per million tokens - about 50x cheaper than OpenAI's embedding models.
Step 5: Build the Vector Store
import faiss
def build_vector_store(embeddings):
"""Build a FAISS index for fast similarity search."""
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension) # Inner product (cosine sim)
# Normalize for cosine similarity
faiss.normalize_L2(embeddings)
index.add(embeddings)
return index
index = build_vector_store(chunk_embeddings.astype('float32'))
print(f"Indexed {index.ntotal} chunks")
FAISS with cosine similarity is fast enough for datasets up to ~1M chunks on a single machine. For larger corpora, consider FAISS HNSW or a managed solution like Pinecone.
Step 6: Retrieve Relevant Chunks
def retrieve(query, index, chunks, top_k=3):
"""Find the most relevant chunks for a query."""
# Generate query embedding
response = client.embeddings.create(
model="qwen-text-embedding-v3",
input=query
)
query_emb = np.array([response.data[0].embedding], dtype='float32')
faiss.normalize_L2(query_emb)
# Search
scores, indices = index.search(query_emb, top_k)
results = []
for i, idx in enumerate(indices[0]):
results.append({
'chunk': chunks[idx],
'score': float(scores[0][i])
})
return results
Step 7: Generate the Final Answer
Now we connect the retrieval pipeline to an LLM. We'll use DeepSeek V4 for generation:
def generate_answer(query, retrieved_chunks):
"""Generate a grounded answer using retrieved context."""
context = "\n\n".join([r['chunk'] for r in retrieved_chunks])
prompt = f"""Based on the following context, answer the question accurately.
If the context doesn't contain relevant information, say so.
Context:
{context}
Question: {query}
Answer:"""
response = client.chat.completions.create(
model="deepseek-v4",
messages=[
{"role": "system", "content": "You are a helpful assistant. Answer based only on the provided context."},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=512
)
return response.choices[0].message.content
Step 8: Put It All Together
def rag_pipeline(query, document):
"""Complete RAG pipeline: chunk ? embed ? retrieve ? generate."""
# 1. Chunk the document
chunks = chunk_text(document)
# 2. Create embeddings
embeddings = create_embeddings(chunks)
# 3. Build vector store
index = build_vector_store(embeddings.astype('float32'))
# 4. Retrieve relevant chunks
retrieved = retrieve(query, index, chunks)
# 5. Generate answer
answer = generate_answer(query, retrieved)
return answer, retrieved
# Run it
answer, sources = rag_pipeline(
query="What are the key benefits of using open-source models?",
document=your_document_text
)
print(f"Answer: {answer}")
print(f"\nBased on {len(sources)} sources")
Performance and Cost Analysis
Here's how this RAG pipeline compares across model backends:
| Component | Chinese Model (via aiwave.live) | OpenAI Equivalent | Cost Savings |
|---|---|---|---|
| Embeddings | Qwen v3 ($0.02/M) | text-embedding-3-small ($1/M) | 98% |
| Generation | DeepSeek V4 ($0.14/M) | GPT-4o ($5/M) | 97% |
| Total (1K queries/day) | ~$2/month | ~$80/month | 97% |
For a startup processing 1,000 RAG queries per day, switching to Chinese models through a unified gateway saves nearly $1,000/year - with comparable quality on most tasks.
Production Tips
- Cache embeddings: Store document embeddings in Redis or a database. Recomputing them on every request wastes money.
-
Use streaming: For chat-like interfaces, stream responses with
stream=Trueto improve perceived latency. - Implement fallback: If the primary model is unavailable, fall back to a secondary model. An API gateway like aiwave.live handles this automatically.
- Monitor relevance scores: If your top chunk's similarity score drops below 0.5, it usually means the query is out-of-domain. Surface a "no relevant results" message instead of hallucinating.
- Batch embeddings: The API supports batch embedding requests. Use them to cut latency by 3-5x.
Conclusion
Building a RAG system with Chinese AI models gives you production-quality retrieval at a fraction of the cost. The OpenAI-compatible API means you can swap your backend without touching application code - literally changing one URL.
The full pipeline we built handles document ingestion, semantic search, and grounded generation in under 100 lines of Python. Scale it by adding a proper vector database, caching layer, and multi-model fallback strategy.
If you found this helpful, check out aiwave.live for a unified API to 50+ Chinese AI models - all OpenAI-compatible, starting at $5 for 1.5M tokens.
Top comments (0)