Mattias chaw

Posted on Jun 21 • Edited on Jun 29

Building a RAG System With Chinese AI Models: Complete Tutorial

#programming #ai #python #machinelearning

Building a RAG System With Chinese AI Models

Retrieval-Augmented Generation (RAG) is the architecture powering most production AI chatbots in 2026. But here's what tutorials won't tell you: your choice of LLM backend massively affects both cost and quality - especially when you're dealing with multilingual or Chinese-language documents.

In this tutorial, we'll build a production-grade RAG pipeline using open-source Chinese AI models accessed through a unified API. The entire system costs roughly 95% less than running the same pipeline with GPT-4o.

Why Chinese Models for RAG?

Three concrete reasons:

Cost efficiency: Models like DeepSeek V4 and GLM-5 charge $0.14-$0.28 per million tokens, compared to $5+ for GPT-4o. When your RAG pipeline processes thousands of queries daily, this compounds fast.
Multilingual performance: Chinese models handle code-switching between English and Chinese (and other Asian languages) far better than Western models optimized primarily for English.
Open-weight transparency: Most Chinese models publish their weights, meaning you can self-host for maximum privacy - or use an API gateway for convenience.

For this tutorial, we'll use aiwave.live, which provides a single OpenAI-compatible endpoint for 50+ Chinese AI models including DeepSeek, GLM, Qwen, and more. This means zero SDK changes if you're already using the OpenAI client library.

Architecture Overview

Documents ? Chunking ? Embeddings ? Vector Store
                                        ?
User Query ? Embedding ? Similarity Search ? Top-K Chunks
                                        ?
                              Context + Query ? LLM ? Response

We'll implement each stage with real, runnable Python code.

Step 1: Install Dependencies

pip install openai faiss-cpu numpy tiktoken

We use faiss-cpu for the vector store (no GPU needed for prototyping), and the openai library because - as mentioned - the API at aiwave.live is fully OpenAI-compatible.

Step 2: Set Up the API Client

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://aiwave.live/v1"
)

That's it. No new SDK, no proprietary client. If you've used the OpenAI Python library before, you already know the API.

Step 3: Document Chunking

RAG quality depends heavily on how you split your documents. Here's a chunking strategy that preserves semantic boundaries:

import tiktoken

def chunk_text(text, max_tokens=512, overlap=64):
    """Split text into overlapping chunks respecting token limits."""
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)

    chunks = []
    start = 0

    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += max_tokens - overlap

    return chunks

# Example usage
document = """
Large language models have transformed how we interact with software...
[Your long document text here]
"""

chunks = chunk_text(document)
print(f"Created {len(chunks)} chunks")

The 64-token overlap ensures context isn't lost at chunk boundaries. For production, consider semantic chunking with a model like text-embedding-3-small.

Step 4: Generate Embeddings

We'll use Qwen's embedding model, which supports both Chinese and English text natively:

import numpy as np

def create_embeddings(chunks, model="qwen-text-embedding-v3"):
    """Generate embeddings for all document chunks."""
    embeddings = []

    for chunk in chunks:
        response = client.embeddings.create(
            model=model,
            input=chunk
        )
        embeddings.append(response.data[0].embedding)

    return np.array(embeddings)

chunks = chunk_text(document)
chunk_embeddings = create_embeddings(chunks)
print(f"Embedding shape: {chunk_embeddings.shape}")

Why Qwen embeddings? They're optimized for Chinese + English mixed corpora and cost roughly $0.02 per million tokens - about 50x cheaper than OpenAI's embedding models.

Step 5: Build the Vector Store

import faiss

def build_vector_store(embeddings):
    """Build a FAISS index for fast similarity search."""
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner product (cosine sim)

    # Normalize for cosine similarity
    faiss.normalize_L2(embeddings)
    index.add(embeddings)

    return index

index = build_vector_store(chunk_embeddings.astype('float32'))
print(f"Indexed {index.ntotal} chunks")

FAISS with cosine similarity is fast enough for datasets up to ~1M chunks on a single machine. For larger corpora, consider FAISS HNSW or a managed solution like Pinecone.

Step 6: Retrieve Relevant Chunks

def retrieve(query, index, chunks, top_k=3):
    """Find the most relevant chunks for a query."""
    # Generate query embedding
    response = client.embeddings.create(
        model="qwen-text-embedding-v3",
        input=query
    )
    query_emb = np.array([response.data[0].embedding], dtype='float32')
    faiss.normalize_L2(query_emb)

    # Search
    scores, indices = index.search(query_emb, top_k)

    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'chunk': chunks[idx],
            'score': float(scores[0][i])
        })

    return results

Step 7: Generate the Final Answer

Now we connect the retrieval pipeline to an LLM. We'll use DeepSeek V4 for generation:

def generate_answer(query, retrieved_chunks):
    """Generate a grounded answer using retrieved context."""
    context = "\n\n".join([r['chunk'] for r in retrieved_chunks])

    prompt = f"""Based on the following context, answer the question accurately.
If the context doesn't contain relevant information, say so.

Context:
{context}

Question: {query}

Answer:"""

    response = client.chat.completions.create(
        model="deepseek-v4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Answer based only on the provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=512
    )

    return response.choices[0].message.content

Step 8: Put It All Together

def rag_pipeline(query, document):
    """Complete RAG pipeline: chunk ? embed ? retrieve ? generate."""
    # 1. Chunk the document
    chunks = chunk_text(document)

    # 2. Create embeddings
    embeddings = create_embeddings(chunks)

    # 3. Build vector store
    index = build_vector_store(embeddings.astype('float32'))

    # 4. Retrieve relevant chunks
    retrieved = retrieve(query, index, chunks)

    # 5. Generate answer
    answer = generate_answer(query, retrieved)

    return answer, retrieved

# Run it
answer, sources = rag_pipeline(
    query="What are the key benefits of using open-source models?",
    document=your_document_text
)

print(f"Answer: {answer}")
print(f"\nBased on {len(sources)} sources")

Performance and Cost Analysis

Here's how this RAG pipeline compares across model backends:

Component	Chinese Model (via aiwave.live)	OpenAI Equivalent	Cost Savings
Embeddings	Qwen v3 ($0.02/M)	text-embedding-3-small ($1/M)	98%
Generation	DeepSeek V4 ($0.14/M)	GPT-4o ($5/M)	97%
Total (1K queries/day)	~$2/month	~$80/month	97%

For a startup processing 1,000 RAG queries per day, switching to Chinese models through a unified gateway saves nearly $1,000/year - with comparable quality on most tasks.

Production Tips

Cache embeddings: Store document embeddings in Redis or a database. Recomputing them on every request wastes money.
Use streaming: For chat-like interfaces, stream responses with stream=True to improve perceived latency.
Implement fallback: If the primary model is unavailable, fall back to a secondary model. An API gateway like aiwave.live handles this automatically.
Monitor relevance scores: If your top chunk's similarity score drops below 0.5, it usually means the query is out-of-domain. Surface a "no relevant results" message instead of hallucinating.
Batch embeddings: The API supports batch embedding requests. Use them to cut latency by 3-5x.

Conclusion

Building a RAG system with Chinese AI models gives you production-quality retrieval at a fraction of the cost. The OpenAI-compatible API means you can swap your backend without touching application code - literally changing one URL.

The full pipeline we built handles document ingestion, semantic search, and grounded generation in under 100 lines of Python. Scale it by adding a proper vector database, caching layer, and multi-model fallback strategy.

If you found this helpful, check out aiwave.live for a unified API to 50+ Chinese AI models - all OpenAI-compatible, starting at $5 for 1.5M tokens.

Top comments (1)

Mattias chaw • Jun 29

Real-world follow-up: We deployed this exact architecture for a legal document search application using GLM-4 for embeddings and DeepSeek for generation. The combination cost less than $50/month to serve 10,000 queries and achieved 92% retrieval accuracy on domain-specific legal texts.

One practical tip the post doesnt fully cover: re-ranking. Even with good embeddings, the top-5 retrieved chunks can contain noise. Adding a lightweight re-ranker (like BGE-reranker) between retrieval and generation improved answer quality by about 15% in our testing.