Mattias chaw

Posted on Jun 22

Building a Semantic Search Engine With Chinese AI Embedding Models: A 2026 Guide

#ai #machinelearning #webdev #programming

Semantic search is killing keyword search. When users type "how to handle timeouts," they don't want results that just match the word "timeout" - they want results about handling timeouts, even if the document says "connection reset" or "request deadline exceeded." That's what embeddings make possible.

In this guide, we'll build a production-grade semantic search engine using Chinese AI embedding APIs - which, as it turns out, are dramatically cheaper than Western alternatives while delivering comparable quality.

Why Embeddings Beat Keyword Search

Traditional search (Elasticsearch, Algolia, basic LIKE queries) matches strings. Embeddings match meaning. Here's the difference:

Query	Keyword Match	Semantic Match
"python web framework"	Only docs containing all 3 words	Flask, Django, FastAPI docs
"database too slow"	Docs mentioning those exact words	Indexing, caching, query optimization docs
"auth token expired"	Docs with "auth" AND "token" AND "expired"	JWT refresh, OAuth re-authentication docs

Semantic search understands intent. It knows that "database too slow" is conceptually close to "query performance optimization" even though zero words overlap.

The Cost Advantage of Chinese Embedding Models

Here's where it gets interesting. Most developers know OpenAI's text-embedding-3-small costs $0.02 per 1M tokens. But Chinese embedding models are available at a fraction of that cost.

You can access embedding models like BGE (BAAI General Embedding), M3E, and other high-quality Chinese + English embedding models through a unified API. For example, AIWave provides access to 50+ Chinese AI models including embedding endpoints, all OpenAI-compatible, starting with free credits.

Building It: Step by Step

Step 1: Generate Embeddings

import requests
import numpy as np

# Using AIWave's OpenAI-compatible endpoint
API_BASE = "https://api.aiwave.live/v1"
API_KEY = "your-api-key"

def get_embedding(text, model="bge-large-zh-v1.5"):
    """Get embedding vector for a piece of text."""
    response = requests.post(
        f"{API_BASE}/embeddings",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": model,
            "input": text
        }
    )
    data = response.json()
    return np.array(data["data"][0]["embedding"])

# Example: embedding a support article
article = """
If your API requests are timing out, check your network connection first.
Then verify the endpoint URL is correct. You can also increase the timeout
setting in your HTTP client configuration.
"""

embedding = get_embedding(article)
print(f"Embedding dimension: {len(embedding)}")  # e.g., 1024

Step 2: Build Your Document Store

import json
from pathlib import Path

class EmbeddingStore:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def add_document(self, text, metadata=None):
        """Add a document and compute its embedding."""
        emb = get_embedding(text)
        self.documents.append({
            "text": text,
            "metadata": metadata or {}
        })
        self.embeddings.append(emb)

    def add_batch(self, docs):
        """Add multiple documents efficiently."""
        texts = [d["text"] for d in docs]
        # Batch embedding call
        response = requests.post(
            f"{API_BASE}/embeddings",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": "bge-large-zh-v1.5",
                "input": texts
            }
        )
        embeddings = [np.array(d["embedding"]) for d in response.json()["data"]]

        for doc, emb in zip(docs, embeddings):
            self.documents.append(doc)
            self.embeddings.append(emb)

    def save(self, path):
        """Persist to disk."""
        data = {
            "documents": self.documents,
            "embeddings": [e.tolist() for e in self.embeddings]
        }
        Path(path).write_text(json.dumps(data))

    def load(self, path):
        """Load from disk."""
        data = json.loads(Path(path).read_text())
        self.documents = data["documents"]
        self.embeddings = [np.array(e) for e in data["embeddings"]]

Step 3: Implement Semantic Search

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class SemanticSearcher(EmbeddingStore):
    def search(self, query, top_k=5):
        """Find the top-k most similar documents."""
        query_emb = get_embedding(query)

        scores = [
            (i, cosine_similarity(query_emb, doc_emb))
            for i, doc_emb in enumerate(self.embeddings)
        ]
        scores.sort(key=lambda x: x[1], reverse=True)

        results = []
        for idx, score in scores[:top_k]:
            doc = self.documents[idx]
            results.append({
                "text": doc["text"],
                "score": float(score),
                "metadata": doc["metadata"]
            })
        return results

# Usage
searcher = SemanticSearcher()
searcher.load("my_index.json")

# Test queries that would FAIL with keyword search
queries = [
    "app keeps crashing",
    "how do I reset my password",
    "payment failed",
]

for q in queries:
    print(f"\nQuery: {q}")
    for result in searcher.search(q, top_k=3):
        print(f"  [{result['score']:.3f}] {result['text'][:80]}...")

Choosing the Right Embedding Model

Not all embedding models are created equal. Here's a practical comparison of models available through Chinese AI APIs:

Model	Dimension	Languages	Best For
`bge-large-zh-v1.5`	1024	Chinese + English	General purpose, balanced
`bge-m3`	1024	100+ languages	Multilingual projects
`m3e-large`	1024	Chinese + English	Chinese-heavy content
`text-embedding-v2`	1536	Chinese + English	Qwen ecosystem

Rule of thumb: If your content is primarily Chinese, use BGE or M3E. If you need multilingual support, BGE-M3 handles 100+ languages and is the most versatile.

Going to Production: Use a Real Vector Database

The numpy approach above works for prototypes. For production with millions of documents, use a proper vector database:

# Using Qdrant (runs locally or in Docker)
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

# Index documents
points = []
for i, doc in enumerate(documents):
    embedding = get_embedding(doc["text"])
    points.append(PointStruct(
        id=i,
        vector=embedding.tolist(),
        payload=doc
    ))

client.upsert(collection_name="docs", points=points)

# Search
query_embedding = get_embedding("my search query")
results = client.search(
    collection_name="docs",
    query_vector=query_embedding.tolist(),
    limit=10
)

For a hosted alternative, you can also check https://aiwave.live which offers a unified API endpoint compatible with all major Chinese embedding models - no separate accounts needed.

Real-World Performance Tips

1. Batch your embedding calls. Generating embeddings one at a time is 10-20x slower than batching. Most APIs support passing an array of texts in a single request.

2. Cache aggressively. Embeddings don't change unless your text changes. Store them in Redis, a database, or even flat files. Re-computing embeddings on every request is a waste.

3. Normalize your vectors. Some APIs return pre-normalized embeddings, others don't. Always normalize before computing cosine similarity:

emb = emb / np.linalg.norm(emb)  # Normalize to unit length

4. Chunk long documents. Embedding models typically handle up to 512 tokens well. For longer documents, split into overlapping chunks of ~300 tokens and index each chunk separately:

def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

Measuring Search Quality

Don't just eyeball results. Build a small evaluation set:

test_cases = [
    {"query": "login not working", "relevant_doc_id": 42},
    {"query": "change email address", "relevant_doc_id": 15},
    {"query": "export data to csv", "relevant_doc_id": 88},
]

def evaluate(searcher, test_cases, k=5):
    hits = 0
    for case in test_cases:
        results = searcher.search(case["query"], top_k=k)
        doc_ids = [r["metadata"].get("id") for r in results]
        if case["relevant_doc_id"] in doc_ids:
            hits += 1
    return hits / len(test_cases)

print(f"Recall@{k}: {evaluate(searcher, test_cases):.1%}")

A well-tuned semantic search should hit 85%+ recall@5 on most domains.

Conclusion

Semantic search with Chinese AI embedding models gives you OpenAI-level quality at a fraction of the cost. The BGE family of models handles both Chinese and English text well, and with batch API calls, you can index thousands of documents for cents.

Start with the numpy prototype to validate your use case, then move to Qdrant or Milvus for production scale. The whole pipeline - from raw text to searchable vector index - can be built in under 100 lines of Python.

Happy embedding.

DEV Community