Semantic search is killing keyword search. When users type "how to handle timeouts," they don't want results that just match the word "timeout" - they want results about handling timeouts, even if the document says "connection reset" or "request deadline exceeded." That's what embeddings make possible.
In this guide, we'll build a production-grade semantic search engine using Chinese AI embedding APIs - which, as it turns out, are dramatically cheaper than Western alternatives while delivering comparable quality.
Why Embeddings Beat Keyword Search
Traditional search (Elasticsearch, Algolia, basic LIKE queries) matches strings. Embeddings match meaning. Here's the difference:
| Query | Keyword Match | Semantic Match |
|---|---|---|
| "python web framework" | Only docs containing all 3 words | Flask, Django, FastAPI docs |
| "database too slow" | Docs mentioning those exact words | Indexing, caching, query optimization docs |
| "auth token expired" | Docs with "auth" AND "token" AND "expired" | JWT refresh, OAuth re-authentication docs |
Semantic search understands intent. It knows that "database too slow" is conceptually close to "query performance optimization" even though zero words overlap.
The Cost Advantage of Chinese Embedding Models
Here's where it gets interesting. Most developers know OpenAI's text-embedding-3-small costs $0.02 per 1M tokens. But Chinese embedding models are available at a fraction of that cost.
You can access embedding models like BGE (BAAI General Embedding), M3E, and other high-quality Chinese + English embedding models through a unified API. For example, AIWave provides access to 50+ Chinese AI models including embedding endpoints, all OpenAI-compatible, starting with free credits.
Building It: Step by Step
Step 1: Generate Embeddings
import requests
import numpy as np
# Using AIWave's OpenAI-compatible endpoint
API_BASE = "https://api.aiwave.live/v1"
API_KEY = "your-api-key"
def get_embedding(text, model="bge-large-zh-v1.5"):
"""Get embedding vector for a piece of text."""
response = requests.post(
f"{API_BASE}/embeddings",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": model,
"input": text
}
)
data = response.json()
return np.array(data["data"][0]["embedding"])
# Example: embedding a support article
article = """
If your API requests are timing out, check your network connection first.
Then verify the endpoint URL is correct. You can also increase the timeout
setting in your HTTP client configuration.
"""
embedding = get_embedding(article)
print(f"Embedding dimension: {len(embedding)}") # e.g., 1024
Step 2: Build Your Document Store
import json
from pathlib import Path
class EmbeddingStore:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, text, metadata=None):
"""Add a document and compute its embedding."""
emb = get_embedding(text)
self.documents.append({
"text": text,
"metadata": metadata or {}
})
self.embeddings.append(emb)
def add_batch(self, docs):
"""Add multiple documents efficiently."""
texts = [d["text"] for d in docs]
# Batch embedding call
response = requests.post(
f"{API_BASE}/embeddings",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": "bge-large-zh-v1.5",
"input": texts
}
)
embeddings = [np.array(d["embedding"]) for d in response.json()["data"]]
for doc, emb in zip(docs, embeddings):
self.documents.append(doc)
self.embeddings.append(emb)
def save(self, path):
"""Persist to disk."""
data = {
"documents": self.documents,
"embeddings": [e.tolist() for e in self.embeddings]
}
Path(path).write_text(json.dumps(data))
def load(self, path):
"""Load from disk."""
data = json.loads(Path(path).read_text())
self.documents = data["documents"]
self.embeddings = [np.array(e) for e in data["embeddings"]]
Step 3: Implement Semantic Search
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
class SemanticSearcher(EmbeddingStore):
def search(self, query, top_k=5):
"""Find the top-k most similar documents."""
query_emb = get_embedding(query)
scores = [
(i, cosine_similarity(query_emb, doc_emb))
for i, doc_emb in enumerate(self.embeddings)
]
scores.sort(key=lambda x: x[1], reverse=True)
results = []
for idx, score in scores[:top_k]:
doc = self.documents[idx]
results.append({
"text": doc["text"],
"score": float(score),
"metadata": doc["metadata"]
})
return results
# Usage
searcher = SemanticSearcher()
searcher.load("my_index.json")
# Test queries that would FAIL with keyword search
queries = [
"app keeps crashing",
"how do I reset my password",
"payment failed",
]
for q in queries:
print(f"\nQuery: {q}")
for result in searcher.search(q, top_k=3):
print(f" [{result['score']:.3f}] {result['text'][:80]}...")
Choosing the Right Embedding Model
Not all embedding models are created equal. Here's a practical comparison of models available through Chinese AI APIs:
| Model | Dimension | Languages | Best For |
|---|---|---|---|
bge-large-zh-v1.5 |
1024 | Chinese + English | General purpose, balanced |
bge-m3 |
1024 | 100+ languages | Multilingual projects |
m3e-large |
1024 | Chinese + English | Chinese-heavy content |
text-embedding-v2 |
1536 | Chinese + English | Qwen ecosystem |
Rule of thumb: If your content is primarily Chinese, use BGE or M3E. If you need multilingual support, BGE-M3 handles 100+ languages and is the most versatile.
Going to Production: Use a Real Vector Database
The numpy approach above works for prototypes. For production with millions of documents, use a proper vector database:
# Using Qdrant (runs locally or in Docker)
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="localhost", port=6333)
# Create collection
client.recreate_collection(
collection_name="docs",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
# Index documents
points = []
for i, doc in enumerate(documents):
embedding = get_embedding(doc["text"])
points.append(PointStruct(
id=i,
vector=embedding.tolist(),
payload=doc
))
client.upsert(collection_name="docs", points=points)
# Search
query_embedding = get_embedding("my search query")
results = client.search(
collection_name="docs",
query_vector=query_embedding.tolist(),
limit=10
)
For a hosted alternative, you can also check https://aiwave.live which offers a unified API endpoint compatible with all major Chinese embedding models - no separate accounts needed.
Real-World Performance Tips
1. Batch your embedding calls. Generating embeddings one at a time is 10-20x slower than batching. Most APIs support passing an array of texts in a single request.
2. Cache aggressively. Embeddings don't change unless your text changes. Store them in Redis, a database, or even flat files. Re-computing embeddings on every request is a waste.
3. Normalize your vectors. Some APIs return pre-normalized embeddings, others don't. Always normalize before computing cosine similarity:
emb = emb / np.linalg.norm(emb) # Normalize to unit length
4. Chunk long documents. Embedding models typically handle up to 512 tokens well. For longer documents, split into overlapping chunks of ~300 tokens and index each chunk separately:
def chunk_text(text, chunk_size=300, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
if i + chunk_size >= len(words):
break
return chunks
Measuring Search Quality
Don't just eyeball results. Build a small evaluation set:
test_cases = [
{"query": "login not working", "relevant_doc_id": 42},
{"query": "change email address", "relevant_doc_id": 15},
{"query": "export data to csv", "relevant_doc_id": 88},
]
def evaluate(searcher, test_cases, k=5):
hits = 0
for case in test_cases:
results = searcher.search(case["query"], top_k=k)
doc_ids = [r["metadata"].get("id") for r in results]
if case["relevant_doc_id"] in doc_ids:
hits += 1
return hits / len(test_cases)
print(f"Recall@{k}: {evaluate(searcher, test_cases):.1%}")
A well-tuned semantic search should hit 85%+ recall@5 on most domains.
Conclusion
Semantic search with Chinese AI embedding models gives you OpenAI-level quality at a fraction of the cost. The BGE family of models handles both Chinese and English text well, and with batch API calls, you can index thousands of documents for cents.
Start with the numpy prototype to validate your use case, then move to Qdrant or Milvus for production scale. The whole pipeline - from raw text to searchable vector index - can be built in under 100 lines of Python.
Happy embedding.
Top comments (0)