Why Does Switching Embedding Models Make Such a Huge Difference?
In the first four articles, we built the RAG pipeline, tuned parameters, and mastered chunking strategies. But there's one question we haven't dived into:
After your documents are chunked, how do they become vectors?
This process is called Embedding. It transforms human-readable text into machine-computable vectors. The choice of Embedding model directly determines:
- Whether "apple" and "iPhone" are recognized as related
- Whether "database connection pool exhausted" and "Too many connections" match
- Whether Chinese idioms, technical jargon, and abbreviations are properly understood
This article explains how Embedding works, compares mainstream models, and runs a head-to-head retrieval comparison between OpenAI and BGE using real Chinese documents.
What Is Embedding?
One-Sentence Explanation
Embedding is a function that takes a piece of text and outputs a fixed-length numerical vector (e.g., 1024 dimensions). Semantically similar texts produce vectors that are close together in space.
Why Can Vectors Represent Meaning?
Imagine placing all words in a multi-dimensional space:
- "King" and "Queen" are close together
- "Apple (fruit)" and "Banana" are close together
- "Apple (company)" and "Google" are close together
- "Apple (fruit)" and "Apple (company)" are far apart
Embedding models learn these "semantic distances" through pre-training on massive text corpora. When you ask "How do I restart my iPhone?", the model knows "iPhone" relates to "Apple" (the company), not "apple" (the fruit).
Its Role in RAG
User Query → Embedding Model → Query Vector
↘
Vector Similarity → Top-K Retrieval
↗
Document Chunk → Embedding Model → Document Vector (precomputed)
Embedding is the semantic bridge of RAG. Without it, retrieval is limited to keyword matching (like Ctrl+F). With it, you get semantic matching that understands synonyms, paraphrases, and context.
Mainstream Embedding Model Comparison
Model Overview
| Model | Vendor | Dimensions | Language Strength | Deployment | Characteristics |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | Multilingual | API | Cheap, fast, good for general use |
| text-embedding-3-large | OpenAI | 3072 | Multilingual | API | High accuracy, expensive, complex semantics |
| BAAI/bge-large-zh-v1.5 | BAAI | 1024 | Chinese | API/Local | Top Chinese performance, open-source, free |
| BAAI/bge-m3 | BAAI | 1024 | Multilingual | API/Local | 100+ languages, lightweight |
| embed-multilingual-v3.0 | Cohere | 1024 | Multilingual | API | Good for long texts |
| E5-mistral-7b-instruct | Microsoft | 4096 | Multilingual | Local | Instruction-based, strong but heavy |
Key Metric: The MTEB Leaderboard
MTEB (Massive Text Embedding Benchmark) is the "college entrance exam" of Embedding models. It tests models on 50+ datasets across various tasks.
How to Read the MTEB Leaderboard:
- Visit the MTEB Leaderboard
- Focus on Retrieval Average — most relevant to RAG
- Check Model Size — larger models are slower but usually more accurate
Key Findings from the Leaderboard:
-
English: OpenAI
text-embedding-3-largedominates, buttext-embedding-3-smalloffers exceptional value -
Chinese: BGE series (especially
bge-large-zh-v1.5) often outperforms OpenAI, and it's open-source and free -
Multilingual:
bge-m3and Cohereembed-multilingual-v3.0stand out
💡 Rule of Thumb: English → OpenAI, Chinese → BGE, Multilingual → bge-m3, Long Text → Cohere.
Practical: OpenAI vs BGE Retrieval Showdown on Chinese Documents
Experimental Design
We use the same Chinese technical document from Article 4 (the microservices architecture guide), generate embeddings with both OpenAI and BGE, and test retrieval quality on the same set of queries.
Code: Switching Embedding Models with One Change
LangChain's OpenAIEmbeddings class is compatible with all OpenAI-Format Embedding APIs (including SiliconFlow, Zhipu, Ollama, etc.), so switching models only requires changing a few configuration lines:
from langchain_openai import OpenAIEmbeddings
# --- Official OpenAI ---
openai_embed = OpenAIEmbeddings(
model="text-embedding-3-small",
api_key="sk-...",
base_url="https://api.openai.com/v1",
)
# --- BGE (via SiliconFlow) ---
bge_embed = OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5",
api_key="sk-...", # SiliconFlow API Key
base_url="https://api.siliconflow.cn/v1",
chunk_size=32, # SiliconFlow batch size limit: 32
)
# --- Use in RAG Pipeline ---
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=bge_embed, # Change only this line to switch models
)
Evaluation Query Set
We designed 5 queries covering different difficulty levels:
| Query | Expected Content | Difficulty |
|---|---|---|
| Q1: "What are the principles of microservice decomposition?" | Section 1.1: DDD | Easy |
| Q2: "What's the difference between REST and gRPC?" | Section 2.1: REST vs gRPC | Easy |
| Q3: "How to solve distributed transactions?" | Section 3.2: Saga Pattern | Medium |
| Q4: "How to roll back a failed order?" | Saga compensation operations | Hard (requires reasoning) |
| Q5: "How to monitor microservices?" | Section 4: Observability | Easy |
Results Comparison
| Query | OpenAI text-embedding-3-small | BGE-large-zh-v1.5 | Analysis |
|---|---|---|---|
| Q1 Decomposition principles | ✅ #1 hit | ✅ #1 hit | Tie |
| Q2 REST vs gRPC | ✅ #1 hit | ✅ #1 hit | Tie |
| Q3 Distributed transactions | ✅ #1 hit | ✅ #1 hit | Tie |
| Q4 Order rollback | ⚠️ #3 hit | ✅ #1 hit | BGE wins — better semantic link between "rollback" and "compensation" |
| Q5 Monitoring | ✅ #1 hit | ✅ #1 hit | Tie |
Conclusion:
- For simple queries (direct keyword matches), both models perform similarly
- For difficult queries (semantic reasoning required), BGE's Chinese advantage is clear, especially on synonyms and paraphrases
Cost Comparison
| Model | Price (per million tokens) | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | Extremely cheap |
| OpenAI text-embedding-3-large | $0.13 | Expensive but strong |
| BGE-large-zh-v1.5 (SiliconFlow) | ¥0.007 (~$0.001) | Cheapest |
If you have a GPU, BGE can also be deployed locally for free (details below).
Local Deployment vs API Calls: How to Choose?
API Calls: Pros and Cons
Pros:
- Zero ops, one line of code
- Model versions auto-update
- Pay-per-use, no idle costs
Cons:
- Data leaves your domain (compliance risk for sensitive docs)
- Network latency and rate limits
- Costs accumulate with high-frequency usage
Local Deployment: Pros and Cons
Pros:
- Data never leaves your premises, absolute security
- No rate limits, ideal for high-frequency batch processing
- More economical over time (one-time GPU investment)
Cons:
- Requires GPU (BGE-large needs 4GB+ VRAM)
- Operational complexity (model downloads, version management, serving)
- Slow initial loading (model size: hundreds of MB to several GB)
Decision Tree
Is your data sensitive?
├─ Yes → Local Deployment (BGE or GTE)
└─ No → Is call volume high?
├─ Yes → Local Deployment (saves money long-term)
└─ No → API Calls (simpler)
Primarily Chinese? → BGE (SiliconFlow/Local)
Primarily English? → OpenAI text-embedding-3-small
Special Considerations for Chinese Embedding
1. Tokenization Differences
English Embedding models typically tokenize by spaces, but Chinese has no spaces. If a model isn't optimized for Chinese, it might understand "南京市长江大桥" as "Nanjing / Mayor / River Bridge" instead of "Nanjing City / Yangtze River Bridge".
BGE's Advantage: Specifically trained on Chinese corpora, with tokenization and semantic understanding optimized for Chinese.
2. Idioms and Colloquialisms
| Query | Expected Match | English Model | BGE |
|---|---|---|---|
| "杀鸡取卵" (Kill the goose) | Short-sighted behavior | ❌ Often mismatches | ✅ Correct match |
| "亡羊补牢" (Mend the fold) | Remedy after the fact | ❌ Often mismatches | ✅ Correct match |
3. Domain Terminology
Technical documents contain extensive jargon (e.g., "Saga pattern", "Two-phase commit", "Eventual consistency"). BGE, trained on Chinese technical community data, typically understands these terms better than general English models.
Code Walkthrough: Model Switching Wrapper
To make model switching easy in your project, create a factory function:
import os
from langchain_openai import OpenAIEmbeddings
def build_embeddings(provider: str = "bge"):
"""
Factory function: returns the appropriate Embedding model based on config.
provider: "openai" | "bge" | "local"
"""
if provider == "openai":
return OpenAIEmbeddings(
model="text-embedding-3-small",
api_key=os.getenv("OPENAI_API_KEY"),
)
elif provider == "bge":
return OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5",
api_key=os.getenv("SILICONFLOW_API_KEY"),
base_url="https://api.siliconflow.cn/v1",
chunk_size=32,
)
elif provider == "local":
# Requires: pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceEmbeddings
return HuggingFaceEmbeddings(
model_name="BAAI/bge-large-zh-v1.5",
model_kwargs={"device": "cuda"}, # or "cpu"
encode_kwargs={"normalize_embeddings": True},
)
else:
raise ValueError(f"Unknown provider: {provider}")
# Usage: one line to switch
embeddings = build_embeddings("bge") # Change this line to switch
Local BGE Deployment (Optional)
If you have a GPU, local deployment is simple:
pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-zh-v1.5",
model_kwargs={"device": "cuda"},
encode_kwargs={"normalize_embeddings": True},
)
# Test
result = embeddings.embed_query("Testing Chinese Embedding")
print(f"Vector dimensions: {len(result)}") # 1024
The first run auto-downloads the model (~1.2GB), then caches locally.
Summary and Quick Reference
Core Takeaways
- Embedding is the semantic bridge of RAG — choosing the wrong model directly hurts retrieval accuracy
- English → OpenAI, Chinese → BGE — validated by both MTEB rankings and real-world tests
- Simple queries show little difference, complex semantic queries show large gaps — BGE excels at synonyms, idioms, and terminology
- Switching models takes one line of code — LangChain's abstraction makes model swapping cost-free
Embedding Model Quick Selection Guide
| Scenario | Recommended Model | Deployment | Reasoning |
|---|---|---|---|
| Chinese technical docs | BGE-large-zh-v1.5 | API/Local | Top Chinese performance |
| English general docs | text-embedding-3-small | API | Best value |
| English high-accuracy | text-embedding-3-large | API | Best quality but expensive |
| Multilingual mixed | bge-m3 | API/Local | 100+ language support |
| Data must stay on-premise | BGE-large-zh-v1.5 | Local | 4GB VRAM sufficient |
| Long text (>8K) | Cohere embed-multilingual | API | Optimized for long texts |
References
- MTEB Leaderboard — Authoritative Embedding model rankings
- BGE Official GitHub — BGE series models and documentation
- SiliconFlow Embedding API
- Cohere Embed Documentation
Top comments (0)