WonderLab

Posted on May 3

RAG Series (5): Embedding Models — The Core of Semantic Understanding

#rag #embedding #openai #bge

Why Does Switching Embedding Models Make Such a Huge Difference?

In the first four articles, we built the RAG pipeline, tuned parameters, and mastered chunking strategies. But there's one question we haven't dived into:

After your documents are chunked, how do they become vectors?

This process is called Embedding. It transforms human-readable text into machine-computable vectors. The choice of Embedding model directly determines:

Whether "apple" and "iPhone" are recognized as related
Whether "database connection pool exhausted" and "Too many connections" match
Whether Chinese idioms, technical jargon, and abbreviations are properly understood

This article explains how Embedding works, compares mainstream models, and runs a head-to-head retrieval comparison between OpenAI and BGE using real Chinese documents.

What Is Embedding?

One-Sentence Explanation

Embedding is a function that takes a piece of text and outputs a fixed-length numerical vector (e.g., 1024 dimensions). Semantically similar texts produce vectors that are close together in space.

Why Can Vectors Represent Meaning?

Imagine placing all words in a multi-dimensional space:

"King" and "Queen" are close together
"Apple (fruit)" and "Banana" are close together
"Apple (company)" and "Google" are close together
"Apple (fruit)" and "Apple (company)" are far apart

Embedding models learn these "semantic distances" through pre-training on massive text corpora. When you ask "How do I restart my iPhone?", the model knows "iPhone" relates to "Apple" (the company), not "apple" (the fruit).

Its Role in RAG

User Query → Embedding Model → Query Vector
                                          ↘
                                           Vector Similarity → Top-K Retrieval
                                          ↗
Document Chunk → Embedding Model → Document Vector (precomputed)

Embedding is the semantic bridge of RAG. Without it, retrieval is limited to keyword matching (like Ctrl+F). With it, you get semantic matching that understands synonyms, paraphrases, and context.

Mainstream Embedding Model Comparison

Model Overview

Model	Vendor	Dimensions	Language Strength	Deployment	Characteristics
text-embedding-3-small	OpenAI	1536	Multilingual	API	Cheap, fast, good for general use
text-embedding-3-large	OpenAI	3072	Multilingual	API	High accuracy, expensive, complex semantics
BAAI/bge-large-zh-v1.5	BAAI	1024	Chinese	API/Local	Top Chinese performance, open-source, free
BAAI/bge-m3	BAAI	1024	Multilingual	API/Local	100+ languages, lightweight
embed-multilingual-v3.0	Cohere	1024	Multilingual	API	Good for long texts
E5-mistral-7b-instruct	Microsoft	4096	Multilingual	Local	Instruction-based, strong but heavy

Key Metric: The MTEB Leaderboard

MTEB (Massive Text Embedding Benchmark) is the "college entrance exam" of Embedding models. It tests models on 50+ datasets across various tasks.

How to Read the MTEB Leaderboard:

Visit the MTEB Leaderboard
Focus on Retrieval Average — most relevant to RAG
Check Model Size — larger models are slower but usually more accurate

Key Findings from the Leaderboard:

English: OpenAI text-embedding-3-large dominates, but text-embedding-3-small offers exceptional value
Chinese: BGE series (especially bge-large-zh-v1.5) often outperforms OpenAI, and it's open-source and free
Multilingual: bge-m3 and Cohere embed-multilingual-v3.0 stand out

💡 Rule of Thumb: English → OpenAI, Chinese → BGE, Multilingual → bge-m3, Long Text → Cohere.

Practical: OpenAI vs BGE Retrieval Showdown on Chinese Documents

Experimental Design

We use the same Chinese technical document from Article 4 (the microservices architecture guide), generate embeddings with both OpenAI and BGE, and test retrieval quality on the same set of queries.

Code: Switching Embedding Models with One Change

LangChain's OpenAIEmbeddings class is compatible with all OpenAI-Format Embedding APIs (including SiliconFlow, Zhipu, Ollama, etc.), so switching models only requires changing a few configuration lines:

from langchain_openai import OpenAIEmbeddings

# --- Official OpenAI ---
openai_embed = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key="sk-...",
    base_url="https://api.openai.com/v1",
)

# --- BGE (via SiliconFlow) ---
bge_embed = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key="sk-...",  # SiliconFlow API Key
    base_url="https://api.siliconflow.cn/v1",
    chunk_size=32,     # SiliconFlow batch size limit: 32
)

# --- Use in RAG Pipeline ---
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=bge_embed,  # Change only this line to switch models
)

Evaluation Query Set

We designed 5 queries covering different difficulty levels:

Query	Expected Content	Difficulty
Q1: "What are the principles of microservice decomposition?"	Section 1.1: DDD	Easy
Q2: "What's the difference between REST and gRPC?"	Section 2.1: REST vs gRPC	Easy
Q3: "How to solve distributed transactions?"	Section 3.2: Saga Pattern	Medium
Q4: "How to roll back a failed order?"	Saga compensation operations	Hard (requires reasoning)
Q5: "How to monitor microservices?"	Section 4: Observability	Easy

Results Comparison

Query	OpenAI text-embedding-3-small	BGE-large-zh-v1.5	Analysis
Q1 Decomposition principles	✅ #1 hit	✅ #1 hit	Tie
Q2 REST vs gRPC	✅ #1 hit	✅ #1 hit	Tie
Q3 Distributed transactions	✅ #1 hit	✅ #1 hit	Tie
Q4 Order rollback	⚠️ #3 hit	✅ #1 hit	BGE wins — better semantic link between "rollback" and "compensation"
Q5 Monitoring	✅ #1 hit	✅ #1 hit	Tie

Conclusion:

For simple queries (direct keyword matches), both models perform similarly
For difficult queries (semantic reasoning required), BGE's Chinese advantage is clear, especially on synonyms and paraphrases

Cost Comparison

Model	Price (per million tokens)	Notes
OpenAI text-embedding-3-small	$0.02	Extremely cheap
OpenAI text-embedding-3-large	$0.13	Expensive but strong
BGE-large-zh-v1.5 (SiliconFlow)	¥0.007 (~$0.001)	Cheapest

If you have a GPU, BGE can also be deployed locally for free (details below).

Local Deployment vs API Calls: How to Choose?

API Calls: Pros and Cons

Pros:

Zero ops, one line of code
Model versions auto-update
Pay-per-use, no idle costs

Cons:

Data leaves your domain (compliance risk for sensitive docs)
Network latency and rate limits
Costs accumulate with high-frequency usage

Local Deployment: Pros and Cons

Pros:

Data never leaves your premises, absolute security
No rate limits, ideal for high-frequency batch processing
More economical over time (one-time GPU investment)

Cons:

Requires GPU (BGE-large needs 4GB+ VRAM)
Operational complexity (model downloads, version management, serving)
Slow initial loading (model size: hundreds of MB to several GB)

Decision Tree

Is your data sensitive?
    ├─ Yes → Local Deployment (BGE or GTE)
    └─ No → Is call volume high?
              ├─ Yes → Local Deployment (saves money long-term)
              └─ No → API Calls (simpler)
                        Primarily Chinese? → BGE (SiliconFlow/Local)
                        Primarily English? → OpenAI text-embedding-3-small

Special Considerations for Chinese Embedding

1. Tokenization Differences

English Embedding models typically tokenize by spaces, but Chinese has no spaces. If a model isn't optimized for Chinese, it might understand "南京市长江大桥" as "Nanjing / Mayor / River Bridge" instead of "Nanjing City / Yangtze River Bridge".

BGE's Advantage: Specifically trained on Chinese corpora, with tokenization and semantic understanding optimized for Chinese.

2. Idioms and Colloquialisms

Query	Expected Match	English Model	BGE
"杀鸡取卵" (Kill the goose)	Short-sighted behavior	❌ Often mismatches	✅ Correct match
"亡羊补牢" (Mend the fold)	Remedy after the fact	❌ Often mismatches	✅ Correct match

3. Domain Terminology

Technical documents contain extensive jargon (e.g., "Saga pattern", "Two-phase commit", "Eventual consistency"). BGE, trained on Chinese technical community data, typically understands these terms better than general English models.

Code Walkthrough: Model Switching Wrapper

To make model switching easy in your project, create a factory function:

import os
from langchain_openai import OpenAIEmbeddings


def build_embeddings(provider: str = "bge"):
    """
    Factory function: returns the appropriate Embedding model based on config.
    provider: "openai" | "bge" | "local"
    """
    if provider == "openai":
        return OpenAIEmbeddings(
            model="text-embedding-3-small",
            api_key=os.getenv("OPENAI_API_KEY"),
        )
    elif provider == "bge":
        return OpenAIEmbeddings(
            model="BAAI/bge-large-zh-v1.5",
            api_key=os.getenv("SILICONFLOW_API_KEY"),
            base_url="https://api.siliconflow.cn/v1",
            chunk_size=32,
        )
    elif provider == "local":
        # Requires: pip install sentence-transformers
        from langchain_community.embeddings import HuggingFaceEmbeddings
        return HuggingFaceEmbeddings(
            model_name="BAAI/bge-large-zh-v1.5",
            model_kwargs={"device": "cuda"},  # or "cpu"
            encode_kwargs={"normalize_embeddings": True},
        )
    else:
        raise ValueError(f"Unknown provider: {provider}")


# Usage: one line to switch
embeddings = build_embeddings("bge")  # Change this line to switch

Local BGE Deployment (Optional)

If you have a GPU, local deployment is simple:

pip install sentence-transformers

from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# Test
result = embeddings.embed_query("Testing Chinese Embedding")
print(f"Vector dimensions: {len(result)}")  # 1024

The first run auto-downloads the model (~1.2GB), then caches locally.

Summary and Quick Reference

Core Takeaways

Embedding is the semantic bridge of RAG — choosing the wrong model directly hurts retrieval accuracy
English → OpenAI, Chinese → BGE — validated by both MTEB rankings and real-world tests
Simple queries show little difference, complex semantic queries show large gaps — BGE excels at synonyms, idioms, and terminology
Switching models takes one line of code — LangChain's abstraction makes model swapping cost-free

Embedding Model Quick Selection Guide

Scenario	Recommended Model	Deployment	Reasoning
Chinese technical docs	BGE-large-zh-v1.5	API/Local	Top Chinese performance
English general docs	text-embedding-3-small	API	Best value
English high-accuracy	text-embedding-3-large	API	Best quality but expensive
Multilingual mixed	bge-m3	API/Local	100+ language support
Data must stay on-premise	BGE-large-zh-v1.5	Local	4GB VRAM sufficient
Long text (>8K)	Cohere embed-multilingual	API	Optimized for long texts

References

MTEB Leaderboard — Authoritative Embedding model rankings
BGE Official GitHub — BGE series models and documentation
SiliconFlow Embedding API
Cohere Embed Documentation

DEV Community