DEV Community

Cover image for RAG Series (5): Embedding Models — The Core of Semantic Understanding
WonderLab
WonderLab

Posted on

RAG Series (5): Embedding Models — The Core of Semantic Understanding

Why Does Switching Embedding Models Make Such a Huge Difference?

In the first four articles, we built the RAG pipeline, tuned parameters, and mastered chunking strategies. But there's one question we haven't dived into:

After your documents are chunked, how do they become vectors?

This process is called Embedding. It transforms human-readable text into machine-computable vectors. The choice of Embedding model directly determines:

  • Whether "apple" and "iPhone" are recognized as related
  • Whether "database connection pool exhausted" and "Too many connections" match
  • Whether Chinese idioms, technical jargon, and abbreviations are properly understood

This article explains how Embedding works, compares mainstream models, and runs a head-to-head retrieval comparison between OpenAI and BGE using real Chinese documents.


What Is Embedding?

One-Sentence Explanation

Embedding is a function that takes a piece of text and outputs a fixed-length numerical vector (e.g., 1024 dimensions). Semantically similar texts produce vectors that are close together in space.

Why Can Vectors Represent Meaning?

Imagine placing all words in a multi-dimensional space:

  • "King" and "Queen" are close together
  • "Apple (fruit)" and "Banana" are close together
  • "Apple (company)" and "Google" are close together
  • "Apple (fruit)" and "Apple (company)" are far apart

Embedding models learn these "semantic distances" through pre-training on massive text corpora. When you ask "How do I restart my iPhone?", the model knows "iPhone" relates to "Apple" (the company), not "apple" (the fruit).

Its Role in RAG

User Query → Embedding Model → Query Vector
                                          ↘
                                           Vector Similarity → Top-K Retrieval
                                          ↗
Document Chunk → Embedding Model → Document Vector (precomputed)
Enter fullscreen mode Exit fullscreen mode

Embedding is the semantic bridge of RAG. Without it, retrieval is limited to keyword matching (like Ctrl+F). With it, you get semantic matching that understands synonyms, paraphrases, and context.


Mainstream Embedding Model Comparison

Model Overview

Model Vendor Dimensions Language Strength Deployment Characteristics
text-embedding-3-small OpenAI 1536 Multilingual API Cheap, fast, good for general use
text-embedding-3-large OpenAI 3072 Multilingual API High accuracy, expensive, complex semantics
BAAI/bge-large-zh-v1.5 BAAI 1024 Chinese API/Local Top Chinese performance, open-source, free
BAAI/bge-m3 BAAI 1024 Multilingual API/Local 100+ languages, lightweight
embed-multilingual-v3.0 Cohere 1024 Multilingual API Good for long texts
E5-mistral-7b-instruct Microsoft 4096 Multilingual Local Instruction-based, strong but heavy

Key Metric: The MTEB Leaderboard

MTEB (Massive Text Embedding Benchmark) is the "college entrance exam" of Embedding models. It tests models on 50+ datasets across various tasks.

How to Read the MTEB Leaderboard:

  1. Visit the MTEB Leaderboard
  2. Focus on Retrieval Average — most relevant to RAG
  3. Check Model Size — larger models are slower but usually more accurate

Key Findings from the Leaderboard:

  • English: OpenAI text-embedding-3-large dominates, but text-embedding-3-small offers exceptional value
  • Chinese: BGE series (especially bge-large-zh-v1.5) often outperforms OpenAI, and it's open-source and free
  • Multilingual: bge-m3 and Cohere embed-multilingual-v3.0 stand out

💡 Rule of Thumb: English → OpenAI, Chinese → BGE, Multilingual → bge-m3, Long Text → Cohere.


Practical: OpenAI vs BGE Retrieval Showdown on Chinese Documents

Experimental Design

We use the same Chinese technical document from Article 4 (the microservices architecture guide), generate embeddings with both OpenAI and BGE, and test retrieval quality on the same set of queries.

Code: Switching Embedding Models with One Change

LangChain's OpenAIEmbeddings class is compatible with all OpenAI-Format Embedding APIs (including SiliconFlow, Zhipu, Ollama, etc.), so switching models only requires changing a few configuration lines:

from langchain_openai import OpenAIEmbeddings

# --- Official OpenAI ---
openai_embed = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key="sk-...",
    base_url="https://api.openai.com/v1",
)

# --- BGE (via SiliconFlow) ---
bge_embed = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key="sk-...",  # SiliconFlow API Key
    base_url="https://api.siliconflow.cn/v1",
    chunk_size=32,     # SiliconFlow batch size limit: 32
)

# --- Use in RAG Pipeline ---
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=bge_embed,  # Change only this line to switch models
)
Enter fullscreen mode Exit fullscreen mode

Evaluation Query Set

We designed 5 queries covering different difficulty levels:

Query Expected Content Difficulty
Q1: "What are the principles of microservice decomposition?" Section 1.1: DDD Easy
Q2: "What's the difference between REST and gRPC?" Section 2.1: REST vs gRPC Easy
Q3: "How to solve distributed transactions?" Section 3.2: Saga Pattern Medium
Q4: "How to roll back a failed order?" Saga compensation operations Hard (requires reasoning)
Q5: "How to monitor microservices?" Section 4: Observability Easy

Results Comparison

Query OpenAI text-embedding-3-small BGE-large-zh-v1.5 Analysis
Q1 Decomposition principles ✅ #1 hit ✅ #1 hit Tie
Q2 REST vs gRPC ✅ #1 hit ✅ #1 hit Tie
Q3 Distributed transactions ✅ #1 hit ✅ #1 hit Tie
Q4 Order rollback ⚠️ #3 hit ✅ #1 hit BGE wins — better semantic link between "rollback" and "compensation"
Q5 Monitoring ✅ #1 hit ✅ #1 hit Tie

Conclusion:

  • For simple queries (direct keyword matches), both models perform similarly
  • For difficult queries (semantic reasoning required), BGE's Chinese advantage is clear, especially on synonyms and paraphrases

Cost Comparison

Model Price (per million tokens) Notes
OpenAI text-embedding-3-small $0.02 Extremely cheap
OpenAI text-embedding-3-large $0.13 Expensive but strong
BGE-large-zh-v1.5 (SiliconFlow) ¥0.007 (~$0.001) Cheapest

If you have a GPU, BGE can also be deployed locally for free (details below).


Local Deployment vs API Calls: How to Choose?

API Calls: Pros and Cons

Pros:

  • Zero ops, one line of code
  • Model versions auto-update
  • Pay-per-use, no idle costs

Cons:

  • Data leaves your domain (compliance risk for sensitive docs)
  • Network latency and rate limits
  • Costs accumulate with high-frequency usage

Local Deployment: Pros and Cons

Pros:

  • Data never leaves your premises, absolute security
  • No rate limits, ideal for high-frequency batch processing
  • More economical over time (one-time GPU investment)

Cons:

  • Requires GPU (BGE-large needs 4GB+ VRAM)
  • Operational complexity (model downloads, version management, serving)
  • Slow initial loading (model size: hundreds of MB to several GB)

Decision Tree

Is your data sensitive?
    ├─ Yes → Local Deployment (BGE or GTE)
    └─ No → Is call volume high?
              ├─ Yes → Local Deployment (saves money long-term)
              └─ No → API Calls (simpler)
                        Primarily Chinese? → BGE (SiliconFlow/Local)
                        Primarily English? → OpenAI text-embedding-3-small
Enter fullscreen mode Exit fullscreen mode

Special Considerations for Chinese Embedding

1. Tokenization Differences

English Embedding models typically tokenize by spaces, but Chinese has no spaces. If a model isn't optimized for Chinese, it might understand "南京市长江大桥" as "Nanjing / Mayor / River Bridge" instead of "Nanjing City / Yangtze River Bridge".

BGE's Advantage: Specifically trained on Chinese corpora, with tokenization and semantic understanding optimized for Chinese.

2. Idioms and Colloquialisms

Query Expected Match English Model BGE
"杀鸡取卵" (Kill the goose) Short-sighted behavior ❌ Often mismatches ✅ Correct match
"亡羊补牢" (Mend the fold) Remedy after the fact ❌ Often mismatches ✅ Correct match

3. Domain Terminology

Technical documents contain extensive jargon (e.g., "Saga pattern", "Two-phase commit", "Eventual consistency"). BGE, trained on Chinese technical community data, typically understands these terms better than general English models.


Code Walkthrough: Model Switching Wrapper

To make model switching easy in your project, create a factory function:

import os
from langchain_openai import OpenAIEmbeddings


def build_embeddings(provider: str = "bge"):
    """
    Factory function: returns the appropriate Embedding model based on config.
    provider: "openai" | "bge" | "local"
    """
    if provider == "openai":
        return OpenAIEmbeddings(
            model="text-embedding-3-small",
            api_key=os.getenv("OPENAI_API_KEY"),
        )
    elif provider == "bge":
        return OpenAIEmbeddings(
            model="BAAI/bge-large-zh-v1.5",
            api_key=os.getenv("SILICONFLOW_API_KEY"),
            base_url="https://api.siliconflow.cn/v1",
            chunk_size=32,
        )
    elif provider == "local":
        # Requires: pip install sentence-transformers
        from langchain_community.embeddings import HuggingFaceEmbeddings
        return HuggingFaceEmbeddings(
            model_name="BAAI/bge-large-zh-v1.5",
            model_kwargs={"device": "cuda"},  # or "cpu"
            encode_kwargs={"normalize_embeddings": True},
        )
    else:
        raise ValueError(f"Unknown provider: {provider}")


# Usage: one line to switch
embeddings = build_embeddings("bge")  # Change this line to switch
Enter fullscreen mode Exit fullscreen mode

Local BGE Deployment (Optional)

If you have a GPU, local deployment is simple:

pip install sentence-transformers
Enter fullscreen mode Exit fullscreen mode
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# Test
result = embeddings.embed_query("Testing Chinese Embedding")
print(f"Vector dimensions: {len(result)}")  # 1024
Enter fullscreen mode Exit fullscreen mode

The first run auto-downloads the model (~1.2GB), then caches locally.


Summary and Quick Reference

Core Takeaways

  1. Embedding is the semantic bridge of RAG — choosing the wrong model directly hurts retrieval accuracy
  2. English → OpenAI, Chinese → BGE — validated by both MTEB rankings and real-world tests
  3. Simple queries show little difference, complex semantic queries show large gaps — BGE excels at synonyms, idioms, and terminology
  4. Switching models takes one line of code — LangChain's abstraction makes model swapping cost-free

Embedding Model Quick Selection Guide

Scenario Recommended Model Deployment Reasoning
Chinese technical docs BGE-large-zh-v1.5 API/Local Top Chinese performance
English general docs text-embedding-3-small API Best value
English high-accuracy text-embedding-3-large API Best quality but expensive
Multilingual mixed bge-m3 API/Local 100+ language support
Data must stay on-premise BGE-large-zh-v1.5 Local 4GB VRAM sufficient
Long text (>8K) Cohere embed-multilingual API Optimized for long texts

References

Top comments (0)