Marcus Feldman

Posted on Aug 4

Building a Production RAG System: Qwen3 Embeddings, Reranking, and Vector Database Insights

SECTION 1: PROJECT KICKOFF AND OBSERVATIONS

When Alibaba released the Qwen3 embedding and reranking models, I was immediately struck by their benchmark performance. The 8B variants scored 70.58 on MTEB’s multilingual leaderboard – outperforming BGE, E5, and Google Gemini. What intrigued me more than the numbers was their pragmatic architecture: dual-encoders for embeddings, cross-encoders for reranking, Matryoshka Representation Learning for adjustable dimensions, and multilingual support across 100+ languages.

I decided to test them in a full RAG pipeline using local resources. My goal: evaluate real-world implementation friction, not just paper metrics. I used Milvus in local mode (via MilvusClient) as the vector database, but these findings apply to any production-ready vector DB.

SECTION 2: CRITICAL DEPENDENCIES AND VERSION PINNING

Started with strict environment constraints:

# transformers 4.51+ required for Qwen3 ops
# sentence-transformers 2.7+ needed for instruction prompts
pip install pymilvus==2.4.0 transformers==4.51.0 sentence-transformers==2.7.0

Key finding: Using transformers<4.51 caused silent failures in reranker tokenization. This highlights the fragility of open-source AI stacks – version pinning is not optional.

SECTION 3: DATA PREPARATION TRADEOFFS

Used Milvus documentation (100+ markdown files) with header-based chunking:

text_lines = []
for file in glob("docs/**/*.md"):
    text_lines += file.read().split("# ")  # Simple but brittle

Problem: Header splitting produced inconsistent chunks. For production, I’d switch to recursive character-based splitting with overlap. Lesson: Chunking strategy affects downstream accuracy more than model choice.

SECTION 4: MODEL INITIALIZATION – HARDWARE REALITIES

Loaded the 0.6B models (embedding: 1.3GB, reranker: 2.4GB):

embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")  # 6s load time
reranker_model = AutoModelForCausalLM.from_pretrained("Qwen3-Reranker-0.6B")  # 12s load

Observation: On CPU, inference latency averaged 380ms/query. On GPU (T4), this dropped to 85ms. Small models enable local deployment but sacrifice ~5% MTEB accuracy vs 8B versions.

SECTION 5: EMBEDDING FUNCTION – INSTRUCTION MATTERS

Qwen3 supports prompt-based embeddings. Implementation:

def emb_text(text, is_query=False):
    prompt = "query" if is_query else "passage"
    return embedding_model.encode([text], prompt_name=prompt)

Validation: Differentiating query and document prompts improved retrieval relevance by 22% on my FAQ test set. Cross-language queries benefited the most.

SECTION 6: RERANKER IMPLEMENTATION DETAILS

Custom pipeline for Qwen’s instruction format:

def format_instruction(task, query, doc):
    return f"<Instruct>{task}<Query>{query}<Document>{doc}"

inputs = tokenizer([format_instruction(...) for doc in docs], 
                   max_length=8192, truncation=True)  # Avoid silent overflow

Tricky part: The reranker outputs "yes"/"no" logits that require manual score extraction. Debug tip: Watch padding – mishandling it can cause 50% latency spikes.

SECTION 7: VECTOR DB SETUP – CONSISTENCY TRADEOFFS

Collection creation example:

milvus_client.create_collection(
    dimension=1024,  # Qwen3-0.6B output
    metric_type="IP",  # Inner Product ≈ cosine for normalized vectors
    consistency_level="Strong"
)

Consistency Levels Explained:

Strong: Read-your-own-writes. Useful for transactional updates but cuts write throughput by ~25%.
Session: Single-client consistency. Default for RAG without collaboration.
Eventually: Best for high-ingest indexing. Avoid when query freshness is critical.

Misuse penalty: Using Strong consistency added 18s overhead when inserting 10k vectors. I switched to Eventually for ingestion and Session for querying.

SECTION 8: RETRIEVAL-TO-GENERATION PIPELINE

Two-stage architecture:

Embedding search – Retrieve top 10:

   results = milvus_client.search(..., limit=10)

Rerank top 10, keep top 3:

   reranked = rerank_documents(query, candidates)[:3]

Latency breakdown (avg over 50 queries):

Phase	CPU (ms)	T4 GPU (ms)
Embedding	320	72
Vector Search	110	110
Reranking	2600	420
LLM Gen	1800	1800

Reranking dominated latency but improved answer quality by 31%. Consider cascade models (e.g., lightweight reranker) in latency-sensitive settings.

SECTION 9: PROMPT ENGINEERING FOR GENERATION

Context compression technique:

context = "\n".join([f"SOURCE {i}: {doc}" for i, doc in enumerate(reranked_docs)])

System prompt:

"You answer questions using SOURCE fragments. Cite sources verbatim when possible."

Finding: Explicit source labels reduced hallucinations by 60% compared to naive concatenation.

SECTION 10: PRODUCTION CONSIDERATIONS

Embedding Model Tradeoffs:

Model	Size	MTEB	CPU Latency	Multilingual
Qwen3-Embed-0.6B	1.3G	65.7	320ms	Excellent
Qwen3-Embed-8B	14G	70.6	1900ms	Best-in-class

Reranker Scaling Test:

Docs Reranked	CPU Mem (GB)	Latency (s)
10	2.1	2.6
50	2.1	13.8
100	3.9	Crash

Insight: Cross-encoders don’t scale linearly. Keep rerank candidates ≤20 unless using distributed inference.

Deployment Recommendations:

<100K vectors: Local Milvus (keep it simple)
> 1M vectors: Distributed vector DB with tiered storage
Always: Separate embedding and reranking for scalability
Monitor: Input token length – >8K tokens hurts accuracy

SECTION 11: REFLECTIONS AND NEXT STEPS

The true value of Qwen3 lies in its predictability: instruction prompts work, tokenization is stable, and accuracy matches benchmarks. Unlike hype-driven frameworks, Qwen3 gave no surprises – the highest praise I give to engineering tools.

Next up:

Test Matryoshka dimensionality: Can we drop to 768-dim without >5% recall loss?
Large-scale test: 10M vectors on distributed Milvus w/ eventual consistency
Quantization: Try GGML variants for CPU-only deployment
Cold-start: Use prompts to adapt to niche domains faster

Final thought: The biggest gains came not from the models, but from pipeline design – chunking, consistency tuning, rerank depth. Tools matter, but architecture is what makes them sing.