DEV Community: yang yaru

Offline Evaluation in AI Applications

yang yaru — Mon, 25 May 2026 08:00:33 +0000

As AI applications become increasingly complex, building the application itself is no longer the hardest part.

The real challenge is ensuring that the system consistently produces reliable, accurate, and controllable results.

This is where Offline Evaluation becomes critical.

Offline Evaluation is one of the most important components in modern AI engineering, especially in systems involving:

RAG (Retrieval-Augmented Generation)
AI Agents
Tool Calling
Workflow Orchestration
Multi-step Reasoning
Enterprise Knowledge Bases

In this article, we will explore what Offline Evaluation is, why it matters, and how it is commonly implemented in industrial AI systems.

What Is Offline Evaluation?

Offline Evaluation refers to evaluating an AI application using a predefined dataset without exposing the evaluation to real users in production.

In simple words:

We use historical or manually created test cases to measure whether the AI system performs well before deployment.

Instead of relying on subjective feelings like:

“This prompt feels better.”
“The answers look smarter.”
“The retrieval seems improved.”

Offline Evaluation provides measurable evidence.

Why Offline Evaluation Matters

Traditional software engineering has:

Unit Tests
Integration Tests
Regression Tests

AI applications also need regression testing.

Because in AI systems, changing one component can unexpectedly affect another:

Updating prompts
Changing retrieval strategies
Switching models
Modifying chunk sizes
Adjusting rerankers
Adding tools

Even a small modification may reduce answer quality or increase hallucinations.

Offline Evaluation helps detect these problems early.

Typical Offline Evaluation Workflow

A common workflow looks like this:

Evaluation Dataset
        ↓
Run AI Pipeline
        ↓
Generate Answers / Retrieval Results
        ↓
Apply Evaluation Metrics
        ↓
Generate Evaluation Report

For example:

Question:
Does this tour package include lunch?

Ground Truth:
Lunch is not included.

Knowledge Source:
ServiceNotes.Meals

Then we can compare multiple versions:

Version A:
Top-5 retrieval + old prompt

Version B:
Top-8 retrieval + reranker + new prompt

The evaluation system determines which version performs better.

What Can Be Evaluated?

1. Answer Quality

This is the most common evaluation target.

Typical questions include:

Is the answer correct?
Does it match user intent?
Is the answer complete?
Is the reasoning logical?
Does the model hallucinate?
Is the response helpful?

Common metrics:

Metric	Description
Accuracy	Whether the answer is correct
Completeness	Whether important information is missing
Helpfulness	Whether the answer is useful
Hallucination Rate	Frequency of fabricated information
Format Compliance	Whether output follows required format

2. RAG Retrieval Quality

In RAG systems, retrieval quality is extremely important.

Because if retrieval fails, generation quality will also fail.

Typical evaluation questions:

Did the system retrieve the correct chunks?
Was the correct document included in Top-K results?
Did reranking improve ordering?
Were important documents missed?

Common metrics:

Metric	Description
Recall@K	Whether correct documents appear in Top-K
Precision@K	Ratio of relevant results in Top-K
MRR	Whether correct results rank near the top
Context Relevance	Whether retrieved context is useful
Faithfulness	Whether the answer stays grounded in retrieved context

3. Tool Calling Evaluation

For AI Agents, tool calling quality becomes another critical dimension.

The system must evaluate:

Did the agent choose the correct tool?
Were parameters correct?
Did the workflow complete successfully?
Did the agent recover from failures?

Example metrics:

Metric	Description
Tool Selection Accuracy	Whether the correct tool was chosen
Parameter Accuracy	Whether arguments were valid
Task Success Rate	Whether the final task succeeded
Step Efficiency	Whether unnecessary steps were used
Recovery Ability	Whether the system handled failures gracefully

Three Common Evaluation Approaches

1. Rule-Based Evaluation

This is the simplest method.

Useful for checking:

JSON format validity
Required fields
Keyword existence
Output structure

Example:

{
  "suggestion": "...",
  "aiReview": "..."
}

The evaluator checks:

Is it valid JSON?
Are required fields present?
Are value types correct?

Advantages:

Cheap
Fast
Stable

Disadvantages:

Cannot judge semantic quality

2. Human Evaluation

Humans manually score outputs.

Example rubric:

Category	Score
Accuracy	1-5
Completeness	1-5
Clarity	1-5
Hallucination	Yes/No

Advantages:

Most reliable

Disadvantages:

Expensive
Slow
Difficult to scale

3. LLM-as-a-Judge

This is increasingly common in modern AI systems.

A separate LLM evaluates another model’s output.

Inputs may include:

User question
Ground truth
Retrieved context
Generated answer
Evaluation rubric

The judge model outputs structured scores:

{
  "faithfulness": 5,
  "answer_relevance": 4,
  "hallucination": false,
  "reason": "The answer is grounded in the provided context."
}

Advantages:

Automated
Scalable
Useful for continuous experimentation

Disadvantages:

Judge models may also be inconsistent
Rubric design becomes very important

A Real RAG Example

Imagine your knowledge base only contains information for a one-day tour.

The user asks:

What will we do on the second day?

A bad AI system might hallucinate:

On the second day, visitors will explore the mountains...

Even though the knowledge base never mentioned a second day.

A properly evaluated RAG system should instead answer:

The knowledge base does not contain information about a second-day itinerary.

Offline Evaluation can specifically test:

Whether hallucination occurred
Whether the model stayed grounded
Whether the retrieval contained supporting evidence

This is extremely important in enterprise AI systems.

Example Architecture for Offline Evaluation

A common industrial design looks like this:

                ┌──────────────────┐
                │ Evaluation Cases │
                └────────┬─────────┘
                         ↓
                ┌──────────────────┐
                │ AI Application   │
                │ (RAG / Agent)    │
                └────────┬─────────┘
                         ↓
                ┌──────────────────┐
                │ Evaluation Layer │
                │ Rules / LLMJudge │
                └────────┬─────────┘
                         ↓
                ┌──────────────────┐
                │ Evaluation Report│
                └──────────────────┘

Typical Database Design

A practical schema might include:

eval_dataset
- id
- name
- description

eval_case
- id
- dataset_id
- question
- expected_answer
- expected_chunk_ids

eval_run
- id
- model_name
- prompt_version
- retriever_config

eval_result
- id
- run_id
- case_id
- actual_answer
- scores
- judge_reason

This allows engineers to compare experiments across:

Prompt versions
Models
Retrieval strategies
Chunking methods
Rerankers

Offline Evaluation Is a Core Part of AI Engineering

Many beginners think AI engineering is mostly about:

Prompt writing
Calling APIs
Connecting models

But in real-world systems, evaluation is one of the hardest and most important parts.

Because eventually, every AI team faces the same question:

“How do we know the system is actually improving?”

Offline Evaluation provides the answer.

It transforms AI development from:

“I think it became better.”

into:

“We have measurable evidence that it improved.”

And that is what separates demos from production-grade AI systems.

Final Thoughts

As AI applications continue evolving toward:

Long-term memory
Autonomous agents
Complex workflows
Enterprise reasoning systems

evaluation will become even more important.

The future of AI engineering is not only about making models more powerful.

It is also about making systems:

measurable
reliable
controllable
testable

Offline Evaluation is one of the foundations that makes this possible.

Understanding MCP: The Communication Layer Between AI Agents and Tools

yang yaru — Wed, 20 May 2026 03:24:45 +0000

The rise of AI Agents has changed the way we think about software systems.

Modern AI applications are no longer just chatbots. They are gradually becoming intelligent systems capable of reasoning, planning, and interacting with the external world.

However, an important question emerges:

How can an AI actually interact with tools, APIs, databases, or enterprise systems?

This is where MCP (Model Context Protocol) enters the picture.

What Is MCP?

At its core, MCP is a standardized protocol that allows AI agents to communicate with tools.

You can think of MCP as:

USB-C for AI Tools

or:

HTTP for AI-to-Tool communication

MCP does not make the AI smarter.

Instead, it standardizes how AI systems discover tools, invoke them, and receive results.

The Core Problem MCP Solves

Before MCP, every AI platform had its own integration method.

For example:

OpenAI Function Calling
Claude Tool Use
Custom LangChain integrations
Proprietary enterprise plugins

Every platform required separate adapters.

This created an ecosystem problem:

Models × Tools = Integration Explosion

If you had:

Multiple LLM providers
Multiple enterprise systems
Multiple APIs

You often needed to build integrations repeatedly.

MCP attempts to solve this by defining a common communication standard.

MCP Is NOT the Agent

One common misunderstanding is:

MCP = Agent

This is incorrect.

MCP is not responsible for:

reasoning
planning
memory
workflow orchestration
multi-agent collaboration

Instead, MCP only focuses on:

AI ↔ Tool Communication

The Modern AI Agent Architecture

A typical industrial AI agent system looks like this:

User
 ↓
LLM (Reasoning Layer)
 ↓
Agent Runtime (Orchestration Layer)
 ↓
MCP (Tool Communication Layer)
 ↓
Tools / APIs / External Systems

Each layer has different responsibilities.

What the LLM Actually Does

The LLM itself never executes code.

This is a critical concept.

When a user says:

"Check the weather in Beijing."

The LLM may generate something like:

{
  "tool": "get_weather",
  "arguments": {
    "city": "Beijing"
  }
}

This is not execution.

It is merely:

structured intent prediction

The actual execution is handled by the Agent Runtime.

The Role of the Agent Runtime

The runtime is the real execution engine.

It is responsible for:

parsing tool calls
calling external APIs
handling retries
permission control
state management
workflow orchestration
logging and monitoring

For example:

if(toolName.equals("get_weather")) {
    weatherService.query(city);
}

The runtime executes the real business logic.

Where MCP Fits Into the Flow

MCP operates between the runtime and the tools.

Example flow:

LLM generates Tool Call
 ↓
Agent Runtime parses result
 ↓
MCP Client communicates with MCP Server
 ↓
MCP Server invokes Tool
 ↓
Tool Result returned
 ↓
LLM generates final response

This means MCP is essentially:

a standardized tool transport layer

What MCP Actually Standardizes

MCP mainly standardizes four things.

1. Tool Discovery

Agents can dynamically ask:

"What tools are available?"

2. Tool Schema

Tools expose metadata like:

{
  "name": "search_order",
  "description": "Search order information",
  "inputSchema": {}
}

This helps AI understand:

what the tool does
when to use it
what parameters it needs

3. Tool Invocation

MCP standardizes how tools are called.

For example:

{
  "tool": "search_order",
  "arguments": {
    "orderId": "1001"
  }
}

4. Result Return

Results are returned in a standardized structure that different AI systems can understand.

Tool Schema vs Real API Schema

An important insight is that:

LLM Tool Schema
≠ MCP Schema
≠ Real Backend API Schema

They serve different purposes.

LLM Tool Schema

Optimized for semantic understanding.

Example:

{
  "name": "get_current_weather",
  "description": "Use when the user asks about weather conditions."
}

MCP Schema

Optimized for protocol communication and interoperability.

Backend API Schema

Optimized for real execution logic.

Example:

GET /weather/v3/current?location=101010100

The runtime often maps between these layers.

MCP as an Abstraction Layer

One of the most powerful ideas behind MCP is:

Tool Virtualization

From the agent's perspective, it no longer matters whether the underlying tool is:

a Java function
a Python script
a database query
a REST API
a shell command

Everything becomes a unified capability.

MCP Is Similar to JDBC

As a backend engineer, I find this analogy particularly useful.

JDBC allows Java to interact with different databases through a unified interface:

MySQL
PostgreSQL
Oracle

Similarly, MCP allows AI agents to interact with different tools through a unified protocol.

In this sense:

MCP is like JDBC for AI tools.

Why MCP Matters

The future of AI applications is shifting from:

Chatbot
→ RAG
→ Workflow
→ Tool Calling
→ Agent Systems
→ Multi-Agent Systems

As AI systems become more capable, tool ecosystems become increasingly important.

MCP is significant because it provides:

standardized infrastructure for AI-to-tool interaction

This may become one of the foundational layers of future AI operating systems.

Final Thoughts

MCP does not replace LLMs.

It does not replace workflows.

It does not replace Agent runtimes.

Instead, it provides something equally important:

a universal communication layer between AI and external capabilities

In many ways, MCP represents the transition from:

AI that can talk

to:

AI that can operate systems.

From Journaling to Evidence-Based Reflection: Building InnerTrace, a Self-Growth AI Agent

yang yaru — Wed, 06 May 2026 07:41:05 +0000

Not just another chatbot, but a reflection assistant that reasons over your own history.

For a long time, I had one frustration with journaling: we write a lot, but very little turns into structured self-understanding.

Entries are scattered across notes, hard to retrieve, and even harder to connect over time. So when we ask, “What’s been draining me lately?”, we usually answer from memory bias, not evidence.

That is why I built InnerTrace — an AI-powered self-growth system based on LLMs, long-term memory, and reflection loops.

The Problem I Wanted to Solve

Most journaling tools are good at capturing thoughts, but weak at helping users reason over them.

Common gaps:

Plenty of records, low reuse
Emotional and behavioral patterns are hard to detect
Reflection outcomes are often intuitive, not evidence-grounded

InnerTrace is designed to bridge that gap:

Raw daily input -> structured understanding -> evidence-based reflection -> actionable suggestions

What InnerTrace Actually Does

At a high level, the system works like this:

You write freely (no rigid template)
LLMs extract structured signals (emotion, events, topics, stress/energy)
The system stores multi-layer memory (raw logs + structured insights + periodic summaries)
The agent retrieves relevant history by topic/time window
It returns reasoning results with evidence and suggested actions

So instead of giving generic motivational output, InnerTrace tries to answer with traceable context.

Core Design Principles

1) No hallucinated psychology

InnerTrace reflects only from user-provided history, not assumptions.

2) Evidence-first insight

Key conclusions should be tied to concrete records and time windows.

3) Gentle guidance, not judgment

The goal is practical direction, not moral evaluation.

Technical Stack (Current Version)

Backend

Java 17
Spring Boot
MyBatis-Plus

Storage

MySQL for structured data
Redis for cache
Vector memory via pgvector / Milvus is in progress

AI Layer

LLM API for analysis and reasoning
Embedding model for semantic memory expansion

Infrastructure

Async task processing

The service architecture is modularized into:

Journal Module
Analysis Module
Memory Module
Reflection Module
Agent Module

This makes it easier to evolve toward multi-agent reasoning and deeper personalization later.

How It Differs from Typical AI Journal Apps

I see three practical differences:

Long-term memory over single-turn chat
Evidence aggregation over generic comfort text
Actionable micro-suggestions over abstract advice

If a user asks, “Why have I felt tired recently?”, the goal is not just a fluent answer. The goal is to connect repeated contexts, time ranges, and patterns — then provide a reasoned response backed by historical evidence.

Current Progress & Roadmap

Completed:

Journal input and storage
Structured analysis pipeline

In progress / next:

Vector memory retrieval
Weekly/monthly reflection reports
Multi-agent reasoning
Personalization tuning
Rich visualization dashboard

What I Learned While Building

Two lessons stood out:

“LLM can speak” != “system can reason well”

Product quality depends heavily on memory modeling, retrieval strategy, and evidence organization.
Growth-oriented AI needs temporal consistency

Without cross-time context, even great single responses become short-lived reassurance.

Screenshots

Quick Start

Backend (8080):

cd InnerTrace
mvn clean package
java -jar target/*.jar
# or
mvn spring-boot:run

Frontend (3000):

cd innertrace_frontend
npm install
npm run dev

Then open http://localhost:3000.

Before running, configure your database and LLM API key in application.yml.

Repositories

Backend: https://github.com/yaruyng/InnerTrace

- Frontend: https://github.com/yaruyng/InnerTrace_frontend

If you’re also exploring AI for long-term self-growth, I’d love to exchange ideas.

A question I’m actively thinking about: what should a reflection AI optimize for first — accuracy, empathy, or actionability?

Understanding the Rerank Stage in Industrial RAG Pipelines

yang yaru — Fri, 13 Mar 2026 09:03:30 +0000

Retrieval-Augmented Generation (RAG) systems and modern search engines rely on multiple stages to retrieve the most relevant information for a user query. One critical component in these pipelines is Rerank, a stage designed to improve the precision of retrieved results.

In industrial systems, retrieval methods such as vector search or keyword search prioritize speed and recall, which means they often return many candidates that are only partially relevant. The reranking stage solves this problem by reordering these candidates using a stronger but more computationally expensive model.

This article explains how reranking typically works in production systems.

1. Query Processing

The pipeline begins when a user submits a query. Before retrieval starts, the system may perform several preprocessing steps to better understand the user's intent.

Common steps include:

Query normalization – removing noise, punctuation, or unnecessary tokens
Query rewriting – expanding or clarifying the query to improve recall
Intent detection – identifying the user’s goal or domain

Example query:

"How to deploy Dify with Docker?"

These preprocessing steps help the retrieval system generate better candidate results.

2. Multi-Recall (Candidate Generation)

Next, the system retrieves a large set of candidate documents using fast retrieval strategies. This stage focuses on maximizing recall, meaning it tries to retrieve as many potentially relevant documents as possible.

Common retrieval approaches include:

Vector Search (embedding similarity)
Keyword Search such as BM25
Metadata filtering
Hybrid retrieval, which combines multiple methods

Example:

Vector Search → Top 50 documents
Keyword Search → Top 30 documents

After merging, the system may have around 80 candidate documents.

3. Candidate Merge and Deduplication

Because results come from multiple retrieval channels, they must be merged into a single candidate set.

Typical operations include:

Removing duplicate documents
Normalizing scores from different retrieval methods
Combining results into a unified list

After merging and deduplication, the candidate set may shrink to around 60 documents.

4. Pre-Filtering (Optional)

To reduce computational cost before reranking, many production systems apply lightweight filtering.

Examples include:

Removing extremely short or low-quality documents
Filtering by similarity thresholds
Applying metadata constraints

After filtering, the system may keep around 40 candidates for reranking.

5. Reranking Stage (Core Step)

The rerank stage evaluates how well each candidate document matches the query and assigns a new relevance score.

Unlike embedding similarity—which compares vectors independently—rerank models typically use cross-encoders. These models process the query and document together, allowing them to capture deeper semantic relationships.

Example scoring:

Query:
"How to deploy Dify with Docker?"

Candidate scores:

Document A → 0.92
Document B → 0.88
Document C → 0.40
Document D → 0.31

The candidates are then reordered according to these scores.

6. Top-K Selection

After reranking, the system selects the Top-K documents with the highest relevance scores.

Typically:

Top 5–10 documents are selected.

These documents may be:

Sent to a Large Language Model (LLM) as context in a RAG system
Returned directly to the user in a search interface

7. Post-Processing (Optional)

Some production systems apply additional processing after reranking.

Examples include:

Diversity control to avoid returning highly similar documents
Chunk merging to combine adjacent text segments
Context window optimization to fit within LLM token limits

These steps help improve the final response quality.

Typical Industrial Architecture

User Query
    ↓
Query Rewrite / Intent Detection
    ↓
Multi-Recall
(Vector + Keyword + Metadata)
    ↓
Candidate Merge
    ↓
Pre-Filter
    ↓
Rerank (Cross-Encoder Model)
    ↓
Top-K Selection
    ↓
LLM Generation / Final Results

Why Reranking Is Necessary

Fast retrieval techniques such as vector search or BM25 are optimized for speed, not deep semantic understanding. As a result, their ranking quality is limited.

Reranking addresses this limitation by applying a more sophisticated model to a smaller set of candidates.

Method	Speed	Accuracy
Vector Search	Fast	Medium
BM25 Keyword Search	Fast	Medium
Rerank Model	Slower	High

Because reranking is computationally expensive, industrial systems follow a common architecture:

Fast Recall + Slow Precision

First, retrieve many candidates quickly. Then use a stronger model to produce a more accurate ranking.

Conclusion

Reranking plays a crucial role in modern search and RAG pipelines. By re-evaluating retrieved candidates with more powerful models, it significantly improves the relevance of final results while keeping system latency manageable.

Understanding this stage is essential for anyone building production-grade AI applications, especially systems that combine retrieval with large language models.

Query Rewrite in RAG Systems: Why It Matters and How It Works

yang yaru — Mon, 09 Mar 2026 08:20:48 +0000

In Retrieval-Augmented Generation (RAG) systems, many developers focus heavily on embeddings and vector databases. However, in real-world production systems, one of the most critical components is often overlooked:

Query Rewrite.

Query rewriting significantly improves retrieval quality and can dramatically impact the overall performance of a RAG pipeline.

This article explains:

What Query Rewrite is
Why it is necessary
How it is implemented in production systems
Common engineering patterns for query optimization

1. What Is Query Rewrite?

Query Rewrite refers to the process of transforming a user's original query into one or more optimized queries that are better suited for retrieval.

Users typically ask questions in natural language, but retrieval systems perform best when queries are:

clear
explicit
keyword-rich
structured

Therefore, a rewriting step is often introduced before retrieval.

Basic pipeline:

User Query
     ↓
Query Rewrite
     ↓
Optimized Retrieval Query
     ↓
Vector / Keyword Search

The rewritten queries help the retrieval system locate more relevant documents.

2. Why Query Rewrite Is Necessary

User queries often suffer from several issues that reduce retrieval quality.

2.1 Missing Context

Users frequently omit important context.

Example:

User query:
What is it?

The system may need to expand it to something like:

What is the architecture of LangGraph?

Without context, retrieval becomes ineffective.

2.2 Conversational Language

Users naturally ask questions in informal language.

Example:

How does AI connect to databases?

A retrieval-friendly query might be:

How to connect an LLM to a database

2.3 Very Short Queries

Example:

LangGraph

A better query for retrieval could be:

LangGraph framework architecture and use cases

2.4 Poor Retrieval Keywords

Example:

Why do AI models make things up?

A rewritten query might be:

LLM hallucination causes

This makes it easier to match relevant documents.

3. Query Rewrite in the RAG Pipeline

A typical RAG system pipeline looks like this:

User Query
     ↓
Query Rewrite
     ↓
Intent Analysis
     ↓
Multi Retrieval
(vector / keyword / metadata)
     ↓
Hybrid Merge
     ↓
Top-K
     ↓
Score Threshold
     ↓
Rerank
     ↓
LLM

Query rewriting is the first step in optimizing retrieval quality.

4. A Practical Query Rewrite Prompt

In many production systems, a small language model is used to generate optimized queries.

Example prompt:

You are a search query optimizer.

Rewrite the user's question to improve retrieval quality.

Rules:
1. Preserve the original meaning.
2. Remove conversational language.
3. Add missing keywords if necessary.
4. Generate 3 different search queries.

User Question:
{query}

Return JSON format:
{
 "intent": "...",
 "queries": ["...", "...", "..."]
}

This prompt produces structured retrieval queries.

5. Example

User input:

What is the difference between LangGraph and AutoGPT?

Rewritten output:

{
 "intent": "compare two AI agent frameworks",
 "queries": [
   "LangGraph vs AutoGPT architecture comparison",
   "differences between LangGraph and AutoGPT agent framework",
   "LangGraph workflow design vs AutoGPT autonomous agent"
 ]
}

Each generated query can then be sent to the retrieval system independently.

6. Common Query Rewrite Patterns

Production systems typically implement query rewriting in several ways.

6.1 Multi-Query Retrieval

The system generates multiple queries from a single user question.

Example:

Query 1 → vector search
Query 2 → vector search
Query 3 → vector search

The results are then merged and ranked.

Frameworks such as LangChain implement this strategy with components like MultiQueryRetriever.

Advantages:

Higher recall
Better document coverage

Disadvantages:

Increased retrieval cost

6.2 Query Decomposition

Complex questions are split into smaller sub-questions.

Example:

User query:

Why is LangGraph more stable than AutoGPT?

Decomposed queries:

LangGraph architecture
AutoGPT architecture
AutoGPT stability issues

Each query retrieves documents independently.

This method is particularly effective for complex reasoning tasks.

6.3 Query Routing

Some systems determine the intent of the query and route it to different retrieval mechanisms.

Example:

Query
 ↓
Intent Detection
 ↓
Router

Example routing table:

Intent	Retrieval Method
Technical explanation	Vector search
API documentation	Keyword search
Database query	SQL

7. The Full Query Optimization Pipeline

In advanced RAG systems, query processing often includes multiple steps:

User Query
   ↓
Query Rewrite
   ↓
Intent Detection
   ↓
Query Expansion
   ↓
Multi Retrieval
(vector + keyword)
   ↓
Hybrid Merge
   ↓
Top-K
   ↓
Rerank
   ↓
LLM

In practice, most RAG optimizations focus on three core areas:

Query Quality
Retrieval Strategy
Reranking

8. A Less Known Optimization Trick

Some systems do not stop at generating multiple queries.

Instead, they perform an additional step:

Rewrite
↓
Generate 5 queries
↓
Select the best 3 queries
↓
Run retrieval

This approach is sometimes called self-query optimization.

It improves retrieval quality while controlling cost.

9. Why Query Rewrite Matters More for Large Knowledge Bases

When a knowledge base is small:

~1000 documents

A simple query may still retrieve relevant information.

But in large systems:

~1,000,000 documents

Query quality becomes critical.

Poor queries lead to:

Low recall
↓
Missing documents
↓
Incorrect or incomplete LLM responses

10. Frameworks Supporting Query Rewrite

Several RAG frameworks provide built-in query transformation tools:

LangChain
LlamaIndex
Haystack

These frameworks include features such as:

Query transformation
Multi-query retrieval
Sub-question decomposition
Query routing

All of these techniques fall under the broader concept of query optimization.

Conclusion

While embeddings and vector databases are essential components of RAG systems, query quality often determines retrieval performance.

A well-designed Query Rewrite layer can:

improve recall
increase retrieval relevance
reduce hallucinations
enhance overall system reliability

In many production RAG pipelines, optimizing the query itself is one of the most effective ways to improve results.

Retrieval Strategy Design: Vector, Keyword, and Hybrid Search

yang yaru — Sat, 28 Feb 2026 06:08:56 +0000

This article explains how to design a modern retrieval strategy for AI systems, especially Retrieval-Augmented Generation (RAG). The focus is not only on definitions, but on engineering trade-offs, system architecture, and practical defaults.

The target audience is backend engineers who can already use embeddings, but want to design reliable and controllable search systems.

1. Where Retrieval Strategy Fits in the System

A typical modern retrieval pipeline looks like this:

User Query
  ↓
Query Rewrite / Intent Analysis
  ↓
Multi-Channel Retrieval
  (Vector / Keyword / Metadata)
  ↓
Hybrid Merge
  ↓
Top-K Limiting
  ↓
Score Threshold Filtering
  ↓
(Optional) Reranking
  ↓
LLM Generation

Concepts like vector search, hybrid search, Top-K, and threshold filtering are not isolated features. They work together inside the recall and filtering stages of this pipeline.

2. Vector Search: The Semantic Recall Layer

2.1 What Vector Search Solves

Vector search addresses the problem of semantic mismatch:

The user and the document use different words
The meaning is similar, but lexical overlap is low

Example:

Query: How to reduce dopamine addiction
Document: Attention control and dopamine regulation

Keyword search fails here, but embeddings succeed.

2.2 Core Parameters Engineers Must Understand

Similarity Metric

The most common similarity metrics are:

Cosine Similarity (industry default)
Dot Product
L2 Distance

Most embedding models are trained assuming cosine similarity, so databases typically follow that convention.

Index Type (Performance-Critical)

Index Type	Use Case
Flat	Small datasets, maximum accuracy
HNSW	General-purpose, production default
IVF	Very large-scale datasets

For most knowledge-base and RAG systems, HNSW is the best trade-off.

2.3 The Fundamental Weakness of Vector Search

Vector search is strong at recall, but weak at precision:

It retrieves related content
It may retrieve irrelevant but semantically nearby content

This is why vector search must be combined with:

Top-K limits
Score thresholds
Reranking

3. Keyword Search (BM25): The Precision Layer

Keyword search is not obsolete. Its role is deterministic precision.

It excels at:

Code and stack traces
API names
Error messages
Proper nouns
Numbers and IDs

In many technical queries, keyword search outperforms embeddings.

Another key benefit is controllability: keyword matching acts as a deterministic filter that reduces hallucinations.

4. Hybrid Search: The Industry Standard

Hybrid search combines the strengths of both approaches:

Vector search for semantic recall
Keyword search for lexical precision

This is no longer optional in production systems.

4.1 Parallel Hybrid (Most Common)

Vector Search Top-K = 20
Keyword Search Top-K = 20
↓
Merge Results
↓
Rerank

Advantages:

Simple to implement
Stable behavior
Widely used in production

4.2 Score Fusion Hybrid

A weighted scoring approach:

Final Score = α × Vector Score + β × BM25 Score

This method is suitable for search-engine-like systems that require strong global ranking.

5. Top-K: A Recall Boundary, Not a Quality Guarantee

A common misconception is:

Higher Top-K means better results

In reality:

Top-K defines the maximum recall scope
Large Top-K increases noise
Token usage and latency increase rapidly

Practical Defaults

Scenario	Recommended Top-K
FAQ	3–5
Technical Docs	5–10
Code Search	10–20

For most RAG systems:

Vector Top-K: 8–10
Keyword Top-K: 8–10

6. Score Threshold Filtering: The Missing Safeguard

Top-K always returns results — even when nothing is relevant.

Threshold filtering solves this:

Only keep results where score > threshold

Without thresholds, systems produce classic failures:

Query: Apple phone
Result: Apple fruit

Threshold Guidelines (Cosine Similarity)

Similarity	Interpretation
> 0.85	Strongly relevant
0.75–0.85	Acceptable
< 0.70	Noise

Many production systems use:

threshold ≈ 0.78

7. A Practical, Production-Ready Retrieval Strategy

A robust default pipeline:

1. Optional Query Rewrite
2. Vector Search (Top-K = 10)
3. Keyword Search (Top-K = 10)
4. Merge Results
5. Filter: score > 0.78
6. Rerank Top 5
7. Send Top 3 to LLM

This structure balances recall, precision, cost, and stability.

8. What Engineers Should Actually Focus On

8.1 Recall vs Precision Trade-off

Vector Search → Recall
Keyword Search → Precision
Reranker → Final Quality

Understanding this triangle is more important than tuning any single parameter.

8.2 Chunk Design Matters More Than Algorithms

Poor chunking breaks all retrieval strategies:

Chunks too long → embedding dilution
Chunks too short → context fragmentation

Good retrieval starts with good chunk boundaries.

8.3 Top-K Is Not the Final Output Size

Typical production flow:

Retrieve 20
Filter to 12
Rerank to 5
LLM consumes 3

Conclusion

Modern retrieval systems are not built on vector search alone.

Hybrid retrieval + threshold filtering + reranking is the real foundation of stable, production-grade RAG systems.

If you design retrieval with a system mindset instead of a single-algorithm mindset, quality improves dramatically.

Designing a Scalable Knowledge Base for Large Language Models

yang yaru — Wed, 11 Feb 2026 06:47:41 +0000

A Practical Engineering Guide to Cleaning, Semantic Chunking, Metadata, and Batch Embeddings

Large Language Model (LLM) knowledge bases are often misunderstood as simply “vectorizing documents.” In reality, a production-grade knowledge system is a retrieval infrastructure that must be traceable, incremental, and measurable.

This article walks through a practical engineering pipeline covering:

Data cleaning and normalization
Semantic chunking strategies
Metadata schema design
Batch embedding architecture
Retrieval and evaluation considerations

The focus is not theory, but implementation decisions that work in real systems.

1. System Architecture Overview

Before implementation, define the boundaries of your pipeline. A robust LLM knowledge base usually consists of the following stages:

Ingest → Normalize → Chunk → Enrich → Embed → Index → Retrieve → Monitor

Core Responsibilities

Ingest: PDFs, web pages, Markdown, databases, or internal docs
Normalize: Convert raw content into structured blocks
Chunk: Create retrieval-ready units
Enrich: Attach metadata and context
Embed: Generate vectors with version control
Index: Build hybrid search indexes
Serve: Retrieval + reranking + citation
Monitor: Evaluate retrieval quality continuously

A knowledge base is closer to a search engine than a simple storage system.

2. Data Cleaning and Normalization

The goal is not to “clean aggressively,” but to preserve structural signals.

Required Processing

Convert all content to UTF-8
Normalize whitespace and line breaks
Remove duplicated navigation/footer content
Detect headings (H1/H2/H3 or numeric sections)
Preserve structural blocks:
- Paragraphs
- Lists
- Tables
- Code blocks

Avoid flattening everything into plain text. Structure improves both retrieval accuracy and traceability.

Common Noise Sources

Web navigation bars and cookie banners
PDF headers and repeated page numbers
Hyphenated line breaks in scanned PDFs
Template content repeated across pages

Tables should ideally be converted into Markdown or key: value rows so that LLMs can interpret them correctly.

3. Semantic Chunking Strategy

Chunking is the most important factor affecting retrieval performance.

Chunking Goals

A good chunk should be:

Self-contained: understandable without large context
Traceable: linked back to its original location
Searchable: not too long or too fragmented

Recommended Hierarchical Approach

Structure-aware splitting (Preferred)

Split by document headings first
Merge paragraphs inside each section

Recursive splitting

Paragraph → Line → Sentence → Token boundary

Semantic boundary detection (Advanced)

Use topic shifts or embeddings to find natural breaks

Chunk Size and Overlap

Typical engineering defaults:

FAQ or policies: 200–450 tokens, overlap 30–80
Technical docs: 300–700 tokens, overlap 50–120
Long reports or research: 400–900 tokens, overlap 80–150

Overlap prevents losing context when answers span boundaries.

Parent–Child Chunk Design

A highly effective production pattern:

Child chunks: smaller pieces used for vector retrieval
Parent chunks: larger contextual sections passed to the LLM

Workflow:

Retrieve child chunks
Expand to parent chunks
Send parents to the model for generation

This significantly improves answer coherence.

4. Metadata Schema Design

Metadata is not optional. It enables filtering, access control, versioning, and debugging.

Minimum Viable Metadata

Each chunk should include:

doc_id
chunk_id
title
section_path
source_uri
page_start / page_end
created_at / updated_at
language
hash (content checksum)
tenant/project
acl (access control)

Enhanced Metadata (Recommended)

doc_version
effective_date
tags
entities (product names, systems, people)
content_type (faq, guide, spec, code)
parent_id
quality_flags

These fields enable advanced filtering and evaluation later.

Stable Chunk ID Strategy

Chunk IDs must remain stable across re-processing.

Example:

chunk_id = sha1(doc_id + doc_version + section_path + chunk_index + text_hash_prefix)

Only changed content should produce new IDs.

5. Batch Embedding Architecture

Embedding pipelines must be idempotent, incremental, and observable.

Suggested Data Model

documents

doc_id, version, uri, title, checksum

chunks

chunk_id, doc_id, text, metadata_json, hash

embeddings

chunk_id, model_name, dim, vector, text_hash

embedding_jobs

job_id, status, created_at

embedding_job_items

job_id, chunk_id, retry_count, error

Key Engineering Practices

Only embed chunks whose hash changed
Process in batches (32–256 chunks or token-limited)
Control concurrency to avoid rate limits
Implement exponential retry
Monitor throughput and failure rates

Supporting Multiple Models

Embedding records must include:

model_name
model_version
vector_dimension
normalized_flag

Allow multiple embeddings per chunk for gradual migration between models.

6. Retrieval Design: Hybrid Search and Reranking

Vector search alone is rarely sufficient.

Recommended Retrieval Pipeline

Hybrid retrieval:

Vector similarity
BM25 keyword search
1. Metadata filtering:
tenant/project
ACL
document type
1. Reranking:
Lightweight reranker or LLM scoring
1. Source citation:
Return source_uri + section_path + page

Hybrid search dramatically improves precision for exact terms and technical names.

7. Chunk Quality Monitoring

Many production issues are caused by poor chunks rather than model failures.

Common anti-patterns:

Chunks shorter than 50 tokens
Chunks longer than 1200 tokens
Repeated template content
Missing title context
Duplicate sections occupying top results

Add a simple rule engine that tags chunks with quality_flags.

8. End-to-End Processing Pipeline

A practical implementation roadmap:

Ingest documents and generate doc_id
Extract structured blocks
Remove noise and duplicates
Build parent chunks from sections
Generate child chunks with overlap
Attach metadata and hashes
Upsert into chunks table
Create embedding jobs for new/changed chunks
Batch embedding with workers
Build vector and keyword indexes
Run evaluation queries (golden dataset)

Final Thoughts

Designing an LLM knowledge base is less about models and more about information architecture.

The biggest improvements usually come from:

Better chunk structure
Strong metadata design
Incremental embedding pipelines
Hybrid retrieval strategies

If you treat your knowledge base like a search system rather than a document dump, both retrieval accuracy and generation quality improve significantly.

How to Choose the Right Model for Your AI Application(A Practical Engineering Guide)

yang yaru — Fri, 30 Jan 2026 07:10:13 +0000

Choosing an AI model is not about finding the strongest model.

It is about finding the most suitable model for your specific scenario.

Many developers waste money, suffer from slow responses, or over-engineer their systems simply because they start with the wrong assumption:

Bigger model = better product.

In reality:

There is no best model.

Only the best model for your use case.

This article provides a practical, engineering-oriented framework for selecting models in real AI applications.

1. The Four Core Dimensions of Model Selection

Every model choice is a trade-off between:

Capability (reasoning, language quality)
Latency (response speed)
Cost (per-token price)
Controllability (structured output reliability)

You cannot maximize all four simultaneously.

Good model selection means choosing the right balance.

2. First: Classify Your Application

Before choosing a model, identify which category your feature belongs to.

A. Generative Tasks

Article writing
Copywriting
Story generation

B. Q&A Tasks

Customer support
Knowledge base Q&A
FAQ bots

C. Structured Output Tasks

JSON generation
Tables
Fixed schemas

D. Strong Reasoning Tasks

Multi-step logical reasoning
Complex code analysis
Data reasoning

E. Embedding Tasks

Vectorization
Semantic search
Similarity matching

3. Capability Requirements by Category

Task Type	Needs Top-Tier Model?
Text generation	No
Customer service	No
Structured JSON	No
Strong reasoning	Yes
Code generation	Medium–High
Embedding	No (use embedding model)

Most applications do not need frontier models.

4. Industry-Proven Three-Tier Model Strategy

Mature systems rarely use a single model.

Instead:

Tier 1 — Cheap Model

Handles ~70% of traffic

Tier 2 — Mid-Level Model

Handles moderately complex requests

Tier 3 — Strong Model

Used only for hard cases

Architecture:

User Request
    |
Rule / Router
    |
Simple → Cheap Model
Medium → Mid Model
Complex → Strong Model

This drastically reduces cost while maintaining quality.

5. Embeddings Are a Separate Track

Never use chat models to create vectors.

Use a dedicated embedding model.

Pipeline:

Text → Embedding Model → Vector → Vector DB

Advantages:

Much cheaper
Better semantic consistency
Faster

6. Practical Model Selection Workflow

Step 1 — Define I/O Contract

Example:

Input: ...
Output: ...

Step 2 — Start with Mid-Level Model

Get the system working first.

Step 3 — Measure

Output quality
Latency
Cost per request

Step 4 — Upgrade Only If Needed

Only upgrade when:

Frequent hallucinations
Logical breakdown
Prompt already optimized

7. A Useful Rule of Thumb

If prompt engineering can fix it,

do NOT switch models.

Most failures are caused by:

Weak prompts
Missing constraints
Unclear output formats

Not weak models.

8. Temperature Guidelines

Scenario	Temperature
Article writing	0.7
Stable output	0.2
JSON generation	0.1
Creative writing	0.8

For article generation:

0.6 – 0.7

9. Three Common Beginner Mistakes

Mistake 1

Using the most expensive model by default

Mistake 2

No caching

Same prompt → regenerate → waste money

Mistake 3

No retry mechanism

Should have:

Fail → Retry once → Log

10. Backend-Oriented Architecture

Controller
|
Service
|
Prompt Builder
|
Model Router
 |       |
Cheap  Strong
|
AI Provider

11. When Should You Upgrade to Stronger Models?

Only when:

content frequently go off-topic
Logical structure collapses
Prompts already well designed

12. Final Takeaway

Start with mid-level models
Use embedding models for vectors
Let prompts and parameters do most of the work
Add model tiers later

Good architecture beats expensive models.

*If you found this useful, feel free to adapt it into your own system design or produce

How to Write a Developer-Level Prompt: A Practical Guide

yang yaru — Fri, 30 Jan 2026 06:46:38 +0000

Large Language Models (LLMs) do not work well with vague instructions.

If you want consistent, controllable, and production-grade behavior, you must move beyond simple “user prompts” and start designing Developer-level prompts.

This article explains:

What a Developer Prompt is
How it differs from other prompt types
A practical structure you can reuse
Real-world examples

1. Prompt Layers: System, Developer, and User

Modern LLM applications usually operate with three layers of instructions:

Layer	Purpose
System Prompt	Defines global behavior of the model
Developer Prompt	Defines product-level rules
User Prompt	Defines per-request task

Think of it like this:

System Prompt → Constitution
Developer Prompt → Job description
User Prompt → Daily task

Your focus as a builder is primarily the Developer Prompt.

2. What Is a Developer Prompt?

A Developer Prompt is a persistent instruction set that defines:

Who the model is
What its main responsibility is
What rules it must follow
What it is allowed and not allowed to do
How it must format output

It is not about what the user asks.

It is about how the system behaves.

A Developer Prompt turns a general AI into a specialized product component.

3. Why Developer Prompts Matter

Without a Developer Prompt:

The model improvises
Output style changes
Hallucinations increase
Formatting becomes inconsistent

With a Developer Prompt:

Behavior becomes stable
Boundaries are enforced
Outputs are predictable
Product quality improves

This is the difference between experimentation and engineering.

4. Standard Structure of a Developer Prompt

A strong Developer Prompt usually contains five sections:

Role
Goal
Knowledge Scope
Behavior Rules
Output Format

Generic Template


text
You are a {role}.

Your primary goal is to {goal}.

You must follow these rules:
1. ...
2. ...
3. ...

You can only use the following knowledge sources:
- ...

If information is missing, respond with:
"I don't know based on the provided information."

Output format:
- ...

Retrieval Technique Series-6.A Discourse on Design in High-Performance Retrieval Systems

yang yaru — Mon, 28 Jul 2025 03:59:07 +0000

In an era defined by data, the ability to retrieve information quickly and accurately is no longer a luxury—it's a fundamental requirement. From the search engines that power our curiosity to the e-commerce platforms that recommend our next purchase, high-performance retrieval systems are the invisible engines of our digital world. But what does it take to build a system that can sift through petabytes of data in milliseconds?

The answer lies in a set of core architectural philosophies. These are not just technical tricks but foundational principles that ensure scalability, speed, and stability. Let's explore four of the most critical design ideas that underpin modern, high-performance retrieval systems.

Principle 1: Decoupling the Index from the Data

At its core, a retrieval system works much like a library. To find a book, you don't scan every shelf; you first consult the card catalog—the index. This analogy highlights our first principle: the separation of the index from the actual data.

The Index: This is a lightweight, highly optimized data structure that maps search terms to the locations of the documents that contain them. Its primary job is to enable fast lookups.
The Data: This is the full, original content of the documents themselves (e.g., web pages, product descriptions, user profiles).
By decoupling these two, we gain immense benefits. The index, being much smaller than the raw data, can often be stored in faster media like SSDs or even loaded entirely into RAM. This allows the system to perform the initial "where is it?" query at lightning speed. Once the relevant document locations are identified, the system can then fetch the full data from slower, more cost-effective storage like HDDs or cloud object storage. This separation also allows for independent scaling—we can add more resources to our indexing service without altering our primary data storage, and vice-versa.

Principle 2: Minimizing Disk I/O

The single greatest bottleneck in most data-intensive systems is disk input/output (I/O). Accessing data from a spinning disk or even an SSD is orders of magnitude slower than accessing it from memory (RAM). Therefore, a relentless focus on minimizing disk I/O is paramount for performance.

Several techniques are employed to achieve this:

Aggressive Caching: Frequently accessed index blocks and popular documents are kept in memory caches. The system checks the cache first, avoiding a trip to the disk entirely if the data is present.
Data Compression: Compressing data reduces its size on disk, meaning fewer bytes need to be read for any given query. This trades a small amount of CPU time (for decompression) for a significant reduction in I/O latency.
Sequential Access Patterns: Random disk access is notoriously slow. Modern systems often use data structures like Log-Structured Merge-Trees (LSM-Trees) that convert random writes into sequential appends, which are much faster. Similarly, structuring data to be read sequentially can dramatically improve throughput.

Principle 3: Implementing Read-Write Separation

The workload of a retrieval system is typically asymmetrical. There are often far more read operations (users searching) than write operations (new data being indexed). The requirements for these two operations are also different: reads must be ultra-fast, while writes need to be durable and consistent.

This leads to the principle of Read-Write Separation, also known as Command Query Responsibility Segregation (CQRS). In this model:

The Write Path handles data ingestion, updates, and indexing. It can be optimized for throughput and data integrity.
The Read Path handles user queries. It operates on a separate, read-optimized copy of the data.
This separation allows us to scale each path independently. If search traffic spikes, we can simply spin up more read replicas without impacting the indexing process. This architecture also improves system resilience; a failure or slowdown in the write system won't bring down the search functionality for users.

Principle 4: Adopting a Layered or Tiered Architecture

It's not feasible to apply complex, computationally expensive logic to every single document in a massive dataset for every query. To solve this, high-performance systems adopt a multi-stage, layered processing approach, creating a "funnel" that progressively refines the results.

A typical search funnel might look like this:

Recall Layer (or Matching): This first layer quickly scans the index to retrieve a large set of potentially relevant candidates—perhaps thousands or tens of thousands of documents. The goal here is high recall (don't miss anything important) and speed. The scoring logic is simple and fast.
Ranking Layer (or Scoring): The candidate set from the first layer is passed to a second, more sophisticated layer. Here, more complex ranking models, often powered by machine learning, are used to score and order the documents more accurately. Since this is only applied to a smaller subset of documents, the computational cost is manageable.
Re-ranking Layer (or Blending): A final layer may take the top-ranked results and apply further business logic, personalization, or diversity rules to produce the final list presented to the user.
This tiered approach ensures that the most expensive computations are reserved for only the most promising candidates, striking a perfect balance between accuracy and performance.

Conclusion

Building a high-performance retrieval system is a masterclass in managing trade-offs. By embracing these four fundamental design philosophies—separating index and data, minimizing disk I/O, segregating reads and writes, and processing in layers—engineers can construct systems that are not only incredibly fast but also scalable, resilient, and efficient. These principles, working in concert, form the bedrock of the seamless information access we rely on every day.

Retrieval Technique Series-5.How Large-Scale Search Systems Accelerate Retrieval with Distributed Technology

yang yaru — Wed, 11 Jun 2025 06:14:37 +0000

In the era of big data, search systems must handle massive volumes of information and user queries efficiently. Traditional single-server architectures quickly become bottlenecks as data and traffic grow. To address this, large-scale search systems leverage distributed technology and index sharding to accelerate retrieval and ensure scalability. In this post, we’ll explore how these techniques work, their advantages, and the challenges they bring.

What are the advantages of distributed technology?

Distributed technology is the decomposition of large tasks into multiple subtasks, using multiple servers to jointly undertake tasks, which greatly improves the overall system's service capabilities compared to single machine systems.

What does a simple distributed structure look like?

A complete distributed system will have complex service management mechanisms, including service registration, service discovery, load balancing, traffic control, remote calling, and redundant backup. Here, let's first set aside the implementation details of distributed systems and return to their essence, which is to start with "letting multiple servers share the task" and see how a simple distributed retrieval system works. Firstly, we need a server that receives requests, but it does not perform specific query tasks. It is only responsible for task distribution, which we call a distribution server. Multiple index servers are the ones that actually perform the retrieval task, each of which stores a complete inverted index and is capable of completing the retrieval task. When the distribution server receives a request, it will send the current query request to a relatively idle index server for querying based on the load balancing mechanism. The specific retrieval work is independently completed by the indexing server and the results are returned.

                +----------------------+
                |   Dispatcher Server  |        The dispatcher server receives the request and forwards it to a specific index server, 
                +----------------------+          according to the load balancing mechanism.
                        |
             -------------------------------------------------------
             |                      |                              |
+---------------------+  +---------------------+   ...   +---------------------+
|  Full Index Data    |  |  Full Index Data    |         |  Full Index Data    |
|  Index Server 1     |  |  Index Server 2     |         |  Index Server n     |
+---------------------+  +---------------------+         +---------------------+

The index server processes the request and returns the search result.

What is Index Sharding?

Index sharding is the process of splitting a large search index into smaller, manageable pieces (shards) that can be distributed across multiple servers. Each shard contains a subset of the data, allowing the system to process queries in parallel and balance the load.

How Index Sharding Works

There are two main strategies for sharding:

1. Document-Based Sharding

Each shard contains a subset of documents.

+---------------------------------------------------------------+
|                        All Documents                          |
|  +-------------------+      ...      +-------------------+    |
|  | Document Set 1    |               | Document Set n    |    |
|  +-------------------+               +-------------------+    |
+---------------------------------------------------------------+
           |                                      |
           v                                      v
   Generate Index Shard 1                 Generate Index Shard n
           |                                      |
           v                                      v
+--------------------------------+   +--------------------------------+
| word 1 ->doc 1->doc 2...doc 19 |   |word 1 ->doc 23->doc 25...doc 41|
| word 2 ->doc 3->doc 4...doc 14 |   |word 2 ->doc 12->doc 16...doc 30|
|   ...                          |   |    ...                         |
| word n ->doc 1->doc 3...doc 15 |   |word n ->doc 21->doc 24...doc 35|
+--------------------------------+   +--------------------------------+

Note: The posting list in a single shard is incomplete.

When a query arrives:

                        +----------------------+
                        |  Dispatcher Server   |
                        +----------------------+
                                 |
        -------------------------------------------------
        |                       |                      |
+-------------------+  +-------------------+   ...   +-------------------+
|  Index Shard 1    |  |  Index Shard 2    |         |  Index Shard 3    |
|  Index Server 1   |  |  Index Server 2   |         |  Index Server n   |
+-------------------+  +-------------------+         +-------------------+

1. The dispatcher server receives the query request and sends the request to all index servers with different index shards.

2. Each index server searches its own loaded index shard and returns the search results to the dispatcher server.

3. The dispatcher server merges all returned results and returns the final result.

Advantages:

Accelerates search efficiency.
Evenly distributes queries and balances server load.
Makes it easier to update the index by adding or modifying documents.
Management Challenges:
Requires careful balancing of query loads across shards.

2. Keyword-Based Sharding

Each shard is responsible for a subset of keywords.

+-------------------+
|   All Documents   |
+-------------------+
          |
          v
   Generate Complete Inverted Index
          |
          v

 Incomplete |
 dictionary | Complete posting list in shard
 in shard   |                 
+--------------------------------------------+
| word 1  -----> doc 1 -> doc 2 ... doc 41   |
| word 2  -----> doc 3 -> doc 4 ... doc 30   |
|   ...                                      |
| word 20 -----> doc 1 -> doc 3 ... doc 35   |
+--------------------------------------------+
+--------------------------------------------+
| word 41 -----> doc 1 -> doc 5 ... doc 41   |
| word 42 -----> doc 2 -> doc 6 ... doc 30   |
|   ...                                      |
| word n  -----> doc 3 -> doc 4 ... doc 35   |
+--------------------------------------------+

When a query arrives:

Query: key1 + key2

                        +----------------------+
                        |  Dispatcher Server   |
                        +----------------------+
                          /               \
                         /                 \
        +---------------------+   +---------------------+   ...   +---------------------+
        |  Full Index Data    |   |  Full Index Data    |         |  Full Index Data    |
        |  Index Server 1     |   |  Index Server 2     |         |  Index Server n     |
        +---------------------+   +---------------------+         +---------------------+

 (Return posting list for key1)   (Return posting list for key2)

1. The dispatcher server receives the request and, based on the query, dispatches it to one or more index servers.
2. Index servers process the request and return the complete search results.
3. The dispatcher server merges the results and returns the final result.

Advantages:

Reduces duplicate computation.
Management Challenges:
If queries contain many keywords not evenly distributed, performance may drop.
High-frequency keywords can overload specific shards.

Conclusion

Index sharding and distributed technology are essential for building scalable, high-performance search systems. By splitting the index and distributing the workload, these systems can handle massive data volumes and high query rates efficiently. However, careful planning and management are required to avoid bottlenecks and ensure balanced performance.

Retrieval Technique Series-4.How Search Engines Generate Indexes for Trillions of Websites?

yang yaru — Fri, 06 Jun 2025 03:21:41 +0000

How to Generate Inverted Indexes Larger Than Memory Capacity

Review of In-Memory Inverted Index Generation

For small document collections that fit in memory, we generate hash-based inverted indexes as follows:

Assign unique, sequential IDs to each document
Scan each document sequentially, generating tuples
Store these tuples in an inverted table with the keyword as the key (position information can be omitted if unnecessary)
Repeat until all documents are processed

This creates an in-memory inverted index.

[Document Analysis Process]

+------------+                +---------------------------------------------+
| word 1     |                | Keywords | DocID  | Position | Posting List |
| word 2     |    Analyze     +----------+--------+----------+--------------+
| word 1     | ------------>  | Word 1   | Doc 2  | [1,3]    |              |
| doc 2      |                | Word 2   | Doc 2  | [2]      |              |
+------------+                +---------------------------------------------+
                                    |
                                    |
                                    v
If key exists in posting list, directly insert node:
word 1 -------> doc 1 -------> doc 2

If key doesn't exist in index, insert key and create posting list:
word 2 -------> doc 2

Handling Large-Scale Document Collections

For large-scale document collections, we need a different approach. Can we split them into smaller collections to build inverted indexes in memory? How do we combine these smaller indexes into a complete large-scale inverted index stored on disk?

Industrial-grade inverted indexes are more complex than what we've studied. For example, if a document contains "Geek Time" , not only might these four characters be added to the dictionary as keywords, but "Geek" , "Time" , and "Geek Time" might also be added. This results in a very large dictionary, potentially containing millions or tens of millions of terms.

Since words have different lengths and storage requirements, we assign a number to each word in the dictionary and store the corresponding string. In the posting list, we record not just document IDs but also position information, frequency, and other details. Each node in the posting list is a complex structure identified by document ID.

Dictionary                          Posting List
+--------------+      +-----------------------------------------------+
| word ID 1    | ---> | [doc 1, pos, tf,...] -> [doc 2, pos, tf,...]   -> ... -> [doc 19, pos, tf,...] |
| string       |      +-----------------------------------------------+
+--------------+
| word ID 2    |      +-----------------------------------------------+
| string       | ---> | [doc 19, pos, tf,...] -> [doc 21, pos, tf,...]  -> ... -> [doc 38, pos, tf,...] |
+--------------+      +-----------------------------------------------+
       :
       :
| word ID n    |      +-----------------------------------------------+
| string       | ---> | [doc 7, pos, tf,...] -> [doc 11, pos, tf,...]  -> ... -> [doc 43, pos, tf,...] |
+--------------+      +-----------------------------------------------+

Generating Industrial-Grade Inverted Indexes

Here's how we generate a large-scale inverted index:

Divide large document collections into multiple smaller sets
Generate in-memory inverted indexes for each small document set
Store these in-memory indexes on disk as temporary inverted files:
- Sort the document lists by keyword string size
- Write each keyword and its corresponding document list as a record to the temporary file
- Records in the temporary file are ordered, and we don't need to store keyword IDs (as they're only locally unique)

Index Table (In Memory)                                                                                  Temporary Files (Disk)

+----------------+                                                                                    +---------------------------+
| word ID 1      |                                                                                    | string 1 | posting list 1 |
| string         |---> doc 1 --> doc 2 --> ... --> doc 19                                             +---------------------------+
+----------------+                                                                                    | string 2 | posting list 2 |
                                                                                                      +---------------------------+
+----------------+                                                                                    | string 3 | posting list 3 |
| word ID 2      |                                                  Write to temporary files          +---------------------------+
| string         |---> doc 19 --> doc 20 --> ... --> doc 34         by key value order                |            ...            |
+----------------+                                                ----------------->                  +---------------------------+
                                                                                                      | string n | posting list n |
        ...                                                                                           +---------------------------+                                            
+----------------+                                             
| word ID n      |                                             
| string         |---> doc 1 --> doc 20 --> ... --> doc 53     
+----------------+

Process each batch of small document collections to generate corresponding temporary files
Merge multiple temporary files using multi-way merge:
- Extract the current record's keyword from each temporary file
- For identical keywords, read out and merge the corresponding posting lists
- If the posting list fits in memory, merge it there and write the result to the final inverted file
- If the posting list is too large, process it in segments
- Assign a globally unique ID to each keyword after merging

Temporary File 1               Temporary File 2           Temporary File 3
+-----------------------+  +-----------------------+  +-----------------------+
| string 1|posting list |  | string 1|posting list |  | string 3|posting list |
+-----------------------+  +-----------------------+  +-----------------------+
| string 2|posting list |  | string 3|posting list |  | string 4|posting list |
+-----------------------+  +-----------------------+  +-----------------------+
| string 3|posting list |  | string 4|posting list |  | string 5|posting list |
+-----------------------+  +-----------------------+  +-----------------------+
|          ...          |  |           ...         |  |           ...         |
+-----------------------+  +-----------------------+  +-----------------------+
| string n|posting list |  | string i|posting list |  | string j|posting list |
+-----------------------+  +-----------------------+  +-----------------------+
        |                         |                         |
        +-------------------------+-------------------------+
                                  |
                                  v
                     +-------------------------------------+
                     | word ID 1 | string 1 | posting list |
                     +-------------------------------------+
                     | word ID 2 | string 2 | posting list |
                     +-------------------------------------+
                     | word ID 3 | string 3 | posting list |
                     +-------------------------------------+
                     |                  ...                |
                     +-------------------------------------+
                     | word ID n | string n | posting list |
                     +-------------------------------------+
                            Complete Sorted File

This approach is similar to the Map-Reduce distributed computing paradigm, making it easy to implement on multiple machines to significantly improve efficiency.

Using Disk-Based Inverted Files for Retrieval

When using large-scale inverted files for retrieval, the core principle is to load as much data as possible into memory since memory retrieval is much faster than disk access.

An inverted index consists of two parts: the dictionary (key collection) and the document lists. In many applications, the dictionary is small enough to load into memory using a hash table.

Hash Table (In Memory)                 |        Inverted Index File (Disk)
                                       |
+----------------+                     |        +----------------------------+
| word ID 1      |                     |        | word ID 1 | posting list 1 |
| string         |---> pos ------------|------> +----------------------------+
+----------------+                     |        | word ID 2 | posting list 2 |
                                       |        +----------------------------+
+----------------+                     |        | word ID 3 | posting list 3 |
| word ID 2      |                     |        +----------------------------+
| string         |---> pos ------------|------> |              ...           |
+----------------+                     |        +----------------------------+
        :                              |        | word ID n | posting list n |
        :                              |        +----------------------------+
+----------------+                     |
| word ID n      |                     |
| string         |---> pos ------------|------->
+----------------+                     |
                                       |

When a query occurs:

Search the in-memory hash table to find the corresponding key
Read the posting list associated with that key from disk into memory for processing

If the dictionary is too large for memory, we can use a B+ tree to search it, treating it as an ordered sequence of keys.

The retrieval process can be summarized in two steps:

Use a B+ tree or similar technology to query the keyword in the dictionary
Read out the posting list for that keyword and process it in memory

B-tree/B+ tree (In Memory)      Dictionary File (Disk)           Inverted Index File (Disk)

       •                    +----------------------------+     +----------------------------+
      / \                   | word ID 1 | string 1 | pos |---->| word ID 1 | posting list 1 |
     •   •--------------->  +----------------------------+     +----------------------------+
    /                       | word ID 2 | string 2 | pos |---->| word ID 2 | posting list 2 |
   •                        +----------------------------+     +----------------------------+
  / \                       | word ID 3 | string 3 | pos |---->| word ID 3 | posting list 3 |
 •   •------------------->  +----------------------------+     +----------------------------+
                            |              ...           |     |             ...            |
                            +----------------------------+     +----------------------------+
                            | word ID n | string n | pos |---->| word ID n | posting list n |
                            +----------------------------+     +----------------------------+

Handling Very Large Posting Lists

For extremely popular keywords that might appear in hundreds of millions of pages, the posting lists may be too large to load into memory. In such cases:

Create B+ tree-like indexes for oversized posting lists
Load only useful data blocks into memory to reduce disk access
For shorter posting lists, load them directly into memory
Use caching techniques like LRU to keep frequently used posting lists in memory

Key Takeaways

We can generate trillion-level inverted indexes using multi-file merging and implement retrieval through dictionary and document list queries.
Two fundamental design principles:
- Load as much data as possible into memory (index compression is crucial here)
- Break large data collections into smaller sets (the core idea of distributed systems)