Julien L for WiScale

Posted on Mar 30 • Edited on Apr 5

One query is never enough: why top RAG systems search three times

#python #database #ai #tutorial

LangChain has MultiQueryRetriever. LlamaIndex has SubQuestionQueryEngine. Every serious RAG framework decomposes user questions into multiple search queries before hitting the vector database.

Why? Because a single embedding compresses your entire question into one point in vector space. And one point can only land in one neighborhood.

Take this question: "How do I fix a slow database connection in my Flask app?"

Three concepts, three clusters in embedding space:

Database connections - pooling, timeouts, driver configuration
Flask-specific patterns - SQLAlchemy setup, app factory patterns, teardown handling
Performance diagnostics - profiling, query logging, bottleneck identification

Embed the full question, and the resulting vector lands in the "Flask + database" neighborhood. The performance diagnostics cluster is invisible. You get back five results about Flask and database setup, zero about profiling or bottleneck identification.

This is not about relationships between entities (that is a graph problem). This is about semantic coverage: one embedding captures one perspective, but most real questions have two or three.

The fix is simple: search three times, fuse the results. And you can do it in one API call.

Proof: one query vs. three queries

Let's prove it. We will build a 12-document technical corpus and compare single-query retrieval against multi-query decomposition.

pip install velesdb sentence-transformers

import velesdb
from sentence_transformers import SentenceTransformer

db = velesdb.Database("./multi_query_demo")
model = SentenceTransformer("all-MiniLM-L6-v2")

collection = db.get_or_create_collection("tech_docs", dimension=384)

Here is our documentation corpus - 12 articles across different domains:

docs = [
    {"id": 1, "title": "Database connection pooling with SQLAlchemy",
     "content": "Connection pooling reuses existing database connections instead of creating new ones. Configure pool_size and max_overflow in SQLAlchemy create_engine to control the pool.",
     "domain": "database"},

    {"id": 2, "title": "Flask app factory pattern for database setup",
     "content": "Use the Flask app factory pattern to initialize SQLAlchemy. Call db.init_app(app) inside create_app() to avoid circular imports and enable testing with different configs.",
     "domain": "flask"},

    {"id": 3, "title": "Profiling slow SQL queries with EXPLAIN ANALYZE",
     "content": "Run EXPLAIN ANALYZE before your query to see the execution plan. Look for sequential scans on large tables, missing indexes, and high row estimates.",
     "domain": "performance"},

    {"id": 4, "title": "Configuring connection timeouts in PostgreSQL",
     "content": "Set statement_timeout to kill long queries. Set idle_in_transaction_session_timeout to reclaim idle connections. Both prevent resource exhaustion under load.",
     "domain": "database"},

    {"id": 5, "title": "Flask-SQLAlchemy session management pitfalls",
     "content": "Always call db.session.remove() at the end of requests. Use scoped_session to avoid sharing sessions across threads. Forgetting teardown causes connection leaks.",
     "domain": "flask"},

    {"id": 6, "title": "Identifying bottlenecks with Python cProfile",
     "content": "Wrap your endpoint with cProfile to find where time is spent. Sort by cumulative time. Database calls often dominate - look for many small queries (N+1 problem).",
     "domain": "performance"},

    {"id": 7, "title": "Async database drivers: asyncpg vs psycopg3",
     "content": "asyncpg delivers 2-5x throughput over synchronous psycopg2 for concurrent workloads. psycopg3 supports both sync and async modes for gradual migration.",
     "domain": "database"},

    {"id": 8, "title": "Flask request lifecycle and database teardown",
     "content": "Flask fires teardown_appcontext after each request. Register a handler to close database sessions. Without it, connections pile up until the pool is exhausted.",
     "domain": "flask"},

    {"id": 9, "title": "Load testing database connections with Locust",
     "content": "Use Locust to simulate concurrent users hitting your API. Monitor connection pool saturation and response times. A sudden latency spike usually means pool exhaustion.",
     "domain": "performance"},

    {"id": 10, "title": "JWT authentication middleware for Flask",
     "content": "Implement JWT validation as a Flask before_request hook. Decode the token, verify the signature, and attach the user to flask.g for downstream handlers.",
     "domain": "security"},

    {"id": 11, "title": "Deploying Flask apps with Gunicorn and Nginx",
     "content": "Run Gunicorn with multiple workers behind Nginx. Each worker gets its own connection pool. Set worker count to (2 * CPU cores) + 1 for CPU-bound apps.",
     "domain": "deployment"},

    {"id": 12, "title": "Database migration strategies with Alembic",
     "content": "Use Alembic for schema migrations. Always test migrations on a staging database first. Use --sql mode to review generated SQL before applying to production.",
     "domain": "database"},
]

points = []
for doc in docs:
    embedding = model.encode(doc["content"]).tolist()
    points.append({
        "id": doc["id"],
        "vector": embedding,
        "payload": {
            "title": doc["title"],
            "content": doc["content"],
            "domain": doc["domain"],
        }
    })
collection.upsert(points)

Now, the experiment. Same user question, three different search angles:

user_question = "How do I fix a slow database connection in my Flask app?"

# Single query - the naive approach
single_vec = model.encode(user_question).tolist()
single_results = collection.search(vector=single_vec, top_k=5)

print("--- Single query results ---")
for r in single_results:
    p = r["payload"]
    print(f"  [{r['score']:.3f}] {p['title']} ({p['domain']})")

Run this yourself. You will see 3 Flask articles and 2 database articles. Performance diagnostics like "Profiling slow SQL queries" and "Identifying bottlenecks with cProfile" are completely absent from the top 5.

Now, let's decompose the question into three search intents:

# Three angles on the same question
q1 = model.encode("database connection pool configuration timeout").tolist()
q2 = model.encode("Flask SQLAlchemy session management setup").tolist()
q3 = model.encode("profiling slow queries performance bottleneck").tolist()

print("\n--- Query 1: database connections ---")
for r in collection.search(vector=q1, top_k=3):
    print(f"  [{r['score']:.3f}] {r['payload']['title']}")

print("\n--- Query 2: Flask patterns ---")
for r in collection.search(vector=q2, top_k=3):
    print(f"  [{r['score']:.3f}] {r['payload']['title']}")

print("\n--- Query 3: performance diagnostics ---")
for r in collection.search(vector=q3, top_k=3):
    print(f"  [{r['score']:.3f}] {r['payload']['title']}")

Here is what actually happens:

--- Query 1: database connections ---
  [0.659] Database connection pooling with SQLAlchemy
  [0.587] Flask request lifecycle and database teardown
  [0.575] Configuring connection timeouts in PostgreSQL

--- Query 2: Flask patterns ---
  [0.601] Flask app factory pattern for database setup
  [0.511] Database connection pooling with SQLAlchemy
  [0.499] Flask request lifecycle and database teardown

--- Query 3: performance diagnostics ---
  [0.655] Profiling slow SQL queries with EXPLAIN ANALYZE
  [0.590] Identifying bottlenecks with Python cProfile
  [0.459] Configuring connection timeouts in PostgreSQL

Query 3 surfaces "Profiling slow SQL queries" and "Identifying bottlenecks with cProfile" at the top - two articles that the single query missed entirely. Across all three queries, we find 6 unique documents instead of 5, covering domains the single query could not reach.

The problem: you now have 9 results with duplicates, no unified ranking, and custom code to merge them. Or you use fusion.

Multi-query fusion: one API call, merged results

VelesDB has multi-query fusion built in. You pass multiple vectors, pick a fusion strategy, and get a single ranked result set:

results = collection.multi_query_search(
    vectors=[q1, q2, q3],
    top_k=5,
    fusion=velesdb.FusionStrategy.rrf()
)

print("\n--- Fused results (RRF) ---")
for r in results:
    p = r.get("payload", r.get("bindings", {}))
    print(f"  {p['title']} ({p['domain']})")

Reciprocal Rank Fusion (RRF) works by converting positions into scores: a document ranked #1 in any query gets a high score, one ranked #5 gets less. Documents that appear in multiple queries get boosted. The math is simple: score = sum(1 / (k + rank)) for each query where the document appears.

Here is the actual output:

--- Fused results (RRF) ---
  Flask request lifecycle and database teardown (flask)
  Database connection pooling with SQLAlchemy (database)
  Configuring connection timeouts in PostgreSQL (database)
  Flask-SQLAlchemy session management pitfalls (flask)
  Load testing database connections with Locust (performance)

Three domains represented: flask, database, and performance. "Load testing database connections with Locust" was buried in the single query results but the fusion pulled it up because it ranked well across multiple angles. One API call, no glue code.

Fusion strategies: pick the right one

RRF is the safe default, but VelesDB supports five strategies. Four of them work with any number of query vectors:

# These 4 strategies accept N vectors
strategies = {
    "rrf": velesdb.FusionStrategy.rrf(),
    "average": velesdb.FusionStrategy.average(),
    "maximum": velesdb.FusionStrategy.maximum(),
    "weighted": velesdb.FusionStrategy.weighted(0.6, 0.3, 0.1),
}

for name, strategy in strategies.items():
    results = collection.multi_query_search(
        vectors=[q1, q2, q3],
        top_k=3,
        fusion=strategy,
    )
    titles = [r.get("payload", r.get("bindings", {}))["title"][:50] for r in results]
    print(f"  {name:16s} => {titles}")

Each strategy produces a different ranking:

rrf              => ["Flask request lifecycle and database teardown",
                     "Database connection pooling with SQLAlchemy",
                     "Configuring connection timeouts in PostgreSQL"]

average          => ["Database connection pooling with SQLAlchemy",
                     "Flask request lifecycle and database teardown",
                     "Configuring connection timeouts in PostgreSQL"]

maximum          => ["Flask app factory pattern for database setup",
                     "Database connection pooling with SQLAlchemy",
                     "Flask request lifecycle and database teardown"]

weighted         => ["Database connection pooling with SQLAlchemy",
                     "Flask request lifecycle and database teardown",
                     "Flask app factory pattern for database setup"]

The fifth strategy, relative_score, is designed for exactly 2 branches (e.g., dense + sparse retrieval). Use it when you have two complementary search signals:

# relative_score: exactly 2 vectors (dense + sparse style)
results = collection.multi_query_search(
    vectors=[q1, q2],
    top_k=3,
    fusion=velesdb.FusionStrategy.relative_score(0.7, 0.3),
)

When to use each:

RRF - Your default. Position-based, ignores raw scores. Works well when queries have different scales. Best for: most RAG pipelines.

Average - Averages the similarity scores across queries. Favors documents that are moderately relevant to everything. Best for: broad recall when all intents matter equally.

Maximum - Takes the highest score from any query. Favors documents that are extremely relevant to at least one intent. Best for: when you want specialist documents, not generalists. Notice how it pulled "Flask app factory pattern" to #1 because it scored highest on query 2.

Relative score - Normalizes scores per branch, then blends with custom weights. Designed for exactly 2 branches (dense + sparse). Best for: hybrid vector + BM25 fusion where you control the balance.

Weighted - Assigns explicit weights to each query vector. Works with N vectors. Best for: when you know the user cares more about one angle (e.g., 60% database, 30% Flask, 10% performance).

Quality modes: squeeze more precision from each query

Before fusion even happens, you can control how thoroughly each individual search explores the HNSW index:

query_vec = model.encode(user_question).tolist()

for mode in ["fast", "balanced", "accurate"]:
    results = collection.search_with_quality(query_vec, mode, top_k=5)
    titles = [r["payload"]["title"][:50] for r in results]
    print(f"  {mode:10s} => {titles}")

fast - Explores fewer candidates. Sub-millisecond. Use for autocomplete, typeahead.
balanced - Good default. Explores more paths in the HNSW graph.
accurate - Maximum recall. Explores extensively. Use for final answers.
perfect - Brute force. Guarantees the mathematically correct top-k. Slow on large collections.
autotune - Lets the engine pick based on collection size and query complexity.

On our 12-document corpus, all three modes return the same results - the HNSW index is small enough that even fast explores everything. The difference shows up on larger collections (10K+ documents) where fast may skip graph neighborhoods that accurate would visit. This is where quality modes matter: when your collection grows, you trade latency for recall.

The combination is powerful: decompose a question into sub-queries, search each with accurate quality, fuse with RRF. You get coverage and precision that single-query retrieval cannot match.

Putting it all together: a production-ready retrieval function

import velesdb
from sentence_transformers import SentenceTransformer

def multi_angle_search(
    collection,
    model: SentenceTransformer,
    question: str,
    angles: list[str],
    top_k: int = 5,
    fusion: str = "rrf",
) -> list[dict]:
    """Search from multiple angles and fuse results.

    Args:
        collection: VelesDB collection
        model: SentenceTransformer encoder
        question: Original user question
        angles: List of reformulated search queries
        top_k: Number of results to return
        fusion: Fusion strategy name

    Returns:
        Fused, deduplicated, ranked results
    """
    strategies = {
        "rrf": velesdb.FusionStrategy.rrf(),
        "average": velesdb.FusionStrategy.average(),
        "maximum": velesdb.FusionStrategy.maximum(),
    }

    vectors = [model.encode(angle).tolist() for angle in angles]

    return collection.multi_query_search(
        vectors=vectors,
        top_k=top_k,
        fusion=strategies.get(fusion, strategies["rrf"]),
    )

Usage:

results = multi_angle_search(
    collection,
    model,
    question="How do I fix a slow database connection in my Flask app?",
    angles=[
        "database connection pool configuration timeout",
        "Flask SQLAlchemy session management setup",
        "profiling slow queries performance bottleneck",
    ],
    top_k=5,
)

for r in results:
    p = r.get("payload", r.get("bindings", {}))
    print(f"  {p['title']}")

In a production RAG pipeline, the "angles" list comes from an LLM. Ask GPT or Claude to decompose the user question into 2-4 search queries, embed each one, and fuse. This is exactly what LangChain's MultiQueryRetriever and LlamaIndex's SubQuestionQueryEngine do - except here the fusion happens inside the database engine, not in Python glue code.

When to use multi-query vs. single query

Not every question needs three searches:

Single query works fine when:

The question is specific and unambiguous ("What is the default pool_size in SQLAlchemy?")
Your corpus is small and homogeneous
Latency matters more than recall (autocomplete, typeahead)

Multi-query wins when:

The question spans multiple domains or concepts
Users ask vague questions that map to several topics
Completeness matters (legal, medical, compliance use cases)
You are building agentic RAG where the LLM generates sub-queries anyway

Full working example

The complete, tested script is available on GitHub:
examples/tutorials/multi_query_fusion_demo.py

VelesDB is a source-available (Elastic License 2.0) local-first database combining vector, graph, and columnar storage in a single ~6MB Rust binary. No Docker. No API keys.