DEV Community

Cover image for RAG Series (21): Performance Optimization — Faster and Cheaper
WonderLab
WonderLab

Posted on

RAG Series (21): Performance Optimization — Faster and Cheaper

The Cost Structure of RAG

What happens in a single RAG request:

1. embed(question)          → 1 Embedding API call
2. vectorstore.search()     → vector store retrieval (local, fast)
3. llm.generate(context)    → 1 LLM API call
Enter fullscreen mode Exit fullscreen mode

At minimum 2 API calls per request. At scale, these compound quickly:

  • Latency: LLM calls typically 1–10 seconds; Embedding calls 0.1–0.5 seconds
  • Cost: token-based billing means identical questions pay the same price every time

The four optimizations each target a different point in this chain:

Optimization Where What it saves
LLM response cache LLM call Skip LLM entirely, 0ms response
Embedding cache Embedding call No re-embedding for identical text
Semantic Cache LLM call Reuse answers for similar questions
Async batch Embedding Embedding call N serial round-trips → 1 concurrent call

Optimization 1: LLM Response Cache

Principle: A given (prompt, model, temperature) combination always produces a deterministic LLM call. Cache the result on the first call; return it directly on subsequent identical calls — no network request at all.

LangChain exposes this as a global switch:

from langchain_core.globals import set_llm_cache
from langchain_community.cache import InMemoryCache

set_llm_cache(InMemoryCache())   # one line, affects all LLM calls
Enter fullscreen mode Exit fullscreen mode

For persistence across restarts, swap in SQLite:

from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".llm_cache.db"))
Enter fullscreen mode Exit fullscreen mode

Results

3 questions, each asked twice:

Q: What are the four core metrics in RAGAS?
  Cache miss:  1743ms   Cache hit:   0.7ms   Speedup: 2441×

Q: What are the common vector database options?
  Cache miss:  3675ms   Cache hit:   0.9ms   Speedup: 4126×

Q: What is Rerank?
  Cache miss:  9753ms   Cache hit:   0.9ms   Speedup: 10993×

Average: miss=5057ms  hit=0.8ms  speedup=6068×
Enter fullscreen mode Exit fullscreen mode

Hit latency is 0.8ms — that's dictionary lookup time, not network latency. On a cache hit, zero network requests are made.

6000× sounds exaggerated, but this is what "in-memory dict vs. network API call" actually looks like.

Good fit for: FAQ-style Q&A, report generation (user clicks "regenerate" repeatedly), popular questions asked by many users.

Limitation: Exact prompt match only. A rephrased question is a cache miss.


Optimization 2: Embedding Cache

Principle: The embedding vector for a given text is deterministic (same model + same text = same vector). CacheBackedEmbeddings wraps a base embeddings object with a ByteStore layer — embed once, serialize and store, read from cache thereafter.

from langchain_classic.embeddings import CacheBackedEmbeddings
from langchain_classic.storage import InMemoryByteStore, LocalFileStore

# In-memory (lost on restart)
store = InMemoryByteStore()

# File-based (persistent across restarts)
# store = LocalFileStore("./embedding_cache/")

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=base_embeddings,
    document_embedding_cache=store,
    namespace=EMB_MODEL,   # isolates cache by model name
)

# API identical to regular embeddings
vectorstore = Chroma.from_documents(docs, embedding=cached_embeddings)
Enter fullscreen mode Exit fullscreen mode

namespace=EMB_MODEL matters: if you switch embedding models, the old cached vectors have a different dimension and distribution. Namespacing by model name prevents the new model from reading stale vectors.

Results

8 texts, three passes:

First index (8 texts, all new):
  285ms   1 API call   8 texts sent

Repeat index (8 texts, all cached):
  5.7ms   0 API calls  0 texts sent

Knowledge base update (6 unchanged + 2 new):
  63.5ms  1 API call   2 texts sent
Enter fullscreen mode Exit fullscreen mode

The third row is the point: on a knowledge base update, the 6 unchanged documents are served from cache. Only the 2 new documents trigger an API call. This pairs naturally with the Indexing API from the previous article — content hash tracking identifies which documents need re-indexing; Embedding cache ensures identical content is never re-embedded.

Good fit for: knowledge bases with a large stable core and occasional updates. The more documents, the lower the update frequency, the bigger the benefit.


Optimization 3: Semantic Cache

Principle: LLM response cache requires an exact prompt match. Semantic Cache goes further: store historical (question, answer) pairs as vectors; when a new question arrives, run a nearest-neighbor search; if a sufficiently similar historical question is found, return its answer directly — skipping both retrieval and LLM.

"What metrics does the RAGAS framework have?"  → miss → LLM generates → stored
"Describe the four core RAGAS metrics"         → vector search → finds above
                                               → similarity ≥ threshold → return cached answer
Enter fullscreen mode Exit fullscreen mode

Implementation:

class SemanticCache:
    def __init__(self, embeddings, threshold: float = 0.85):
        self._store   = Chroma(collection_name="semantic_cache", ...)
        self._answers = {}          # cache_id → answer
        self._threshold = threshold

    def get(self, question: str) -> Optional[str]:
        results = self._store.similarity_search_with_relevance_scores(question, k=1)
        if results:
            doc, score = results[0]
            if score >= self._threshold:
                return self._answers[doc.metadata["cache_id"]]
        return None

    def set(self, question: str, answer: str) -> None:
        cache_id = str(uuid.uuid4())
        self._store.add_texts([question], metadatas=[{"cache_id": cache_id}])
        self._answers[cache_id] = answer
Enter fullscreen mode Exit fullscreen mode

Results: Threshold Calibration Is the Hard Part

Threshold: 0.85

RAGAS group:
  Original:  "What metrics does RAGAS have?"              → miss (3782ms)
  Paraphrase: "Describe the four core RAGAS metrics"      → miss (3298ms) ← expected HIT
  Different:  "How should I choose a vector database?"    → miss (2509ms) ← correct miss

Rerank group:
  Original:  "What role does Rerank play in RAG?"         → miss (11602ms)
  Paraphrase: "Why do RAG systems need re-ranking?"       → miss (3834ms) ← expected HIT
  Different:  "What is hybrid retrieval?"                 → miss (12578ms) ← correct miss

Total hit rate: 0/6
Enter fullscreen mode Exit fullscreen mode

The paraphrases didn't hit the cache. This is not a code bug — threshold 0.85 is too high for these paraphrase pairs.

Why: bge-large-zh-v1.5 cosine similarity between these pairs likely falls in the 0.80–0.84 range, just below the threshold. Semantic similarity ≠ high cosine similarity. The mapping depends on the embedding model's representation space and training data.

The correct approach: calibrate before setting a threshold. Measure the similarity distribution on your actual question samples:

# Calibration: measure similarity on known similar pairs and known-different pairs
from numpy import dot
from numpy.linalg import norm

def cosine(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similar_pairs = [
    ("What RAGAS metrics are there?", "List the RAGAS evaluation metrics"),
    ("How to choose a vector DB?", "Which vector database should I use?"),
]
dissimilar_pairs = [
    ("What RAGAS metrics are there?", "How to choose a vector DB?"),
]

for q1, q2 in similar_pairs:
    v1 = embeddings.embed_query(q1)
    v2 = embeddings.embed_query(q2)
    print(f"Similar:    {cosine(v1, v2):.3f}  {q1[:30]} / {q2[:30]}")

for q1, q2 in dissimilar_pairs:
    v1 = embeddings.embed_query(q1)
    v2 = embeddings.embed_query(q2)
    print(f"Dissimilar: {cosine(v1, v2):.3f}  {q1[:30]} / {q2[:30]}")

# Set threshold between the two distributions
Enter fullscreen mode Exit fullscreen mode

Find a threshold that separates the two distributions. For Chinese Q&A with bge models, 0.80–0.85 is a common starting range — but you must validate on your own data before deploying.

The real value of Semantic Cache: high-volume FAQ systems where users ask the same questions in many different ways (customer service bots, documentation assistants). Potential for large LLM call reduction. But the value is entirely dependent on threshold calibration — it's not a drop-in default.


Optimization 4: Async Batch Embedding

Principle: Embedding N texts sequentially = N network round-trips. Embedding N texts in a single batch call = 1 network round-trip, processed in parallel server-side.

import asyncio
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(...)

# Sequential (slow): one API call per text
sequential = [embeddings.embed_query(text) for text in texts]

# Async batch (fast): one API call for all texts
async def embed_batch(texts):
    return await embeddings.aembed_documents(texts)

batch = asyncio.run(embed_batch(texts))
Enter fullscreen mode Exit fullscreen mode

Results

12 texts:

Sequential (one by one):    830ms
Async batch (one call):     289ms
Speedup:                    2.87×
Enter fullscreen mode Exit fullscreen mode

Same vectors, 11 fewer network round-trips. Vector agreement > 0.9999 cosine similarity.

Where to apply in the RAG pipeline:

# Batch indexing at build time
async def index_documents_async(docs: list[Document]):
    texts   = [d.page_content for d in docs]
    vectors = await embeddings.aembed_documents(texts)
    # bulk write to vector store
    ...

# Concurrent user queries in the service layer
async def handle_batch_queries(questions: list[str]):
    vectors = await embeddings.aembed_documents(questions)
    results = await asyncio.gather(*[
        retriever.ainvoke(q) for q in questions
    ])
    return results
Enter fullscreen mode Exit fullscreen mode

The more documents, the bigger the gain. Batch documents in chunks of 50–100 during index builds; expect 3–5× speedup over sequential, depending on network latency.


Combining All Four Optimizations

# 1. LLM cache (global, always on)
set_llm_cache(SQLiteCache(".llm_cache.db"))

# 2. Embedding cache (wrap the base embeddings)
store = LocalFileStore("./embedding_cache/")
embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=base_embeddings,
    document_embedding_cache=store,
    namespace=EMB_MODEL,
)

# 3. Semantic Cache (check before full pipeline)
semantic_cache = SemanticCache(embeddings, threshold=YOUR_CALIBRATED_THRESHOLD)

def query(question: str) -> str:
    cached = semantic_cache.get(question)
    if cached:
        return cached
    docs   = retriever.invoke(question)
    answer = llm.invoke(...)
    semantic_cache.set(question, answer)
    return answer

# 4. Async for bulk operations
vectors = asyncio.run(embeddings.aembed_documents(texts))
Enter fullscreen mode Exit fullscreen mode

All four are orthogonal and stackable. Highest-ROI combination: LLM cache + Embedding cache — near-zero implementation cost, should be on by default. Semantic Cache requires calibration but delivers large savings once tuned. Async batch is specifically valuable at index-build time and under high concurrency.


Summary

=====================================================================
  Optimization Results Summary
=====================================================================

  Optimization             Before          After         Savings
  ─────────────────────────────────────────────────────────────
  LLM response cache       5057ms          0.8ms         99.98%  ✓ strongly recommended
  Embedding cache (rebuild) 285ms          5.7ms         98%     ✓ strongly recommended
  Embedding cache (update)  8 API calls    2 API calls   75%     ✓ strongly recommended
  Semantic Cache (t=0.85)   functional     needs calibr. —       ⚠ calibrate first
  Async batch Embedding     830ms          289ms         65%     ✓ recommended at scale
=====================================================================
Enter fullscreen mode Exit fullscreen mode

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/21-rag-performance

Key file:

  • rag_performance.py — all four benchmarks with report generation

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 21-rag-performance
cp .env.example .env
pip install -r requirements.txt
python rag_performance.py
Enter fullscreen mode Exit fullscreen mode

Summary

This article implemented and measured four RAG performance optimizations:

  1. LLM response cache: cheapest and highest impact — one line of code, repeated questions go from 5057ms to 0.8ms (6000× speedup)
  2. Embedding cache: identical text never re-embedded; knowledge base updates only embed changed content (8 calls → 2 calls)
  3. Semantic Cache: conceptually correct, but threshold 0.85 produced 0/6 hits in this experiment — threshold calibration is non-optional; measure similarity distribution on real data before setting any value
  4. Async batch Embedding: 2.87× speedup for 12 texts; benefit grows with document count

The first three optimizations attack the same root problem: repeated computation is waste. The same work shouldn't cost twice. The fourth attacks a different problem: serial waiting is unnecessary. Work that can be parallelized shouldn't be queued.

Different problems, same goal: making RAG viable in production.


References

Top comments (0)