The Cost Structure of RAG
What happens in a single RAG request:
1. embed(question) → 1 Embedding API call
2. vectorstore.search() → vector store retrieval (local, fast)
3. llm.generate(context) → 1 LLM API call
At minimum 2 API calls per request. At scale, these compound quickly:
- Latency: LLM calls typically 1–10 seconds; Embedding calls 0.1–0.5 seconds
- Cost: token-based billing means identical questions pay the same price every time
The four optimizations each target a different point in this chain:
| Optimization | Where | What it saves |
|---|---|---|
| LLM response cache | LLM call | Skip LLM entirely, 0ms response |
| Embedding cache | Embedding call | No re-embedding for identical text |
| Semantic Cache | LLM call | Reuse answers for similar questions |
| Async batch Embedding | Embedding call | N serial round-trips → 1 concurrent call |
Optimization 1: LLM Response Cache
Principle: A given (prompt, model, temperature) combination always produces a deterministic LLM call. Cache the result on the first call; return it directly on subsequent identical calls — no network request at all.
LangChain exposes this as a global switch:
from langchain_core.globals import set_llm_cache
from langchain_community.cache import InMemoryCache
set_llm_cache(InMemoryCache()) # one line, affects all LLM calls
For persistence across restarts, swap in SQLite:
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".llm_cache.db"))
Results
3 questions, each asked twice:
Q: What are the four core metrics in RAGAS?
Cache miss: 1743ms Cache hit: 0.7ms Speedup: 2441×
Q: What are the common vector database options?
Cache miss: 3675ms Cache hit: 0.9ms Speedup: 4126×
Q: What is Rerank?
Cache miss: 9753ms Cache hit: 0.9ms Speedup: 10993×
Average: miss=5057ms hit=0.8ms speedup=6068×
Hit latency is 0.8ms — that's dictionary lookup time, not network latency. On a cache hit, zero network requests are made.
6000× sounds exaggerated, but this is what "in-memory dict vs. network API call" actually looks like.
Good fit for: FAQ-style Q&A, report generation (user clicks "regenerate" repeatedly), popular questions asked by many users.
Limitation: Exact prompt match only. A rephrased question is a cache miss.
Optimization 2: Embedding Cache
Principle: The embedding vector for a given text is deterministic (same model + same text = same vector). CacheBackedEmbeddings wraps a base embeddings object with a ByteStore layer — embed once, serialize and store, read from cache thereafter.
from langchain_classic.embeddings import CacheBackedEmbeddings
from langchain_classic.storage import InMemoryByteStore, LocalFileStore
# In-memory (lost on restart)
store = InMemoryByteStore()
# File-based (persistent across restarts)
# store = LocalFileStore("./embedding_cache/")
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings=base_embeddings,
document_embedding_cache=store,
namespace=EMB_MODEL, # isolates cache by model name
)
# API identical to regular embeddings
vectorstore = Chroma.from_documents(docs, embedding=cached_embeddings)
namespace=EMB_MODEL matters: if you switch embedding models, the old cached vectors have a different dimension and distribution. Namespacing by model name prevents the new model from reading stale vectors.
Results
8 texts, three passes:
First index (8 texts, all new):
285ms 1 API call 8 texts sent
Repeat index (8 texts, all cached):
5.7ms 0 API calls 0 texts sent
Knowledge base update (6 unchanged + 2 new):
63.5ms 1 API call 2 texts sent
The third row is the point: on a knowledge base update, the 6 unchanged documents are served from cache. Only the 2 new documents trigger an API call. This pairs naturally with the Indexing API from the previous article — content hash tracking identifies which documents need re-indexing; Embedding cache ensures identical content is never re-embedded.
Good fit for: knowledge bases with a large stable core and occasional updates. The more documents, the lower the update frequency, the bigger the benefit.
Optimization 3: Semantic Cache
Principle: LLM response cache requires an exact prompt match. Semantic Cache goes further: store historical (question, answer) pairs as vectors; when a new question arrives, run a nearest-neighbor search; if a sufficiently similar historical question is found, return its answer directly — skipping both retrieval and LLM.
"What metrics does the RAGAS framework have?" → miss → LLM generates → stored
"Describe the four core RAGAS metrics" → vector search → finds above
→ similarity ≥ threshold → return cached answer
Implementation:
class SemanticCache:
def __init__(self, embeddings, threshold: float = 0.85):
self._store = Chroma(collection_name="semantic_cache", ...)
self._answers = {} # cache_id → answer
self._threshold = threshold
def get(self, question: str) -> Optional[str]:
results = self._store.similarity_search_with_relevance_scores(question, k=1)
if results:
doc, score = results[0]
if score >= self._threshold:
return self._answers[doc.metadata["cache_id"]]
return None
def set(self, question: str, answer: str) -> None:
cache_id = str(uuid.uuid4())
self._store.add_texts([question], metadatas=[{"cache_id": cache_id}])
self._answers[cache_id] = answer
Results: Threshold Calibration Is the Hard Part
Threshold: 0.85
RAGAS group:
Original: "What metrics does RAGAS have?" → miss (3782ms)
Paraphrase: "Describe the four core RAGAS metrics" → miss (3298ms) ← expected HIT
Different: "How should I choose a vector database?" → miss (2509ms) ← correct miss
Rerank group:
Original: "What role does Rerank play in RAG?" → miss (11602ms)
Paraphrase: "Why do RAG systems need re-ranking?" → miss (3834ms) ← expected HIT
Different: "What is hybrid retrieval?" → miss (12578ms) ← correct miss
Total hit rate: 0/6
The paraphrases didn't hit the cache. This is not a code bug — threshold 0.85 is too high for these paraphrase pairs.
Why: bge-large-zh-v1.5 cosine similarity between these pairs likely falls in the 0.80–0.84 range, just below the threshold. Semantic similarity ≠ high cosine similarity. The mapping depends on the embedding model's representation space and training data.
The correct approach: calibrate before setting a threshold. Measure the similarity distribution on your actual question samples:
# Calibration: measure similarity on known similar pairs and known-different pairs
from numpy import dot
from numpy.linalg import norm
def cosine(a, b):
return dot(a, b) / (norm(a) * norm(b))
similar_pairs = [
("What RAGAS metrics are there?", "List the RAGAS evaluation metrics"),
("How to choose a vector DB?", "Which vector database should I use?"),
]
dissimilar_pairs = [
("What RAGAS metrics are there?", "How to choose a vector DB?"),
]
for q1, q2 in similar_pairs:
v1 = embeddings.embed_query(q1)
v2 = embeddings.embed_query(q2)
print(f"Similar: {cosine(v1, v2):.3f} {q1[:30]} / {q2[:30]}")
for q1, q2 in dissimilar_pairs:
v1 = embeddings.embed_query(q1)
v2 = embeddings.embed_query(q2)
print(f"Dissimilar: {cosine(v1, v2):.3f} {q1[:30]} / {q2[:30]}")
# Set threshold between the two distributions
Find a threshold that separates the two distributions. For Chinese Q&A with bge models, 0.80–0.85 is a common starting range — but you must validate on your own data before deploying.
The real value of Semantic Cache: high-volume FAQ systems where users ask the same questions in many different ways (customer service bots, documentation assistants). Potential for large LLM call reduction. But the value is entirely dependent on threshold calibration — it's not a drop-in default.
Optimization 4: Async Batch Embedding
Principle: Embedding N texts sequentially = N network round-trips. Embedding N texts in a single batch call = 1 network round-trip, processed in parallel server-side.
import asyncio
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(...)
# Sequential (slow): one API call per text
sequential = [embeddings.embed_query(text) for text in texts]
# Async batch (fast): one API call for all texts
async def embed_batch(texts):
return await embeddings.aembed_documents(texts)
batch = asyncio.run(embed_batch(texts))
Results
12 texts:
Sequential (one by one): 830ms
Async batch (one call): 289ms
Speedup: 2.87×
Same vectors, 11 fewer network round-trips. Vector agreement > 0.9999 cosine similarity.
Where to apply in the RAG pipeline:
# Batch indexing at build time
async def index_documents_async(docs: list[Document]):
texts = [d.page_content for d in docs]
vectors = await embeddings.aembed_documents(texts)
# bulk write to vector store
...
# Concurrent user queries in the service layer
async def handle_batch_queries(questions: list[str]):
vectors = await embeddings.aembed_documents(questions)
results = await asyncio.gather(*[
retriever.ainvoke(q) for q in questions
])
return results
The more documents, the bigger the gain. Batch documents in chunks of 50–100 during index builds; expect 3–5× speedup over sequential, depending on network latency.
Combining All Four Optimizations
# 1. LLM cache (global, always on)
set_llm_cache(SQLiteCache(".llm_cache.db"))
# 2. Embedding cache (wrap the base embeddings)
store = LocalFileStore("./embedding_cache/")
embeddings = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings=base_embeddings,
document_embedding_cache=store,
namespace=EMB_MODEL,
)
# 3. Semantic Cache (check before full pipeline)
semantic_cache = SemanticCache(embeddings, threshold=YOUR_CALIBRATED_THRESHOLD)
def query(question: str) -> str:
cached = semantic_cache.get(question)
if cached:
return cached
docs = retriever.invoke(question)
answer = llm.invoke(...)
semantic_cache.set(question, answer)
return answer
# 4. Async for bulk operations
vectors = asyncio.run(embeddings.aembed_documents(texts))
All four are orthogonal and stackable. Highest-ROI combination: LLM cache + Embedding cache — near-zero implementation cost, should be on by default. Semantic Cache requires calibration but delivers large savings once tuned. Async batch is specifically valuable at index-build time and under high concurrency.
Summary
=====================================================================
Optimization Results Summary
=====================================================================
Optimization Before After Savings
─────────────────────────────────────────────────────────────
LLM response cache 5057ms 0.8ms 99.98% ✓ strongly recommended
Embedding cache (rebuild) 285ms 5.7ms 98% ✓ strongly recommended
Embedding cache (update) 8 API calls 2 API calls 75% ✓ strongly recommended
Semantic Cache (t=0.85) functional needs calibr. — ⚠ calibrate first
Async batch Embedding 830ms 289ms 65% ✓ recommended at scale
=====================================================================
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/21-rag-performance
Key file:
-
rag_performance.py— all four benchmarks with report generation
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 21-rag-performance
cp .env.example .env
pip install -r requirements.txt
python rag_performance.py
Summary
This article implemented and measured four RAG performance optimizations:
- LLM response cache: cheapest and highest impact — one line of code, repeated questions go from 5057ms to 0.8ms (6000× speedup)
- Embedding cache: identical text never re-embedded; knowledge base updates only embed changed content (8 calls → 2 calls)
- Semantic Cache: conceptually correct, but threshold 0.85 produced 0/6 hits in this experiment — threshold calibration is non-optional; measure similarity distribution on real data before setting any value
- Async batch Embedding: 2.87× speedup for 12 texts; benefit grows with document count
The first three optimizations attack the same root problem: repeated computation is waste. The same work shouldn't cost twice. The fourth attacks a different problem: serial waiting is unnecessary. Work that can be parallelized shouldn't be queued.
Different problems, same goal: making RAG viable in production.
Top comments (0)