Every RAG query costs money: embedding + LLM tokens. I implemented three optimization layers that reduce real cost by 30-45%: FAQ matching (curated answers at zero cost), semantic cache (pgvector with similarity >= 0.95), and auto-FAQ generation from frequent unanswered queries. All with production code and a fallback that keeps the service running when the LLM budget runs out.
The Problem: Every Query Costs Real Money
In previous articles I built a RAG pipeline with hybrid search, reranking and streaming. Works well. But in production:
Typical query: embedding + search + LLM (~800 tokens) ≈ $0.0003
1,000 queries/day × 30 days = 30,000 queries/month
30,000 × $0.0003 = ~$9/month per tenant
50 tenants = ~$450/month in LLM alone
The key insight: 35-40% of queries are repeated or very similar. Each one is money burned.
Architecture of the 3 Savings Layers
User query
│
▼
┌─────────────┐
│ FAQ Match │──→ If score ≥ 0.85: curated answer (LLM cost = $0)
└─────┬───────┘
│ No match
▼
┌─────────────────┐
│ Semantic Cache │──→ If similarity ≥ 0.95: cached response ($0)
└─────┬───────────┘
│ Cache miss
▼
┌─────────────────┐
│ Budget Check │──→ If budget exhausted: FAQ-only fallback
└─────┬───────────┘
│ Budget OK
▼
Full RAG pipeline → fire-and-forget: store in cache
Layer 1: FAQ Matching — The Most Profitable Investment
async def match_faq(
db: AsyncSession, query: str, kb_ids: list[UUID], threshold: float = 0.85,
) -> dict | None:
query_embedding = embedding_service.embed_query(query)
kb_ids_str = ",".join(f"'{str(kb_id)}'" for kb_id in kb_ids)
stmt = text(f"""
SELECT id, question, answer, attachments,
1 - (embedding <=> CAST(:embedding AS vector)) as score
FROM faqs
WHERE knowledge_base_id IN ({kb_ids_str})
AND is_active = true AND embedding IS NOT NULL
ORDER BY embedding <=> CAST(:embedding AS vector)
LIMIT 1
""")
result = await db.execute(stmt, {"embedding": str(query_embedding)})
row = result.mappings().first()
if not row or float(row["score"]) < threshold:
return None
# Increment hit_count atomically
await db.execute(
update(FAQ).where(FAQ.id == row["id"]).values(hit_count=FAQ.hit_count + 1)
)
await db.commit()
return {"faq_id": str(row["id"]), "question": row["question"],
"answer": row["answer"], "score": float(row["score"])}
The model uses an HNSW index on the embedding so the search takes ~15ms:
class FAQ(Base):
__tablename__ = "faqs"
embedding = mapped_column(Vector(384), nullable=True)
hit_count = mapped_column(Integer, default=0, server_default="0")
__table_args__ = (
Index("ix_faqs_embedding", "embedding",
postgresql_using="hnsw",
postgresql_ops={"embedding": "vector_cosine_ops"},
postgresql_with={"m": 16, "ef_construction": 64}),
)
Threshold Tuning: From 0.75 to 0.85
Started at 0.75. Too low — got false positives where "How do I configure the widget?" matched "What is a widget?". Raised to 0.85 and false positives disappeared:
| Query | Stored FAQ | Score | Match? |
|---|---|---|---|
| "how do I reset my password" | "How do I change my password?" | 0.91 | Yes |
| "do you accept visa?" | "What payment methods do you accept?" | 0.87 | Yes |
| "how do I configure the widget" | "What is a widget?" | 0.72 | No |
When there's a match, the full pipeline doesn't execute:
if faq_match:
yield {"event": "faq_match", "data": json.dumps(faq_match)}
# FAQ matches are free — don't increment billing counters
fire_and_forget_log_usage(
query_type="faq", response_tokens=0, provider=None, ...
)
response_tokens=0 and query_type="faq" — fundamental for savings metrics.
Layer 2: Semantic Cache — pgvector as Intelligent Cache
The Concept
If someone asks "How do I reset my password?" and 2 hours ago another asked "How can I change my password?", the answer is the same. Semantic cache detects that equivalence by comparing embeddings, not exact strings.
The model stores the query embedding, response, sources, and a hash of the RAG configuration:
class ResponseCache(Base):
__tablename__ = "response_cache"
query_embedding = mapped_column(Vector(384), nullable=False)
query_text = mapped_column(String(500), nullable=False)
response_text = mapped_column(Text, nullable=False)
sources = mapped_column(JSONB, nullable=False, default=list)
confidence = mapped_column(Float, nullable=False)
knowledge_base_ids = mapped_column(ARRAY(UUID(as_uuid=True)), nullable=False)
rag_config_hash = mapped_column(String(64), nullable=False)
hit_count = mapped_column(Integer, default=0, server_default="0")
expires_at = mapped_column(DateTime(timezone=True), nullable=False)
Config Hash: Scoping by Configuration
If the admin changes RAG parameters, cached answers are no longer valid:
def compute_config_hash(retrieval_config, language: str | None = None) -> str:
key_fields = {
"candidate_k": retrieval_config.candidate_k,
"rerank_top_n": retrieval_config.rerank_top_n,
"top_k": retrieval_config.top_k,
"lambda_param": retrieval_config.lambda_param,
"max_per_doc": retrieval_config.max_per_doc,
"bm25_weight": retrieval_config.bm25_weight,
"language": language or "es",
}
return hashlib.sha256(json.dumps(key_fields, sort_keys=True).encode()).hexdigest()
Lookup: Similarity >= 0.95
CACHE_SIMILARITY_THRESHOLD = 0.95
async def lookup_cache(db, query_embedding, kb_ids, rag_config_hash, tenant_id):
stmt = text(f"""
SELECT id, response_text, sources, confidence,
1 - (query_embedding <=> CAST(:embedding AS vector)) AS similarity
FROM response_cache
WHERE tenant_id = :tid AND expires_at > now()
AND rag_config_hash = :config_hash
AND knowledge_base_ids @> {kb_array}
ORDER BY query_embedding <=> CAST(:embedding AS vector)
LIMIT 1
""")
row = result.mappings().first()
if not row or float(row["similarity"]) < CACHE_SIMILARITY_THRESHOLD:
return None
await db.execute(
text("UPDATE response_cache SET hit_count = hit_count + 1 WHERE id = :cid"),
{"cid": str(row["id"])},
)
return {"response_text": row["response_text"], "sources": row["sources"],
"confidence": float(row["confidence"])}
Why 0.95? At 0.90 I had incorrect cache hits:
| Query A | Query B | Similarity | Same answer? |
|---|---|---|---|
| "How do I export to CSV" | "How do I download data as CSV" | 0.97 | Yes |
| "How do I configure the webhook" | "How do I test the webhook" | 0.92 | No |
| "How do I add users" | "How do I delete users" | 0.91 | No |
Fire-and-Forget Storage
You can't block SSE streaming to save to cache:
def fire_and_forget_store_cache(tenant_id, query_embedding, query_text,
kb_ids, rag_config_hash, response_text,
sources, confidence, ttl_hours=168):
async def _store():
try:
async with async_session() as db:
await store_cache(db, tenant_id, query_embedding, ...)
except Exception as e:
logger.error("Failed to store cache: %s", e)
try:
loop = asyncio.get_running_loop()
loop.create_task(_store())
except RuntimeError:
pass # No event loop (tests) — skip
And the condition for caching:
if rag_config.retrieval.cache_enabled and kb_ids and not low_confidence:
fire_and_forget_store_cache(...)
not low_confidence: we don't cache bad answers.
Proactive Invalidation: The Cache That Doesn't Lie
async def invalidate_kb_cache(db: AsyncSession, kb_id: UUID) -> int:
result = await db.execute(
text("DELETE FROM response_cache WHERE CAST(:kb_id AS UUID) = ANY(knowledge_base_ids)"),
{"kb_id": str(kb_id)},
)
await db.commit()
return result.rowcount
Called automatically on every operation that changes KB content:
# In faq_service.py — when creating, updating or importing FAQs
async def create_faq(db, kb_id, data):
faq = FAQ(knowledge_base_id=kb_id, ...)
db.add(faq)
await db.commit()
await invalidate_kb_cache(db, kb_id) # Cache dies
return faq
Same happens when uploading or reprocessing documents. Simple rule: if a KB's content changed, that KB's cache dies.
Auto-FAQ: Turn Frequent Questions into Real FAQs
The system records queries with low confidence. When one appears multiple times, the admin can auto-generate a FAQ:
async def generate_faq_suggestion(db, unanswered_query_id, kb_id, tenant_id):
uq = await db.get(UnansweredQuery, unanswered_query_id)
# Dedup: skip if pending suggestion or similar FAQ (>= 0.85) exists
similar_faq = await db.execute(text("""
SELECT 1 - (embedding <=> CAST(:embedding AS vector)) AS score
FROM faqs WHERE knowledge_base_id = :kb_id AND is_active = true
ORDER BY embedding <=> CAST(:embedding AS vector) LIMIT 1
"""), ...)
if faq_row and float(faq_row["score"]) >= 0.85:
return None # Similar FAQ already exists
# Search KB for context, then call LLM
sources = await search_chunks(db, uq.query_text, [kb_id], ...)
result = await generate_playground_response(
query=uq.query_text, sources=sources[:5],
temperature=0.2, max_tokens=512,
system_prompt="Answer ONLY with information from the context. "
"2-4 sentences. If insufficient info: NO_ANSWER"
)
if not result.get("response") or result["response"].strip() == "NO_ANSWER":
return None
return SuggestedFAQ(
tenant_id=tenant_id, knowledge_base_id=kb_id,
question=uq.query_text, generated_answer=result["response"].strip(),
embedding=query_embedding, status="pending",
)
Batch generation prioritizes by frequency:
async def batch_generate_suggestions(db, tenant_id, kb_id, limit=5):
stmt = text("""
SELECT uq.id FROM unanswered_queries uq
WHERE uq.tenant_id = :tid AND uq.resolved = false
AND NOT EXISTS (
SELECT 1 FROM suggested_faqs sf
WHERE sf.source_query_id = uq.id AND sf.status = 'pending'
)
ORDER BY uq.occurrence_count DESC LIMIT :lim
""")
# Generate suggestions for each...
On approval, the real FAQ is created, cache is invalidated, and the query is marked as resolved. Closed loop.
FAQ-Only Fallback: Degraded Service Without Cutoff
Free plan: 50 AI queries/month. When exhausted, FAQs keep working:
async def is_llm_budget_exhausted(db, tenant_id) -> bool:
"""Soft check — enables FAQ-only fallback, does NOT raise."""
plan, tenant = await _get_plan_and_tenant(db, tenant_id)
if not plan:
return False
usage = await get_monthly_usage(db, tenant_id)
if plan.max_messages_month != -1 and usage["messages_month"] >= plan.max_messages_month:
return True
return False
In the main flow:
faq_only_mode = await is_llm_budget_exhausted(db, tenant_id)
# FAQ matching always works
if faq_match:
return faq_response(faq_match) # Free
# No match + no budget → upgrade card
if faq_only_mode:
yield {"event": "upgrade_required", "data": json.dumps({
"message": "You've reached your AI query limit. "
"FAQs remain available at no cost."
})}
Cost Metrics
async def _get_cost_metrics(db, tenant_id, cutoff):
# Tokens by provider
# ...configurable prices from settings...
groq_price = await settings_service.get(db, "cost.groq.price_per_1k_tokens", 0.00027)
# FAQ savings estimation
avg_llm_tokens = total_tokens / max(llm_query_count, 1)
estimated_faq_tokens_saved = int(faq_count * avg_llm_tokens)
estimated_faq_cost_saved = estimated_faq_tokens_saved / 1000 * groq_price
return {
"by_provider": by_provider,
"total_estimated_cost": round(total_cost, 4),
"faq_savings": {
"queries_without_llm": faq_count,
"estimated_tokens_saved": estimated_faq_tokens_saved,
"estimated_cost_saved": round(estimated_faq_cost_saved, 4),
},
"avg_cost_per_conversation": round(total_cost / conv_count, 6),
}
Cache hit rate as the 7th stat card in the dashboard:
async def get_cache_stats(db, tenant_id):
rate_result = await db.execute(text("""
SELECT
COUNT(*) FILTER (WHERE query_type = 'cached') AS cached_queries,
COUNT(*) FILTER (WHERE query_type IN ('chat','widget','cached')) AS total_queries
FROM usage_logs WHERE tenant_id = :tid
"""))
return {"hit_rate": round(cached / total, 3), ...}
The Full Flow in the Chat Endpoint
To see how the 3 layers fit together, here's the real order in chat.py:
# 1. Budget check (soft — doesn't raise)
faq_only_mode = await is_llm_budget_exhausted(db, tenant_id)
# 2. FAQ match (always works, even in faq_only_mode)
faq_match = await match_faq(db, data.content, kb_ids, threshold=0.85)
if faq_match:
return EventSourceResponse(faq_event_generator()) # $0
# 3. Budget exhausted + no FAQ match → upgrade card
if faq_only_mode:
yield {"event": "upgrade_required", ...}
return
# 4. Semantic cache lookup
query_embedding = embedding_service.embed_query(effective_query)
config_hash = compute_config_hash(rag_config.retrieval, detected_lang)
cache_hit = await lookup_cache(db, query_embedding, kb_ids, config_hash, tenant_id)
if cache_hit:
return EventSourceResponse(cache_event_generator()) # $0
# 5. Full RAG pipeline (vector + BM25 + rerank + LLM)
# ... streaming response ...
# 6. Fire-and-forget: store in cache for next time
if not low_confidence:
fire_and_forget_store_cache(...)
Real Numbers
| Metric | Value |
|---|---|
| FAQ threshold | 0.85 |
| Cache threshold | 0.95 |
| Default cache TTL | 7 days |
| FAQ match latency | ~15ms |
| Cache lookup latency | ~20ms |
| Full pipeline latency | ~2-4s |
| % queries resolved by FAQ | 15-25% |
| % queries resolved by cache | 10-20% |
| % queries reaching the LLM | 55-75% |
| Estimated total savings | 30-45% |
Lessons Learned
1. FAQs Are the Best Investment
Not sexy. It's a table with questions and answers. But every FAQ is a perfect answer served in 15ms at zero cost, forever.
2. Cache Threshold Must Be Very High
At 0.90 I had false positives that served incorrect answers. 0.95 is the safe minimum.
3. Fire-and-Forget Is Mandatory for Cache Storage
First attempt was synchronous. If the INSERT takes long, the last streaming token is delayed. Fire-and-forget eliminates that latency.
4. Proactive Invalidation Is Worth It
It's tempting to let cache expire on its own (TTL). But the admin who uploads a document expects updated answers immediately.
5. Auto-FAQ Closes the Loop
Without auto-FAQ, knowledge gaps accumulate. With auto-FAQ, in 2 clicks the frequent question goes from gap to active FAQ.
6. FAQ-Only Fallback > Cutting Off Service
Free-tier users still ask questions that match FAQs. Reduces churn and provides real incentive to upgrade.
7. LLM Prices Must Be Configurable
Hardcoding $0.00027/1K tokens is a trap. Prices change. Storing them in a settings table allows adjustments without a deploy.
Conclusion
Optimizing RAG costs is a problem of avoiding unnecessary work. The three layers (FAQ, cache, auto-FAQ) attack the same principle: if the answer already exists, don't pay to generate it again.
The interesting part: they also improve the experience. FAQ match in 15ms vs 2-4s for the full pipeline. Cache hit in 20ms with the same quality. And pgvector was already in the stack — no need for Redis or Elasticsearch.
This is the fifth article in the series. If it was useful, a like helps it reach more people. Questions about semantic cache or FAQ matching? Drop a comment.
Top comments (0)