DEV Community

Cover image for Semantic Cache and FAQ Matching: How I Cut LLM Costs by 40% in My RAG Engine
Martin Palopoli
Martin Palopoli

Posted on

Semantic Cache and FAQ Matching: How I Cut LLM Costs by 40% in My RAG Engine

Every RAG query costs money: embedding + LLM tokens. I implemented three optimization layers that reduce real cost by 30-45%: FAQ matching (curated answers at zero cost), semantic cache (pgvector with similarity >= 0.95), and auto-FAQ generation from frequent unanswered queries. All with production code and a fallback that keeps the service running when the LLM budget runs out.


The Problem: Every Query Costs Real Money

In previous articles I built a RAG pipeline with hybrid search, reranking and streaming. Works well. But in production:

Typical query:  embedding + search + LLM (~800 tokens) ≈ $0.0003
1,000 queries/day × 30 days = 30,000 queries/month
30,000 × $0.0003 = ~$9/month per tenant
50 tenants = ~$450/month in LLM alone
Enter fullscreen mode Exit fullscreen mode

The key insight: 35-40% of queries are repeated or very similar. Each one is money burned.


Architecture of the 3 Savings Layers

User query
     │
     ▼
┌─────────────┐
│  FAQ Match   │──→ If score ≥ 0.85: curated answer (LLM cost = $0)
└─────┬───────┘
      │ No match
      ▼
┌─────────────────┐
│ Semantic Cache   │──→ If similarity ≥ 0.95: cached response ($0)
└─────┬───────────┘
      │ Cache miss
      ▼
┌─────────────────┐
│ Budget Check     │──→ If budget exhausted: FAQ-only fallback
└─────┬───────────┘
      │ Budget OK
      ▼
  Full RAG pipeline → fire-and-forget: store in cache
Enter fullscreen mode Exit fullscreen mode

Layer 1: FAQ Matching — The Most Profitable Investment

async def match_faq(
    db: AsyncSession, query: str, kb_ids: list[UUID], threshold: float = 0.85,
) -> dict | None:
    query_embedding = embedding_service.embed_query(query)
    kb_ids_str = ",".join(f"'{str(kb_id)}'" for kb_id in kb_ids)

    stmt = text(f"""
        SELECT id, question, answer, attachments,
               1 - (embedding <=> CAST(:embedding AS vector)) as score
        FROM faqs
        WHERE knowledge_base_id IN ({kb_ids_str})
          AND is_active = true AND embedding IS NOT NULL
        ORDER BY embedding <=> CAST(:embedding AS vector)
        LIMIT 1
    """)
    result = await db.execute(stmt, {"embedding": str(query_embedding)})
    row = result.mappings().first()

    if not row or float(row["score"]) < threshold:
        return None

    # Increment hit_count atomically
    await db.execute(
        update(FAQ).where(FAQ.id == row["id"]).values(hit_count=FAQ.hit_count + 1)
    )
    await db.commit()
    return {"faq_id": str(row["id"]), "question": row["question"],
            "answer": row["answer"], "score": float(row["score"])}
Enter fullscreen mode Exit fullscreen mode

The model uses an HNSW index on the embedding so the search takes ~15ms:

class FAQ(Base):
    __tablename__ = "faqs"
    embedding = mapped_column(Vector(384), nullable=True)
    hit_count = mapped_column(Integer, default=0, server_default="0")

    __table_args__ = (
        Index("ix_faqs_embedding", "embedding",
              postgresql_using="hnsw",
              postgresql_ops={"embedding": "vector_cosine_ops"},
              postgresql_with={"m": 16, "ef_construction": 64}),
    )
Enter fullscreen mode Exit fullscreen mode

Threshold Tuning: From 0.75 to 0.85

Started at 0.75. Too low — got false positives where "How do I configure the widget?" matched "What is a widget?". Raised to 0.85 and false positives disappeared:

Query Stored FAQ Score Match?
"how do I reset my password" "How do I change my password?" 0.91 Yes
"do you accept visa?" "What payment methods do you accept?" 0.87 Yes
"how do I configure the widget" "What is a widget?" 0.72 No

When there's a match, the full pipeline doesn't execute:

if faq_match:
    yield {"event": "faq_match", "data": json.dumps(faq_match)}
    # FAQ matches are free — don't increment billing counters
    fire_and_forget_log_usage(
        query_type="faq", response_tokens=0, provider=None, ...
    )
Enter fullscreen mode Exit fullscreen mode

response_tokens=0 and query_type="faq" — fundamental for savings metrics.


Layer 2: Semantic Cache — pgvector as Intelligent Cache

The Concept

If someone asks "How do I reset my password?" and 2 hours ago another asked "How can I change my password?", the answer is the same. Semantic cache detects that equivalence by comparing embeddings, not exact strings.

The model stores the query embedding, response, sources, and a hash of the RAG configuration:

class ResponseCache(Base):
    __tablename__ = "response_cache"

    query_embedding = mapped_column(Vector(384), nullable=False)
    query_text = mapped_column(String(500), nullable=False)
    response_text = mapped_column(Text, nullable=False)
    sources = mapped_column(JSONB, nullable=False, default=list)
    confidence = mapped_column(Float, nullable=False)
    knowledge_base_ids = mapped_column(ARRAY(UUID(as_uuid=True)), nullable=False)
    rag_config_hash = mapped_column(String(64), nullable=False)
    hit_count = mapped_column(Integer, default=0, server_default="0")
    expires_at = mapped_column(DateTime(timezone=True), nullable=False)
Enter fullscreen mode Exit fullscreen mode

Config Hash: Scoping by Configuration

If the admin changes RAG parameters, cached answers are no longer valid:

def compute_config_hash(retrieval_config, language: str | None = None) -> str:
    key_fields = {
        "candidate_k": retrieval_config.candidate_k,
        "rerank_top_n": retrieval_config.rerank_top_n,
        "top_k": retrieval_config.top_k,
        "lambda_param": retrieval_config.lambda_param,
        "max_per_doc": retrieval_config.max_per_doc,
        "bm25_weight": retrieval_config.bm25_weight,
        "language": language or "es",
    }
    return hashlib.sha256(json.dumps(key_fields, sort_keys=True).encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

Lookup: Similarity >= 0.95

CACHE_SIMILARITY_THRESHOLD = 0.95

async def lookup_cache(db, query_embedding, kb_ids, rag_config_hash, tenant_id):
    stmt = text(f"""
        SELECT id, response_text, sources, confidence,
               1 - (query_embedding <=> CAST(:embedding AS vector)) AS similarity
        FROM response_cache
        WHERE tenant_id = :tid AND expires_at > now()
          AND rag_config_hash = :config_hash
          AND knowledge_base_ids @> {kb_array}
        ORDER BY query_embedding <=> CAST(:embedding AS vector)
        LIMIT 1
    """)
    row = result.mappings().first()
    if not row or float(row["similarity"]) < CACHE_SIMILARITY_THRESHOLD:
        return None

    await db.execute(
        text("UPDATE response_cache SET hit_count = hit_count + 1 WHERE id = :cid"),
        {"cid": str(row["id"])},
    )
    return {"response_text": row["response_text"], "sources": row["sources"],
            "confidence": float(row["confidence"])}
Enter fullscreen mode Exit fullscreen mode

Why 0.95? At 0.90 I had incorrect cache hits:

Query A Query B Similarity Same answer?
"How do I export to CSV" "How do I download data as CSV" 0.97 Yes
"How do I configure the webhook" "How do I test the webhook" 0.92 No
"How do I add users" "How do I delete users" 0.91 No

Fire-and-Forget Storage

You can't block SSE streaming to save to cache:

def fire_and_forget_store_cache(tenant_id, query_embedding, query_text,
                                 kb_ids, rag_config_hash, response_text,
                                 sources, confidence, ttl_hours=168):
    async def _store():
        try:
            async with async_session() as db:
                await store_cache(db, tenant_id, query_embedding, ...)
        except Exception as e:
            logger.error("Failed to store cache: %s", e)

    try:
        loop = asyncio.get_running_loop()
        loop.create_task(_store())
    except RuntimeError:
        pass  # No event loop (tests) — skip
Enter fullscreen mode Exit fullscreen mode

And the condition for caching:

if rag_config.retrieval.cache_enabled and kb_ids and not low_confidence:
    fire_and_forget_store_cache(...)
Enter fullscreen mode Exit fullscreen mode

not low_confidence: we don't cache bad answers.


Proactive Invalidation: The Cache That Doesn't Lie

async def invalidate_kb_cache(db: AsyncSession, kb_id: UUID) -> int:
    result = await db.execute(
        text("DELETE FROM response_cache WHERE CAST(:kb_id AS UUID) = ANY(knowledge_base_ids)"),
        {"kb_id": str(kb_id)},
    )
    await db.commit()
    return result.rowcount
Enter fullscreen mode Exit fullscreen mode

Called automatically on every operation that changes KB content:

# In faq_service.py — when creating, updating or importing FAQs
async def create_faq(db, kb_id, data):
    faq = FAQ(knowledge_base_id=kb_id, ...)
    db.add(faq)
    await db.commit()
    await invalidate_kb_cache(db, kb_id)  # Cache dies
    return faq
Enter fullscreen mode Exit fullscreen mode

Same happens when uploading or reprocessing documents. Simple rule: if a KB's content changed, that KB's cache dies.


Auto-FAQ: Turn Frequent Questions into Real FAQs

The system records queries with low confidence. When one appears multiple times, the admin can auto-generate a FAQ:

async def generate_faq_suggestion(db, unanswered_query_id, kb_id, tenant_id):
    uq = await db.get(UnansweredQuery, unanswered_query_id)

    # Dedup: skip if pending suggestion or similar FAQ (>= 0.85) exists
    similar_faq = await db.execute(text("""
        SELECT 1 - (embedding <=> CAST(:embedding AS vector)) AS score
        FROM faqs WHERE knowledge_base_id = :kb_id AND is_active = true
        ORDER BY embedding <=> CAST(:embedding AS vector) LIMIT 1
    """), ...)
    if faq_row and float(faq_row["score"]) >= 0.85:
        return None  # Similar FAQ already exists

    # Search KB for context, then call LLM
    sources = await search_chunks(db, uq.query_text, [kb_id], ...)
    result = await generate_playground_response(
        query=uq.query_text, sources=sources[:5],
        temperature=0.2, max_tokens=512,
        system_prompt="Answer ONLY with information from the context. "
                      "2-4 sentences. If insufficient info: NO_ANSWER"
    )
    if not result.get("response") or result["response"].strip() == "NO_ANSWER":
        return None

    return SuggestedFAQ(
        tenant_id=tenant_id, knowledge_base_id=kb_id,
        question=uq.query_text, generated_answer=result["response"].strip(),
        embedding=query_embedding, status="pending",
    )
Enter fullscreen mode Exit fullscreen mode

Batch generation prioritizes by frequency:

async def batch_generate_suggestions(db, tenant_id, kb_id, limit=5):
    stmt = text("""
        SELECT uq.id FROM unanswered_queries uq
        WHERE uq.tenant_id = :tid AND uq.resolved = false
          AND NOT EXISTS (
              SELECT 1 FROM suggested_faqs sf
              WHERE sf.source_query_id = uq.id AND sf.status = 'pending'
          )
        ORDER BY uq.occurrence_count DESC LIMIT :lim
    """)
    # Generate suggestions for each...
Enter fullscreen mode Exit fullscreen mode

On approval, the real FAQ is created, cache is invalidated, and the query is marked as resolved. Closed loop.


FAQ-Only Fallback: Degraded Service Without Cutoff

Free plan: 50 AI queries/month. When exhausted, FAQs keep working:

async def is_llm_budget_exhausted(db, tenant_id) -> bool:
    """Soft check — enables FAQ-only fallback, does NOT raise."""
    plan, tenant = await _get_plan_and_tenant(db, tenant_id)
    if not plan:
        return False
    usage = await get_monthly_usage(db, tenant_id)
    if plan.max_messages_month != -1 and usage["messages_month"] >= plan.max_messages_month:
        return True
    return False
Enter fullscreen mode Exit fullscreen mode

In the main flow:

faq_only_mode = await is_llm_budget_exhausted(db, tenant_id)

# FAQ matching always works
if faq_match:
    return faq_response(faq_match)  # Free

# No match + no budget → upgrade card
if faq_only_mode:
    yield {"event": "upgrade_required", "data": json.dumps({
        "message": "You've reached your AI query limit. "
                   "FAQs remain available at no cost."
    })}
Enter fullscreen mode Exit fullscreen mode

Cost Metrics

async def _get_cost_metrics(db, tenant_id, cutoff):
    # Tokens by provider
    # ...configurable prices from settings...
    groq_price = await settings_service.get(db, "cost.groq.price_per_1k_tokens", 0.00027)

    # FAQ savings estimation
    avg_llm_tokens = total_tokens / max(llm_query_count, 1)
    estimated_faq_tokens_saved = int(faq_count * avg_llm_tokens)
    estimated_faq_cost_saved = estimated_faq_tokens_saved / 1000 * groq_price

    return {
        "by_provider": by_provider,
        "total_estimated_cost": round(total_cost, 4),
        "faq_savings": {
            "queries_without_llm": faq_count,
            "estimated_tokens_saved": estimated_faq_tokens_saved,
            "estimated_cost_saved": round(estimated_faq_cost_saved, 4),
        },
        "avg_cost_per_conversation": round(total_cost / conv_count, 6),
    }
Enter fullscreen mode Exit fullscreen mode

Cache hit rate as the 7th stat card in the dashboard:

async def get_cache_stats(db, tenant_id):
    rate_result = await db.execute(text("""
        SELECT
            COUNT(*) FILTER (WHERE query_type = 'cached') AS cached_queries,
            COUNT(*) FILTER (WHERE query_type IN ('chat','widget','cached')) AS total_queries
        FROM usage_logs WHERE tenant_id = :tid
    """))
    return {"hit_rate": round(cached / total, 3), ...}
Enter fullscreen mode Exit fullscreen mode

The Full Flow in the Chat Endpoint

To see how the 3 layers fit together, here's the real order in chat.py:

# 1. Budget check (soft — doesn't raise)
faq_only_mode = await is_llm_budget_exhausted(db, tenant_id)

# 2. FAQ match (always works, even in faq_only_mode)
faq_match = await match_faq(db, data.content, kb_ids, threshold=0.85)
if faq_match:
    return EventSourceResponse(faq_event_generator())  # $0

# 3. Budget exhausted + no FAQ match → upgrade card
if faq_only_mode:
    yield {"event": "upgrade_required", ...}
    return

# 4. Semantic cache lookup
query_embedding = embedding_service.embed_query(effective_query)
config_hash = compute_config_hash(rag_config.retrieval, detected_lang)
cache_hit = await lookup_cache(db, query_embedding, kb_ids, config_hash, tenant_id)
if cache_hit:
    return EventSourceResponse(cache_event_generator())  # $0

# 5. Full RAG pipeline (vector + BM25 + rerank + LLM)
# ... streaming response ...

# 6. Fire-and-forget: store in cache for next time
if not low_confidence:
    fire_and_forget_store_cache(...)
Enter fullscreen mode Exit fullscreen mode

Real Numbers

Metric Value
FAQ threshold 0.85
Cache threshold 0.95
Default cache TTL 7 days
FAQ match latency ~15ms
Cache lookup latency ~20ms
Full pipeline latency ~2-4s
% queries resolved by FAQ 15-25%
% queries resolved by cache 10-20%
% queries reaching the LLM 55-75%
Estimated total savings 30-45%

Lessons Learned

1. FAQs Are the Best Investment

Not sexy. It's a table with questions and answers. But every FAQ is a perfect answer served in 15ms at zero cost, forever.

2. Cache Threshold Must Be Very High

At 0.90 I had false positives that served incorrect answers. 0.95 is the safe minimum.

3. Fire-and-Forget Is Mandatory for Cache Storage

First attempt was synchronous. If the INSERT takes long, the last streaming token is delayed. Fire-and-forget eliminates that latency.

4. Proactive Invalidation Is Worth It

It's tempting to let cache expire on its own (TTL). But the admin who uploads a document expects updated answers immediately.

5. Auto-FAQ Closes the Loop

Without auto-FAQ, knowledge gaps accumulate. With auto-FAQ, in 2 clicks the frequent question goes from gap to active FAQ.

6. FAQ-Only Fallback > Cutting Off Service

Free-tier users still ask questions that match FAQs. Reduces churn and provides real incentive to upgrade.

7. LLM Prices Must Be Configurable

Hardcoding $0.00027/1K tokens is a trap. Prices change. Storing them in a settings table allows adjustments without a deploy.


Conclusion

Optimizing RAG costs is a problem of avoiding unnecessary work. The three layers (FAQ, cache, auto-FAQ) attack the same principle: if the answer already exists, don't pay to generate it again.

The interesting part: they also improve the experience. FAQ match in 15ms vs 2-4s for the full pipeline. Cache hit in 20ms with the same quality. And pgvector was already in the stack — no need for Redis or Elasticsearch.


This is the fifth article in the series. If it was useful, a like helps it reach more people. Questions about semantic cache or FAQ matching? Drop a comment.

Top comments (0)