DEV Community: Derrick Pedranti

Improving Determinism with LLMs: Prompting, Model Selection, Context, and Tools

Derrick Pedranti — Sat, 02 May 2026 04:48:06 +0000

Large language models are incredibly powerful, but they are not automatically deterministic.

Ask the same question twice and you may get slightly different answers. Ask for facts without enough context and the model may fill in gaps. Ask it to perform complex matching or calculations directly in natural language and you may get an answer that sounds confident but is not reliable enough for production use.

That does not mean LLMs are unreliable by default. It means we need to design around how they work.

When building AI-powered applications, improving determinism usually comes down to four practical methods:

Prompt engineering
Choosing the right model
Providing the right context, including RAG
Using tools for deterministic work

The goal is not to make the LLM magically perfect. The goal is to reduce ambiguity, improve accuracy, and prevent the model from inventing answers when it does not have enough information.

1. Prompt Engineering

Prompt engineering is one of the simplest ways to improve LLM reliability. A vague prompt gives the model too much freedom. A specific prompt gives it boundaries.

Instead of asking:

Compare these records and tell me which ones match.

You can improve the prompt by giving the model a clear process:

Compare the records step by step.
First, normalize company names.
Second, compare addresses.
Third, compare phone numbers.
Fourth, assign a confidence score.
If there is not enough evidence to determine a match, return `unknown`.

Good prompt engineering often includes:

Step-by-step instructions
Specific examples
Example outputs
Clear formatting requirements
Constraints on what sources the model should use
Permission for the model to say “I don’t know”

That last point is important.

LLMs are often optimized to be helpful, which can sometimes make them answer even when they should not.

Giving the model permission to say it does not know can reduce hallucinations.

For example:

If the answer cannot be determined from the provided context, respond with:
"I don't know based on the provided information."
Do not guess.
Do not use outside knowledge.

This kind of instruction helps the model stay inside the boundaries of the task. Prompting alone will not guarantee perfect results, but it is usually the first layer of control.

2. Choosing the Right Model

Not all LLMs are equally good at every task.

Some models are stronger at reasoning. Some are better at coding. Some are optimized for speed and cost. Some are designed for image generation, document understanding, or multimodal workflows.

For example, a model like Claude Opus 4.7 is commonly used for complex reasoning and coding-heavy tasks. A model like Nano Banana Pro is designed for high-quality image generation and editing, including use cases where accurate text rendering inside images matters.

The key point is simple:

Pick the model based on the task, not just the brand name.

If your task is code generation, evaluate models against coding benchmarks and real coding examples from your own project. If your task is medical document summarization, legal review, financial extraction, or data matching, evaluate models against examples from that subject matter. If your task is image generation, use a model designed for image generation.

Model settings matter too.

Temperature is one of the most important settings for determinism. Lower temperature generally makes responses more predictable and focused, while higher temperature increases creativity and variation.

For accuracy-focused tasks like structured extraction, classification, JSON output, or data processing, I usually prefer a low temperature (often close to 0). Conversely, for creative writing, brainstorming, or marketing copy, a higher temperature may be more appropriate.

Another useful pattern is intelligent model routing.

Instead of sending every prompt to the same model, you can route tasks based on intent:

If the user asks for code generation, use the coding model.
If the user asks for image generation, use the image model.
If the user asks for summarization, use the fast summarization model.
If the user asks for complex reasoning, use the reasoning model.

This routing can be rule-based, or you can use an LLM to classify the task and select the best model. The more specialized the task, the more important model selection becomes.

3. Providing the Right Context (RAG)

Context is one of the biggest factors in improving LLM accuracy.

An LLM without context may answer based on general knowledge. That can be useful, but it is risky when you need answers grounded in specific documents, company policies, user data, contracts, codebases, or domain-specific content.

Context gives the model boundaries.

For example:

Answer only using the provided context.
If the context does not contain the answer, say you do not know.

And this is where RAG becomes extremely useful.

RAG stands for Retrieval-Augmented Generation. In a RAG system, your documents are usually chunked, embedded, and stored in a vector database. When a user asks a question, the system performs a semantic search to find relevant content and passes that content to the LLM as context.

Instead of asking the model to rely only on what it already knows, you are giving it the source material it should use.

A simplified RAG flow looks like this:

User asks a question
        ↓
Search relevant documents
        ↓
Retrieve the best matching chunks
        ↓
Pass those chunks to the LLM
        ↓
Generate an answer grounded in the retrieved context

This improves determinism because the model is no longer operating in an open-ended way. It has a defined source of truth.

RAG is especially useful for:

Internal documentation
Policy questions
Knowledge bases
Technical documentation
Customer support
Contract review
Medical or legal document review
Codebase Q&A
Research assistants

However, RAG does not automatically solve everything. You still need good chunking, good retrieval, good metadata, and good prompting. If the wrong context is retrieved, the model may still produce the wrong answer.

A strong RAG prompt is built on strict boundaries. While a production prompt would be much more detailed, a simplified example of the core instructions looks like this:

Use only the provided context.
Cite the source sections used.
Do not answer from general knowledge.
If the answer is not present in the context, say so.

This helps reduce hallucinations and makes the answer easier to verify.

4. Using Tools for Deterministic Work

Tools are one of the best ways to improve reliability.

There are many tasks that an LLM should not perform directly if you need consistent, production-quality results.

For example:

Complex calculations
Fuzzy matching across large datasets
Sorting and filtering
Database queries
API lookups
File parsing
Data validation
Date calculations
Business rule execution

An LLM can reason about these tasks, but it should not always be the thing performing them.

If you need to compare thousands of records, do not rely on the LLM to manually inspect all of them in a prompt. Instead create a tool.

For example, a fuzzy matching tool could be written in Python and exposed to the LLM:

def fuzzy_match_records(source_records, target_records, threshold=0.85):
    """
    Deterministically compare two datasets and return likely matches.
    """
    matches = []

    for source in source_records:
        for target in target_records:
            score = calculate_similarity(source, target)

            if score >= threshold:
                matches.append({
                    "source_id": source["id"],
                    "target_id": target["id"],
                    "score": score
                })

    return matches

The LLM can decide when to use the tool, explain the results, and help the user interpret the output. But the matching itself happens in code, which is much more reliable.

The same applies to calculations. If you need accurate math, use a calculator tool or a Python function. If you need data from a database, use a query tool. If you need to check real-time information, use an API.

The pattern is:

Use the LLM for reasoning, language, orchestration, and explanation.
Use tools for deterministic execution.

This is especially important in agentic workflows.

The more autonomy you give an AI agent, the more important tool boundaries become. Tools should be scoped, validated, logged, and restricted. A tool should do one thing clearly and safely.

Tools make LLM systems more reliable because they move critical operations out of natural language and into deterministic code.

One important clarification: a tool does not guarantee correct results just because it is a tool. It guarantees that the same code runs consistently, assuming the implementation and inputs are correct. That is still a major improvement over asking an LLM to improvise calculations or matching logic in plain text.

Bringing It All Together

Improving determinism with LLMs is not about one magic trick. It is a layered approach.

Prompt engineering gives the model clear instructions.
Model selection ensures you are using the right model for the task.
Context and RAG ground the model in relevant source material.
Tools move critical logic into deterministic code.

Together, these methods can dramatically improve the reliability of LLM-powered applications.

A practical architecture might look like this:

User Prompt
   ↓
Prompt Classification
   ↓
Model Routing
   ↓
Retrieve Context with RAG
   ↓
LLM Reasoning
   ↓
Tool Calls for Deterministic Work
   ↓
Validated Response

This kind of design gives you the best of both worlds. You get the flexibility and reasoning ability of an LLM, but you also get the reliability of structured prompts, grounded context, model specialization, and deterministic tools.

That is where LLM applications become much more production-ready.

Final Thoughts

LLMs are powerful, but they need guardrails. If you want better accuracy, fewer hallucinations, and more repeatable results, start by asking four questions:

Is my prompt specific enough?
Am I using the right model for this task?
Have I provided the right context?
Should this task be handled by a tool instead of the LLM?

The more often you answer those questions intentionally, the more deterministic your AI system becomes. LLMs are not just chatbots anymore. They are reasoning engines, orchestrators, and interfaces to tools.

But for production systems, the best results come when we stop expecting the model to do everything by itself and instead design systems that combine LLM intelligence with deterministic software engineering.

Stop Overloading Your CLAUDE.md — Simplicity Wins (and Saves Tokens)

Derrick Pedranti — Sun, 12 Apr 2026 21:15:56 +0000

If your CLAUDE.md, .cursorrules, or agent.md file is longer than a few hundred lines, you are probably making your AI assistant worse, not better.

Every time you start a new chat session, you pay a hidden cost for massive context files—in tokens, performance, and overall accuracy. Many developers tend to over-engineer their context files, stuffing them with endless rules and massive context blocks. Ironically, this usually leads to worse results.

The Shift Most Developers Haven't Fully Realized

Modern Large Language Models (LLMs) are exceptionally capable right out of the box. You no longer need to explain fundamental concepts like how React works, what REST APIs are, or re-teach basic programming architecture. The models already possess this knowledge.

What matters now isn't providing more instructions, but managing the context you provide much more effectively. This emerging practice is known as context engineering—the art of optimizing exactly what goes into the model's context window to produce the best possible results without overwhelming it.

The Hidden Cost of Large Context Files

Every time you start a new coding session or prompt your AI assistant, your context files (CLAUDE.md, .cursorrules, agent.md, system instructions) are all loaded into the context window.

That content immediately converts into tokens, and those tokens have a tangible cost:

Cost: You pay for every token processed, whether through direct API usage or hidden compute limits.
Attention: LLMs have finite attention spans. Essential project rules get diluted by boilerplate instructions.
Performance Risk: The larger the context, the slower the response times, and the higher the chance the model hallucinates or ignores specific constraints.

LLMs operate within a finite context window, meaning everything you include competes for attention. When you dump a massive configuration file into every single session, you run the risk of degrading the model's reasoning capabilities over time.

The Big Mistake: "More Context = Better Results"

It feels logical to assume that giving an AI more background information will yield a better answer. However, research and real-world usage consistently demonstrate a "less-is-more" effect in prompting. Removing non-essential content actually improves the accuracy and relevance of the model's output.

When a context window is bloated:

The model gets distracted: It might index heavily on a minor, irrelevant rule you included "just in case."
Important instructions get buried: The "needle in a haystack" problem means your critical constraints are lost in a sea of generic best practices.
Signal-to-noise ratio drops: Meaningful project context is drowned out by unnecessary explanations, leading to generic or confused outputs.

What You Should Do Instead

1. Keep Context Files Minimal

Most developers and teams do not need an enormous configuration file. Your system prompts should be lean and highly specific.

Only include:

Project-specific rules: Naming conventions, specific directory structures, or custom architectural patterns unique to your repository.
Constraints the model wouldn't infer: Hard requirements like "Never use external libraries for data fetching" or "Strictly adhere to local timezones."
Truly required defaults: Formatting preferences or language-specific compiler flags.

Everything else? Remove it.

2. Stop Treating Agents Like They're Dumb

There is no need to include generic instructions like "Write clean code" or "Use best practices." Modern models are aligned to do this by default. Telling an advanced LLM to write good code is like telling a senior engineer not to forget to breathe—it wastes space and adds no value.

3. Use Skills Instead of Static Context

This is where you can drastically improve your workflow. Agent skills allow for progressive disclosure of context:

Instead of loading a massive document of instructions upfront, only the skill's name and a brief description are loaded initially (consuming perhaps 100 tokens).
The full, detailed instructions and context are only loaded dynamically when the agent decides it needs to use that specific skill.

By utilizing skills, you ensure lower token usage per request, significantly better focus for the model, and a much more scalable system as your project grows.

4. Keep Skills Small Too

Even dynamic skills can suffer from bloat if you aren't careful. When building out agent capabilities:

Only include what the agent wouldn't already know: Do not paste the entirety of a public API's documentation if the model was likely trained on it.
Keep instructions concise and actionable: Focus on input/output expectations and specific steps.
Avoid "documentation-style" writing: Be direct. Once a skill activates, its entire payload enters the context window, so every word should earn its keep.

The Real Insight: Context Is a Budget

It helps to think of your context window like system RAM. In software development, you wouldn't load unnecessary libraries into memory, keep unused data structures active, or duplicate logic everywhere.

You should treat your AI's context with the same level of discipline. Manage it like a strict budget where every token must justify its inclusion.

Why This Matters More Than Ever

We are entering an era of AI development where:

Foundational models are becoming commoditized.
Baseline capabilities across different providers are largely similar.
The true differentiation lies in how you orchestrate and utilize them.

The engineering teams and individual developers who will excel are those who keep their systems lean, rigorously optimize their context, and build modular, reusable workflows—not the ones writing the most exhaustive, monolithic prompts.

Introducing: simplify-markdown

One problem I consistently encountered while refining these workflows is that AI-generated markdown tends to get bloated incredibly fast. It becomes too verbose, contains redundant sections, includes unnecessary explanations, and relies on token-heavy structures.

To solve this, I built a specialized skill: simplify-markdown.

This tool is designed to systematically reduce token usage, clean up unwieldy context files, and simplify agent or skill markdown files so that only the signal remains.

Where to Find It

GitHub Repository: ai-agent-toolkit
Skill Source: simplify-markdown/SKILL.md

When to Use It

Consider integrating simplify-markdown into your workflow when:

Your context files (.cursorrules, CLAUDE.md, etc.) are growing too large to manage easily.
Your dynamic skills feel bloated and are slowing down execution.
Your prompt architecture is becoming difficult to reason about.
You want to immediately improve response performance and lower your token expenditure.

Final Thought

The future of AI-assisted development isn't about writing more instructions. It is about writing less, but better instructions.

By focusing on smaller context windows, cleaner automated workflows, and smarter loading mechanisms like skills, you empower the AI rather than suffocate it. The models are already highly capable; your job as an engineer is simply to provide the right environment and stay out of their way.

Inspiration & Sources

Some of the core ideas and inspiration for this post came from the following resources—highly recommend checking them out:

The Startup Ideas Podcast
Ras Mic on YouTube

Semantic Caching for LLMs: Faster Responses, Lower Costs

Derrick Pedranti — Sun, 29 Mar 2026 20:24:12 +0000

If you're building AI applications with LLMs, you've probably noticed a pattern:

The same (or very similar) questions keep coming in
Each one triggers a full LLM call
Latency adds up, and token costs quietly grow in the background

What makes this especially frustrating is that many of these requests aren't truly unique. They're slightly reworded versions of things you've already answered.

For example:

"What is the capital of France?"
"What's France's capital?"
"Can you tell me the capital city of France?"

From an LLM's perspective, these are three separate requests. From a user's perspective, they're the same question. Without caching, you pay for each one.

Semantic caching solves this. Instead of treating every request as new, your system recognizes when a query is similar enough to a previous one and reuses the existing response.

In real-world systems, this single optimization can reduce LLM calls by 30–70%, drop latency from seconds to milliseconds, and significantly lower your token costs. It's one of the highest-leverage improvements you can make early in your architecture.

How It Works

Traditional caching relies on exact string matches. Change a single character and the cache misses.

Semantic caching takes a different approach: instead of comparing raw text, it compares meaning using embeddings.

Here's the flow:

User Query
    ↓
Generate embedding
    ↓
Search cache for similar embeddings
    ↓
Match found? → Return cached response
    ↓
No match? → Call LLM → Store result in cache → Return response

The key insight: you avoid calling the LLM unless you have to. Before every request, you ask "Have I already answered something similar enough?" If yes, you skip the most expensive part of your system entirely.

Under the hood, this works by converting queries into vectors and measuring how close they are in vector space. If the distance is below a threshold, the system considers them a match.

Implementation

Let's build a working semantic cache. We'll use FAISS for vector search and sentence-transformers for embeddings, which keeps everything local and dependency-light.

Install dependencies

pip install faiss-cpu sentence-transformers numpy

Note on Dependencies

Depending on your environment (especially Python 3.12 on macOS), you may need to pin a few dependencies due to PyTorch compatibility.

For example:

pip install "sentence-transformers<4" "transformers<5" "numpy<2"

The versions in this article are intentionally left unpinned to keep things simple, but if you run into installation issues, try the pinned versions above.

Define a cache interface

from abc import ABC, abstractmethod

class ResponseCache(ABC):
    @abstractmethod
    def get(self, query: str, context: dict) -> str | None:
        """Look up a cached response. Returns None on miss."""
        ...

    @abstractmethod
    def put(self, query: str, context: dict, response: str) -> None:
        """Store a response for future reuse."""
        ...

Defining an interface keeps your application decoupled from the caching backend. You might start with FAISS locally, then move to Redis or Qdrant in production. Your LLM logic shouldn't need to change.

Implement a no-op cache

class NoCache(ResponseCache):
    def get(self, query, context):
        return None

    def put(self, query, context, response):
        pass

This gives you a safe default for environments where caching isn't available, and a clean baseline for benchmarking.

Implement the semantic cache

import json
import time
import hashlib
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer


class SemanticCache(ResponseCache):
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        distance_threshold: float = 0.25,
        ttl_seconds: int = 3600,
    ):
        self.encoder = SentenceTransformer(model_name)
        self.dimension = self.encoder.get_sentence_embedding_dimension()
        self.distance_threshold = distance_threshold
        self.ttl_seconds = ttl_seconds

        # FAISS index for fast similarity search (L2 distance)
        self.index = faiss.IndexFlatL2(self.dimension)

        # Parallel store: maps index position → cached entry
        self.entries: list[dict] = []

    def _context_key(self, context: dict) -> str:
        """Create a deterministic key from context so we only match
        responses generated under the same conditions."""
        stable = json.dumps(context, sort_keys=True)
        return hashlib.sha256(stable.encode()).hexdigest()

    def _embed(self, text: str) -> np.ndarray:
        vector = self.encoder.encode([text], normalize_embeddings=False)
        return np.array(vector, dtype="float32")

    def get(self, query: str, context: dict) -> str | None:
        if self.index.ntotal == 0:
            return None

        query_vector = self._embed(query)
        distances, indices = self.index.search(query_vector, k=5)

        ctx_key = self._context_key(context)
        now = time.time()

        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1:
                continue
            if dist > self.distance_threshold:
                continue

            entry = self.entries[idx]

            # Context must match (model, temperature, user, etc.)
            if entry["context_key"] != ctx_key:
                continue

            # Respect TTL
            if now - entry["created_at"] > self.ttl_seconds:
                continue

            return entry["response"]

        return None

    def put(self, query: str, context: dict, response: str) -> None:
        query_vector = self._embed(query)
        self.index.add(query_vector)
        self.entries.append({
            "query": query,
            "response": response,
            "context_key": self._context_key(context),
            "created_at": time.time(),
        })

A few things to note:

all-MiniLM-L6-v2 is a lightweight embedding model (~80MB) that's fast enough for real-time use. For higher accuracy on domain-specific queries, consider all-mpnet-base-v2 or a fine-tuned model.
FAISS IndexFlatL2 does exact nearest-neighbor search using L2 (Euclidean) distance. For millions of entries, switch to IndexIVFFlat or IndexHNSWFlat for approximate search.
The context key ensures we never return a cached response generated under different conditions (different model, temperature, user, etc.).

Tie it all together

def handle_request(query: str, context: dict, cache: ResponseCache, llm) -> str:
    # 1. Check cache
    cached = cache.get(query, context)
    if cached:
        print("[cache hit]")
        return cached

    # 2. Cache miss — call the LLM
    print("[cache miss → calling LLM]")
    response = llm.generate(query)

    # 3. Store for future reuse
    cache.put(query, context, response)

    return response

That's it. Three steps: check, call, store.

Try it out

# Swap in your actual LLM client here
class MockLLM:
    def generate(self, query):
        return f"The capital of France is Paris."

cache = SemanticCache(distance_threshold=0.25, ttl_seconds=3600)
llm = MockLLM()
context = {"model": "claude-sonnet-4-20250514", "temperature": 0.0, "user": "user_123"}

# First call — cache miss, calls the LLM
r1 = handle_request("What is the capital of France?", context, cache, llm)

# Second call — semantically similar, cache hit
r2 = handle_request("What's France's capital city?", context, cache, llm)

# Different context — cache miss even though query is similar
other_context = {"model": "gpt-4o", "temperature": 0.7, "user": "user_123"}
r3 = handle_request("Tell me the capital of France", other_context, cache, llm)

Tuning the Distance Threshold

The distance threshold is the most important tuning parameter in your system. It controls the tradeoff between precision (returning only correct matches) and recall (catching more cache hits).

Lower values → stricter matching, fewer false positives, lower hit rate
Higher values → more matches, higher hit rate, risk of returning wrong responses

The right value depends on your embedding model and distance metric:

Metric	Typical Range	Notes
L2 (Euclidean)	0.15 – 0.40	Used in our FAISS example above
Cosine distance	0.05 – 0.15	1 - cosine_similarity; common in Redis, Qdrant

Start around 0.25 for L2 or 0.10 for cosine, then adjust based on real traffic.

How to calibrate

Don't guess. Log your near-misses and spot-check them:

# During development, log borderline matches for review
if dist <= self.distance_threshold * 1.5:
    print(f"Near match: dist={dist:.4f} | query='{query}' | cached='{entry['query']}'")

Review these logs periodically. If you see incorrect matches slipping through, tighten the threshold. If you see obvious matches being missed, loosen it.

Context Filters: Correctness and Security

Semantic similarity alone isn't enough. Two queries can be nearly identical in meaning but require different responses based on context.

Correctness concerns:

Different models produce different outputs
Temperature affects randomness — a cached temperature=0 response shouldn't serve a temperature=1 request
System prompts or attached documents change the answer

Security concerns:

In multi-tenant systems, responses that contain user-specific data (account details, personalized recommendations, user-scoped RAG results) must include the user identifier in the context key. Without it, User A could receive User B's cached response. Treat this as a security boundary, not just a correctness optimization.

context = {
    "model": "claude-sonnet-4-20250514",
    "temperature": 0.0,
    "user": user_id,
    "system_prompt_hash": hash(system_prompt),
}

A note on cache hit rate: Including the user in the context key means every user builds separate cache entries, even for identical answers. For applications where responses don't depend on who's asking — general knowledge, shared documentation, public FAQs — consider omitting the user from the context key so all users share the same cache entries. This can dramatically improve your hit rate. The right approach depends on your application; the important thing is to make the decision intentionally rather than including or excluding the user by default across the board.

Cache Invalidation

TTL handles the simple case: responses expire after a set period. But in practice, you'll also need to think about:

Knowledge updates. If the underlying data your LLM references changes (e.g., you update your RAG corpus), cached responses built on the old data become stale. Consider including a version identifier in your context key so that corpus updates automatically invalidate old entries.
System prompt changes. If you modify your system prompt, cached responses from the previous version may no longer be appropriate. Hashing the system prompt into your context key (as shown above) handles this automatically.
Selective invalidation. Sometimes you need to invalidate specific entries rather than waiting for TTL. Adding a purge(context_key) method to your cache gives you this escape hatch.

def purge(self, context: dict) -> int:
    """Remove all entries matching a given context. Returns count removed."""
    ctx_key = self._context_key(context)
    removed = 0
    for i, entry in enumerate(self.entries):
        if entry["context_key"] == ctx_key:
            entry["created_at"] = 0  # Force expiry
            removed += 1
    return removed

For production systems with frequent knowledge updates, you'll likely want a more sophisticated approach — e.g., tagging cache entries with a corpus version and bulk-invalidating when the version changes.

Common Pitfalls

Caching everything. Some responses should never be cached: real-time data (stock prices, weather), sensitive PII-containing responses, or anything where staleness causes harm. Maintain an explicit skip list.

No TTL. Without expiration, your cache will silently return outdated responses. Always set a TTL, even if it's generous (e.g., 24 hours).

Ignoring context. If you cache without filtering on model, temperature, and user, you will eventually serve wrong or leaked responses. This is the most dangerous pitfall because it often doesn't surface in testing — only in production with real multi-user traffic.

Poor serialization. If you're caching structured LLM responses (tool calls, JSON, streaming chunks), make sure your serialization round-trips correctly. A subtle bug here can produce responses that look right but are subtly broken.

Embedding model mismatch. If you change your embedding model, your existing cache becomes invalid — the vector spaces are incompatible. Either clear the cache on model change or version your cache keys.

When to Use It (and When Not To)

Good fit:

FAQ-style or search-style applications with high query overlap
Customer support bots where similar questions recur frequently
RAG systems where the same retrievals happen repeatedly
Internal tools with a bounded set of common queries

Poor fit:

Highly personalized responses that differ per user even for the same query
Real-time data applications where freshness is critical
Creative applications where variety is the point (e.g., brainstorming tools)

Semantic caching works best when there's meaningful reuse across queries. If every request is genuinely unique, the cache overhead adds cost without benefit.

Going to Production

The FAISS implementation above is great for prototyping and single-process applications. When you're ready to scale, here's what changes:

Vector store. Move to a dedicated vector database: Redis with vector search, Qdrant, Pinecone, or Weaviate. These give you persistence, replication, and filtering built in.
Embedding service. Consider moving embedding generation to an API (OpenAI, Cohere, or a self-hosted model) so your application server stays lightweight.
Monitoring. Track your cache hit rate, average distance of matches, and latency savings. These metrics tell you if your threshold is calibrated correctly and how much value the cache is providing.
Warm-up. Pre-populate the cache with common queries from your logs to maximize hit rate from day one.

Wrapping Up

Semantic caching doesn't require retraining models or changing your core LLM logic. It adds a decision layer in front of your existing system: Have I already answered something similar enough? If yes, skip the LLM call.

That single decision can make your system significantly faster, cheaper, and more scalable. If you're running LLMs in production, it's worth building in early — the ROI only grows as your traffic does.

The full working code from this article is available as a single file you can drop into your project and start experimenting with immediately.