DEV Community: Guy Kobrinsky

Zero False Positives: Your LLM Caching Solution Explained

Guy Kobrinsky — Thu, 28 May 2026 06:22:08 +0000

Guy Kobrinsky | Software Engineering Manager @ Meta. Building SemanticGuard

At Meta, and in my previous roles running cloud platforms at Teads and Outbrain, I've managed systems where a 0.1% error rate isn't a rounding error, it's a catastrophic event affecting millions of users. The principle drilled into you at that scale is reliability: systems must not only work, but fail predictably and safely. That principle is the only reason SemanticGuard exists in its current form. Saving 70% on LLM API calls is worthless if even a small fraction of cache hits return a subtly incorrect answer to a question that looked similar but wasn't.

Why Semantic Caches Fail by Default

Intelligent caching has a fundamental tension. A simple key-value cache is 100% accurate but misses the bulk of savings opportunities, since real users phrase the same question dozens of different ways. A naive semantic cache that matches purely on vector similarity is a data integrity disaster waiting to happen.

Picture an application that gets the query "What is the capital of Washington State?" The cache has previously seen "Washington D.C., the US capital," and the embeddings are close. A threshold-only cache hits, and the user is told the capital of Washington State is Washington D.C. This is not an exotic edge case, it is the default failure mode of any system that ships with vector similarity as its only check. For any production application, that failure mode is unacceptable.

Solving the false-positive problem became our core engineering directive. The goal is not just to find similar questions, it is to apply multiple independent signals so that two different questions are only treated as equivalent when the probability of them producing different answers is negligible. Anything weaker than that is a toy, not production infrastructure.

The Multi-Layer System for Correctness

Achieving high-confidence matching required moving past a single-pass check. A cosine similarity score is a blunt instrument: it tells you two sentences are generally on the same topic, but says nothing about the specific nuances that change an answer. SemanticGuard's approach is a multi-layer validation funnel where every request passes through progressively sophisticated checks, with each layer cheap enough to eliminate obvious misses before invoking more expensive logic.

The first layer is a fast, exact-match lookup on the normalized prompt. We strip whitespace, lowercase the text, and remove non-essential punctuation. A surprising number of duplicates have only trivial variations, and according to SemanticGuard's internal benchmark data this layer alone accounts for a meaningful share of cache hits in applications with high-frequency repeats. If we hit here, we serve immediately.

When no exact match exists, the request enters the core of our LLM semantic caching engine. We compute a vector embedding for the incoming prompt and compare it against the embeddings of cached items, but we never rely on a similarity score alone. Instead, we run what we call semantic guardrails. Alongside the embedding comparison, we perform named entity recognition on both the incoming prompt and the candidate cache entry, extracting names, locations, dates, products, and other critical entities. A cache hit is only valid when two conditions are met: vector similarity above our high-confidence threshold, and an identical set of critical named entities between the two prompts. This is what prevents the "Paris, Texas" versus "Paris, France" disaster. The vectors might look close, but the entity mismatch is a definitive veto. That conservative, multi-factor design is how we achieve a near-zero false positive rate in production.

How Safety Translates to Savings and Speed

The first thing a near-zero false positive design produces is trust. Engineering leaders can deploy caching without the constant worry that their application is silently corrupting responses. That confidence is what unlocks the two real benefits: drastic cost reduction and a measurably better user experience. When you can safely serve 40 to 70 percent of requests from a cache, as indicated by SemanticGuard's internal benchmark data, your LLM bill drops by the same proportion. For a team spending $20,000 per month on GPT-4 Turbo, that is $8,000 to $14,000 saved every single month.

The performance story is just as important. A call to a frontier model like Claude 3 Opus or GPT-4 typically takes a few seconds to complete, and that latency is perceptible enough to make any interactive product feel sluggish. A cache hit is a database lookup. According to SemanticGuard's benchmarks, cached responses are consistently served in under 50 milliseconds. The result is the difference between an application that feels slow and one that feels instant, and it makes use cases like real-time Q&A or in-line code suggestions viable for the first time.

The third benefit is operational. Every team using SemanticGuard gets a single dashboard for cache hit rates, savings, latency, and the queries driving the most LLM spend. Instead of every developer wiring up provider SDKs individually, traffic flows through the caching layer. That unified view is what makes FinOps possible at AI scale, and it is what gives architecture teams the data they need to make informed decisions as usage grows.

A Scenario: A B2C Coding Assistant at $20K per Month

Consider a B2C AI coding assistant handling 2 million requests per month. At current frontier-model pricing, that traffic produces roughly a $20,000 monthly LLM bill. When the team analyzes a sample of recent queries, they find that close to half are semantic paraphrases of each other: "how to sort a list in Python," "Python list sort," "what is the best way to order a list in Python," and dozens of other phrasings that should all produce the same code snippet from the LLM.

Once those queries route through a caching layer with conservative, multi-signal validation, the service settles at a 50 percent cache hit rate. The system correctly recognizes that paraphrased versions of the same question deserve the same answer, and it correctly rejects superficially similar prompts that would produce different code. The monthly LLM bill drops from $20,000 to $10,000, a savings of $120,000 per year. For the half of users hitting a cached response, p95 latency falls from roughly 2.5 seconds to under 50 milliseconds, which is the difference between a UX that feels like waiting and a UX that feels like typing.

Getting Started with Safe Semantic Caching

The right first step is a risk-free monitoring phase, before you let a cache serve any traffic. SemanticGuard ships a feature called Shadow Mode for exactly this. When you integrate the SDK, you enable Shadow Mode with a single configuration flag. The system processes your requests, computes which ones would have been cache hits, and logs the savings you would have captured, but every request still goes to the underlying LLM provider. After 24 to 48 hours in production, you have a precise, data-backed report on potential savings and latency wins for your specific traffic.

Once Shadow Mode confirms the numbers, switching to active caching is one configuration change away. The integration is designed to add a single line to your existing setup. For the OpenAI TypeScript SDK, you wrap your fetch call.

  import { withSemanticGuard } from "@semanticguard/ai-sdk";
  const openai = new OpenAI({ apiKey: "...", fetch: withSemanticGuard() });

The wrapper transparently intercepts requests, runs the multi-layer cache check, and either serves a cached response or forwards the request to the provider and caches the new result on the way back. Before you flip the switch, record your baseline LLM cost and p95 latency, then track the same metrics after. The impact shows up immediately on cost and performance dashboards.

Next Steps

Audit your query logs to identify the top 10 to 20 most frequent prompts in your application, which gives you a zero-cost preview of your cache hit potential.
Deploy SemanticGuard in Shadow Mode and let it observe your real production traffic for 24 to 48 hours.
Review the savings report to see your projected monthly LLM savings and latency improvements.
Activate caching with one line of code once the numbers look right for your traffic.
Talk to the SemanticGuard team for guidance on tuning thresholds and guardrails to your specific application.

You WON'T Get Realtime LLM Cost From Your Public Cloud

Guy Kobrinsky — Thu, 14 May 2026 15:22:53 +0000

As an engineering manager who has spent years grappling with infrastructure costs across all public cloud environments, I've seen firsthand how quickly expenses can spiral without proper visibility. When it comes to Generative AI, specifically LLMs, there's a common misconception that standard public cloud cost monitoring will give you the real-time insights you need. Let me be direct: you won't get realtime LLM cost from your public cloud provider.

This isn't an indictment of cloud providers; it's a fundamental mismatch between how LLM usage is billed and how traditional cloud services are aggregated for cost reporting. I've designed and managed systems where every penny counts, and the hourly or even daily, batched reports from your AWS, Azure, or GCP console are simply too late for effective LLM cost management.

Why Public Cloud Cost Reporting Falls Short for LLMs

Public cloud providers are excellent at giving you an hourly or daily aggregate of your compute, storage, and network usage. You'll see line items for your EC2 instances, S3 buckets, or serverless function invocations. This works well for resources with relatively predictable billing cycles or larger, less granular units of consumption.

LLMs, however, operate on a per-token basis. Consider models like OpenAI's GPT-4 Turbo, where input tokens might cost $10 per 1M and output tokens $30 per 1M; their newer GPT-4o is cheaper at $2.50/$10, but complex use cases still default to the pricier models. Or Anthropic's Claude 3 Opus, with even higher rates of $15/1M input, $75/1M output. Every character, every word, every prompt, and every response directly translates into a micro-transaction. A single complex query or an extended conversation can quickly rack up hundreds or thousands of tokens.

Your public cloud provider aggregates these individual token costs into an hourly total. This means if an anomaly in your application causes a spike in LLM calls, or an unoptimized prompt is suddenly getting used thousands of times, you won't see the financial impact until a few hours have passed, or even until the next morning at best. By then, hundreds or even thousands of dollars might have been spent unnecessarily. That delay is precisely why traditional alerts based on cloud billing data are often too late.

The Granularity Gap: Tokens vs. Traditional Resources

Think about the difference. If a rogue Lambda function starts executing too often, you might notice an increase in invocations and duration metrics quickly. But with LLMs, it's not just the number of calls; it's the content of each call. A slight change in prompt engineering, perhaps adding a few more examples or constraints, can easily double or triple the token count for a single interaction. And that's often invisible to generic API monitoring.

As someone who's focused on FinOps and cloud economics, I know that granular data is the bedrock of effective cost control. With traditional infrastructure, you might monitor CPU utilization or data transfer. For LLMs, you need to monitor token consumption, both input and output, per-user, per-feature, or even per-prompt template, and you need to do it in near real-time.

This isn't a problem unique to any single public cloud; it's inherent to the billing model for these advanced AI services. The cloud provides the underlying infrastructure to access these models, but the LLM API providers (OpenAI, Anthropic, Google AI) are the ones charging per token. Your cloud bill reflects the sum of these charges, not the details.

The True Cost of LLMs Goes Beyond Tokens

Effective LLM cost management also involves understanding more than just the raw token count. You have other factors at play:

Latency Impact: High latency from repeated, unoptimized calls can degrade user experience and might lead to users abandoning your application. While not a direct billing cost, it's a significant business cost.
Failed Requests: Are you paying for requests that error out or time out? If your retry logic isn't smart, you could be doubling or tripling costs on every failed attempt.
Prompt Engineering Iterations: Developers iterating on prompts often don't have a clear view of the cost implications of each change. They're focused on model quality, not token efficiency, and their playground experiments can accrue substantial costs without a dashboard to reflect it.
Vendor Lock-in: Relying heavily on one provider without understanding usage patterns can limit your negotiation power or ability to switch providers if costs escalate. I built SemanticGuard because I saw this critical gap. My experience leading large-scale FinOps initiatives taught me that you can't optimize what you can't see. We needed a layer that sat between our applications and the LLM APIs, capable of understanding the semantic content of requests and reporting costs with the precision required for these new models. ### Implementing Granular LLM Cost Tracking

To get a handle on LLM cost management, you need a system that can:

Intercept Requests: It needs to sit in the request path, before the call hits the LLM provider.
Count Tokens Accurately: It must understand the tokenization rules for different models and providers to give accurate pre-flight and post-flight token counts.
Attribute Costs: You need to tag requests by user, application feature, prompt ID, or whatever granularity makes sense for your business logic.
Report in Real-time: Costs should be visible on a minute-by-minute or even second-by-second basis, with dashboards and anomaly detection that can trigger immediate alerts. This kind of detailed tracking also opens the door to intelligent optimization strategies, like semantic caching. If you can identify duplicate or semantically similar requests, you can serve them from a cache, reducing API calls to the LLM provider by 40-70% instantly. This not only saves money but also drastically reduces latency, often under 50ms for cached responses.

For example, integrating a solution to track and optimize these calls might look something like this in your code. It's a simple change at the fetch layer:

import OpenAI from "openai";
import { withSemanticGuard } from "@semanticguard/ai-sdk";
const openai = new OpenAI({
  apiKey: "your-openai-key",
  fetch: withSemanticGuard(), // intercepts and optimizes all LLM calls
});

This single line of code allows a dedicated gateway to inspect, optimize, and report on every LLM interaction, giving you the real-time insights your public cloud can't.

What to Do Next: Actionable Steps for LLM Cost Management

Don't wait for your next cloud bill to be surprised by your LLM spend. Here are concrete steps you can take today to get better control:

Inventory Your LLM Usage: Identify every application and service that makes calls to LLM APIs. Document which models they use and for what purpose. This gives you a baseline.
Estimate Current Token Costs: Use a tool or write a script to roughly estimate the token counts for your most common prompts and responses. This helps you understand the unit economics.
Implement a Centralized Gateway or Proxy: Route all your LLM API traffic through a single point. This is crucial for gaining the visibility needed for proper llm cost management , caching, and future optimizations. It also helps abstract away provider-specific SDKs.
Start with Shadow Mode Monitoring: Before committing to any optimization, deploy your chosen gateway or proxy in a 'shadow mode.' This allows you to measure potential savings and identify cost anomalies without affecting production traffic. You can calculate your baseline and then project potential savings.
Set Up Real-time Alerts for Token Spikes: Configure alerts that trigger immediately when token usage (input or output) for specific applications or models exceeds predefined thresholds. Don't rely solely on daily cloud billing alerts; they are too slow for LLMs.

Why I Built SemanticGuard

Guy Kobrinsky — Thu, 14 May 2026 00:00:00 +0000

Why I Built SemanticGuard

My career has been defined by a persistent drive to build efficient, scalable systems and to manage their operational costs. From transforming localized processes into web-scale platforms at Meta to spearheading FinOps strategies as VP Cloud Platform at Teads and Outbrain, I've spent years immersed in the practical realities of infrastructure economics. I've seen firsthand how easily technology, despite its immense power, can become a drain if not managed shrewdly.

Then came Generative AI. The promise was clear: transformative applications, incredible productivity gains. But it wasn't long before a familiar challenge emerged, one that mirrored the early days of cloud adoption but amplified: unpredictable, rapidly escalating costs. Developers, product managers, and CTOs I spoke with were all grappling with the same issue: how to reduce LLM API cost without sacrificing the very quality that made these models so compelling.

This wasn't just a theoretical problem for me; it was a daily reality for teams trying to ship AI-powered features. We were building remarkable things, but every API call felt like it had a ticking meter attached. I knew there had to be a better way to harness the power of LLMs responsibly.

The Unseen Cost of Every LLM API Call

Many of us started our LLM journeys by simply calling OpenAI, Anthropic and Google Gemini AI APIs directly. The initial costs might seem manageable for a proof-of-concept. But as applications scale, the token counts skyrocket. A single complex agent chain or an LLM-powered internal tool can quickly run up a bill. Consider that GPT-4, for instance, costs around $30 per 1 million input tokens and $60 per 1 million output tokens. For sophisticated applications making hundreds or thousands of calls daily, these figures quickly turn into significant operational expenses.

What often gets overlooked is the nature of these calls. How many are genuinely unique? How many are slightly rephrased versions of a previous query? Without intelligent disambiguation, each variation becomes a new, expensive API call. This isn't just about reducing redundant calls; it's about optimizing for semantic similarity. A user asking "What's the capital of France?" and then "Tell me the capital of France" should ideally hit the same answer from a cache, but most simple caching mechanisms would treat them as distinct requests. This is where traditional key-value caching falls short; it lacks the necessary understanding of meaning to truly reduce LLM API cost effectively.

My experience in distributed systems taught me that optimization needs layers. Just as we wouldn't fetch the same database query repeatedly if the data hadn't changed, we shouldn't be asking the same semantic question to an LLM over and over. The challenge was how to build that semantic layer without introducing complexity or compromising accuracy.

My Journey to Intelligent Caching: The FinOps Perspective

At companies like Outbrain and Meta, a core part of my role involved optimizing large-scale cloud infrastructure. This wasn't just about buying cheaper instances; it was about smart architecture, efficient resource utilization, and granular visibility into spending. When I looked at LLM usage, I saw the same patterns of inefficiency that I had battled with traditional cloud resources.

The idea for SemanticGuard didn't come out of thin air; it was born from these FinOps principles applied to a new domain. I recognized that to genuinely reduce LLM API cost, we needed a solution that was:

Context-aware: It needed to understand the intent behind a query, not just its exact string.
Performance-driven: Cache hits needed to be lightning fast, under 50ms, to avoid degrading user experience.
Developer-friendly: Integration had to be trivial, not a multi-week engineering project. Developers are already stretched; adding more infrastructure burden wasn't the answer.
Trustworthy: It had to guarantee zero false positives, meaning a cached response would always be as accurate as a fresh LLM call. Compromising quality was not an option. This last point zero false positives was non-negotiable. If a caching layer started returning incorrect answers, its value would be immediately negated. Achieving this required deep dives into embedding models, similarity metrics, and robust cache invalidation strategies. It was a complex engineering problem, but one that I believed was solvable with the right approach. ## The Engineering Dilemma: Build vs. Buy for LLM Caching

Many engineering teams initially consider building their own LLM caching solution. I understand this impulse; I've led teams that built everything from scratch. But the nuances of effective LLM caching are substantial. It's not just a dict lookup.

You need to:

Choose and manage embedding models: These are critical for converting text into semantic vectors. There's an ongoing cost and maintenance for these models alone.
Implement vector search: You're not comparing strings; you're comparing high-dimensional vectors. This requires specialized databases and algorithms to perform similarity searches efficiently.
Manage cache invalidation: When should a cached response expire? How do you handle updates to underlying knowledge bases?
Ensure performance at scale: Low latency is key. Cache hits are only beneficial if they're faster than a new API call.
Guarantee accuracy: This means careful tuning of similarity thresholds and robust testing to prevent "false hits" that return irrelevant or incorrect information.
Handle varying LLM providers: Different models, different APIs, different response formats add layers of complexity. Before you know it, you're dedicating significant engineering resources to what should be an optimization layer, pulling focus from core product development. My vision for SemanticGuard was to abstract away this complexity, offering a one-line integration that delivers tangible results from day one.

import OpenAI from "openai";
import { withSemanticGuard } from "@semanticguard/ai-sdk";
const openai = new OpenAI({
  apiKey: "your-openai-key",
  fetch: withSemanticGuard(),
});

While the code snippet above is illustrative, it highlights the core principle: the developer experience should remain familiar, while the underlying intelligence drastically optimizes resource use. This simple integration pattern was central to how I envisioned SemanticGuard, a powerful optimization without requiring a complete rewrite of your LLM interaction logic.

My Philosophy: Prove it Before You Commit

One of the biggest hurdles in adopting new infrastructure is proving its value before making a full commitment. I've been in countless meetings where I had to justify significant cloud spend or infrastructure changes. That's why I insisted on a "Shadow Mode" for SemanticGuard. This feature allows teams to route their LLM traffic through our gateway, observe the potential savings, and see exactly how much they could reduce LLM API cost – all before enabling caching or making any financial commitment. It reflects my engineering ethos: measure, validate, then optimize.

This isn't just about cost; it's about confidence. Confidence that your solution will perform, that your data is secure (running in your own infrastructure), and that you retain control. It's about providing a tool that acts as a reliable partner, allowing developers to focus on building features, not battling spiraling operational expenses.

I built SemanticGuard because I believe in empowering developers to build amazing AI applications without being constrained by unpredictable costs or complex infrastructure. It's the culmination of years of experience in FinOps, cloud architecture, and distributed systems, applied to solve one of the most pressing problems in modern AI development.

Practical Steps You Can Take Today to Manage LLM Spend

Even if you're not ready to implement an intelligent caching solution, there are immediate actions you can take to gain control over your LLM API costs:

Monitor Your Usage Granularly: Implement logging for every LLM API call, capturing input tokens, output tokens, model used, and response times. This data is gold for identifying expensive patterns and redundant queries. Tools like Prometheus, Grafana, or simple custom dashboards can provide this visibility. You can't optimize what you don't measure.
Understand Model Pricing: Get intimate with the pricing models of the LLMs you use. GPT-3.5 Turbo is significantly cheaper than GPT-4, and sometimes, a simpler model is sufficient for less complex tasks. Experiment with different models for different use cases to find the optimal cost-performance balance. This often leads to immediate, substantial savings.
Optimize Your Prompts: Shorter, more precise prompts use fewer input tokens. Also, explore techniques like few-shot learning or fine-tuning (if appropriate and cost-effective) to reduce the amount of context you need to send with each query.
Implement Basic Deduplication (if possible): For truly identical, verbatim requests, even a simple key-value cache can offer some relief. While limited in its effectiveness for semantic variations, it's a low-hanging fruit for obvious redundancies. Just be mindful of cache invalidation.
Educate Your Team: Ensure everyone using LLMs understands the cost implications. Foster a culture of cost-awareness, encouraging developers to think critically about whether an LLM call is truly necessary or if a simpler, cheaper method (like a database lookup or regex) could achieve the same result.

By taking these steps, you'll be well on your way to a more controlled and sustainable LLM strategy.