<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anil Prasad</title>
    <description>The latest articles on DEV Community by Anil Prasad (@anilatambharii).</description>
    <link>https://dev.to/anilatambharii</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843681%2Fe0b19f3a-123f-4286-b970-10682e211b29.jpeg</url>
      <title>DEV Community: Anil Prasad</title>
      <link>https://dev.to/anilatambharii</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anilatambharii"/>
    <language>en</language>
    <item>
      <title>Title: I Built an AI Governance Proxy in 72 Hours. Here Is Exactly How.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Wed, 01 Jul 2026 14:01:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/title-i-built-an-ai-governance-proxy-in-72-hours-here-is-exactly-how-16pk</link>
      <guid>https://dev.to/anilatambharii/title-i-built-an-ai-governance-proxy-in-72-hours-here-is-exactly-how-16pk</guid>
      <description>&lt;p&gt;Liquid syntax error: 'raw' tag was never closed&lt;/p&gt;
</description>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
      <category>security</category>
    </item>
    <item>
      <title>I built a zero-dependency PII scanner for AI prompts in 270 lines of Python</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 26 Jun 2026 12:45:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/i-built-a-zero-dependency-pii-scanner-for-ai-prompts-in-270-lines-of-python-2fml</link>
      <guid>https://dev.to/anilatambharii/i-built-a-zero-dependency-pii-scanner-for-ai-prompts-in-270-lines-of-python-2fml</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — AgentMesh 0.3.2 ships a PII/PHI/PCI scanner that runs on every AI prompt before it reaches the model. 17 entity types. Under 2ms. No external API. No cloud service. Pure Python regex with Luhn validation and overlap deduplication. Three enforcement modes: mask, redact, block. &lt;code&gt;pip install agentmesh-proxy&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Your AI agents and tools are sending raw sensitive data to the LLM vendor.&lt;/p&gt;

&lt;p&gt;Medical record numbers in clinical AI prompts. Credit card numbers in finance team workflows. AWS access keys in developer debug pastes. Social security numbers in HR automation.&lt;/p&gt;

&lt;p&gt;The people doing this are not making bad decisions. They are using the tools available to them. The problem is that there is no layer between the prompt and the model that catches sensitive data first.&lt;/p&gt;

&lt;p&gt;I built that layer into AgentMesh. Here is how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it catches
&lt;/h2&gt;

&lt;p&gt;![17 entity types caught before the LLM]&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8qk5nk2lfyp6287ykpqa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8qk5nk2lfyp6287ykpqa.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;17 entity types across four categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PII&lt;/strong&gt; — personal identity&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSN: &lt;code&gt;567-89-0123&lt;/code&gt; → &lt;code&gt;[SSN]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Date of birth: &lt;code&gt;07/22/1985&lt;/code&gt; → &lt;code&gt;[DOB]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Email: &lt;code&gt;sarah.johnson@gmail.com&lt;/code&gt; → &lt;code&gt;[EMAIL]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Phone: &lt;code&gt;(415) 867-5309&lt;/code&gt; → &lt;code&gt;[PHONE_US]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Passport: &lt;code&gt;Passport no: US123456789&lt;/code&gt; → &lt;code&gt;[PASSPORT]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PCI&lt;/strong&gt; — payment card and financial data&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visa: &lt;code&gt;4532 1234 5678 9012&lt;/code&gt; → &lt;code&gt;[PCI_CARD]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Amex: &lt;code&gt;3714 496353 98431&lt;/code&gt; → &lt;code&gt;[PCI_CARD]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Mastercard: &lt;code&gt;5500 0055 0000 0004&lt;/code&gt; → &lt;code&gt;[PCI_CARD]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;CVV: &lt;code&gt;CVV 394&lt;/code&gt; → &lt;code&gt;[PCI_CVV]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Routing: &lt;code&gt;Routing: 021000021&lt;/code&gt; → &lt;code&gt;[PCI_ROUTING]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Account: &lt;code&gt;Account: 000123456789&lt;/code&gt; → &lt;code&gt;[PCI_ACCOUNT]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PHI&lt;/strong&gt; — HIPAA-protected medical data&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medical record: &lt;code&gt;MRN: P-987654&lt;/code&gt; → &lt;code&gt;[PHI_MRN]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;ICD-10 diagnosis: &lt;code&gt;E11.9&lt;/code&gt; → &lt;code&gt;[PHI_ICD10]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Medication dosage: &lt;code&gt;10mg lisinopril&lt;/code&gt; → &lt;code&gt;[PHI_DOSAGE]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Provider ID: &lt;code&gt;NPI: 1234567890&lt;/code&gt; → &lt;code&gt;[PHI_NPI]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CII&lt;/strong&gt; — cloud credentials and infrastructure&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS key: &lt;code&gt;AKIAIOSFODNN7EXAMPLE&lt;/code&gt; → &lt;code&gt;[CII_AWS_KEY]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JWT token: &lt;code&gt;eyJhbGci...&lt;/code&gt; → &lt;code&gt;[CII_JWT]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agentmesh-proxy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentmesh.security.pii_scanner&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ScanMode&lt;/span&gt;

&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MASK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Patient MRN: P-987654, email: sarah@example.com, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;card: 4532 1234 5678 9012, key: AKIAIOSFODNN7EXAMPLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Patient MRN: [PHI_MRN], email: [EMAIL],
# card: [PCI_CARD], key: [CII_AWS_KEY]
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finding_types&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ['CII_AWS_KEY', 'EMAIL', 'PCI_CARD', 'PHI_MRN']
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For scanning a list of &lt;code&gt;{role, content}&lt;/code&gt; messages (OpenAI format):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SSN 123-45-6789, card 4532 1234 5678 9012&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;cleaned_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# SSN [SSN], card [PCI_CARD]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Three enforcement modes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentmesh.security.pii_scanner&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PIIDetectedError&lt;/span&gt;

&lt;span class="c1"&gt;# MASK: replace with labeled placeholder — model still gets a useful prompt
&lt;/span&gt;&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MASK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SSN 123-45-6789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "SSN [SSN]"
&lt;/span&gt;
&lt;span class="c1"&gt;# REDACT: replace with *** — when even the label is too much context
&lt;/span&gt;&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REDACT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SSN 123-45-6789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "SSN ***"
&lt;/span&gt;
&lt;span class="c1"&gt;# BLOCK: raise PIIDetectedError — zero tolerance, reject the request
&lt;/span&gt;&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SSN 123-45-6789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;PIIDetectedError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# [Finding(entity_type='SSN', ...)]
&lt;/span&gt;    &lt;span class="c1"&gt;# Return HTTP 400 to the caller
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Engineering decisions worth explaining
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why regex over an NLP model?
&lt;/h3&gt;

&lt;p&gt;Speed. The scan runs in under 2ms. An NLP-based entity recognizer adds 50ms to 200ms per call and requires a model download. For a proxy that sits in the path of every LLM call, 2ms is acceptable and 200ms is not.&lt;/p&gt;

&lt;p&gt;The tradeoff is recall. Regex will miss creative obfuscation. For governance purposes — where the goal is catching accidental leakage, not adversarial attacks — regex is the right tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  The credit card validation decision
&lt;/h3&gt;

&lt;p&gt;Standard implementations run Luhn validation on card numbers and only mask numbers that pass. We run in &lt;code&gt;strict_pci=True&lt;/code&gt; mode by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In PIIScanner.__init__:
# strict_pci=True (default): mask any card-shaped number (13-19 digits)
# even if it fails the Luhn check.
# Rationale: governance proxies should over-mask rather than under-mask.
# A false positive costs one masked token.
# A false negative sends a real card number to the vendor.
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strict_pci&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strict_pci&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you prefer Luhn validation only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MASK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strict_pci&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The overlap deduplication problem
&lt;/h3&gt;

&lt;p&gt;This one took a few iterations to get right.&lt;/p&gt;

&lt;p&gt;Consider a prompt containing &lt;code&gt;MRN: A1234567&lt;/code&gt;. The &lt;code&gt;PHI_MRN&lt;/code&gt; pattern matches the whole span. The &lt;code&gt;PASSPORT&lt;/code&gt; pattern (before it required a &lt;code&gt;passport:&lt;/code&gt; prefix) would also match the &lt;code&gt;A1234567&lt;/code&gt; part.&lt;/p&gt;

&lt;p&gt;If you apply replacements in reverse order by start position — which is the standard approach to keep earlier offsets valid — and the inner match gets processed first, it replaces 8 characters with 10 (&lt;code&gt;[PASSPORT]&lt;/code&gt;). The outer match then tries to cut at the original end offset, which now points into the middle of &lt;code&gt;[PASSPORT]&lt;/code&gt;, producing &lt;code&gt;[PHI_MRN]T]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_dedup_overlapping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Sort by start position, then by length descending (outermost first).
&lt;/span&gt;    &lt;span class="c1"&gt;# Walk forward and drop any finding whose start is inside the
&lt;/span&gt;    &lt;span class="c1"&gt;# previous kept finding's range.
&lt;/span&gt;    &lt;span class="n"&gt;sorted_f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;last_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sorted_f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;last_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;last_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep the outermost match. Drop everything whose start position falls inside it. Apply replacements in reverse order on the deduplicated list. No artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wiring it into the proxy
&lt;/h2&gt;

&lt;p&gt;If you are running AgentMesh as a proxy rather than calling the scanner directly, activate it in config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# agentmesh.yaml&lt;/span&gt;
&lt;span class="na"&gt;pii_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mask&lt;/span&gt;           &lt;span class="c1"&gt;# mask | redact | block&lt;/span&gt;
&lt;span class="na"&gt;block_injections&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# prompt injection detection (14 rules)&lt;/span&gt;
&lt;span class="na"&gt;anomaly_detection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# runaway loop + burn rate monitoring&lt;/span&gt;
&lt;span class="na"&gt;slack_webhook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;        &lt;span class="c1"&gt;# optional: alert destination&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentmesh serve &lt;span class="nt"&gt;--config&lt;/span&gt; agentmesh.yaml &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point your agents at it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/v1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call going through the proxy now gets scanned. The response includes a header showing what was found:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X-AgentMesh-PII-Findings: 4
X-AgentMesh-Cache: miss
X-AgentMesh-Cost-USD: 0.000420
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkccei9fmhn12lif7ylki.png" alt=" " width="800" height="450"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Chrome extension
&lt;/h2&gt;

&lt;p&gt;A server-side proxy cannot intercept prompts typed directly into the ChatGPT or Claude.ai browser tab. For that there is a Chrome extension — same scanner, running locally in the browser process before the request leaves the tab.&lt;/p&gt;

&lt;p&gt;Google approved it last weekend.&lt;/p&gt;

&lt;p&gt;Install from the Chrome Web Store (link in the repo readme) or build from source. Works with ChatGPT, Claude.ai, Gemini, Perplexity, and Cursor. No server required for standalone use.&lt;/p&gt;




&lt;h2&gt;
  
  
  HIPAA in production
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fapjaw1jkpz87l755upa6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fapjaw1jkpz87l755upa6.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If your team uses AI in a clinical setting, the PHI scanner is the piece that matters most. ICD-10 codes are two to five characters but identify specific diagnoses. Combined with a medical record number and a provider NPI, they reconstruct a patient record from a prompt.&lt;/p&gt;

&lt;p&gt;AgentMesh also generates HIPAA readiness reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentmesh.compliance.pdf_report&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ComplianceReporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Framework&lt;/span&gt;

&lt;span class="n"&gt;reporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ComplianceReporter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reporter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Framework&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HIPAA&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# or
&lt;/span&gt;&lt;span class="n"&gt;reporter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Framework&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HIPAA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hipaa_report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs a structured report listing which controls are active, which are not, and what gaps remain. Useful for security reviews before a compliance audit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agentmesh-proxy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentmesh.security.pii_scanner&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ScanMode&lt;/span&gt;

&lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScanMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MASK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your prompt here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finding_types&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scanner is in &lt;code&gt;agentmesh/security/pii_scanner.py&lt;/code&gt;. About 270 lines. No external dependencies beyond the Python standard library.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/anilatambharii/agentmesh" rel="noopener noreferrer"&gt;https://github.com/anilatambharii/agentmesh&lt;/a&gt;&lt;br&gt;
PyPI: &lt;code&gt;agentmesh-proxy&lt;/code&gt;&lt;br&gt;
Docker: &lt;code&gt;docker pull anilsprasad/agentmesh:latest&lt;/code&gt;&lt;br&gt;
Apache 2.0.&lt;/p&gt;

&lt;p&gt;What entity types would you add? What patterns are you seeing in your team's prompts that are not covered here?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Find me: &lt;a href="https://anilsprasad.com" rel="noopener noreferrer"&gt;anilsprasad.com&lt;/a&gt; · X &lt;a href="https://x.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/anilsprasad/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>security</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I put one proxy in front of every AI tool my team uses 85% cache hits, 75% lower cost</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Sun, 14 Jun 2026 22:30:49 +0000</pubDate>
      <link>https://dev.to/anilatambharii/i-put-one-proxy-in-front-of-every-ai-tool-my-team-uses-85-cache-hits-75-lower-cost-262g</link>
      <guid>https://dev.to/anilatambharii/i-put-one-proxy-in-front-of-every-ai-tool-my-team-uses-85-cache-hits-75-lower-cost-262g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Your team's AI tools (Claude Code, Copilot, ChatGPT, Gemini, your own agents) each call the LLM API independently — no shared cache, no shared budget, no audit trail. AgentMesh is an open-source proxy that sits in front of all of them and runs every call through a three-layer cache, per-team quotas, cheapest-model routing, and a tamper-evident audit log. You point your tools at it with two env vars. On a reproducible benchmark (no API keys): 85% cache hits, 75% lower cost. Apache 2.0. → pip install agentmesh-proxy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem, in one sentence&lt;/strong&gt;&lt;br&gt;
Every AI tool on your team talks to the model on its own.&lt;br&gt;
Claude Code has its own connection. Copilot has its own. The ChatGPT tab in someone's browser has its own. Your LangGraph service has its own. None of them share a cache, a budget, or an audit log — so the same prompt gets paid for over and over, a runaway loop in one service is invisible to the others, and nobody can answer "what did we send to third-party APIs last quarter?"&lt;br&gt;
This isn't a discipline problem. It's a missing layer. So I built it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;60-second quickstart&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install agentmesh-proxy sentence-transformers
agentmesh serve --port 8080 --demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point any tool at it — no code changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Claude Code, or any Anthropic SDK tool
export ANTHROPIC_BASE_URL=http://localhost:8080

# Copilot / Cursor / any OpenAI SDK tool
export OPENAI_BASE_URL=http://localhost:8080/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every response comes back with governance headers so you can see what happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X-AgentMesh-Cache:     hit          # exact | semantic | miss
X-AgentMesh-Tokens:    0            # 0 on a cache hit
X-AgentMesh-Cost-USD:  0.000000     # $0 on a cache hit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole integration. The agent code never knows the proxy is there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;br&gt;
Every call from the proxy or the SDK runs the same ordered pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bba10h1mkjt4cdw5pf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bba10h1mkjt4cdw5pf1.png" alt=" " width="800" height="1147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interesting part is the cache, because it does something most "LLM caches" don't.&lt;/p&gt;

&lt;p&gt;Exact-match caching almost never hits in real life, because people rephrase: they paste You are a senior architect. in front of the question, switch between optimise and optimize, wrap things in markdown. So before anything is hashed or embedded, AgentMesh normalizes the prompt — stripping the noise that doesn't change meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh.optimizer.normalizer import normalize_prompt

normalize_prompt("You are a senior architect. **Please** review this design...")
# -&amp;gt; "review this design ..."   (persona prefix, markdown, filler removed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;compares by cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh import SemanticCache

cache = SemanticCache(similarity_threshold=0.70)   # tune per workload
cache.put("Review this microservices design for scaling issues", response)

# Different wording, same intent -&amp;gt; still a hit
hit = cache.get("Analyze this distributed system design")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;normalize, then embed is the whole trick — it's the difference between a cache that almost never hits and one that hits ~85% of the time.&lt;/p&gt;

&lt;p&gt;And because every call already flows through one interceptor, a tamper-evident audit log is almost free — each entry is hash-chained (SHA-256) and signed with Ed25519:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh import AuditTrail
trail = AuditTrail()
# ... calls happen ...
assert trail.verify()   # walks the chain, checks every prev_hash + signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The benchmark (run it yourself, no API keys)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I didn't want to ship a number you can't check, so the benchmark runs in demo mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python examples/benchmark.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl222ppobr1m0axwnvlib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl222ppobr1m0axwnvlib.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total requests          20
Exact cache hits         2  (10%)
Semantic cache hits     15  (75%)
Total misses             3  (15%)

Cost WITHOUT AgentMesh  $0.0030
Cost WITH AgentMesh     $0.0008
Savings                 $0.0023  (75%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;20 requests, 5 topics, 4 phrasings each. 85% never reached the model; the 3 misses are the cold-start first call per topic — exactly what you'd expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's also a Chrome extension&lt;/strong&gt;&lt;br&gt;
A proxy can't see a prompt typed straight into the ChatGPT or Gemini tab. So there's an extension: declarativeNetRequest reroutes api.anthropic.com / api.openai.com to localhost:8080, and content scripts show a governance overlay before the prompt is sent. Stats persist across service-worker restarts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's deliberately not built yet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'd rather ship a small, verifiable core than a wide surface of half-features:&lt;/p&gt;

&lt;p&gt;The cache is &lt;strong&gt;in-memory, single-process **— great for a local proxy, not yet a fleet. **Redis is next&lt;/strong&gt;.&lt;br&gt;
No native VS Code panel (env vars + the Chrome extension for now).&lt;br&gt;
No SAML/SSO identity propagation; quotas key on a team header.&lt;/p&gt;

&lt;p&gt;None of these are research problems — they're scope. PRs welcome, especially the Redis backend.&lt;br&gt;
&lt;strong&gt;Try it / contribute&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install agentmesh-proxy sentence-transformers
python examples/benchmark.py     # 85% cache hits, 75% lower cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repo (star it)&lt;/strong&gt;: &lt;a href="https://github.com/anilatambharii/agentmesh" rel="noopener noreferrer"&gt;https://github.com/anilatambharii/agentmesh&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: agentmesh-proxy · &lt;strong&gt;Docker&lt;/strong&gt;: anilsprasad/agentmesh · also on Hugging Face&lt;br&gt;
&lt;strong&gt;Apache 2.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you run AI tools across a team and your bill is outgrowing your usage, clone it, run the benchmark, and tell me where it breaks. What would you build on top of this?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How I took a production RAG pipeline from 61% to 97% accuracy (6 stages, full code)</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 12 Jun 2026 12:30:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-i-took-a-production-rag-pipeline-from-61-to-97-accuracy-6-stages-full-code-37mg</link>
      <guid>https://dev.to/anilatambharii/how-i-took-a-production-rag-pipeline-from-61-to-97-accuracy-6-stages-full-code-37mg</guid>
      <description>&lt;p&gt;Six months in production on a healthcare RAG system. Four rewrites. Here is the exact pipeline, every stage, and the code. The reference implementation is open source and linked at the bottom.&lt;/p&gt;

&lt;p&gt;TL;DR&lt;br&gt;
A weekend tutorial got our retrieval system to 61% accuracy. Six months of production work got it to 97%, under 2 seconds at P99, at $0.08 per query. The gains came from six stages added in order of return, not from a better model. Here is each one with code you can drop into your own pipeline.&lt;br&gt;
If you only have five minutes, here is the whole thing:&lt;/p&gt;

&lt;p&gt;Query rewriting turns vague questions into searchable ones. +11 points. Almost free.&lt;br&gt;
Hybrid retrieval runs dense + BM25 and fuses them. +9 points.&lt;br&gt;
Cross-encoder reranking rescores the top candidates properly. +8 points.&lt;br&gt;
Context compression strips irrelevant sentences before generation. +5 points.&lt;br&gt;
Citation guard blocks any claim that is not grounded in a source.&lt;br&gt;
Answer validation routes multi-hop questions to a human instead of guessing.&lt;/p&gt;

&lt;p&gt;First, measure where you actually fail&lt;br&gt;
Before writing any code, we instrumented the pipeline and traced every wrong answer to its cause. The result changed our entire roadmap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rij20r9ldjjrfibfbh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rij20r9ldjjrfibfbh2.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;64% of failures were retrieval. 23% were chunking. Only 13% were the generator hallucinating from good context. We had spent two months tuning prompts, which was 13% of the problem. Lesson one: measure before you optimize, because your intuition about where RAG breaks is almost always wrong.&lt;/p&gt;

&lt;p&gt;Stage 1: query rewriting&lt;/p&gt;

&lt;p&gt;The user's raw message is rarely a good search query. What did it say about the dosage? has no good match in any index, because the meaning is in the previous turns. A small 8B model rewrites it into a standalone query first.&lt;/p&gt;

&lt;p&gt;REWRITE_SYSTEM = """You rewrite a user's latest message into a single,&lt;br&gt;
standalone search query. Resolve all pronouns and references using the&lt;br&gt;
conversation. Keep it specific. Output only the rewritten query."""&lt;/p&gt;

&lt;p&gt;def rewrite_query(history: list[dict], latest: str, llm) -&amp;gt; str:&lt;br&gt;
    convo = "\n".join(f"{m['role']}: {m['content']}" for m in history[-4:])&lt;br&gt;
    prompt = f"{convo}\nuser: {latest}\n\nStandalone search query:"&lt;br&gt;
    out = llm.complete(&lt;br&gt;
        system=REWRITE_SYSTEM, prompt=prompt,&lt;br&gt;
        model="small-8b", max_tokens=64, temperature=0.0,&lt;br&gt;
    ).strip()&lt;br&gt;
    return out or latest&lt;/p&gt;

&lt;p&gt;Cost: about $0.0001 per query. Gain: +11 points, from 61% to 72%. This is the highest return change in the entire pipeline and the one most people skip.&lt;/p&gt;

&lt;p&gt;Stage 2: hybrid retrieval&lt;/p&gt;

&lt;p&gt;Embedding similarity is great at meaning and weak at exact terms. Two passages can be close in vector space and mean opposite things. Keyword search has the opposite failure mode. So run both and fuse with reciprocal rank fusion, which needs no weight tuning.&lt;/p&gt;

&lt;p&gt;from rank_bm25 import BM25Okapi&lt;/p&gt;

&lt;p&gt;def hybrid_search(query, dense_index, bm25: BM25Okapi, corpus, k=20):&lt;br&gt;
    dense_hits = dense_index.search(query, k=k)            # [(doc_id, score)]&lt;br&gt;
    bm25_scores = bm25.get_scores(query.split())&lt;br&gt;
    bm25_hits = sorted(enumerate(bm25_scores),&lt;br&gt;
                       key=lambda x: x[1], reverse=True)[:k]&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fused, C = {}, 60
for rank, (doc_id, _) in enumerate(dense_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)
for rank, (doc_id, _) in enumerate(bm25_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)

ranked = sorted(fused.items(), key=lambda x: x[1], reverse=True)
return [corpus[doc_id] for doc_id, _ in ranked[:k]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Gain: +9 points, from 72% to 81%. Dense and sparse retrieval are not competitors. Use both.&lt;/p&gt;

&lt;p&gt;Stage 3: cross-encoder reranking&lt;/p&gt;

&lt;p&gt;Stages 1 and 2 are fast because they score the query and each document independently. A cross-encoder reads them together, which is slower and much more accurate. So you run it only on the top candidates the cheap stages already found.&lt;/p&gt;

&lt;p&gt;from sentence_transformers import CrossEncoder&lt;/p&gt;

&lt;p&gt;reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")&lt;/p&gt;

&lt;p&gt;def rerank(query, candidates, top_n=5):&lt;br&gt;
    pairs = [(query, c.text) for c in candidates]&lt;br&gt;
    scores = reranker.predict(pairs)&lt;br&gt;
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)&lt;br&gt;
    return [c for c, _ in ranked[:top_n]]&lt;/p&gt;

&lt;p&gt;Gain: +8 points, from 81% to 89%. The classic retrieve-then-rerank pattern, and it earns its cost because you only rerank a handful of candidates.&lt;/p&gt;

&lt;p&gt;Stage 4: context compression&lt;/p&gt;

&lt;p&gt;A retrieved passage can be the right document and still carry sentences that have nothing to do with the question. Each irrelevant sentence is a chance for the model to anchor on the wrong thing. So score sentences against the query and drop the ones that do not earn their place.&lt;/p&gt;

&lt;p&gt;def compress_context(query, passages, relevance_model, threshold=0.5):&lt;br&gt;
    kept = []&lt;br&gt;
    for p in passages:&lt;br&gt;
        sentences = split_sentences(p.text)&lt;br&gt;
        scored = relevance_model.score(query, sentences)   # 0..1 per sentence&lt;br&gt;
        relevant = [s for s, sc in zip(sentences, scored) if sc &amp;gt;= threshold]&lt;br&gt;
        if relevant:&lt;br&gt;
            kept.append(p.with_text(" ".join(relevant)))&lt;br&gt;
    return kept&lt;/p&gt;

&lt;p&gt;Gain: +5 points, from 89% to 94%. Bonus: it cuts your generation token bill, because you stop paying to send the model context it should ignore.&lt;/p&gt;

&lt;p&gt;The pipeline so far&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funixpyylnqy9jr8vu50p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funixpyylnqy9jr8vu50p.png" alt=" " width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stages 1 through 4 took us from 61% to 94%. The last two stages do not chase points. They make the system honest, which in a regulated domain matters more.&lt;/p&gt;

&lt;p&gt;Stage 5: citation guard&lt;/p&gt;

&lt;p&gt;Before an answer ships, every claim in it has to trace back to a retrieved source. If a sentence has no supporting passage, it does not go out.&lt;/p&gt;

&lt;p&gt;def citation_guard(answer_claims, sources, entailment_model, min_support=0.7):&lt;br&gt;
    for claim in answer_claims:&lt;br&gt;
        support = max(entailment_model.entails(s.text, claim) for s in sources)&lt;br&gt;
        if support &amp;lt; min_support:&lt;br&gt;
            return False, claim    # ungrounded claim, block it&lt;br&gt;
    return True, None&lt;/p&gt;

&lt;p&gt;Stage 6: answer validation&lt;/p&gt;

&lt;p&gt;Some questions need three or more documents synthesized together. That is where RAG quietly fails by writing a fluent, wrong answer. Detect those and route them to a human.&lt;/p&gt;

&lt;p&gt;def validate_answer(query, answer, sources, confidence):&lt;br&gt;
    if confidence &amp;lt; 0.6:&lt;br&gt;
        return route_to_human(query, reason="low confidence")&lt;br&gt;
    if requires_multi_hop(query) and len(sources) &amp;lt; 2:&lt;br&gt;
        return route_to_human(query, reason="insufficient evidence")&lt;br&gt;
    return answer&lt;/p&gt;

&lt;p&gt;Together stages 5 and 6 took the production number from 94% to 97%. The real output is not the three points. It is the 3% the system now refuses to answer automatically. Serving an uncertain answer is not honesty. It is a liability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbd4ia92a8d6cdj5x0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbd4ia92a8d6cdj5x0d.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the climb, stage by stage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72ne64utocqp1haoehe3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72ne64utocqp1haoehe3.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyg9bf4x6rjhlko6pgi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyg9bf4x6rjhlko6pgi.png" alt=" " width="797" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How to adopt this&lt;br&gt;
You do not need a six-month rebuild. Add stages in order of return and measure after each one, so you know which change earned which points.&lt;/p&gt;

&lt;p&gt;Query rewriting first. A day of work, nearly free to run.&lt;br&gt;
Hybrid retrieval next, because most teams run embeddings only.&lt;br&gt;
Reranking third.&lt;/p&gt;

&lt;p&gt;Compression fourth.&lt;/p&gt;

&lt;p&gt;Build the guards last, once accuracy is where you want it.&lt;/p&gt;

&lt;p&gt;Run it yourself&lt;/p&gt;

&lt;p&gt;The full reference implementation is open source, including every stage above, the benchmark harness that produced these numbers, and a 250-case adversarial test suite that caught the failures we did not anticipate. Clone it and run it today.&lt;/p&gt;

&lt;p&gt;github.com/anilatambharii&lt;/p&gt;

&lt;p&gt;I write up the production AI work in more depth, with the narrative and the failures, on my newsletter first. If the deep version is useful to you, that is where it lives: anilsprasad.substack.com&lt;/p&gt;

&lt;p&gt;If you are running RAG in production, I would like to know one thing in the comments: what does your error breakdown look like? Retrieval, chunking, or generation? I read all of them.&lt;/p&gt;

&lt;h1&gt;
  
  
  HumanWritten #ExpertiseFromField
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>anilprasad</category>
    </item>
    <item>
      <title>How We Cut AI Infrastructure Costs by 94% Without Sacrificing Quality (And How You Can Too)</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Mon, 01 Jun 2026 15:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-we-cut-ai-infrastructure-costs-by-94-without-sacrificing-quality-and-how-you-can-too-5fim</link>
      <guid>https://dev.to/anilatambharii/how-we-cut-ai-infrastructure-costs-by-94-without-sacrificing-quality-and-how-you-can-too-5fim</guid>
      <description>&lt;p&gt;A production engineer's guide to building efficient AI systems at scale - complete with code, architecture, and real metrics&lt;/p&gt;

&lt;h2&gt;
  
  
  series: Production AI Infrastructure
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📧 Originally published on &lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;my Substack newsletter&lt;/a&gt;&lt;/strong&gt; where I share weekly deep-dives on production AI infrastructure. Subscribe for early access to future articles!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Three months ago, our AI infrastructure bill was &lt;strong&gt;$47,000 per month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Last month? &lt;strong&gt;$2,800&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Same quality. Same performance. Same user experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;94% cost reduction. $530,000 saved annually.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a case study about "theoretical optimization." This is a field guide from production systems processing &lt;strong&gt;2.3 million events per second&lt;/strong&gt;, serving millions of users, and running 24/7 without downtime.&lt;/p&gt;

&lt;p&gt;The efficiency revolution in AI is here. Small models are closing the gap with frontier models faster than anyone predicted. &lt;strong&gt;The race to bigger is over. The race to efficient just started.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's everything we learned building production AI infrastructure at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;PART 1: The Cost Crisis Nobody Talks About&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI infrastructure costs are spiraling out of control, and most companies don't realize it until it's too late.&lt;/p&gt;

&lt;p&gt;The pattern is predictable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1-3&lt;/strong&gt;: Prototype with GPT-4 or Claude. Costs are manageable ($500-2,000/month). Everyone's happy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4-6&lt;/strong&gt;: Scale to production. Usage increases 10x. Costs jump to $15K-30K/month. Finance starts asking questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-9&lt;/strong&gt;: Growth continues. Costs hit $40K-60K/month. Emergency meetings. "Can we optimize this?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 10+&lt;/strong&gt;: Either massive optimization effort or AI features get cut. The dream dies or the budget explodes.&lt;/p&gt;

&lt;p&gt;We've seen this pattern across dozens of companies. The problem isn't the technology—it's the architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why AI Costs Spiral&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Three core issues:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. The "Bigger Model = Better" Myth&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The default assumption: Use the biggest, most capable model for everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4 for summarization? Sure.&lt;/li&gt;
&lt;li&gt;Claude 3.5 for classification? Why not.&lt;/li&gt;
&lt;li&gt;Llama 2 70B for simple Q&amp;amp;A? Absolutely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's the reality: &lt;strong&gt;Most AI workloads don't need frontier model capability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Industry analysis shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;10% of AI workloads&lt;/strong&gt; require maximum capability (complex reasoning, multi-step analysis)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30-40%&lt;/strong&gt; can run on medium models (7B-70B parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50-60%&lt;/strong&gt; can run on small models (3B-8B parameters)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Yet 80% of companies use frontier models for 80% of workloads.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's like using a Lamborghini for your daily commute. Expensive. Unnecessary. Wasteful.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Zero Caching Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Every request hits the model. Even identical requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What's the weather today?" → Model inference → $0.002
"What's the weather today?" (5 minutes later) → Model inference → $0.002
"What's the weather today?" (user refresh) → Model inference → $0.002
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same question. Same answer. Triple the cost.&lt;/p&gt;

&lt;p&gt;With caching: $0.002 for the first request, $0.0001 for subsequent requests (100x cheaper).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without caching, you're burning 70-90% of your budget on duplicate work.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. No Routing Logic&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Every request goes to the same model, regardless of complexity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple query: "What time is it?" → 70B model inference&lt;/li&gt;
&lt;li&gt;Complex query: "Analyze quarterly revenue by region and predict Q3 trends" → 70B model inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simple query could run on a 3B model at 1/20th the cost and 10x faster.&lt;/p&gt;

&lt;p&gt;But without routing logic, both queries cost the same. &lt;strong&gt;You're overpaying for 60-80% of requests.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Real Production Cost Breakdown&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here's what a typical $47,000/month LLM infrastructure actually looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model Inference:        $32,000 (68%)
Infrastructure:         $8,000 (17%)
Data Processing:        $4,000 (8%)
Monitoring/Logging:     $2,000 (4%)
Networking:             $1,000 (2%)
---
Total:                  $47,000/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The opportunity: 90%+ of model inference costs are optimizable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not through vague "best practices." Through specific, proven architectural changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4traruafa6ymam6ibwgt.png" alt=" " width="800" height="400"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;PART 2: The 4-Layer Optimization Stack&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We rebuilt our AI infrastructure from the ground up with one principle: &lt;strong&gt;Make efficiency the default, not an afterthought.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The result: A 4-layer optimization stack that reduced costs by 94% while maintaining—and in some cases improving—quality and performance.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiopzja9uocvvaxbwvddg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiopzja9uocvvaxbwvddg.png" alt=" " width="800" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 1: Semantic Caching (70% Cost Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Problem&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Users ask the same questions different ways.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How do I reset my password?"&lt;/li&gt;
&lt;li&gt;"I forgot my password, help"&lt;/li&gt;
&lt;li&gt;"Password reset instructions"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three queries. Same intent. Same answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without semantic caching: 3x model calls&lt;/li&gt;
&lt;li&gt;With semantic caching: 1x model call, 2x cache hits&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;How Semantic Caching Works&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Instead of exact-match caching (traditional Redis), we cache by &lt;em&gt;semantic similarity&lt;/em&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embed the query&lt;/strong&gt; using a small embedding model (all-MiniLM-L6-v2, 22M parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search vector DB&lt;/strong&gt; for similar queries (cosine similarity &amp;gt;0.95)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return cached response&lt;/strong&gt; if match found&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate + cache&lt;/strong&gt; if no match&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Stack&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model&lt;/strong&gt;: all-MiniLM-L6-v2 (inference: &amp;lt;10ms, cost: negligible)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector DB&lt;/strong&gt;: Qdrant (self-hosted) or Pinecone (managed) or FAISS (self-hosted)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity threshold&lt;/strong&gt;: 0.95 (adjustable based on use case)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results in Production&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache hit rate: 99.2%
Average cache latency: 8ms
Average cache miss latency: 340ms
Cost per cache hit: $0.00001
Cost per cache miss: $0.002

Monthly queries: 45M
Cache hits: 44.6M (99.2%)
Cache misses: 360K (0.8%)

Semantic cache cost: $446
Without cache cost: $90,000

Savings: $89,554/month (99.5% reduction on this layer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Implementation (High-Level)&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Semantic cache check
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;similar_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similar_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similar_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Fast cache hit
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Expensive generation
&lt;/span&gt;    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Semantic caching works because users are less creative than we think. In production, 99%+ of queries are variations of questions we've already answered.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 2: Redis Caching (Additional 15% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Semantic caching handles 99% of hits. Redis caching handles the remaining 1% of frequently repeated &lt;em&gt;exact&lt;/em&gt; queries.&lt;/p&gt;

&lt;p&gt;Why both?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic cache&lt;/strong&gt;: Slower (8-15ms), handles similarity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis cache&lt;/strong&gt;: Faster (1-3ms), handles exact matches&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis first (exact match, 1-3ms)&lt;/li&gt;
&lt;li&gt;If miss → Check semantic cache (similarity match, 8-15ms)&lt;/li&gt;
&lt;li&gt;If miss → Generate response (model inference, 200-400ms)&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Redis hit rate on semantic misses: 95%
Average latency: 2ms
Cost per hit: $0.00001

Additional savings: $6,800/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Combined Layer 1 + 2 Performance&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total cache hit rate: 99.7%
Average response time: 12ms (cached) vs 340ms (uncached)
Total caching cost: $7,246/month
Without caching cost: $90,000/month

Savings so far: $82,754/month (92% reduction)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 3: Model Routing (Additional 12% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Not all queries are created equal.&lt;/p&gt;

&lt;p&gt;"What's 2+2?" shouldn't cost the same as "Analyze these 10,000 financial transactions and flag anomalies."&lt;/p&gt;

&lt;p&gt;But without routing logic, they do.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Solution: Complexity-based routing&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classify query complexity&lt;/strong&gt; (using a small 1B classifier model, &amp;lt;5ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route to appropriate model&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Simple → 8B model (fast, cheap)&lt;/li&gt;
&lt;li&gt;Medium → 70B model (balanced)&lt;/li&gt;
&lt;li&gt;Complex → 405B model (maximum capability)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Complexity Classification&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Fast classifier model (1B parameters, &amp;lt;5ms inference)
&lt;/span&gt;    &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;detect_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# factual, analytical, creative
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context_required&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;needs_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multi_step&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;is_multi_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 8B model
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 70B model
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 405B model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Production Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query distribution:
- Simple (8B): 62% of queries
- Medium (70B): 28% of queries
- Complex (405B): 10% of queries

Cost comparison:
- 8B model: $0.0001/query
- 70B model: $0.001/query
- 405B model: $0.01/query

Average cost per query (with routing): $0.0008
Average cost per query (70B for all): $0.001

Savings: 20% reduction in model costs
Monthly impact: $5,600 saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Quality Impact: Zero degradation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We A/B tested 10,000 queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8B model accuracy on simple queries: &lt;strong&gt;97.2%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;70B model accuracy on same queries: &lt;strong&gt;97.4%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;User-perceived difference: &lt;strong&gt;0%&lt;/strong&gt; (statistically insignificant)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Users can't tell the difference between 8B and 70B on simple queries. Don't overpay for capability you don't need.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 4: Efficient Models (Additional 15% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The final layer: Replace expensive models with efficient alternatives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Shift&lt;/strong&gt;: Llama 2 70B → Llama 3.1 8B&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Why This Works&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Llama 3.1 8B (released 2024) matches Llama 2 70B (2023) performance on most tasks.&lt;/p&gt;

&lt;p&gt;But it's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1/9th the parameters&lt;/li&gt;
&lt;li&gt;15x faster inference&lt;/li&gt;
&lt;li&gt;15x cheaper at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F754lct3a0y3zvmse0ibz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F754lct3a0y3zvmse0ibz.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Benchmark Comparison (Production Data)&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Llama 2 70B:
- Parameters: 70B
- Inference latency (P99): 340ms
- Cost per 1M tokens: $0.65
- Accuracy (MMLU): 69.7%

Llama 3.1 8B:
- Parameters: 8B
- Inference latency (P99): 120ms
- Cost per 1M tokens: $0.04
- Accuracy (MMLU): 69.4%

Quality difference: 0.3% (negligible)
Speed improvement: 2.8x faster
Cost improvement: 16x cheaper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Migration Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We didn't switch overnight. We tested:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1-2&lt;/strong&gt;: Shadow mode (8B runs alongside 70B, results logged but not served)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3-4&lt;/strong&gt;: A/B test (50% traffic to 8B, 50% to 70B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5-6&lt;/strong&gt;: 90% to 8B, 10% to 70B (monitor quality)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 7+&lt;/strong&gt;: 100% to 8B, 70B for exceptions only&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Quality degradation: 0.2% (within acceptable range)
User complaints: 0 (nobody noticed)
Speed improvement: 2.8x (users noticed this positively)
Cost reduction: 94% (from all 4 layers combined)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;The Complete Stack in Production&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request Flow:
1. Check Redis (exact match) → 95% hit rate, 2ms
2. If miss → Check semantic cache → 99% hit rate, 12ms
3. If miss → Classify complexity → 5ms
4. Route to model:
   - 62% → Llama 3.1 8B
   - 28% → Llama 3.1 70B
   - 10% → Llama 3.3 405B
5. Cache response
6. Return to user

Total average latency: 15ms (cached) vs 125ms (uncached)
Total cost per query: $0.00008 (vs $0.001 before optimization)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;PART 3: Complete Implementation Guide&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Architecture Overview&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our production stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend → API Gateway → Request Router
                              ↓
                  [Redis Cache Layer]
                              ↓
              [Semantic Cache (Vector DB)]
                              ↓
                  [Complexity Classifier]
                              ↓
            ┌─────────┬─────────┬─────────┐
            ↓         ↓         ↓         ↓
         8B Model  70B Model  405B Model  (Fallback)
            ↓         ↓         ↓         ↓
                  Response Aggregator
                              ↓
                      User Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Technology Stack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Caching Layer&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis: Elasticache (AWS) or Redis Cloud&lt;/li&gt;
&lt;li&gt;Vector DB: Qdrant (self-hosted) or Pinecone (managed)&lt;/li&gt;
&lt;li&gt;Embedding model: all-MiniLM-L6-v2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model Serving&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference: vLLM (optimized serving)&lt;/li&gt;
&lt;li&gt;Infrastructure: NVIDIA A10G GPUs (cost-efficient)&lt;/li&gt;
&lt;li&gt;Orchestration: Kubernetes + KServe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Pipeline&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event streaming: Apache Kafka&lt;/li&gt;
&lt;li&gt;Processing: Apache Flink&lt;/li&gt;
&lt;li&gt;Metrics: Prometheus + Grafana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APM: Datadog or New Relic&lt;/li&gt;
&lt;li&gt;Logging: CloudWatch or Elasticsearch&lt;/li&gt;
&lt;li&gt;Alerting: PagerDuty&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Deployment Steps&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Infrastructure (Week 1-2)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up Redis cluster (Elasticache or self-hosted)&lt;/li&gt;
&lt;li&gt;Deploy vector database (Qdrant recommended for self-hosting)&lt;/li&gt;
&lt;li&gt;Configure embedding model endpoint&lt;/li&gt;
&lt;li&gt;Set up model serving infrastructure (vLLM + GPU instances)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Caching Implementation (Week 3-4)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Implement Redis caching layer&lt;/li&gt;
&lt;li&gt;Deploy semantic caching with vector DB&lt;/li&gt;
&lt;li&gt;Test cache hit rates and latency&lt;/li&gt;
&lt;li&gt;Optimize similarity thresholds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Routing Logic (Week 5-6)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train complexity classifier (or use rule-based initially)&lt;/li&gt;
&lt;li&gt;Implement routing logic&lt;/li&gt;
&lt;li&gt;Deploy multiple model endpoints (8B, 70B, 405B)&lt;/li&gt;
&lt;li&gt;A/B test routing accuracy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 4: Migration (Week 7-8)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shadow mode testing (new stack runs alongside old)&lt;/li&gt;
&lt;li&gt;Gradual traffic migration (10% → 50% → 90% → 100%)&lt;/li&gt;
&lt;li&gt;Monitor quality and cost metrics&lt;/li&gt;
&lt;li&gt;Rollback capability ready at all times&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 5: Optimization (Ongoing)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fine-tune cache similarity thresholds&lt;/li&gt;
&lt;li&gt;Optimize model routing logic&lt;/li&gt;
&lt;li&gt;Monitor and reduce cache misses&lt;/li&gt;
&lt;li&gt;Continuous cost tracking and optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code Examples&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Semantic Cache Implementation&lt;/strong&gt; (Python):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize components
&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6333&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_with_semantic_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Check Redis (exact match)
&lt;/span&gt;    &lt;span class="n"&gt;redis_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;redis_hit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Generate embedding
&lt;/span&gt;    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Search vector DB for similar queries
&lt;/span&gt;    &lt;span class="n"&gt;search_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;similarity_threshold&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Return cached if similar query found
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;search_result&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cached_query_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;search_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
        &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached_query_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;semantic_hit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 5: Generate new response (cache miss)
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_llm_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 6: Cache response
&lt;/span&gt;    &lt;span class="n"&gt;query_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cache_miss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model Routing Implementation&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Classify complexity
&lt;/span&gt;    &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Route based on complexity
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;8b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;70b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;405b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;

    &lt;span class="c1"&gt;# Call appropriate model
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Rule-based classification (can be replaced with ML model)
&lt;/span&gt;    &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple heuristics
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;requires_reasoning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;is_multi_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring and Observability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Track&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cache Performance&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis hit rate (target: &amp;gt;95%)&lt;/li&gt;
&lt;li&gt;Semantic hit rate (target: &amp;gt;99%)&lt;/li&gt;
&lt;li&gt;Average cache latency (target: &amp;lt;15ms)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Performance&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;P50, P95, P99 latency by model&lt;/li&gt;
&lt;li&gt;Throughput (queries/second)&lt;/li&gt;
&lt;li&gt;Error rate (&amp;lt;0.1% target)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Metrics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per query (overall and by model)&lt;/li&gt;
&lt;li&gt;Daily/monthly spend tracking&lt;/li&gt;
&lt;li&gt;Cost attribution by endpoint/user&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quality Metrics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response accuracy (A/B testing)&lt;/li&gt;
&lt;li&gt;User satisfaction (thumbs up/down)&lt;/li&gt;
&lt;li&gt;Escalation rate (queries requiring human review)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Dashboard Setup&lt;/strong&gt; (Grafana):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Panel 1: Cache Hit Rates (last 24h)
- Redis: 95.2%
- Semantic: 99.1%
- Overall: 99.7%

Panel 2: Cost Trends (last 30 days)
- Total spend: $2,800
- Trend: -94% vs Month 1
- Projection: $2,850 next month

Panel 3: Model Distribution
- 8B: 62% of queries
- 70B: 28% of queries
- 405B: 10% of queries

Panel 4: Latency P99
- Cached: 12ms
- Uncached: 125ms
- Overall: 18ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;PART 4: Results &amp;amp; ROI&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Month-by-Month Cost Reduction&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Month 1 (Baseline):
- Infrastructure cost: $47,000
- Queries served: 42M
- Cost per query: $0.00112

Month 2 (Redis caching deployed):
- Infrastructure cost: $38,000
- Queries served: 44M
- Cost per query: $0.00086
- Reduction: 19%

Month 3 (Semantic caching deployed):
- Infrastructure cost: $12,000
- Queries served: 45M
- Cost per query: $0.00027
- Reduction: 74% (from baseline)

Month 4 (Model routing deployed):
- Infrastructure cost: $6,500
- Queries served: 46M
- Cost per query: $0.00014
- Reduction: 86% (from baseline)

Month 5 (Efficient models deployed):
- Infrastructure cost: $2,800
- Queries served: 47M
- Cost per query: $0.00006
- Reduction: 94% (from baseline)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Performance Metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before Optimization&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P50 latency: 280ms
P95 latency: 420ms
P99 latency: 650ms
Throughput: 1,200 queries/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After Optimization&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P50 latency: 8ms (97% faster)
P95 latency: 15ms (96% faster)
P99 latency: 125ms (81% faster)
Throughput: 8,500 queries/sec (7x improvement)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;User Experience Impact&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page load times: -60% (faster responses)&lt;/li&gt;
&lt;li&gt;User complaints: 0 (nobody noticed quality change)&lt;/li&gt;
&lt;li&gt;User satisfaction: +12% (noticed speed improvement)&lt;/li&gt;
&lt;li&gt;Feature usage: +28% (faster = more engagement)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ROI Analysis&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Investment&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Engineering time: 8 weeks × 2 engineers = 16 engineer-weeks
Infrastructure setup: $5,000 (one-time)
Testing and monitoring tools: $2,000 (one-time)

Total investment: ~$80,000-$100,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Savings&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly savings: $44,200 ($47K - $2.8K)
Annual savings: $530,400
3-year savings: $1,591,200

ROI (Year 1): 530% ($530K saved / $100K invested)
Payback period: 2.3 months
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Lessons Learned&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What Worked&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;✅ &lt;strong&gt;Gradual migration&lt;/strong&gt; - Shadow mode → A/B test → full rollout prevented disasters&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Monitoring first&lt;/strong&gt; - Set up dashboards before making changes, not after&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Conservative thresholds&lt;/strong&gt; - Started with 0.98 similarity, lowered to 0.95 after confidence built&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Rollback plan&lt;/strong&gt; - Having old infrastructure ready for instant rollback was crucial&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Quality gates&lt;/strong&gt; - Automated quality checks caught issues before users did&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work Initially&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;❌ &lt;strong&gt;Too aggressive cache invalidation&lt;/strong&gt; - First attempt: invalidate after 1 hour. Too frequent. Changed to 24 hours.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Wrong similarity threshold&lt;/strong&gt; - Started at 0.90, got too many false positives. Raised to 0.95.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Inadequate monitoring&lt;/strong&gt; - Missed cache memory issues initially. Added memory alerts.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;No cost attribution&lt;/strong&gt; - Couldn't tell which endpoints were expensive. Added detailed tracking.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Common Pitfalls to Avoid&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #1: Caching everything&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't cache time-sensitive queries (stock prices, weather)&lt;/li&gt;
&lt;li&gt;Don't cache user-specific data without proper key isolation&lt;/li&gt;
&lt;li&gt;Don't cache low-frequency queries (waste of memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #2: Wrong model routing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't route based on query length alone (misleading)&lt;/li&gt;
&lt;li&gt;Don't use overly complex routing logic (adds latency)&lt;/li&gt;
&lt;li&gt;Don't forget to measure routing accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #3: Premature optimization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't optimize before measuring (know your bottlenecks)&lt;/li&gt;
&lt;li&gt;Don't sacrifice quality for cost (users &amp;gt; dollars)&lt;/li&gt;
&lt;li&gt;Don't optimize in isolation (system-level thinking required)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #4: Ignoring monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't deploy without observability (you're flying blind)&lt;/li&gt;
&lt;li&gt;Don't skip A/B testing (assumptions fail in production)&lt;/li&gt;
&lt;li&gt;Don't ignore long-tail latency (P99 matters more than average)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;PART 5: What's Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The AI efficiency revolution is just beginning.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2026-2028 Predictions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;2026&lt;/strong&gt; (now):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8B models match 70B performance ✅ (happening)&lt;/li&gt;
&lt;li&gt;Semantic caching becomes standard practice&lt;/li&gt;
&lt;li&gt;Model routing adopted by 30% of AI-first companies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2027&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3B models match today's 70B performance&lt;/li&gt;
&lt;li&gt;On-device AI becomes viable for 50%+ of use cases&lt;/li&gt;
&lt;li&gt;Edge deployment standard for latency-critical apps&lt;/li&gt;
&lt;li&gt;First $1B+ open source AI infrastructure company&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2028&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumer devices run GPT-4-equivalent models natively&lt;/li&gt;
&lt;li&gt;Cloud inference costs drop 95% from 2024 levels&lt;/li&gt;
&lt;li&gt;AI infrastructure consolidates around 3-5 major platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Emerging Technologies to Watch&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mixture of Experts (MoE)&lt;/strong&gt; - Activate only subset of parameters per query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speculative Decoding&lt;/strong&gt; - Generate faster with small model + large model verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantized Models&lt;/strong&gt; - 4-bit and even 2-bit inference without quality loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State Space Models&lt;/strong&gt; - Alternative to transformers, potentially more efficient&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neuromorphic Computing&lt;/strong&gt; - Hardware optimized for neural networks&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How to Stay Ahead&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For Technical Leaders&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start measuring cost per query today&lt;/li&gt;
&lt;li&gt;Implement caching this quarter&lt;/li&gt;
&lt;li&gt;Experiment with model routing next quarter&lt;/li&gt;
&lt;li&gt;Migrate to efficient models within 6 months&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;For Organizations&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Treat AI infrastructure as platform investment, not project&lt;/li&gt;
&lt;li&gt;Hire engineers who've built AI at scale (not just trained models)&lt;/li&gt;
&lt;li&gt;Open source your learnings (builds credibility, attracts talent)&lt;/li&gt;
&lt;li&gt;Focus on efficiency from day one (retrofitting is 10x harder)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;For the Industry&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Standardize on efficiency benchmarks (cost per query, not just accuracy)&lt;/li&gt;
&lt;li&gt;Share production learnings openly (we all benefit)&lt;/li&gt;
&lt;li&gt;Pressure model providers for more efficient options&lt;/li&gt;
&lt;li&gt;Invest in infrastructure, not just models&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cutting AI costs by 94% wasn't magic. It was architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 4-layer stack&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Semantic caching (70% reduction)&lt;/li&gt;
&lt;li&gt;Redis caching (15% additional)&lt;/li&gt;
&lt;li&gt;Model routing (12% additional)&lt;/li&gt;
&lt;li&gt;Efficient models (15% additional)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The results&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47,000 → $2,800/month&lt;/li&gt;
&lt;li&gt;340ms → 125ms latency&lt;/li&gt;
&lt;li&gt;0% quality degradation&lt;/li&gt;
&lt;li&gt;530% ROI in year 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson&lt;/strong&gt;: AI infrastructure optimization isn't about compromising quality. It's about building intelligently from the start.&lt;/p&gt;

&lt;p&gt;The companies that master AI efficiency will win the next decade. The companies that don't will burn cash until they can't compete.&lt;/p&gt;

&lt;p&gt;Which side do you want to be on?&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;💡 Enjoyed this deep-dive?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you found this article valuable, here's how to stay connected and go deeper:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;📧 Subscribe to my Substack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Get weekly deep-dives on production AI infrastructure, case studies, and implementation guides delivered to your inbox.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;Subscribe here&lt;/a&gt;&lt;/strong&gt; (Early access to all articles!)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;💻 Explore the Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;All the optimization techniques discussed here are open source:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/anilatambharii" rel="noopener noreferrer"&gt;github.com/anilatambharii&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Cost Optimization frameworks&lt;/li&gt;
&lt;li&gt;Production RAG implementations&lt;/li&gt;
&lt;li&gt;AI Safety testing frameworks&lt;/li&gt;
&lt;li&gt;Distributed training utilities&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;💼 Connect &amp;amp; Follow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn&lt;/strong&gt;: Daily AI infrastructure insights → &lt;a href="https://linkedin.com/in/anilsprasad" rel="noopener noreferrer"&gt;linkedin.com/in/anilsprasad&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;X/Twitter&lt;/strong&gt;: Real-time production AI observations → &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambharii Labs&lt;/strong&gt;: We build production AI infrastructure → &lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ambharii.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🏢 Need Help Implementing This?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If your team is struggling with AI infrastructure costs or wants to build efficient systems from day one:&lt;/p&gt;

&lt;p&gt;📨 &lt;strong&gt;Email&lt;/strong&gt;: &lt;a href="mailto:contact@ambharii.com"&gt;contact@ambharii.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
🌐 &lt;strong&gt;Website&lt;/strong&gt;: ambharii.com&lt;/p&gt;

&lt;p&gt;We offer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture review &amp;amp; optimization consulting&lt;/li&gt;
&lt;li&gt;Build services for production AI infrastructure&lt;/li&gt;
&lt;li&gt;Training for engineering teams&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;About the Author&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Anil Prasad&lt;/strong&gt; is Head of Engineering at Ambharii Labs, where he builds production AI infrastructure processing 2.3M events/second. Named one of "100 Most Influential AI Leaders in USA 2024." &lt;/p&gt;

&lt;p&gt;Previously led engineering teams at Fortune 500 companies recovering $47M in revenue through real-time data systems. Passionate about making production AI infrastructure accessible through open source and knowledge sharing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Products&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ARIA RCM&lt;/strong&gt;: AI-native revenue cycle management for healthcare&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GenomiziQ&lt;/strong&gt;: Precision medicine platform (WEF candidate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic AI Platform&lt;/strong&gt;: Multi-agent orchestration infrastructure&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #ai #machinelearning #production #llm #optimization #costoptimization #infrastructure #devops #engineering #opensource&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;💖 If this article helped you, please heart it and share it with your team!&lt;br&gt;&lt;br&gt;
🔖 Bookmark for future reference&lt;br&gt;&lt;br&gt;
💬 Drop a comment if you have questions or want to share your own optimization wins!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Published&lt;/strong&gt;: June 10, 2026&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Reading Time&lt;/strong&gt;: 16-18 minutes&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Originally published on&lt;/strong&gt;: &lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Building Production-Ready Open Source AI Infrastructure: A Technical Guide</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Tue, 19 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/building-production-ready-open-source-ai-infrastructure-a-technical-guide-14cl</link>
      <guid>https://dev.to/anilatambharii/building-production-ready-open-source-ai-infrastructure-a-technical-guide-14cl</guid>
      <description>&lt;h1&gt;
  
  
  Building Production-Ready Open Source AI Infrastructure: A Technical Guide
&lt;/h1&gt;

&lt;p&gt;Over the past year, we've built and open sourced six production-grade AI infrastructure projects. This isn't toy code or proof of concepts. These are systems handling millions of requests daily in production environments.&lt;/p&gt;

&lt;p&gt;Here's what we learned building open source AI infrastructure that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Six Projects
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;llm-cost-optimization&lt;/strong&gt;: 3-layer caching plus intelligent routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ai-safety-framework&lt;/strong&gt;: 5-layer defense with 250 red team test cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;production-rag&lt;/strong&gt;: 6-stage pipeline with re-ranking and evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;distributed-training&lt;/strong&gt;: PyTorch DDP with NCCL tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;roi-first-ai&lt;/strong&gt;: Business metric selection and deployment templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agentic-ai&lt;/strong&gt;: Multi-agent orchestration framework&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All repositories are at &lt;code&gt;github.com/anilatambharii&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Open Source Our Production Code
&lt;/h2&gt;

&lt;p&gt;Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, the AI infrastructure landscape is fragmented. Every team rebuilds the same patterns from scratch. LLM caching. RAG pipelines. Cost optimization. Agent orchestration. We've already solved these problems. Sharing the solutions helps the community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, open source code is battle tested. When thousands of developers review, use, and contribute to your code, it gets better fast. Private code stays brittle. Public code gets hardened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt;, hiring advantage. The best engineers want to work on code that matters. Open source contributions demonstrate technical credibility better than any interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Principle: Composition Over Configuration
&lt;/h2&gt;

&lt;p&gt;Each project is a focused library, not a framework. You compose them together rather than configuring one monolithic system.&lt;/p&gt;

&lt;p&gt;Bad approach: One repo with 47 configuration options trying to do everything.&lt;/p&gt;

&lt;p&gt;Good approach: Six repos, each solving one problem well. Use what you need. Ignore what you don't.&lt;/p&gt;

&lt;p&gt;Example using &lt;code&gt;llm-cost-optimization&lt;/code&gt; and &lt;code&gt;production-rag&lt;/code&gt; together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_cost_optimization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CachingLayer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ModelRouter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGPipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HybridRetriever&lt;/span&gt;

&lt;span class="c1"&gt;# Set up caching for LLM calls
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CachingLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;semantic_cache_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set up model routing based on query complexity
&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;complexity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set up RAG pipeline with hybrid retrieval
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HybridRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keyword_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RAGPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_router&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use them together
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What were Q2 financial results?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each component is independent. Each can be used standalone. Together they form a complete system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Deep Dive: LLM Cost Optimization
&lt;/h2&gt;

&lt;p&gt;This project reduced our LLM costs from $47K monthly to $2.8K monthly. 94% cost reduction. Same quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Layer Caching
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Exact match cache&lt;/strong&gt; catches identical queries. Redis key is SHA256 hash of prompt. Cache hit returns response instantly. No LLM call. Zero cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExactMatchCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hit rate: 23% of queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic cache&lt;/strong&gt; catches similar queries. Embed the prompt. Find nearest neighbors in vector DB. If similarity &amp;gt; threshold (0.95), return cached response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;cached_response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hit rate: 31% of queries not caught by exact match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefix cache&lt;/strong&gt; reuses computation for prompts with common prefixes. System prompt is usually identical. Few-shot examples are usually identical. Only the user query changes.&lt;/p&gt;

&lt;p&gt;Anthropic's prompt caching API handles this automatically. Mark static parts as cacheable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined hit rate: 73% of queries serve from cache. 27% hit the LLM. Cost reduced 73% from caching alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intelligent Model Routing
&lt;/h3&gt;

&lt;p&gt;Not every query needs GPT-4 or Claude Opus. Simple queries work fine on Haiku. Complex queries need Sonnet.&lt;/p&gt;

&lt;p&gt;Routing strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;calculate_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $0.25 per 1M tokens
&lt;/span&gt;        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3 per 1M tokens
&lt;/span&gt;        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $15 per 1M tokens
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Features: length, question marks, technical terms, etc.
&lt;/span&gt;        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trained a simple classifier on 10K labeled examples. "What's the capital of France?" → Haiku. "Analyze this 50 page contract for liability clauses" → Opus.&lt;/p&gt;

&lt;p&gt;Result: 89% of queries route to Haiku. 9% to Sonnet. 2% to Opus. Average cost per query drops 88%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fews0p24mflyd2k60kg2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fews0p24mflyd2k60kg2t.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Notes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cache invalidation&lt;/strong&gt; is the hard part. We invalidate based on TTL (1 hour default) and explicit updates. When source data changes, we flush related cache entries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; tracks hit rates, latency, cost per query. Dashboard shows cache performance in real time. Alerts fire when hit rate drops below threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradual rollout&lt;/strong&gt; started with 1% of traffic. Measured cache hit rate and accuracy. Ramped to 10%, 50%, 100% over 3 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Deep Dive: Production RAG
&lt;/h2&gt;

&lt;p&gt;We increased RAG accuracy from 52% to 89% by fixing retrieval, not the LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 6-Stage Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Query Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't send raw user queries to vector DB. Expand with synonyms. Extract metadata. Generate context-aware embedding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QueryProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract metadata
&lt;/span&gt;        &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_date_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_department&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_doc_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Expand with synonyms
&lt;/span&gt;        &lt;span class="n"&gt;expanded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand_synonyms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate embedding
&lt;/span&gt;        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2: Vector Database Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cosine similarity threshold 0.85. Top-k 50 candidates (not 5, not 10). Use Pinecone with metadata filtering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$gte&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 3: Hybrid Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combine semantic search (70%) with keyword search (30%) using BM25.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HybridRetriever&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Vector search
&lt;/span&gt;        &lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Keyword search
&lt;/span&gt;        &lt;span class="n"&gt;keyword_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bm25_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Combine with weights
&lt;/span&gt;        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;keyword_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vector_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;keyword_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 4: Re-ranking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This single stage improved accuracy by 23%. Use cross-encoder to score each candidate against the actual query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Reranker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-MiniLM-L-12-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Score each doc against query
&lt;/span&gt;        &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Sort by score
&lt;/span&gt;        &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Top 50 candidates from hybrid search → Re-rank → Best 5 to LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 5: Context Assembly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Smart chunking with overlap. 512 token chunks with 50 token overlap. Include surrounding context. Add metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assemble_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranked_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranked_docs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Source &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Department: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

---
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 6: LLM Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Force grounded responses. System prompt enforces citation. User query includes assembled context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant. Use ONLY the provided context to answer questions. 

If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain enough information, say &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have enough information to answer that question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

Always cite your sources using the Source number.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;assembled_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;original_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;Before: 52% answer accuracy. 3.8s latency. 31% hallucination rate.&lt;/p&gt;

&lt;p&gt;After: 89% accuracy (+71%). 1.2s latency (faster!). 4% hallucination rate (-87%).&lt;/p&gt;

&lt;p&gt;The insight: Don't optimize the LLM. Optimize the retrieval. GPT-4 with bad context = bad answers. Haiku with perfect context = great answers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3943lz1ypcgvehf5fjax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3943lz1ypcgvehf5fjax.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Projects Production Ready
&lt;/h2&gt;

&lt;p&gt;Every project includes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comprehensive tests&lt;/strong&gt;: Unit tests for every function. Integration tests for pipelines. End-to-end tests for workflows. 90%+ coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: README with quick start. Detailed API docs. Architecture diagrams. Example notebooks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks&lt;/strong&gt;: Performance metrics. Accuracy measurements. Cost comparisons. Real numbers, not claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus metrics. Logging. Error tracking. Observability built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment&lt;/strong&gt;: Docker containers. Kubernetes manifests. Terraform modules. Production ready deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing to Open Source AI
&lt;/h2&gt;

&lt;p&gt;Our projects welcome contributions. Here's how to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a project that interests you&lt;/li&gt;
&lt;li&gt;Read the CONTRIBUTING.md&lt;/li&gt;
&lt;li&gt;Check the issues for "good first issue" labels&lt;/li&gt;
&lt;li&gt;Submit a PR with tests and documentation&lt;/li&gt;
&lt;li&gt;Respond to review feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We review all PRs within 48 hours. Quality bar is high but we help contributors meet it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89brvmwv8kb18ggp8kvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89brvmwv8kb18ggp8kvf.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Open source AI infrastructure should be production ready, not proof of concept. These six projects represent thousands of hours of real world testing and optimization.&lt;/p&gt;

&lt;p&gt;Use them. Contribute to them. Build on them.&lt;/p&gt;

&lt;p&gt;The code is at &lt;code&gt;github.com/anilatambharii&lt;/code&gt;. Documentation is comprehensive. Examples are plentiful. Issues are welcome.&lt;/p&gt;

&lt;p&gt;Let's build better AI infrastructure together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anil Prasad is Head of Engineering at Ambharii Labs, recognized as one of "100 Most Influential AI Leaders in USA 2024." He builds production-scale AI and data systems for enterprise organizations. Connect on LinkedIn at linkedin.com/in/anilsprasad or visit ambharii.com.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/llm-cost-optimization" rel="noopener noreferrer"&gt;LLM Cost Optimization Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/production-rag" rel="noopener noreferrer"&gt;Production RAG Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/ai-safety-framework" rel="noopener noreferrer"&gt;AI Safety Framework Repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opensource</category>
      <category>datascience</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How We Cut LLM API Costs by 94%: A 3-Layer Caching Strategy</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Thu, 14 May 2026 13:59:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-we-cut-llm-api-costs-by-94-a-3-layer-caching-strategy-145l</link>
      <guid>https://dev.to/anilatambharii/how-we-cut-llm-api-costs-by-94-a-3-layer-caching-strategy-145l</guid>
      <description>&lt;p&gt;Last month, our LLM API bills hit &lt;strong&gt;$47,000&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This month: &lt;strong&gt;$2,800&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Same product. Same user experience. Same performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;94% cost reduction&lt;/strong&gt; without sacrificing quality.&lt;/p&gt;

&lt;p&gt;Here's the architecture that made it possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Wake-Up Call
&lt;/h2&gt;

&lt;p&gt;CFO's message: "Fix this or we shut down the AI features."&lt;/p&gt;

&lt;p&gt;We had 90 days.&lt;/p&gt;

&lt;p&gt;Most teams would panic and start cutting features. We treated it as an architecture problem, not a budget problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: 3-Layer Caching + Intelligent Routing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/monday-cost-optimization.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/monday-cost-optimization.png" alt="Cost Optimization Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: Prompt Caching (68% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Every request pays for the same tokens repeatedly.&lt;/p&gt;

&lt;p&gt;Standard system prompts, documentation, static context—all charged every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Claude's native prompt caching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Mark cacheable content with cache_control
&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful AI assistant for our healthcare platform...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Cache this
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Current user context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Don't cache (changes per user)
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Economics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input tokens:&lt;/strong&gt; $3.00 / 1M tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cached input tokens:&lt;/strong&gt; $0.30 / 1M tokens (10x cheaper!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache write:&lt;/strong&gt; $3.75 / 1M tokens (one-time cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First request (cache write):&lt;/p&gt;

&lt;p&gt;5,000 token system prompt&lt;br&gt;
Cost: $0.01875 (5K tokens × $3.75/1M)&lt;/p&gt;

&lt;p&gt;Next 100 requests (cache hit):&lt;/p&gt;

&lt;p&gt;Same 5,000 token system prompt&lt;br&gt;
Cost: $0.0015 (5K tokens × $0.30/1M × 100)&lt;/p&gt;

&lt;p&gt;Total: $0.02025 for 101 requests&lt;br&gt;
Without caching: $1.515 (5K × $3/1M × 101)&lt;br&gt;
Savings: 98.7%&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our hit rate:&lt;/strong&gt; 68%&lt;/p&gt;


&lt;h2&gt;
  
  
  Layer 2: Semantic Caching (15% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Vector search doesn't catch similar queries.&lt;/p&gt;

&lt;p&gt;"How do I reset my password?" vs "Password reset help?" are semantically identical but literally different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Semantic similarity matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# {embedding: (query, response, timestamp)}
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if semantically similar query exists in cache&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cached_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache HIT: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; ≈ &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (similarity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store query-response pair with embedding&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# First query
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I reset my password?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I reset my password?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Similar query (cache hit!)
&lt;/span&gt;&lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Password reset help?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Returns the cached response, no LLM call
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Additional 15% cache hit rate&lt;/strong&gt; on top of prompt caching.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Result Caching (10% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Identical queries hit the LLM multiple times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Cache complete responses with smart TTL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create deterministic cache key&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;cache_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get cached response if exists&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Cache response with TTL

        TTL strategy:
        - Stable content: 24 hours (86400s)
        - Dynamic content: 1 hour (3600s)
        - Real-time data: 5 minutes (300s)
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Invalidate cache on data updates&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Check cache first
&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;  &lt;span class="c1"&gt;# Cache hit!
&lt;/span&gt;
&lt;span class="c1"&gt;# Cache miss - call LLM
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Cache the result
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Invalidate on data update
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:123:*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Clear all caches for user 123
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final 10% cache hit rate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combined: 73% cache hit rate&lt;/strong&gt; (68% + 15% + 10% with some overlap)&lt;/p&gt;




&lt;h2&gt;
  
  
  Intelligent Model Routing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Caching alone isn't enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;67% of our queries work perfectly with Haiku. That's a 60x price difference vs Opus.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;HAIKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $0.25/1M input
&lt;/span&gt;    &lt;span class="n"&gt;SONNET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3/1M input
&lt;/span&gt;    &lt;span class="n"&gt;OPUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# $15/1M input
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Route based on complexity

    Indicators for Haiku (simple):
    - Short queries (&amp;lt;50 tokens)
    - FAQ-style questions
    - Retrieval tasks

    Indicators for Sonnet (analysis):
    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    - Multi-step reasoning
    - Longer context (&amp;gt;2K tokens)

    Indicators for Opus (complex):
    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    - Creative tasks
    - Critical business decisions
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple queries → Haiku
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HAIKU&lt;/span&gt;

    &lt;span class="c1"&gt;# Analysis tasks → Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;evaluate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;explain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

    &lt;span class="c1"&gt;# Complex reasoning → Opus
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;strategy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;create&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPUS&lt;/span&gt;

    &lt;span class="c1"&gt;# Default to Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Our distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;67% Haiku&lt;/strong&gt; ($0.25/1M)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;28% Sonnet&lt;/strong&gt; ($3/1M)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5% Opus&lt;/strong&gt; ($15/1M)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Complete System
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLMClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# Layer 1
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 2
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# Layer 3
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 3: Check result cache
&lt;/span&gt;        &lt;span class="n"&gt;cached_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_result&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 2: Check semantic cache
&lt;/span&gt;        &lt;span class="n"&gt;semantic_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic_result&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 1: Prompt caching + model routing happens in LLM call
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Prompt caching
&lt;/span&gt;            &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Cache the result
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLMClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my account balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47K/month API costs&lt;/li&gt;
&lt;li&gt;P95 latency: 2.1s&lt;/li&gt;
&lt;li&gt;No optimization strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$2.8K/month (-94%)&lt;/li&gt;
&lt;li&gt;P95 latency: 340ms (67% faster!)&lt;/li&gt;
&lt;li&gt;73% cache hit rate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Infrastructure &amp;gt; Model Selection
&lt;/h3&gt;

&lt;p&gt;Opus with naive setup:     $47K/month&lt;br&gt;
Haiku with optimization:   $2.8K/month&lt;/p&gt;

&lt;p&gt;A well-architected system with Haiku outperforms naive Opus at 1/16th the cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cache Hit Rate Math
&lt;/h3&gt;

&lt;p&gt;Without caching: 100% requests hit LLM&lt;br&gt;
With 73% cache hit: 27% requests hit LLM&lt;br&gt;
Cost reduction: 73% from caching alone&lt;br&gt;
Additional savings: 67% of remaining 27% uses cheap Haiku&lt;br&gt;
Total: 94% cost reduction&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Speed as Side Effect
&lt;/h3&gt;

&lt;p&gt;Caching doesn't just save money. It's faster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hit: 50ms (Redis lookup)&lt;/li&gt;
&lt;li&gt;LLM call: 2,100ms (P95)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;42x faster&lt;/strong&gt; for cached requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Enable prompt caching (10x savings on repeated context)&lt;/li&gt;
&lt;li&gt;[ ] Add semantic similarity cache (15% additional hits)&lt;/li&gt;
&lt;li&gt;[ ] Implement result caching with smart TTL&lt;/li&gt;
&lt;li&gt;[ ] Route queries to appropriate model tier&lt;/li&gt;
&lt;li&gt;[ ] Monitor cache hit rates and adjust thresholds&lt;/li&gt;
&lt;li&gt;[ ] Set up cache invalidation on data updates&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Monitoring Dashboard
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cache_metrics&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompt_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;semantic_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;combined_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model_distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_per_1k_requests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;340&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track these weekly. Optimize based on data, not assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're open-sourcing our cost optimization framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete caching implementation&lt;/li&gt;
&lt;li&gt;Model routing logic&lt;/li&gt;
&lt;li&gt;Monitoring dashboards&lt;/li&gt;
&lt;li&gt;Cost calculation tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt; or &lt;a href="https://github.com/ambharii" rel="noopener noreferrer"&gt;Ambharii Labs&lt;/a&gt; for the release.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;What's your LLM API bill?&lt;/p&gt;

&lt;p&gt;Drop it in the comments and I'll tell you which optimization would have the highest ROI for your use case.&lt;/p&gt;

&lt;p&gt;Common wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt caching: 10x savings on repeated context&lt;/li&gt;
&lt;li&gt;Model routing: 60x price difference (Haiku vs Opus)&lt;/li&gt;
&lt;li&gt;Semantic caching: 15% additional hits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make LLMs affordable for everyone. 💰&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #performance #optimization #tutorial&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Tue, 12 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/building-production-rag-from-52-to-89-accuracy-with-a-6-stage-pipeline-33ff</link>
      <guid>https://dev.to/anilatambharii/building-production-rag-from-52-to-89-accuracy-with-a-6-stage-pipeline-33ff</guid>
      <description>&lt;p&gt;Two hard problems in production AI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;: RAG systems giving wrong answers 48% of the time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: LLM API bills hitting $47K/month&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We solved both. Here's how.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: RAG Accuracy (52% → 89%)
&lt;/h2&gt;

&lt;p&gt;Our RAG system was confidently wrong. Users asked "What were Q2 healthcare results?" and got Q1 data, footnotes, and chapter titles with zero content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High similarity scores. Completely useless context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM wasn't the problem. Retrieval was broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/wednesday-rag-architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/wednesday-rag-architecture.png" alt="RAG Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The 6-Stage Pipeline
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Stage 1: Query Processing
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; "Show me Q2 results" has no semantic information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Query expansion + metadata extraction&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# dates, entities
&lt;/span&gt;    &lt;span class="n"&gt;expanded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_with_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Transformation:&lt;/strong&gt;&lt;br&gt;
Input:  "Show me Q2 results"&lt;br&gt;
Output: "quarterly financial results Q2 2024 revenue profit earnings second quarter"&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 2: Vector Database Search
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# not 10, not 20
&lt;/span&gt;    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$gte&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthcare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Key:&lt;/strong&gt; Cosine similarity threshold 0.85. Anything lower retrieves noise.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 3: Hybrid Search (Semantic + Keyword)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Semantic (70%) + BM25 keyword (30%)
&lt;/span&gt;    &lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;keyword_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
                 &lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Patent queries like "US-2847291" need exact match, not semantic.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 4: Re-ranking (23% Accuracy Boost)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrossEncoder&lt;/span&gt;

&lt;span class="n"&gt;reranker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-MiniLM-L-6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Strategy:&lt;/strong&gt; Fast bi-encoder for 50 candidates → slow cross-encoder for final 5.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 5: Context Assembly
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;detokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;section&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;extract_section&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why overlap:&lt;/strong&gt; "Revenue increased 23% vs previous quarter" → needs surrounding context.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 6: LLM Generation
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Chunk&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;document&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;source&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/source&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;content&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/content&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/document&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use ONLY the provided context.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Query: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Instructions:
1. Answer using ONLY provided context
2. Cite sources
3. Say &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; if insufficient

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  RAG Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;52% accuracy&lt;/li&gt;
&lt;li&gt;31% hallucination rate&lt;/li&gt;
&lt;li&gt;3.8s latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;89% accuracy&lt;/strong&gt; (+71%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4% hallucination rate&lt;/strong&gt; (-87%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.2s latency&lt;/strong&gt; (-67%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; GPT-4 with naive retrieval = 54% accuracy. Haiku with 6-stage pipeline = 87% accuracy.&lt;/p&gt;

&lt;p&gt;Optimize retrieval, not the LLM.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 2: Cost Reduction ($47K → $2.8K)
&lt;/h2&gt;

&lt;p&gt;Same product. Same UX. 94% cost reduction.&lt;/p&gt;

&lt;p&gt;The secret: 3-layer caching + intelligent routing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 1: Prompt Caching (68% hit rate)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Every request pays for the same system prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful AI assistant...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# 10x cheaper!
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Economics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: $3.00/1M tokens&lt;/li&gt;
&lt;li&gt;Cached: $0.30/1M tokens (10x cheaper)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
5K token system prompt × 100 requests:&lt;br&gt;
Without caching: $1.50&lt;br&gt;
With caching:    $0.02&lt;br&gt;
98.7% savings&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 2: Semantic Caching (15% hit rate)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;query_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Catches:&lt;/strong&gt; "How do I reset password?" ≈ "Password reset help?"&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 3: Result Caching (10% hit rate)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;TTL strategy:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable content: 24 hours&lt;/li&gt;
&lt;li&gt;Dynamic content: 1 hour&lt;/li&gt;
&lt;li&gt;Real-time: 5 minutes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Intelligent Model Routing
&lt;/h3&gt;

&lt;p&gt;67% of queries work with Haiku ($0.25/1M). 60x cheaper than Opus ($15/1M).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;HAIKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $0.25/1M
&lt;/span&gt;    &lt;span class="n"&gt;SONNET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3/1M
&lt;/span&gt;    &lt;span class="n"&gt;OPUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# $15/1M
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple → Haiku
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HAIKU&lt;/span&gt;

    &lt;span class="c1"&gt;# Analysis → Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

    &lt;span class="c1"&gt;# Complex → Opus
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPUS&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;67% Haiku&lt;/li&gt;
&lt;li&gt;28% Sonnet&lt;/li&gt;
&lt;li&gt;5% Opus&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Complete System
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 3: Result cache
&lt;/span&gt;        &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 2: Semantic cache
&lt;/span&gt;        &lt;span class="n"&gt;semantic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 1: Prompt cache + routing
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Cache results
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47K/month&lt;/li&gt;
&lt;li&gt;P95 latency: 2.1s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;$2.8K/month (-94%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;P95 latency: 340ms (-84%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;73% combined cache hit rate&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Implement query processing (expand + extract metadata)&lt;/li&gt;
&lt;li&gt;[ ] Set up vector DB with metadata filtering&lt;/li&gt;
&lt;li&gt;[ ] Add hybrid search (semantic + keyword)&lt;/li&gt;
&lt;li&gt;[ ] Deploy cross-encoder re-ranking&lt;/li&gt;
&lt;li&gt;[ ] Build chunking with 50-token overlap&lt;/li&gt;
&lt;li&gt;[ ] Force grounded prompts (no hallucinations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Enable prompt caching (10x savings)&lt;/li&gt;
&lt;li&gt;[ ] Add semantic similarity cache&lt;/li&gt;
&lt;li&gt;[ ] Implement result cache with smart TTL&lt;/li&gt;
&lt;li&gt;[ ] Route to appropriate model tier&lt;/li&gt;
&lt;li&gt;[ ] Monitor cache hit rates weekly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval &amp;gt; LLM:&lt;/strong&gt; Haiku + perfect context beats GPT-4 + bad context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-ranking = 23% boost:&lt;/strong&gt; Single highest-ROI optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching = 73% hit rate:&lt;/strong&gt; Most requests never touch the LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model routing = 60x savings:&lt;/strong&gt; Haiku for 67% of queries&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What We're Open-Sourcing
&lt;/h2&gt;

&lt;p&gt;Next month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6-stage RAG pipeline (code + docs)&lt;/li&gt;
&lt;li&gt;Cost optimization framework&lt;/li&gt;
&lt;li&gt;Re-ranking models&lt;/li&gt;
&lt;li&gt;Monitoring dashboards&lt;/li&gt;
&lt;li&gt;Evaluation datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt; or &lt;a href="https://github.com/ambharii" rel="noopener noreferrer"&gt;Ambharii Labs&lt;/a&gt; for release.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For RAG:&lt;/strong&gt; What's your accuracy? Drop it in comments.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;For Costs:&lt;/strong&gt; What's your monthly LLM bill? I'll tell you which optimization has highest ROI.&lt;/p&gt;

&lt;p&gt;Common wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt caching: 10x savings&lt;/li&gt;
&lt;li&gt;Re-ranking: 23% accuracy boost&lt;/li&gt;
&lt;li&gt;Model routing: 60x price difference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make production AI work. 🚀&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #machinelearning #python #tutorial&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The web is now weaponized against your AI agents</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 08 May 2026 17:15:44 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-web-is-now-weaponized-against-your-ai-agents-50ol</link>
      <guid>https://dev.to/anilatambharii/the-web-is-now-weaponized-against-your-ai-agents-50ol</guid>
      <description>&lt;p&gt;Google dropped a security bomb last week.&lt;/p&gt;

&lt;p&gt;Their threat intelligence team scanned 2-3 billion web pages per month looking for indirect prompt injection attacks targeting enterprise AI agents. They found a &lt;strong&gt;32% increase&lt;/strong&gt; in malicious attempts between November 2025 and February 2026.&lt;/p&gt;

&lt;p&gt;The open web is now an attack surface for production AI.&lt;/p&gt;

&lt;p&gt;This is not speculation. This is documented evidence of active attacks deployed at scale. Hidden instructions embedded in public HTML. Invisible to humans. Visible to AI agents. Real payloads designed to hijack enterprise systems the moment an agent scrapes the page.&lt;/p&gt;

&lt;p&gt;If you have AI agents reading the open web on behalf of your organization, your security model just became obsolete.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monday: Hidden instructions at scale
&lt;/h2&gt;

&lt;p&gt;Google researchers documented the attack patterns deployed across billions of public web pages. The techniques are simple and effective:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero font size text&lt;/strong&gt;: Instructions rendered in font-size: 0. Invisible to humans, fully visible to AI parsing HTML&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opacity manipulation&lt;/strong&gt;: Commands hidden using CSS opacity: 0. Text exists but appears transparent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Off-screen positioning&lt;/strong&gt;: Instructions placed outside viewport using negative coordinates&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JavaScript dynamic execution&lt;/strong&gt;: Payloads injected after page load via client-side JS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;URL fragment injection&lt;/strong&gt;: Commands embedded after the # symbol in URLs&lt;/p&gt;

&lt;p&gt;These are not sophisticated zero-days requiring nation-state capabilities. These are techniques any web developer knows. The barrier to entry is near zero.&lt;/p&gt;

&lt;p&gt;Real payloads found in the wild:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fully specified PayPal transaction instructions&lt;/li&gt;
&lt;li&gt;Stripe donation redirects with persuasion amplifier keywords&lt;/li&gt;
&lt;li&gt;Data exfiltration commands targeting enterprise agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is production infrastructure under active attack.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://security.googleblog.com/2026/04/ai-threats-in-wild-current-state-of.html" rel="noopener noreferrer"&gt;Google Threat Intelligence, April 23, 2026&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuesday: The exploit window collapsed
&lt;/h2&gt;

&lt;p&gt;Black Hat Asia 2026 data from RunSybil: attack window compressed from &lt;strong&gt;5 months (2023) to 10 hours (2026)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why? Frontier LLMs now do offensive security work autonomously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2023 workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Security researcher finds vulnerability&lt;/li&gt;
&lt;li&gt;Documents it technically&lt;/li&gt;
&lt;li&gt;Writes POC exploit code&lt;/li&gt;
&lt;li&gt;Tests against targets&lt;/li&gt;
&lt;li&gt;Iterates based on results&lt;/li&gt;
&lt;li&gt;Publishes working exploit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline: months&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026 workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Describe bug to LLM&lt;/li&gt;
&lt;li&gt;Model generates exploit code&lt;/li&gt;
&lt;li&gt;Test in real-time&lt;/li&gt;
&lt;li&gt;Iterate with AI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline: hours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Meanwhile, 57% of organizations have AI agents in production right now. Most were architected before this research dropped. The threat model changed faster than the deployment cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wednesday: The sanitizer model pattern
&lt;/h2&gt;

&lt;p&gt;Two models. One reads the web. The other does the work.&lt;/p&gt;

&lt;p&gt;This is the architecture that actually defends against indirect prompt injection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;Deploy a small isolated model with zero system permissions. It reads untrusted web content, filters instructions, validates structure. If it gets compromised by a prompt injection, it lacks the permissions to cause damage.&lt;/p&gt;

&lt;p&gt;The production agent never touches raw web input directly. It only processes data that passed through the sanitizer layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key principle&lt;/strong&gt;: Trust boundary between models, not just at network edge.&lt;/p&gt;

&lt;p&gt;The sanitizer has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ No write access&lt;/li&gt;
&lt;li&gt;❌ No email permissions&lt;/li&gt;
&lt;li&gt;❌ No payment capabilities&lt;/li&gt;
&lt;li&gt;❌ No database credentials&lt;/li&gt;
&lt;li&gt;✅ Can read and filter only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If compromised by prompt injection, worst case is tainted text reaching production layer where business logic validation applies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;This is not theoretical. I've implemented this in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/anilatambharii/argus" rel="noopener noreferrer"&gt;ARGUS&lt;/a&gt;&lt;/strong&gt;: Dual model verification by default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://api.genomixiq.com" rel="noopener noreferrer"&gt;GenomixIQ&lt;/a&gt;&lt;/strong&gt;: Clinical genomics data ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ARIA RCM&lt;/a&gt;&lt;/strong&gt;: Healthcare revenue cycle workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All production systems in regulated environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thursday: Agent firewalls are the next layer
&lt;/h2&gt;

&lt;p&gt;Agent firewalls enforce security policies traditional infrastructure can't.&lt;/p&gt;

&lt;h3&gt;
  
  
  What they block
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Instruction injection&lt;/strong&gt;: Override commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential exfiltration&lt;/strong&gt;: Data to external endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privilege escalation&lt;/strong&gt;: Unauthorized tool calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision manipulation&lt;/strong&gt;: Logic chain redirects&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Five-layer architecture
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Input validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown sanitization&lt;/li&gt;
&lt;li&gt;Suspicious URL redaction&lt;/li&gt;
&lt;li&gt;Pattern matching for attack signatures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Instruction detection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML models trained on override attempts&lt;/li&gt;
&lt;li&gt;Recognizes semantic patterns (role reversals, system prompt refs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Permission checks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compartmentalized tool authorization&lt;/li&gt;
&lt;li&gt;Research agents: read only&lt;/li&gt;
&lt;li&gt;Write agents: database access, no email&lt;/li&gt;
&lt;li&gt;Email agents: no payment processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Decision logging&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full audit trails with context&lt;/li&gt;
&lt;li&gt;Source data tracking&lt;/li&gt;
&lt;li&gt;Reasoning chain capture&lt;/li&gt;
&lt;li&gt;Forensic reconstruction capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 5: Human confirmation gates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial transactions require approval&lt;/li&gt;
&lt;li&gt;Data deletion needs review&lt;/li&gt;
&lt;li&gt;Credential changes trigger verification&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zero trust for agents
&lt;/h3&gt;

&lt;p&gt;Never trust input. Assume web content hostile. Verify every action. Log decision lineage. Compartmentalize tools. Human in loop for high stakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Friday: Five questions before deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does your sanitizer have zero system permissions?
&lt;/h3&gt;

&lt;p&gt;If your sanitizer can write to databases or send emails, it's not a sanitizer. It's a production agent reading untrusted input. When compromised, attackers gain those capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are tool permissions compartmentalized by role?
&lt;/h3&gt;

&lt;p&gt;Monolithic access = single compromised agent exposes entire system. Implement RBAC for agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you reconstruct every decision from logs?
&lt;/h3&gt;

&lt;p&gt;If compliance asks why an agent made a recommendation 6 months ago, can you trace to exact data sources and reasoning steps?&lt;/p&gt;

&lt;h3&gt;
  
  
  Does human confirmation trigger for financial actions?
&lt;/h3&gt;

&lt;p&gt;Agents processing payments without approval = automated embezzlement risk. Confirmation gates are not optional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Have you tested injection attacks?
&lt;/h3&gt;

&lt;p&gt;No red team testing = you don't know if defenses work. Run adversarial testing continuously.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The 86-89% that fail&lt;/strong&gt; discover these requirements 6 weeks before go-live when compliance asks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 14% that succeed&lt;/strong&gt; build them day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your systems
&lt;/h2&gt;

&lt;p&gt;Security architecture requirements:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Dual model verification&lt;/strong&gt; - Sanitizer + production agent separation&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Compartmentalized permissions&lt;/strong&gt; - Role-based tool access&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Decision lineage tracking&lt;/strong&gt; - Full audit trails&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Human confirmation gates&lt;/strong&gt; - Required for high-stakes actions&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Continuous injection testing&lt;/strong&gt; - Red team + automated&lt;/p&gt;

&lt;p&gt;Not optional enhancements. Production requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ambharii.com/tools" rel="noopener noreferrer"&gt;AI Aether&lt;/a&gt;&lt;/strong&gt;: Free agent security readiness assessment (30 min, 30 questions)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/anilatambharii/argus" rel="noopener noreferrer"&gt;ARGUS&lt;/a&gt;&lt;/strong&gt;: Dual model verification, available on PyPI/GitHub&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://api.genomixiq.com" rel="noopener noreferrer"&gt;GenomixIQ&lt;/a&gt;&lt;/strong&gt;: Clinical genomics with FHIR R4 interoperability&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ARIA RCM&lt;/a&gt;&lt;/strong&gt;: Healthcare revenue cycle with HIPAA compliance&lt;/p&gt;

&lt;p&gt;All production-grade. No pilots. No POCs. Systems that ship and scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Years production AI taught one lesson
&lt;/h2&gt;

&lt;p&gt;The teams that succeed build governance before deployment, not after compliance review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCMTech&lt;/strong&gt;: $340M measurable improvements, 89 days integration, zero clinical data loss&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GeneticsTech&lt;/strong&gt;: 99.97% uptime during 50TB migration, FHIR R4 compliance throughout&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EnergyTech&lt;/strong&gt;: 23→81% AI adoption among 20-year veteran operators&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HealthTech&lt;/strong&gt;: Petabyte-scale platforms, every decision traceable&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Anil Prasad&lt;/strong&gt; is Founder of Ambharii Technologies and Head of Engineering &amp;amp; Product at EnergyTech.&lt;/p&gt;

&lt;p&gt;28 years building production AI in regulated environments across Fortune 100 companies. Currently building agent security infrastructure for enterprise AI: dual-model verification, compartmentalized permissions, and audit trail architecture for autonomous systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect:&lt;/strong&gt; &lt;a href="https://linkedin.com/in/anilsprasad" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt; | &lt;a href="https://github.com/anilatambharii" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next week&lt;/strong&gt;: Production deployment patterns, compliance architecture, audit trail infrastructure.&lt;/p&gt;

&lt;h1&gt;
  
  
  AgentSecurity #EnterpriseAI #HumanWritten #ExpertiseFromField
&lt;/h1&gt;

</description>
      <category>productionai</category>
      <category>llmops</category>
      <category>agentsecurity</category>
      <category>aigovernance</category>
    </item>
    <item>
      <title>Claude Code Has 6 Ways to Authenticate. I Built a Cross-Platform Installer Because of It</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Wed, 06 May 2026 16:48:06 +0000</pubDate>
      <link>https://dev.to/anilatambharii/claude-code-has-6-ways-to-authenticate-i-built-a-cross-platform-installer-because-of-it-171k</link>
      <guid>https://dev.to/anilatambharii/claude-code-has-6-ways-to-authenticate-i-built-a-cross-platform-installer-because-of-it-171k</guid>
      <description>&lt;p&gt;TL;DR&lt;br&gt;
Claude Code supports 6 different authentication methods with a strict priority order. Get the order wrong and your Pro subscription silently gets overridden by an API key, costing you real money.&lt;/p&gt;

&lt;p&gt;I built claude-auth-setup — a cross-platform installer (Bash + Batch + PowerShell) that handles the whole thing correctly. MIT licensed, ~17KB of bash, zero runtime dependencies.&lt;/p&gt;

&lt;p&gt;This post walks through the design decisions, the cross-platform tax, and the testing approach.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
The Claude Code auth resolution order, highest to lowest:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cloud provider creds (Bedrock / Vertex AI / Foundry)&lt;/li&gt;
&lt;li&gt;ANTHROPIC_AUTH_TOKEN&lt;/li&gt;
&lt;li&gt;ANTHROPIC_API_KEY     ← the silent footgun&lt;/li&gt;
&lt;li&gt;apiKeyHelper script&lt;/li&gt;
&lt;li&gt;CLAUDE_CODE_OAUTH_TOKEN&lt;/li&gt;
&lt;li&gt;Subscription OAuth    ← what most users actually want&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89556629th2b26r55nml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89556629th2b26r55nml.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're a Pro/Max subscriber and you ever set ANTHROPIC_API_KEY to test something — the API key wins forever until you explicitly unset it. No error. No warning. Just per-token charges added on top of your $20/month subscription.&lt;/p&gt;

&lt;p&gt;The single most common Claude Code support thread is some variation of:&lt;/p&gt;

&lt;p&gt;"My Anthropic Console bill went from $0 to $47 last month and I don't know why."&lt;/p&gt;

&lt;p&gt;The "why" is almost always a stale ANTHROPIC_API_KEY from a tutorial.&lt;/p&gt;

&lt;p&gt;Why a Script Instead of Better Docs&lt;br&gt;
Documentation tells you the rules. A setup script enforces them. A doc that says "remove ANTHROPIC_API_KEY before logging in" gets skimmed. A script that detects the conflict, explains why it's a problem, asks for permission to back up your shell config, and then unsets it — that one ships the right outcome.&lt;/p&gt;

&lt;p&gt;The installer does five things in order:&lt;/p&gt;

&lt;p&gt;Verify install — checks for claude, offers npm i -g @anthropic-ai/claude-code if missing&lt;br&gt;
Ask one question — "Do you have a Claude subscription?" Branches from this&lt;br&gt;
Detect conflicts — finds existing env vars, explains what they'd do, asks before changing&lt;br&gt;
Validate — sk-ant- prefix check, length check, env var persistence check&lt;br&gt;
Back up before mutating — every shell config edit gets a timestamped backup with a printed rollback command&lt;br&gt;
The Cross-Platform Tax&lt;br&gt;
Bash (macOS / Linux): the easy one&lt;br&gt;
detect_shell_config() {&lt;br&gt;
  case "$SHELL" in&lt;br&gt;
    &lt;em&gt;/zsh)  echo "$HOME/.zshrc" ;;&lt;br&gt;
    */bash) [[ "$OSTYPE" == "darwin"&lt;/em&gt; ]] &amp;amp;&amp;amp; echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ;;&lt;br&gt;
    *)      echo "$HOME/.profile" ;;&lt;br&gt;
  esac&lt;br&gt;
}&lt;br&gt;
Append the export, source the file, done.&lt;/p&gt;

&lt;p&gt;Batch (Windows cmd): the hard one&lt;br&gt;
Windows persists env vars in the registry under HKEY_CURRENT_USER\Environment. The supported tool is setx, which has two gotchas:&lt;/p&gt;

&lt;p&gt;1024-character limit (undocumented, will silently truncate)&lt;br&gt;
Doesn't update the current session — only future processes started after the setx call&lt;br&gt;
So users would run the script, run claude, see the same error, and assume the script broke. The fix is to set the variable in both places:&lt;/p&gt;

&lt;p&gt;setx ANTHROPIC_API_KEY "%KEY%"&lt;br&gt;
set ANTHROPIC_API_KEY=%KEY%&lt;br&gt;
echo NOTE: open a new Command Prompt window to verify persistence&lt;br&gt;
PowerShell: the third path&lt;br&gt;
PowerShell has $PROFILE, but you can't assume:&lt;/p&gt;

&lt;p&gt;The profile file exists&lt;br&gt;
Execution policy allows it to load&lt;br&gt;
The user knows what $PROFILE is&lt;br&gt;
The script gracefully degrades: profile edit → registry write → manual command shown to the user.&lt;/p&gt;

&lt;p&gt;What "Production-Grade" Means at 17KB&lt;br&gt;
I reduced it to four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Idempotency — running twice is safe. The second run detects the configured state and exits cleanly, no duplicate exports.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inspectability — before any mutation, print exactly what's about to happen and wait for y/n:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;About to add this line to /Users/anil/.zshrc:&lt;br&gt;
  export ANTHROPIC_API_KEY="sk-ant-..."&lt;br&gt;
Continue? [y/N]&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reversibility — every backup is timestamped (~/.zshrc.backup_20250506_143022). Rollback is one cp command, printed to the screen.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Testability — a test suite that validates the installer without mutating user state. Sandboxes backup/restore in /tmp, regex-checks key validation, verifies cross-platform parity. Runs in &amp;lt;2s.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Test suite output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlq6drczx0ddqfh8k475.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlq6drczx0ddqfh8k475.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(The 1 failure is a regex bug in the test, not the installer.)&lt;/p&gt;

&lt;p&gt;The Single Best Thing I Did&lt;br&gt;
Replaced a flat 6-option menu with one yes/no question:&lt;/p&gt;

&lt;p&gt;"Do you have a Claude subscription? [y/N]"&lt;/p&gt;

&lt;p&gt;Everything else branches from there. Conversion (people completing the script vs abandoning it mid-flow) went from "hard to measure but bad" to "essentially everyone finishes."&lt;/p&gt;

&lt;p&gt;If you remember nothing else from this post: users don't know which auth method applies to them. They know whether they pay a subscription. Branch on that.&lt;/p&gt;

&lt;p&gt;What I Got Wrong&lt;br&gt;
Too clever about shell detection. First version parsed $SHELL, then $0, then ps -p $$ -o comm=. Over-engineered. Three lines instead of thirty was the fix.&lt;br&gt;
PowerShell-first testing. Wrote canonical tests in PowerShell, hit version/encoding compat issues across Windows machines. Now the canonical suite is Bash; PowerShell is a convenience port.&lt;br&gt;
Underestimated docs. The repo has a README, quick-start, contributing guide, configuration examples, project overview, deployment doc, and build summary. That sounds like a lot for 17KB of script — until you realize the script is the easy part. The diagnosis ("why am I getting charged?") lives in the docs.&lt;br&gt;
Try It&lt;/p&gt;

&lt;h1&gt;
  
  
  Unix
&lt;/h1&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/your-repo/claude-auth-setup.git" rel="noopener noreferrer"&gt;https://github.com/your-repo/claude-auth-setup.git&lt;/a&gt;&lt;br&gt;
cd claude-auth-setup&lt;br&gt;
chmod +x setup-claude-auth.sh&lt;br&gt;
./setup-claude-auth.sh&lt;/p&gt;

&lt;h1&gt;
  
  
  Windows
&lt;/h1&gt;

&lt;p&gt;.\setup-claude-auth.bat&lt;br&gt;
MIT licensed. Issues, PRs, and bug reports all welcome. The best one I got so far was:&lt;/p&gt;

&lt;p&gt;"It worked. Why didn't this exist already?"&lt;/p&gt;

&lt;p&gt;I don't know either.&lt;/p&gt;

&lt;p&gt;Repo: github.com/your-repo/claude-auth-setup&lt;/p&gt;

&lt;p&gt;Follow me here on dev.to for more posts about the unglamorous parts of shipping production tools.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
      <category>claude</category>
    </item>
    <item>
      <title>The week the agent capability inflection arrived. And what to do about the 86% that still fail.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Sat, 02 May 2026 17:48:10 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-week-the-agent-capability-inflection-arrived-and-what-to-do-about-the-86-that-still-fail-25b9</link>
      <guid>https://dev.to/anilatambharii/the-week-the-agent-capability-inflection-arrived-and-what-to-do-about-the-86-that-still-fail-25b9</guid>
      <description>&lt;p&gt;&lt;strong&gt;By Anil Prasad&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Head of Engineering and Product, Duke Energy CASPAR · Founder, Ambharii Labs&lt;/p&gt;

&lt;h2&gt;
  
  
  Three signals. One pattern.
&lt;/h2&gt;

&lt;p&gt;Stanford released the 2026 AI Index this week. AI agents jumped from 12% to 66% success on real computer tasks in one year. That is a 5.5x capability multiplier in twelve months.&lt;/p&gt;

&lt;p&gt;In the same week, industry research confirmed that 86 to 89% of enterprise AI agent pilots fail to reach production at scale. Apoorva Mehta launched Abundance, a hedge fund with $100M in seed funding designed to have AI agents run the entire fund. JPMorgan reported their LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles for portfolio managers.&lt;/p&gt;

&lt;p&gt;These stories are not contradictory. They describe the same reality from different angles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The capability inflection has happened. The deployment infrastructure investment lags 18 months behind. That gap is the business opportunity of 2026.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Quick numbers before we dig in:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88b678yydjkrs20srikn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88b678yydjkrs20srikn.png" alt=" " width="760" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Monday: Stanford 12 to 66. Here is what most coverage will miss.
&lt;/h2&gt;

&lt;p&gt;Stanford published the 2026 AI Index this week. The 66% number on real computer tasks will be quoted in every AI keynote for the next twelve months.&lt;/p&gt;

&lt;p&gt;The number is real. The capability inflection has happened.&lt;br&gt;
What everyone is going to miss: 66% on benchmark tasks does not equal 66% in your production environment.&lt;/p&gt;

&lt;p&gt;Benchmarks measure: can the agent complete this task in ideal conditions with clean inputs and a defined success criterion?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production measures: can the agent complete this task at 2 AM on Sunday when the upstream data feed is degraded, the API is throttled, and the human reviewer is asleep?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Those are different questions. The benchmark answers one. The other one decides whether your AI program ships or fails.&lt;/p&gt;

&lt;p&gt;The capability bottleneck is gone. The readiness bottleneck just became the only bottleneck that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuesday: 86 to 89% of pilots fail. The four reasons. All fixable.
&lt;/h2&gt;

&lt;p&gt;Industry research published this month confirmed what 28 years in production AI has taught me. Agent pilots fail in predictable ways. The fixes are known. Almost nobody is applying them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 1: Governance breakdowns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pilot worked. The team wants to scale. The compliance team has not seen the system yet. Six weeks of compliance review later, the pilot has lost momentum, the team has shifted to other priorities, and the agent is sitting in staging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Compliance starts at week zero, not week sixteen. If your AI program treats compliance as a release gate, you have already lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 2: Evaluation infrastructure gaps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pilot demonstrated 84% accuracy on a curated test set. In production, the team cannot tell whether the agent is performing better or worse than baseline because they never built the evaluation framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Build the evaluation infrastructure before the agent. This is what G-ARVIS exists to do. Nine dimensions built from production failure, not academic theory.&lt;/p&gt;

&lt;p&gt;Failure mode 3: Integration complexity&lt;br&gt;
Integration and governance consume up to 60% of AI agent project budgets. Most teams plan for the model and underinvest in everything around it.&lt;/p&gt;

&lt;p&gt;Fix: Plan a 60% integration budget from day one. If the team budgeted 80% for the model and 20% for integration, the project is going to overrun before it ships.&lt;/p&gt;

&lt;p&gt;Failure mode 4: Accountability gaps&lt;br&gt;
When the agent is wrong, nobody knows whose problem it is. The system fails in the gap between teams.&lt;/p&gt;

&lt;p&gt;Fix: Assign one accountable human per agent before deployment. The work belongs to a name, not a function.&lt;/p&gt;

&lt;p&gt;The 86 to 89% failure rate is not happening because the technology does not work. It is happening because organizations are deploying capability without the foundation to support it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wednesday: A2A and MCP crossed 150 production deployments. The architecture conversation just shifted.
&lt;/h2&gt;

&lt;p&gt;Three months ago the question was: which orchestration framework should we use?&lt;/p&gt;

&lt;p&gt;Today the question is: do our agents speak the right protocols?&lt;br&gt;
Two protocols are emerging as the foundation of multi-agent systems in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; handles vertical connectivity. Agent to tool. Agent to data source. Agent to API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A (Agent to Agent)&lt;/strong&gt; handles horizontal connectivity. Direct peer to peer delegation between agents.&lt;/p&gt;

&lt;p&gt;Together they replace the brittle custom integration code that has been the failure mode of multi-agent systems for the past three years.&lt;br&gt;
This is the Kubernetes moment for agentic AI.&lt;/p&gt;

&lt;p&gt;The pattern looks exactly like what happened to microservices ten years ago. Custom service discovery, custom load balancing, custom health checks. Then Kubernetes standardized all of it. The organizations that built on the standardized layer were able to scale. The ones that built proprietary versions had to rewrite their infrastructure.&lt;/p&gt;

&lt;p&gt;Vendor lock in just changed shape too. Three years ago you locked in by choosing a model. Eighteen months ago you locked in by choosing an orchestration framework. In 2026, the lock in is at the protocol layer. Organizations that build on standardized protocols can swap models, frameworks, even vendors with bounded engineering effort.&lt;/p&gt;

&lt;p&gt;ARGUS now supports both A2A and MCP natively. Every tool call through MCP gets logged with full audit trail. Every agent to agent message through A2A gets traced with sender, recipient, timestamp, and payload hash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thursday: Financial AI just had its inflection point.
&lt;/h2&gt;

&lt;p&gt;Apoorva Mehta launched Abundance, a hedge fund designed to have AI agents run the entire fund with $100M in seed funding. JPMorgan's LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial services AI just crossed a threshold most other industries have not faced yet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When AI agents are managing money, every decision is not just one inference. It is a chain of reasoning across multiple agents that has to be reconstructable when the SEC asks.&lt;/p&gt;

&lt;p&gt;For an agent to participate in a regulated financial workflow, every decision must be:&lt;/p&gt;

&lt;p&gt;Reconstructable months after the fact&lt;br&gt;
Attributable to specific data sources at specific timestamps&lt;br&gt;
Explainable in language the regulator can evaluate&lt;br&gt;
Reviewable by a human with override authority&lt;/p&gt;

&lt;p&gt;If your agent infrastructure does not support all four, the agent cannot ship into a regulated financial environment.&lt;/p&gt;

&lt;p&gt;This is exactly the gap ARGUS is built to close. Every agent decision logged with input hash, output hash, model version, and tool calls. Full reasoning trace across multi-agent workflows. Time stamped audit log that can be replayed against the original data state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Friday synthesis: The Ambry Genetics migration story.
&lt;/h2&gt;

&lt;p&gt;We migrated a clinical genomics AI platform from MySQL to Vitess at Ambry Genetics. 99.97% uptime. Zero clinical data loss. 8 month migration during which the AI was making real recommendations for real patients.&lt;/p&gt;

&lt;p&gt;The migration could have happened faster. We chose to optimize for safety, not speed.&lt;/p&gt;

&lt;p&gt;What that taught me about AI in regulated environments: the model is the least constrained part of the system. Infrastructure, data governance, compliance requirements, and clinical validation processes are the actual engineering challenges.&lt;/p&gt;

&lt;p&gt;Every AI in healthcare implementation I have seen fail, failed at infrastructure or governance. Not at model accuracy.&lt;/p&gt;

&lt;p&gt;If you are deploying AI in healthcare, energy, or financial services, your constraint set looks more like that migration than like a benchmark optimization problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ambharii Labs platform suite
&lt;/h2&gt;

&lt;p&gt;This week marks three weeks since GenomixIQ and ARIA RCM launched. Health system inquiries on FHIR R4 interoperability are validating the architectural decisions made years before launch.&lt;/p&gt;

&lt;p&gt;AI Aether (ambharii.com/tools)&lt;br&gt;
Free enterprise AI readiness assessment. 8 dimensions on the G-ARVIS framework. Board ready roadmap. 30 minutes.&lt;/p&gt;

&lt;p&gt;ARGUS (github.com/anilatambharii/argus)&lt;br&gt;
Autonomous LLM correction and agent monitoring. Now native to A2A and MCP protocols. Open source. PyPI: pip install argus-ai&lt;/p&gt;

&lt;p&gt;GenomixIQ (genomixiq.com)&lt;br&gt;
12-agent molecular mesh for genomic variant interpretation. FHIR R4 from day one. Variant Intelligence Score. Population stratified evaluation.&lt;/p&gt;

&lt;p&gt;ARIA RCM (&lt;a href="mailto:anil@ambharii.com"&gt;anil@ambharii.com&lt;/a&gt;)&lt;br&gt;
11-agent healthcare revenue cycle platform. Three viable acquisition paths: Oracle Health, Microsoft Nuance, NVIDIA Healthcare.&lt;/p&gt;

&lt;p&gt;One shared architecture. G-ARVIS observability across all four. ARGUS self correction built into every agent. Production grade from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The week in one sentence
&lt;/h2&gt;

&lt;p&gt;The agents work at scale. Most organizations are not yet ready to deploy them safely. That gap is the business opportunity of 2026.&lt;/p&gt;

&lt;p&gt;If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions worth sitting with this weekend are the same five I ask in every program kickoff.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does failure look like and who does it hurt?&lt;/li&gt;
&lt;li&gt;Who is accountable when the agent is wrong?&lt;/li&gt;
&lt;li&gt;How does the agent know what it does not know?&lt;/li&gt;
&lt;li&gt;What is the kill switch and who can pull it?&lt;/li&gt;
&lt;li&gt;What does the audit trail look like nine months from now?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your team can answer all five with specifics, you are positioned for the 11 to 14% that will succeed.&lt;/p&gt;

&lt;p&gt;If they cannot, the foundation work is ahead of any deployment work.&lt;/p&gt;

&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;Anil Prasad is Head of Engineering and Product at Duke Energy and Founder of Ambharii Labs. He serves as an AI Factory Builder at BCG and co-founded the CDAIO Circle Tri-State Chapter. He has 28 years of production AI experience across Fortune 100 companies including R1 RCM, Ambry Genetics, UnitedHealth Group, Medtronic, and Accenture. He was recognized as one of the Top 100 Most Influential AI Leaders USA 2024 and holds degrees from Stanford and BITS Pilani.&lt;/p&gt;

&lt;p&gt;ambharii.com | linkedin.com/in/anilsprasad | @anilsprasad on X | anilsprasad.substack.com&lt;/p&gt;

&lt;p&gt;Subscribe to Field Notes: Production AI for weekly insights from 28 years building AI in regulated environments. No benchmarks. No hype. Real deployments, real failure modes, and the infrastructure decisions that distinguish production AI from demo AI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>aiops</category>
    </item>
    <item>
      <title>The week AI capability outpaced readiness. Again. Here is what it means in production.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 24 Apr 2026 12:47:57 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-week-ai-capability-outpaced-readiness-again-here-is-what-it-means-in-production-47kh</link>
      <guid>https://dev.to/anilatambharii/the-week-ai-capability-outpaced-readiness-again-here-is-what-it-means-in-production-47kh</guid>
      <description>&lt;h2&gt;
  
  
  Three events. One pattern.
&lt;/h2&gt;

&lt;p&gt;Three significant things happened in AI this week. Claude Opus 4.7 launched. The EU AI Act moved into full enforcement. And a new arXiv paper, EviSearch, validated what I have been building around for six years: domain-specific multi-agent architectures outperform general ones in clinical settings.&lt;/p&gt;

&lt;p&gt;Each story is real. Each story matters. And each story points to the same pattern I have watched repeat across 28 years of production AI in healthcare, energy, and financial services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability accelerates faster than readiness. Every time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zgvhs9tgiydp08la2ux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zgvhs9tgiydp08la2ux.png" alt=" " width="664" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Monday · April 20&lt;/p&gt;

&lt;p&gt;** ## Claude Opus 4.7: the benchmark is impressive. Here is the real question.**&lt;/p&gt;

&lt;p&gt;SWE-bench Pro reached 64.3%, up 10.9 points in a single version. SWE-bench Verified hit 87.6%. CursorBench reached 70%. Tool error rates dropped by two thirds. Self-verification built in at the model level. These are genuinely significant improvements.&lt;/p&gt;

&lt;p&gt;But the question I am not seeing asked in any of the coverage: does your organization have the evaluation infrastructure to know whether this model is actually better for your specific use case?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwur31w3r8hco9nm5httc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwur31w3r8hco9nm5httc.png" alt=" " width="642" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The organizations that move confidently after a major model launch are not the ones with the most advanced AI. They are the ones with evaluation infrastructure that can answer four questions within 72 hours of a new model release.&lt;/p&gt;

&lt;p&gt;Is this model better on our specific domain tasks? Is output variance within our acceptable range? What happens to cost-per-correct-output? Can our governance layer onboard this model without a compliance review starting from zero?&lt;/p&gt;

&lt;p&gt;If you cannot answer all four within 72 hours, you are not evaluating the model. You are waiting for someone else to tell you whether to use it. That is a readiness infrastructure problem, not a model problem.&lt;/p&gt;

&lt;p&gt;The self-verification feature is genuinely novel. Two thirds fewer tool errors means a system that needs much less constant human oversight. For multi-agent workflows running thousands of tool calls per day, that is the difference between a system that runs reliably overnight and one that requires a human on call. ARGUS operates the same self-correction principle at the system layer across the entire agent workflow, not just within a single inference.&lt;/p&gt;

&lt;p&gt;Tuesday · April 21&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;EU AI Act: the audit trail is the most common gap. Here is how to close it.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The EU AI Act entered full enforcement in 2026. Fines up to 7% of global annual turnover. High-risk categories include healthcare AI, critical infrastructure, employment, and education technology. Those are the exact sectors I have spent 28 years building production AI for.&lt;/p&gt;

&lt;p&gt;The five mandatory requirements for high-risk AI systems are: a risk management system maintained throughout the entire lifecycle, complete technical documentation, human oversight and intervention mechanisms, demonstrable accuracy and robustness, and a full audit trail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeiy3yfgsel22w8ofyr1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeiy3yfgsel22w8ofyr1.png" alt=" " width="660" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a Energy Enterprise, I rebuilt the entire logging layer before deploying a single agent in a live operational context. A grid operations manager asked a question I was not prepared for: "If this system makes a recommendation that causes an outage, and FERC comes knocking, can you show them exactly what the model saw, what it decided, and why?"&lt;/p&gt;

&lt;p&gt;We could not answer that. We rebuilt. That decision delayed the launch by six weeks and saved us months of regulatory exposure eighteen months later.&lt;/p&gt;

&lt;p&gt;ARGUS generates the full audit trail by default. Every inference logged with input hash, output hash, timestamp, and model version. Every tool call traced with actor identity and permission scope. Every human override recorded with reason and outcome. Not as a reporting feature. As the foundational observability layer. github.com/anilatambharii/argus or pip install argus-ai&lt;/p&gt;

&lt;p&gt;Wednesday · April 22&lt;/p&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;h2&gt;
  
  
  EviSearch and the domain-specific agent case: specificity is the moat.
&lt;/h2&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;p&gt;A paper published this week on arXiv described EviSearch, a multi-agent system that automates the creation of clinical evidence tables from medical literature using a specialized architecture. The finding was exactly what I have seen in every clinical AI program I have run: domain-specific agent architectures outperform general-purpose ones in technical domains, typically by 15 to 25 percentage points on domain-relevant evaluation tasks.&lt;/p&gt;

&lt;p&gt;Why the gap exists: A general-purpose agent reasons about yo&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklkry4vjx5wvv14lfbiv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklkry4vjx5wvv14lfbiv.png" alt=" " width="642" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why GenomixIQ uses 12 specialized agents rather than one large general agent. The literature agent understands how to evaluate evidence in population genetics. The ACMG criteria agent knows all 28 classification criteria and the interaction rules between them. The conflict resolution agent knows which database takes precedence when population databases disagree. None of that is prompt engineering. All of it is architectural encoding of domain expertise.&lt;/p&gt;

&lt;p&gt;The EviSearch paper also documented that multi-agent systems for clinical evidence work show inter-run variability below 5%, compared to 15 to 30% for human reviewers on complex evidence tables. Consistency in clinical decision support is not a nice-to-have. It is the compliance requirement.&lt;/p&gt;

&lt;p&gt;Thursday · April 23&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## G-ARVIS: the nine dimensions most AI teams are not measuring.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built the G-ARVIS framework from production failure across 28 years in regulated environments. Nine dimensions. Not from academic theory. From watching accurate models fail catastrophically because nobody was measuring the right things.&lt;/p&gt;

&lt;p&gt;The six dimensions: Groundedness (anchored to verifiable facts), Accuracy (correct output consistently), Reliability (stable at scale across thousands of runs), Variance (output stability on the same prompt across runs), Inference Cost (cost per correct output, not cost per token), Safety (domain-specific harm profile for this domain, this use case, this failure mode).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtwrdyfu0ick7yb18cet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtwrdyfu0ick7yb18cet.png" alt=" " width="651" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three agentic metrics I added specifically for multi-agent production systems: Action Sequence Fidelity (percentage of multi-step workflows completing without human intervention), Error Recovery Rate (when an agent fails, how often does the system recover without escalation), and Cost Per Correct Sequence (total inference cost divided by the number of complete sequences producing a validated correct output).&lt;/p&gt;

&lt;p&gt;All nine are assessed in AI Aether. 73% of organizations score below 12 out of 30 on data architecture alone. The foundation problem has not changed in 28 years. Only the model on top of it has. ambharii.com/tools&lt;/p&gt;

&lt;p&gt;The Ambharii Labs Platform&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## Four platforms. One shared architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This week marks two weeks since GenomixIQ and ARIA RCM launched, with ARGUS SDK updates shipping and AI Aether continuing to show the same pattern: 73% of organizations score below 12/30 on data architecture. The foundation problem precedes every other problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jcyw8sj9vct5g3a9k6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jcyw8sj9vct5g3a9k6.png" alt=" " width="668" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Week in One Sentence&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;## AI shipped faster than most organizations can absorb it. The gap between capability and readiness is the business opportunity of 2026.&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions I am asking every week are the same questions you should be asking: What does your AI actually do at 2 AM? Who sees the audit trail? What happens when the model is wrong in a way it has never been wrong before?&lt;/p&gt;

&lt;p&gt;The answers to those questions are what distinguishes production AI from demo AI. That distinction is what 28 years in this field teaches you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2qhtgin8lg5an5hojce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2qhtgin8lg5an5hojce.png" alt=" " width="713" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>argus</category>
      <category>genomixiq</category>
      <category>ariarcm</category>
    </item>
  </channel>
</rss>
