<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anil Prasad</title>
    <description>The latest articles on DEV Community by Anil Prasad (@anilatambharii).</description>
    <link>https://dev.to/anilatambharii</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843681%2Fe0b19f3a-123f-4286-b970-10682e211b29.jpeg</url>
      <title>DEV Community: Anil Prasad</title>
      <link>https://dev.to/anilatambharii</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anilatambharii"/>
    <language>en</language>
    <item>
      <title>I put one proxy in front of every AI tool my team uses 85% cache hits, 75% lower cost</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Sun, 14 Jun 2026 22:30:49 +0000</pubDate>
      <link>https://dev.to/anilatambharii/i-put-one-proxy-in-front-of-every-ai-tool-my-team-uses-85-cache-hits-75-lower-cost-262g</link>
      <guid>https://dev.to/anilatambharii/i-put-one-proxy-in-front-of-every-ai-tool-my-team-uses-85-cache-hits-75-lower-cost-262g</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Your team's AI tools (Claude Code, Copilot, ChatGPT, Gemini, your own agents) each call the LLM API independently — no shared cache, no shared budget, no audit trail. AgentMesh is an open-source proxy that sits in front of all of them and runs every call through a three-layer cache, per-team quotas, cheapest-model routing, and a tamper-evident audit log. You point your tools at it with two env vars. On a reproducible benchmark (no API keys): 85% cache hits, 75% lower cost. Apache 2.0. → pip install agentmesh-proxy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem, in one sentence&lt;/strong&gt;&lt;br&gt;
Every AI tool on your team talks to the model on its own.&lt;br&gt;
Claude Code has its own connection. Copilot has its own. The ChatGPT tab in someone's browser has its own. Your LangGraph service has its own. None of them share a cache, a budget, or an audit log — so the same prompt gets paid for over and over, a runaway loop in one service is invisible to the others, and nobody can answer "what did we send to third-party APIs last quarter?"&lt;br&gt;
This isn't a discipline problem. It's a missing layer. So I built it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;60-second quickstart&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install agentmesh-proxy sentence-transformers
agentmesh serve --port 8080 --demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point any tool at it — no code changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Claude Code, or any Anthropic SDK tool
export ANTHROPIC_BASE_URL=http://localhost:8080

# Copilot / Cursor / any OpenAI SDK tool
export OPENAI_BASE_URL=http://localhost:8080/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every response comes back with governance headers so you can see what happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X-AgentMesh-Cache:     hit          # exact | semantic | miss
X-AgentMesh-Tokens:    0            # 0 on a cache hit
X-AgentMesh-Cost-USD:  0.000000     # $0 on a cache hit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole integration. The agent code never knows the proxy is there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;br&gt;
Every call from the proxy or the SDK runs the same ordered pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bba10h1mkjt4cdw5pf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bba10h1mkjt4cdw5pf1.png" alt=" " width="800" height="1147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interesting part is the cache, because it does something most "LLM caches" don't.&lt;/p&gt;

&lt;p&gt;Exact-match caching almost never hits in real life, because people rephrase: they paste You are a senior architect. in front of the question, switch between optimise and optimize, wrap things in markdown. So before anything is hashed or embedded, AgentMesh normalizes the prompt — stripping the noise that doesn't change meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh.optimizer.normalizer import normalize_prompt

normalize_prompt("You are a senior architect. **Please** review this design...")
# -&amp;gt; "review this design ..."   (persona prefix, markdown, filler removed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;compares by cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh import SemanticCache

cache = SemanticCache(similarity_threshold=0.70)   # tune per workload
cache.put("Review this microservices design for scaling issues", response)

# Different wording, same intent -&amp;gt; still a hit
hit = cache.get("Analyze this distributed system design")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;normalize, then embed is the whole trick — it's the difference between a cache that almost never hits and one that hits ~85% of the time.&lt;/p&gt;

&lt;p&gt;And because every call already flows through one interceptor, a tamper-evident audit log is almost free — each entry is hash-chained (SHA-256) and signed with Ed25519:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agentmesh import AuditTrail
trail = AuditTrail()
# ... calls happen ...
assert trail.verify()   # walks the chain, checks every prev_hash + signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The benchmark (run it yourself, no API keys)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I didn't want to ship a number you can't check, so the benchmark runs in demo mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python examples/benchmark.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl222ppobr1m0axwnvlib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl222ppobr1m0axwnvlib.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total requests          20
Exact cache hits         2  (10%)
Semantic cache hits     15  (75%)
Total misses             3  (15%)

Cost WITHOUT AgentMesh  $0.0030
Cost WITH AgentMesh     $0.0008
Savings                 $0.0023  (75%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;20 requests, 5 topics, 4 phrasings each. 85% never reached the model; the 3 misses are the cold-start first call per topic — exactly what you'd expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's also a Chrome extension&lt;/strong&gt;&lt;br&gt;
A proxy can't see a prompt typed straight into the ChatGPT or Gemini tab. So there's an extension: declarativeNetRequest reroutes api.anthropic.com / api.openai.com to localhost:8080, and content scripts show a governance overlay before the prompt is sent. Stats persist across service-worker restarts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's deliberately not built yet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'd rather ship a small, verifiable core than a wide surface of half-features:&lt;/p&gt;

&lt;p&gt;The cache is &lt;strong&gt;in-memory, single-process **— great for a local proxy, not yet a fleet. **Redis is next&lt;/strong&gt;.&lt;br&gt;
No native VS Code panel (env vars + the Chrome extension for now).&lt;br&gt;
No SAML/SSO identity propagation; quotas key on a team header.&lt;/p&gt;

&lt;p&gt;None of these are research problems — they're scope. PRs welcome, especially the Redis backend.&lt;br&gt;
&lt;strong&gt;Try it / contribute&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install agentmesh-proxy sentence-transformers
python examples/benchmark.py     # 85% cache hits, 75% lower cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repo (star it)&lt;/strong&gt;: &lt;a href="https://github.com/anilatambharii/agentmesh" rel="noopener noreferrer"&gt;https://github.com/anilatambharii/agentmesh&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: agentmesh-proxy · &lt;strong&gt;Docker&lt;/strong&gt;: anilsprasad/agentmesh · also on Hugging Face&lt;br&gt;
&lt;strong&gt;Apache 2.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you run AI tools across a team and your bill is outgrowing your usage, clone it, run the benchmark, and tell me where it breaks. What would you build on top of this?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How I took a production RAG pipeline from 61% to 97% accuracy (6 stages, full code)</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 12 Jun 2026 12:30:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-i-took-a-production-rag-pipeline-from-61-to-97-accuracy-6-stages-full-code-37mg</link>
      <guid>https://dev.to/anilatambharii/how-i-took-a-production-rag-pipeline-from-61-to-97-accuracy-6-stages-full-code-37mg</guid>
      <description>&lt;p&gt;Six months in production on a healthcare RAG system. Four rewrites. Here is the exact pipeline, every stage, and the code. The reference implementation is open source and linked at the bottom.&lt;/p&gt;

&lt;p&gt;TL;DR&lt;br&gt;
A weekend tutorial got our retrieval system to 61% accuracy. Six months of production work got it to 97%, under 2 seconds at P99, at $0.08 per query. The gains came from six stages added in order of return, not from a better model. Here is each one with code you can drop into your own pipeline.&lt;br&gt;
If you only have five minutes, here is the whole thing:&lt;/p&gt;

&lt;p&gt;Query rewriting turns vague questions into searchable ones. +11 points. Almost free.&lt;br&gt;
Hybrid retrieval runs dense + BM25 and fuses them. +9 points.&lt;br&gt;
Cross-encoder reranking rescores the top candidates properly. +8 points.&lt;br&gt;
Context compression strips irrelevant sentences before generation. +5 points.&lt;br&gt;
Citation guard blocks any claim that is not grounded in a source.&lt;br&gt;
Answer validation routes multi-hop questions to a human instead of guessing.&lt;/p&gt;

&lt;p&gt;First, measure where you actually fail&lt;br&gt;
Before writing any code, we instrumented the pipeline and traced every wrong answer to its cause. The result changed our entire roadmap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rij20r9ldjjrfibfbh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rij20r9ldjjrfibfbh2.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;64% of failures were retrieval. 23% were chunking. Only 13% were the generator hallucinating from good context. We had spent two months tuning prompts, which was 13% of the problem. Lesson one: measure before you optimize, because your intuition about where RAG breaks is almost always wrong.&lt;/p&gt;

&lt;p&gt;Stage 1: query rewriting&lt;/p&gt;

&lt;p&gt;The user's raw message is rarely a good search query. What did it say about the dosage? has no good match in any index, because the meaning is in the previous turns. A small 8B model rewrites it into a standalone query first.&lt;/p&gt;

&lt;p&gt;REWRITE_SYSTEM = """You rewrite a user's latest message into a single,&lt;br&gt;
standalone search query. Resolve all pronouns and references using the&lt;br&gt;
conversation. Keep it specific. Output only the rewritten query."""&lt;/p&gt;

&lt;p&gt;def rewrite_query(history: list[dict], latest: str, llm) -&amp;gt; str:&lt;br&gt;
    convo = "\n".join(f"{m['role']}: {m['content']}" for m in history[-4:])&lt;br&gt;
    prompt = f"{convo}\nuser: {latest}\n\nStandalone search query:"&lt;br&gt;
    out = llm.complete(&lt;br&gt;
        system=REWRITE_SYSTEM, prompt=prompt,&lt;br&gt;
        model="small-8b", max_tokens=64, temperature=0.0,&lt;br&gt;
    ).strip()&lt;br&gt;
    return out or latest&lt;/p&gt;

&lt;p&gt;Cost: about $0.0001 per query. Gain: +11 points, from 61% to 72%. This is the highest return change in the entire pipeline and the one most people skip.&lt;/p&gt;

&lt;p&gt;Stage 2: hybrid retrieval&lt;/p&gt;

&lt;p&gt;Embedding similarity is great at meaning and weak at exact terms. Two passages can be close in vector space and mean opposite things. Keyword search has the opposite failure mode. So run both and fuse with reciprocal rank fusion, which needs no weight tuning.&lt;/p&gt;

&lt;p&gt;from rank_bm25 import BM25Okapi&lt;/p&gt;

&lt;p&gt;def hybrid_search(query, dense_index, bm25: BM25Okapi, corpus, k=20):&lt;br&gt;
    dense_hits = dense_index.search(query, k=k)            # [(doc_id, score)]&lt;br&gt;
    bm25_scores = bm25.get_scores(query.split())&lt;br&gt;
    bm25_hits = sorted(enumerate(bm25_scores),&lt;br&gt;
                       key=lambda x: x[1], reverse=True)[:k]&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fused, C = {}, 60
for rank, (doc_id, _) in enumerate(dense_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)
for rank, (doc_id, _) in enumerate(bm25_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)

ranked = sorted(fused.items(), key=lambda x: x[1], reverse=True)
return [corpus[doc_id] for doc_id, _ in ranked[:k]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Gain: +9 points, from 72% to 81%. Dense and sparse retrieval are not competitors. Use both.&lt;/p&gt;

&lt;p&gt;Stage 3: cross-encoder reranking&lt;/p&gt;

&lt;p&gt;Stages 1 and 2 are fast because they score the query and each document independently. A cross-encoder reads them together, which is slower and much more accurate. So you run it only on the top candidates the cheap stages already found.&lt;/p&gt;

&lt;p&gt;from sentence_transformers import CrossEncoder&lt;/p&gt;

&lt;p&gt;reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")&lt;/p&gt;

&lt;p&gt;def rerank(query, candidates, top_n=5):&lt;br&gt;
    pairs = [(query, c.text) for c in candidates]&lt;br&gt;
    scores = reranker.predict(pairs)&lt;br&gt;
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)&lt;br&gt;
    return [c for c, _ in ranked[:top_n]]&lt;/p&gt;

&lt;p&gt;Gain: +8 points, from 81% to 89%. The classic retrieve-then-rerank pattern, and it earns its cost because you only rerank a handful of candidates.&lt;/p&gt;

&lt;p&gt;Stage 4: context compression&lt;/p&gt;

&lt;p&gt;A retrieved passage can be the right document and still carry sentences that have nothing to do with the question. Each irrelevant sentence is a chance for the model to anchor on the wrong thing. So score sentences against the query and drop the ones that do not earn their place.&lt;/p&gt;

&lt;p&gt;def compress_context(query, passages, relevance_model, threshold=0.5):&lt;br&gt;
    kept = []&lt;br&gt;
    for p in passages:&lt;br&gt;
        sentences = split_sentences(p.text)&lt;br&gt;
        scored = relevance_model.score(query, sentences)   # 0..1 per sentence&lt;br&gt;
        relevant = [s for s, sc in zip(sentences, scored) if sc &amp;gt;= threshold]&lt;br&gt;
        if relevant:&lt;br&gt;
            kept.append(p.with_text(" ".join(relevant)))&lt;br&gt;
    return kept&lt;/p&gt;

&lt;p&gt;Gain: +5 points, from 89% to 94%. Bonus: it cuts your generation token bill, because you stop paying to send the model context it should ignore.&lt;/p&gt;

&lt;p&gt;The pipeline so far&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funixpyylnqy9jr8vu50p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funixpyylnqy9jr8vu50p.png" alt=" " width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stages 1 through 4 took us from 61% to 94%. The last two stages do not chase points. They make the system honest, which in a regulated domain matters more.&lt;/p&gt;

&lt;p&gt;Stage 5: citation guard&lt;/p&gt;

&lt;p&gt;Before an answer ships, every claim in it has to trace back to a retrieved source. If a sentence has no supporting passage, it does not go out.&lt;/p&gt;

&lt;p&gt;def citation_guard(answer_claims, sources, entailment_model, min_support=0.7):&lt;br&gt;
    for claim in answer_claims:&lt;br&gt;
        support = max(entailment_model.entails(s.text, claim) for s in sources)&lt;br&gt;
        if support &amp;lt; min_support:&lt;br&gt;
            return False, claim    # ungrounded claim, block it&lt;br&gt;
    return True, None&lt;/p&gt;

&lt;p&gt;Stage 6: answer validation&lt;/p&gt;

&lt;p&gt;Some questions need three or more documents synthesized together. That is where RAG quietly fails by writing a fluent, wrong answer. Detect those and route them to a human.&lt;/p&gt;

&lt;p&gt;def validate_answer(query, answer, sources, confidence):&lt;br&gt;
    if confidence &amp;lt; 0.6:&lt;br&gt;
        return route_to_human(query, reason="low confidence")&lt;br&gt;
    if requires_multi_hop(query) and len(sources) &amp;lt; 2:&lt;br&gt;
        return route_to_human(query, reason="insufficient evidence")&lt;br&gt;
    return answer&lt;/p&gt;

&lt;p&gt;Together stages 5 and 6 took the production number from 94% to 97%. The real output is not the three points. It is the 3% the system now refuses to answer automatically. Serving an uncertain answer is not honesty. It is a liability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbd4ia92a8d6cdj5x0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbd4ia92a8d6cdj5x0d.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the climb, stage by stage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72ne64utocqp1haoehe3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72ne64utocqp1haoehe3.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyg9bf4x6rjhlko6pgi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyg9bf4x6rjhlko6pgi.png" alt=" " width="797" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How to adopt this&lt;br&gt;
You do not need a six-month rebuild. Add stages in order of return and measure after each one, so you know which change earned which points.&lt;/p&gt;

&lt;p&gt;Query rewriting first. A day of work, nearly free to run.&lt;br&gt;
Hybrid retrieval next, because most teams run embeddings only.&lt;br&gt;
Reranking third.&lt;/p&gt;

&lt;p&gt;Compression fourth.&lt;/p&gt;

&lt;p&gt;Build the guards last, once accuracy is where you want it.&lt;/p&gt;

&lt;p&gt;Run it yourself&lt;/p&gt;

&lt;p&gt;The full reference implementation is open source, including every stage above, the benchmark harness that produced these numbers, and a 250-case adversarial test suite that caught the failures we did not anticipate. Clone it and run it today.&lt;/p&gt;

&lt;p&gt;github.com/anilatambharii&lt;/p&gt;

&lt;p&gt;I write up the production AI work in more depth, with the narrative and the failures, on my newsletter first. If the deep version is useful to you, that is where it lives: anilsprasad.substack.com&lt;/p&gt;

&lt;p&gt;If you are running RAG in production, I would like to know one thing in the comments: what does your error breakdown look like? Retrieval, chunking, or generation? I read all of them.&lt;/p&gt;

&lt;h1&gt;
  
  
  HumanWritten #ExpertiseFromField
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>anilprasad</category>
    </item>
    <item>
      <title>How We Cut AI Infrastructure Costs by 94% Without Sacrificing Quality (And How You Can Too)</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Mon, 01 Jun 2026 15:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-we-cut-ai-infrastructure-costs-by-94-without-sacrificing-quality-and-how-you-can-too-5fim</link>
      <guid>https://dev.to/anilatambharii/how-we-cut-ai-infrastructure-costs-by-94-without-sacrificing-quality-and-how-you-can-too-5fim</guid>
      <description>&lt;p&gt;A production engineer's guide to building efficient AI systems at scale - complete with code, architecture, and real metrics&lt;/p&gt;

&lt;h2&gt;
  
  
  series: Production AI Infrastructure
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📧 Originally published on &lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;my Substack newsletter&lt;/a&gt;&lt;/strong&gt; where I share weekly deep-dives on production AI infrastructure. Subscribe for early access to future articles!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Three months ago, our AI infrastructure bill was &lt;strong&gt;$47,000 per month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Last month? &lt;strong&gt;$2,800&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Same quality. Same performance. Same user experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;94% cost reduction. $530,000 saved annually.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a case study about "theoretical optimization." This is a field guide from production systems processing &lt;strong&gt;2.3 million events per second&lt;/strong&gt;, serving millions of users, and running 24/7 without downtime.&lt;/p&gt;

&lt;p&gt;The efficiency revolution in AI is here. Small models are closing the gap with frontier models faster than anyone predicted. &lt;strong&gt;The race to bigger is over. The race to efficient just started.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's everything we learned building production AI infrastructure at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;PART 1: The Cost Crisis Nobody Talks About&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI infrastructure costs are spiraling out of control, and most companies don't realize it until it's too late.&lt;/p&gt;

&lt;p&gt;The pattern is predictable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1-3&lt;/strong&gt;: Prototype with GPT-4 or Claude. Costs are manageable ($500-2,000/month). Everyone's happy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4-6&lt;/strong&gt;: Scale to production. Usage increases 10x. Costs jump to $15K-30K/month. Finance starts asking questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-9&lt;/strong&gt;: Growth continues. Costs hit $40K-60K/month. Emergency meetings. "Can we optimize this?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 10+&lt;/strong&gt;: Either massive optimization effort or AI features get cut. The dream dies or the budget explodes.&lt;/p&gt;

&lt;p&gt;We've seen this pattern across dozens of companies. The problem isn't the technology—it's the architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why AI Costs Spiral&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Three core issues:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. The "Bigger Model = Better" Myth&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The default assumption: Use the biggest, most capable model for everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4 for summarization? Sure.&lt;/li&gt;
&lt;li&gt;Claude 3.5 for classification? Why not.&lt;/li&gt;
&lt;li&gt;Llama 2 70B for simple Q&amp;amp;A? Absolutely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's the reality: &lt;strong&gt;Most AI workloads don't need frontier model capability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Industry analysis shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;10% of AI workloads&lt;/strong&gt; require maximum capability (complex reasoning, multi-step analysis)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30-40%&lt;/strong&gt; can run on medium models (7B-70B parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50-60%&lt;/strong&gt; can run on small models (3B-8B parameters)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Yet 80% of companies use frontier models for 80% of workloads.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's like using a Lamborghini for your daily commute. Expensive. Unnecessary. Wasteful.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Zero Caching Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Every request hits the model. Even identical requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What's the weather today?" → Model inference → $0.002
"What's the weather today?" (5 minutes later) → Model inference → $0.002
"What's the weather today?" (user refresh) → Model inference → $0.002
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same question. Same answer. Triple the cost.&lt;/p&gt;

&lt;p&gt;With caching: $0.002 for the first request, $0.0001 for subsequent requests (100x cheaper).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without caching, you're burning 70-90% of your budget on duplicate work.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. No Routing Logic&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Every request goes to the same model, regardless of complexity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple query: "What time is it?" → 70B model inference&lt;/li&gt;
&lt;li&gt;Complex query: "Analyze quarterly revenue by region and predict Q3 trends" → 70B model inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simple query could run on a 3B model at 1/20th the cost and 10x faster.&lt;/p&gt;

&lt;p&gt;But without routing logic, both queries cost the same. &lt;strong&gt;You're overpaying for 60-80% of requests.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Real Production Cost Breakdown&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here's what a typical $47,000/month LLM infrastructure actually looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model Inference:        $32,000 (68%)
Infrastructure:         $8,000 (17%)
Data Processing:        $4,000 (8%)
Monitoring/Logging:     $2,000 (4%)
Networking:             $1,000 (2%)
---
Total:                  $47,000/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The opportunity: 90%+ of model inference costs are optimizable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not through vague "best practices." Through specific, proven architectural changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4traruafa6ymam6ibwgt.png" alt=" " width="800" height="400"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;PART 2: The 4-Layer Optimization Stack&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We rebuilt our AI infrastructure from the ground up with one principle: &lt;strong&gt;Make efficiency the default, not an afterthought.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The result: A 4-layer optimization stack that reduced costs by 94% while maintaining—and in some cases improving—quality and performance.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiopzja9uocvvaxbwvddg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiopzja9uocvvaxbwvddg.png" alt=" " width="800" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 1: Semantic Caching (70% Cost Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Problem&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Users ask the same questions different ways.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How do I reset my password?"&lt;/li&gt;
&lt;li&gt;"I forgot my password, help"&lt;/li&gt;
&lt;li&gt;"Password reset instructions"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three queries. Same intent. Same answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without semantic caching: 3x model calls&lt;/li&gt;
&lt;li&gt;With semantic caching: 1x model call, 2x cache hits&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;How Semantic Caching Works&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Instead of exact-match caching (traditional Redis), we cache by &lt;em&gt;semantic similarity&lt;/em&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embed the query&lt;/strong&gt; using a small embedding model (all-MiniLM-L6-v2, 22M parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search vector DB&lt;/strong&gt; for similar queries (cosine similarity &amp;gt;0.95)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return cached response&lt;/strong&gt; if match found&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate + cache&lt;/strong&gt; if no match&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Stack&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model&lt;/strong&gt;: all-MiniLM-L6-v2 (inference: &amp;lt;10ms, cost: negligible)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector DB&lt;/strong&gt;: Qdrant (self-hosted) or Pinecone (managed) or FAISS (self-hosted)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity threshold&lt;/strong&gt;: 0.95 (adjustable based on use case)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results in Production&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache hit rate: 99.2%
Average cache latency: 8ms
Average cache miss latency: 340ms
Cost per cache hit: $0.00001
Cost per cache miss: $0.002

Monthly queries: 45M
Cache hits: 44.6M (99.2%)
Cache misses: 360K (0.8%)

Semantic cache cost: $446
Without cache cost: $90,000

Savings: $89,554/month (99.5% reduction on this layer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Implementation (High-Level)&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Semantic cache check
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;similar_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similar_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similar_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Fast cache hit
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Expensive generation
&lt;/span&gt;    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Semantic caching works because users are less creative than we think. In production, 99%+ of queries are variations of questions we've already answered.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 2: Redis Caching (Additional 15% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Semantic caching handles 99% of hits. Redis caching handles the remaining 1% of frequently repeated &lt;em&gt;exact&lt;/em&gt; queries.&lt;/p&gt;

&lt;p&gt;Why both?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic cache&lt;/strong&gt;: Slower (8-15ms), handles similarity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis cache&lt;/strong&gt;: Faster (1-3ms), handles exact matches&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis first (exact match, 1-3ms)&lt;/li&gt;
&lt;li&gt;If miss → Check semantic cache (similarity match, 8-15ms)&lt;/li&gt;
&lt;li&gt;If miss → Generate response (model inference, 200-400ms)&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Redis hit rate on semantic misses: 95%
Average latency: 2ms
Cost per hit: $0.00001

Additional savings: $6,800/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Combined Layer 1 + 2 Performance&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total cache hit rate: 99.7%
Average response time: 12ms (cached) vs 340ms (uncached)
Total caching cost: $7,246/month
Without caching cost: $90,000/month

Savings so far: $82,754/month (92% reduction)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 3: Model Routing (Additional 12% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Not all queries are created equal.&lt;/p&gt;

&lt;p&gt;"What's 2+2?" shouldn't cost the same as "Analyze these 10,000 financial transactions and flag anomalies."&lt;/p&gt;

&lt;p&gt;But without routing logic, they do.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The Solution: Complexity-based routing&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classify query complexity&lt;/strong&gt; (using a small 1B classifier model, &amp;lt;5ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route to appropriate model&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Simple → 8B model (fast, cheap)&lt;/li&gt;
&lt;li&gt;Medium → 70B model (balanced)&lt;/li&gt;
&lt;li&gt;Complex → 405B model (maximum capability)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Complexity Classification&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Fast classifier model (1B parameters, &amp;lt;5ms inference)
&lt;/span&gt;    &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;detect_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# factual, analytical, creative
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context_required&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;needs_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multi_step&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;is_multi_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 8B model
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 70B model
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Route to 405B model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Production Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query distribution:
- Simple (8B): 62% of queries
- Medium (70B): 28% of queries
- Complex (405B): 10% of queries

Cost comparison:
- 8B model: $0.0001/query
- 70B model: $0.001/query
- 405B model: $0.01/query

Average cost per query (with routing): $0.0008
Average cost per query (70B for all): $0.001

Savings: 20% reduction in model costs
Monthly impact: $5,600 saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Quality Impact: Zero degradation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We A/B tested 10,000 queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8B model accuracy on simple queries: &lt;strong&gt;97.2%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;70B model accuracy on same queries: &lt;strong&gt;97.4%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;User-perceived difference: &lt;strong&gt;0%&lt;/strong&gt; (statistically insignificant)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Users can't tell the difference between 8B and 70B on simple queries. Don't overpay for capability you don't need.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Layer 4: Efficient Models (Additional 15% Reduction)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The final layer: Replace expensive models with efficient alternatives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Shift&lt;/strong&gt;: Llama 2 70B → Llama 3.1 8B&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Why This Works&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Llama 3.1 8B (released 2024) matches Llama 2 70B (2023) performance on most tasks.&lt;/p&gt;

&lt;p&gt;But it's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1/9th the parameters&lt;/li&gt;
&lt;li&gt;15x faster inference&lt;/li&gt;
&lt;li&gt;15x cheaper at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F754lct3a0y3zvmse0ibz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F754lct3a0y3zvmse0ibz.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Benchmark Comparison (Production Data)&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Llama 2 70B:
- Parameters: 70B
- Inference latency (P99): 340ms
- Cost per 1M tokens: $0.65
- Accuracy (MMLU): 69.7%

Llama 3.1 8B:
- Parameters: 8B
- Inference latency (P99): 120ms
- Cost per 1M tokens: $0.04
- Accuracy (MMLU): 69.4%

Quality difference: 0.3% (negligible)
Speed improvement: 2.8x faster
Cost improvement: 16x cheaper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Migration Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We didn't switch overnight. We tested:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1-2&lt;/strong&gt;: Shadow mode (8B runs alongside 70B, results logged but not served)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3-4&lt;/strong&gt;: A/B test (50% traffic to 8B, 50% to 70B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5-6&lt;/strong&gt;: 90% to 8B, 10% to 70B (monitor quality)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 7+&lt;/strong&gt;: 100% to 8B, 70B for exceptions only&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Quality degradation: 0.2% (within acceptable range)
User complaints: 0 (nobody noticed)
Speed improvement: 2.8x (users noticed this positively)
Cost reduction: 94% (from all 4 layers combined)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;The Complete Stack in Production&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request Flow:
1. Check Redis (exact match) → 95% hit rate, 2ms
2. If miss → Check semantic cache → 99% hit rate, 12ms
3. If miss → Classify complexity → 5ms
4. Route to model:
   - 62% → Llama 3.1 8B
   - 28% → Llama 3.1 70B
   - 10% → Llama 3.3 405B
5. Cache response
6. Return to user

Total average latency: 15ms (cached) vs 125ms (uncached)
Total cost per query: $0.00008 (vs $0.001 before optimization)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;PART 3: Complete Implementation Guide&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Architecture Overview&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our production stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend → API Gateway → Request Router
                              ↓
                  [Redis Cache Layer]
                              ↓
              [Semantic Cache (Vector DB)]
                              ↓
                  [Complexity Classifier]
                              ↓
            ┌─────────┬─────────┬─────────┐
            ↓         ↓         ↓         ↓
         8B Model  70B Model  405B Model  (Fallback)
            ↓         ↓         ↓         ↓
                  Response Aggregator
                              ↓
                      User Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Technology Stack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Caching Layer&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis: Elasticache (AWS) or Redis Cloud&lt;/li&gt;
&lt;li&gt;Vector DB: Qdrant (self-hosted) or Pinecone (managed)&lt;/li&gt;
&lt;li&gt;Embedding model: all-MiniLM-L6-v2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model Serving&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference: vLLM (optimized serving)&lt;/li&gt;
&lt;li&gt;Infrastructure: NVIDIA A10G GPUs (cost-efficient)&lt;/li&gt;
&lt;li&gt;Orchestration: Kubernetes + KServe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Pipeline&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event streaming: Apache Kafka&lt;/li&gt;
&lt;li&gt;Processing: Apache Flink&lt;/li&gt;
&lt;li&gt;Metrics: Prometheus + Grafana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APM: Datadog or New Relic&lt;/li&gt;
&lt;li&gt;Logging: CloudWatch or Elasticsearch&lt;/li&gt;
&lt;li&gt;Alerting: PagerDuty&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Deployment Steps&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Infrastructure (Week 1-2)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up Redis cluster (Elasticache or self-hosted)&lt;/li&gt;
&lt;li&gt;Deploy vector database (Qdrant recommended for self-hosting)&lt;/li&gt;
&lt;li&gt;Configure embedding model endpoint&lt;/li&gt;
&lt;li&gt;Set up model serving infrastructure (vLLM + GPU instances)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Caching Implementation (Week 3-4)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Implement Redis caching layer&lt;/li&gt;
&lt;li&gt;Deploy semantic caching with vector DB&lt;/li&gt;
&lt;li&gt;Test cache hit rates and latency&lt;/li&gt;
&lt;li&gt;Optimize similarity thresholds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Routing Logic (Week 5-6)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train complexity classifier (or use rule-based initially)&lt;/li&gt;
&lt;li&gt;Implement routing logic&lt;/li&gt;
&lt;li&gt;Deploy multiple model endpoints (8B, 70B, 405B)&lt;/li&gt;
&lt;li&gt;A/B test routing accuracy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 4: Migration (Week 7-8)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shadow mode testing (new stack runs alongside old)&lt;/li&gt;
&lt;li&gt;Gradual traffic migration (10% → 50% → 90% → 100%)&lt;/li&gt;
&lt;li&gt;Monitor quality and cost metrics&lt;/li&gt;
&lt;li&gt;Rollback capability ready at all times&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Phase 5: Optimization (Ongoing)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fine-tune cache similarity thresholds&lt;/li&gt;
&lt;li&gt;Optimize model routing logic&lt;/li&gt;
&lt;li&gt;Monitor and reduce cache misses&lt;/li&gt;
&lt;li&gt;Continuous cost tracking and optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code Examples&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Semantic Cache Implementation&lt;/strong&gt; (Python):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize components
&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6333&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_with_semantic_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Check Redis (exact match)
&lt;/span&gt;    &lt;span class="n"&gt;redis_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;redis_hit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Generate embedding
&lt;/span&gt;    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Search vector DB for similar queries
&lt;/span&gt;    &lt;span class="n"&gt;search_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;similarity_threshold&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Return cached if similar query found
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;search_result&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cached_query_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;search_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
        &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached_query_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;semantic_hit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 5: Generate new response (cache miss)
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_llm_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 6: Cache response
&lt;/span&gt;    &lt;span class="n"&gt;query_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cache_miss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model Routing Implementation&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Classify complexity
&lt;/span&gt;    &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Route based on complexity
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;8b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;70b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;405b_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;

    &lt;span class="c1"&gt;# Call appropriate model
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Rule-based classification (can be replaced with ML model)
&lt;/span&gt;    &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple heuristics
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;requires_reasoning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;is_multi_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring and Observability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Track&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cache Performance&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis hit rate (target: &amp;gt;95%)&lt;/li&gt;
&lt;li&gt;Semantic hit rate (target: &amp;gt;99%)&lt;/li&gt;
&lt;li&gt;Average cache latency (target: &amp;lt;15ms)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Performance&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;P50, P95, P99 latency by model&lt;/li&gt;
&lt;li&gt;Throughput (queries/second)&lt;/li&gt;
&lt;li&gt;Error rate (&amp;lt;0.1% target)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Metrics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per query (overall and by model)&lt;/li&gt;
&lt;li&gt;Daily/monthly spend tracking&lt;/li&gt;
&lt;li&gt;Cost attribution by endpoint/user&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quality Metrics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response accuracy (A/B testing)&lt;/li&gt;
&lt;li&gt;User satisfaction (thumbs up/down)&lt;/li&gt;
&lt;li&gt;Escalation rate (queries requiring human review)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Dashboard Setup&lt;/strong&gt; (Grafana):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Panel 1: Cache Hit Rates (last 24h)
- Redis: 95.2%
- Semantic: 99.1%
- Overall: 99.7%

Panel 2: Cost Trends (last 30 days)
- Total spend: $2,800
- Trend: -94% vs Month 1
- Projection: $2,850 next month

Panel 3: Model Distribution
- 8B: 62% of queries
- 70B: 28% of queries
- 405B: 10% of queries

Panel 4: Latency P99
- Cached: 12ms
- Uncached: 125ms
- Overall: 18ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;PART 4: Results &amp;amp; ROI&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Month-by-Month Cost Reduction&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Month 1 (Baseline):
- Infrastructure cost: $47,000
- Queries served: 42M
- Cost per query: $0.00112

Month 2 (Redis caching deployed):
- Infrastructure cost: $38,000
- Queries served: 44M
- Cost per query: $0.00086
- Reduction: 19%

Month 3 (Semantic caching deployed):
- Infrastructure cost: $12,000
- Queries served: 45M
- Cost per query: $0.00027
- Reduction: 74% (from baseline)

Month 4 (Model routing deployed):
- Infrastructure cost: $6,500
- Queries served: 46M
- Cost per query: $0.00014
- Reduction: 86% (from baseline)

Month 5 (Efficient models deployed):
- Infrastructure cost: $2,800
- Queries served: 47M
- Cost per query: $0.00006
- Reduction: 94% (from baseline)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Performance Metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before Optimization&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P50 latency: 280ms
P95 latency: 420ms
P99 latency: 650ms
Throughput: 1,200 queries/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After Optimization&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P50 latency: 8ms (97% faster)
P95 latency: 15ms (96% faster)
P99 latency: 125ms (81% faster)
Throughput: 8,500 queries/sec (7x improvement)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;User Experience Impact&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page load times: -60% (faster responses)&lt;/li&gt;
&lt;li&gt;User complaints: 0 (nobody noticed quality change)&lt;/li&gt;
&lt;li&gt;User satisfaction: +12% (noticed speed improvement)&lt;/li&gt;
&lt;li&gt;Feature usage: +28% (faster = more engagement)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ROI Analysis&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Investment&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Engineering time: 8 weeks × 2 engineers = 16 engineer-weeks
Infrastructure setup: $5,000 (one-time)
Testing and monitoring tools: $2,000 (one-time)

Total investment: ~$80,000-$100,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Savings&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly savings: $44,200 ($47K - $2.8K)
Annual savings: $530,400
3-year savings: $1,591,200

ROI (Year 1): 530% ($530K saved / $100K invested)
Payback period: 2.3 months
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Lessons Learned&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What Worked&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;✅ &lt;strong&gt;Gradual migration&lt;/strong&gt; - Shadow mode → A/B test → full rollout prevented disasters&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Monitoring first&lt;/strong&gt; - Set up dashboards before making changes, not after&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Conservative thresholds&lt;/strong&gt; - Started with 0.98 similarity, lowered to 0.95 after confidence built&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Rollback plan&lt;/strong&gt; - Having old infrastructure ready for instant rollback was crucial&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Quality gates&lt;/strong&gt; - Automated quality checks caught issues before users did&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work Initially&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;❌ &lt;strong&gt;Too aggressive cache invalidation&lt;/strong&gt; - First attempt: invalidate after 1 hour. Too frequent. Changed to 24 hours.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Wrong similarity threshold&lt;/strong&gt; - Started at 0.90, got too many false positives. Raised to 0.95.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Inadequate monitoring&lt;/strong&gt; - Missed cache memory issues initially. Added memory alerts.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;No cost attribution&lt;/strong&gt; - Couldn't tell which endpoints were expensive. Added detailed tracking.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Common Pitfalls to Avoid&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #1: Caching everything&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't cache time-sensitive queries (stock prices, weather)&lt;/li&gt;
&lt;li&gt;Don't cache user-specific data without proper key isolation&lt;/li&gt;
&lt;li&gt;Don't cache low-frequency queries (waste of memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #2: Wrong model routing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't route based on query length alone (misleading)&lt;/li&gt;
&lt;li&gt;Don't use overly complex routing logic (adds latency)&lt;/li&gt;
&lt;li&gt;Don't forget to measure routing accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #3: Premature optimization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't optimize before measuring (know your bottlenecks)&lt;/li&gt;
&lt;li&gt;Don't sacrifice quality for cost (users &amp;gt; dollars)&lt;/li&gt;
&lt;li&gt;Don't optimize in isolation (system-level thinking required)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #4: Ignoring monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't deploy without observability (you're flying blind)&lt;/li&gt;
&lt;li&gt;Don't skip A/B testing (assumptions fail in production)&lt;/li&gt;
&lt;li&gt;Don't ignore long-tail latency (P99 matters more than average)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;PART 5: What's Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The AI efficiency revolution is just beginning.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2026-2028 Predictions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;2026&lt;/strong&gt; (now):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8B models match 70B performance ✅ (happening)&lt;/li&gt;
&lt;li&gt;Semantic caching becomes standard practice&lt;/li&gt;
&lt;li&gt;Model routing adopted by 30% of AI-first companies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2027&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3B models match today's 70B performance&lt;/li&gt;
&lt;li&gt;On-device AI becomes viable for 50%+ of use cases&lt;/li&gt;
&lt;li&gt;Edge deployment standard for latency-critical apps&lt;/li&gt;
&lt;li&gt;First $1B+ open source AI infrastructure company&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2028&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumer devices run GPT-4-equivalent models natively&lt;/li&gt;
&lt;li&gt;Cloud inference costs drop 95% from 2024 levels&lt;/li&gt;
&lt;li&gt;AI infrastructure consolidates around 3-5 major platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Emerging Technologies to Watch&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mixture of Experts (MoE)&lt;/strong&gt; - Activate only subset of parameters per query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speculative Decoding&lt;/strong&gt; - Generate faster with small model + large model verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantized Models&lt;/strong&gt; - 4-bit and even 2-bit inference without quality loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State Space Models&lt;/strong&gt; - Alternative to transformers, potentially more efficient&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neuromorphic Computing&lt;/strong&gt; - Hardware optimized for neural networks&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How to Stay Ahead&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For Technical Leaders&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start measuring cost per query today&lt;/li&gt;
&lt;li&gt;Implement caching this quarter&lt;/li&gt;
&lt;li&gt;Experiment with model routing next quarter&lt;/li&gt;
&lt;li&gt;Migrate to efficient models within 6 months&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;For Organizations&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Treat AI infrastructure as platform investment, not project&lt;/li&gt;
&lt;li&gt;Hire engineers who've built AI at scale (not just trained models)&lt;/li&gt;
&lt;li&gt;Open source your learnings (builds credibility, attracts talent)&lt;/li&gt;
&lt;li&gt;Focus on efficiency from day one (retrofitting is 10x harder)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;For the Industry&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Standardize on efficiency benchmarks (cost per query, not just accuracy)&lt;/li&gt;
&lt;li&gt;Share production learnings openly (we all benefit)&lt;/li&gt;
&lt;li&gt;Pressure model providers for more efficient options&lt;/li&gt;
&lt;li&gt;Invest in infrastructure, not just models&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cutting AI costs by 94% wasn't magic. It was architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 4-layer stack&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Semantic caching (70% reduction)&lt;/li&gt;
&lt;li&gt;Redis caching (15% additional)&lt;/li&gt;
&lt;li&gt;Model routing (12% additional)&lt;/li&gt;
&lt;li&gt;Efficient models (15% additional)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The results&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47,000 → $2,800/month&lt;/li&gt;
&lt;li&gt;340ms → 125ms latency&lt;/li&gt;
&lt;li&gt;0% quality degradation&lt;/li&gt;
&lt;li&gt;530% ROI in year 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson&lt;/strong&gt;: AI infrastructure optimization isn't about compromising quality. It's about building intelligently from the start.&lt;/p&gt;

&lt;p&gt;The companies that master AI efficiency will win the next decade. The companies that don't will burn cash until they can't compete.&lt;/p&gt;

&lt;p&gt;Which side do you want to be on?&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;💡 Enjoyed this deep-dive?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you found this article valuable, here's how to stay connected and go deeper:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;📧 Subscribe to my Substack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Get weekly deep-dives on production AI infrastructure, case studies, and implementation guides delivered to your inbox.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;Subscribe here&lt;/a&gt;&lt;/strong&gt; (Early access to all articles!)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;💻 Explore the Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;All the optimization techniques discussed here are open source:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/anilatambharii" rel="noopener noreferrer"&gt;github.com/anilatambharii&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Cost Optimization frameworks&lt;/li&gt;
&lt;li&gt;Production RAG implementations&lt;/li&gt;
&lt;li&gt;AI Safety testing frameworks&lt;/li&gt;
&lt;li&gt;Distributed training utilities&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;💼 Connect &amp;amp; Follow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn&lt;/strong&gt;: Daily AI infrastructure insights → &lt;a href="https://linkedin.com/in/anilsprasad" rel="noopener noreferrer"&gt;linkedin.com/in/anilsprasad&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;X/Twitter&lt;/strong&gt;: Real-time production AI observations → &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambharii Labs&lt;/strong&gt;: We build production AI infrastructure → &lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ambharii.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🏢 Need Help Implementing This?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If your team is struggling with AI infrastructure costs or wants to build efficient systems from day one:&lt;/p&gt;

&lt;p&gt;📨 &lt;strong&gt;Email&lt;/strong&gt;: &lt;a href="mailto:contact@ambharii.com"&gt;contact@ambharii.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
🌐 &lt;strong&gt;Website&lt;/strong&gt;: ambharii.com&lt;/p&gt;

&lt;p&gt;We offer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture review &amp;amp; optimization consulting&lt;/li&gt;
&lt;li&gt;Build services for production AI infrastructure&lt;/li&gt;
&lt;li&gt;Training for engineering teams&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;About the Author&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Anil Prasad&lt;/strong&gt; is Head of Engineering at Ambharii Labs, where he builds production AI infrastructure processing 2.3M events/second. Named one of "100 Most Influential AI Leaders in USA 2024." &lt;/p&gt;

&lt;p&gt;Previously led engineering teams at Fortune 500 companies recovering $47M in revenue through real-time data systems. Passionate about making production AI infrastructure accessible through open source and knowledge sharing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Products&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ARIA RCM&lt;/strong&gt;: AI-native revenue cycle management for healthcare&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GenomiziQ&lt;/strong&gt;: Precision medicine platform (WEF candidate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic AI Platform&lt;/strong&gt;: Multi-agent orchestration infrastructure&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #ai #machinelearning #production #llm #optimization #costoptimization #infrastructure #devops #engineering #opensource&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;💖 If this article helped you, please heart it and share it with your team!&lt;br&gt;&lt;br&gt;
🔖 Bookmark for future reference&lt;br&gt;&lt;br&gt;
💬 Drop a comment if you have questions or want to share your own optimization wins!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Published&lt;/strong&gt;: June 10, 2026&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Reading Time&lt;/strong&gt;: 16-18 minutes&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Originally published on&lt;/strong&gt;: &lt;a href="https://anilsprasad.substack.com" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Building Production-Ready Open Source AI Infrastructure: A Technical Guide</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Tue, 19 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/building-production-ready-open-source-ai-infrastructure-a-technical-guide-14cl</link>
      <guid>https://dev.to/anilatambharii/building-production-ready-open-source-ai-infrastructure-a-technical-guide-14cl</guid>
      <description>&lt;h1&gt;
  
  
  Building Production-Ready Open Source AI Infrastructure: A Technical Guide
&lt;/h1&gt;

&lt;p&gt;Over the past year, we've built and open sourced six production-grade AI infrastructure projects. This isn't toy code or proof of concepts. These are systems handling millions of requests daily in production environments.&lt;/p&gt;

&lt;p&gt;Here's what we learned building open source AI infrastructure that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Six Projects
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;llm-cost-optimization&lt;/strong&gt;: 3-layer caching plus intelligent routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ai-safety-framework&lt;/strong&gt;: 5-layer defense with 250 red team test cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;production-rag&lt;/strong&gt;: 6-stage pipeline with re-ranking and evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;distributed-training&lt;/strong&gt;: PyTorch DDP with NCCL tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;roi-first-ai&lt;/strong&gt;: Business metric selection and deployment templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agentic-ai&lt;/strong&gt;: Multi-agent orchestration framework&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All repositories are at &lt;code&gt;github.com/anilatambharii&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Open Source Our Production Code
&lt;/h2&gt;

&lt;p&gt;Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, the AI infrastructure landscape is fragmented. Every team rebuilds the same patterns from scratch. LLM caching. RAG pipelines. Cost optimization. Agent orchestration. We've already solved these problems. Sharing the solutions helps the community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, open source code is battle tested. When thousands of developers review, use, and contribute to your code, it gets better fast. Private code stays brittle. Public code gets hardened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt;, hiring advantage. The best engineers want to work on code that matters. Open source contributions demonstrate technical credibility better than any interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Principle: Composition Over Configuration
&lt;/h2&gt;

&lt;p&gt;Each project is a focused library, not a framework. You compose them together rather than configuring one monolithic system.&lt;/p&gt;

&lt;p&gt;Bad approach: One repo with 47 configuration options trying to do everything.&lt;/p&gt;

&lt;p&gt;Good approach: Six repos, each solving one problem well. Use what you need. Ignore what you don't.&lt;/p&gt;

&lt;p&gt;Example using &lt;code&gt;llm-cost-optimization&lt;/code&gt; and &lt;code&gt;production-rag&lt;/code&gt; together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_cost_optimization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CachingLayer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ModelRouter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGPipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HybridRetriever&lt;/span&gt;

&lt;span class="c1"&gt;# Set up caching for LLM calls
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CachingLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;semantic_cache_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set up model routing based on query complexity
&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;complexity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set up RAG pipeline with hybrid retrieval
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HybridRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keyword_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RAGPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_router&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use them together
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What were Q2 financial results?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each component is independent. Each can be used standalone. Together they form a complete system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Deep Dive: LLM Cost Optimization
&lt;/h2&gt;

&lt;p&gt;This project reduced our LLM costs from $47K monthly to $2.8K monthly. 94% cost reduction. Same quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Layer Caching
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Exact match cache&lt;/strong&gt; catches identical queries. Redis key is SHA256 hash of prompt. Cache hit returns response instantly. No LLM call. Zero cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExactMatchCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hit rate: 23% of queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic cache&lt;/strong&gt; catches similar queries. Embed the prompt. Find nearest neighbors in vector DB. If similarity &amp;gt; threshold (0.95), return cached response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;cached_response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hit rate: 31% of queries not caught by exact match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefix cache&lt;/strong&gt; reuses computation for prompts with common prefixes. System prompt is usually identical. Few-shot examples are usually identical. Only the user query changes.&lt;/p&gt;

&lt;p&gt;Anthropic's prompt caching API handles this automatically. Mark static parts as cacheable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined hit rate: 73% of queries serve from cache. 27% hit the LLM. Cost reduced 73% from caching alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intelligent Model Routing
&lt;/h3&gt;

&lt;p&gt;Not every query needs GPT-4 or Claude Opus. Simple queries work fine on Haiku. Complex queries need Sonnet.&lt;/p&gt;

&lt;p&gt;Routing strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;calculate_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $0.25 per 1M tokens
&lt;/span&gt;        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3 per 1M tokens
&lt;/span&gt;        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $15 per 1M tokens
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Features: length, question marks, technical terms, etc.
&lt;/span&gt;        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trained a simple classifier on 10K labeled examples. "What's the capital of France?" → Haiku. "Analyze this 50 page contract for liability clauses" → Opus.&lt;/p&gt;

&lt;p&gt;Result: 89% of queries route to Haiku. 9% to Sonnet. 2% to Opus. Average cost per query drops 88%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fews0p24mflyd2k60kg2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fews0p24mflyd2k60kg2t.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Notes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cache invalidation&lt;/strong&gt; is the hard part. We invalidate based on TTL (1 hour default) and explicit updates. When source data changes, we flush related cache entries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; tracks hit rates, latency, cost per query. Dashboard shows cache performance in real time. Alerts fire when hit rate drops below threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradual rollout&lt;/strong&gt; started with 1% of traffic. Measured cache hit rate and accuracy. Ramped to 10%, 50%, 100% over 3 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Deep Dive: Production RAG
&lt;/h2&gt;

&lt;p&gt;We increased RAG accuracy from 52% to 89% by fixing retrieval, not the LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 6-Stage Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Query Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't send raw user queries to vector DB. Expand with synonyms. Extract metadata. Generate context-aware embedding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QueryProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract metadata
&lt;/span&gt;        &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_date_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_department&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_doc_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Expand with synonyms
&lt;/span&gt;        &lt;span class="n"&gt;expanded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand_synonyms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate embedding
&lt;/span&gt;        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2: Vector Database Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cosine similarity threshold 0.85. Top-k 50 candidates (not 5, not 10). Use Pinecone with metadata filtering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$gte&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 3: Hybrid Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combine semantic search (70%) with keyword search (30%) using BM25.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HybridRetriever&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Vector search
&lt;/span&gt;        &lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Keyword search
&lt;/span&gt;        &lt;span class="n"&gt;keyword_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bm25_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Combine with weights
&lt;/span&gt;        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;keyword_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vector_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;keyword_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 4: Re-ranking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This single stage improved accuracy by 23%. Use cross-encoder to score each candidate against the actual query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Reranker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-MiniLM-L-12-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Score each doc against query
&lt;/span&gt;        &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Sort by score
&lt;/span&gt;        &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Top 50 candidates from hybrid search → Re-rank → Best 5 to LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 5: Context Assembly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Smart chunking with overlap. 512 token chunks with 50 token overlap. Include surrounding context. Add metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assemble_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranked_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranked_docs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Source &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Department: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

---
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 6: LLM Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Force grounded responses. System prompt enforces citation. User query includes assembled context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant. Use ONLY the provided context to answer questions. 

If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain enough information, say &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have enough information to answer that question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

Always cite your sources using the Source number.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;assembled_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;original_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;Before: 52% answer accuracy. 3.8s latency. 31% hallucination rate.&lt;/p&gt;

&lt;p&gt;After: 89% accuracy (+71%). 1.2s latency (faster!). 4% hallucination rate (-87%).&lt;/p&gt;

&lt;p&gt;The insight: Don't optimize the LLM. Optimize the retrieval. GPT-4 with bad context = bad answers. Haiku with perfect context = great answers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3943lz1ypcgvehf5fjax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3943lz1ypcgvehf5fjax.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Projects Production Ready
&lt;/h2&gt;

&lt;p&gt;Every project includes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comprehensive tests&lt;/strong&gt;: Unit tests for every function. Integration tests for pipelines. End-to-end tests for workflows. 90%+ coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: README with quick start. Detailed API docs. Architecture diagrams. Example notebooks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks&lt;/strong&gt;: Performance metrics. Accuracy measurements. Cost comparisons. Real numbers, not claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus metrics. Logging. Error tracking. Observability built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment&lt;/strong&gt;: Docker containers. Kubernetes manifests. Terraform modules. Production ready deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing to Open Source AI
&lt;/h2&gt;

&lt;p&gt;Our projects welcome contributions. Here's how to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a project that interests you&lt;/li&gt;
&lt;li&gt;Read the CONTRIBUTING.md&lt;/li&gt;
&lt;li&gt;Check the issues for "good first issue" labels&lt;/li&gt;
&lt;li&gt;Submit a PR with tests and documentation&lt;/li&gt;
&lt;li&gt;Respond to review feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We review all PRs within 48 hours. Quality bar is high but we help contributors meet it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89brvmwv8kb18ggp8kvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89brvmwv8kb18ggp8kvf.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Open source AI infrastructure should be production ready, not proof of concept. These six projects represent thousands of hours of real world testing and optimization.&lt;/p&gt;

&lt;p&gt;Use them. Contribute to them. Build on them.&lt;/p&gt;

&lt;p&gt;The code is at &lt;code&gt;github.com/anilatambharii&lt;/code&gt;. Documentation is comprehensive. Examples are plentiful. Issues are welcome.&lt;/p&gt;

&lt;p&gt;Let's build better AI infrastructure together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anil Prasad is Head of Engineering at Ambharii Labs, recognized as one of "100 Most Influential AI Leaders in USA 2024." He builds production-scale AI and data systems for enterprise organizations. Connect on LinkedIn at linkedin.com/in/anilsprasad or visit ambharii.com.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related Reading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/llm-cost-optimization" rel="noopener noreferrer"&gt;LLM Cost Optimization Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/production-rag" rel="noopener noreferrer"&gt;Production RAG Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anilatambharii/ai-safety-framework" rel="noopener noreferrer"&gt;AI Safety Framework Repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opensource</category>
      <category>datascience</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How We Cut LLM API Costs by 94%: A 3-Layer Caching Strategy</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Thu, 14 May 2026 13:59:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/how-we-cut-llm-api-costs-by-94-a-3-layer-caching-strategy-145l</link>
      <guid>https://dev.to/anilatambharii/how-we-cut-llm-api-costs-by-94-a-3-layer-caching-strategy-145l</guid>
      <description>&lt;p&gt;Last month, our LLM API bills hit &lt;strong&gt;$47,000&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This month: &lt;strong&gt;$2,800&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Same product. Same user experience. Same performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;94% cost reduction&lt;/strong&gt; without sacrificing quality.&lt;/p&gt;

&lt;p&gt;Here's the architecture that made it possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Wake-Up Call
&lt;/h2&gt;

&lt;p&gt;CFO's message: "Fix this or we shut down the AI features."&lt;/p&gt;

&lt;p&gt;We had 90 days.&lt;/p&gt;

&lt;p&gt;Most teams would panic and start cutting features. We treated it as an architecture problem, not a budget problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: 3-Layer Caching + Intelligent Routing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/monday-cost-optimization.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/monday-cost-optimization.png" alt="Cost Optimization Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: Prompt Caching (68% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Every request pays for the same tokens repeatedly.&lt;/p&gt;

&lt;p&gt;Standard system prompts, documentation, static context—all charged every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Claude's native prompt caching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Mark cacheable content with cache_control
&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful AI assistant for our healthcare platform...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Cache this
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Current user context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Don't cache (changes per user)
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Economics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input tokens:&lt;/strong&gt; $3.00 / 1M tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cached input tokens:&lt;/strong&gt; $0.30 / 1M tokens (10x cheaper!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache write:&lt;/strong&gt; $3.75 / 1M tokens (one-time cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First request (cache write):&lt;/p&gt;

&lt;p&gt;5,000 token system prompt&lt;br&gt;
Cost: $0.01875 (5K tokens × $3.75/1M)&lt;/p&gt;

&lt;p&gt;Next 100 requests (cache hit):&lt;/p&gt;

&lt;p&gt;Same 5,000 token system prompt&lt;br&gt;
Cost: $0.0015 (5K tokens × $0.30/1M × 100)&lt;/p&gt;

&lt;p&gt;Total: $0.02025 for 101 requests&lt;br&gt;
Without caching: $1.515 (5K × $3/1M × 101)&lt;br&gt;
Savings: 98.7%&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our hit rate:&lt;/strong&gt; 68%&lt;/p&gt;


&lt;h2&gt;
  
  
  Layer 2: Semantic Caching (15% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Vector search doesn't catch similar queries.&lt;/p&gt;

&lt;p&gt;"How do I reset my password?" vs "Password reset help?" are semantically identical but literally different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Semantic similarity matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# {embedding: (query, response, timestamp)}
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if semantically similar query exists in cache&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cached_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache HIT: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; ≈ &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (similarity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store query-response pair with embedding&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# First query
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I reset my password?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I reset my password?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Similar query (cache hit!)
&lt;/span&gt;&lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Password reset help?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Returns the cached response, no LLM call
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Additional 15% cache hit rate&lt;/strong&gt; on top of prompt caching.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Result Caching (10% hit rate)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Identical queries hit the LLM multiple times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Cache complete responses with smart TTL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create deterministic cache key&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;cache_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get cached response if exists&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Cache response with TTL

        TTL strategy:
        - Stable content: 24 hours (86400s)
        - Dynamic content: 1 hour (3600s)
        - Real-time data: 5 minutes (300s)
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Invalidate cache on data updates&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Check cache first
&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;  &lt;span class="c1"&gt;# Cache hit!
&lt;/span&gt;
&lt;span class="c1"&gt;# Cache miss - call LLM
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Cache the result
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Invalidate on data update
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:123:*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Clear all caches for user 123
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final 10% cache hit rate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combined: 73% cache hit rate&lt;/strong&gt; (68% + 15% + 10% with some overlap)&lt;/p&gt;




&lt;h2&gt;
  
  
  Intelligent Model Routing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Caching alone isn't enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;67% of our queries work perfectly with Haiku. That's a 60x price difference vs Opus.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;HAIKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $0.25/1M input
&lt;/span&gt;    &lt;span class="n"&gt;SONNET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3/1M input
&lt;/span&gt;    &lt;span class="n"&gt;OPUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# $15/1M input
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Route based on complexity

    Indicators for Haiku (simple):
    - Short queries (&amp;lt;50 tokens)
    - FAQ-style questions
    - Retrieval tasks

    Indicators for Sonnet (analysis):
    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    - Multi-step reasoning
    - Longer context (&amp;gt;2K tokens)

    Indicators for Opus (complex):
    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    - Creative tasks
    - Critical business decisions
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple queries → Haiku
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HAIKU&lt;/span&gt;

    &lt;span class="c1"&gt;# Analysis tasks → Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;evaluate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;explain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

    &lt;span class="c1"&gt;# Complex reasoning → Opus
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;strategy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;create&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPUS&lt;/span&gt;

    &lt;span class="c1"&gt;# Default to Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ModelTier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Our distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;67% Haiku&lt;/strong&gt; ($0.25/1M)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;28% Sonnet&lt;/strong&gt; ($3/1M)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5% Opus&lt;/strong&gt; ($15/1M)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Complete System
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLMClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# Layer 1
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Layer 2
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      &lt;span class="c1"&gt;# Layer 3
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 3: Check result cache
&lt;/span&gt;        &lt;span class="n"&gt;cached_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_result&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 2: Check semantic cache
&lt;/span&gt;        &lt;span class="n"&gt;semantic_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic_result&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 1: Prompt caching + model routing happens in LLM call
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Prompt caching
&lt;/span&gt;            &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Cache the result
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLMClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my account balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47K/month API costs&lt;/li&gt;
&lt;li&gt;P95 latency: 2.1s&lt;/li&gt;
&lt;li&gt;No optimization strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$2.8K/month (-94%)&lt;/li&gt;
&lt;li&gt;P95 latency: 340ms (67% faster!)&lt;/li&gt;
&lt;li&gt;73% cache hit rate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Infrastructure &amp;gt; Model Selection
&lt;/h3&gt;

&lt;p&gt;Opus with naive setup:     $47K/month&lt;br&gt;
Haiku with optimization:   $2.8K/month&lt;/p&gt;

&lt;p&gt;A well-architected system with Haiku outperforms naive Opus at 1/16th the cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cache Hit Rate Math
&lt;/h3&gt;

&lt;p&gt;Without caching: 100% requests hit LLM&lt;br&gt;
With 73% cache hit: 27% requests hit LLM&lt;br&gt;
Cost reduction: 73% from caching alone&lt;br&gt;
Additional savings: 67% of remaining 27% uses cheap Haiku&lt;br&gt;
Total: 94% cost reduction&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Speed as Side Effect
&lt;/h3&gt;

&lt;p&gt;Caching doesn't just save money. It's faster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hit: 50ms (Redis lookup)&lt;/li&gt;
&lt;li&gt;LLM call: 2,100ms (P95)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;42x faster&lt;/strong&gt; for cached requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Enable prompt caching (10x savings on repeated context)&lt;/li&gt;
&lt;li&gt;[ ] Add semantic similarity cache (15% additional hits)&lt;/li&gt;
&lt;li&gt;[ ] Implement result caching with smart TTL&lt;/li&gt;
&lt;li&gt;[ ] Route queries to appropriate model tier&lt;/li&gt;
&lt;li&gt;[ ] Monitor cache hit rates and adjust thresholds&lt;/li&gt;
&lt;li&gt;[ ] Set up cache invalidation on data updates&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Monitoring Dashboard
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cache_metrics&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompt_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;semantic_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result_cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;combined_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model_distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_per_1k_requests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;p95_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;340&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track these weekly. Optimize based on data, not assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're open-sourcing our cost optimization framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete caching implementation&lt;/li&gt;
&lt;li&gt;Model routing logic&lt;/li&gt;
&lt;li&gt;Monitoring dashboards&lt;/li&gt;
&lt;li&gt;Cost calculation tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt; or &lt;a href="https://github.com/ambharii" rel="noopener noreferrer"&gt;Ambharii Labs&lt;/a&gt; for the release.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;What's your LLM API bill?&lt;/p&gt;

&lt;p&gt;Drop it in the comments and I'll tell you which optimization would have the highest ROI for your use case.&lt;/p&gt;

&lt;p&gt;Common wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt caching: 10x savings on repeated context&lt;/li&gt;
&lt;li&gt;Model routing: 60x price difference (Haiku vs Opus)&lt;/li&gt;
&lt;li&gt;Semantic caching: 15% additional hits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make LLMs affordable for everyone. 💰&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #performance #optimization #tutorial&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Tue, 12 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/anilatambharii/building-production-rag-from-52-to-89-accuracy-with-a-6-stage-pipeline-33ff</link>
      <guid>https://dev.to/anilatambharii/building-production-rag-from-52-to-89-accuracy-with-a-6-stage-pipeline-33ff</guid>
      <description>&lt;p&gt;Two hard problems in production AI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;: RAG systems giving wrong answers 48% of the time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: LLM API bills hitting $47K/month&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We solved both. Here's how.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: RAG Accuracy (52% → 89%)
&lt;/h2&gt;

&lt;p&gt;Our RAG system was confidently wrong. Users asked "What were Q2 healthcare results?" and got Q1 data, footnotes, and chapter titles with zero content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High similarity scores. Completely useless context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM wasn't the problem. Retrieval was broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/wednesday-rag-architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/wednesday-rag-architecture.png" alt="RAG Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The 6-Stage Pipeline
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Stage 1: Query Processing
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; "Show me Q2 results" has no semantic information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Query expansion + metadata extraction&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# dates, entities
&lt;/span&gt;    &lt;span class="n"&gt;expanded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_with_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ProcessedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expanded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Transformation:&lt;/strong&gt;&lt;br&gt;
Input:  "Show me Q2 results"&lt;br&gt;
Output: "quarterly financial results Q2 2024 revenue profit earnings second quarter"&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 2: Vector Database Search
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# not 10, not 20
&lt;/span&gt;    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$gte&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthcare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Key:&lt;/strong&gt; Cosine similarity threshold 0.85. Anything lower retrieves noise.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 3: Hybrid Search (Semantic + Keyword)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Semantic (70%) + BM25 keyword (30%)
&lt;/span&gt;    &lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;keyword_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
                 &lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Patent queries like "US-2847291" need exact match, not semantic.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 4: Re-ranking (23% Accuracy Boost)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrossEncoder&lt;/span&gt;

&lt;span class="n"&gt;reranker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-MiniLM-L-6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Strategy:&lt;/strong&gt; Fast bi-encoder for 50 candidates → slow cross-encoder for final 5.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 5: Context Assembly
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;detokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;section&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;extract_section&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why overlap:&lt;/strong&gt; "Revenue increased 23% vs previous quarter" → needs surrounding context.&lt;/p&gt;
&lt;h4&gt;
  
  
  Stage 6: LLM Generation
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Chunk&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;document&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;source&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/source&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;content&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/content&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/document&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use ONLY the provided context.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Query: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Instructions:
1. Answer using ONLY provided context
2. Cite sources
3. Say &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; if insufficient

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  RAG Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;52% accuracy&lt;/li&gt;
&lt;li&gt;31% hallucination rate&lt;/li&gt;
&lt;li&gt;3.8s latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;89% accuracy&lt;/strong&gt; (+71%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4% hallucination rate&lt;/strong&gt; (-87%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.2s latency&lt;/strong&gt; (-67%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; GPT-4 with naive retrieval = 54% accuracy. Haiku with 6-stage pipeline = 87% accuracy.&lt;/p&gt;

&lt;p&gt;Optimize retrieval, not the LLM.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 2: Cost Reduction ($47K → $2.8K)
&lt;/h2&gt;

&lt;p&gt;Same product. Same UX. 94% cost reduction.&lt;/p&gt;

&lt;p&gt;The secret: 3-layer caching + intelligent routing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 1: Prompt Caching (68% hit rate)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Every request pays for the same system prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful AI assistant...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# 10x cheaper!
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Economics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: $3.00/1M tokens&lt;/li&gt;
&lt;li&gt;Cached: $0.30/1M tokens (10x cheaper)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
5K token system prompt × 100 requests:&lt;br&gt;
Without caching: $1.50&lt;br&gt;
With caching:    $0.02&lt;br&gt;
98.7% savings&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 2: Semantic Caching (15% hit rate)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;query_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Catches:&lt;/strong&gt; "How do I reset password?" ≈ "Password reset help?"&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 3: Result Caching (10% hit rate)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;TTL strategy:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable content: 24 hours&lt;/li&gt;
&lt;li&gt;Dynamic content: 1 hour&lt;/li&gt;
&lt;li&gt;Real-time: 5 minutes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Intelligent Model Routing
&lt;/h3&gt;

&lt;p&gt;67% of queries work with Haiku ($0.25/1M). 60x cheaper than Opus ($15/1M).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;HAIKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# $0.25/1M
&lt;/span&gt;    &lt;span class="n"&gt;SONNET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# $3/1M
&lt;/span&gt;    &lt;span class="n"&gt;OPUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# $15/1M
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple → Haiku
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HAIKU&lt;/span&gt;

    &lt;span class="c1"&gt;# Analysis → Sonnet
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;

    &lt;span class="c1"&gt;# Complex → Opus
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;design&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;architect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPUS&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SONNET&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;67% Haiku&lt;/li&gt;
&lt;li&gt;28% Sonnet&lt;/li&gt;
&lt;li&gt;5% Opus&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Complete System
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OptimizedLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResultCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Layer 3: Result cache
&lt;/span&gt;        &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 2: Semantic cache
&lt;/span&gt;        &lt;span class="n"&gt;semantic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer 1: Prompt cache + routing
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Cache results
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$47K/month&lt;/li&gt;
&lt;li&gt;P95 latency: 2.1s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;$2.8K/month (-94%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;P95 latency: 340ms (-84%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;73% combined cache hit rate&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Implement query processing (expand + extract metadata)&lt;/li&gt;
&lt;li&gt;[ ] Set up vector DB with metadata filtering&lt;/li&gt;
&lt;li&gt;[ ] Add hybrid search (semantic + keyword)&lt;/li&gt;
&lt;li&gt;[ ] Deploy cross-encoder re-ranking&lt;/li&gt;
&lt;li&gt;[ ] Build chunking with 50-token overlap&lt;/li&gt;
&lt;li&gt;[ ] Force grounded prompts (no hallucinations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Enable prompt caching (10x savings)&lt;/li&gt;
&lt;li&gt;[ ] Add semantic similarity cache&lt;/li&gt;
&lt;li&gt;[ ] Implement result cache with smart TTL&lt;/li&gt;
&lt;li&gt;[ ] Route to appropriate model tier&lt;/li&gt;
&lt;li&gt;[ ] Monitor cache hit rates weekly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval &amp;gt; LLM:&lt;/strong&gt; Haiku + perfect context beats GPT-4 + bad context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-ranking = 23% boost:&lt;/strong&gt; Single highest-ROI optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching = 73% hit rate:&lt;/strong&gt; Most requests never touch the LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model routing = 60x savings:&lt;/strong&gt; Haiku for 67% of queries&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What We're Open-Sourcing
&lt;/h2&gt;

&lt;p&gt;Next month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6-stage RAG pipeline (code + docs)&lt;/li&gt;
&lt;li&gt;Cost optimization framework&lt;/li&gt;
&lt;li&gt;Re-ranking models&lt;/li&gt;
&lt;li&gt;Monitoring dashboards&lt;/li&gt;
&lt;li&gt;Evaluation datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow &lt;a href="https://twitter.com/anilsprasad" rel="noopener noreferrer"&gt;@anilsprasad&lt;/a&gt; or &lt;a href="https://github.com/ambharii" rel="noopener noreferrer"&gt;Ambharii Labs&lt;/a&gt; for release.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For RAG:&lt;/strong&gt; What's your accuracy? Drop it in comments.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;For Costs:&lt;/strong&gt; What's your monthly LLM bill? I'll tell you which optimization has highest ROI.&lt;/p&gt;

&lt;p&gt;Common wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt caching: 10x savings&lt;/li&gt;
&lt;li&gt;Re-ranking: 23% accuracy boost&lt;/li&gt;
&lt;li&gt;Model routing: 60x price difference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make production AI work. 🚀&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #machinelearning #python #tutorial&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The web is now weaponized against your AI agents</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 08 May 2026 17:15:44 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-web-is-now-weaponized-against-your-ai-agents-50ol</link>
      <guid>https://dev.to/anilatambharii/the-web-is-now-weaponized-against-your-ai-agents-50ol</guid>
      <description>&lt;p&gt;Google dropped a security bomb last week.&lt;/p&gt;

&lt;p&gt;Their threat intelligence team scanned 2-3 billion web pages per month looking for indirect prompt injection attacks targeting enterprise AI agents. They found a &lt;strong&gt;32% increase&lt;/strong&gt; in malicious attempts between November 2025 and February 2026.&lt;/p&gt;

&lt;p&gt;The open web is now an attack surface for production AI.&lt;/p&gt;

&lt;p&gt;This is not speculation. This is documented evidence of active attacks deployed at scale. Hidden instructions embedded in public HTML. Invisible to humans. Visible to AI agents. Real payloads designed to hijack enterprise systems the moment an agent scrapes the page.&lt;/p&gt;

&lt;p&gt;If you have AI agents reading the open web on behalf of your organization, your security model just became obsolete.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monday: Hidden instructions at scale
&lt;/h2&gt;

&lt;p&gt;Google researchers documented the attack patterns deployed across billions of public web pages. The techniques are simple and effective:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero font size text&lt;/strong&gt;: Instructions rendered in font-size: 0. Invisible to humans, fully visible to AI parsing HTML&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opacity manipulation&lt;/strong&gt;: Commands hidden using CSS opacity: 0. Text exists but appears transparent&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Off-screen positioning&lt;/strong&gt;: Instructions placed outside viewport using negative coordinates&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JavaScript dynamic execution&lt;/strong&gt;: Payloads injected after page load via client-side JS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;URL fragment injection&lt;/strong&gt;: Commands embedded after the # symbol in URLs&lt;/p&gt;

&lt;p&gt;These are not sophisticated zero-days requiring nation-state capabilities. These are techniques any web developer knows. The barrier to entry is near zero.&lt;/p&gt;

&lt;p&gt;Real payloads found in the wild:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fully specified PayPal transaction instructions&lt;/li&gt;
&lt;li&gt;Stripe donation redirects with persuasion amplifier keywords&lt;/li&gt;
&lt;li&gt;Data exfiltration commands targeting enterprise agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is production infrastructure under active attack.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://security.googleblog.com/2026/04/ai-threats-in-wild-current-state-of.html" rel="noopener noreferrer"&gt;Google Threat Intelligence, April 23, 2026&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuesday: The exploit window collapsed
&lt;/h2&gt;

&lt;p&gt;Black Hat Asia 2026 data from RunSybil: attack window compressed from &lt;strong&gt;5 months (2023) to 10 hours (2026)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why? Frontier LLMs now do offensive security work autonomously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2023 workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Security researcher finds vulnerability&lt;/li&gt;
&lt;li&gt;Documents it technically&lt;/li&gt;
&lt;li&gt;Writes POC exploit code&lt;/li&gt;
&lt;li&gt;Tests against targets&lt;/li&gt;
&lt;li&gt;Iterates based on results&lt;/li&gt;
&lt;li&gt;Publishes working exploit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline: months&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026 workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Describe bug to LLM&lt;/li&gt;
&lt;li&gt;Model generates exploit code&lt;/li&gt;
&lt;li&gt;Test in real-time&lt;/li&gt;
&lt;li&gt;Iterate with AI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline: hours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Meanwhile, 57% of organizations have AI agents in production right now. Most were architected before this research dropped. The threat model changed faster than the deployment cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wednesday: The sanitizer model pattern
&lt;/h2&gt;

&lt;p&gt;Two models. One reads the web. The other does the work.&lt;/p&gt;

&lt;p&gt;This is the architecture that actually defends against indirect prompt injection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;Deploy a small isolated model with zero system permissions. It reads untrusted web content, filters instructions, validates structure. If it gets compromised by a prompt injection, it lacks the permissions to cause damage.&lt;/p&gt;

&lt;p&gt;The production agent never touches raw web input directly. It only processes data that passed through the sanitizer layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key principle&lt;/strong&gt;: Trust boundary between models, not just at network edge.&lt;/p&gt;

&lt;p&gt;The sanitizer has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ No write access&lt;/li&gt;
&lt;li&gt;❌ No email permissions&lt;/li&gt;
&lt;li&gt;❌ No payment capabilities&lt;/li&gt;
&lt;li&gt;❌ No database credentials&lt;/li&gt;
&lt;li&gt;✅ Can read and filter only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If compromised by prompt injection, worst case is tainted text reaching production layer where business logic validation applies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;This is not theoretical. I've implemented this in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/anilatambharii/argus" rel="noopener noreferrer"&gt;ARGUS&lt;/a&gt;&lt;/strong&gt;: Dual model verification by default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://api.genomixiq.com" rel="noopener noreferrer"&gt;GenomixIQ&lt;/a&gt;&lt;/strong&gt;: Clinical genomics data ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ARIA RCM&lt;/a&gt;&lt;/strong&gt;: Healthcare revenue cycle workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All production systems in regulated environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thursday: Agent firewalls are the next layer
&lt;/h2&gt;

&lt;p&gt;Agent firewalls enforce security policies traditional infrastructure can't.&lt;/p&gt;

&lt;h3&gt;
  
  
  What they block
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Instruction injection&lt;/strong&gt;: Override commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential exfiltration&lt;/strong&gt;: Data to external endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privilege escalation&lt;/strong&gt;: Unauthorized tool calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision manipulation&lt;/strong&gt;: Logic chain redirects&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Five-layer architecture
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Input validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown sanitization&lt;/li&gt;
&lt;li&gt;Suspicious URL redaction&lt;/li&gt;
&lt;li&gt;Pattern matching for attack signatures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Instruction detection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML models trained on override attempts&lt;/li&gt;
&lt;li&gt;Recognizes semantic patterns (role reversals, system prompt refs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Permission checks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compartmentalized tool authorization&lt;/li&gt;
&lt;li&gt;Research agents: read only&lt;/li&gt;
&lt;li&gt;Write agents: database access, no email&lt;/li&gt;
&lt;li&gt;Email agents: no payment processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Decision logging&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full audit trails with context&lt;/li&gt;
&lt;li&gt;Source data tracking&lt;/li&gt;
&lt;li&gt;Reasoning chain capture&lt;/li&gt;
&lt;li&gt;Forensic reconstruction capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 5: Human confirmation gates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial transactions require approval&lt;/li&gt;
&lt;li&gt;Data deletion needs review&lt;/li&gt;
&lt;li&gt;Credential changes trigger verification&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zero trust for agents
&lt;/h3&gt;

&lt;p&gt;Never trust input. Assume web content hostile. Verify every action. Log decision lineage. Compartmentalize tools. Human in loop for high stakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Friday: Five questions before deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does your sanitizer have zero system permissions?
&lt;/h3&gt;

&lt;p&gt;If your sanitizer can write to databases or send emails, it's not a sanitizer. It's a production agent reading untrusted input. When compromised, attackers gain those capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are tool permissions compartmentalized by role?
&lt;/h3&gt;

&lt;p&gt;Monolithic access = single compromised agent exposes entire system. Implement RBAC for agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you reconstruct every decision from logs?
&lt;/h3&gt;

&lt;p&gt;If compliance asks why an agent made a recommendation 6 months ago, can you trace to exact data sources and reasoning steps?&lt;/p&gt;

&lt;h3&gt;
  
  
  Does human confirmation trigger for financial actions?
&lt;/h3&gt;

&lt;p&gt;Agents processing payments without approval = automated embezzlement risk. Confirmation gates are not optional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Have you tested injection attacks?
&lt;/h3&gt;

&lt;p&gt;No red team testing = you don't know if defenses work. Run adversarial testing continuously.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The 86-89% that fail&lt;/strong&gt; discover these requirements 6 weeks before go-live when compliance asks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 14% that succeed&lt;/strong&gt; build them day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your systems
&lt;/h2&gt;

&lt;p&gt;Security architecture requirements:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Dual model verification&lt;/strong&gt; - Sanitizer + production agent separation&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Compartmentalized permissions&lt;/strong&gt; - Role-based tool access&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Decision lineage tracking&lt;/strong&gt; - Full audit trails&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Human confirmation gates&lt;/strong&gt; - Required for high-stakes actions&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Continuous injection testing&lt;/strong&gt; - Red team + automated&lt;/p&gt;

&lt;p&gt;Not optional enhancements. Production requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ambharii.com/tools" rel="noopener noreferrer"&gt;AI Aether&lt;/a&gt;&lt;/strong&gt;: Free agent security readiness assessment (30 min, 30 questions)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/anilatambharii/argus" rel="noopener noreferrer"&gt;ARGUS&lt;/a&gt;&lt;/strong&gt;: Dual model verification, available on PyPI/GitHub&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://api.genomixiq.com" rel="noopener noreferrer"&gt;GenomixIQ&lt;/a&gt;&lt;/strong&gt;: Clinical genomics with FHIR R4 interoperability&lt;br&gt;&lt;br&gt;
&lt;strong&gt;&lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;ARIA RCM&lt;/a&gt;&lt;/strong&gt;: Healthcare revenue cycle with HIPAA compliance&lt;/p&gt;

&lt;p&gt;All production-grade. No pilots. No POCs. Systems that ship and scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Years production AI taught one lesson
&lt;/h2&gt;

&lt;p&gt;The teams that succeed build governance before deployment, not after compliance review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCMTech&lt;/strong&gt;: $340M measurable improvements, 89 days integration, zero clinical data loss&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GeneticsTech&lt;/strong&gt;: 99.97% uptime during 50TB migration, FHIR R4 compliance throughout&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EnergyTech&lt;/strong&gt;: 23→81% AI adoption among 20-year veteran operators&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HealthTech&lt;/strong&gt;: Petabyte-scale platforms, every decision traceable&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Anil Prasad&lt;/strong&gt; is Founder of Ambharii Technologies and Head of Engineering &amp;amp; Product at EnergyTech.&lt;/p&gt;

&lt;p&gt;28 years building production AI in regulated environments across Fortune 100 companies. Currently building agent security infrastructure for enterprise AI: dual-model verification, compartmentalized permissions, and audit trail architecture for autonomous systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect:&lt;/strong&gt; &lt;a href="https://linkedin.com/in/anilsprasad" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://ambharii.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt; | &lt;a href="https://github.com/anilatambharii" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next week&lt;/strong&gt;: Production deployment patterns, compliance architecture, audit trail infrastructure.&lt;/p&gt;

&lt;h1&gt;
  
  
  AgentSecurity #EnterpriseAI #HumanWritten #ExpertiseFromField
&lt;/h1&gt;

</description>
      <category>productionai</category>
      <category>llmops</category>
      <category>agentsecurity</category>
      <category>aigovernance</category>
    </item>
    <item>
      <title>Claude Code Has 6 Ways to Authenticate. I Built a Cross-Platform Installer Because of It</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Wed, 06 May 2026 16:48:06 +0000</pubDate>
      <link>https://dev.to/anilatambharii/claude-code-has-6-ways-to-authenticate-i-built-a-cross-platform-installer-because-of-it-171k</link>
      <guid>https://dev.to/anilatambharii/claude-code-has-6-ways-to-authenticate-i-built-a-cross-platform-installer-because-of-it-171k</guid>
      <description>&lt;p&gt;TL;DR&lt;br&gt;
Claude Code supports 6 different authentication methods with a strict priority order. Get the order wrong and your Pro subscription silently gets overridden by an API key, costing you real money.&lt;/p&gt;

&lt;p&gt;I built claude-auth-setup — a cross-platform installer (Bash + Batch + PowerShell) that handles the whole thing correctly. MIT licensed, ~17KB of bash, zero runtime dependencies.&lt;/p&gt;

&lt;p&gt;This post walks through the design decisions, the cross-platform tax, and the testing approach.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
The Claude Code auth resolution order, highest to lowest:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cloud provider creds (Bedrock / Vertex AI / Foundry)&lt;/li&gt;
&lt;li&gt;ANTHROPIC_AUTH_TOKEN&lt;/li&gt;
&lt;li&gt;ANTHROPIC_API_KEY     ← the silent footgun&lt;/li&gt;
&lt;li&gt;apiKeyHelper script&lt;/li&gt;
&lt;li&gt;CLAUDE_CODE_OAUTH_TOKEN&lt;/li&gt;
&lt;li&gt;Subscription OAuth    ← what most users actually want&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89556629th2b26r55nml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F89556629th2b26r55nml.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're a Pro/Max subscriber and you ever set ANTHROPIC_API_KEY to test something — the API key wins forever until you explicitly unset it. No error. No warning. Just per-token charges added on top of your $20/month subscription.&lt;/p&gt;

&lt;p&gt;The single most common Claude Code support thread is some variation of:&lt;/p&gt;

&lt;p&gt;"My Anthropic Console bill went from $0 to $47 last month and I don't know why."&lt;/p&gt;

&lt;p&gt;The "why" is almost always a stale ANTHROPIC_API_KEY from a tutorial.&lt;/p&gt;

&lt;p&gt;Why a Script Instead of Better Docs&lt;br&gt;
Documentation tells you the rules. A setup script enforces them. A doc that says "remove ANTHROPIC_API_KEY before logging in" gets skimmed. A script that detects the conflict, explains why it's a problem, asks for permission to back up your shell config, and then unsets it — that one ships the right outcome.&lt;/p&gt;

&lt;p&gt;The installer does five things in order:&lt;/p&gt;

&lt;p&gt;Verify install — checks for claude, offers npm i -g @anthropic-ai/claude-code if missing&lt;br&gt;
Ask one question — "Do you have a Claude subscription?" Branches from this&lt;br&gt;
Detect conflicts — finds existing env vars, explains what they'd do, asks before changing&lt;br&gt;
Validate — sk-ant- prefix check, length check, env var persistence check&lt;br&gt;
Back up before mutating — every shell config edit gets a timestamped backup with a printed rollback command&lt;br&gt;
The Cross-Platform Tax&lt;br&gt;
Bash (macOS / Linux): the easy one&lt;br&gt;
detect_shell_config() {&lt;br&gt;
  case "$SHELL" in&lt;br&gt;
    &lt;em&gt;/zsh)  echo "$HOME/.zshrc" ;;&lt;br&gt;
    */bash) [[ "$OSTYPE" == "darwin"&lt;/em&gt; ]] &amp;amp;&amp;amp; echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ;;&lt;br&gt;
    *)      echo "$HOME/.profile" ;;&lt;br&gt;
  esac&lt;br&gt;
}&lt;br&gt;
Append the export, source the file, done.&lt;/p&gt;

&lt;p&gt;Batch (Windows cmd): the hard one&lt;br&gt;
Windows persists env vars in the registry under HKEY_CURRENT_USER\Environment. The supported tool is setx, which has two gotchas:&lt;/p&gt;

&lt;p&gt;1024-character limit (undocumented, will silently truncate)&lt;br&gt;
Doesn't update the current session — only future processes started after the setx call&lt;br&gt;
So users would run the script, run claude, see the same error, and assume the script broke. The fix is to set the variable in both places:&lt;/p&gt;

&lt;p&gt;setx ANTHROPIC_API_KEY "%KEY%"&lt;br&gt;
set ANTHROPIC_API_KEY=%KEY%&lt;br&gt;
echo NOTE: open a new Command Prompt window to verify persistence&lt;br&gt;
PowerShell: the third path&lt;br&gt;
PowerShell has $PROFILE, but you can't assume:&lt;/p&gt;

&lt;p&gt;The profile file exists&lt;br&gt;
Execution policy allows it to load&lt;br&gt;
The user knows what $PROFILE is&lt;br&gt;
The script gracefully degrades: profile edit → registry write → manual command shown to the user.&lt;/p&gt;

&lt;p&gt;What "Production-Grade" Means at 17KB&lt;br&gt;
I reduced it to four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Idempotency — running twice is safe. The second run detects the configured state and exits cleanly, no duplicate exports.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inspectability — before any mutation, print exactly what's about to happen and wait for y/n:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;About to add this line to /Users/anil/.zshrc:&lt;br&gt;
  export ANTHROPIC_API_KEY="sk-ant-..."&lt;br&gt;
Continue? [y/N]&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reversibility — every backup is timestamped (~/.zshrc.backup_20250506_143022). Rollback is one cp command, printed to the screen.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Testability — a test suite that validates the installer without mutating user state. Sandboxes backup/restore in /tmp, regex-checks key validation, verifies cross-platform parity. Runs in &amp;lt;2s.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Test suite output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlq6drczx0ddqfh8k475.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlq6drczx0ddqfh8k475.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(The 1 failure is a regex bug in the test, not the installer.)&lt;/p&gt;

&lt;p&gt;The Single Best Thing I Did&lt;br&gt;
Replaced a flat 6-option menu with one yes/no question:&lt;/p&gt;

&lt;p&gt;"Do you have a Claude subscription? [y/N]"&lt;/p&gt;

&lt;p&gt;Everything else branches from there. Conversion (people completing the script vs abandoning it mid-flow) went from "hard to measure but bad" to "essentially everyone finishes."&lt;/p&gt;

&lt;p&gt;If you remember nothing else from this post: users don't know which auth method applies to them. They know whether they pay a subscription. Branch on that.&lt;/p&gt;

&lt;p&gt;What I Got Wrong&lt;br&gt;
Too clever about shell detection. First version parsed $SHELL, then $0, then ps -p $$ -o comm=. Over-engineered. Three lines instead of thirty was the fix.&lt;br&gt;
PowerShell-first testing. Wrote canonical tests in PowerShell, hit version/encoding compat issues across Windows machines. Now the canonical suite is Bash; PowerShell is a convenience port.&lt;br&gt;
Underestimated docs. The repo has a README, quick-start, contributing guide, configuration examples, project overview, deployment doc, and build summary. That sounds like a lot for 17KB of script — until you realize the script is the easy part. The diagnosis ("why am I getting charged?") lives in the docs.&lt;br&gt;
Try It&lt;/p&gt;

&lt;h1&gt;
  
  
  Unix
&lt;/h1&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/your-repo/claude-auth-setup.git" rel="noopener noreferrer"&gt;https://github.com/your-repo/claude-auth-setup.git&lt;/a&gt;&lt;br&gt;
cd claude-auth-setup&lt;br&gt;
chmod +x setup-claude-auth.sh&lt;br&gt;
./setup-claude-auth.sh&lt;/p&gt;

&lt;h1&gt;
  
  
  Windows
&lt;/h1&gt;

&lt;p&gt;.\setup-claude-auth.bat&lt;br&gt;
MIT licensed. Issues, PRs, and bug reports all welcome. The best one I got so far was:&lt;/p&gt;

&lt;p&gt;"It worked. Why didn't this exist already?"&lt;/p&gt;

&lt;p&gt;I don't know either.&lt;/p&gt;

&lt;p&gt;Repo: github.com/your-repo/claude-auth-setup&lt;/p&gt;

&lt;p&gt;Follow me here on dev.to for more posts about the unglamorous parts of shipping production tools.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
      <category>claude</category>
    </item>
    <item>
      <title>The week the agent capability inflection arrived. And what to do about the 86% that still fail.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Sat, 02 May 2026 17:48:10 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-week-the-agent-capability-inflection-arrived-and-what-to-do-about-the-86-that-still-fail-25b9</link>
      <guid>https://dev.to/anilatambharii/the-week-the-agent-capability-inflection-arrived-and-what-to-do-about-the-86-that-still-fail-25b9</guid>
      <description>&lt;p&gt;&lt;strong&gt;By Anil Prasad&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Head of Engineering and Product, Duke Energy CASPAR · Founder, Ambharii Labs&lt;/p&gt;

&lt;h2&gt;
  
  
  Three signals. One pattern.
&lt;/h2&gt;

&lt;p&gt;Stanford released the 2026 AI Index this week. AI agents jumped from 12% to 66% success on real computer tasks in one year. That is a 5.5x capability multiplier in twelve months.&lt;/p&gt;

&lt;p&gt;In the same week, industry research confirmed that 86 to 89% of enterprise AI agent pilots fail to reach production at scale. Apoorva Mehta launched Abundance, a hedge fund with $100M in seed funding designed to have AI agents run the entire fund. JPMorgan reported their LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles for portfolio managers.&lt;/p&gt;

&lt;p&gt;These stories are not contradictory. They describe the same reality from different angles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The capability inflection has happened. The deployment infrastructure investment lags 18 months behind. That gap is the business opportunity of 2026.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Quick numbers before we dig in:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88b678yydjkrs20srikn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88b678yydjkrs20srikn.png" alt=" " width="760" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Monday: Stanford 12 to 66. Here is what most coverage will miss.
&lt;/h2&gt;

&lt;p&gt;Stanford published the 2026 AI Index this week. The 66% number on real computer tasks will be quoted in every AI keynote for the next twelve months.&lt;/p&gt;

&lt;p&gt;The number is real. The capability inflection has happened.&lt;br&gt;
What everyone is going to miss: 66% on benchmark tasks does not equal 66% in your production environment.&lt;/p&gt;

&lt;p&gt;Benchmarks measure: can the agent complete this task in ideal conditions with clean inputs and a defined success criterion?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production measures: can the agent complete this task at 2 AM on Sunday when the upstream data feed is degraded, the API is throttled, and the human reviewer is asleep?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Those are different questions. The benchmark answers one. The other one decides whether your AI program ships or fails.&lt;/p&gt;

&lt;p&gt;The capability bottleneck is gone. The readiness bottleneck just became the only bottleneck that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuesday: 86 to 89% of pilots fail. The four reasons. All fixable.
&lt;/h2&gt;

&lt;p&gt;Industry research published this month confirmed what 28 years in production AI has taught me. Agent pilots fail in predictable ways. The fixes are known. Almost nobody is applying them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 1: Governance breakdowns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pilot worked. The team wants to scale. The compliance team has not seen the system yet. Six weeks of compliance review later, the pilot has lost momentum, the team has shifted to other priorities, and the agent is sitting in staging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Compliance starts at week zero, not week sixteen. If your AI program treats compliance as a release gate, you have already lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 2: Evaluation infrastructure gaps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pilot demonstrated 84% accuracy on a curated test set. In production, the team cannot tell whether the agent is performing better or worse than baseline because they never built the evaluation framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Build the evaluation infrastructure before the agent. This is what G-ARVIS exists to do. Nine dimensions built from production failure, not academic theory.&lt;/p&gt;

&lt;p&gt;Failure mode 3: Integration complexity&lt;br&gt;
Integration and governance consume up to 60% of AI agent project budgets. Most teams plan for the model and underinvest in everything around it.&lt;/p&gt;

&lt;p&gt;Fix: Plan a 60% integration budget from day one. If the team budgeted 80% for the model and 20% for integration, the project is going to overrun before it ships.&lt;/p&gt;

&lt;p&gt;Failure mode 4: Accountability gaps&lt;br&gt;
When the agent is wrong, nobody knows whose problem it is. The system fails in the gap between teams.&lt;/p&gt;

&lt;p&gt;Fix: Assign one accountable human per agent before deployment. The work belongs to a name, not a function.&lt;/p&gt;

&lt;p&gt;The 86 to 89% failure rate is not happening because the technology does not work. It is happening because organizations are deploying capability without the foundation to support it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wednesday: A2A and MCP crossed 150 production deployments. The architecture conversation just shifted.
&lt;/h2&gt;

&lt;p&gt;Three months ago the question was: which orchestration framework should we use?&lt;/p&gt;

&lt;p&gt;Today the question is: do our agents speak the right protocols?&lt;br&gt;
Two protocols are emerging as the foundation of multi-agent systems in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; handles vertical connectivity. Agent to tool. Agent to data source. Agent to API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A (Agent to Agent)&lt;/strong&gt; handles horizontal connectivity. Direct peer to peer delegation between agents.&lt;/p&gt;

&lt;p&gt;Together they replace the brittle custom integration code that has been the failure mode of multi-agent systems for the past three years.&lt;br&gt;
This is the Kubernetes moment for agentic AI.&lt;/p&gt;

&lt;p&gt;The pattern looks exactly like what happened to microservices ten years ago. Custom service discovery, custom load balancing, custom health checks. Then Kubernetes standardized all of it. The organizations that built on the standardized layer were able to scale. The ones that built proprietary versions had to rewrite their infrastructure.&lt;/p&gt;

&lt;p&gt;Vendor lock in just changed shape too. Three years ago you locked in by choosing a model. Eighteen months ago you locked in by choosing an orchestration framework. In 2026, the lock in is at the protocol layer. Organizations that build on standardized protocols can swap models, frameworks, even vendors with bounded engineering effort.&lt;/p&gt;

&lt;p&gt;ARGUS now supports both A2A and MCP natively. Every tool call through MCP gets logged with full audit trail. Every agent to agent message through A2A gets traced with sender, recipient, timestamp, and payload hash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thursday: Financial AI just had its inflection point.
&lt;/h2&gt;

&lt;p&gt;Apoorva Mehta launched Abundance, a hedge fund designed to have AI agents run the entire fund with $100M in seed funding. JPMorgan's LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial services AI just crossed a threshold most other industries have not faced yet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When AI agents are managing money, every decision is not just one inference. It is a chain of reasoning across multiple agents that has to be reconstructable when the SEC asks.&lt;/p&gt;

&lt;p&gt;For an agent to participate in a regulated financial workflow, every decision must be:&lt;/p&gt;

&lt;p&gt;Reconstructable months after the fact&lt;br&gt;
Attributable to specific data sources at specific timestamps&lt;br&gt;
Explainable in language the regulator can evaluate&lt;br&gt;
Reviewable by a human with override authority&lt;/p&gt;

&lt;p&gt;If your agent infrastructure does not support all four, the agent cannot ship into a regulated financial environment.&lt;/p&gt;

&lt;p&gt;This is exactly the gap ARGUS is built to close. Every agent decision logged with input hash, output hash, model version, and tool calls. Full reasoning trace across multi-agent workflows. Time stamped audit log that can be replayed against the original data state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Friday synthesis: The Ambry Genetics migration story.
&lt;/h2&gt;

&lt;p&gt;We migrated a clinical genomics AI platform from MySQL to Vitess at Ambry Genetics. 99.97% uptime. Zero clinical data loss. 8 month migration during which the AI was making real recommendations for real patients.&lt;/p&gt;

&lt;p&gt;The migration could have happened faster. We chose to optimize for safety, not speed.&lt;/p&gt;

&lt;p&gt;What that taught me about AI in regulated environments: the model is the least constrained part of the system. Infrastructure, data governance, compliance requirements, and clinical validation processes are the actual engineering challenges.&lt;/p&gt;

&lt;p&gt;Every AI in healthcare implementation I have seen fail, failed at infrastructure or governance. Not at model accuracy.&lt;/p&gt;

&lt;p&gt;If you are deploying AI in healthcare, energy, or financial services, your constraint set looks more like that migration than like a benchmark optimization problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ambharii Labs platform suite
&lt;/h2&gt;

&lt;p&gt;This week marks three weeks since GenomixIQ and ARIA RCM launched. Health system inquiries on FHIR R4 interoperability are validating the architectural decisions made years before launch.&lt;/p&gt;

&lt;p&gt;AI Aether (ambharii.com/tools)&lt;br&gt;
Free enterprise AI readiness assessment. 8 dimensions on the G-ARVIS framework. Board ready roadmap. 30 minutes.&lt;/p&gt;

&lt;p&gt;ARGUS (github.com/anilatambharii/argus)&lt;br&gt;
Autonomous LLM correction and agent monitoring. Now native to A2A and MCP protocols. Open source. PyPI: pip install argus-ai&lt;/p&gt;

&lt;p&gt;GenomixIQ (genomixiq.com)&lt;br&gt;
12-agent molecular mesh for genomic variant interpretation. FHIR R4 from day one. Variant Intelligence Score. Population stratified evaluation.&lt;/p&gt;

&lt;p&gt;ARIA RCM (&lt;a href="mailto:anil@ambharii.com"&gt;anil@ambharii.com&lt;/a&gt;)&lt;br&gt;
11-agent healthcare revenue cycle platform. Three viable acquisition paths: Oracle Health, Microsoft Nuance, NVIDIA Healthcare.&lt;/p&gt;

&lt;p&gt;One shared architecture. G-ARVIS observability across all four. ARGUS self correction built into every agent. Production grade from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The week in one sentence
&lt;/h2&gt;

&lt;p&gt;The agents work at scale. Most organizations are not yet ready to deploy them safely. That gap is the business opportunity of 2026.&lt;/p&gt;

&lt;p&gt;If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions worth sitting with this weekend are the same five I ask in every program kickoff.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does failure look like and who does it hurt?&lt;/li&gt;
&lt;li&gt;Who is accountable when the agent is wrong?&lt;/li&gt;
&lt;li&gt;How does the agent know what it does not know?&lt;/li&gt;
&lt;li&gt;What is the kill switch and who can pull it?&lt;/li&gt;
&lt;li&gt;What does the audit trail look like nine months from now?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your team can answer all five with specifics, you are positioned for the 11 to 14% that will succeed.&lt;/p&gt;

&lt;p&gt;If they cannot, the foundation work is ahead of any deployment work.&lt;/p&gt;

&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;Anil Prasad is Head of Engineering and Product at Duke Energy and Founder of Ambharii Labs. He serves as an AI Factory Builder at BCG and co-founded the CDAIO Circle Tri-State Chapter. He has 28 years of production AI experience across Fortune 100 companies including R1 RCM, Ambry Genetics, UnitedHealth Group, Medtronic, and Accenture. He was recognized as one of the Top 100 Most Influential AI Leaders USA 2024 and holds degrees from Stanford and BITS Pilani.&lt;/p&gt;

&lt;p&gt;ambharii.com | linkedin.com/in/anilsprasad | @anilsprasad on X | anilsprasad.substack.com&lt;/p&gt;

&lt;p&gt;Subscribe to Field Notes: Production AI for weekly insights from 28 years building AI in regulated environments. No benchmarks. No hype. Real deployments, real failure modes, and the infrastructure decisions that distinguish production AI from demo AI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>aiops</category>
    </item>
    <item>
      <title>The week AI capability outpaced readiness. Again. Here is what it means in production.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 24 Apr 2026 12:47:57 +0000</pubDate>
      <link>https://dev.to/anilatambharii/the-week-ai-capability-outpaced-readiness-again-here-is-what-it-means-in-production-47kh</link>
      <guid>https://dev.to/anilatambharii/the-week-ai-capability-outpaced-readiness-again-here-is-what-it-means-in-production-47kh</guid>
      <description>&lt;h2&gt;
  
  
  Three events. One pattern.
&lt;/h2&gt;

&lt;p&gt;Three significant things happened in AI this week. Claude Opus 4.7 launched. The EU AI Act moved into full enforcement. And a new arXiv paper, EviSearch, validated what I have been building around for six years: domain-specific multi-agent architectures outperform general ones in clinical settings.&lt;/p&gt;

&lt;p&gt;Each story is real. Each story matters. And each story points to the same pattern I have watched repeat across 28 years of production AI in healthcare, energy, and financial services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability accelerates faster than readiness. Every time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zgvhs9tgiydp08la2ux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zgvhs9tgiydp08la2ux.png" alt=" " width="664" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Monday · April 20&lt;/p&gt;

&lt;p&gt;** ## Claude Opus 4.7: the benchmark is impressive. Here is the real question.**&lt;/p&gt;

&lt;p&gt;SWE-bench Pro reached 64.3%, up 10.9 points in a single version. SWE-bench Verified hit 87.6%. CursorBench reached 70%. Tool error rates dropped by two thirds. Self-verification built in at the model level. These are genuinely significant improvements.&lt;/p&gt;

&lt;p&gt;But the question I am not seeing asked in any of the coverage: does your organization have the evaluation infrastructure to know whether this model is actually better for your specific use case?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwur31w3r8hco9nm5httc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwur31w3r8hco9nm5httc.png" alt=" " width="642" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The organizations that move confidently after a major model launch are not the ones with the most advanced AI. They are the ones with evaluation infrastructure that can answer four questions within 72 hours of a new model release.&lt;/p&gt;

&lt;p&gt;Is this model better on our specific domain tasks? Is output variance within our acceptable range? What happens to cost-per-correct-output? Can our governance layer onboard this model without a compliance review starting from zero?&lt;/p&gt;

&lt;p&gt;If you cannot answer all four within 72 hours, you are not evaluating the model. You are waiting for someone else to tell you whether to use it. That is a readiness infrastructure problem, not a model problem.&lt;/p&gt;

&lt;p&gt;The self-verification feature is genuinely novel. Two thirds fewer tool errors means a system that needs much less constant human oversight. For multi-agent workflows running thousands of tool calls per day, that is the difference between a system that runs reliably overnight and one that requires a human on call. ARGUS operates the same self-correction principle at the system layer across the entire agent workflow, not just within a single inference.&lt;/p&gt;

&lt;p&gt;Tuesday · April 21&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;EU AI Act: the audit trail is the most common gap. Here is how to close it.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The EU AI Act entered full enforcement in 2026. Fines up to 7% of global annual turnover. High-risk categories include healthcare AI, critical infrastructure, employment, and education technology. Those are the exact sectors I have spent 28 years building production AI for.&lt;/p&gt;

&lt;p&gt;The five mandatory requirements for high-risk AI systems are: a risk management system maintained throughout the entire lifecycle, complete technical documentation, human oversight and intervention mechanisms, demonstrable accuracy and robustness, and a full audit trail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeiy3yfgsel22w8ofyr1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeiy3yfgsel22w8ofyr1.png" alt=" " width="660" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a Energy Enterprise, I rebuilt the entire logging layer before deploying a single agent in a live operational context. A grid operations manager asked a question I was not prepared for: "If this system makes a recommendation that causes an outage, and FERC comes knocking, can you show them exactly what the model saw, what it decided, and why?"&lt;/p&gt;

&lt;p&gt;We could not answer that. We rebuilt. That decision delayed the launch by six weeks and saved us months of regulatory exposure eighteen months later.&lt;/p&gt;

&lt;p&gt;ARGUS generates the full audit trail by default. Every inference logged with input hash, output hash, timestamp, and model version. Every tool call traced with actor identity and permission scope. Every human override recorded with reason and outcome. Not as a reporting feature. As the foundational observability layer. github.com/anilatambharii/argus or pip install argus-ai&lt;/p&gt;

&lt;p&gt;Wednesday · April 22&lt;/p&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;h2&gt;
  
  
  EviSearch and the domain-specific agent case: specificity is the moat.
&lt;/h2&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;p&gt;A paper published this week on arXiv described EviSearch, a multi-agent system that automates the creation of clinical evidence tables from medical literature using a specialized architecture. The finding was exactly what I have seen in every clinical AI program I have run: domain-specific agent architectures outperform general-purpose ones in technical domains, typically by 15 to 25 percentage points on domain-relevant evaluation tasks.&lt;/p&gt;

&lt;p&gt;Why the gap exists: A general-purpose agent reasons about yo&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklkry4vjx5wvv14lfbiv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklkry4vjx5wvv14lfbiv.png" alt=" " width="642" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why GenomixIQ uses 12 specialized agents rather than one large general agent. The literature agent understands how to evaluate evidence in population genetics. The ACMG criteria agent knows all 28 classification criteria and the interaction rules between them. The conflict resolution agent knows which database takes precedence when population databases disagree. None of that is prompt engineering. All of it is architectural encoding of domain expertise.&lt;/p&gt;

&lt;p&gt;The EviSearch paper also documented that multi-agent systems for clinical evidence work show inter-run variability below 5%, compared to 15 to 30% for human reviewers on complex evidence tables. Consistency in clinical decision support is not a nice-to-have. It is the compliance requirement.&lt;/p&gt;

&lt;p&gt;Thursday · April 23&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## G-ARVIS: the nine dimensions most AI teams are not measuring.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built the G-ARVIS framework from production failure across 28 years in regulated environments. Nine dimensions. Not from academic theory. From watching accurate models fail catastrophically because nobody was measuring the right things.&lt;/p&gt;

&lt;p&gt;The six dimensions: Groundedness (anchored to verifiable facts), Accuracy (correct output consistently), Reliability (stable at scale across thousands of runs), Variance (output stability on the same prompt across runs), Inference Cost (cost per correct output, not cost per token), Safety (domain-specific harm profile for this domain, this use case, this failure mode).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtwrdyfu0ick7yb18cet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtwrdyfu0ick7yb18cet.png" alt=" " width="651" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three agentic metrics I added specifically for multi-agent production systems: Action Sequence Fidelity (percentage of multi-step workflows completing without human intervention), Error Recovery Rate (when an agent fails, how often does the system recover without escalation), and Cost Per Correct Sequence (total inference cost divided by the number of complete sequences producing a validated correct output).&lt;/p&gt;

&lt;p&gt;All nine are assessed in AI Aether. 73% of organizations score below 12 out of 30 on data architecture alone. The foundation problem has not changed in 28 years. Only the model on top of it has. ambharii.com/tools&lt;/p&gt;

&lt;p&gt;The Ambharii Labs Platform&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## Four platforms. One shared architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This week marks two weeks since GenomixIQ and ARIA RCM launched, with ARGUS SDK updates shipping and AI Aether continuing to show the same pattern: 73% of organizations score below 12/30 on data architecture. The foundation problem precedes every other problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jcyw8sj9vct5g3a9k6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jcyw8sj9vct5g3a9k6.png" alt=" " width="668" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Week in One Sentence&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;## AI shipped faster than most organizations can absorb it. The gap between capability and readiness is the business opportunity of 2026.&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions I am asking every week are the same questions you should be asking: What does your AI actually do at 2 AM? Who sees the audit trail? What happens when the model is wrong in a way it has never been wrong before?&lt;/p&gt;

&lt;p&gt;The answers to those questions are what distinguishes production AI from demo AI. That distinction is what 28 years in this field teaches you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2qhtgin8lg5an5hojce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2qhtgin8lg5an5hojce.png" alt=" " width="713" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>argus</category>
      <category>genomixiq</category>
      <category>ariarcm</category>
    </item>
    <item>
      <title>I Built the First Agentic AI Platform for Clinical Genomics. Here Is the Full Architecture</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:21:54 +0000</pubDate>
      <link>https://dev.to/anilatambharii/i-built-the-first-agentic-ai-platform-for-clinical-genomics-here-is-the-full-architecture-510i</link>
      <guid>https://dev.to/anilatambharii/i-built-the-first-agentic-ai-platform-for-clinical-genomics-here-is-the-full-architecture-510i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — GenomixIQ is a 12-agent autonomous AI platform for clinical genomics. It classifies genetic variants in 8 seconds with zero hallucinations — enforced at the architecture level, not by prompting. FHIR R4 native. Any-cloud deploy. API live at &lt;a href="https://api.genomixiq.com/docs" rel="noopener noreferrer"&gt;api.genomixiq.com/docs&lt;/a&gt;. First platform of its kind. Integration and acquisition ready.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Had Solved
&lt;/h2&gt;

&lt;p&gt;Walk into any clinical genetics lab today. Watch what happens when a variant comes in.&lt;/p&gt;

&lt;p&gt;A molecular pathologist opens ClinVar. The variant is Pathogenic. Has been for 8 years. 47 supporting submissions. The pathologist reads the evidence, applies ACMG/AMP criteria, writes the interpretation, runs QC, produces the report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;90 minutes. For a deterministic computation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every lab. Every day. Across thousands of variants.&lt;/p&gt;

&lt;p&gt;The data to solve this exists:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ClinVar: 3 million+ variant interpretations&lt;/li&gt;
&lt;li&gt;gnomAD: population frequencies for 4.2 billion variants&lt;/li&gt;
&lt;li&gt;PubMed: decades of functional studies&lt;/li&gt;
&lt;li&gt;OncoKB: therapeutic implications for hundreds of somatic alterations&lt;/li&gt;
&lt;li&gt;CPIC: 300+ drug-gene dosing pairs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem was never the data. It was &lt;strong&gt;orchestration, integration, and trust&lt;/strong&gt;. Nobody had assembled these sources into a production-grade agent mesh with a technically enforced safety framework and native EHR output.&lt;/p&gt;

&lt;p&gt;I built it. It is called &lt;strong&gt;GenomixIQ&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: The Molecular Agent Mesh
&lt;/h2&gt;

&lt;p&gt;GenomixIQ uses a &lt;strong&gt;Molecular Agent Mesh&lt;/strong&gt; — 12 specialized autonomous agents running in parallel, each owning a distinct clinical genomics reasoning task, coordinated by a master orchestrator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MasterOrchestratorAgent
├── Agent 01: VariantClassifierAgent    → ACMG/AMP, ClinVar, gnomAD
├── Agent 02: ClinicalReporterAgent     → FHIR R4 DiagnosticReport
├── Agent 03: TrialMatcherAgent         → ClinicalTrials.gov live
├── Agent 04: DrugDiscoveryAgent        → ADMET, ChEMBL, AlphaFold
├── Agent 05: PGxAgent                  → 300+ CPIC pairs, diplotyping
├── Agent 06: SomaticOncologyAgent      → TMB, MSI-H, OncoKB
├── Agent 07: RareDiseaseAgent          → Trio analysis, de novo
├── Agent 08: HeredCancerAgent          → BRCA1/2, Lynch, 80+ genes
├── Agent 09: SafetyGateAgent           → G-ARVIS hard block (1.00)
├── Agent 10: CitationVerifierAgent     → PubMed, ClinVar live check
├── Agent 11: EHRIntegratorAgent        → Epic SMART, Cerner FHIR
└── Agent 12: QualityScorerAgent        → VIS, ACMG attestation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why 12 agents instead of one model call?
&lt;/h3&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Reasoning task decomposition.&lt;/strong&gt; Variant classification, pharmacogenomic interaction analysis, somatic therapy matching, and FHIR report generation are four distinct reasoning tasks requiring different knowledge bases, different validation logic, and different confidence thresholds. A single model call cannot hold this complexity reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Internal error correction.&lt;/strong&gt; If one agent returns a hallucinated citation, the Citation Verifier catches it before it reaches the Clinical Reporter. Single-model architectures have no internal correction loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Quality attestation per reasoning unit.&lt;/strong&gt; G-ARVIS scores each agent output independently. A single confidence score on the final output tells you nothing about where in the reasoning chain the uncertainty lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Safety Gate: Zero Hallucinations — Technically Enforced
&lt;/h2&gt;

&lt;p&gt;This is the differentiator that no competitor has built.&lt;/p&gt;

&lt;p&gt;Most "AI genomics" tools add instructions like "do not hallucinate" to their system prompts and call it a safety framework.&lt;/p&gt;

&lt;p&gt;That is not safety. That is a request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GenomixIQ's Safety Gate (Agent 09) is an architectural block:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SafetyGateAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    G-ARVIS Safety Gate — hard binary enforcement.
    No Pathogenic classification reaches ClinicalReporterAgent
    without a verified citation from CitationVerifierAgent.
    Not a warning. A block.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enforce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classification_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ClassificationResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;GateDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;classification_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acmg_class&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;ACMGClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PATHOGENIC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;ACMGClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LIKELY_PATHOGENIC&lt;/span&gt;
        &lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;citation_verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citation_verifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;classification_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evidence_statements&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinvar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pubmed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;omim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;citation_verified&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# ARGUS runs up to 3 correction iterations
&lt;/span&gt;                &lt;span class="n"&gt;corrected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;corrected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citation_verified&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="c1"&gt;# Route for human review — never release unverified
&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GateDecision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ROUTE_TO_HUMAN_REVIEW&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GateDecision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;APPROVED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a Pathogenic classification cannot be verified after 3 ARGUS correction iterations, it routes to human review. It never reaches a clinician unverified. This is G-ARVIS Safety dimension: &lt;strong&gt;1.00 hard binary&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  G-ARVIS: The Quality Framework
&lt;/h2&gt;

&lt;p&gt;Every GenomixIQ output is scored across 6 dimensions before release:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;G&lt;/strong&gt;roundedness&lt;/td&gt;
&lt;td&gt;0.93&lt;/td&gt;
&lt;td&gt;Citation coverage ratio vs ClinVar/PubMed/gnomAD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;A&lt;/strong&gt;ccuracy&lt;/td&gt;
&lt;td&gt;0.90&lt;/td&gt;
&lt;td&gt;Match rate vs CAP-accredited lab gold standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;R&lt;/strong&gt;eliability&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;Classification consistency across equivalent inputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;V&lt;/strong&gt;ariance&lt;/td&gt;
&lt;td&gt;0.86&lt;/td&gt;
&lt;td&gt;Stability under input perturbation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;I&lt;/strong&gt;nference Cost&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;Token efficiency per clinical decision unit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;S&lt;/strong&gt;afety&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard binary — verified citation required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Composite: 0.937&lt;/strong&gt;. Clinical grade threshold: 0.90. ✓ Passed.&lt;/p&gt;

&lt;p&gt;G-ARVIS is not a post-hoc confidence score. It is a pre-release gate wired into the architecture. Outputs below threshold trigger ARGUS autonomous correction (max 3 iterations). Outputs that fail Safety route to human review.&lt;/p&gt;




&lt;h2&gt;
  
  
  The ARGUS Autonomous Correction Engine
&lt;/h2&gt;

&lt;p&gt;ARGUS (Autonomous Reasoning and Guided Update System) runs inside GenomixIQ as the self-healing layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ARGUSEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;MAX_ITERATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentOutput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;failing_dimension&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GARVISDimension&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_ITERATIONS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;reflection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reflect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failing_dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;refined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reflection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;garvis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;failing_dimension&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;refined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;recovered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recovered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;route_to_human&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Error Recovery Rate: 87.3%&lt;/li&gt;
&lt;li&gt;Average iterations to recovery: 1.4&lt;/li&gt;
&lt;li&gt;Human review routing rate: 12.7%&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  FHIR R4 Output — Native, Not Bolted On
&lt;/h2&gt;

&lt;p&gt;This is the second thing that separates GenomixIQ from every other clinical AI tool.&lt;/p&gt;

&lt;p&gt;EHR integration is not a Phase 2 roadmap item. GenomixIQ produces FHIR R4 DiagnosticReport output natively from Agent 02.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClinicalReporterAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;VerifiedClassification&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;FHIRDiagnosticReport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;FHIRDiagnosticReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;resourceType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DiagnosticReport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;CodeableConcept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;coding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Coding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://loinc.org&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;81247-9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Master HL7 genetic variant reporting panel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)]&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_variant_observation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_acmg_observation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_garvis_attestation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;conclusion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clinical_interpretation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;conclusionCode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nc"&gt;CodeableConcept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="nc"&gt;Coding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://loinc.org&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acmg_loinc_code&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ready for Epic SMART on FHIR and Cerner Millennium. Zero custom integration work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Clinical Coverage — 5 Domains
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hereditary Cancer
&lt;/h3&gt;

&lt;p&gt;BRCA1/2, Lynch syndrome (MLH1, MSH2, MSH6, PMS2, EPCAM), Li-Fraumeni (TP53), Cowden (PTEN), hereditary diffuse gastric cancer (CDH1), 80+ hereditary cancer genes with syndrome-specific ACMG logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pharmacogenomics
&lt;/h3&gt;

&lt;p&gt;300+ CPIC Level A/B drug-gene pairs. CYP2D6, CYP2C19, CYP2C9, DPYD, TPMT, SLCO1B1, G6PD, NUDT15. Full diplotype calling with population-adjusted allele frequencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Somatic Oncology
&lt;/h3&gt;

&lt;p&gt;TMB calculation, MSI-H assessment, OncoKB therapeutic implication mapping, pan-cancer actionability scoring, FDA-approved and investigational therapy matching.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rare Disease
&lt;/h3&gt;

&lt;p&gt;Whole exome/genome trio analysis, de novo variant prioritization, pedigree reconstruction from VCF, HPO phenotype-to-gene matching, OMIM disorder linkage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drug Discovery
&lt;/h3&gt;

&lt;p&gt;Target-disease association scoring, ADMET prediction, lead compound optimization, AlphaFold-integrated structural impact analysis for variant functional characterization.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ai_agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Anthropic Claude&lt;/span&gt;
  &lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Opus (STAT) → Sonnet (standard) → Haiku (QC)&lt;/span&gt;
  &lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;G-ARVIS engine&lt;/span&gt;
  &lt;span class="na"&gt;correction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ARGUS-AI (max 3 iterations)&lt;/span&gt;
  &lt;span class="na"&gt;orchestration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LangChain + LlamaIndex&lt;/span&gt;
  &lt;span class="na"&gt;vector_db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Qdrant (7 collections, 1536-dim, tenant-scoped)&lt;/span&gt;
    &lt;span class="s"&gt;collections&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ClinVar&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gnomAD&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OMIM&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PubMed&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PharmGKB&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OncoKB&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ChEMBL&lt;/span&gt;

&lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FastAPI&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Python &lt;/span&gt;&lt;span class="m"&gt;3.11&lt;/span&gt;
  &lt;span class="na"&gt;orm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SQLAlchemy + Alembic&lt;/span&gt;
  &lt;span class="na"&gt;validation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pydantic v2&lt;/span&gt;
  &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Keycloak RBAC + ABAC + JWT&lt;/span&gt;

&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;primary_db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgreSQL 16 (RLS, 500-tenant capacity)&lt;/span&gt;
  &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Redis &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;
  &lt;span class="na"&gt;streaming&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Apache Kafka&lt;/span&gt;
  &lt;span class="na"&gt;analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClickHouse (immutable audit log)&lt;/span&gt;
  &lt;span class="na"&gt;object_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Delta Lake / S3 (BAM/VCF/FASTQ/CRAM)&lt;/span&gt;

&lt;span class="na"&gt;bioinformatics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;variant_calling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GATK + DeepVariant&lt;/span&gt;
  &lt;span class="na"&gt;annotation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ANNOVAR + VEP + Ensembl&lt;/span&gt;
  &lt;span class="na"&gt;structure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AlphaFold API&lt;/span&gt;

&lt;span class="na"&gt;frontend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;React 18 + TypeScript&lt;/span&gt;
  &lt;span class="na"&gt;styling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Tailwind CSS&lt;/span&gt;
  &lt;span class="na"&gt;visualization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;D3.js + IGV.js + Recharts&lt;/span&gt;

&lt;span class="na"&gt;mlops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MLflow&lt;/span&gt;
  &lt;span class="na"&gt;drift&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Evidently&lt;/span&gt;
  &lt;span class="na"&gt;monitoring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prometheus + Grafana&lt;/span&gt;
  &lt;span class="na"&gt;tracing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OpenTelemetry&lt;/span&gt;

&lt;span class="na"&gt;infrastructure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Docker + Kubernetes + Helm&lt;/span&gt;
  &lt;span class="na"&gt;iac&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Terraform&lt;/span&gt;
  &lt;span class="na"&gt;gitops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ArgoCD&lt;/span&gt;
  &lt;span class="na"&gt;ci_cd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GitHub Actions + SonarQube + Trivy&lt;/span&gt;
  &lt;span class="na"&gt;deployment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS / Azure / GCP / Oracle Cloud / On-prem&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Integration Architecture
&lt;/h2&gt;

&lt;p&gt;GenomixIQ is built integration-ready from day one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;External Systems          GenomixIQ API            Internal Agents
─────────────────         ──────────────           ───────────────
Epic SMART on FHIR ──→   /api/v1/variants  ──→   VariantClassifier
Cerner Millennium  ──→   /api/v1/reports   ──→   ClinicalReporter
Lab LIS            ──→   /api/v1/pgx       ──→   PGxAgent
Research Portal    ──→   /api/v1/trials    ──→   TrialMatcher
Pharma Pipeline    ──→   /api/v1/targets   ──→   DrugDiscovery
                         /api/v1/quality   ──→   QualityScorer
                         /api/v1/batch     ──→   All agents (parallel)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;API is publicly testable today:&lt;/strong&gt; &lt;a href="https://api.genomixiq.com/docs" rel="noopener noreferrer"&gt;api.genomixiq.com/docs&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Right Now
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test the health check&lt;/span&gt;
curl https://api.genomixiq.com/health

&lt;span class="c"&gt;# Submit a variant for classification&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.genomixiq.com/api/v1/variants &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "variant_id": "VCV000017694",
    "gene": "BRCA1",
    "transcript": "NM_007294.4",
    "hgvs_c": "c.5266dupC",
    "genome_build": "GRCh38"
  }'&lt;/span&gt;

&lt;span class="c"&gt;# Response includes:&lt;/span&gt;
&lt;span class="c"&gt;# - ACMG classification with criteria applied&lt;/span&gt;
&lt;span class="c"&gt;# - Variant Intelligence Score (VIS)&lt;/span&gt;
&lt;span class="c"&gt;# - G-ARVIS quality attestation&lt;/span&gt;
&lt;span class="c"&gt;# - Verified citations&lt;/span&gt;
&lt;span class="c"&gt;# - FHIR R4 DiagnosticReport&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;Terraform modules included for AWS, Azure, GCP, Oracle Cloud, and on-premises Kubernetes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Unlocks for Integration Partners
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For EHR vendors (Epic, Oracle Health, Microsoft Nuance, Veeva):&lt;/strong&gt;&lt;br&gt;
FHIR R4 native output. SMART on FHIR compatible. Plug into your existing genomics module with zero custom integration. G-ARVIS attestation on every result for clinical defensibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For health system IT teams:&lt;/strong&gt;&lt;br&gt;
On-premises Kubernetes deploy. HIPAA-ready architecture with PHI tokenization before any LLM prompt. Full audit trail in ClickHouse. Row-level security across 500-tenant capacity. SOC 2 Type II controls in the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For pharma R&amp;amp;D platforms:&lt;/strong&gt;&lt;br&gt;
Drug Discovery Agent with ChEMBL, DrugBank, and AlphaFold API integration. Target validation, ADMET prediction, lead optimization. REST API + batch endpoints for pipeline integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For genomics lab software (Sophia Genetics, Illumina DRAGEN, Fabric Genomics):&lt;/strong&gt;&lt;br&gt;
VCF ingestion endpoint. Batch classification API. ACMG classification output with full evidence trace. Direct integration point for existing lab workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks vs Current Standard of Care
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Manual (current)&lt;/th&gt;
&lt;th&gt;GenomixIQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time per variant&lt;/td&gt;
&lt;td&gt;60–90 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8 seconds&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per variant&lt;/td&gt;
&lt;td&gt;$12–$18&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.023&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Citation coverage&lt;/td&gt;
&lt;td&gt;Pathologist memory&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% verified&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination rate&lt;/td&gt;
&lt;td&gt;N/A (human)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.00 (hard gate)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FHIR output&lt;/td&gt;
&lt;td&gt;Manual entry&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Native&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;~6/hour/pathologist&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;450+/hour&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G-ARVIS score&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.937&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Whole genome population-scale interpretation with federated learning&lt;/li&gt;
&lt;li&gt;Direct ClinVar submission pipeline for novel variant classifications&lt;/li&gt;
&lt;li&gt;Somatic liquid biopsy with ctDNA quantification&lt;/li&gt;
&lt;li&gt;Multi-modal proteomics and epigenomics integration&lt;/li&gt;
&lt;li&gt;CAP/CLIA audit package for regulatory inspection readiness&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Acquisition Conversation
&lt;/h2&gt;

&lt;p&gt;GenomixIQ is the first platform to combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production-grade 12-agent clinical genomics mesh&lt;/li&gt;
&lt;li&gt;Technically enforced zero-hallucination safety gate&lt;/li&gt;
&lt;li&gt;FHIR R4 native output (not a roadmap item)&lt;/li&gt;
&lt;li&gt;G-ARVIS — the first AI quality standard for clinical genomics&lt;/li&gt;
&lt;li&gt;Any-cloud + on-prem single-command deployment&lt;/li&gt;
&lt;li&gt;Live public API with Swagger documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strategic fits include Oracle Health (Cerner), Microsoft (Nuance), NVIDIA (Clara), Illumina, Tempus AI, Sophia Genetics, and health system genomics programs building precision medicine infrastructure.&lt;/p&gt;

&lt;p&gt;If you are building in this space — as an engineer, a partner, or an acquirer — the conversation is open.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🌐 Platform: &lt;a href="https://genomixiq.com" rel="noopener noreferrer"&gt;genomixiq.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔬 Live API: &lt;a href="https://api.genomixiq.com/docs" rel="noopener noreferrer"&gt;api.genomixiq.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🐙 GitHub: &lt;a href="https://github.com/anilatambharii/GenomixIQ" rel="noopener noreferrer"&gt;github.com/anilatambharii/GenomixIQ&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📧 Contact: &lt;a href="mailto:anil@ambharii.com"&gt;anil@ambharii.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💼 LinkedIn: &lt;a href="https://linkedin.com/in/anilsprasad" rel="noopener noreferrer"&gt;linkedin.com/in/anilsprasad&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by Anil Prasad — Founder, Ambharii Labs. 28 years of production AI/ML at Fortune 100 scale across healthcare, genomics, and energy. Top 100 Most Influential AI Leaders USA 2024.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If this helped you think differently about agentic AI in clinical settings, drop a reaction and share it with someone building in health tech.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agentaichallenge</category>
      <category>genomics</category>
      <category>healthtech</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built 11 Autonomous Agents for Healthcare Revenue Cycle. Here Is the Full Architecture.</title>
      <dc:creator>Anil Prasad</dc:creator>
      <pubDate>Fri, 10 Apr 2026 19:36:57 +0000</pubDate>
      <link>https://dev.to/anilatambharii/i-built-11-autonomous-agents-for-healthcare-revenue-cycle-here-is-the-full-architecture-388n</link>
      <guid>https://dev.to/anilatambharii/i-built-11-autonomous-agents-for-healthcare-revenue-cycle-here-is-the-full-architecture-388n</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Self-Correcting Multi-Agent System for Healthcare — and Why Standard ML Metrics Failed Me
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; ai, python, healthcare, machinelearning&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cover image:&lt;/strong&gt; 04_argus_correction.png&lt;/p&gt;




&lt;p&gt;I have been building production AI systems for 28 years. At UnitedHealth Group I ran a 20,000-node Big Data Platform. At R1 RCM I was inside the $4.1B Cloudmed acquisition. At Duke Energy I run AI and product engineering for critical infrastructure.&lt;/p&gt;

&lt;p&gt;None of that experience prepared me for the specific engineering problem of building a reliable multi-agent system for healthcare revenue cycle management.&lt;/p&gt;

&lt;p&gt;This post is about what I learned, what broke, and what I had to invent to make it work. The code is real. The numbers are from production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with agentic systems in regulated environments
&lt;/h2&gt;

&lt;p&gt;Most agentic system tutorials show you a single agent calling a few tools and returning a result. That is fine for demos. It is not fine when the agent is making claims submission decisions on a $300M annual revenue stream for a hospital system.&lt;/p&gt;

&lt;p&gt;The core issues I ran into, in order of how badly they burned me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. LLMs are not deterministic enough for sequential RCM workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Give the same clinical note to the same model twice and you will get subtly different ICD-10 code recommendations. In a classification task that is fine — you measure accuracy across a test set. In an agent that is making 14 sequential decisions across a claims workflow, small inconsistencies compound. A slightly different coding recommendation in step 3 changes the prior authorization requirement in step 5, which changes the denial probability score in step 8.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Standard metrics do not capture agentic failure modes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Precision and recall tell you nothing about whether the agent followed the right path to get to a correct answer. An agent that approves the right claim after six wrong turns is not a success — it is a future liability. I needed metrics that measured the sequential behavior of the agent across a workflow, not just the final output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. PHI in prompts is a HIPAA violation waiting to happen&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one is obvious in theory and surprisingly hard in practice. The moment you build a multi-agent system where context is passed between agents, you have to be extremely deliberate about what is in that context. A naive implementation will leak PHI into prompt context within the first week of real data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. There was no observability framework built for agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog, Arize, WhyLabs — all excellent for ML model monitoring. None of them answer the questions I needed answered: Is this agent's output grounded in the source data? Is it consistent across similar inputs? Is it recovering from failures autonomously or silently degrading?&lt;/p&gt;




&lt;h2&gt;
  
  
  What I built: ARIA and the frameworks around it
&lt;/h2&gt;

&lt;p&gt;ARIA is a hierarchical multi-agent system: one Supervisor agent orchestrating 10 specialist agents across the full RCM workflow. I will not walk through all 11 agents here — the full architecture is in the Medium article linked at the end. What I want to focus on are the three engineering innovations that made it reliable enough for production healthcare.&lt;/p&gt;




&lt;h2&gt;
  
  
  Innovation 1: G-ARVIS — a 6-dimension observability framework for agents
&lt;/h2&gt;

&lt;p&gt;I defined G-ARVIS to answer the specific observability questions that no existing tool addressed. Six dimensions, scored per agent execution, in real time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GARVISScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;groundedness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;    &lt;span class="c1"&gt;# Output traceable to source data (0-1)
&lt;/span&gt;    &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;        &lt;span class="c1"&gt;# Factual correctness of output (0-1)
&lt;/span&gt;    &lt;span class="n"&gt;reliability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;     &lt;span class="c1"&gt;# Consistency across similar inputs (0-1)
&lt;/span&gt;    &lt;span class="n"&gt;variance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;        &lt;span class="c1"&gt;# Stability under edge cases (0-1)
&lt;/span&gt;    &lt;span class="n"&gt;inference_cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;  &lt;span class="c1"&gt;# Token efficiency per correct output (0-1)
&lt;/span&gt;    &lt;span class="n"&gt;safety&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;          &lt;span class="c1"&gt;# PHI enforcement, HIPAA compliance (0-1)
&lt;/span&gt;
    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;composite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groundedness&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;     &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reliability&lt;/span&gt;  &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.18&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variance&lt;/span&gt;     &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.17&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_cost&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety&lt;/span&gt;       &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_production_ready&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Safety is a hard gate — any PHI violation fails immediately
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The weighting is intentional. Groundedness and Accuracy carry the most weight because in healthcare, a hallucinated output is not an annoyance — it is a compliance event. Safety carries 15% but is also a hard gate: any execution that touches PHI in the prompt context fails immediately regardless of the composite score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Variance is the hardest dimension to score&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Variance measures output stability under edge cases — ambiguous clinical notes, incomplete payer data, conflicting authorization histories. The challenge is that you can only measure it retrospectively across a population of similar inputs. We use a sliding window of the last 200 similar executions and measure the coefficient of variation on key output fields.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VarianceMonitor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;window_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_vector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;  &lt;span class="c1"&gt;# insufficient data, assume stable
&lt;/span&gt;        &lt;span class="n"&gt;arr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="c1"&gt;# Coefficient of variation per output dimension
&lt;/span&gt;        &lt;span class="n"&gt;cv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Score: 1.0 = perfectly stable, 0.0 = completely unstable
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current production Variance score: 91.7%. This is the dimension I am least satisfied with and where most of our active engineering effort is focused. Target is 95%+.&lt;/p&gt;




&lt;h2&gt;
  
  
  Innovation 2: Three new agentic metrics
&lt;/h2&gt;

&lt;p&gt;I defined these because ASF, ERR, and CPCS did not exist anywhere I could find, and I needed them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action Sequence Fidelity (ASF)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What percentage of agent execution paths match the optimal RCM workflow path? This requires defining the optimal path — which we did by analyzing 50,000 adjudicated claims and extracting the decision sequence that led to first-pass approval with minimum rework.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SequenceMatcher&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ASFCalculator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;optimal_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]):&lt;/span&gt;
        &lt;span class="c1"&gt;# optimal_paths: claim_type -&amp;gt; sequence of agent actions
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimal_paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optimal_paths&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;claim_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actual_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;optimal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimal_paths&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;optimal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;  &lt;span class="c1"&gt;# no baseline, assume correct
&lt;/span&gt;
        &lt;span class="n"&gt;matcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SequenceMatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;optimal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;actual_path&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;batch_asf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;executions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claim_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;executions&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current production ASF: 91.4%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error Recovery Rate (ERR)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When an agent encounters a failure, how often does it recover autonomously? This is straightforward to measure — you track every exception event and whether it resolved within the ARGUS correction loop or escalated to human review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExecutionEvent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;execution_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
    &lt;span class="n"&gt;exception_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;resolved_autonomously&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ERRTracker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ExecutionEvent&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ExecutionEvent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;window_hours&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_hours&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exception_type&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

        &lt;span class="n"&gt;autonomous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resolved_autonomously&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;autonomous&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current production ERR: 87.3%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Per Correct Sequence (CPCS)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Total LLM inference cost for one complete, correct RCM workflow execution. This is your unit economics metric. If CPCS exceeds the margin on the claim being processed, the system is not profitable to operate regardless of how accurate it is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SequenceCost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;execution_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;total_input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;model_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;  &lt;span class="c1"&gt;# model_id -&amp;gt; (input_rate, output_rate) per 1M tokens
&lt;/span&gt;    &lt;span class="n"&gt;was_correct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;total_cost_usd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;output_cost&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CPCSCalculator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SequenceCost&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SequenceCost&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_cpcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequences&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;was_correct&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;total_cost_usd&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current production CPCS: $0.023 per claim end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Innovation 3: ARGUS — autonomous self-correction
&lt;/h2&gt;

&lt;p&gt;ARGUS is the layer that makes the system reliable enough for production. The core insight: instead of trying to make an LLM deterministically correct on the first attempt, you build a reflection loop that detects failure, analyzes the failure mode by G-ARVIS dimension, and generates a corrected prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GARVISScore&lt;/span&gt;
    &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;corrected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;escalated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ARGUSGuard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;target_composite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;safety_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# hard gate
&lt;/span&gt;        &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthcare_rcm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;phi_safe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_attempts&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_composite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_composite&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;safety_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phi_safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;phi_safe&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_correction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GARVISScorer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;current_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# PHI hard gate — fail immediately, do not retry
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safety_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;corrected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;escalated&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_composite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;corrected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;escalated&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Score below threshold — reflect and refine
&lt;/span&gt;            &lt;span class="n"&gt;current_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_reflect_and_refine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;original_task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;failed_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

        &lt;span class="c1"&gt;# All attempts exhausted — escalate to human review
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CorrectionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;corrected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;escalated&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_reflect_and_refine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;original_task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;failed_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GARVISScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Identify the weakest dimension and generate
&lt;/span&gt;        &lt;span class="c1"&gt;# a dimension-specific correction signal
&lt;/span&gt;        &lt;span class="n"&gt;weak_dims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_weakest_dimensions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;correction_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_correction_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;original_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;failed_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;weak_dims&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;attempt&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;refined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;refined&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correction_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;correction_prompt&lt;/span&gt;
        &lt;span class="n"&gt;refined&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attempt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;refined&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_weakest_dimensions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GARVISScore&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;dims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groundedness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reliability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reliability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inference_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="c1"&gt;# Return dimensions below 0.85, sorted weakest first
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dims&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;_build_correction_prompt&lt;/code&gt; method is proprietary — that is where the domain-specific healthcare knowledge lives. But the structure above is fully open in the ARGUS SDK.&lt;/p&gt;




&lt;h2&gt;
  
  
  The PHI tokenization architecture
&lt;/h2&gt;

&lt;p&gt;This is the part that took the longest to get right. The requirement: agents need full clinical context to make good RCM decisions, but no PHI can appear in any LLM prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PHITokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Patterns for common PHI types
&lt;/span&gt;    &lt;span class="n"&gt;PHI_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MRN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bMRN[-:\s]?\d{6,10}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DOB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SSN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{3}-\d{2}-\d{4}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b[A-Z][a-z]+\s[A-Z][a-z]+\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bNPI[-:\s]?\d{10}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secret_key&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_token_map&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reverse_map&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_generate_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi_value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Deterministic: same PHI always maps to same token
&lt;/span&gt;        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;phi_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;phi_value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;token_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;phi_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_TOKEN_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_bytes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tokenized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;phi_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PHI_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_token_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reverse_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;
                &lt;span class="n"&gt;tokenized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokenized&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rehydrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenized_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenized_text&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi_value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reverse_map&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phi_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_phi_clean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PHI_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every prompt that goes to an LLM passes through &lt;code&gt;tokenize()&lt;/code&gt; first. Every output that gets committed to the RCM state machine passes through &lt;code&gt;rehydrate()&lt;/code&gt; inside the secure perimeter. The &lt;code&gt;is_phi_clean()&lt;/code&gt; check is what the G-ARVIS Safety dimension calls before every inference.&lt;/p&gt;

&lt;p&gt;Production Safety score: 100%. Zero PHI exposure events.&lt;/p&gt;




&lt;h2&gt;
  
  
  Install and get started
&lt;/h2&gt;

&lt;p&gt;The ARGUS SDK — G-ARVIS scoring, ASF/ERR/CPCS calculators, PHITokenizer base class, and ARGUSGuard correction loop — is open-core and on PyPI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;argus-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;argus_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ARGUSGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GARVISScorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PHITokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;argus_ai.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ASFCalculator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ERRTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CPCSCalculator&lt;/span&gt;

&lt;span class="c1"&gt;# Wrap any async agent function with self-correction
&lt;/span&gt;&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ARGUSGuard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_composite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthcare_rcm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;phi_safe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_with_correction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_denial_predictor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;claim_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;GARVISScorer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composite&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Attempts: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Escalated: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escalated&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Production results
&lt;/h2&gt;

&lt;p&gt;These are from the live ARIA system, 24-hour rolling average:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;G-ARVIS composite&lt;/td&gt;
&lt;td&gt;93.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groundedness&lt;/td&gt;
&lt;td&gt;96.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;94.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;93.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Variance&lt;/td&gt;
&lt;td&gt;91.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference Cost&lt;/td&gt;
&lt;td&gt;95.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safety&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action Sequence Fidelity&lt;/td&gt;
&lt;td&gt;91.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error Recovery Rate&lt;/td&gt;
&lt;td&gt;87.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Per Correct Sequence&lt;/td&gt;
&lt;td&gt;$0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Denial rate reduction&lt;/td&gt;
&lt;td&gt;38%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What is open vs proprietary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Open (argus-ai on PyPI + GitHub):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ARGUSGuard correction loop&lt;/li&gt;
&lt;li&gt;GARVISScorer base framework&lt;/li&gt;
&lt;li&gt;PHITokenizer base class&lt;/li&gt;
&lt;li&gt;ASF, ERR, CPCS calculators&lt;/li&gt;
&lt;li&gt;PulseFlow MLOps pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Proprietary (the ARIA product):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;11-agent supervisor hierarchy with RCM domain specialization&lt;/li&gt;
&lt;li&gt;Payer policy RAG with live contract updates&lt;/li&gt;
&lt;li&gt;Predictive denial scoring model&lt;/li&gt;
&lt;li&gt;RCM domain knowledge engine&lt;/li&gt;
&lt;li&gt;Multi-tenant deployment infrastructure&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: github.com/anilatambharii/argus-ai&lt;/li&gt;
&lt;li&gt;PyPI: pypi.org/project/argus-ai&lt;/li&gt;
&lt;li&gt;Platform: ambharii.com/RCM&lt;/li&gt;
&lt;li&gt;Full architecture article: medium.com/p/9d0c9f8d662a&lt;/li&gt;
&lt;li&gt;Questions or contributions: &lt;a href="mailto:anil@ambharii.com"&gt;anil@ambharii.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are building agentic systems in regulated industries and running into the same observability and reliability problems — I would genuinely like to hear from you. The metrics definitions are public. Use them, improve them, tell me what is wrong with them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anil Prasad — Founder, Ambharii Labs · Head of Engineering &amp;amp; Product, Duke Energy · Top 100 AI Leaders USA 2024&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;#HumanWritten #ExpertiseFromField&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>healthcare</category>
    </item>
  </channel>
</rss>
