DEV Community: Anil Prasad

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

Anil Prasad — Tue, 19 May 2026 13:00:00 +0000

Building Production-Ready Open Source AI Infrastructure: A Technical Guide

Over the past year, we've built and open sourced six production-grade AI infrastructure projects. This isn't toy code or proof of concepts. These are systems handling millions of requests daily in production environments.

Here's what we learned building open source AI infrastructure that actually works.

The Six Projects

llm-cost-optimization: 3-layer caching plus intelligent routing
ai-safety-framework: 5-layer defense with 250 red team test cases
production-rag: 6-stage pipeline with re-ranking and evaluation
distributed-training: PyTorch DDP with NCCL tuning
roi-first-ai: Business metric selection and deployment templates
agentic-ai: Multi-agent orchestration framework

All repositories are at github.com/anilatambharii

Why Open Source Our Production Code

Three reasons.

First, the AI infrastructure landscape is fragmented. Every team rebuilds the same patterns from scratch. LLM caching. RAG pipelines. Cost optimization. Agent orchestration. We've already solved these problems. Sharing the solutions helps the community.

Second, open source code is battle tested. When thousands of developers review, use, and contribute to your code, it gets better fast. Private code stays brittle. Public code gets hardened.

Third, hiring advantage. The best engineers want to work on code that matters. Open source contributions demonstrate technical credibility better than any interview.

Architecture Principle: Composition Over Configuration

Each project is a focused library, not a framework. You compose them together rather than configuring one monolithic system.

Bad approach: One repo with 47 configuration options trying to do everything.

Good approach: Six repos, each solving one problem well. Use what you need. Ignore what you don't.

Example using llm-cost-optimization and production-rag together:

from llm_cost_optimization import CachingLayer, ModelRouter
from production_rag import RAGPipeline, HybridRetriever

# Set up caching for LLM calls
cache = CachingLayer(
    semantic_cache_threshold=0.95,
    redis_url="redis://localhost:6379"
)

# Set up model routing based on query complexity
router = ModelRouter(
    models={
        "simple": "claude-haiku-4-5",
        "complex": "claude-sonnet-4-6"
    },
    complexity_threshold=0.7
)

# Set up RAG pipeline with hybrid retrieval
retriever = HybridRetriever(
    vector_weight=0.7,
    keyword_weight=0.3
)

rag = RAGPipeline(
    retriever=retriever,
    llm_cache=cache,
    llm_router=router
)

# Use them together
result = rag.query("What were Q2 financial results?")

Each component is independent. Each can be used standalone. Together they form a complete system.

Project Deep Dive: LLM Cost Optimization

This project reduced our LLM costs from $47K monthly to $2.8K monthly. 94% cost reduction. Same quality.

Three Layer Caching

Exact match cache catches identical queries. Redis key is SHA256 hash of prompt. Cache hit returns response instantly. No LLM call. Zero cost.

class ExactMatchCache:
    def __init__(self, redis_client):
        self.redis = redis_client

    def get(self, prompt: str) -> Optional[str]:
        key = hashlib.sha256(prompt.encode()).hexdigest()
        return self.redis.get(f"exact:{key}")

    def set(self, prompt: str, response: str, ttl: int = 3600):
        key = hashlib.sha256(prompt.encode()).hexdigest()
        self.redis.setex(f"exact:{key}", ttl, response)

Hit rate: 23% of queries.

Semantic cache catches similar queries. Embed the prompt. Find nearest neighbors in vector DB. If similarity > threshold (0.95), return cached response.

class SemanticCache:
    def __init__(self, embedding_model, vector_db, threshold=0.95):
        self.embed = embedding_model
        self.db = vector_db
        self.threshold = threshold

    def get(self, prompt: str) -> Optional[str]:
        embedding = self.embed(prompt)
        results = self.db.search(embedding, k=1)

        if results and results[0].score > self.threshold:
            return results[0].cached_response
        return None

    def set(self, prompt: str, response: str):
        embedding = self.embed(prompt)
        self.db.insert(embedding, cached_response=response)

Hit rate: 31% of queries not caught by exact match.

Prefix cache reuses computation for prompts with common prefixes. System prompt is usually identical. Few-shot examples are usually identical. Only the user query changes.

Anthropic's prompt caching API handles this automatically. Mark static parts as cacheable.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)

Combined hit rate: 73% of queries serve from cache. 27% hit the LLM. Cost reduced 73% from caching alone.

Intelligent Model Routing

Not every query needs GPT-4 or Claude Opus. Simple queries work fine on Haiku. Complex queries need Sonnet.

Routing strategy:

class ModelRouter:
    def route(self, query: str) -> str:
        complexity = self.calculate_complexity(query)

        if complexity < 0.3:
            return "claude-haiku-4-5"  # $0.25 per 1M tokens
        elif complexity < 0.7:
            return "claude-sonnet-4-6"  # $3 per 1M tokens
        else:
            return "claude-opus-4-6"    # $15 per 1M tokens

    def calculate_complexity(self, query: str) -> float:
        # Features: length, question marks, technical terms, etc.
        features = self.extract_features(query)
        return self.classifier.predict_proba(features)[1]

Trained a simple classifier on 10K labeled examples. "What's the capital of France?" → Haiku. "Analyze this 50 page contract for liability clauses" → Opus.

Result: 89% of queries route to Haiku. 9% to Sonnet. 2% to Opus. Average cost per query drops 88%.

Implementation Notes

Cache invalidation is the hard part. We invalidate based on TTL (1 hour default) and explicit updates. When source data changes, we flush related cache entries.

Monitoring tracks hit rates, latency, cost per query. Dashboard shows cache performance in real time. Alerts fire when hit rate drops below threshold.

Gradual rollout started with 1% of traffic. Measured cache hit rate and accuracy. Ramped to 10%, 50%, 100% over 3 weeks.

Project Deep Dive: Production RAG

We increased RAG accuracy from 52% to 89% by fixing retrieval, not the LLM.

The 6-Stage Pipeline

Stage 1: Query Processing

Don't send raw user queries to vector DB. Expand with synonyms. Extract metadata. Generate context-aware embedding.

class QueryProcessor:
    def process(self, query: str) -> ProcessedQuery:
        # Extract metadata
        metadata = {
            "date_range": self.extract_date_range(query),
            "department": self.extract_department(query),
            "doc_type": self.extract_doc_type(query)
        }

        # Expand with synonyms
        expanded = self.expand_synonyms(query)

        # Generate embedding
        embedding = self.embed_model(expanded)

        return ProcessedQuery(
            original=query,
            expanded=expanded,
            embedding=embedding,
            metadata=metadata
        )

Stage 2: Vector Database Search

Cosine similarity threshold 0.85. Top-k 50 candidates (not 5, not 10). Use Pinecone with metadata filtering.

results = index.query(
    vector=processed_query.embedding,
    top_k=50,
    filter={
        "department": processed_query.metadata["department"],
        "date": {"$gte": processed_query.metadata["date_range"][0]}
    }
)

Stage 3: Hybrid Search

Combine semantic search (70%) with keyword search (30%) using BM25.

class HybridRetriever:
    def retrieve(self, query: ProcessedQuery) -> List[Document]:
        # Vector search
        vector_results = self.vector_search(query, k=50)

        # Keyword search
        keyword_results = self.bm25_search(query.expanded, k=50)

        # Combine with weights
        combined = self.merge_results(
            vector_results, 
            keyword_results,
            vector_weight=0.7,
            keyword_weight=0.3
        )

        return combined[:50]

Stage 4: Re-ranking

This single stage improved accuracy by 23%. Use cross-encoder to score each candidate against the actual query.

class Reranker:
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, documents: List[Document]) -> List[Document]:
        # Score each doc against query
        pairs = [(query, doc.text) for doc in documents]
        scores = self.model.predict(pairs)

        # Sort by score
        ranked = sorted(
            zip(documents, scores),
            key=lambda x: x[1],
            reverse=True
        )

        return [doc for doc, score in ranked[:5]]

Top 50 candidates from hybrid search → Re-rank → Best 5 to LLM.

Stage 5: Context Assembly

Smart chunking with overlap. 512 token chunks with 50 token overlap. Include surrounding context. Add metadata.

def assemble_context(ranked_docs: List[Document]) -> str:
    context_parts = []

    for i, doc in enumerate(ranked_docs):
        context_parts.append(f"""
Source {i+1}: {doc.metadata['title']}
Date: {doc.metadata['date']}
Department: {doc.metadata['department']}

{doc.text}

---
        """)

    return "\n".join(context_parts)

Stage 6: LLM Generation

Force grounded responses. System prompt enforces citation. User query includes assembled context.

system_prompt = """You are a helpful assistant. Use ONLY the provided context to answer questions. 

If the context doesn't contain enough information, say "I don't have enough information to answer that question."

Always cite your sources using the Source number."""

user_prompt = f"""Context:
{assembled_context}

Question: {original_query}

Answer:"""

Results

Before: 52% answer accuracy. 3.8s latency. 31% hallucination rate.

After: 89% accuracy (+71%). 1.2s latency (faster!). 4% hallucination rate (-87%).

The insight: Don't optimize the LLM. Optimize the retrieval. GPT-4 with bad context = bad answers. Haiku with perfect context = great answers.

Making Projects Production Ready

Every project includes:

Comprehensive tests: Unit tests for every function. Integration tests for pipelines. End-to-end tests for workflows. 90%+ coverage.

Documentation: README with quick start. Detailed API docs. Architecture diagrams. Example notebooks.

Benchmarks: Performance metrics. Accuracy measurements. Cost comparisons. Real numbers, not claims.

Monitoring: Prometheus metrics. Logging. Error tracking. Observability built in.

Deployment: Docker containers. Kubernetes manifests. Terraform modules. Production ready deployment.

Contributing to Open Source AI

Our projects welcome contributions. Here's how to get started:

Pick a project that interests you
Read the CONTRIBUTING.md
Check the issues for "good first issue" labels
Submit a PR with tests and documentation
Respond to review feedback

We review all PRs within 48 hours. Quality bar is high but we help contributors meet it.

Conclusion

Open source AI infrastructure should be production ready, not proof of concept. These six projects represent thousands of hours of real world testing and optimization.

Use them. Contribute to them. Build on them.

The code is at github.com/anilatambharii. Documentation is comprehensive. Examples are plentiful. Issues are welcome.

Let's build better AI infrastructure together.

About the Author

Anil Prasad is Head of Engineering at Ambharii Labs, recognized as one of "100 Most Influential AI Leaders in USA 2024." He builds production-scale AI and data systems for enterprise organizations. Connect on LinkedIn at linkedin.com/in/anilsprasad or visit ambharii.com.

Related Reading

How We Cut LLM API Costs by 94%: A 3-Layer Caching Strategy

Anil Prasad — Thu, 14 May 2026 13:59:00 +0000

Last month, our LLM API bills hit $47,000.

This month: $2,800.

Same product. Same user experience. Same performance.

94% cost reduction without sacrificing quality.

Here's the architecture that made it possible.

The Wake-Up Call

CFO's message: "Fix this or we shut down the AI features."

We had 90 days.

Most teams would panic and start cutting features. We treated it as an architecture problem, not a budget problem.

The Solution: 3-Layer Caching + Intelligent Routing

Layer 1: Prompt Caching (68% hit rate)

Problem: Every request pays for the same tokens repeatedly.

Standard system prompts, documentation, static context—all charged every time.

Solution: Claude's native prompt caching.

import anthropic

client = anthropic.Anthropic(api_key="your-key")

# Mark cacheable content with cache_control
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful AI assistant for our healthcare platform...",
            "cache_control": {"type": "ephemeral"}  # Cache this
        },
        {
            "type": "text", 
            "text": f"Current user context: {user_context}"  # Don't cache (changes per user)
        }
    ],
    messages=[{"role": "user", "content": query}]
)

Economics:

Input tokens: $3.00 / 1M tokens
Cached input tokens: $0.30 / 1M tokens (10x cheaper!)
Cache write: $3.75 / 1M tokens (one-time cost)

Example:

First request (cache write):

5,000 token system prompt
Cost: $0.01875 (5K tokens × $3.75/1M)

Next 100 requests (cache hit):

Same 5,000 token system prompt
Cost: $0.0015 (5K tokens × $0.30/1M × 100)

Total: $0.02025 for 101 requests
Without caching: $1.515 (5K × $3/1M × 101)
Savings: 98.7%

Our hit rate: 68%

Layer 2: Semantic Caching (15% hit rate)

Problem: Vector search doesn't catch similar queries.

"How do I reset my password?" vs "Password reset help?" are semantically identical but literally different.

Solution: Semantic similarity matching.

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}  # {embedding: (query, response, timestamp)}
        self.threshold = similarity_threshold

    def get(self, query: str):
        """Check if semantically similar query exists in cache"""
        query_embedding = self.model.encode(query)

        for cached_embedding, (cached_query, response, timestamp) in self.cache.items():
            similarity = np.dot(query_embedding, cached_embedding)

            if similarity >= self.threshold:
                print(f"Cache HIT: '{query}' ≈ '{cached_query}' (similarity: {similarity:.3f})")
                return response

        return None

    def set(self, query: str, response: str):
        """Store query-response pair with embedding"""
        embedding = self.model.encode(query)
        self.cache[tuple(embedding)] = (query, response, time.time())

# Usage
cache = SemanticCache(similarity_threshold=0.95)

# First query
response = llm.complete("How do I reset my password?")
cache.set("How do I reset my password?", response)

# Similar query (cache hit!)
cached_response = cache.get("Password reset help?")
# Returns the cached response, no LLM call

Additional 15% cache hit rate on top of prompt caching.

Layer 3: Result Caching (10% hit rate)

Problem: Identical queries hit the LLM multiple times.

Solution: Cache complete responses with smart TTL.

import redis
import hashlib
import json

class ResultCache:
    def __init__(self):
        self.redis = redis.Redis(host='localhost', port=6379, db=0)

    def get_cache_key(self, query: str, context: dict) -> str:
        """Create deterministic cache key"""
        cache_input = json.dumps({
            'query': query,
            'context': context
        }, sort_keys=True)
        return hashlib.sha256(cache_input.encode()).hexdigest()

    def get(self, query: str, context: dict):
        """Get cached response if exists"""
        key = self.get_cache_key(query, context)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None

    def set(self, query: str, context: dict, response: str, ttl: int = 3600):
        """Cache response with TTL

        TTL strategy:
        - Stable content: 24 hours (86400s)
        - Dynamic content: 1 hour (3600s)
        - Real-time data: 5 minutes (300s)
        """
        key = self.get_cache_key(query, context)
        self.redis.setex(
            key,
            ttl,
            json.dumps(response)
        )

    def invalidate(self, pattern: str):
        """Invalidate cache on data updates"""
        for key in self.redis.scan_iter(pattern):
            self.redis.delete(key)

# Usage
cache = ResultCache()

# Check cache first
cached = cache.get(query, context)
if cached:
    return cached  # Cache hit!

# Cache miss - call LLM
response = llm.complete(query, context)

# Cache the result
cache.set(query, context, response, ttl=3600)

# Invalidate on data update
cache.invalidate("user:123:*")  # Clear all caches for user 123

Final 10% cache hit rate.

Combined: 73% cache hit rate (68% + 15% + 10% with some overlap)

Intelligent Model Routing

Caching alone isn't enough.

67% of our queries work perfectly with Haiku. That's a 60x price difference vs Opus.

from enum import Enum

class ModelTier(Enum):
    HAIKU = "claude-haiku-4-20250514"    # $0.25/1M input
    SONNET = "claude-sonnet-4-20250514"  # $3/1M input
    OPUS = "claude-opus-4-20250514"      # $15/1M input

def route_to_model(query: str, context: str) -> ModelTier:
    """
    Route based on complexity

    Indicators for Haiku (simple):
    - Short queries (<50 tokens)
    - FAQ-style questions
    - Retrieval tasks

    Indicators for Sonnet (analysis):
    - "analyze", "compare", "evaluate"
    - Multi-step reasoning
    - Longer context (>2K tokens)

    Indicators for Opus (complex):
    - "design", "architect", "strategy"
    - Creative tasks
    - Critical business decisions
    """
    tokens = len(query.split())

    # Simple queries → Haiku
    if tokens < 50 and not any(word in query.lower() for word in ['analyze', 'compare', 'design']):
        return ModelTier.HAIKU

    # Analysis tasks → Sonnet
    if any(word in query.lower() for word in ['analyze', 'compare', 'evaluate', 'explain']):
        return ModelTier.SONNET

    # Complex reasoning → Opus
    if any(word in query.lower() for word in ['design', 'architect', 'strategy', 'create']):
        return ModelTier.OPUS

    # Default to Sonnet
    return ModelTier.SONNET

# Usage
model = route_to_model(user_query, context)
response = llm.complete(user_query, model=model.value)

Our distribution:

67% Haiku ($0.25/1M)
28% Sonnet ($3/1M)
5% Opus ($15/1M)

The Complete System

class OptimizedLLMClient:
    def __init__(self):
        self.prompt_cache = PromptCache()      # Layer 1
        self.semantic_cache = SemanticCache()  # Layer 2
        self.result_cache = ResultCache()      # Layer 3
        self.client = anthropic.Anthropic()

    def complete(self, query: str, context: dict):
        # Layer 3: Check result cache
        cached_result = self.result_cache.get(query, context)
        if cached_result:
            return cached_result

        # Layer 2: Check semantic cache
        semantic_result = self.semantic_cache.get(query)
        if semantic_result:
            return semantic_result

        # Layer 1: Prompt caching + model routing happens in LLM call
        model = route_to_model(query, context)

        response = self.client.messages.create(
            model=model.value,
            max_tokens=1024,
            system=[{
                "type": "text",
                "text": context.get('system_prompt'),
                "cache_control": {"type": "ephemeral"}  # Prompt caching
            }],
            messages=[{"role": "user", "content": query}]
        )

        # Cache the result
        self.result_cache.set(query, context, response.content, ttl=3600)
        self.semantic_cache.set(query, response.content)

        return response.content

# Usage
llm = OptimizedLLMClient()
answer = llm.complete("What's my account balance?", context)

The Results

Before:

$47K/month API costs
P95 latency: 2.1s
No optimization strategy

After:

$2.8K/month (-94%)
P95 latency: 340ms (67% faster!)
73% cache hit rate

Key Insights

1. Infrastructure > Model Selection

Opus with naive setup: $47K/month
Haiku with optimization: $2.8K/month

A well-architected system with Haiku outperforms naive Opus at 1/16th the cost.

2. Cache Hit Rate Math

Without caching: 100% requests hit LLM
With 73% cache hit: 27% requests hit LLM
Cost reduction: 73% from caching alone
Additional savings: 67% of remaining 27% uses cheap Haiku
Total: 94% cost reduction

3. Speed as Side Effect

Caching doesn't just save money. It's faster:

Cache hit: 50ms (Redis lookup)
LLM call: 2,100ms (P95)

42x faster for cached requests.

Implementation Checklist

[ ] Enable prompt caching (10x savings on repeated context)
[ ] Add semantic similarity cache (15% additional hits)
[ ] Implement result caching with smart TTL
[ ] Route queries to appropriate model tier
[ ] Monitor cache hit rates and adjust thresholds
[ ] Set up cache invalidation on data updates

Monitoring Dashboard

def get_cache_metrics():
    return {
        'prompt_cache_hit_rate': 0.68,
        'semantic_cache_hit_rate': 0.15,
        'result_cache_hit_rate': 0.10,
        'combined_hit_rate': 0.73,
        'model_distribution': {
            'haiku': 0.67,
            'sonnet': 0.28,
            'opus': 0.05
        },
        'cost_per_1k_requests': 2.80,
        'p95_latency_ms': 340
    }

Track these weekly. Optimize based on data, not assumptions.

What's Next

We're open-sourcing our cost optimization framework:

Complete caching implementation
Model routing logic
Monitoring dashboards
Cost calculation tools

Follow @anilsprasad or Ambharii Labs for the release.

Your Turn

What's your LLM API bill?

Drop it in the comments and I'll tell you which optimization would have the highest ROI for your use case.

Common wins:

Prompt caching: 10x savings on repeated context
Model routing: 60x price difference (Haiku vs Opus)
Semantic caching: 15% additional hits

Let's make LLMs affordable for everyone. 💰

Tags: #ai #performance #optimization #tutorial

Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline

Anil Prasad — Tue, 12 May 2026 13:00:00 +0000

Two hard problems in production AI:

Accuracy: RAG systems giving wrong answers 48% of the time
Cost: LLM API bills hitting $47K/month

We solved both. Here's how.

Part 1: RAG Accuracy (52% → 89%)

Our RAG system was confidently wrong. Users asked "What were Q2 healthcare results?" and got Q1 data, footnotes, and chapter titles with zero content.

High similarity scores. Completely useless context.

The LLM wasn't the problem. Retrieval was broken.

The 6-Stage Pipeline

Stage 1: Query Processing

Problem: "Show me Q2 results" has no semantic information.

Solution: Query expansion + metadata extraction

def process_query(raw_query: str) -> ProcessedQuery:
    metadata = extract_metadata(raw_query)  # dates, entities
    expanded = expand_query(raw_query, metadata)
    embedding = embed_with_context(expanded, metadata)
    return ProcessedQuery(expanded, metadata, embedding)

Transformation:
Input: "Show me Q2 results"
Output: "quarterly financial results Q2 2024 revenue profit earnings second quarter"

Stage 2: Vector Database Search

import pinecone

index = pinecone.Index("knowledge-base")

results = index.query(
    vector=query_embedding,
    top_k=5,  # not 10, not 20
    filter={
        "date_range": {"$gte": "2024-04-01"},
        "department": "healthcare"
    }
)

Key: Cosine similarity threshold 0.85. Anything lower retrieves noise.

Stage 3: Hybrid Search (Semantic + Keyword)

def hybrid_search(query: str, top_k=50):
    # Semantic (70%) + BM25 keyword (30%)
    vector_results = vector_search(query, top_k)
    bm25_results = keyword_search(query, top_k)

    combined = []
    for chunk_id in set(vector_results) | set(bm25_results):
        score = (vector_results.get(chunk_id, 0) * 0.7 + 
                 bm25_results.get(chunk_id, 0) * 0.3)
        combined.append((chunk_id, score))

    return sorted(combined, key=lambda x: x[1], reverse=True)[:top_k]

Why: Patent queries like "US-2847291" need exact match, not semantic.

Stage 4: Re-ranking (23% Accuracy Boost)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, chunks: List[str], top_k=5):
    pairs = [[query, chunk] for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

Strategy: Fast bi-encoder for 50 candidates → slow cross-encoder for final 5.

Stage 5: Context Assembly

def create_chunks(doc: str, size=512, overlap=50):
    chunks = []
    tokens = tokenize(doc)

    for i in range(0, len(tokens), size - overlap):
        chunk = tokens[i:i + size]
        chunks.append(Chunk(
            text=detokenize(chunk),
            metadata={
                'source': doc.title,
                'date': doc.date,
                'section': extract_section(chunk)
            }
        ))
    return chunks

Why overlap: "Revenue increased 23% vs previous quarter" → needs surrounding context.

Stage 6: LLM Generation

def generate_answer(query: str, chunks: List[Chunk]):
    context = "\n\n".join([
        f"<document>\n<source>{c.metadata['source']}</source>\n"
        f"<content>{c.text}</content>\n</document>"
        for c in chunks
    ])

    prompt = f"""Use ONLY the provided context.

Context:
{context}

Query: {query}

Instructions:
1. Answer using ONLY provided context
2. Cite sources
3. Say "I don't know" if insufficient

Answer:"""

    return llm.complete(prompt)

RAG Results

Before:

52% accuracy
31% hallucination rate
3.8s latency

After:

89% accuracy (+71%)
4% hallucination rate (-87%)
1.2s latency (-67%)

Key Insight: GPT-4 with naive retrieval = 54% accuracy. Haiku with 6-stage pipeline = 87% accuracy.

Optimize retrieval, not the LLM.

Part 2: Cost Reduction ($47K → $2.8K)

Same product. Same UX. 94% cost reduction.

The secret: 3-layer caching + intelligent routing.

Layer 1: Prompt Caching (68% hit rate)

Problem: Every request pays for the same system prompt.

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": "You are a helpful AI assistant...",
            "cache_control": {"type": "ephemeral"}  # 10x cheaper!
        }
    ],
    messages=[{"role": "user", "content": query}]
)

Economics:

Normal: $3.00/1M tokens
Cached: $0.30/1M tokens (10x cheaper)

Example:
5K token system prompt × 100 requests:
Without caching: $1.50
With caching: $0.02
98.7% savings

Layer 2: Semantic Caching (15% hit rate)

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.threshold = threshold

    def get(self, query: str):
        query_emb = self.model.encode(query)

        for cached_emb, (cached_q, response, _) in self.cache.items():
            similarity = np.dot(query_emb, cached_emb)
            if similarity >= self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        embedding = self.model.encode(query)
        self.cache[tuple(embedding)] = (query, response, time.time())

Catches: "How do I reset password?" ≈ "Password reset help?"

Layer 3: Result Caching (10% hit rate)

import redis
import hashlib

class ResultCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get(self, query: str, context: dict):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        return self.redis.get(key)

    def set(self, query: str, context: dict, response: str, ttl=3600):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        self.redis.setex(key, ttl, response)

TTL strategy:

Stable content: 24 hours
Dynamic content: 1 hour
Real-time: 5 minutes

Intelligent Model Routing

67% of queries work with Haiku ($0.25/1M). 60x cheaper than Opus ($15/1M).

from enum import Enum

class Model(Enum):
    HAIKU = "claude-haiku-4-20250514"    # $0.25/1M
    SONNET = "claude-sonnet-4-20250514"  # $3/1M
    OPUS = "claude-opus-4-20250514"      # $15/1M

def route(query: str) -> Model:
    tokens = len(query.split())

    # Simple → Haiku
    if tokens < 50:
        return Model.HAIKU

    # Analysis → Sonnet
    if any(w in query.lower() for w in ['analyze', 'compare']):
        return Model.SONNET

    # Complex → Opus
    if any(w in query.lower() for w in ['design', 'architect']):
        return Model.OPUS

    return Model.SONNET

Distribution:

67% Haiku
28% Sonnet
5% Opus

The Complete System

class OptimizedLLM:
    def __init__(self):
        self.semantic_cache = SemanticCache()
        self.result_cache = ResultCache()
        self.client = anthropic.Anthropic()

    def complete(self, query: str, context: dict):
        # Layer 3: Result cache
        cached = self.result_cache.get(query, context)
        if cached:
            return cached

        # Layer 2: Semantic cache
        semantic = self.semantic_cache.get(query)
        if semantic:
            return semantic

        # Layer 1: Prompt cache + routing
        model = route(query)

        response = self.client.messages.create(
            model=model.value,
            system=[{
                "type": "text",
                "text": context['system_prompt'],
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[{"role": "user", "content": query}]
        )

        # Cache results
        self.result_cache.set(query, context, response.content)
        self.semantic_cache.set(query, response.content)

        return response.content

Cost Results

Before:

$47K/month
P95 latency: 2.1s

After:

$2.8K/month (-94%)
P95 latency: 340ms (-84%)
73% combined cache hit rate

Implementation Checklist

RAG:

[ ] Implement query processing (expand + extract metadata)
[ ] Set up vector DB with metadata filtering
[ ] Add hybrid search (semantic + keyword)
[ ] Deploy cross-encoder re-ranking
[ ] Build chunking with 50-token overlap
[ ] Force grounded prompts (no hallucinations)

Cost:

[ ] Enable prompt caching (10x savings)
[ ] Add semantic similarity cache
[ ] Implement result cache with smart TTL
[ ] Route to appropriate model tier
[ ] Monitor cache hit rates weekly

Key Insights

Retrieval > LLM: Haiku + perfect context beats GPT-4 + bad context
Re-ranking = 23% boost: Single highest-ROI optimization
Caching = 73% hit rate: Most requests never touch the LLM
Model routing = 60x savings: Haiku for 67% of queries

What We're Open-Sourcing

Next month:

6-stage RAG pipeline (code + docs)
Cost optimization framework
Re-ranking models
Monitoring dashboards
Evaluation datasets

Follow @anilsprasad or Ambharii Labs for release.

Your Turn

For RAG: What's your accuracy? Drop it in comments.

For Costs: What's your monthly LLM bill? I'll tell you which optimization has highest ROI.

Common wins:

Prompt caching: 10x savings
Re-ranking: 23% accuracy boost
Model routing: 60x price difference

Let's make production AI work. 🚀

Tags: #ai #machinelearning #python #tutorial

The web is now weaponized against your AI agents

Anil Prasad — Fri, 08 May 2026 17:15:44 +0000

Google dropped a security bomb last week.

Their threat intelligence team scanned 2-3 billion web pages per month looking for indirect prompt injection attacks targeting enterprise AI agents. They found a 32% increase in malicious attempts between November 2025 and February 2026.

The open web is now an attack surface for production AI.

This is not speculation. This is documented evidence of active attacks deployed at scale. Hidden instructions embedded in public HTML. Invisible to humans. Visible to AI agents. Real payloads designed to hijack enterprise systems the moment an agent scrapes the page.

If you have AI agents reading the open web on behalf of your organization, your security model just became obsolete.

Monday: Hidden instructions at scale

Google researchers documented the attack patterns deployed across billions of public web pages. The techniques are simple and effective:

Zero font size text: Instructions rendered in font-size: 0. Invisible to humans, fully visible to AI parsing HTML

Opacity manipulation: Commands hidden using CSS opacity: 0. Text exists but appears transparent

Off-screen positioning: Instructions placed outside viewport using negative coordinates

JavaScript dynamic execution: Payloads injected after page load via client-side JS

URL fragment injection: Commands embedded after the # symbol in URLs

These are not sophisticated zero-days requiring nation-state capabilities. These are techniques any web developer knows. The barrier to entry is near zero.

Real payloads found in the wild:

Fully specified PayPal transaction instructions
Stripe donation redirects with persuasion amplifier keywords
Data exfiltration commands targeting enterprise agents

This is production infrastructure under active attack.

Source: Google Threat Intelligence, April 23, 2026

Tuesday: The exploit window collapsed

Black Hat Asia 2026 data from RunSybil: attack window compressed from 5 months (2023) to 10 hours (2026).

Why? Frontier LLMs now do offensive security work autonomously.

2023 workflow:

Security researcher finds vulnerability
Documents it technically
Writes POC exploit code
Tests against targets
Iterates based on results
Publishes working exploit

Timeline: months

2026 workflow:

Describe bug to LLM
Model generates exploit code
Test in real-time
Iterate with AI

Timeline: hours

Meanwhile, 57% of organizations have AI agents in production right now. Most were architected before this research dropped. The threat model changed faster than the deployment cycle.

Wednesday: The sanitizer model pattern

Two models. One reads the web. The other does the work.

This is the architecture that actually defends against indirect prompt injection.

Architecture

Deploy a small isolated model with zero system permissions. It reads untrusted web content, filters instructions, validates structure. If it gets compromised by a prompt injection, it lacks the permissions to cause damage.

The production agent never touches raw web input directly. It only processes data that passed through the sanitizer layer.

Key principle: Trust boundary between models, not just at network edge.

The sanitizer has:

❌ No write access
❌ No email permissions
❌ No payment capabilities
❌ No database credentials
✅ Can read and filter only

If compromised by prompt injection, worst case is tainted text reaching production layer where business logic validation applies.

Implementation

This is not theoretical. I've implemented this in:

ARGUS: Dual model verification by default
GenomixIQ: Clinical genomics data ingestion
ARIA RCM: Healthcare revenue cycle workflows

All production systems in regulated environments.

Thursday: Agent firewalls are the next layer

Agent firewalls enforce security policies traditional infrastructure can't.

What they block

Instruction injection: Override commands
Credential exfiltration: Data to external endpoints
Privilege escalation: Unauthorized tool calls
Decision manipulation: Logic chain redirects

Five-layer architecture

Layer 1: Input validation

Markdown sanitization
Suspicious URL redaction
Pattern matching for attack signatures

Layer 2: Instruction detection

ML models trained on override attempts
Recognizes semantic patterns (role reversals, system prompt refs)

Layer 3: Permission checks

Compartmentalized tool authorization
Research agents: read only
Write agents: database access, no email
Email agents: no payment processing

Layer 4: Decision logging

Full audit trails with context
Source data tracking
Reasoning chain capture
Forensic reconstruction capability

Layer 5: Human confirmation gates

Financial transactions require approval
Data deletion needs review
Credential changes trigger verification

Zero trust for agents

Never trust input. Assume web content hostile. Verify every action. Log decision lineage. Compartmentalize tools. Human in loop for high stakes.

Friday: Five questions before deployment

Does your sanitizer have zero system permissions?

If your sanitizer can write to databases or send emails, it's not a sanitizer. It's a production agent reading untrusted input. When compromised, attackers gain those capabilities.

Are tool permissions compartmentalized by role?

Monolithic access = single compromised agent exposes entire system. Implement RBAC for agents.

Can you reconstruct every decision from logs?

If compliance asks why an agent made a recommendation 6 months ago, can you trace to exact data sources and reasoning steps?

Does human confirmation trigger for financial actions?

Agents processing payments without approval = automated embezzlement risk. Confirmation gates are not optional.

Have you tested injection attacks?

No red team testing = you don't know if defenses work. Run adversarial testing continuously.

The 86-89% that fail discover these requirements 6 weeks before go-live when compliance asks.

The 14% that succeed build them day one.

What this means for your systems

Security architecture requirements:

✅ Dual model verification - Sanitizer + production agent separation

✅ Compartmentalized permissions - Role-based tool access

✅ Decision lineage tracking - Full audit trails

✅ Human confirmation gates - Required for high-stakes actions

✅ Continuous injection testing - Red team + automated

Not optional enhancements. Production requirements.

Resources

AI Aether: Free agent security readiness assessment (30 min, 30 questions)

ARGUS: Dual model verification, available on PyPI/GitHub

GenomixIQ: Clinical genomics with FHIR R4 interoperability

ARIA RCM: Healthcare revenue cycle with HIPAA compliance

All production-grade. No pilots. No POCs. Systems that ship and scale.

Years production AI taught one lesson

The teams that succeed build governance before deployment, not after compliance review.

RCMTech: $340M measurable improvements, 89 days integration, zero clinical data loss

GeneticsTech: 99.97% uptime during 50TB migration, FHIR R4 compliance throughout

EnergyTech: 23→81% AI adoption among 20-year veteran operators

HealthTech: Petabyte-scale platforms, every decision traceable

Anil Prasad is Founder of Ambharii Technologies and Head of Engineering & Product at EnergyTech.

28 years building production AI in regulated environments across Fortune 100 companies. Currently building agent security infrastructure for enterprise AI: dual-model verification, compartmentalized permissions, and audit trail architecture for autonomous systems.

Connect: LinkedIn | Website | GitHub

Next week: Production deployment patterns, compliance architecture, audit trail infrastructure.

AgentSecurity #EnterpriseAI #HumanWritten #ExpertiseFromField

Claude Code Has 6 Ways to Authenticate. I Built a Cross-Platform Installer Because of It

Anil Prasad — Wed, 06 May 2026 16:48:06 +0000

TL;DR
Claude Code supports 6 different authentication methods with a strict priority order. Get the order wrong and your Pro subscription silently gets overridden by an API key, costing you real money.

I built claude-auth-setup — a cross-platform installer (Bash + Batch + PowerShell) that handles the whole thing correctly. MIT licensed, ~17KB of bash, zero runtime dependencies.

This post walks through the design decisions, the cross-platform tax, and the testing approach.

The Problem
The Claude Code auth resolution order, highest to lowest:

Cloud provider creds (Bedrock / Vertex AI / Foundry)
ANTHROPIC_AUTH_TOKEN
ANTHROPIC_API_KEY ← the silent footgun
apiKeyHelper script
CLAUDE_CODE_OAUTH_TOKEN
Subscription OAuth ← what most users actually want

If you're a Pro/Max subscriber and you ever set ANTHROPIC_API_KEY to test something — the API key wins forever until you explicitly unset it. No error. No warning. Just per-token charges added on top of your $20/month subscription.

The single most common Claude Code support thread is some variation of:

"My Anthropic Console bill went from $0 to $47 last month and I don't know why."

The "why" is almost always a stale ANTHROPIC_API_KEY from a tutorial.

Why a Script Instead of Better Docs
Documentation tells you the rules. A setup script enforces them. A doc that says "remove ANTHROPIC_API_KEY before logging in" gets skimmed. A script that detects the conflict, explains why it's a problem, asks for permission to back up your shell config, and then unsets it — that one ships the right outcome.

The installer does five things in order:

Verify install — checks for claude, offers npm i -g @anthropic-ai/claude-code if missing
Ask one question — "Do you have a Claude subscription?" Branches from this
Detect conflicts — finds existing env vars, explains what they'd do, asks before changing
Validate — sk-ant- prefix check, length check, env var persistence check
Back up before mutating — every shell config edit gets a timestamped backup with a printed rollback command
The Cross-Platform Tax
Bash (macOS / Linux): the easy one
detect_shell_config() {
case "$SHELL" in
/zsh) echo "$HOME/.zshrc" ;;
*/bash) [[ "$OSTYPE" == "darwin" ]] && echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ;;
*) echo "$HOME/.profile" ;;
esac
}
Append the export, source the file, done.

Batch (Windows cmd): the hard one
Windows persists env vars in the registry under HKEY_CURRENT_USER\Environment. The supported tool is setx, which has two gotchas:

1024-character limit (undocumented, will silently truncate)
Doesn't update the current session — only future processes started after the setx call
So users would run the script, run claude, see the same error, and assume the script broke. The fix is to set the variable in both places:

setx ANTHROPIC_API_KEY "%KEY%"
set ANTHROPIC_API_KEY=%KEY%
echo NOTE: open a new Command Prompt window to verify persistence
PowerShell: the third path
PowerShell has $PROFILE, but you can't assume:

The profile file exists
Execution policy allows it to load
The user knows what $PROFILE is
The script gracefully degrades: profile edit → registry write → manual command shown to the user.

What "Production-Grade" Means at 17KB
I reduced it to four things:

Idempotency — running twice is safe. The second run detects the configured state and exits cleanly, no duplicate exports.
Inspectability — before any mutation, print exactly what's about to happen and wait for y/n:

About to add this line to /Users/anil/.zshrc:
export ANTHROPIC_API_KEY="sk-ant-..."
Continue? [y/N]

Reversibility — every backup is timestamped (~/.zshrc.backup_20250506_143022). Rollback is one cp command, printed to the screen.
Testability — a test suite that validates the installer without mutating user state. Sandboxes backup/restore in /tmp, regex-checks key validation, verifies cross-platform parity. Runs in <2s.

Test suite output:

(The 1 failure is a regex bug in the test, not the installer.)

The Single Best Thing I Did
Replaced a flat 6-option menu with one yes/no question:

"Do you have a Claude subscription? [y/N]"

Everything else branches from there. Conversion (people completing the script vs abandoning it mid-flow) went from "hard to measure but bad" to "essentially everyone finishes."

If you remember nothing else from this post: users don't know which auth method applies to them. They know whether they pay a subscription. Branch on that.

What I Got Wrong
Too clever about shell detection. First version parsed $SHELL, then $0, then ps -p $$ -o comm=. Over-engineered. Three lines instead of thirty was the fix.
PowerShell-first testing. Wrote canonical tests in PowerShell, hit version/encoding compat issues across Windows machines. Now the canonical suite is Bash; PowerShell is a convenience port.
Underestimated docs. The repo has a README, quick-start, contributing guide, configuration examples, project overview, deployment doc, and build summary. That sounds like a lot for 17KB of script — until you realize the script is the easy part. The diagnosis ("why am I getting charged?") lives in the docs.
Try It

Unix

git clone https://github.com/your-repo/claude-auth-setup.git
cd claude-auth-setup
chmod +x setup-claude-auth.sh
./setup-claude-auth.sh

Windows

.\setup-claude-auth.bat
MIT licensed. Issues, PRs, and bug reports all welcome. The best one I got so far was:

"It worked. Why didn't this exist already?"

I don't know either.

Repo: github.com/your-repo/claude-auth-setup

Follow me here on dev.to for more posts about the unglamorous parts of shipping production tools.

The week the agent capability inflection arrived. And what to do about the 86% that still fail.

Anil Prasad — Sat, 02 May 2026 17:48:10 +0000

By Anil Prasad

Head of Engineering and Product, Duke Energy CASPAR · Founder, Ambharii Labs

Three signals. One pattern.

Stanford released the 2026 AI Index this week. AI agents jumped from 12% to 66% success on real computer tasks in one year. That is a 5.5x capability multiplier in twelve months.

In the same week, industry research confirmed that 86 to 89% of enterprise AI agent pilots fail to reach production at scale. Apoorva Mehta launched Abundance, a hedge fund with $100M in seed funding designed to have AI agents run the entire fund. JPMorgan reported their LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles for portfolio managers.

These stories are not contradictory. They describe the same reality from different angles.

The capability inflection has happened. The deployment infrastructure investment lags 18 months behind. That gap is the business opportunity of 2026.

Quick numbers before we dig in:

Monday: Stanford 12 to 66. Here is what most coverage will miss.

Stanford published the 2026 AI Index this week. The 66% number on real computer tasks will be quoted in every AI keynote for the next twelve months.

The number is real. The capability inflection has happened.
What everyone is going to miss: 66% on benchmark tasks does not equal 66% in your production environment.

Benchmarks measure: can the agent complete this task in ideal conditions with clean inputs and a defined success criterion?

Production measures: can the agent complete this task at 2 AM on Sunday when the upstream data feed is degraded, the API is throttled, and the human reviewer is asleep?

Those are different questions. The benchmark answers one. The other one decides whether your AI program ships or fails.

The capability bottleneck is gone. The readiness bottleneck just became the only bottleneck that matters.

Tuesday: 86 to 89% of pilots fail. The four reasons. All fixable.

Industry research published this month confirmed what 28 years in production AI has taught me. Agent pilots fail in predictable ways. The fixes are known. Almost nobody is applying them.

Failure mode 1: Governance breakdowns

The pilot worked. The team wants to scale. The compliance team has not seen the system yet. Six weeks of compliance review later, the pilot has lost momentum, the team has shifted to other priorities, and the agent is sitting in staging.

Fix: Compliance starts at week zero, not week sixteen. If your AI program treats compliance as a release gate, you have already lost.

Failure mode 2: Evaluation infrastructure gaps

The pilot demonstrated 84% accuracy on a curated test set. In production, the team cannot tell whether the agent is performing better or worse than baseline because they never built the evaluation framework.

Fix: Build the evaluation infrastructure before the agent. This is what G-ARVIS exists to do. Nine dimensions built from production failure, not academic theory.

Failure mode 3: Integration complexity
Integration and governance consume up to 60% of AI agent project budgets. Most teams plan for the model and underinvest in everything around it.

Fix: Plan a 60% integration budget from day one. If the team budgeted 80% for the model and 20% for integration, the project is going to overrun before it ships.

Failure mode 4: Accountability gaps
When the agent is wrong, nobody knows whose problem it is. The system fails in the gap between teams.

Fix: Assign one accountable human per agent before deployment. The work belongs to a name, not a function.

The 86 to 89% failure rate is not happening because the technology does not work. It is happening because organizations are deploying capability without the foundation to support it.

Wednesday: A2A and MCP crossed 150 production deployments. The architecture conversation just shifted.

Three months ago the question was: which orchestration framework should we use?

Today the question is: do our agents speak the right protocols?
Two protocols are emerging as the foundation of multi-agent systems in 2026.

MCP (Model Context Protocol) handles vertical connectivity. Agent to tool. Agent to data source. Agent to API.

A2A (Agent to Agent) handles horizontal connectivity. Direct peer to peer delegation between agents.

Together they replace the brittle custom integration code that has been the failure mode of multi-agent systems for the past three years.
This is the Kubernetes moment for agentic AI.

The pattern looks exactly like what happened to microservices ten years ago. Custom service discovery, custom load balancing, custom health checks. Then Kubernetes standardized all of it. The organizations that built on the standardized layer were able to scale. The ones that built proprietary versions had to rewrite their infrastructure.

Vendor lock in just changed shape too. Three years ago you locked in by choosing a model. Eighteen months ago you locked in by choosing an orchestration framework. In 2026, the lock in is at the protocol layer. Organizations that build on standardized protocols can swap models, frameworks, even vendors with bounded engineering effort.

ARGUS now supports both A2A and MCP natively. Every tool call through MCP gets logged with full audit trail. Every agent to agent message through A2A gets traced with sender, recipient, timestamp, and payload hash.

Thursday: Financial AI just had its inflection point.

Apoorva Mehta launched Abundance, a hedge fund designed to have AI agents run the entire fund with $100M in seed funding. JPMorgan's LLM Suite is automating 360,000 manual hours annually with 83% faster research cycles.

Financial services AI just crossed a threshold most other industries have not faced yet.

When AI agents are managing money, every decision is not just one inference. It is a chain of reasoning across multiple agents that has to be reconstructable when the SEC asks.

For an agent to participate in a regulated financial workflow, every decision must be:

Reconstructable months after the fact
Attributable to specific data sources at specific timestamps
Explainable in language the regulator can evaluate
Reviewable by a human with override authority

If your agent infrastructure does not support all four, the agent cannot ship into a regulated financial environment.

This is exactly the gap ARGUS is built to close. Every agent decision logged with input hash, output hash, model version, and tool calls. Full reasoning trace across multi-agent workflows. Time stamped audit log that can be replayed against the original data state.

Friday synthesis: The Ambry Genetics migration story.

We migrated a clinical genomics AI platform from MySQL to Vitess at Ambry Genetics. 99.97% uptime. Zero clinical data loss. 8 month migration during which the AI was making real recommendations for real patients.

The migration could have happened faster. We chose to optimize for safety, not speed.

What that taught me about AI in regulated environments: the model is the least constrained part of the system. Infrastructure, data governance, compliance requirements, and clinical validation processes are the actual engineering challenges.

Every AI in healthcare implementation I have seen fail, failed at infrastructure or governance. Not at model accuracy.

If you are deploying AI in healthcare, energy, or financial services, your constraint set looks more like that migration than like a benchmark optimization problem.

The Ambharii Labs platform suite

This week marks three weeks since GenomixIQ and ARIA RCM launched. Health system inquiries on FHIR R4 interoperability are validating the architectural decisions made years before launch.

AI Aether (ambharii.com/tools)
Free enterprise AI readiness assessment. 8 dimensions on the G-ARVIS framework. Board ready roadmap. 30 minutes.

ARGUS (github.com/anilatambharii/argus)
Autonomous LLM correction and agent monitoring. Now native to A2A and MCP protocols. Open source. PyPI: pip install argus-ai

GenomixIQ (genomixiq.com)
12-agent molecular mesh for genomic variant interpretation. FHIR R4 from day one. Variant Intelligence Score. Population stratified evaluation.

ARIA RCM (anil@ambharii.com)
11-agent healthcare revenue cycle platform. Three viable acquisition paths: Oracle Health, Microsoft Nuance, NVIDIA Healthcare.

One shared architecture. G-ARVIS observability across all four. ARGUS self correction built into every agent. Production grade from day one.

The week in one sentence

The agents work at scale. Most organizations are not yet ready to deploy them safely. That gap is the business opportunity of 2026.

If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions worth sitting with this weekend are the same five I ask in every program kickoff.

What does failure look like and who does it hurt?
Who is accountable when the agent is wrong?
How does the agent know what it does not know?
What is the kill switch and who can pull it?
What does the audit trail look like nine months from now?

If your team can answer all five with specifics, you are positioned for the 11 to 14% that will succeed.

If they cannot, the foundation work is ahead of any deployment work.

About the author

Anil Prasad is Head of Engineering and Product at Duke Energy and Founder of Ambharii Labs. He serves as an AI Factory Builder at BCG and co-founded the CDAIO Circle Tri-State Chapter. He has 28 years of production AI experience across Fortune 100 companies including R1 RCM, Ambry Genetics, UnitedHealth Group, Medtronic, and Accenture. He was recognized as one of the Top 100 Most Influential AI Leaders USA 2024 and holds degrees from Stanford and BITS Pilani.

ambharii.com | linkedin.com/in/anilsprasad | @anilsprasad on X | anilsprasad.substack.com

Subscribe to Field Notes: Production AI for weekly insights from 28 years building AI in regulated environments. No benchmarks. No hype. Real deployments, real failure modes, and the infrastructure decisions that distinguish production AI from demo AI.

The week AI capability outpaced readiness. Again. Here is what it means in production.

Anil Prasad — Fri, 24 Apr 2026 12:47:57 +0000

Three events. One pattern.

Three significant things happened in AI this week. Claude Opus 4.7 launched. The EU AI Act moved into full enforcement. And a new arXiv paper, EviSearch, validated what I have been building around for six years: domain-specific multi-agent architectures outperform general ones in clinical settings.

Each story is real. Each story matters. And each story points to the same pattern I have watched repeat across 28 years of production AI in healthcare, energy, and financial services.

Capability accelerates faster than readiness. Every time.

Monday · April 20

** ## Claude Opus 4.7: the benchmark is impressive. Here is the real question.**

SWE-bench Pro reached 64.3%, up 10.9 points in a single version. SWE-bench Verified hit 87.6%. CursorBench reached 70%. Tool error rates dropped by two thirds. Self-verification built in at the model level. These are genuinely significant improvements.

But the question I am not seeing asked in any of the coverage: does your organization have the evaluation infrastructure to know whether this model is actually better for your specific use case?

The organizations that move confidently after a major model launch are not the ones with the most advanced AI. They are the ones with evaluation infrastructure that can answer four questions within 72 hours of a new model release.

Is this model better on our specific domain tasks? Is output variance within our acceptable range? What happens to cost-per-correct-output? Can our governance layer onboard this model without a compliance review starting from zero?

If you cannot answer all four within 72 hours, you are not evaluating the model. You are waiting for someone else to tell you whether to use it. That is a readiness infrastructure problem, not a model problem.

The self-verification feature is genuinely novel. Two thirds fewer tool errors means a system that needs much less constant human oversight. For multi-agent workflows running thousands of tool calls per day, that is the difference between a system that runs reliably overnight and one that requires a human on call. ARGUS operates the same self-correction principle at the system layer across the entire agent workflow, not just within a single inference.

Tuesday · April 21

EU AI Act: the audit trail is the most common gap. Here is how to close it.

The EU AI Act entered full enforcement in 2026. Fines up to 7% of global annual turnover. High-risk categories include healthcare AI, critical infrastructure, employment, and education technology. Those are the exact sectors I have spent 28 years building production AI for.

The five mandatory requirements for high-risk AI systems are: a risk management system maintained throughout the entire lifecycle, complete technical documentation, human oversight and intervention mechanisms, demonstrable accuracy and robustness, and a full audit trail.

At a Energy Enterprise, I rebuilt the entire logging layer before deploying a single agent in a live operational context. A grid operations manager asked a question I was not prepared for: "If this system makes a recommendation that causes an outage, and FERC comes knocking, can you show them exactly what the model saw, what it decided, and why?"

We could not answer that. We rebuilt. That decision delayed the launch by six weeks and saved us months of regulatory exposure eighteen months later.

ARGUS generates the full audit trail by default. Every inference logged with input hash, output hash, timestamp, and model version. Every tool call traced with actor identity and permission scope. Every human override recorded with reason and outcome. Not as a reporting feature. As the foundational observability layer. github.com/anilatambharii/argus or pip install argus-ai

Wednesday · April 22

EviSearch and the domain-specific agent case: specificity is the moat.

A paper published this week on arXiv described EviSearch, a multi-agent system that automates the creation of clinical evidence tables from medical literature using a specialized architecture. The finding was exactly what I have seen in every clinical AI program I have run: domain-specific agent architectures outperform general-purpose ones in technical domains, typically by 15 to 25 percentage points on domain-relevant evaluation tasks.

Why the gap exists: A general-purpose agent reasons about yo

This is why GenomixIQ uses 12 specialized agents rather than one large general agent. The literature agent understands how to evaluate evidence in population genetics. The ACMG criteria agent knows all 28 classification criteria and the interaction rules between them. The conflict resolution agent knows which database takes precedence when population databases disagree. None of that is prompt engineering. All of it is architectural encoding of domain expertise.

The EviSearch paper also documented that multi-agent systems for clinical evidence work show inter-run variability below 5%, compared to 15 to 30% for human reviewers on complex evidence tables. Consistency in clinical decision support is not a nice-to-have. It is the compliance requirement.

Thursday · April 23

## G-ARVIS: the nine dimensions most AI teams are not measuring.

I built the G-ARVIS framework from production failure across 28 years in regulated environments. Nine dimensions. Not from academic theory. From watching accurate models fail catastrophically because nobody was measuring the right things.

The six dimensions: Groundedness (anchored to verifiable facts), Accuracy (correct output consistently), Reliability (stable at scale across thousands of runs), Variance (output stability on the same prompt across runs), Inference Cost (cost per correct output, not cost per token), Safety (domain-specific harm profile for this domain, this use case, this failure mode).

Three agentic metrics I added specifically for multi-agent production systems: Action Sequence Fidelity (percentage of multi-step workflows completing without human intervention), Error Recovery Rate (when an agent fails, how often does the system recover without escalation), and Cost Per Correct Sequence (total inference cost divided by the number of complete sequences producing a validated correct output).

All nine are assessed in AI Aether. 73% of organizations score below 12 out of 30 on data architecture alone. The foundation problem has not changed in 28 years. Only the model on top of it has. ambharii.com/tools

The Ambharii Labs Platform

## Four platforms. One shared architecture.

This week marks two weeks since GenomixIQ and ARIA RCM launched, with ARGUS SDK updates shipping and AI Aether continuing to show the same pattern: 73% of organizations score below 12/30 on data architecture. The foundation problem precedes every other problem.

The Week in One Sentence

*## AI shipped faster than most organizations can absorb it. The gap between capability and readiness is the business opportunity of 2026.
*
If you are building AI in healthcare, energy, finance, or any domain where being wrong has real consequences, the questions I am asking every week are the same questions you should be asking: What does your AI actually do at 2 AM? Who sees the audit trail? What happens when the model is wrong in a way it has never been wrong before?

The answers to those questions are what distinguishes production AI from demo AI. That distinction is what 28 years in this field teaches you.

I Built the First Agentic AI Platform for Clinical Genomics. Here Is the Full Architecture

Anil Prasad — Sat, 18 Apr 2026 11:21:54 +0000

TL;DR — GenomixIQ is a 12-agent autonomous AI platform for clinical genomics. It classifies genetic variants in 8 seconds with zero hallucinations — enforced at the architecture level, not by prompting. FHIR R4 native. Any-cloud deploy. API live at api.genomixiq.com/docs. First platform of its kind. Integration and acquisition ready.

The Problem Nobody Had Solved

Walk into any clinical genetics lab today. Watch what happens when a variant comes in.

A molecular pathologist opens ClinVar. The variant is Pathogenic. Has been for 8 years. 47 supporting submissions. The pathologist reads the evidence, applies ACMG/AMP criteria, writes the interpretation, runs QC, produces the report.

90 minutes. For a deterministic computation.

Every lab. Every day. Across thousands of variants.

The data to solve this exists:

ClinVar: 3 million+ variant interpretations
gnomAD: population frequencies for 4.2 billion variants
PubMed: decades of functional studies
OncoKB: therapeutic implications for hundreds of somatic alterations
CPIC: 300+ drug-gene dosing pairs

The problem was never the data. It was orchestration, integration, and trust. Nobody had assembled these sources into a production-grade agent mesh with a technically enforced safety framework and native EHR output.

I built it. It is called GenomixIQ.

Architecture: The Molecular Agent Mesh

GenomixIQ uses a Molecular Agent Mesh — 12 specialized autonomous agents running in parallel, each owning a distinct clinical genomics reasoning task, coordinated by a master orchestrator.

MasterOrchestratorAgent
├── Agent 01: VariantClassifierAgent    → ACMG/AMP, ClinVar, gnomAD
├── Agent 02: ClinicalReporterAgent     → FHIR R4 DiagnosticReport
├── Agent 03: TrialMatcherAgent         → ClinicalTrials.gov live
├── Agent 04: DrugDiscoveryAgent        → ADMET, ChEMBL, AlphaFold
├── Agent 05: PGxAgent                  → 300+ CPIC pairs, diplotyping
├── Agent 06: SomaticOncologyAgent      → TMB, MSI-H, OncoKB
├── Agent 07: RareDiseaseAgent          → Trio analysis, de novo
├── Agent 08: HeredCancerAgent          → BRCA1/2, Lynch, 80+ genes
├── Agent 09: SafetyGateAgent           → G-ARVIS hard block (1.00)
├── Agent 10: CitationVerifierAgent     → PubMed, ClinVar live check
├── Agent 11: EHRIntegratorAgent        → Epic SMART, Cerner FHIR
└── Agent 12: QualityScorerAgent        → VIS, ACMG attestation

Why 12 agents instead of one model call?

Three reasons:

1. Reasoning task decomposition. Variant classification, pharmacogenomic interaction analysis, somatic therapy matching, and FHIR report generation are four distinct reasoning tasks requiring different knowledge bases, different validation logic, and different confidence thresholds. A single model call cannot hold this complexity reliably.

2. Internal error correction. If one agent returns a hallucinated citation, the Citation Verifier catches it before it reaches the Clinical Reporter. Single-model architectures have no internal correction loop.

3. Quality attestation per reasoning unit. G-ARVIS scores each agent output independently. A single confidence score on the final output tells you nothing about where in the reasoning chain the uncertainty lives.

The Safety Gate: Zero Hallucinations — Technically Enforced

This is the differentiator that no competitor has built.

Most "AI genomics" tools add instructions like "do not hallucinate" to their system prompts and call it a safety framework.

That is not safety. That is a request.

GenomixIQ's Safety Gate (Agent 09) is an architectural block:

class SafetyGateAgent:
    """
    G-ARVIS Safety Gate — hard binary enforcement.
    No Pathogenic classification reaches ClinicalReporterAgent
    without a verified citation from CitationVerifierAgent.
    Not a warning. A block.
    """

    async def enforce(self, classification_result: ClassificationResult) -> GateDecision:
        if classification_result.acmg_class in [
            ACMGClass.PATHOGENIC, 
            ACMGClass.LIKELY_PATHOGENIC
        ]:
            citation_verified = await self.citation_verifier.verify(
                evidence=classification_result.evidence_statements,
                sources=["clinvar", "pubmed", "omim"]
            )

            if not citation_verified:
                # ARGUS runs up to 3 correction iterations
                corrected = await self.argus.correct(classification_result)
                if not corrected.citation_verified:
                    # Route for human review — never release unverified
                    return GateDecision.ROUTE_TO_HUMAN_REVIEW

        return GateDecision.APPROVED

If a Pathogenic classification cannot be verified after 3 ARGUS correction iterations, it routes to human review. It never reaches a clinician unverified. This is G-ARVIS Safety dimension: 1.00 hard binary.

G-ARVIS: The Quality Framework

Every GenomixIQ output is scored across 6 dimensions before release:

Dimension	Score	What It Measures
Groundedness	0.93	Citation coverage ratio vs ClinVar/PubMed/gnomAD
Accuracy	0.90	Match rate vs CAP-accredited lab gold standard
Reliability	0.92	Classification consistency across equivalent inputs
Variance	0.86	Stability under input perturbation
Inference Cost	0.89	Token efficiency per clinical decision unit
Safety	1.00	Hard binary — verified citation required

Composite: 0.937. Clinical grade threshold: 0.90. ✓ Passed.

G-ARVIS is not a post-hoc confidence score. It is a pre-release gate wired into the architecture. Outputs below threshold trigger ARGUS autonomous correction (max 3 iterations). Outputs that fail Safety route to human review.

The ARGUS Autonomous Correction Engine

ARGUS (Autonomous Reasoning and Guided Update System) runs inside GenomixIQ as the self-healing layer:

class ARGUSEngine:
    MAX_ITERATIONS = 3

    async def correct(
        self, 
        output: AgentOutput, 
        failing_dimension: GARVISDimension
    ) -> CorrectionResult:

        for iteration in range(self.MAX_ITERATIONS):
            reflection = await self.reflect(output, failing_dimension)
            refined = await self.refine(output, reflection)
            score = await self.garvis.score(refined)

            if score[failing_dimension] >= score.threshold:
                return CorrectionResult(
                    output=refined,
                    iterations=iteration + 1,
                    recovered=True
                )

        return CorrectionResult(recovered=False, route_to_human=True)

Production metrics:

Error Recovery Rate: 87.3%
Average iterations to recovery: 1.4
Human review routing rate: 12.7%

FHIR R4 Output — Native, Not Bolted On

This is the second thing that separates GenomixIQ from every other clinical AI tool.

EHR integration is not a Phase 2 roadmap item. GenomixIQ produces FHIR R4 DiagnosticReport output natively from Agent 02.

class ClinicalReporterAgent:

    async def generate(
        self, 
        classification: VerifiedClassification
    ) -> FHIRDiagnosticReport:

        return FHIRDiagnosticReport(
            resourceType="DiagnosticReport",
            status="final",
            code=CodeableConcept(
                coding=[Coding(
                    system="http://loinc.org",
                    code="81247-9",
                    display="Master HL7 genetic variant reporting panel"
                )]
            ),
            result=[
                self._build_variant_observation(classification),
                self._build_acmg_observation(classification),
                self._build_garvis_attestation(classification),
            ],
            conclusion=classification.clinical_interpretation,
            conclusionCode=[
                CodeableConcept(coding=[
                    Coding(
                        system="http://loinc.org",
                        code=classification.acmg_loinc_code
                    )
                ])
            ]
        )

Ready for Epic SMART on FHIR and Cerner Millennium. Zero custom integration work.

Clinical Coverage — 5 Domains

Hereditary Cancer

BRCA1/2, Lynch syndrome (MLH1, MSH2, MSH6, PMS2, EPCAM), Li-Fraumeni (TP53), Cowden (PTEN), hereditary diffuse gastric cancer (CDH1), 80+ hereditary cancer genes with syndrome-specific ACMG logic.

Pharmacogenomics

300+ CPIC Level A/B drug-gene pairs. CYP2D6, CYP2C19, CYP2C9, DPYD, TPMT, SLCO1B1, G6PD, NUDT15. Full diplotype calling with population-adjusted allele frequencies.

Somatic Oncology

TMB calculation, MSI-H assessment, OncoKB therapeutic implication mapping, pan-cancer actionability scoring, FDA-approved and investigational therapy matching.

Rare Disease

Whole exome/genome trio analysis, de novo variant prioritization, pedigree reconstruction from VCF, HPO phenotype-to-gene matching, OMIM disorder linkage.

Drug Discovery

Target-disease association scoring, ADMET prediction, lead compound optimization, AlphaFold-integrated structural impact analysis for variant functional characterization.

Tech Stack

ai_agents:
  llm: Anthropic Claude
  routing: Opus (STAT) → Sonnet (standard) → Haiku (QC)
  quality: G-ARVIS engine
  correction: ARGUS-AI (max 3 iterations)
  orchestration: LangChain + LlamaIndex
  vector_db: Qdrant (7 collections, 1536-dim, tenant-scoped)
    collections:
      - ClinVar
      - gnomAD  
      - OMIM
      - PubMed
      - PharmGKB
      - OncoKB
      - ChEMBL

backend:
  framework: FastAPI
  language: Python 3.11
  orm: SQLAlchemy + Alembic
  validation: Pydantic v2
  auth: Keycloak RBAC + ABAC + JWT

data:
  primary_db: PostgreSQL 16 (RLS, 500-tenant capacity)
  cache: Redis 7
  streaming: Apache Kafka
  analytics: ClickHouse (immutable audit log)
  object_store: Delta Lake / S3 (BAM/VCF/FASTQ/CRAM)

bioinformatics:
  variant_calling: GATK + DeepVariant
  annotation: ANNOVAR + VEP + Ensembl
  structure: AlphaFold API

frontend:
  framework: React 18 + TypeScript
  styling: Tailwind CSS
  visualization: D3.js + IGV.js + Recharts

mlops:
  registry: MLflow
  drift: Evidently
  monitoring: Prometheus + Grafana
  tracing: OpenTelemetry

infrastructure:
  containers: Docker + Kubernetes + Helm
  iac: Terraform
  gitops: ArgoCD
  ci_cd: GitHub Actions + SonarQube + Trivy
  deployment: AWS / Azure / GCP / Oracle Cloud / On-prem

Integration Architecture

GenomixIQ is built integration-ready from day one:

External Systems          GenomixIQ API            Internal Agents
─────────────────         ──────────────           ───────────────
Epic SMART on FHIR ──→   /api/v1/variants  ──→   VariantClassifier
Cerner Millennium  ──→   /api/v1/reports   ──→   ClinicalReporter
Lab LIS            ──→   /api/v1/pgx       ──→   PGxAgent
Research Portal    ──→   /api/v1/trials    ──→   TrialMatcher
Pharma Pipeline    ──→   /api/v1/targets   ──→   DrugDiscovery
                         /api/v1/quality   ──→   QualityScorer
                         /api/v1/batch     ──→   All agents (parallel)

API is publicly testable today: api.genomixiq.com/docs

Try It Right Now

# Test the health check
curl https://api.genomixiq.com/health

# Submit a variant for classification
curl -X POST https://api.genomixiq.com/api/v1/variants \
  -H "Content-Type: application/json" \
  -d '{
    "variant_id": "VCV000017694",
    "gene": "BRCA1",
    "transcript": "NM_007294.4",
    "hgvs_c": "c.5266dupC",
    "genome_build": "GRCh38"
  }'

# Response includes:
# - ACMG classification with criteria applied
# - Variant Intelligence Score (VIS)
# - G-ARVIS quality attestation
# - Verified citations
# - FHIR R4 DiagnosticReport

Terraform modules included for AWS, Azure, GCP, Oracle Cloud, and on-premises Kubernetes.

What This Unlocks for Integration Partners

For EHR vendors (Epic, Oracle Health, Microsoft Nuance, Veeva):
FHIR R4 native output. SMART on FHIR compatible. Plug into your existing genomics module with zero custom integration. G-ARVIS attestation on every result for clinical defensibility.

For health system IT teams:
On-premises Kubernetes deploy. HIPAA-ready architecture with PHI tokenization before any LLM prompt. Full audit trail in ClickHouse. Row-level security across 500-tenant capacity. SOC 2 Type II controls in the pipeline.

For pharma R&D platforms:
Drug Discovery Agent with ChEMBL, DrugBank, and AlphaFold API integration. Target validation, ADMET prediction, lead optimization. REST API + batch endpoints for pipeline integration.

For genomics lab software (Sophia Genetics, Illumina DRAGEN, Fabric Genomics):
VCF ingestion endpoint. Batch classification API. ACMG classification output with full evidence trace. Direct integration point for existing lab workflows.

Benchmarks vs Current Standard of Care

Metric	Manual (current)	GenomixIQ
Time per variant	60–90 min	8 seconds
Cost per variant	$12–$18	$0.023
Citation coverage	Pathologist memory	100% verified
Hallucination rate	N/A (human)	0.00 (hard gate)
FHIR output	Manual entry	Native
Throughput	~6/hour/pathologist	450+/hour
G-ARVIS score	N/A	0.937

What Comes Next

Whole genome population-scale interpretation with federated learning
Direct ClinVar submission pipeline for novel variant classifications
Somatic liquid biopsy with ctDNA quantification
Multi-modal proteomics and epigenomics integration
CAP/CLIA audit package for regulatory inspection readiness

The Acquisition Conversation

GenomixIQ is the first platform to combine:

Production-grade 12-agent clinical genomics mesh
Technically enforced zero-hallucination safety gate
FHIR R4 native output (not a roadmap item)
G-ARVIS — the first AI quality standard for clinical genomics
Any-cloud + on-prem single-command deployment
Live public API with Swagger documentation

Strategic fits include Oracle Health (Cerner), Microsoft (Nuance), NVIDIA (Clara), Illumina, Tempus AI, Sophia Genetics, and health system genomics programs building precision medicine infrastructure.

If you are building in this space — as an engineer, a partner, or an acquirer — the conversation is open.

I Built 11 Autonomous Agents for Healthcare Revenue Cycle. Here Is the Full Architecture.

Anil Prasad — Fri, 10 Apr 2026 19:36:57 +0000

How I Built a Self-Correcting Multi-Agent System for Healthcare — and Why Standard ML Metrics Failed Me

Tags: ai, python, healthcare, machinelearning

Cover image: 04_argus_correction.png

I have been building production AI systems for 28 years. At UnitedHealth Group I ran a 20,000-node Big Data Platform. At R1 RCM I was inside the $4.1B Cloudmed acquisition. At Duke Energy I run AI and product engineering for critical infrastructure.

None of that experience prepared me for the specific engineering problem of building a reliable multi-agent system for healthcare revenue cycle management.

This post is about what I learned, what broke, and what I had to invent to make it work. The code is real. The numbers are from production.

The problem with agentic systems in regulated environments

Most agentic system tutorials show you a single agent calling a few tools and returning a result. That is fine for demos. It is not fine when the agent is making claims submission decisions on a $300M annual revenue stream for a hospital system.

The core issues I ran into, in order of how badly they burned me:

1. LLMs are not deterministic enough for sequential RCM workflows

Give the same clinical note to the same model twice and you will get subtly different ICD-10 code recommendations. In a classification task that is fine — you measure accuracy across a test set. In an agent that is making 14 sequential decisions across a claims workflow, small inconsistencies compound. A slightly different coding recommendation in step 3 changes the prior authorization requirement in step 5, which changes the denial probability score in step 8.

2. Standard metrics do not capture agentic failure modes

Precision and recall tell you nothing about whether the agent followed the right path to get to a correct answer. An agent that approves the right claim after six wrong turns is not a success — it is a future liability. I needed metrics that measured the sequential behavior of the agent across a workflow, not just the final output.

3. PHI in prompts is a HIPAA violation waiting to happen

This one is obvious in theory and surprisingly hard in practice. The moment you build a multi-agent system where context is passed between agents, you have to be extremely deliberate about what is in that context. A naive implementation will leak PHI into prompt context within the first week of real data.

4. There was no observability framework built for agents

Datadog, Arize, WhyLabs — all excellent for ML model monitoring. None of them answer the questions I needed answered: Is this agent's output grounded in the source data? Is it consistent across similar inputs? Is it recovering from failures autonomously or silently degrading?

What I built: ARIA and the frameworks around it

ARIA is a hierarchical multi-agent system: one Supervisor agent orchestrating 10 specialist agents across the full RCM workflow. I will not walk through all 11 agents here — the full architecture is in the Medium article linked at the end. What I want to focus on are the three engineering innovations that made it reliable enough for production healthcare.

Innovation 1: G-ARVIS — a 6-dimension observability framework for agents

I defined G-ARVIS to answer the specific observability questions that no existing tool addressed. Six dimensions, scored per agent execution, in real time.

from dataclasses import dataclass
from typing import Optional

@dataclass
class GARVISScore:
    groundedness: float    # Output traceable to source data (0-1)
    accuracy: float        # Factual correctness of output (0-1)
    reliability: float     # Consistency across similar inputs (0-1)
    variance: float        # Stability under edge cases (0-1)
    inference_cost: float  # Token efficiency per correct output (0-1)
    safety: float          # PHI enforcement, HIPAA compliance (0-1)

    @property
    def composite(self) -> float:
        return (
            self.groundedness * 0.20 +
            self.accuracy     * 0.20 +
            self.reliability  * 0.18 +
            self.variance     * 0.17 +
            self.inference_cost * 0.10 +
            self.safety       * 0.15
        )

    @property
    def is_production_ready(self) -> bool:
        # Safety is a hard gate — any PHI violation fails immediately
        if self.safety < 1.0:
            return False
        return self.composite >= 0.85

The weighting is intentional. Groundedness and Accuracy carry the most weight because in healthcare, a hallucinated output is not an annoyance — it is a compliance event. Safety carries 15% but is also a hard gate: any execution that touches PHI in the prompt context fails immediately regardless of the composite score.

Why Variance is the hardest dimension to score

Variance measures output stability under edge cases — ambiguous clinical notes, incomplete payer data, conflicting authorization histories. The challenge is that you can only measure it retrospectively across a population of similar inputs. We use a sliding window of the last 200 similar executions and measure the coefficient of variation on key output fields.

import numpy as np
from collections import deque

class VarianceMonitor:
    def __init__(self, window_size: int = 200):
        self.window = deque(maxlen=window_size)

    def record(self, output_vector: list[float]):
        self.window.append(output_vector)

    def score(self) -> float:
        if len(self.window) < 10:
            return 1.0  # insufficient data, assume stable
        arr = np.array(list(self.window))
        # Coefficient of variation per output dimension
        cv = np.std(arr, axis=0) / (np.mean(arr, axis=0) + 1e-8)
        # Score: 1.0 = perfectly stable, 0.0 = completely unstable
        return float(np.clip(1.0 - np.mean(cv), 0.0, 1.0))

Current production Variance score: 91.7%. This is the dimension I am least satisfied with and where most of our active engineering effort is focused. Target is 95%+.

Innovation 2: Three new agentic metrics

I defined these because ASF, ERR, and CPCS did not exist anywhere I could find, and I needed them.

Action Sequence Fidelity (ASF)

What percentage of agent execution paths match the optimal RCM workflow path? This requires defining the optimal path — which we did by analyzing 50,000 adjudicated claims and extracting the decision sequence that led to first-pass approval with minimum rework.

from difflib import SequenceMatcher

class ASFCalculator:
    def __init__(self, optimal_paths: dict[str, list[str]]):
        # optimal_paths: claim_type -> sequence of agent actions
        self.optimal_paths = optimal_paths

    def calculate(
        self,
        claim_type: str,
        actual_path: list[str]
    ) -> float:
        optimal = self.optimal_paths.get(claim_type, [])
        if not optimal:
            return 1.0  # no baseline, assume correct

        matcher = SequenceMatcher(
            None,
            optimal,
            actual_path
        )
        return matcher.ratio()

    def batch_asf(self, executions: list[dict]) -> float:
        scores = [
            self.calculate(e["claim_type"], e["path"])
            for e in executions
        ]
        return sum(scores) / len(scores) if scores else 0.0

Current production ASF: 91.4%.

Error Recovery Rate (ERR)

When an agent encounters a failure, how often does it recover autonomously? This is straightforward to measure — you track every exception event and whether it resolved within the ARGUS correction loop or escalated to human review.

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ExecutionEvent:
    execution_id: str
    agent_id: str
    timestamp: datetime
    exception_type: Optional[str]
    resolved_autonomously: Optional[bool]
    attempts: int

class ERRTracker:
    def __init__(self):
        self.events: list[ExecutionEvent] = []

    def record(self, event: ExecutionEvent):
        self.events.append(event)

    def calculate_err(
        self,
        window_hours: int = 24
    ) -> float:
        cutoff = datetime.now().timestamp() - (window_hours * 3600)
        recent = [
            e for e in self.events
            if e.timestamp.timestamp() > cutoff
            and e.exception_type is not None
        ]
        if not recent:
            return 1.0

        autonomous = sum(
            1 for e in recent
            if e.resolved_autonomously is True
        )
        return autonomous / len(recent)

Current production ERR: 87.3%.

Cost Per Correct Sequence (CPCS)

Total LLM inference cost for one complete, correct RCM workflow execution. This is your unit economics metric. If CPCS exceeds the margin on the claim being processed, the system is not profitable to operate regardless of how accurate it is.

@dataclass
class SequenceCost:
    execution_id: str
    total_input_tokens: int
    total_output_tokens: int
    model_rates: dict  # model_id -> (input_rate, output_rate) per 1M tokens
    was_correct: bool
    attempts: int

    def total_cost_usd(self) -> float:
        input_cost = sum(
            (self.total_input_tokens / 1_000_000) * rate[0]
            for rate in self.model_rates.values()
        )
        output_cost = sum(
            (self.total_output_tokens / 1_000_000) * rate[1]
            for rate in self.model_rates.values()
        )
        return input_cost + output_cost

class CPCSCalculator:
    def __init__(self):
        self.sequences: list[SequenceCost] = []

    def record(self, seq: SequenceCost):
        self.sequences.append(seq)

    def calculate_cpcs(self) -> float:
        correct = [s for s in self.sequences if s.was_correct]
        if not correct:
            return float('inf')
        total_cost = sum(s.total_cost_usd() for s in correct)
        return total_cost / len(correct)

Current production CPCS: $0.023 per claim end-to-end.

Innovation 3: ARGUS — autonomous self-correction

ARGUS is the layer that makes the system reliable enough for production. The core insight: instead of trying to make an LLM deterministically correct on the first attempt, you build a reflection loop that detects failure, analyzes the failure mode by G-ARVIS dimension, and generates a corrected prompt.

import asyncio
from typing import Any, Callable, Awaitable

@dataclass
class CorrectionResult:
    output: Any
    score: GARVISScore
    attempts: int
    corrected: bool
    escalated: bool

class ARGUSGuard:
    def __init__(
        self,
        max_attempts: int = 3,
        target_composite: float = 0.85,
        safety_threshold: float = 1.0,  # hard gate
        domain: str = "healthcare_rcm",
        phi_safe: bool = True
    ):
        self.max_attempts = max_attempts
        self.target_composite = target_composite
        self.safety_threshold = safety_threshold
        self.domain = domain
        self.phi_safe = phi_safe

    async def execute_with_correction(
        self,
        agent_fn: Callable[..., Awaitable[Any]],
        task: dict,
        scorer: "GARVISScorer"
    ) -> CorrectionResult:

        attempt = 0
        current_task = task.copy()

        while attempt < self.max_attempts:
            output = await agent_fn(current_task)
            score = await scorer.score(output, self.domain)

            # PHI hard gate — fail immediately, do not retry
            if score.safety < self.safety_threshold:
                return CorrectionResult(
                    output=None,
                    score=score,
                    attempts=attempt + 1,
                    corrected=False,
                    escalated=True
                )

            if score.composite >= self.target_composite:
                return CorrectionResult(
                    output=output,
                    score=score,
                    attempts=attempt + 1,
                    corrected=attempt > 0,
                    escalated=False
                )

            # Score below threshold — reflect and refine
            current_task = self._reflect_and_refine(
                original_task=task,
                failed_output=output,
                score=score,
                attempt=attempt
            )
            attempt += 1

        # All attempts exhausted — escalate to human review
        return CorrectionResult(
            output=output,
            score=score,
            attempts=attempt,
            corrected=False,
            escalated=True
        )

    def _reflect_and_refine(
        self,
        original_task: dict,
        failed_output: Any,
        score: GARVISScore,
        attempt: int
    ) -> dict:
        # Identify the weakest dimension and generate
        # a dimension-specific correction signal
        weak_dims = self._weakest_dimensions(score)
        correction_prompt = self._build_correction_prompt(
            original_task,
            failed_output,
            weak_dims,
            attempt
        )
        refined = original_task.copy()
        refined["correction_context"] = correction_prompt
        refined["attempt"] = attempt + 1
        return refined

    def _weakest_dimensions(
        self,
        score: GARVISScore
    ) -> list[str]:
        dims = {
            "groundedness": score.groundedness,
            "accuracy": score.accuracy,
            "reliability": score.reliability,
            "variance": score.variance,
            "inference_cost": score.inference_cost,
        }
        # Return dimensions below 0.85, sorted weakest first
        return sorted(
            [k for k, v in dims.items() if v < 0.85],
            key=lambda k: dims[k]
        )

The _build_correction_prompt method is proprietary — that is where the domain-specific healthcare knowledge lives. But the structure above is fully open in the ARGUS SDK.

The PHI tokenization architecture

This is the part that took the longest to get right. The requirement: agents need full clinical context to make good RCM decisions, but no PHI can appear in any LLM prompt.

import hashlib
import hmac
import re
from typing import Any

class PHITokenizer:
    # Patterns for common PHI types
    PHI_PATTERNS = {
        "MRN":   r"\bMRN[-:\s]?\d{6,10}\b",
        "DOB":   r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
        "SSN":   r"\b\d{3}-\d{2}-\d{4}\b",
        "NAME":  r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b",
        "NPI":   r"\bNPI[-:\s]?\d{10}\b",
    }

    def __init__(self, secret_key: bytes):
        self.secret_key = secret_key
        self._token_map: dict[str, str] = {}
        self._reverse_map: dict[str, str] = {}

    def _generate_token(self, phi_value: str, phi_type: str) -> str:
        # Deterministic: same PHI always maps to same token
        raw = f"{phi_type}:{phi_value}"
        token_bytes = hmac.new(
            self.secret_key,
            raw.encode(),
            hashlib.sha256
        ).hexdigest()[:16]
        return f"[{phi_type}_TOKEN_{token_bytes.upper()}]"

    def tokenize(self, text: str) -> str:
        tokenized = text
        for phi_type, pattern in self.PHI_PATTERNS.items():
            matches = re.findall(pattern, tokenized)
            for match in matches:
                token = self._generate_token(match, phi_type)
                self._token_map[match] = token
                self._reverse_map[token] = match
                tokenized = tokenized.replace(match, token)
        return tokenized

    def rehydrate(self, tokenized_text: str) -> str:
        result = tokenized_text
        for token, phi_value in self._reverse_map.items():
            result = result.replace(token, phi_value)
        return result

    def is_phi_clean(self, text: str) -> bool:
        for pattern in self.PHI_PATTERNS.values():
            if re.search(pattern, text):
                return False
        return True

Every prompt that goes to an LLM passes through tokenize() first. Every output that gets committed to the RCM state machine passes through rehydrate() inside the secure perimeter. The is_phi_clean() check is what the G-ARVIS Safety dimension calls before every inference.

Production Safety score: 100%. Zero PHI exposure events.

Install and get started

The ARGUS SDK — G-ARVIS scoring, ASF/ERR/CPCS calculators, PHITokenizer base class, and ARGUSGuard correction loop — is open-core and on PyPI.

pip install argus-ai

from argus_ai import ARGUSGuard, GARVISScorer, PHITokenizer
from argus_ai.metrics import ASFCalculator, ERRTracker, CPCSCalculator

# Wrap any async agent function with self-correction
guard = ARGUSGuard(
    max_attempts=3,
    target_composite=0.85,
    domain="healthcare_rcm",
    phi_safe=True
)

result = await guard.execute_with_correction(
    agent_fn=my_denial_predictor,
    task=claim_task,
    scorer=GARVISScorer()
)

print(f"Score: {result.score.composite:.1%}")
print(f"Attempts: {result.attempts}")
print(f"Escalated: {result.escalated}")

Production results

These are from the live ARIA system, 24-hour rolling average:

Metric	Value
G-ARVIS composite	93.9%
Groundedness	96.2%
Accuracy	94.8%
Reliability	93.1%
Variance	91.7%
Inference Cost	95.3%
Safety	100%
Action Sequence Fidelity	91.4%
Error Recovery Rate	87.3%
Cost Per Correct Sequence	$0.023
Denial rate reduction	38%

What is open vs proprietary

Open (argus-ai on PyPI + GitHub):

ARGUSGuard correction loop
GARVISScorer base framework
PHITokenizer base class
ASF, ERR, CPCS calculators
PulseFlow MLOps pipeline

Proprietary (the ARIA product):

11-agent supervisor hierarchy with RCM domain specialization
Payer policy RAG with live contract updates
Predictive denial scoring model
RCM domain knowledge engine
Multi-tenant deployment infrastructure

Links

GitHub: github.com/anilatambharii/argus-ai
PyPI: pypi.org/project/argus-ai
Platform: ambharii.com/RCM
Full architecture article: medium.com/p/9d0c9f8d662a
Questions or contributions: anil@ambharii.com

If you are building agentic systems in regulated industries and running into the same observability and reliability problems — I would genuinely like to hear from you. The metrics definitions are public. Use them, improve them, tell me what is wrong with them.

Anil Prasad — Founder, Ambharii Labs · Head of Engineering & Product, Duke Energy · Top 100 AI Leaders USA 2024

#HumanWritten #ExpertiseFromField

Building Multi-Agent Systems That Don't Collapse in Production

Anil Prasad — Wed, 08 Apr 2026 18:10:33 +0000

Building Multi-Agent Systems That Don't Collapse in Production
Multi-agent AI deployments grew 327% in four months across 20,000 organizations (Databricks, 2025). Most of those deployments will fail in production. Not because the models are bad. Because the composition is broken.

This post covers three failure modes I've seen repeatedly in regulated production environments, and the engineering patterns that fix them — with real code using ARGUS, the open-source agentic observability framework I built and maintain.

The math that kills multi-agent systems first Before architecture, do this calculation:

pythonimport math

def end_to_end_reliability(agent_reliability: float, num_agents: int) -> float:
return math.pow(agent_reliability, num_agents)

What most teams are actually deploying

print(end_to_end_reliability(0.85, 5)) # → 0.4437
print(end_to_end_reliability(0.90, 5)) # → 0.5905
print(end_to_end_reliability(0.97, 5)) # → 0.8587

The target you need before orchestrating

print(end_to_end_reliability(0.99, 5)) # → 0.9510

The rule: get each single agent to 97%+ before you chain them. Below that, you are engineering a system that fails more than it succeeds.

Failure mode 1: Cascade failures
(See the cascade failure trace diagram above)
Agent A produces a marginally wrong output. Agent B treats it as correct input. Agent C produces a confidently wrong conclusion. No single agent failed — the composition did.
In standard per-agent logging, this is invisible. The per-agent logs all show status: success. Only the final output reveals the failure — after it has already been acted upon.
The fix: inter-agent validation with sampled contracts
pythonfrom argus_ai import AgentTracer, ValidationContract

tracer = AgentTracer(workflow_id="rcm-prior-auth-v2")

class ValidatedAgent:
def init(self, agent_fn, contract: ValidationContract, sample_rate=0.15):
self.agent = agent_fn
self.contract = contract
self.sample_rate = sample_rate

def run(self, input_payload: dict, hop_id: str) -> dict:
    output = self.agent(input_payload)

    # Sample 15% of hops for deep validation
    # 100% validation on high-stakes decision points
    should_validate = (
        random.random() < self.sample_rate
        or input_payload.get("high_stakes", False)
    )

    if should_validate:
        violations = self.contract.check(output)
        tracer.record_hop(
            hop_id=hop_id,
            input=input_payload,
            output=output,
            violations=violations,
            validated=True
        )
        if violations:
            raise ContractViolation(f"hop {hop_id}: {violations}")
    else:
        tracer.record_hop(hop_id=hop_id, input=input_payload,
                          output=output, validated=False)

    return output

Key design decisions here:

15% sample rate on standard hops — cheap enough to run always, catches systematic errors fast
100% validation on high-stakes hops (financial commits, clinical decisions, compliance writes)
Every hop is recorded regardless of whether it was validated — the audit trail is unconditional

Failure mode 2: Context drift
Each agent has a finite context window. As tasks pass between agents, the original intent degrades. By agent 5, the goal may have been silently reinterpreted twice.
This is especially dangerous in regulated domains. If the original intent is a compliance requirement, a drift of even 5% of the specification can create a violation.
The fix: shared state with strict write contracts
pythonfrom argus_ai import SharedStateStore, StateContract
from pydantic import BaseModel
from typing import Optional
import hashlib

class WorkflowIntent(BaseModel):
"""The original goal. Immutable after creation."""
goal_id: str
original_prompt: str
compliance_constraints: list[str]
created_at: str
checksum: str # sha256 of original_prompt + constraints

class AgentWriteContract(BaseModel):
"""What each agent is allowed to write."""
agent_id: str
allowed_write_keys: list[str]
forbidden_write_keys: list[str] = ["original_intent", "goal_id"]

store = SharedStateStore(backend="redis")

def write_with_contract(
agent_id: str,
key: str,
value: any,
contract: AgentWriteContract
) -> None:
if key in contract.forbidden_write_keys:
raise PermissionError(
f"Agent {agent_id} attempted to overwrite protected key: {key}"
)
if key not in contract.allowed_write_keys:
raise PermissionError(
f"Agent {agent_id} attempted to write undeclared key: {key}"
)
store.set(key, value, written_by=agent_id)
The original_intent is write-once. No agent can overwrite the goal. Each agent reads from the store at the start of its hop — it always has access to the original specification, not just what the previous agent passed.

Failure mode 3: Accountability gaps
When the multi-agent workflow fails, which agent do you debug?
Without an end-to-end trace, this question is unanswerable. You have logs from five agents, all showing local success, and a broken final output. That is a crime scene with no chain of custody.
The fix: end-to-end workflow tracing with G-ARVIS scoring
pythonfrom argus_ai import WorkflowTracer, GARVISScorer

Initialize once per workflow run

tracer = WorkflowTracer(
workflow_id="prior-auth-batch-20260408",
g_arvis_dimensions=["groundedness", "accuracy", "reliability",
"variance", "inference_cost", "safety"]
)

Each agent wraps its execution

with tracer.hop("parser", metadata={"model": "claude-sonnet-4-6"}) as hop:
result = parser_agent.run(document)
hop.record(
input_tokens=result.input_tokens,
output_tokens=result.output_tokens,
confidence=result.confidence,
output_hash=hashlib.sha256(
str(result.output).encode()
).hexdigest()
)

After workflow completes — full trace available

report = tracer.finalize()

print(report.end_to_end_success_rate) # 0.943
print(report.weakest_hop) # "validator" — 84.2% pass rate
print(report.g_arvis_scores) # per-dimension scores
print(report.cascade_risk_score) # probability of undetected cascade
The cascade_risk_score is the key metric. It measures the probability that a marginal error in an early hop could propagate undetected to a confident wrong output. If this exceeds 0.15, you have a systemic observability problem regardless of individual agent quality.

Putting it together: the minimal production-ready multi-agent loop
pythonfrom argus_ai import (
AgentTracer, SharedStateStore,
WorkflowTracer, ValidationContract
)

class SupervisorAgent:
def init(self, specialists: dict, tracer: WorkflowTracer):
self.specialists = specialists
self.tracer = tracer
self.store = SharedStateStore()

def run(self, goal: str, constraints: list[str]) -> dict:
    # Write intent once — immutable
    intent = self.store.write_intent(goal, constraints)

    # Decompose
    subtasks = self.decompose(goal)

    results = {}
    for task_id, task in subtasks.items():
        agent = self.specialists[task.agent_type]
        contract = ValidationContract.for_task(task_id)

        with self.tracer.hop(task_id) as hop:
            # Agent always reads original intent from store
            context = {
                "task": task,
                "original_intent": self.store.get_intent(intent.goal_id),
                "prior_results": results  # only pass, never overwrite
            }
            output = agent.run_with_validation(context, contract)
            results[task_id] = output
            hop.record(output)

    return self.synthesize(results, intent)

Three things this loop enforces that most implementations skip:

Every agent reads the original intent — not just what the previous agent passed
Every hop is traced unconditionally — validation is sampled, tracing is not
The supervisor synthesizes from all hop results — not just the last agent's output

Install and try it
bashpip install argus-ai
python# Minimal smoke test
from argus_ai import AgentTracer

tracer = AgentTracer(workflow_id="test-001")

with tracer.hop("my-first-agent") as hop:
output = {"result": "hello", "confidence": 0.94}
hop.record(output=output, confidence=0.94)

print(tracer.finalize().summary())
Full docs and examples at github.com/anilatambharii/argus-ai.
The G-ARVIS scoring engine and SDK are fully open-source. The autonomous correction agents (self-healing workflows) are in the Pro tier.

Check your agentic readiness before you deploy
The AI Aether Platform runs a G-ARVIS-based readiness assessment across 8 dimensions — observability maturity, governance posture, agentic infrastructure, and more. Takes 10 minutes. Gives you a baseline before you commit architecture decisions that cost months to reverse.
CDAIO Circle members: use code CDAIO2026 for Pro access.

I write about production AI engineering from regulated-industry deployments (healthcare, energy, financial services). Follow for more patterns from the field.

AgenticAI #MLOps #Python #ProductionAI #HumanWritten #ExpertiseFromField

How I Built a Self-Correcting Multi-Agent System for Healthcare - and Why Standard ML Metrics Failed Me

Anil Prasad — Tue, 07 Apr 2026 21:48:54 +0000

How I Built a Self-Correcting Multi-Agent System for Healthcare — and Why Standard ML Metrics Failed Me

Tags: ai, python, healthcare, machinelearning

Cover image: 04_argus_correction.png

None of that experience prepared me for the specific engineering problem of building a reliable multi-agent system for healthcare revenue cycle management.

This post is about what I learned, what broke, and what I had to invent to make it work. The code is real. The numbers are from production.

The problem with agentic systems in regulated environments

The core issues I ran into, in order of how badly they burned me:

1. LLMs are not deterministic enough for sequential RCM workflows

2. Standard metrics do not capture agentic failure modes

3. PHI in prompts is a HIPAA violation waiting to happen

4. There was no observability framework built for agents

What I built: ARIA and the frameworks around it

Innovation 1: G-ARVIS — a 6-dimension observability framework for agents

I defined G-ARVIS to answer the specific observability questions that no existing tool addressed. Six dimensions, scored per agent execution, in real time.

from dataclasses import dataclass
from typing import Optional

@dataclass
class GARVISScore:
    groundedness: float    # Output traceable to source data (0-1)
    accuracy: float        # Factual correctness of output (0-1)
    reliability: float     # Consistency across similar inputs (0-1)
    variance: float        # Stability under edge cases (0-1)
    inference_cost: float  # Token efficiency per correct output (0-1)
    safety: float          # PHI enforcement, HIPAA compliance (0-1)

    @property
    def composite(self) -> float:
        return (
            self.groundedness * 0.20 +
            self.accuracy     * 0.20 +
            self.reliability  * 0.18 +
            self.variance     * 0.17 +
            self.inference_cost * 0.10 +
            self.safety       * 0.15
        )

    @property
    def is_production_ready(self) -> bool:
        # Safety is a hard gate — any PHI violation fails immediately
        if self.safety < 1.0:
            return False
        return self.composite >= 0.85

Why Variance is the hardest dimension to score

import numpy as np
from collections import deque

class VarianceMonitor:
    def __init__(self, window_size: int = 200):
        self.window = deque(maxlen=window_size)

    def record(self, output_vector: list[float]):
        self.window.append(output_vector)

    def score(self) -> float:
        if len(self.window) < 10:
            return 1.0  # insufficient data, assume stable
        arr = np.array(list(self.window))
        # Coefficient of variation per output dimension
        cv = np.std(arr, axis=0) / (np.mean(arr, axis=0) + 1e-8)
        # Score: 1.0 = perfectly stable, 0.0 = completely unstable
        return float(np.clip(1.0 - np.mean(cv), 0.0, 1.0))

Current production Variance score: 91.7%. This is the dimension I am least satisfied with and where most of our active engineering effort is focused. Target is 95%+.

Innovation 2: Three new agentic metrics

I defined these because ASF, ERR, and CPCS did not exist anywhere I could find, and I needed them.

Action Sequence Fidelity (ASF)

from difflib import SequenceMatcher

class ASFCalculator:
    def __init__(self, optimal_paths: dict[str, list[str]]):
        # optimal_paths: claim_type -> sequence of agent actions
        self.optimal_paths = optimal_paths

    def calculate(
        self,
        claim_type: str,
        actual_path: list[str]
    ) -> float:
        optimal = self.optimal_paths.get(claim_type, [])
        if not optimal:
            return 1.0  # no baseline, assume correct

        matcher = SequenceMatcher(
            None,
            optimal,
            actual_path
        )
        return matcher.ratio()

    def batch_asf(self, executions: list[dict]) -> float:
        scores = [
            self.calculate(e["claim_type"], e["path"])
            for e in executions
        ]
        return sum(scores) / len(scores) if scores else 0.0

Current production ASF: 91.4%.

Error Recovery Rate (ERR)

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ExecutionEvent:
    execution_id: str
    agent_id: str
    timestamp: datetime
    exception_type: Optional[str]
    resolved_autonomously: Optional[bool]
    attempts: int

class ERRTracker:
    def __init__(self):
        self.events: list[ExecutionEvent] = []

    def record(self, event: ExecutionEvent):
        self.events.append(event)

    def calculate_err(
        self,
        window_hours: int = 24
    ) -> float:
        cutoff = datetime.now().timestamp() - (window_hours * 3600)
        recent = [
            e for e in self.events
            if e.timestamp.timestamp() > cutoff
            and e.exception_type is not None
        ]
        if not recent:
            return 1.0

        autonomous = sum(
            1 for e in recent
            if e.resolved_autonomously is True
        )
        return autonomous / len(recent)

Current production ERR: 87.3%.

Cost Per Correct Sequence (CPCS)

@dataclass
class SequenceCost:
    execution_id: str
    total_input_tokens: int
    total_output_tokens: int
    model_rates: dict  # model_id -> (input_rate, output_rate) per 1M tokens
    was_correct: bool
    attempts: int

    def total_cost_usd(self) -> float:
        input_cost = sum(
            (self.total_input_tokens / 1_000_000) * rate[0]
            for rate in self.model_rates.values()
        )
        output_cost = sum(
            (self.total_output_tokens / 1_000_000) * rate[1]
            for rate in self.model_rates.values()
        )
        return input_cost + output_cost

class CPCSCalculator:
    def __init__(self):
        self.sequences: list[SequenceCost] = []

    def record(self, seq: SequenceCost):
        self.sequences.append(seq)

    def calculate_cpcs(self) -> float:
        correct = [s for s in self.sequences if s.was_correct]
        if not correct:
            return float('inf')
        total_cost = sum(s.total_cost_usd() for s in correct)
        return total_cost / len(correct)

Current production CPCS: $0.023 per claim end-to-end.

Innovation 3: ARGUS — autonomous self-correction

import asyncio
from typing import Any, Callable, Awaitable

@dataclass
class CorrectionResult:
    output: Any
    score: GARVISScore
    attempts: int
    corrected: bool
    escalated: bool

class ARGUSGuard:
    def __init__(
        self,
        max_attempts: int = 3,
        target_composite: float = 0.85,
        safety_threshold: float = 1.0,  # hard gate
        domain: str = "healthcare_rcm",
        phi_safe: bool = True
    ):
        self.max_attempts = max_attempts
        self.target_composite = target_composite
        self.safety_threshold = safety_threshold
        self.domain = domain
        self.phi_safe = phi_safe

    async def execute_with_correction(
        self,
        agent_fn: Callable[..., Awaitable[Any]],
        task: dict,
        scorer: "GARVISScorer"
    ) -> CorrectionResult:

        attempt = 0
        current_task = task.copy()

        while attempt < self.max_attempts:
            output = await agent_fn(current_task)
            score = await scorer.score(output, self.domain)

            # PHI hard gate — fail immediately, do not retry
            if score.safety < self.safety_threshold:
                return CorrectionResult(
                    output=None,
                    score=score,
                    attempts=attempt + 1,
                    corrected=False,
                    escalated=True
                )

            if score.composite >= self.target_composite:
                return CorrectionResult(
                    output=output,
                    score=score,
                    attempts=attempt + 1,
                    corrected=attempt > 0,
                    escalated=False
                )

            # Score below threshold — reflect and refine
            current_task = self._reflect_and_refine(
                original_task=task,
                failed_output=output,
                score=score,
                attempt=attempt
            )
            attempt += 1

        # All attempts exhausted — escalate to human review
        return CorrectionResult(
            output=output,
            score=score,
            attempts=attempt,
            corrected=False,
            escalated=True
        )

    def _reflect_and_refine(
        self,
        original_task: dict,
        failed_output: Any,
        score: GARVISScore,
        attempt: int
    ) -> dict:
        # Identify the weakest dimension and generate
        # a dimension-specific correction signal
        weak_dims = self._weakest_dimensions(score)
        correction_prompt = self._build_correction_prompt(
            original_task,
            failed_output,
            weak_dims,
            attempt
        )
        refined = original_task.copy()
        refined["correction_context"] = correction_prompt
        refined["attempt"] = attempt + 1
        return refined

    def _weakest_dimensions(
        self,
        score: GARVISScore
    ) -> list[str]:
        dims = {
            "groundedness": score.groundedness,
            "accuracy": score.accuracy,
            "reliability": score.reliability,
            "variance": score.variance,
            "inference_cost": score.inference_cost,
        }
        # Return dimensions below 0.85, sorted weakest first
        return sorted(
            [k for k, v in dims.items() if v < 0.85],
            key=lambda k: dims[k]
        )

The _build_correction_prompt method is proprietary — that is where the domain-specific healthcare knowledge lives. But the structure above is fully open in the ARGUS SDK.

The PHI tokenization architecture

This is the part that took the longest to get right. The requirement: agents need full clinical context to make good RCM decisions, but no PHI can appear in any LLM prompt.

import hashlib
import hmac
import re
from typing import Any

class PHITokenizer:
    # Patterns for common PHI types
    PHI_PATTERNS = {
        "MRN":   r"\bMRN[-:\s]?\d{6,10}\b",
        "DOB":   r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
        "SSN":   r"\b\d{3}-\d{2}-\d{4}\b",
        "NAME":  r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b",
        "NPI":   r"\bNPI[-:\s]?\d{10}\b",
    }

    def __init__(self, secret_key: bytes):
        self.secret_key = secret_key
        self._token_map: dict[str, str] = {}
        self._reverse_map: dict[str, str] = {}

    def _generate_token(self, phi_value: str, phi_type: str) -> str:
        # Deterministic: same PHI always maps to same token
        raw = f"{phi_type}:{phi_value}"
        token_bytes = hmac.new(
            self.secret_key,
            raw.encode(),
            hashlib.sha256
        ).hexdigest()[:16]
        return f"[{phi_type}_TOKEN_{token_bytes.upper()}]"

    def tokenize(self, text: str) -> str:
        tokenized = text
        for phi_type, pattern in self.PHI_PATTERNS.items():
            matches = re.findall(pattern, tokenized)
            for match in matches:
                token = self._generate_token(match, phi_type)
                self._token_map[match] = token
                self._reverse_map[token] = match
                tokenized = tokenized.replace(match, token)
        return tokenized

    def rehydrate(self, tokenized_text: str) -> str:
        result = tokenized_text
        for token, phi_value in self._reverse_map.items():
            result = result.replace(token, phi_value)
        return result

    def is_phi_clean(self, text: str) -> bool:
        for pattern in self.PHI_PATTERNS.values():
            if re.search(pattern, text):
                return False
        return True

Production Safety score: 100%. Zero PHI exposure events.

Install and get started

The ARGUS SDK — G-ARVIS scoring, ASF/ERR/CPCS calculators, PHITokenizer base class, and ARGUSGuard correction loop — is open-core and on PyPI.

pip install argus-ai

from argus_ai import ARGUSGuard, GARVISScorer, PHITokenizer
from argus_ai.metrics import ASFCalculator, ERRTracker, CPCSCalculator

# Wrap any async agent function with self-correction
guard = ARGUSGuard(
    max_attempts=3,
    target_composite=0.85,
    domain="healthcare_rcm",
    phi_safe=True
)

result = await guard.execute_with_correction(
    agent_fn=my_denial_predictor,
    task=claim_task,
    scorer=GARVISScorer()
)

print(f"Score: {result.score.composite:.1%}")
print(f"Attempts: {result.attempts}")
print(f"Escalated: {result.escalated}")

Production results

These are from the live ARIA system, 24-hour rolling average:

Metric	Value
G-ARVIS composite	93.9%
Groundedness	96.2%
Accuracy	94.8%
Reliability	93.1%
Variance	91.7%
Inference Cost	95.3%
Safety	100%
Action Sequence Fidelity	91.4%
Error Recovery Rate	87.3%
Cost Per Correct Sequence	$0.023
Denial rate reduction	38%

What is open vs proprietary

Open (argus-ai on PyPI + GitHub):

ARGUSGuard correction loop
GARVISScorer base framework
PHITokenizer base class
ASF, ERR, CPCS calculators
PulseFlow MLOps pipeline

Proprietary (the ARIA product):

11-agent supervisor hierarchy with RCM domain specialization
Payer policy RAG with live contract updates
Predictive denial scoring model
RCM domain knowledge engine
Multi-tenant deployment infrastructure

78% of PyTorch Models Never Reach Production. I Built the Fix.

Anil Prasad — Sat, 04 Apr 2026 18:26:23 +0000

78% of PyTorch Models Never Reach Production. I Built the Fix.

After 28 years shipping AI at scale, I got tired of watching good models die on the way to production.

By Anil S. Prasad — Founder, Ambharii Labs | Head of Engineering & Product, Duke Energy | Top 100 Most Influential AI Leaders USA 2024

There is a number that has followed me across every organization I have worked in. UnitedHealth Group, Medtronic, Ambry Genetics, R1 RCM, Duke Energy. The number is 78.

Seventy-eight percent of PyTorch models built in research never make it to production.

This is not a data science problem. The data scientists are talented. The models are good. The math works.

The problem is everything around the model. The audit trail that regulators demand. The compliance framework that legal requires. The drift detection that ops needs at 3am. The fairness analysis that the board is now asking about. The explanation that a clinician, an underwriter, or a grid operator needs before they trust the output.

None of that is in PyTorch. And nobody was building it.

So I did.

Introducing TorchForge

TorchForge is an open source enterprise governance wrapper for PyTorch. You take any model you have already built and wrap it in four lines of code. What comes back is the same model, same weights, same architecture, with a full production governance layer running underneath it.

Two-point-five percent overhead. That is all it costs.

from torchforge import ForgeModel, ForgeConfig

config = ForgeConfig(
    model_name="credit_risk_v2",
    version="1.0.0",
    enable_governance=True,
    compliance_framework="NIST_RMF_1.0"
)

model = ForgeModel(your_pytorch_model, config)
output = model(x)

# Audit trail: live.
# Drift detection: live.
# Compliance reporting: live.
# That's it.

No refactoring. No retraining. No new infrastructure team required.

What You Get Out of the Box

I want to be precise here because this is where most governance tools either overpromise or underdeliver.

NIST AI RMF 1.0 compliance tracking. Every inference is logged against the seven functions of the NIST Risk Management Framework. Govern, Map, Measure, Manage. The report generates automatically. When a regulator asks for your AI risk documentation, you export it in one command.

Real-time drift detection with automatic alerts. TorchForge monitors input distribution and output distribution on every inference pass. When drift exceeds configurable thresholds, it fires alerts to Slack, PagerDuty, or any webhook you point it at. No separate monitoring pipeline. No Evidently setup. No manual dashboards.

Bias and fairness analysis on every prediction. Demographic parity, equalized odds, individual fairness metrics run as part of the inference pass. Not as a post-hoc audit you remember to do quarterly. On every prediction. Because bias does not wait for your audit schedule.

Full audit trail from training to deployment. Every model version, every config change, every inference batch is logged with timestamp, input hash, output, confidence scores, and the governance metadata. Immutable. Queryable. Exportable.

One-click deployment to five clouds. AWS, Azure, GCP, Kubernetes, Oracle Cloud. The deployment module generates the Terraform, the Helm chart, and the GitHub Actions pipeline. Your ops team gets a clean artifact, not a Jupyter notebook printed to PDF.

A/B testing with gradual rollout. Define your champion and challenger models. Set a traffic split. TorchForge handles the routing, collects the performance metrics, and helps you decide when to promote. No feature flags library required.

Why I Built This Now

Three things converged in 2025 that made this the right moment.

First, the regulatory pressure on AI is no longer hypothetical. The EU AI Act is in force. US federal agencies have issued AI governance guidance. State-level bills are passing. Every organization I talk to is scrambling to answer the same question: how do we prove our AI is trustworthy? TorchForge makes that question answerable.

Second, PyTorch won. It is the dominant research framework and increasingly the production framework. ONNX and TorchScript made serving easier. But nobody solved governance at the framework layer. Everyone solved it at the infrastructure layer, which means it is always bolted on, never built in.

Third, I kept meeting talented ML engineers who had the same story. They built something that worked. Leadership approved it. Then it went to compliance, and it sat there for six months because nobody could answer the audit questions. TorchForge is the answer you hand compliance the day the model is ready.

The Performance Story

I know what you are thinking. Governance overhead sounds expensive.

Here is what we measured in production-equivalent workloads:

Operation	TorchForge	Pure PyTorch	Overhead
Forward pass	12.3ms	12.0ms	2.5%
Training step	45.2ms	44.8ms	0.9%
Inference batch	8.7ms	8.5ms	2.3%

Two-point-five percent on a forward pass. On a GPU cluster running ten thousand inferences per second, that is 250 extra milliseconds of total compute per second across the cluster. In exchange for full NIST compliance, continuous drift detection, bias monitoring, and a complete audit trail.

That is not a trade-off. That is a deal.

Who This Is For

If you are a solo researcher building hobby projects, TorchForge is overkill. Use it anyway if you want to learn the patterns, but it is not built for you.

TorchForge is built for three audiences.

ML engineers at regulated companies. Healthcare, financial services, energy, insurance. If your model touches a human life, a financial decision, or critical infrastructure, you need this. The compliance cost of not having it is orders of magnitude higher than the 2.5% overhead.

ML platform teams at growth-stage companies. You are scaling from one model to fifty. You need standardization. TorchForge is the standard. Every team wraps their model the same way, and you get a unified governance view across the entire portfolio.

AI consultants and system integrators. When you deliver a PyTorch model to a client and it comes with TorchForge, you are delivering a production-ready artifact, not a prototype. That changes the conversation about what you charge and what the client owns.

The Open Core Model

TorchForge is MIT-licensed. The core is free and always will be. I believe governance tooling should be open because the alternative is that only large companies with large procurement budgets can ship trustworthy AI. That is a bad outcome for the field.

The enterprise platform, the autonomous correction agents, the multi-tenant dashboard, the SLA-backed support, the private deployment options, those are available through Ambharii Labs. If you need them, you know where to find me.

But the open core does everything I described above. The compliance tracking, drift detection, bias analysis, audit trail, deployment tooling, A/B testing framework. All of it. No license key required.

Try It in Three Minutes

pip install torchforge

import torch
import torch.nn as nn
from torchforge import ForgeModel, ForgeConfig

# Your existing model, unchanged
class YourModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )
    def forward(self, x):
        return self.net(x)

# Wrap it
config = ForgeConfig(
    model_name="my_first_governed_model",
    version="1.0.0",
    enable_governance=True,
    compliance_framework="NIST_RMF_1.0"
)

model = ForgeModel(YourModel(), config)

# Run inference — governance is automatic
x = torch.randn(4, 128)
output = model(x)

# Export your compliance report
model.export_compliance_report("./compliance_report.json")

That is three minutes from install to your first NIST-compliant inference.

The live demo runs on Hugging Face Spaces at no cost to you. Go to huggingface.co/spaces/AmbhariiLabs/torchforge-demo and run a governed inference in your browser before you install anything.

What Comes Next

The roadmap for TorchForge is driven by what I see breaking in production, not by what sounds impressive in a conference talk.

Q2 2026: Federated learning support with differential privacy guarantees. For healthcare and financial services teams who cannot centralize training data.

Q3 2026: LLM governance extension. The same wrapper pattern applied to fine-tuned language models. Hallucination rate tracking, toxicity monitoring, prompt injection detection.

Q4 2026: Cross-framework support. The governance layer decoupled from PyTorch so it can wrap TensorFlow, JAX, and ONNX models with the same four-line interface.

Everything will stay open core. That is not a marketing promise. It is the design constraint I set before writing the first line of code.

Links

GitHub: github.com/anilatambharii/torchforge

PyPI: pip install torchforge

Live demo: huggingface.co/spaces/AmbhariiLabs/torchforge-demo

Enterprise: ambharii.com

Connect: linkedin.com/in/anilsprasad

Built by Anil S. Prasad — Founder, Ambharii Labs. 28 years of production AI across UnitedHealth Group, Medtronic, Duke Energy, Ambry Genetics, and R1 RCM. Co-Founder of the CDAIO Circle Tri-State Chapter. Stanford and BITS Pilani.

#HumanWritten #ExpertiseFromField #PyTorch #MLOps #EnterpriseAI #AIGovernance #OpenSource #NIST #ProductionAI

Cross-posting note: This article is published on Medium (@anilAmbharii). Canonical version lives at medium.com. If you found this on DEV.to, Substack, or LinkedIn, follow me there for more field notes from 28 years of production AI.