<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dishant Sethi</title>
    <description>The latest articles on DEV Community by Dishant Sethi (@dishant_sethi).</description>
    <link>https://dev.to/dishant_sethi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3794376%2F1e679e88-5cea-4819-bc93-271539b3945a.png</url>
      <title>DEV Community: Dishant Sethi</title>
      <link>https://dev.to/dishant_sethi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dishant_sethi"/>
    <language>en</language>
    <item>
      <title>Why Your RAG Pipeline Is Failing in Production (And How to Fix It)</title>
      <dc:creator>Dishant Sethi</dc:creator>
      <pubDate>Wed, 27 May 2026 16:19:11 +0000</pubDate>
      <link>https://dev.to/dishant_sethi/why-your-rag-pipeline-is-failing-in-production-and-how-to-fix-it-4318</link>
      <guid>https://dev.to/dishant_sethi/why-your-rag-pipeline-is-failing-in-production-and-how-to-fix-it-4318</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://www.prodinit.com/blog/rag-pipeline-debugging-production" rel="noopener noreferrer"&gt;prodinit.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;80% of RAG failures trace back to the ingestion layer, not the LLM — fix chunking and indexing before tuning your prompts&lt;/li&gt;
&lt;li&gt;Chunk size alone can swing retrieval precision by 20–40%; there is no universal right answer, and the correct value depends on your document type and query pattern&lt;/li&gt;
&lt;li&gt;Adding a cross-encoder reranker on top of vector search typically lifts answer correctness by 15–25% with minimal latency cost&lt;/li&gt;
&lt;li&gt;Stale indexes are invisible in standard monitoring: a document updated 3 months ago may still be answering queries from its old content&lt;/li&gt;
&lt;li&gt;Teams without an eval loop discover regressions 4–8× slower than teams with automated retrieval quality checks running on every deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A RAG pipeline looks straightforward on paper: retrieve relevant chunks, stuff them into a prompt, get an answer. Teams wire it up in a weekend, the demo works, and they ship it. Then, weeks later, users start complaining that the system returns outdated information, misses obvious answers, or confidently cites the wrong document.&lt;/p&gt;

&lt;p&gt;RAG pipeline debugging starts at the retrieval layer, not the LLM. The five failure modes that break production RAG systems — bad chunking, missing reranking, stale indexes, no hybrid retrieval, and no eval loop — are all fixable at the data and infrastructure layer. None require changing your model or rewriting your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why RAG Fails Silently in Production
&lt;/h2&gt;

&lt;p&gt;The LLM itself is almost always fine. The retrieval layer is what's broken — and most observability tooling points at the model, not the retriever. You can spend days tweaking system prompts and temperature settings while the root cause sits in how you chunked your documents three months ago. Production RAG failure leaves no stack trace.&lt;/p&gt;

&lt;p&gt;There is no exception, no 500 error, no latency spike. The system continues to return answers. They are just wrong, or incomplete, or stale. Without an explicit eval loop tied to retrieval quality, you will not know until a user tells you.&lt;/p&gt;

&lt;p&gt;This guide covers the five failure modes Prodinit encounters most often when auditing RAG systems in production, with diagnosis steps and fixes for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Mode 1: Bad Chunking Strategy
&lt;/h2&gt;

&lt;p&gt;Fixed-size character splitting destroys retrieval quality for anything beyond plain prose. A 512-token chunk of a legal contract may split a clause mid-sentence; a 512-token chunk of code may span four unrelated functions. Neither produces embeddings specific enough to surface the right document for a precise query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it breaks
&lt;/h3&gt;

&lt;p&gt;Chunking is the most consequential decision in a RAG pipeline and the one teams spend the least time on. The default in most frameworks is a fixed-size character or token split with a small overlap. This works in demos. In production, it destroys retrieval quality for anything that isn't plain prose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem with fixed-size chunking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 512-token chunk of a legal contract may split a clause mid-sentence, leaving neither chunk with enough context to be retrieved correctly&lt;/li&gt;
&lt;li&gt;A 512-token chunk of code may contain four unrelated functions, causing the entire chunk to match queries loosely but none of them precisely&lt;/li&gt;
&lt;li&gt;Tables, structured data, and numbered lists lose their semantics when split by character count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When your chunks are semantically incoherent, your embeddings are noisy. Noisy embeddings produce low-confidence nearest-neighbor results. The retriever returns tangentially related chunks, the LLM hallucinates to fill the gap, and the answer looks plausible but wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosis
&lt;/h3&gt;

&lt;p&gt;Check what your chunks actually look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
def audit_chunks(chunks: list[str], sample_size: int = 20) -&amp;gt; dict:
    import random
    sample = random.sample(chunks, min(sample_size, len(chunks)))
    stats = {
        "avg_tokens": sum(len(c.split()) for c in chunks) / len(chunks),
        "min_tokens": min(len(c.split()) for c in chunks),
        "max_tokens": max(len(c.split()) for c in chunks),
        "truncated_sentences": sum(
            1 for c in sample
            if not c.strip().endswith((".", "?", "!", "```

", "}"))
        ),
        "sample": sample[:3],
    }
    return stats


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Red flags: &lt;code&gt;truncated_sentences&lt;/code&gt; above 30%, average tokens below 100 or above 600, or chunks that end mid-code-block.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Switch to semantic chunking. For prose documents, split on sentence boundaries and merge until a semantic similarity threshold is crossed. For structured content, use document-aware splitters that respect headings, tables, and code blocks.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Document-aware splitter that respects structure
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
    chunk_size=600,          # tokens, not characters
    chunk_overlap=60,        # ~10% overlap for context continuity
    length_function=len,
    is_separator_regex=False,
)

# For code: use language-aware splitters
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=800,
    chunk_overlap=80,
)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There is no universal correct chunk size. Run retrieval precision benchmarks at 256, 512, and 1024 tokens against a sample of real queries. Pick the size that maximises the percentage of queries where the correct answer appears in the top-3 retrieved chunks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Mode 2: Missing Reranking
&lt;/h2&gt;

&lt;p&gt;Vector similarity search retrieves the right candidates but ranks them poorly. The chunk with the highest cosine similarity is not always the most useful chunk for the specific query — it is the closest in embedding space, not the most relevant to the question. Without a cross-encoder reranker, you are systematically passing the wrong context to your LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it breaks
&lt;/h3&gt;

&lt;p&gt;Vector similarity search is excellent at candidate retrieval. It is poor at ranking. Cosine similarity between two high-dimensional embeddings captures semantic proximity, not answer relevance for a specific query. The top result by cosine distance is not always the most useful chunk for the question at hand.&lt;/p&gt;

&lt;p&gt;Teams that skip reranking are essentially treating their retrieval problem as solved after the first-stage ANN search. In practice, the chunk that best answers the query is often ranked 3rd or 5th by embedding similarity — close enough to retrieve, not close enough to surface first.&lt;/p&gt;

&lt;p&gt;If your system passes the top-1 or top-2 chunks to the LLM without reranking and truncates the rest, you are systematically dropping the best answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosis
&lt;/h3&gt;

&lt;p&gt;Run a relevance audit on your retrieval results:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def audit_retrieval_rank(query: str, retrieved_chunks: list[str], 
                          ground_truth_chunk: str) -&amp;gt; dict:
    scores = reranker.predict(
        [(query, chunk) for chunk in retrieved_chunks]
    )
    reranked = sorted(
        enumerate(retrieved_chunks), 
        key=lambda x: scores[x[0]], 
        reverse=True
    )

    vector_rank = retrieved_chunks.index(ground_truth_chunk) + 1
    reranked_rank = next(
        i + 1 for i, (orig_idx, _) in enumerate(reranked)
        if retrieved_chunks[orig_idx] == ground_truth_chunk
    )

    return {
        "query": query,
        "vector_rank": vector_rank,
        "reranked_rank": reranked_rank,
        "improved": reranked_rank &amp;lt; vector_rank,
    }


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If reranked rank is better than vector rank on more than 30% of your test queries, you have a reranking gap that is actively hurting answer quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Add a cross-encoder reranker as a second-pass filter. Retrieve &lt;code&gt;k=20&lt;/code&gt; candidates from your vector store, rerank them, and pass the top-3 to your LLM. The cross-encoder sees the full query and each chunk together, which lets it score relevance directly rather than proximity in embedding space.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from sentence_transformers import CrossEncoder
from typing import List

class RerankedRetriever:
    def __init__(self, vector_store, reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.vector_store = vector_store
        self.reranker = CrossEncoder(reranker_model)

    def retrieve(self, query: str, top_k: int = 3, candidate_k: int = 20) -&amp;gt; List[str]:
        # First-stage: broad vector retrieval
        candidates = self.vector_store.similarity_search(query, k=candidate_k)

        # Second-stage: cross-encoder reranking
        pairs = [(query, doc.page_content) for doc in candidates]
        scores = self.reranker.predict(pairs)

        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )
        return [doc.page_content for doc, _ in ranked[:top_k]]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Cross-encoder reranking adds 50–200ms of latency for a 20-candidate set. For most production RAG workloads, that is an acceptable trade for a 15–25% improvement in answer correctness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Mode 3: Stale Index
&lt;/h2&gt;

&lt;p&gt;Your embedding index is a snapshot of your documents at indexing time. When a policy is updated, a product spec is revised, or a pricing page changes, the index does not update automatically — queries continue retrieving content from weeks or months ago, with no error signal to indicate the problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it breaks
&lt;/h3&gt;

&lt;p&gt;Stale index is insidious because it is invisible in standard observability. Query latency is normal. Embedding lookups return results. The system appears healthy. Users are just silently receiving outdated information.&lt;/p&gt;

&lt;p&gt;The problem compounds with time. A document indexed 6 months ago and updated 3 times since is a liability, not an asset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosis
&lt;/h3&gt;

&lt;p&gt;Implement index freshness tracking:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import hashlib
from datetime import datetime
from dataclasses import dataclass

@dataclass
class IndexedDocument:
    doc_id: str
    content_hash: str
    indexed_at: datetime
    source_updated_at: datetime

def audit_index_freshness(indexed_docs: list[IndexedDocument], 
                           max_age_days: int = 30) -&amp;gt; dict:
    now = datetime.utcnow()
    stale = []

    for doc in indexed_docs:
        age = (now - doc.indexed_at).days
        if age &amp;gt; max_age_days:
            stale.append({"id": doc.doc_id, "age_days": age})

        if doc.source_updated_at &amp;gt; doc.indexed_at:
            stale.append({
                "id": doc.doc_id, 
                "reason": "source_updated_after_index",
                "gap_hours": (doc.source_updated_at - doc.indexed_at).seconds // 3600,
            })

    return {
        "total_documents": len(indexed_docs),
        "stale_count": len(stale),
        "stale_pct": round(len(stale) / len(indexed_docs) * 100, 1),
        "stale_docs": stale[:10],
    }


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Implement incremental re-indexing on document change, not on a fixed schedule. Track content hashes. When a source document's hash changes, queue it for re-embedding immediately.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import hashlib
from datetime import datetime

class IncrementalIndexer:
    def __init__(self, vector_store, embedder):
        self.vector_store = vector_store
        self.embedder = embedder
        self.index_registry: dict[str, str] = {}  # doc_id -&amp;gt; content_hash

    def _content_hash(self, content: str) -&amp;gt; str:
        return hashlib.sha256(content.encode()).hexdigest()

    def upsert_document(self, doc_id: str, content: str, metadata: dict):
        new_hash = self._content_hash(content)

        if self.index_registry.get(doc_id) == new_hash:
            return  # Content unchanged, skip re-indexing

        self.vector_store.delete(filter={"doc_id": doc_id})

        chunks = self.chunk(content)
        embeddings = self.embedder.embed_documents(chunks)

        self.vector_store.add_embeddings(
            texts=chunks,
            embeddings=embeddings,
            metadatas=[{**metadata, "doc_id": doc_id, "indexed_at": datetime.utcnow().isoformat()}
                       for _ in chunks],
        )

        self.index_registry[doc_id] = new_hash


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Wire this to your content management system's webhook or change-data-capture stream. Every document update should trigger an upsert within minutes, not the next scheduled batch run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Mode 4: No Hybrid Retrieval (BM25 + Vector)
&lt;/h2&gt;

&lt;p&gt;Pure vector search fails on exact-match queries. When a user searches for a specific error code, API endpoint, or product identifier, vector similarity often surfaces semantically related content that never contains the exact string. BM25 handles rare-term and exact-match queries precisely — hybrid retrieval combines both and consistently outperforms either approach alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it breaks
&lt;/h3&gt;

&lt;p&gt;Pure vector search excels at semantic similarity. It is poor at exact matching. When a user queries for a specific product code, a person's name, an API endpoint, or an error message, vector search often surfaces semantically related but lexically different results. The chunk containing the exact string &lt;code&gt;ERR_QUOTA_EXCEEDED&lt;/code&gt; may score lower than a chunk about "error handling" that never mentions the specific code.&lt;/p&gt;

&lt;p&gt;BM25 (the algorithm behind classic keyword search) handles exact and rare-term matching extremely well. It rewards documents that contain the query terms, with inverse document frequency weighting meaning that rare, specific terms get boosted. What BM25 misses is paraphrase, synonym, and conceptual matching — exactly what vector search handles.&lt;/p&gt;

&lt;p&gt;Teams that use only vector search leave a meaningful precision gap for queries with specific identifiers. Teams that use only BM25 miss semantic intent. Hybrid retrieval combines both, and on standard retrieval benchmarks it consistently outperforms either approach alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosis
&lt;/h3&gt;

&lt;p&gt;Run a query set that mixes semantic queries ("how does the refund policy work?") and exact-match queries ("what is the timeout value for API_GATEWAY_CONNECT?"). Compare top-3 precision for vector-only versus BM25-only versus hybrid across both query types. If vector-only precision on exact-match queries is more than 15 percentage points lower than on semantic queries, you have a pure-vector blind spot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Implement reciprocal rank fusion (RRF) to merge vector and BM25 rankings:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List

class HybridRetriever:
    def __init__(self, vector_store, documents: List[str], 
                 rrf_k: int = 60, alpha: float = 0.5):
        self.vector_store = vector_store
        self.documents = documents
        self.alpha = alpha         # 0 = BM25 only, 1 = vector only
        self.rrf_k = rrf_k

        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def _rrf_score(self, rank: int) -&amp;gt; float:
        return 1.0 / (self.rrf_k + rank)

    def retrieve(self, query: str, top_k: int = 5) -&amp;gt; List[str]:
        vector_results = self.vector_store.similarity_search(query, k=top_k * 4)

        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        bm25_ranked = np.argsort(bm25_scores)[::-1][:top_k * 4]

        rrf_scores: dict[str, float] = {}

        for rank, doc in enumerate(vector_results):
            doc_id = doc.metadata["id"]
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + self.alpha * self._rrf_score(rank)

        for rank, idx in enumerate(bm25_ranked):
            doc_id = f"doc_{idx}"
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1 - self.alpha) * self._rrf_score(rank)

        sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)
        return sorted_ids[:top_k]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Start with &lt;code&gt;alpha=0.5&lt;/code&gt; (equal weight) and tune based on your query distribution. If your users ask mostly exact-product or identifier queries, shift toward &lt;code&gt;alpha=0.3&lt;/code&gt; to weight BM25 more heavily.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Mode 5: No Eval Loop
&lt;/h2&gt;

&lt;p&gt;Without an automated eval loop, every regression in your RAG pipeline is invisible until a user complaint surfaces it. Teams without retrieval quality checks running on every deployment discover degradation 4–8× slower than teams that do — and by then, the root cause is typically tangled across multiple changes and hard to isolate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it breaks
&lt;/h3&gt;

&lt;p&gt;You cannot improve what you do not measure. RAG systems degrade over time as documents are updated, query patterns shift, and underlying model versions change. Without an automated eval loop tied to retrieval quality metrics, every one of these changes is invisible until a user complaint surfaces it.&lt;/p&gt;

&lt;p&gt;The eval loop is not optional. It is the mechanism that keeps your RAG pipeline honest over its operational lifetime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosis
&lt;/h3&gt;

&lt;p&gt;Check whether your deployment pipeline currently runs any of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval precision@k (what fraction of ground-truth relevant chunks appear in the top-k retrieved?)&lt;/li&gt;
&lt;li&gt;Answer faithfulness (does the generated answer stay within the retrieved context, or does it hallucinate beyond it?)&lt;/li&gt;
&lt;li&gt;Answer relevance (does the generated answer actually address the query?)&lt;/li&gt;
&lt;li&gt;Context recall (does the retrieved set contain all the information needed to answer correctly?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If none of these are tracked per deployment, you are operating blind.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Build a retrieval eval suite using a golden query set and run it in CI on every deployment:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalCase:
    query: str
    expected_doc_ids: List[str]
    expected_answer_contains: Optional[str] = None

def precision_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -&amp;gt; float:
    top_k = retrieved_ids[:k]
    hits = sum(1 for doc_id in top_k if doc_id in relevant_ids)
    return hits / k

def run_retrieval_eval(retriever, eval_cases: List[EvalCase], k: int = 3) -&amp;gt; dict:
    results = []

    for case in eval_cases:
        retrieved = retriever.retrieve(case.query, top_k=k)
        retrieved_ids = [r["id"] for r in retrieved]

        precision = precision_at_k(retrieved_ids, case.expected_doc_ids, k)
        recall = sum(
            1 for doc_id in case.expected_doc_ids if doc_id in retrieved_ids
        ) / len(case.expected_doc_ids)

        results.append({
            "query": case.query,
            f"precision@{k}": precision,
            "recall": recall,
        })

    avg_precision = sum(r[f"precision@{k}"] for r in results) / len(results)
    avg_recall = sum(r["recall"] for r in results) / len(results)

    return {
        f"avg_precision@{k}": round(avg_precision, 3),
        "avg_recall": round(avg_recall, 3),
        "per_query": results,
    }

def ci_gate(current_metrics: dict, baseline_metrics: dict, 
             relative_threshold: float = 0.05) -&amp;gt; bool:
    baseline_p = baseline_metrics["avg_precision@3"]
    current_p = current_metrics["avg_precision@3"]
    regression = (baseline_p - current_p) / baseline_p

    if regression &amp;gt; relative_threshold:
        print(f"FAIL: precision@3 regressed {regression:.1%} (baseline={baseline_p:.3f}, current={current_p:.3f})")
        return False
    return True


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this eval suite against a golden set of 50–200 query/relevant-document pairs on every deploy. Gate the deployment if precision@3 drops more than 5% relative to the last passing run.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG Pipeline Debugging Checklist
&lt;/h2&gt;

&lt;p&gt;Run this before spending time on prompt engineering or model tuning. These five failure modes are sequential — chunking problems corrupt every downstream step, so work top to bottom. If any item below fails, fix it before moving to the next row.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Tool / Signal&lt;/th&gt;
&lt;th&gt;Pass Condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk quality&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;audit_chunks()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;truncated_sentences&lt;/code&gt; &amp;lt; 30%, avg tokens 200–600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunk strategy&lt;/td&gt;
&lt;td&gt;Manual inspection&lt;/td&gt;
&lt;td&gt;Chunks are semantically coherent units&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranker present&lt;/td&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;Cross-encoder reranker on first-stage candidates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranker improves rank&lt;/td&gt;
&lt;td&gt;&lt;code&gt;audit_retrieval_rank()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ground-truth rank improves in &amp;gt; 30% of queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index freshness&lt;/td&gt;
&lt;td&gt;Hash comparison&lt;/td&gt;
&lt;td&gt;No document indexed &amp;gt; 30 days without change check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDC / webhook&lt;/td&gt;
&lt;td&gt;Infrastructure review&lt;/td&gt;
&lt;td&gt;Document updates trigger re-index within minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid retrieval&lt;/td&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;BM25 + vector fusion implemented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid alpha tuned&lt;/td&gt;
&lt;td&gt;Precision comparison&lt;/td&gt;
&lt;td&gt;Hybrid P@3 ≥ max(vector-only, BM25-only) P@3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval suite exists&lt;/td&gt;
&lt;td&gt;CI pipeline&lt;/td&gt;
&lt;td&gt;Retrieval eval runs on every deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regression gate&lt;/td&gt;
&lt;td&gt;CI config&lt;/td&gt;
&lt;td&gt;Deploy blocked if precision drops &amp;gt; 5% relative&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Building Production Voice AI Agents: Latency, Architecture, and What Nobody Tells You</title>
      <dc:creator>Dishant Sethi</dc:creator>
      <pubDate>Wed, 27 May 2026 16:10:44 +0000</pubDate>
      <link>https://dev.to/dishant_sethi/building-production-voice-ai-agents-latency-architecture-and-what-nobody-tells-you-3jhj</link>
      <guid>https://dev.to/dishant_sethi/building-production-voice-ai-agents-latency-architecture-and-what-nobody-tells-you-3jhj</guid>
      <description>&lt;p&gt;Originally published on &lt;a href="https://www.prodinit.com/blog/production-voice-ai-agents-latency-architecture" rel="noopener noreferrer"&gt;prodinit.com&lt;/a&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-300ms end-to-end latency is the human-conversation threshold for voice AI.&lt;/li&gt;
&lt;li&gt;The latency budget breaks into four layers: STT (80–120ms), LLM first-token (150–250ms), TTS first-chunk (60–100ms), and network transport (20–60ms). Missing target in any one layer pushes the total over 500ms.&lt;/li&gt;
&lt;li&gt;WebRTC with ICE Trickle is the correct transport for browser and mobile clients. SIP is the right choice for PSTN integration and legacy telephony.&lt;/li&gt;
&lt;li&gt;LiveKit SFU reduces media server complexity by forwarding encoded streams rather than decoding and re-mixing them, and its hosted tier removes the need to operate a media server fleet entirely.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Voice AI Fails in Production
&lt;/h2&gt;

&lt;p&gt;Voice AI demos look deceptively easy. A GPT-4o API call, a TTS response, a microphone input — connected together in 200 lines of Python, the thing works. Then you put it in front of real users and it fails.&lt;/p&gt;

&lt;p&gt;The failure is almost never the model. It is the architecture.&lt;/p&gt;

&lt;p&gt;In production at 2000+ calls per day — the scale Prodinit operates for a healthcare scheduling platform — three classes of failure dominate: latency spikes that destroy conversational flow, audio glitches from unmanaged WebRTC sessions, and compliance gaps where customer PII surfaces in LLM provider logs. None of these appear in a notebook demo. All of them have architecture solutions.&lt;/p&gt;

&lt;p&gt;This guide walks through the complete production stack: what latency target you are actually trying to hit, how the budget breaks across each layer, the transport architecture that achieves it, and the security and observability instrumentation that keeps it running without surprises.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Latency Is Acceptable for Voice AI?
&lt;/h2&gt;

&lt;p&gt;Sub-300ms end-to-end latency is the human-conversation threshold. Conversational linguistics research places the average human response gap at 200ms; gaps up to 500ms are within the natural range. Beyond 500ms, listeners register the pause. Beyond 1,500ms, they start to speak again — or hang up.&lt;/p&gt;

&lt;p&gt;The practical production target is &lt;strong&gt;under 800ms at p95&lt;/strong&gt;, with a p50 below 400ms. This is not a soft target — these numbers correlate directly with call completion rates and CSAT scores.&lt;/p&gt;

&lt;p&gt;End-to-end latency in a voice AI agent is the sum of five contributors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audio capture and VAD (voice activity detection) — 10–30ms&lt;/li&gt;
&lt;li&gt;STT transcription — 80–120ms with streaming&lt;/li&gt;
&lt;li&gt;LLM first-token latency — 150–250ms with low-latency models&lt;/li&gt;
&lt;li&gt;TTS first-audio-chunk — 60–100ms with streaming&lt;/li&gt;
&lt;li&gt;Network transport and jitter buffer — 20–60ms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total target: 320–560ms.&lt;/strong&gt; That is achievable. The mistakes that push it over 1,000ms are predictable and avoidable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency Budget by Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Voice Activity Detection (10–30ms)
&lt;/h3&gt;

&lt;p&gt;VAD decides when the user has stopped speaking and the pipeline should fire. A misconfigured VAD is the single easiest way to add 500ms of latency without touching any model. Most implementations default to a trailing silence window of 500–800ms — that pause sits entirely in the user experience before a single API call fires.&lt;/p&gt;

&lt;p&gt;In production, configure VAD with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Silence threshold: 300ms for call center contexts, 200ms for high-tempo applications&lt;/li&gt;
&lt;li&gt;Endpointing: fire on silence, not on a fixed timer&lt;/li&gt;
&lt;li&gt;Echo cancellation: required whenever the agent speaks; browser &lt;code&gt;getUserMedia&lt;/code&gt; handles this with &lt;code&gt;echoCancellation: true&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deepgram's streaming STT includes built-in VAD endpointing via &lt;code&gt;endpointing=300&lt;/code&gt; — use this rather than a separate VAD layer, as it eliminates an additional round-trip.&lt;/p&gt;

&lt;h3&gt;
  
  
  STT: Streaming Transcription (80–120ms)
&lt;/h3&gt;

&lt;p&gt;Batch transcription — send audio, wait for full transcript — adds 600–1,200ms before your LLM call even starts. This alone makes sub-300ms unreachable. The solution is streaming STT with interim results.&lt;/p&gt;

&lt;p&gt;Deepgram Nova-2 delivers streaming transcription with a first-word latency around 80ms over WebSocket. You do not wait for the complete transcript; you begin processing on &lt;code&gt;is_final: true&lt;/code&gt; utterances:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User audio → WebSocket → Deepgram Nova-2 (streaming)
                              ↓
                    interim results (ignored)
                              ↓
                    is_final: true → LLM pipeline fires
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Critical configuration: &lt;code&gt;punctuate=true&lt;/code&gt;, &lt;code&gt;smart_format=true&lt;/code&gt;, and &lt;code&gt;endpointing=300&lt;/code&gt;. Without endpointing set, Deepgram uses server-side silence detection that defaults longer than your VAD window.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Reasoning (150–250ms)
&lt;/h3&gt;

&lt;p&gt;LLM first-token latency is the hardest constraint to optimize. GPT-4 in streaming mode cannot reliably hit sub-200ms first-token in typical network conditions. The model choices that achieve 150–250ms in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o-mini&lt;/strong&gt; — ~150ms first-token median; suitable for most voice turn completions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o&lt;/strong&gt; — ~200–300ms first-token; higher quality for complex reasoning turns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Haiku 4.5&lt;/strong&gt; — ~120–180ms first-token; strong instruction-following, well-suited for structured voice turns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groq-hosted Llama&lt;/strong&gt; — sub-100ms first-token via custom hardware; lower model quality ceiling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stream the response. Pass tokens to TTS as they arrive — do not buffer the full LLM output before starting TTS synthesis. The overlap between LLM generation and TTS synthesis recovers 100–200ms of total latency.&lt;/p&gt;

&lt;p&gt;Prompt engineering for voice: system prompts should be shorter than for text chatbots. Strip all markdown formatting instructions — the output goes to TTS and formatted text degrades audio. Keep total context under 2,000 tokens where possible; token count has a near-linear relationship with first-token latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  TTS: Streaming Synthesis (60–100ms)
&lt;/h3&gt;

&lt;p&gt;ElevenLabs streaming delivers first-audio-chunk in 60–100ms on their Flash tier versus 200–400ms on standard. The difference is significant enough that choosing the wrong tier consumes your entire latency budget on TTS alone.&lt;/p&gt;

&lt;p&gt;Use streaming TTS: do not wait for the complete audio file before playback. The client should begin playing as soon as the first audio chunk arrives. For browser clients, the Web Audio API handles chunked playback natively; for telephony, use RTP packetization.&lt;/p&gt;

&lt;p&gt;The TTS configuration that matters for latency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model: &lt;code&gt;eleven_flash_v2_5&lt;/code&gt; for minimum latency&lt;/li&gt;
&lt;li&gt;Streaming: set &lt;code&gt;stream=true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Output format: &lt;code&gt;pcm_16000&lt;/code&gt; for telephony, &lt;code&gt;mp3_44100_128&lt;/code&gt; for browser&lt;/li&gt;
&lt;li&gt;Streaming latency optimization: &lt;code&gt;optimize_streaming_latency=4&lt;/code&gt; (aggressive mode)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Network Transport (20–60ms)
&lt;/h3&gt;

&lt;p&gt;With a well-configured WebRTC connection, transport adds 20–40ms round-trip. With a WebSocket-only approach through a distant cloud region, transport alone can add 200ms in the tail. This is where the transport choice has the most impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Stack Architecture
&lt;/h2&gt;

&lt;p&gt;The production architecture for a sub-300ms voice AI agent:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhrutuacfnh1vtma5kmvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhrutuacfnh1vtma5kmvk.png" alt="Full Stack Arch" width="799" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent worker sits between the media plane and the model APIs. It receives raw audio frames from LiveKit, streams them to Deepgram, fires the LLM on final utterances, and pushes TTS audio frames back into the LiveKit room. The client never calls model APIs directly — this is essential for PII control and rate-limit management.&lt;/p&gt;

&lt;h2&gt;
  
  
  ICE Trickle and the LiveKit SFU Pattern
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why ICE Trickle Matters
&lt;/h3&gt;

&lt;p&gt;WebRTC connection establishment uses Interactive Connectivity Establishment (ICE) to find a network path between peers. In the naive implementation — wait for all ICE candidates before signaling — setup latency adds 500–2,000ms to every call start. This is invisible in demos and very visible in production.&lt;/p&gt;

&lt;p&gt;ICE Trickle solves this: candidates are sent to the remote peer as they are gathered, and connectivity checks begin immediately. Call setup time drops to 100–400ms in most network conditions.&lt;/p&gt;

&lt;p&gt;LiveKit implements ICE Trickle automatically. What you need to deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STUN servers&lt;/strong&gt; — used for reflexive candidate discovery; &lt;code&gt;stun.l.google.com:19302&lt;/code&gt; works for most cases; deploy your own for HIPAA environments to keep traffic off third-party infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TURN servers&lt;/strong&gt; — required for clients behind symmetric NAT, common in enterprise networks; LiveKit's hosted tier includes TURN, or deploy coturn yourself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signaling&lt;/strong&gt; — LiveKit's built-in signaling server handles offer/answer exchange; no separate WebSocket signaling server required&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LiveKit SFU Pattern
&lt;/h3&gt;

&lt;p&gt;A Selective Forwarding Unit receives encoded media streams and forwards them to participants without decoding and re-encoding. For voice AI, this matters because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The agent worker receives RTP packets from the SFU rather than raw WebRTC — simpler to handle in server-side Python or Node.js code&lt;/li&gt;
&lt;li&gt;Multiple agents or observers can subscribe to the same audio stream without additional encoding cost&lt;/li&gt;
&lt;li&gt;The SFU handles DTLS/SRTP complexity; the agent sees plain RTP internally&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LiveKit room model maps cleanly to a voice call session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rtc&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;entrypoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JobContext&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track_subscribed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;rtc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrackKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KIND_AUDIO&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;audio_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rtc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AudioStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;process_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rtc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AudioStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rtc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Room&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LiveKit's agent framework handles room lifecycle, track subscription, and RTP framing. Application code focuses on pipeline logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  WebRTC vs SIP: Which Transport to Use
&lt;/h2&gt;

&lt;p&gt;This is the question that trips up most teams evaluating voice AI infrastructure. They are not competing choices — they solve different integration problems.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtr3c1pq48zh0oddy4ng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtr3c1pq48zh0oddy4ng.png" alt="Which Transport to Use" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use WebRTC when&lt;/strong&gt; you control the client — a web app, mobile app, or embedded SDK. It gives you wideband Opus audio (meaningfully better STT accuracy), lower setup latency, and direct control over the media path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use SIP when&lt;/strong&gt; the caller is on a real phone number — inbound calls to a support line, outbound dialer campaigns, or integration with an existing contact center (Genesys, Five9, Twilio PSTN). Twilio's Media Streams provides a WebSocket bridge from PSTN to your agent worker, which avoids running a full SIP stack yourself.&lt;/p&gt;

&lt;p&gt;The G.711 codec limitation of PSTN calls has an underappreciated consequence: STT accuracy on 8kHz narrowband audio is meaningfully lower than on 16kHz+ wideband. For healthcare or fintech agents where transcription accuracy directly affects outcomes, browser/mobile WebRTC with Opus gives a material accuracy advantage over telephone calls.&lt;/p&gt;

&lt;p&gt;A production voice AI WebRTC architecture typically uses both: WebRTC for app callers and a SIP trunk or Twilio Media Streams for inbound phone calls, with the same agent worker behind both paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability: What to Instrument
&lt;/h2&gt;

&lt;p&gt;Voice AI pipelines fail silently. A WebRTC ICE failure looks like a dropped call. A Deepgram WebSocket disconnect looks like the agent not hearing the user. A TTS timeout manifests as silence on the line. Without structured observability, every incident is a multi-hour debugging session across three services.&lt;/p&gt;

&lt;p&gt;Instrument the following at minimum:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-call latency histogram&lt;/strong&gt; — record wall-clock time from VAD endpoint event to first TTS audio chunk, broken down by component: &lt;code&gt;stt_latency_ms&lt;/code&gt;, &lt;code&gt;llm_first_token_ms&lt;/code&gt;, &lt;code&gt;tts_first_chunk_ms&lt;/code&gt;. Alert on p95 &amp;gt; 800ms for any single component.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-call transcription confidence&lt;/strong&gt; — Deepgram returns a &lt;code&gt;confidence&lt;/code&gt; score per utterance. Log confidence distributions; a degradation in median confidence correlates with audio quality issues, codec mismatches, or background noise problems before callers start complaining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebRTC ICE connection state&lt;/strong&gt; — log ICE state transitions (checking → connected → disconnected → failed). Track &lt;code&gt;failed&lt;/code&gt; rates by client region. Elevated failure rates in a specific geography usually indicate TURN server coverage gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STT WebSocket reconnections&lt;/strong&gt; — Deepgram WebSocket connections drop under load or network events. Count reconnections per call. A call with 3+ reconnections will have visible transcription gaps; flag and review these separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM error rates&lt;/strong&gt; — log 4xx/5xx rates from your LLM provider independently from total call failure. A 429 spike during peak hours needs a different response (add capacity, queue calls) than a 500 (inspect payloads, contact provider).&lt;/p&gt;

&lt;p&gt;Use structured logging with a &lt;code&gt;call_id&lt;/code&gt; field on every log event. Voice AI incidents always span Deepgram, your agent worker, and your SFU. Without a consistent &lt;code&gt;call_id&lt;/code&gt;, joining those log lines across services is impossible.&lt;/p&gt;

</description>
      <category>livekit</category>
      <category>agents</category>
      <category>ai</category>
      <category>webrtc</category>
    </item>
    <item>
      <title>How to Deploy on Air-Gapped AWS EKS for Regulated Financial Services</title>
      <dc:creator>Dishant Sethi</dc:creator>
      <pubDate>Wed, 27 May 2026 16:05:29 +0000</pubDate>
      <link>https://dev.to/dishant_sethi/how-to-deploy-on-air-gapped-aws-eks-for-regulated-financial-services-211g</link>
      <guid>https://dev.to/dishant_sethi/how-to-deploy-on-air-gapped-aws-eks-for-regulated-financial-services-211g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://www.prodinit.com/blog/air-gapped-eks-deployment-fintech" rel="noopener noreferrer"&gt;prodinit.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Financial services data breaches cost an average of &lt;strong&gt;$6.08 million per incident&lt;/strong&gt; — 22% above the global average across all industries (&lt;a href="https://www.ibm.com/think/insights/cost-of-a-data-breach-2024-financial-industry" rel="noopener noreferrer"&gt;IBM Cost of a Data Breach 2024&lt;/a&gt;, 2024). For regulated institutions, the answer isn't just better firewalls. It's network architecture that eliminates the attack surface at the infrastructure level.&lt;/p&gt;

&lt;p&gt;Air-gapped AWS EKS deployments — where private subnets have zero internet egress and all traffic routes through VPC endpoints — are becoming the standard for regulated financial services workloads. This guide walks through the full architecture, from VPC design to CI/CD pipeline, based on a real deployment we executed for a fintech platform at Prodinit.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Air-gapped EKS requires VPC interface and gateway endpoints for every AWS service your workloads touch — there's no fallback to the public internet&lt;/li&gt;
&lt;li&gt;Your CI/CD pipeline must be redesigned from scratch: images push to private ECR via VPC endpoint, and deployment runs through Systems Manager or a bastion inside the VPC&lt;/li&gt;
&lt;li&gt;Kubernetes External Secrets Operator + AWS Secrets Manager is the cleanest pattern for pod-level secret injection without exposing credentials in manifests&lt;/li&gt;
&lt;li&gt;Every data store (RDS, DynamoDB, ElastiCache) must live in private subnets, accessed via security group rules — no public endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Does "Air-Gapped" Mean in AWS Context?
&lt;/h2&gt;

&lt;p&gt;An air-gapped VPC means your private subnets have no route to the internet — no NAT Gateway in private subnets, no internet gateway attachment to private route tables. All communication between your workloads and AWS services (S3, ECR, Secrets Manager, CloudWatch, Bedrock) must route through &lt;strong&gt;VPC endpoints&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;AWS supports two endpoint types (&lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/gateway-endpoints.html" rel="noopener noreferrer"&gt;AWS VPC Endpoints documentation&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gateway endpoints&lt;/strong&gt; — for S3 and DynamoDB only; free, added as route table entries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interface endpoints&lt;/strong&gt; — for all other AWS services via AWS PrivateLink; billed per hour per AZ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a typical EKS deployment, you'll need interface endpoints for: ECR API, ECR Docker, S3 (or gateway), Secrets Manager, Systems Manager, CloudWatch Logs, STS, ELB, Bedrock (if using AI services), and Transcribe (if using speech-to-text).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters for Regulated Workloads
&lt;/h3&gt;

&lt;p&gt;Network isolation is a hard requirement under FFIEC guidelines and SEC cybersecurity rules for financial institutions. An air-gapped VPC enforces this at the infrastructure layer — there's no misconfigured security group that can accidentally allow outbound internet access, because the route simply doesn't exist.&lt;/p&gt;

&lt;blockquote&gt;


&lt;p&gt;On the Client deployment, we discovered mid-project that several Helm charts we'd planned to use for in-cluster controllers (ALB Ingress Controller, cluster-autoscaler) attempt to pull their own images from public registries at install time. We had to mirror every controller image into private ECR before the cluster could bootstrap. This is a class of problem that only surfaces when you actually try to deploy — not in planning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How Do You Design a Zero-Egress VPC for AWS EKS?
&lt;/h2&gt;

&lt;p&gt;Regulated financial services environments under FFIEC and SEC cybersecurity guidelines require network isolation enforced at the infrastructure level — not as a policy overlay but as a structural property of the network. A multi-AZ VPC with private subnets carrying no internet route, combined with VPC interface endpoints for every AWS service, eliminates the outbound internet path entirely rather than restricting it.&lt;/p&gt;

&lt;p&gt;Start with a multi-AZ VPC with distinct public and private subnet tiers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subnet Design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC: 10.0.0.0/16
├── Public Subnets (one per AZ)
│   ├── 10.0.1.0/24 (us-east-1a)
│   ├── 10.0.2.0/24 (us-east-1b)
│   └── NAT Gateways (for public-subnet resources only)
└── Private Subnets (one per AZ)
    ├── 10.0.10.0/24 (us-east-1a)
    └── 10.0.11.0/24 (us-east-1b)
        — No internet route in route table
        — VPC endpoints attached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Private subnet route tables should contain exactly two entries: the local VPC CIDR, and gateway endpoint routes for S3/DynamoDB. Nothing else.&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC Endpoints to Provision
&lt;/h3&gt;

&lt;p&gt;Create interface endpoints in your private subnets for each required AWS service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ECR — required for image pulls&lt;/span&gt;
aws ec2 create-vpc-endpoint &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.us-east-1.ecr.api &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-endpoint-type&lt;/span&gt; Interface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet-ids&lt;/span&gt; subnet-xxx subnet-yyy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--security-group-ids&lt;/span&gt; sg-endpoints

aws ec2 create-vpc-endpoint &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.us-east-1.ecr.dkr &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-endpoint-type&lt;/span&gt; Interface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet-ids&lt;/span&gt; subnet-xxx subnet-yyy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--security-group-ids&lt;/span&gt; sg-endpoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a dedicated security group for endpoints that allows HTTPS (443) inbound from your VPC CIDR. Don't open it wider than needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  IAM Roles
&lt;/h3&gt;

&lt;p&gt;Set up three distinct IAM role categories before touching EKS:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Developer access role&lt;/strong&gt; — scoped to read operations, no production deploy permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD role&lt;/strong&gt; — ECR push, EKS &lt;code&gt;kubectl apply&lt;/code&gt;, Secrets Manager read&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node instance role&lt;/strong&gt; — ECR pull, CloudWatch logging, S3 read for application buckets&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use IAM Roles for Service Accounts (IRSA) for in-cluster components. This ties Kubernetes service accounts to IAM roles without storing credentials anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Do You Run an EKS Cluster with No Public Internet Path?
&lt;/h2&gt;

&lt;p&gt;As of 2024, 80% of organizations run Kubernetes in production (&lt;a href="https://www.cncf.io/reports/cncf-annual-survey-2024/" rel="noopener noreferrer"&gt;CNCF Annual Survey 2024&lt;/a&gt;). The hard part isn't Kubernetes — it's running it without any public internet path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cluster Creation
&lt;/h3&gt;

&lt;p&gt;When creating the EKS cluster, set the API server endpoint access to &lt;strong&gt;private only&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;eksctl create cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; fintech-prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-private-subnets&lt;/span&gt; subnet-xxx,subnet-yyy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-private-networking&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint-private-access&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint-public-access&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;--endpoint-public-access false&lt;/code&gt;, &lt;code&gt;kubectl&lt;/code&gt; only works from inside the VPC. This is intentional. Access the cluster via a bastion host or AWS Systems Manager Session Manager.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node Groups
&lt;/h3&gt;

&lt;p&gt;Place node groups in private subnets with autoscaling enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# nodegroup.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fintech-prod&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;workers&lt;/span&gt;
    &lt;span class="na"&gt;instanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;m6i.xlarge"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;m6i.2xlarge"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;minSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;desiredCapacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;privateNetworking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;iam&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;withAddonPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;autoScaler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;cloudWatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Bootstrapping In-Cluster Controllers
&lt;/h3&gt;

&lt;p&gt;This is where most air-gapped deployments stall. cluster-autoscaler and the AWS Load Balancer Controller both try to pull images from public registries during &lt;code&gt;helm install&lt;/code&gt;. You must mirror them to ECR first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pull, retag, and push to private ECR&lt;/span&gt;
docker pull registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
docker tag registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0 &lt;span class="se"&gt;\&lt;/span&gt;
  123456789.dkr.ecr.us-east-1.amazonaws.com/cluster-autoscaler:v1.29.0
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/cluster-autoscaler:v1.29.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Override the image in Helm values to point to your private ECR before installing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Do You Build a CI/CD Pipeline Without Internet Access?
&lt;/h2&gt;

&lt;p&gt;Standard GitHub Actions hosted runners and most CI/CD platforms assume outbound internet access for image pulls and API calls — assumptions that silently break the moment you remove internet egress. The working architecture requires a self-hosted runner deployed inside the VPC, all image pushes to private ECR via VPC endpoint, and all cluster deployments executed through AWS Systems Manager Session Manager with zero inbound ports.&lt;/p&gt;

&lt;p&gt;Standard CI/CD tooling assumes internet access. GitHub Actions' hosted runners can't reach a private EKS API endpoint. CodePipeline agents can't pull from Docker Hub. You need a fundamentally different pipeline architecture.&lt;/p&gt;

&lt;p&gt;The pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer pushes code
        ↓
GitHub Actions (or CodePipeline)
        ↓
Build image in CI environment (with internet access)
        ↓
Push image to Private ECR via VPC endpoint
        ↓
Trigger deployment (CodePipeline or self-hosted runner in VPC)
        ↓
kubectl/Helm apply via Systems Manager Session Manager
        ↓
EKS pulls image from private ECR (no internet needed)
        ↓
ALB routes traffic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Self-Hosted Runner in VPC
&lt;/h3&gt;

&lt;p&gt;If using GitHub Actions, deploy a self-hosted runner inside the VPC. It can reach the private EKS API endpoint and ECR via VPC endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/deploy.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;self-hosted&lt;/span&gt;  &lt;span class="c1"&gt;# runner inside VPC&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789:role/cicd-deploy-role&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to ECR&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws ecr get-login-password | docker login \&lt;/span&gt;
            &lt;span class="s"&gt;--username AWS \&lt;/span&gt;
            &lt;span class="s"&gt;--password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy to EKS&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws eks update-kubeconfig --name fintech-prod&lt;/span&gt;
          &lt;span class="s"&gt;helm upgrade --install my-app ./helm/my-app \&lt;/span&gt;
            &lt;span class="s"&gt;--set image.tag=${{ github.sha }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  First Deployment Validation
&lt;/h3&gt;

&lt;p&gt;Don't call the pipeline "working" until you've traced the full path end to end: image push → ECR → EKS pod pull → running pod → ALB health check → live traffic. Each hop can fail independently in an air-gapped setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Should Data Stores Be Configured in an Air-Gapped VPC?
&lt;/h2&gt;

&lt;p&gt;Every data store in a regulated EKS deployment — RDS PostgreSQL, DynamoDB, and ElastiCache Redis — must be provisioned in private subnets with &lt;code&gt;publicly_accessible: false&lt;/code&gt; set explicitly at the resource level, not just through security group rules. Security groups can be modified; the &lt;code&gt;publicly_accessible&lt;/code&gt; flag removes the public DNS endpoint entirely, closing the exposure regardless of any future policy drift.&lt;/p&gt;

&lt;p&gt;All data stores must be provisioned in private subnets with no public endpoint exposure.&lt;/p&gt;

&lt;h3&gt;
  
  
  RDS PostgreSQL with pgvector
&lt;/h3&gt;

&lt;p&gt;For AI-augmented fintech applications, pgvector enables vector similarity search inside Postgres — useful for semantic search over transaction data, document embeddings, or fraud pattern matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;identifier&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"fintech-postgres"&lt;/span&gt;
  &lt;span class="nx"&gt;engine&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt;
  &lt;span class="nx"&gt;engine_version&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"16.1"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.r6g.large"&lt;/span&gt;
  &lt;span class="nx"&gt;multi_az&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;db_subnet_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_db_subnet_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;publicly_accessible&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="nx"&gt;storage_encrypted&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Enable pgvector via parameter group&lt;/span&gt;
  &lt;span class="nx"&gt;parameter_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_db_parameter_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_pgvector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the extension after provisioning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;ivfflat&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_cosine_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ElastiCache Redis and DynamoDB
&lt;/h3&gt;

&lt;p&gt;Both run entirely in private subnets. ElastiCache Redis requires a subnet group scoped to private subnets. DynamoDB uses a gateway endpoint (free) — no interface endpoint needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Do You Manage Secrets and Security Without Exposing Credentials?
&lt;/h2&gt;

&lt;p&gt;Kubernetes Secret objects are base64-encoded, not encrypted — any cluster administrator with RBAC read access can decode them with a single command. In regulated environments, AWS External Secrets Operator resolves this by pulling credentials from AWS Secrets Manager at pod startup and syncing them into ephemeral Kubernetes Secrets. Credentials never appear in manifest files, Git history, or container image layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS WAF on the ALB
&lt;/h3&gt;

&lt;p&gt;Attach a Web ACL to your Application Load Balancer with at minimum:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core Rule Set (CRS)&lt;/strong&gt; — protects against OWASP Top 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Known Bad Inputs&lt;/strong&gt; — blocks common injection payloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ALB sits in public subnets (it receives external traffic), but the security group only allows 443 inbound. Backend EKS nodes only allow traffic from the ALB security group.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes Secrets Without Kubernetes Secrets
&lt;/h3&gt;

&lt;p&gt;Storing secrets as Kubernetes Secret objects is fine for development, but they're base64-encoded, not encrypted, and cluster admins can read them. In a regulated environment, use &lt;strong&gt;External Secrets Operator&lt;/strong&gt; to pull from AWS Secrets Manager instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ExternalSecret — pulls from Secrets Manager into a Kubernetes Secret&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets-manager&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterSecretStore&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials&lt;/span&gt;
    &lt;span class="na"&gt;creationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fintech/prod/db&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pods mount the resulting Kubernetes Secret normally — but the actual credential lives in Secrets Manager, has rotation enabled, and never touches a manifest file.&lt;/p&gt;

&lt;h3&gt;
  
  
  TLS and HTTPS Enforcement
&lt;/h3&gt;

&lt;p&gt;Provision an ACM certificate for your domain and configure the ALB to redirect HTTP to HTTPS. Set HTTPS-only enforcement at the ALB listener level — don't rely on application code to enforce it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Do You Run Amazon Bedrock and Transcribe in an Air-Gapped Cluster?
&lt;/h2&gt;

&lt;p&gt;Amazon Bedrock and Amazon Transcribe both support VPC interface endpoints, meaning all LLM inference and speech-to-text requests from private EKS workloads never leave the AWS network. For regulated industries, this keeps AI processing within the same network boundary as the rest of the application — data residency compliance is maintained without routing inference traffic through the public internet.&lt;/p&gt;

&lt;p&gt;For platforms using Amazon Bedrock (LLM inference) or Amazon Transcribe (speech-to-text), both services support VPC interface endpoints — meaning model inference requests never leave the AWS network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bedrock VPC endpoint&lt;/span&gt;
aws ec2 create-vpc-endpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.us-east-1.bedrock-runtime &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-endpoint-type&lt;/span&gt; Interface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet-ids&lt;/span&gt; subnet-xxx subnet-yyy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--security-group-ids&lt;/span&gt; sg-endpoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IAM policies for Bedrock should be scoped to specific model ARNs — don't grant &lt;code&gt;bedrock:*&lt;/code&gt; broadly.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Evaluate LLM Outputs: Building Evals That Actually Catch Regressions</title>
      <dc:creator>Dishant Sethi</dc:creator>
      <pubDate>Wed, 27 May 2026 15:50:09 +0000</pubDate>
      <link>https://dev.to/dishant_sethi/how-to-evaluate-llm-outputs-building-evals-that-actually-catch-regressions-511k</link>
      <guid>https://dev.to/dishant_sethi/how-to-evaluate-llm-outputs-building-evals-that-actually-catch-regressions-511k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://www.prodinit.com/blog/llm-evals" rel="noopener noreferrer"&gt;prodinit.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most LLM eval setups fail for three structural reasons: evaluating on metrics that don't reflect production failure modes, using golden datasets that have silently rotted, and running evals on a separate schedule from deployments&lt;/li&gt;
&lt;li&gt;The four-layer eval stack — unit, reference, rubric, and behavioral — catches different regression types; shipping without all four leaves blind spots&lt;/li&gt;
&lt;li&gt;GPT-4 as judge agrees with human experts 85% of the time on general tasks (&lt;a href="https://eugeneyan.com/writing/llm-evaluators/" rel="noopener noreferrer"&gt;Zheng et al., NeurIPS 2023&lt;/a&gt;), but that agreement drops to 60–68% in expert domains — calibrate before you trust it&lt;/li&gt;
&lt;li&gt;A February 2025 Amazon study found INT4 quantization caused a 39.46% accuracy drop on Llama-3.3 70B — silent regressions from "safe" model changes are real and statistically detectable (&lt;a href="https://arxiv.org/html/2602.10144" rel="noopener noreferrer"&gt;Kübler et al., arXiv 2025&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Block deployments on rubric regressions ≥2% relative to the last passing run; warn on everything else&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Most LLM Eval Setups Miss Regressions
&lt;/h2&gt;

&lt;p&gt;42% of companies abandoned the majority of their AI initiatives in 2025, up from 17% in 2024 (&lt;a href="https://www.spglobal.com/market-intelligence/en/news-insights/research/ai-experiences-rapid-adoption-but-with-mixed-outcomes" rel="noopener noreferrer"&gt;S&amp;amp;P Global Market Intelligence, 2025&lt;/a&gt;). The default explanation is ROI. The technical explanation, in most cases, is that the system shipped fine and then quietly got worse — and nobody caught it until a customer did.&lt;/p&gt;

&lt;p&gt;Three structural failure modes explain most missed regressions in production LLM systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 1: Proxy metrics that don't predict production failure.&lt;/strong&gt; Teams instrument BLEU score, exact match, or perplexity because those are easy to compute. A customer-facing summarisation model can maintain a BLEU score of 0.74 while its summaries become subtly contradictory after a retrieval change. BLEU measures token overlap; it doesn't measure factual consistency. The metric passed. The feature regressed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 2: Golden datasets that have silently rotted.&lt;/strong&gt; A golden dataset built during initial evaluation captures the distribution of inputs that existed at that moment. Six months later, real traffic has drifted: new document formats, new query patterns, edge cases the original set never covered. Evaluating against a stale golden set produces a green score against a test that no longer represents the problem you're actually solving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 3: Evals that don't run at deployment time.&lt;/strong&gt; Evaluation suites that run weekly, on a separate schedule from code deploys, detect regressions after they've been live for days. The culprit PR has already been merged and three others have been built on top of it. What you needed was a gate, not a report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four-Layer Eval Stack
&lt;/h2&gt;

&lt;p&gt;The single strongest change you can make to your eval setup is adding layers. Each layer catches different failure modes; each is cheap to run for what it surfaces. Shipping any one layer in isolation leaves a class of regression invisible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Unit Evals
&lt;/h3&gt;

&lt;p&gt;Unit evals test individual capabilities in isolation: does the model correctly extract a date from a structured input? Does it refuse an off-topic request? Does it stay within a 200-word limit when instructed to? These are deterministic — the answer is either correct or it isn't.&lt;/p&gt;

&lt;p&gt;Unit evals run in milliseconds, require no LLM calls for evaluation, and give you a precise signal when a model update breaks a capability it previously had. They are the first gate in the pipeline: cheap to fail, cheap to fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Reference Evals
&lt;/h3&gt;

&lt;p&gt;Reference evals compare model output against a gold-standard answer using a similarity metric. They're appropriate when outputs have a correct or near-correct form: code generation, factual Q&amp;amp;A with a known answer, structured extraction against a schema.&lt;/p&gt;

&lt;p&gt;The weakness: reference evals degrade with output diversity. A model that answers correctly but in different words than the reference will score low. Use them where correctness has a tight definition. Avoid them for open-ended generation where paraphrase is acceptable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Rubric Evals (LLM-as-Judge)
&lt;/h3&gt;

&lt;p&gt;Rubric evals ask a separate LLM to score the output against a defined rubric. This is the only practical approach for evaluating coherence, helpfulness, or factual consistency at scale — human annotation doesn't scale to continuous deployment. &lt;a href="https://crfm.stanford.edu/helm/" rel="noopener noreferrer"&gt;Stanford's HELM benchmark&lt;/a&gt; applies seven evaluation metrics across 42 real-world scenarios using a comparable rubric-based approach at research scale.&lt;/p&gt;

&lt;p&gt;Rubric evals are powerful but require calibration. See the LLM-as-Judge section below for the documented failure modes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Behavioral Evals
&lt;/h3&gt;

&lt;p&gt;Behavioral evals test system-level properties that don't reduce to a single output score: does the system stay in character across a 10-turn conversation? Does it escalate correctly when the user indicates distress? Does retrieval-augmented generation cite only sources it actually retrieved?&lt;/p&gt;

&lt;p&gt;These require end-to-end test harnesses or carefully instrumented integration tests. They're more expensive to run but catch a class of regression that the other three layers cannot: failures that only manifest across interactions or under specific system conditions. They also run slower — which matters for your CI blocking policy, covered below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Golden Datasets: How They Rot and How to Refresh Them
&lt;/h2&gt;

&lt;p&gt;A golden dataset is the most valuable artifact your evaluation pipeline owns, and it has an expiry date nobody writes down.&lt;/p&gt;

&lt;p&gt;Datasets rot in three ways. &lt;strong&gt;Input drift&lt;/strong&gt;: real user queries evolve — new terminology, new intents, new edge cases — and your golden set stops representing them. &lt;strong&gt;Label rot&lt;/strong&gt;: the correct answer changes. A customer service bot's golden dataset might contain ideal answers that reference a product feature that no longer exists. &lt;strong&gt;Coverage gaps&lt;/strong&gt;: your initial dataset captured the happy path. Production traffic eventually surfaces the long tail that was never represented.&lt;/p&gt;

&lt;p&gt;The practical fix is a two-track refresh strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track 1: Scheduled review.&lt;/strong&gt; Every 90 days, pull a stratified sample of real production inputs — at minimum, 200 examples per major intent cluster — and manually verify that the golden labels are still correct. Flag rows where the ideal answer has changed. Retire rows from deprecated flows. &lt;a href="https://www.statsig.com/perspectives/golden-datasets-evaluation-standards" rel="noopener noreferrer"&gt;Statsig's research on golden dataset maintenance&lt;/a&gt; recommends marking rows stale after 90 days unless re-verified; persistent drift is a signal the dataset no longer reflects reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track 2: Failure-driven refresh.&lt;/strong&gt; When a customer-reported regression reaches you, trace it back to the eval suite. If the failing case wasn't in the golden set, add it — annotated with why it failed and what the correct output should have been. A regression that reaches production is, at minimum, a contribution to the golden dataset. Don't waste the signal.&lt;/p&gt;

&lt;p&gt;One diagnostic worth running: if your eval suite consistently scores above 90% but your support tickets are increasing, the dataset has drifted past the real problem space. That 90% is measuring something — it's just no longer measuring the right thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-Judge: When It Works, When It Lies
&lt;/h2&gt;

&lt;p&gt;LLM-as-judge is a necessary tool for evaluating open-ended outputs at scale. It's also unreliable in specific, documented ways. Use it without understanding those ways, and your rubric evals will give you false confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What works.&lt;/strong&gt; GPT-4 as judge achieves 85% agreement with human expert evaluators on general-task benchmarks (MT-Bench), and 83–87% agreement on Chatbot Arena evaluations (&lt;a href="https://eugeneyan.com/writing/llm-evaluators/" rel="noopener noreferrer"&gt;Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023, via Eugene Yan&lt;/a&gt;). For general-purpose, non-expert tasks, LLM-as-judge is a defensible substitute for human annotation if you validate the judge against your specific rubric before deploying it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What lies.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verbosity bias.&lt;/strong&gt; Both GPT-3.5 and Claude (v1) preferred longer responses over shorter ones more than 90% of the time, independent of correctness (&lt;a href="https://eugeneyan.com/writing/llm-evaluators/" rel="noopener noreferrer"&gt;Zheng et al., NeurIPS 2023&lt;/a&gt;). If your outputs tend to be long and verbose, a verbosity-biased judge will score them well even when they're wrong. Mitigate by normalising output length in your rubric prompt or running paired length-controlled evaluations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-preference bias.&lt;/strong&gt; GPT-4 as judge gave a 10% win-rate advantage to GPT-4-generated outputs; Claude v1 showed a 25% self-preference bias (&lt;a href="https://eugeneyan.com/writing/llm-evaluators/" rel="noopener noreferrer"&gt;Zheng et al., NeurIPS 2023&lt;/a&gt;). If your production model and your judge share a model family, expect inflated scores. Use a different model family for the judge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expert domain degradation.&lt;/strong&gt; Agreement between LLM judges and human domain experts drops to 60–68% in fields like dietetics and mental health (&lt;a href="https://dl.acm.org/doi/10.1145/3708359.3712091" rel="noopener noreferrer"&gt;ACL/EMNLP 2024, via ACM DL&lt;/a&gt;). If you're evaluating a healthcare, legal, or highly specialized technical application, LLM-as-judge is not a substitute for domain expert annotation on the rubric dimensions that matter most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration process.&lt;/strong&gt; Before deploying a rubric eval in CI: (1) define explicit scoring criteria with labelled examples for each score level; (2) run the judge on 50–100 human-annotated examples and measure agreement; (3) if agreement is below 75% on your specific rubric, revise the rubric or change the judge model. &lt;a href="https://arxiv.org/abs/2411.15594" rel="noopener noreferrer"&gt;The 2024 survey on LLM-as-a-Judge&lt;/a&gt; provides a comprehensive bias taxonomy useful as a calibration checklist. Treat LLM-as-judge as a probabilistic instrument you've validated — not a ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiring Evals into CI: What to Block On, What to Warn On
&lt;/h2&gt;

&lt;p&gt;Running evals in CI without a blocking policy produces reports, not gates. The purpose of CI eval integration is to make a shipping decision: does this diff change behavior in a way that crosses a regression threshold?&lt;/p&gt;

&lt;p&gt;The integration pattern that works in production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval_pipeline.py — framework-agnostic eval runner
# Runs on every PR against main; blocks merge if BLOCK conditions fail
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 1: Unit evals — run all, block on any failure
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_unit_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 2: Reference evals — block if accuracy drops below floor
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_reference_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_match_normalized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 3: Rubric evals — block on relative regression vs baseline
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_rubric_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;judge_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# different family from production model
&lt;/span&gt;        &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RUBRIC_CONFIG&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 4: Behavioral evals — warn only; too slow to block on every PR
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_behavioral_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;evaluate_thresholds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;THRESHOLDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block_on_any_failure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block_if_below&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block_if_regression_vs_baseline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# 2% relative
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warn_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What to block on.&lt;/strong&gt; Any unit eval failure. Reference accuracy falling below your defined floor. Rubric score dropping more than 2% relative to the last passing run on &lt;code&gt;main&lt;/code&gt;. These signals have high signal-to-noise ratio — when they fire, they reliably indicate a regression rather than measurement variance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to warn on.&lt;/strong&gt; Behavioral eval regressions (too slow and too variable to block every PR), single-dimension rubric drops that don't cross the aggregate threshold, and latency increases above your SLO. Warnings go into the PR review, not the merge gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The baseline problem.&lt;/strong&gt; Your blocking threshold needs a reference point. Store eval results in a persistent store — a JSON file in the repo works; a purpose-built eval tracking system works better — and compare each run to the last green run on &lt;code&gt;main&lt;/code&gt;. Don't compare to a fixed absolute. Compare to a rolling baseline that advances with intentional quality improvements.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://www.prodinit.com/services/ai-infrastructure-llmops" rel="noopener noreferrer"&gt;AI Infrastructure &amp;amp; LLMOps service&lt;/a&gt; wires eval pipelines directly into deployment workflows so that model updates, retrieval changes, and prompt edits all pass through the same gate before reaching production.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Regression That Slipped Through (and the Eval That Would Have Caught It)
&lt;/h2&gt;

&lt;p&gt;A retrieval-augmented clinical documentation system was producing accurate outputs in testing. Production ROUGE-L scores were stable at 0.81. An infrastructure team updated the vector database and reindexed the embeddings corpus. No model weights changed. The migration was flagged as non-breaking.&lt;/p&gt;

&lt;p&gt;Two weeks later: escalating complaints from clinical staff. Summaries were citing facts from adjacent patient records in a multi-tenant environment. The retrieval had started returning higher-cosine-similarity results from nearby tenant partitions due to an index partitioning bug introduced in the new release.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the eval suite had:&lt;/strong&gt; ROUGE-L score on golden summaries (Layer 2).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it didn't have:&lt;/strong&gt; a cross-tenant citation check (Layer 4 behavioral), or a factual grounding check verifying that every claimed fact appeared in the retrieved source documents (Layer 3 rubric).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The eval that would have caught it:&lt;/strong&gt; A rubric eval scoring "all factual claims in the output are supported by at least one retrieved source document" — rated by an LLM judge with access to both the output and the retrieved context. This would have flagged outputs immediately: claims were present in the generation, but the supporting documents in context were from different records.&lt;/p&gt;

&lt;p&gt;A behavioral eval running 20 end-to-end test cases with known tenant isolation requirements would have caught the regression in the first CI run after the index migration.&lt;/p&gt;

&lt;p&gt;Neither eval existed because both required knowing what to test before the failure occurred. The lesson isn't that you should anticipate every specific bug. It's that behavioral evals should cover the properties your system must hold regardless of what changes — tenant isolation, citation grounding, output fidelity are invariants, not features. They belong in the eval suite from day one, not after the first production incident.&lt;/p&gt;

&lt;p&gt;A related pattern appears in model optimisation: the &lt;a href="https://arxiv.org/html/2602.10144" rel="noopener noreferrer"&gt;Amazon arXiv study (February 2025)&lt;/a&gt; found that INT4 quantization — routinely treated as a cost-reduction step with negligible quality impact — caused a 1.73% accuracy drop on Llama-3.1 8B and a 39.46% drop on Llama-3.3 70B. The study also showed the McNemar statistical test can detect accuracy degradations as small as 0.3% — meaning you don't need large regressions to justify measurement. You just need to be measuring.&lt;/p&gt;

</description>
      <category>evals</category>
      <category>ai</category>
      <category>llmops</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
