<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anna Danilec</title>
    <description>The latest articles on DEV Community by Anna Danilec (@anna_danilec).</description>
    <link>https://dev.to/anna_danilec</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3919694%2Fb82546bc-c4cd-41e0-81c0-9246ed3bfa76.jpg</url>
      <title>DEV Community: Anna Danilec</title>
      <link>https://dev.to/anna_danilec</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anna_danilec"/>
    <language>en</language>
    <item>
      <title>RAG Evaluation with RAGAS: Measuring Faithfulness, Context Precision, and Recall in Production</title>
      <dc:creator>Anna Danilec</dc:creator>
      <pubDate>Mon, 18 May 2026 10:03:19 +0000</pubDate>
      <link>https://dev.to/anna_danilec/rag-evaluation-with-ragas-measuring-faithfulness-context-precision-and-recall-in-production-31hj</link>
      <guid>https://dev.to/anna_danilec/rag-evaluation-with-ragas-measuring-faithfulness-context-precision-and-recall-in-production-31hj</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;p&gt;RAGAS gives you four core metrics that split RAG failures into retrieval vs. generation problems&lt;/p&gt;

&lt;p&gt;Faithfulness catches hallucinations; Context Recall catches retrieval gaps&lt;/p&gt;

&lt;p&gt;Most metrics require no human-labeled data&lt;/p&gt;

&lt;p&gt;Treat RAGAS like unit tests, run it in CI every time you change your pipeline&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You've shipped a RAG-based product. Your engineers say it "seems to work well." Your users occasionally complain it gives wrong answers. You have no idea which part is broken, the retrieval, the generation, or both.&lt;/p&gt;

&lt;p&gt;This is the state of most RAG deployments today. And it's a problem you can solve with a proper evaluation framework.&lt;/p&gt;

&lt;p&gt;Let's talk about RAGAS.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With "It Seems Fine"
&lt;/h2&gt;

&lt;p&gt;Building a RAG system is the easy part. Dozens of tutorials get you from zero to a working demo in an afternoon. But production RAG is a different beast. You're dealing with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Retrieval failures - the system pulls irrelevant chunks from your vector store&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hallucinations - the LLM generates facts not present in the retrieved documents&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Incomplete coverage - the retrieval misses key information needed to answer the question&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Irrelevant answers - the response doesn't actually address what the user asked&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional NLP metrics like BLEU and ROUGE won't catch any of these. They measure surface-level text similarity to a reference answer, useful for machine translation, not for knowledge-grounded generation. They completely ignore whether the LLM is actually using the retrieved context or just making things up.&lt;/p&gt;

&lt;p&gt;You need metrics designed specifically for the RAG pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is RAGAS?
&lt;/h2&gt;

&lt;p&gt;RAGAS (Retrieval Augmented Generation Assessment) is an open-source Python framework for evaluating RAG systems. It was introduced by Shahul Es, Jithin James, and collaborators in a paper published in late 2023 and presented at EACL 2024.&lt;/p&gt;

&lt;p&gt;The key design decision that makes RAGAS practical: most of its metrics require no human-labeled ground truth. It uses LLMs as judges, the same type of model you're evaluating is used to evaluate the evaluation. Yes, this is a meta-game, but it works surprisingly well in practice.&lt;/p&gt;

&lt;p&gt;RAGAS processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It has 4,000+ GitHub stars and is backed by a Y Combinator company. It's become the de facto standard for RAG evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Axis Mental Model
&lt;/h2&gt;

&lt;p&gt;Before diving into specific metrics, understand the RAG pipeline as having two distinct components, each with its own failure modes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffte50zanrei7sll38ys5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffte50zanrei7sll38ys5.png" alt=" " width="800" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retriever failures&lt;/strong&gt;: Wrong chunks, missing chunks, poorly ranked chunks&lt;br&gt;
&lt;strong&gt;Generator failures&lt;/strong&gt;: Hallucination, ignoring context, irrelevant response&lt;/p&gt;

&lt;p&gt;RAGAS gives you metrics for both axes. If you only measure end-to-end output quality, you can't tell which half is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Four Metrics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Faithfulness: Does the answer stay true to the retrieved context?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it catches:&lt;/strong&gt; Hallucinations from the generator&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;br&gt;
RAGAS extracts individual statements from the generated answer, then asks an LLM judge whether each statement can be logically inferred from the retrieved context. The score is the fraction of statements that can be supported.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Faithfulness = Supported Statements / Total Statements in Answer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score range:&lt;/strong&gt; 0 to 1 (higher is better)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
Question: "What is our refund policy?"&lt;br&gt;
Context: "Refunds are available within 30 days of purchase."&lt;br&gt;
Answer: "Refunds are available within 30 days. We also offer exchanges for 60 days."&lt;/p&gt;

&lt;p&gt;The second sentence isn't in the context → Faithfulness &amp;lt; 1&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What low Faithfulness scores tell you, and how to fix it:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tighten your system prompt:&lt;/strong&gt; The most immediate lever. Add explicit grounding instructions:&lt;/p&gt;

&lt;p&gt;"Answer only using the information provided in the context below. If the context does not contain enough information to answer, say so explicitly."&lt;/p&gt;

&lt;p&gt;"Do not use any prior knowledge. Every claim in your answer must be traceable to the context."&lt;/p&gt;

&lt;p&gt;Negative framing helps too: "Do not speculate. Do not add information not present in the provided documents."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Lower the temperature:&lt;/strong&gt; High temperature = more creative, more likely to drift from the context. For factual RAG tasks, set temperature to 0 or close to it. There's no good reason to have randomness in a document Q&amp;amp;A system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Switch or downgrade your model:&lt;/strong&gt; Counter-intuitively, more capable models sometimes hallucinate more confidently. A model like GPT-4o has seen so much training data that it may "helpfully" fill gaps from its parametric memory rather than admitting the context is insufficient. Sometimes a smaller, instruction-tuned model with a strict prompt outperforms a frontier model on Faithfulness specifically.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Answer Relevancy: Does the answer actually address the question?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it catches:&lt;/strong&gt; Verbose, off-topic, or evasive answers from the generator&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;br&gt;
An LLM generates several hypothetical questions that the given answer would be the answer to. Then it computes the cosine similarity between those generated questions and the original question. High similarity means the answer is directly addressing what was asked.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Answer Relevancy = avg(cosine_similarity(generated_questions, original_question))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score range:&lt;/strong&gt; 0 to 1&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example of a low-relevancy response:&lt;/strong&gt;&lt;br&gt;
Question: "When does the system auto-scale?"&lt;br&gt;
Answer: "Our platform uses Kubernetes for container orchestration. It supports multiple cloud providers and can be deployed on-premises." ← technically related but doesn't answer the question&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What low Answer Relevancy scores tell you, and how to fix it:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Your prompt template isn't directing the LLM toward the question:&lt;/strong&gt; The most common cause. If your template looks like "Use the context below to help the user", the LLM has too much freedom to respond however feels natural, which often means answering a slightly different, easier version of the question. Fix it by anchoring the response explicitly to the input:&lt;/p&gt;

&lt;p&gt;"Answer the following question directly and concisely: {question}"&lt;/p&gt;

&lt;p&gt;"Your response must directly address what was asked. Do not provide background information unless it is necessary to answer the question."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Your retriever is pulling tangentially related chunks:&lt;/strong&gt; This one is subtle, the LLM isn't hallucinating, it's faithfully summarizing context that happens to be adjacent to the topic but doesn't answer the specific question asked. The answer sounds reasonable, passes a Faithfulness check, but misses the point entirely. &lt;/p&gt;

&lt;p&gt;Cross-reference with Context Precision: if that score is also low, the retriever is the culprit. The fix is better retrieval, a reranker, stricter similarity thresholds, or query rewriting before retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The LLM is being evasive or overly hedged:&lt;/strong&gt; Some models, especially when given ambiguous context, default to safe, non-committal answers: "This is a complex topic with many perspectives…" These score very low on Answer Relevancy because a hypothetical question reverse-engineered from that answer looks nothing like the original query. &lt;/p&gt;

&lt;p&gt;The fix is prompt-level: instruct the model to commit to an answer and flag uncertainty explicitly rather than hiding behind vagueness, "If you cannot find a direct answer in the context, say: I don't have enough information to answer this. Do not speculate."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Precision: Are the retrieved chunks actually useful? Are the best ones ranked first?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it catches:&lt;/strong&gt; Noisy retrieval, retrieving a lot of documents but ranking the relevant ones poorly&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;br&gt;
For each retrieved chunk, an LLM judge decides whether that chunk is useful for answering the question. Context Precision then uses Average Precision, a ranking-aware metric that penalizes systems that bury the relevant chunks at the bottom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is important:&lt;/strong&gt; two systems could retrieve the same relevant chunks but if one puts them at positions 1 and 2 and another at positions 8 and 9, the LLM may not use them effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What low Context Precision scores tell you, and how to fix it:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. Add a reranker as a second retrieval stage:&lt;/strong&gt; Your embedding model does a decent job finding broadly relevant chunks, but cosine similarity in vector space is a blunt instrument, it measures general topic overlap, not “does this chunk actually help answer this specific question.”&lt;/p&gt;

&lt;p&gt;A cross-encoder reranker (Cohere Rerank, BGE Reranker, Jina Reranker) reads the query and each chunk together and produces a much more accurate relevance score. The typical pattern is: retrieve top-20 with your vector store, rerank, pass top-5 to the LLM. This often moves Context Precision more than any other single change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Fix your chunking strategy:&lt;/strong&gt; Poorly sized chunks are a hidden precision killer. Chunks that are too large contain the relevant sentence plus a lot of surrounding noise, the chunk scores as retrieved but most of its content is irrelevant, dragging precision down.&lt;/p&gt;

&lt;p&gt;Chunks that are too small lose surrounding context and get ranked inconsistently. The fix isn’t always obvious because the right chunk size is domain-dependent: dense technical documentation needs smaller chunks than narrative prose.&lt;/p&gt;

&lt;p&gt;Test with a few different sizes (256, 512, 1024 tokens) and run Context Precision against each. Also consider sentence-window retrieval or parent-child chunking, retrieve small chunks for precision, but pass their larger parent context to the LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rewrite the query before retrieval:&lt;/strong&gt; User queries are often poorly formed for vector search. They’re conversational, ambiguous, or assume context from earlier in the conversation. The embedding model then retrieves chunks that match the surface phrasing of the query rather than its intent.&lt;/p&gt;

&lt;p&gt;Query rewriting with an LLM before hitting the vector store (sometimes called HyDE, Hypothetical Document Embeddings, or simply query expansion) can dramatically improve what gets ranked at the top. A simple prompt like “Rewrite this question as a declarative statement that would appear in a technical document” often moves the needle more than swapping embedding models.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Context Recall: Did the retriever find everything needed to answer the question?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it catches:&lt;/strong&gt; Retrieval gaps, the right information exists in your knowledge base but wasn’t retrieved&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;br&gt;
This is the one metric that typically needs a ground truth reference answer. RAGAS decomposes the reference answer into individual statements, then checks which statements can be attributed to the retrieved context.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Context Recall = Statements attributable to context / Total statements in reference answer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What low Context Recall scores tell you, and how to fix it:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. Increase your top-K and experiment with retrieval depth:&lt;/strong&gt; The simplest fix first. If you’re retrieving top-3 or top-5 chunks, relevant information that exists in your knowledge base simply isn’t making it into the context window. Try top-10 or top-20 and re-measure. &lt;/p&gt;

&lt;p&gt;The tradeoff is more noise (which hurts Context Precision), so watch both metrics together, you’re looking for the sweet spot where recall improves without precision collapsing. A reranker helps here because it lets you retrieve broadly and then filter aggressively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Fix your chunking before fixing your retrieval:&lt;/strong&gt; Low Context Recall is often misdiagnosed as a retrieval problem when it’s actually a chunking problem. If a single answer requires information spread across a document, an introduction, a table in the middle, and a caveat at the end, but your chunks split those pieces apart and only one gets retrieved, recall will suffer regardless of how good your embeddings are.&lt;/p&gt;

&lt;p&gt;Consider parent-child chunking: index small chunks for precise matching, but when a small chunk is retrieved, pass its larger parent document to the LLM. This way you get retrieval precision without losing surrounding context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Switch to hybrid search:&lt;/strong&gt; Pure vector search fails on specific, precise queries, exact product names, version numbers, acronyms, proper nouns. The embedding model generalizes these into semantic space where they lose their distinctiveness.&lt;/p&gt;

&lt;p&gt;BM25 (keyword search) handles them perfectly. Hybrid search combines both signals, dense retrieval for semantic understanding, sparse retrieval for exact matching, and consistently improves recall across diverse query types without significantly hurting precision. Most modern vector stores (Elasticsearch, Weaviate, Qdrant) support hybrid search natively.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Metrics Map to Your Architecture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Measures&lt;/th&gt;
&lt;th&gt;Failure Points to Investigate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval quality &amp;amp; ranking&lt;/td&gt;
&lt;td&gt;Embedding model, reranker, chunk size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval coverage&lt;/td&gt;
&lt;td&gt;top-K setting, chunking, indexing strategy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generator groundedness&lt;/td&gt;
&lt;td&gt;System prompt, temperature, model choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer Relevancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generator focus&lt;/td&gt;
&lt;td&gt;Prompt template, retrieval quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A useful diagnostic pattern: if Faithfulness is fine but Answer Relevancy is low, your LLM is staying honest but the retrieved context isn’t helping it answer the actual question. That’s a retrieval problem dressed up as a generation problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond the Core Four
&lt;/h2&gt;

&lt;p&gt;RAGAS has expanded significantly since its original release. For production systems, you should also look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Noise Sensitivity – how much does your answer quality degrade when irrelevant chunks are retrieved alongside relevant ones? Critical for adversarial or domain-drift scenarios.&lt;/li&gt;
&lt;li&gt;Context Entities Recall – checks whether specific entities (names, numbers, dates) from the ground truth appear in the retrieved context. Useful for fact-dense domains like legal or finance.&lt;/li&gt;
&lt;li&gt;Factual Correctness – a reference-based metric that checks whether the answer is factually correct, not just grounded in context. This requires ground truth but gives you absolute accuracy, not just relative faithfulness.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams building agentic RAG pipelines, RAGAS also covers Tool Call Accuracy, Agent Goal Accuracy, and Topic Adherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evaluation Dataset Problem (And How RAGAS Solves It)
&lt;/h2&gt;

&lt;p&gt;Here’s the real bottleneck: to run these metrics at scale, you need test questions. Building hundreds of representative questions by hand is expensive and slow.&lt;/p&gt;

&lt;p&gt;RAGAS includes a synthetic test data generation module. It ingests your source documents, builds a knowledge graph, and generates diverse question types automatically, including multi-hop questions that require reasoning across multiple documents.&lt;/p&gt;

&lt;p&gt;This lets you create a meaningful evaluation dataset in hours rather than weeks. It’s not perfect, you’ll still want human review for high-stakes domains, but it dramatically lowers the barrier to having a real eval suite before your next deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a RAGAS Workflow Looks Like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;

&lt;span class="c1"&gt;# Your RAG system output
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our data retention policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Our data is retained for 90 days...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data is retained for 90 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data is retained for 90 days per GDPR requirements.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {'faithfulness': 0.91, 'answer_relevancy': 0.87, 
#  'context_precision': 0.76, 'context_recall': 0.82}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real value isn’t a single run. It’s running RAGAS as part of your CI/CD pipeline every time you change your prompt template, swap your embedding model, or update your knowledge base. Treat it like unit tests for your AI system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Guidance
&lt;/h2&gt;

&lt;p&gt;Start with Faithfulness and Context Recall. These two are the highest signal metrics for most production systems. Faithfulness catches the most dangerous failure mode (hallucination), and Context Recall tells you if your retrieval architecture is fundamentally sound.&lt;/p&gt;

&lt;p&gt;Don’t optimize a single metric. You can game Context Precision by returning fewer, more targeted chunks, but this hurts Context Recall. You need to watch all four together.&lt;/p&gt;

&lt;p&gt;Use RAGAS scores to run A/B experiments. Want to know if switching from text-embedding-ada-002 to text-embedding-3-large improves your system? Run RAGAS before and after. Now you have data instead of intuition.&lt;/p&gt;

&lt;p&gt;Integrate with observability tools. RAGAS works natively with LangSmith and Langfuse. This means you can trace individual requests that score poorly and inspect exactly what was retrieved and how the LLM used it.&lt;/p&gt;

&lt;p&gt;The LLM-as-judge limitation. Be aware that RAGAS uses LLMs internally for most metrics. This means your evaluation has its own failure modes, LLM judges can be inconsistent, sensitive to prompt phrasing, and exhibit position bias. Use a strong, reliable model (GPT-4o, Claude) for your judge. For critical systems, validate RAGAS scores against a sample of human annotations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Most teams ship RAG systems and evaluate them with vibes. RAGAS gives you a structured, automated way to know exactly where your pipeline is failing, retrieval or generation, and gives you the feedback loop to fix it systematically.&lt;/p&gt;

&lt;p&gt;This is the difference between iterating on your AI system and guessing about it.&lt;/p&gt;

&lt;p&gt;The framework is open-source, takes an afternoon to integrate, and has become the standard for a reason. If you’re running RAG in production without evaluation metrics, that’s the technical debt your team should be paying down next.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>How we built an MCP Guardrail to enforce tech policy in real-time</title>
      <dc:creator>Anna Danilec</dc:creator>
      <pubDate>Tue, 12 May 2026 06:30:59 +0000</pubDate>
      <link>https://dev.to/anna_danilec/how-we-built-an-mcp-guardrail-to-enforce-tech-policy-in-real-time-542b</link>
      <guid>https://dev.to/anna_danilec/how-we-built-an-mcp-guardrail-to-enforce-tech-policy-in-real-time-542b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpepmtm9tl81dqt3ayzvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpepmtm9tl81dqt3ayzvk.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In 2026, most mid-sized and large organizations are aggressively adopting AI coding assistants such as Cursor, Claude Desktop and Windsurf. Developers are now generating a significant portion of code using LLMs.&lt;/p&gt;

&lt;p&gt;However, this acceleration brings serious risks. According to the GitGuardian State of Secrets Sprawl 2026 report, in 2025 alone 28.65 million new hardcoded secrets were detected on public GitHub, a 34% year-over-year increase. Secrets related to AI services grew by 81%. Commits assisted by tools like Claude Code show nearly double the credential leak rate (3.2% vs 1.5% for manually written code).&lt;/p&gt;

&lt;p&gt;Additionally, Veracode’s 2025 research reveals that 45% of AI-generated code contains OWASP Top 10 vulnerabilities. In some languages (e.g. Java), this figure reaches as high as 70%.&lt;/p&gt;

&lt;p&gt;The core challenge for CTOs and architects is clear: we provide developers with extremely powerful generative tools, but we fail to supply them with the company’s current context. Language models have no knowledge of internal Tech Radars, approved library lists, security policies, Architecture Decision Records (ADRs), or forbidden patterns. As a result, LLMs often suggest solutions that are outdated, misaligned with company architecture, or dangerous.&lt;/p&gt;

&lt;p&gt;Organizations now face a difficult dilemma: either limit AI adoption (which is unrealistic in a competitive market) or lose control over their technology stack, security, and compliance.&lt;/p&gt;

&lt;p&gt;Architect’s Guardrail solves this problem. It is a lightweight, local MCP Server (Model Context Protocol) that acts like “an architect sitting on the developer’s shoulder.”&lt;/p&gt;

&lt;p&gt;When a developer asks the AI to generate code, the server delivers the company’s current policy to the LLM in real time, approved technologies, required patterns, security rules, and restrictions. This enables the AI to either generate policy-compliant code or clearly explain why a specific technology or approach is not allowed and suggest an alternative.&lt;/p&gt;

&lt;p&gt;True AI Governance in 2026 is not about blocking language models.&lt;/p&gt;

&lt;p&gt;It is about delivering up-to-date company knowledge exactly at the moment decisions are made.&lt;/p&gt;

&lt;p&gt;The project is fully open-source:&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/annadanilec/architect-guardrail" rel="noopener noreferrer"&gt;https://github.com/annadanilec/architect-guardrail&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: Why Traditional Approaches Fail
&lt;/h3&gt;

&lt;p&gt;Traditional DevSecOps controls, such as pre-commit hooks, SAST (Static Application Security Testing), SCA (Software Composition Analysis), and dependency scanning, act too late in the AI-assisted code generation process.&lt;/p&gt;

&lt;p&gt;These tools analyze code only after it has been created, usually at the commit, pull request, or CI/CD build stage. In the world of agentic AI, where a model can generate dozens or hundreds of lines of code in seconds, a post-factum reaction is insufficient. The problem occurs the moment the developer accepts the LLM’s suggestion.&lt;/p&gt;

&lt;p&gt;The root cause is the lack of organizational context in language models. LLMs do not know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The company’s internal Tech Radar&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Approved / unapproved libraries and their versions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Architecture Decision Records (ADR)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Security policies (e.g., how to handle secrets, authentication, logging)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Forbidden architectural patterns (e.g., direct database calls in backend services)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, models rely on public training data, which is often outdated or misaligned with internal standards.&lt;/p&gt;

&lt;p&gt;Key risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Credential and secret leaks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use of unapproved, risky libraries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inconsistency with company architecture&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lack of auditability of AI decisions&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Classic DevSecOps&lt;/th&gt;
&lt;th&gt;Agentic AI 2026&lt;/th&gt;
&lt;th&gt;Consequence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Problem detection moment&lt;/td&gt;
&lt;td&gt;After commit / PR&lt;/td&gt;
&lt;td&gt;At code generation time&lt;/td&gt;
&lt;td&gt;Too late&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Company policy awareness&lt;/td&gt;
&lt;td&gt;None (only technical rules)&lt;/td&gt;
&lt;td&gt;No organizational context&lt;/td&gt;
&lt;td&gt;High risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback speed&lt;/td&gt;
&lt;td&gt;Minutes–hours&lt;/td&gt;
&lt;td&gt;Seconds (without guardrails)&lt;/td&gt;
&lt;td&gt;Errors multiply quickly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI decision auditability&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Very low without dedicated tooling&lt;/td&gt;
&lt;td&gt;Compliance difficulties&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the era of classic DevSecOps, control was reactive and code-focused. In the era of Agentic AI, control must be proactive and contextual, delivering company policy exactly when the model is making a decision.&lt;/p&gt;

&lt;p&gt;Traditional tools simply cannot keep up with the speed and nature of generative AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Architect’s Guardrail (MCP Server)
&lt;/h3&gt;

&lt;p&gt;Architect’s Guardrail is a lightweight, local server compliant with the MCP (Model Context Protocol). It serves as a context layer between the developer and the language model (Claude, Cursor, Windsurf, etc.).&lt;/p&gt;

&lt;p&gt;What is MCP and why is it ideal for a guardrail?&lt;/p&gt;

&lt;p&gt;MCP is an open protocol introduced by Anthropic in 2025. It allows language models to securely and structurally retrieve context from external sources (tools, resources, and prompts) in real time.&lt;/p&gt;

&lt;p&gt;Unlike classic tool calling (e.g., LangChain or OpenAI function calling), MCP was designed for &lt;strong&gt;local IDE integration&lt;/strong&gt; and secure context delivery. It supports two modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;stdio (most common for local use)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;HTTP (for enterprise deployments)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why MCP is perfect for a guardrail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Works before code generation, delivers context while the model is still thinking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Natively supported by Claude Desktop, Cursor, and other modern AI tools.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Supports both tools and resources (e.g., the entire company policy as context).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lightweight, runs locally, and does not require sending code outside the organization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Provides full auditability, every request to the server can be logged.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Architecture&lt;/p&gt;

&lt;p&gt;The architecture is deliberately simple so it can be implemented in just a few hours:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;MCP Server – core (Python + FastMCP or TypeScript)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Policy sources: policy.json, Tech Radar (JSON or external integration), approved libraries, security rules, ADRs, forbidden patterns&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration: Claude Desktop / Cursor / VS Code&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Optional: Git repository with policy (auto-refresh on startup)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A basic version can fit in a single ~150–200 line server.py file.&lt;/p&gt;

&lt;h4&gt;
  
  
  How Architect’s Guardrail Works Step by Step
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. The developer writes a prompt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Cursor or Claude Desktop, the developer enters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Create a secure payment processing endpoint in FastAPI with Stripe integration.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;2. The IDE sends a request to the MCP Server&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before generating a response, the AI tool sends the current context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the prompt,&lt;/li&gt;
&lt;li&gt;the current file,&lt;/li&gt;
&lt;li&gt;and the workspace context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to the local MCP Server using either the stdio or HTTP protocol.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The MCP Server provides policy context&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The server reads the current organizational policy and returns structured context, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"approved_frameworks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"fastapi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"django"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"nestjs"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"preferred_http_client"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"httpx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"forbidden"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"urllib3"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"secrets_handling"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Must use Vault / AWS Secrets Manager. No hardcoded keys."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payment_requirements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"rate limiting"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"idempotency key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"PCI-compliant logging"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. The LLM receives full organizational context&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model receives not only the user request, but also the company’s current engineering standards and security policies.&lt;/p&gt;

&lt;p&gt;As a result, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generates policy-compliant code,&lt;/li&gt;
&lt;li&gt;uses httpx,&lt;/li&gt;
&lt;li&gt;implements proper secret handling,&lt;/li&gt;
&lt;li&gt;adds rate limiting,&lt;/li&gt;
&lt;li&gt;and follows required architectural patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the model attempts to use something forbidden, it explains why it is not allowed and proposes an approved alternative.&lt;/p&gt;

&lt;p&gt;Example AI response with Guardrail enabled&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“According to company policy, I cannot use the requests library. Instead, I will use httpx with retry logic, as required by our Tech Radar (ADR-187). Stripe secrets will be retrieved from Vault…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h5&gt;
  
  
  Architecture Flow
&lt;/h5&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjziv4wuzzd0lxj91t72t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjziv4wuzzd0lxj91t72t.png" alt=" " width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks to this approach, the guardrail operates proactively rather than reactively.&lt;/p&gt;

&lt;p&gt;The policy is always up to date because the server can continuously read the latest version directly from Git or a centralized repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation: Building Architect’s Guardrail in ~2 Hours
&lt;/h3&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/Qymwb2rwTW0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Open-source repository:&lt;br&gt;
&lt;a href="https://github.com/annadanilec/architect-guardrail" rel="noopener noreferrer"&gt;https://github.com/annadanilec/architect-guardrail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The basic version can be built in 1.5–2 hours if you have intermediate Python knowledge. &lt;/p&gt;
&lt;h4&gt;
  
  
  Project Structure
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;architect-guardrail/
├── src/
│   ├── server.py
│   ├── config.py
│   ├── policy/
│   └── tools/
├── policy/
│   ├── policy.json
│   ├── tech-radar.json
│   └── adrs/
├── mcp.json
├── pyproject.toml
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Dependency Installation (pyproject.toml)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[project]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"architect-guardrail"&lt;/span&gt;
&lt;span class="py"&gt;dependencies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="py"&gt;"mcp&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;    &lt;span class="py"&gt;"pydantic&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;    &lt;span class="s"&gt;"pyyaml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"httpx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"python-dotenv"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Policy Configuration and Loading (config.py)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;approved_libraries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;forbidden_libraries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;secrets_handling&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;preferred_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
    &lt;span class="n"&gt;adrs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_policy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;policy/policy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Main MCP Server (server.py)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_policy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.tools.policy_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_approved_libraries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_against_policy&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Architect-Guardrail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Resources – automatically available to the LLM
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;policy://tech-radar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_tech_radar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_policy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;policy://security-rules&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_security_rules&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_policy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Security rules:
    - Secrets: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets_handling&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    - Authentication: OAuth2 + JWT + short-lived tokens
    - Logging: structured logging + PII masking&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# Tools
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_approved_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns a list of approved libraries for a given language and category.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_policy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;libs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;approved_libraries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# category filtering
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; libraries in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;libs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; libraries: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;libs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_against_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code_snippet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Validates code against the current company policy.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_policy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;code_snippet&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;httpx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;code_snippet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use httpx instead of requests (according to ADR-142)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from httpx import AsyncClient + retry logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;code_snippet&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hardcoded secrets are forbidden&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use Vault / AWS Secrets Manager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Server Security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs exclusively locally (transport="stdio")&lt;/li&gt;
&lt;li&gt;No public port exposure by default&lt;/li&gt;
&lt;li&gt;Optional lightweight token authentication for HTTP mode (enterprise deployment)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Example Use Cases
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Example 1: AI Attempts to Use requests&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer Prompt&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write a function to call an external API.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;AI Response with Guardrail Enabled&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“According to our policy (Tech Radar + ADR-142), we do not use the requests library. Instead, we use httpx with automatic retry and circuit breaker support. Here is the updated implementation…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Example 2: Hardcoded secret&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Response&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I cannot generate code containing a hardcoded API key. According to the security policy (policy://security-rules), all secrets must be retrieved from Vault. Here is a version using hvac / boto3 Secrets Manager…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Example 3: Deprecated Framework&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Response&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The proposed usage of Flask 2.x is not compliant with our Tech Radar. The currently approved framework is FastAPI 0.115+. See the migration path in ADR-203.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Advanced Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tech Radar Integration – automatically fetch policies from databases, Notion, internal APIs, or Git repositories every X minutes&lt;/li&gt;
&lt;li&gt;Policy Versioning – policy/v1.2.json + Git history&lt;/li&gt;
&lt;li&gt;Audit trail – log prompts, returned context, AI decisions, and policy validations&lt;/li&gt;
&lt;li&gt;Human in the loop – use tools such as request_approval() for critical changes like introducing a new library&lt;/li&gt;
&lt;li&gt;Multi-policy support – different policies for backend, frontend, mobile, data science, and platform engineering teams&lt;/li&gt;
&lt;li&gt;CI/CD integration – validate pull requests using MCP logs, policy validation results, and AI interaction history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practical Limitation: Context Availability vs Policy Enforcement
&lt;/h3&gt;

&lt;p&gt;Current MCP integrations in tools such as Cursor significantly improve organizational context delivery, but an important distinction must be understood:&lt;/p&gt;

&lt;p&gt;MCP makes policies available to the model, but does not automatically guarantee that the model will always consult them before generating code.&lt;/p&gt;

&lt;p&gt;In practice, this means that simply attaching an MCP Server is not yet equivalent to deterministic policy enforcement. During testing of Architect’s Guardrail in Cursor, policy compliance was significantly more reliable when the AI was explicitly instructed to always consult MCP resources before generating production code.&lt;/p&gt;

&lt;p&gt;This reveals one of the key challenges of enterprise AI governance in 2026:&lt;/p&gt;

&lt;p&gt;Context availability is not the same as policy enforcement.&lt;/p&gt;

&lt;p&gt;To improve reliability, several additional layers can be introduced.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recommended Enforcement Layers
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Persistent Workspace Rules (.cursorrules)&lt;/strong&gt; – A lightweight but effective approach is adding project-level instructions such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Always consult Architect's Guardrail MCP resources before generating code.

Never use external libraries before validating them against company policy.

Always validate:
- secrets handling
- authentication
- logging
- architectural patterns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This significantly increases the probability that Cursor consistently consults MCP resources during generation.&lt;/p&gt;

&lt;p&gt;Mandatory Policy Pre-Check – A stronger approach is introducing a middleware layer that automatically injects policy context before generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Prompt
    ↓
Policy Validation Layer
    ↓
MCP Policy Retrieval
    ↓
LLM Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This transforms governance from probabilistic to deterministic.&lt;/p&gt;

&lt;p&gt;Output Validation – Even after generation, code can be validated against organizational policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generated Code
      ↓
Policy Validator
      ↓
Approve / Reject / Rewrite
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a true enterprise-grade governance pipeline.&lt;/p&gt;

&lt;p&gt;Architect’s Guardrail should therefore be viewed not as a complete enforcement system by itself, but as foundational governance infrastructure for AI-assisted development environments.&lt;/p&gt;

&lt;p&gt;MCP provides the missing organizational context layer. Deterministic enforcement still requires orchestration, workflow design, and validation mechanisms on top of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business Results and Impact
&lt;/h3&gt;

&lt;p&gt;Implementing Architect’s Guardrail delivers measurable benefits for both the organization and individual teams.&lt;/p&gt;

&lt;h4&gt;
  
  
  Benefits for CTOs, Architects, and Security Teams:
&lt;/h4&gt;

&lt;p&gt;Real-time policy enforcement&lt;br&gt;
Significant risk reduction&lt;br&gt;
Full auditability of AI decisions&lt;br&gt;
Reduced manual review workload&lt;br&gt;
Easier maintenance of Tech Radar consistency&lt;/p&gt;

&lt;h4&gt;
  
  
  Benefits for Developers:
&lt;/h4&gt;

&lt;p&gt;Faster development&lt;br&gt;
Less friction&lt;br&gt;
Educational effect (AI explains decisions)&lt;br&gt;
Higher confidence that code will pass review&lt;/p&gt;

&lt;p&gt;Key Metrics to Track:&lt;br&gt;
| Metric | Before | After | Expected Improvement |&lt;/p&gt;

&lt;p&gt;|---|---|---|---|&lt;/p&gt;

&lt;p&gt;| Compliance rate | 60–70% | 90–96% | +30–40% |&lt;/p&gt;

&lt;p&gt;| Risky AI suggestions per week | 20–35 | 3–6 | -80–85% |&lt;/p&gt;

&lt;p&gt;| Hardcoded secrets in PRs | 6–12 | 0–2 | -85–90% |&lt;/p&gt;

&lt;p&gt;| Time from prompt to approved code | 40–55 min | 25–35 min | -30–40% |&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison with Other Approaches
&lt;/h3&gt;

&lt;p&gt;There are several ways to enforce policies in AI environments. Below is a comparison of the most common approaches alongside the MCP-based Architect’s Guardrail solution.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;th&gt;Weaknesses&lt;/th&gt;
&lt;th&gt;Real-time Effectiveness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Engineering&lt;/td&gt;
&lt;td&gt;Very easy to start&lt;/td&gt;
&lt;td&gt;Brittle, context limits, hard to maintain&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangChain / LangGraph&lt;/td&gt;
&lt;td&gt;Flexible for complex agents&lt;/td&gt;
&lt;td&gt;Heavy, high overhead, poor IDE integration&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Central Proxy (Guardrails AI, NeMo)&lt;/td&gt;
&lt;td&gt;Strong central control&lt;/td&gt;
&lt;td&gt;Latency, single point of failure, complex&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local MCP Guardrail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native IDE integration, low latency, local&lt;/td&gt;
&lt;td&gt;Requires local installation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Very High&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Architect’s Guardrail stands out due to its simplicity, locality, and excellent developer experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Development Roadmap &amp;amp; Recommendations
&lt;/h3&gt;

&lt;p&gt;Stage 1: Basic Guardrail (2–4 hours)&lt;/p&gt;

&lt;p&gt;Stage 2: Full Tech Radar + ADR integration (1–2 days)&lt;/p&gt;

&lt;p&gt;Stage 3: Multi-team / multi-policy support (3–5 days)&lt;/p&gt;

&lt;p&gt;Stage 4: Agent-driven policy updates (advanced)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommendations:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with Stage 1 in a pilot team.&lt;br&gt;
Treat policy as code, store it in Git.&lt;br&gt;
Combine rollout with team training on effective AI usage.&lt;br&gt;
Aim for organization-wide standardization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The best way to achieve AI Governance in 2026 is not by blocking models, but by giving them the right company context at the right moment.&lt;/p&gt;

&lt;p&gt;Architect’s Guardrail proves that you can reconcile two seemingly contradictory goals: dramatically accelerating development with AI while maintaining strong control over quality, security, and architectural consistency.&lt;/p&gt;

&lt;p&gt;MCP is emerging as a highly effective and elegant standard. It is lightweight, local, natively integrated with the best developer tools, and built with security in mind.&lt;/p&gt;

&lt;p&gt;Recommendation: If your organization is already using Claude, Cursor, or similar tools, build your own Guardrail. The basic version takes just one evening.&lt;/p&gt;

&lt;p&gt;In the coming months, MCP has a strong chance of becoming the de facto standard for enterprise AI governance, just as GitHub Actions became the standard for CI/CD and OpenTelemetry for observability.&lt;/p&gt;

&lt;p&gt;It’s time to stop treating AI as an uncontrolled guest in the team.&lt;/p&gt;

&lt;p&gt;It’s time to treat it as a highly talented team member that requires clear, precise guidance.&lt;/p&gt;

&lt;p&gt;And it all starts with context.&lt;/p&gt;

&lt;p&gt;Open-source project:&lt;br&gt;
&lt;a href="https://github.com/annadanilec/architect-guardrail" rel="noopener noreferrer"&gt;https://github.com/annadanilec/architect-guardrail&lt;/a&gt;&lt;br&gt;
If you are experimenting with AI governance, MCP, or architecture-aware coding assistants, feel free to contribute or fork the project.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>python</category>
      <category>security</category>
    </item>
    <item>
      <title>Deterministic Guardrails for Non-Deterministic Agents</title>
      <dc:creator>Anna Danilec</dc:creator>
      <pubDate>Fri, 08 May 2026 11:18:53 +0000</pubDate>
      <link>https://dev.to/anna_danilec/deterministic-guardrails-for-non-deterministic-agents-127b</link>
      <guid>https://dev.to/anna_danilec/deterministic-guardrails-for-non-deterministic-agents-127b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjizpq6zysf2cz0zrk02z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjizpq6zysf2cz0zrk02z.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By design, Large Language Models (LLMs) are non-deterministic. Even with an identical prompt, they can return different answers, trigger the wrong API, leak sensitive personal data, or initiate a costly chain of requests that evaporates a monthly cloud budget in seconds. For engineers managing production systems, this isn't an abstract risk — it's the nightmare scenario that keeps them up at night.&lt;/p&gt;

&lt;p&gt;The solution does not lie in hoping for better models. Instead, it lies in a &lt;strong&gt;deterministic guardrail layer&lt;/strong&gt; that governs the agent, regardless of the model's output. This article explores four strategic pillars of such an architecture, all built using FastAPI.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Governance Layer&lt;/td&gt;
&lt;td&gt;FastAPI middleware&lt;/td&gt;
&lt;td&gt;Compliance, PII Masking, &amp;amp; Cost Control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource Sandboxing&lt;/td&gt;
&lt;td&gt;Dependency Injection&lt;/td&gt;
&lt;td&gt;Minimizing the "Blast Radius"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability &amp;amp; Traceability&lt;/td&gt;
&lt;td&gt;OpenTelemetry + OTLP&lt;/td&gt;
&lt;td&gt;Visualizing Chain of Thought (CoT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async ROI&lt;/td&gt;
&lt;td&gt;asyncio event loop&lt;/td&gt;
&lt;td&gt;Infrastructure Cost Optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Pillar 1: Governance Layer – FastAPI Middleware
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is middleware and why is it a perfect place for guardrails?
&lt;/h3&gt;

&lt;p&gt;In web architecture, middleware is logic executed during the lifecycle of every HTTP request — both before it reaches the intended handler and after the response is generated but before it reaches the client. Think of it as an airport security checkpoint: every passenger (request) must pass through it without exception.&lt;/p&gt;

&lt;p&gt;In the context of AI agents, FastAPI middleware intercepts the agent's output before it reaches the end user. This provides a single, centralized layer to enforce compliance and safety policies, regardless of how many specialized agents are operating within the system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The golden rule:&lt;/strong&gt; guardrails are NOT part of the agent. The agent is a black box that can behave unpredictably. Guardrails must exist as an external control layer, operating independently of the agent's internal logic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Implementation: Compliance Middleware
&lt;/h3&gt;

&lt;p&gt;The middleware below intercepts every agent response and performs three critical operations: compliance validation, PII masking, and cost tracking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# middleware/guardrails.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.middleware.base&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseHTTPMiddleware&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JSONResponse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_analyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnalyzerEngine&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_anonymizer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnonymizerEngine&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_anonymizer.entities&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OperatorConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ── Prometheus Metrics ──────────────────────────────────────────
# The histogram tracks the distribution of response times (not just the average)
&lt;/span&gt;&lt;span class="n"&gt;REQUEST_DURATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_request_duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent response time in ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# The Counter increases monotonically — ideal for anomaly detection alerts
&lt;/span&gt;&lt;span class="n"&gt;POLICY_VIOLATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_policy_violations_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of blocked responses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PII_DETECTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_pii_detections_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of detected PII entities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ── Prohibited Phrases (Prompt Injection / Compliance) ──────────
&lt;/span&gt;&lt;span class="n"&gt;FORBIDDEN_PHRASES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore previous instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore all instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you are now&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# classic prompt injection attack
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;disregard your&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentGuardrailsMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseHTTPMiddleware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Initialize ONCE in the constructor — NLP models are resource-heavy (~200ms);
&lt;/span&gt;        &lt;span class="c1"&gt;# we want to avoid loading them per-request
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;languages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;languages&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnalyzerEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;anonymizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnonymizerEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 1: Call proper handler (agent) ───────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# call_next passes the request deeper into the middleware stack
&lt;/span&gt;        &lt;span class="c1"&gt;# until it reaches the intended endpoint. The agent performs all its logic:
&lt;/span&gt;        &lt;span class="c1"&gt;# querying the LLM, executing tools, and building the response.
&lt;/span&gt;        &lt;span class="c1"&gt;# We wait, and only THEN do we receive the response object.
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 2: Collect streaming body ───────────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# Starlette streams responses chunk-by-chunk — there is
&lt;/span&gt;        &lt;span class="c1"&gt;# no single 'response.text'. We must assemble the body manually.
&lt;/span&gt;        &lt;span class="c1"&gt;# Note: this buffers the entire response in RAM.
&lt;/span&gt;        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body_iterator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 3: Compliance check ─────────────────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# Check if the agent returned any prohibited content.
&lt;/span&gt;        &lt;span class="c1"&gt;# We block the response here, before anything reaches the client.
&lt;/span&gt;        &lt;span class="n"&gt;violation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_check_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;POLICY_VIOLATIONS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Policy violation [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Response blocked&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;violation&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 4: PII masking with Presidio ────────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# Presidio uses NLP models (spaCy) instead of pure RegEx.
&lt;/span&gt;        &lt;span class="c1"&gt;# It understands context: '48123456789' as a phone number
&lt;/span&gt;        &lt;span class="c1"&gt;# in 'call me at...' but not when it's an Order ID.
&lt;/span&gt;        &lt;span class="n"&gt;masked_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_mask_pii&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 5: Metrics ──────────────────────────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# duration_ms is sent to the Prometheus Histogram, allowing us
&lt;/span&gt;        &lt;span class="c1"&gt;# to calculate p50/p95/p99 latency and trigger alerts on threshold violations.
&lt;/span&gt;        &lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
        &lt;span class="n"&gt;REQUEST_DURATION&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Agent: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# ── STEP 6: Return response ──────────────────────────────
&lt;/span&gt;        &lt;span class="c1"&gt;# Crucial: replicate the original status_code and headers.
&lt;/span&gt;        &lt;span class="c1"&gt;# Without this you lose Content-Type and bypass original 403/404 codes.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;masked_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# keep original code
&lt;/span&gt;            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# keep original headers
&lt;/span&gt;            &lt;span class="n"&gt;media_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;media_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_check_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns the violation description or None if the text is compliant.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FORBIDDEN_PHRASES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompt_injection: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;phrase&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_mask_pii&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Presidio operates in two stages:
&lt;/span&gt;        &lt;span class="c1"&gt;# 1. analyzer.analyze() — detects entities and returns their positions
&lt;/span&gt;        &lt;span class="c1"&gt;# 2. anonymizer.anonymize() — replaces them according to the defined strategy
&lt;/span&gt;        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PERSON&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EMAIL_ADDRESS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PHONE_NUMBER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IBAN_CODE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CREDIT_CARD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IP_ADDRESS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;score_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# eliminates most false positives
&lt;/span&gt;            &lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;

        &lt;span class="c1"&gt;# Log the entity TYPE only — never the value itself (that would be a PII leak!)
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;PII_DETECTIONS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entity_type&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PII: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entity_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;anonymized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;anonymizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anonymize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;analyzer_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;operators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PERSON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[PERSON]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMAIL_ADDRESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[EMAIL]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PHONE_NUMBER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[PHONE]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IBAN_CODE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[IBAN]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREDIT_CARD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[CARD]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IP_ADDRESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nc"&gt;OperatorConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[IP]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;anonymized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happens step by step?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;call_next(request)&lt;/code&gt;&lt;/strong&gt; — The middleware passes the request down the stack until it reaches the intended endpoint (the agent). The agent performs its logic: querying the LLM, calling tools, and constructing the response. We wait for the final output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;response.body_iterator&lt;/code&gt;&lt;/strong&gt; — Starlette streams responses chunk-by-chunk, so there is no single &lt;code&gt;response.text&lt;/code&gt; field available. We must manually reassemble the body in a loop. Warning: this buffers the entire response in RAM; for extremely large payloads, a size limit should be implemented.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;_check_policy(text)&lt;/code&gt;&lt;/strong&gt; — We validate the agent's output for prohibited content, including prompt injection attempts, off-limits topics, or profanity. If a violation is detected, we block the request immediately — before it ever reaches the client — and return a &lt;code&gt;403 Forbidden&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;_mask_pii(text)&lt;/code&gt;&lt;/strong&gt; — Microsoft Presidio in a two-stage process: first, &lt;code&gt;analyzer.analyze()&lt;/code&gt; detects entities and their positions by understanding linguistic context; then, &lt;code&gt;anonymizer.anonymize()&lt;/code&gt; replaces them with placeholders. Crucial: we log only the entity &lt;strong&gt;TYPE&lt;/strong&gt;, never the actual value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;REQUEST_DURATION.observe()&lt;/code&gt;&lt;/strong&gt; — Response time is recorded in a Prometheus Histogram. A Histogram (rather than a simple Counter) allows us to calculate p50/p95/p99 latency and build percentile-based alerts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;Response(status_code=…, headers=…)&lt;/code&gt;&lt;/strong&gt; — We return the final response while preserving the original status code and headers. Without this, you lose metadata like &lt;code&gt;Content-Type&lt;/code&gt; or &lt;code&gt;Cache-Control&lt;/code&gt;, and the client might receive a &lt;code&gt;200 OK&lt;/code&gt; even if the underlying logic returned a &lt;code&gt;404&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why is this better than internal agent guardrails?&lt;/strong&gt; Because middleware operates across ALL endpoints simultaneously. Whether you have one agent or fifty, policies are defined once and enforced centrally.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Registering the middleware
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;middleware.guardrails&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentGuardrailsMiddleware&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Middleware stack — order matters!
# Last in = first executed (LIFO)
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentGuardrailsMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CostTrackingMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AuthMiddleware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- this executes first
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Middleware is executed in reverse order of registration (LIFO). Authentication should always be registered last to ensure it triggers first — there is no point processing PII masking for an unauthorized request.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pillar 2: Resource Sandboxing – Dependency Injection
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem: what if the agent goes in the wrong direction?
&lt;/h3&gt;

&lt;p&gt;Imagine an AI agent with access to your production database. The LLM hallucinates, generates a destructive SQL query, and within seconds your data is gone. This isn't science fiction — it's a reality already faced by some early adopters of AI agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blast radius&lt;/strong&gt; is an SRE term defining the maximum potential damage caused by a single component failure. For AI agents, we minimize this through Resource Sandboxing: the agent is granted only the minimum necessary permissions, operates in read-only mode, and remains strictly isolated from the rest of the infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dependency Injection in FastAPI
&lt;/h3&gt;

&lt;p&gt;FastAPI's built-in DI system lets us inject dependencies from the outside rather than creating them within the class. This gives us total control over exactly which resources the agent can access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# dependencies/sandboxed_db.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Depends&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy.orm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Session&lt;/span&gt;

&lt;span class="c1"&gt;# Separate connection pool — read-only
# DB account with SELECT-only privileges; no INSERT/UPDATE/DELETE
&lt;/span&gt;&lt;span class="n"&gt;READONLY_DB_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgresql://agent_ro:pass@db-replica/prod&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;readonly_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;READONLY_DB_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Small pool — prevents the agent from exhausting connections
&lt;/span&gt;    &lt;span class="n"&gt;max_overflow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# Maximum of 7 total concurrent connections
&lt;/span&gt;    &lt;span class="n"&gt;pool_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# Timeout — ensures the agent doesn't wait indefinitely
&lt;/span&gt;    &lt;span class="n"&gt;connect_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-c default_transaction_read_only=on&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SandboxedDatabase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Database with enforced read-only access at the connection level.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_query_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max_queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# Rate limit per request
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_query_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max_queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PermissionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Query limit exceeded for this agent session&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Block dangerous operations at the application level
&lt;/span&gt;        &lt;span class="n"&gt;forbidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DELETE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TRUNCATE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ALTER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GRANT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;forbidden&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PermissionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Forbidden SQL keyword: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_query_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_sandboxed_db&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SandboxedDatabase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readonly_engine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nc"&gt;SandboxedDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/agent/query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SandboxedDatabase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_sandboxed_db&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Injection
&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The agent has access ONLY to db.execute_query()
&lt;/span&gt;    &lt;span class="c1"&gt;# No access to the engine, session, or other databases
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layered security – Defense in Depth
&lt;/h3&gt;

&lt;p&gt;Four independent layers of protection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1 – DB permissions:&lt;/strong&gt; &lt;code&gt;agent_ro&lt;/code&gt; has only &lt;code&gt;SELECT&lt;/code&gt; at the PostgreSQL level. Even if the agent constructs a &lt;code&gt;DELETE&lt;/code&gt;, the database will reject it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2 – Connection options:&lt;/strong&gt; &lt;code&gt;default_transaction_read_only=on&lt;/code&gt; enforced at the connection level. An additional lock on the SQL session itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3 – Application validation:&lt;/strong&gt; We scan for forbidden keywords in Python code. Any violation is logged and an alert is triggered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 4 – Pool limits:&lt;/strong&gt; The agent is restricted to 7 concurrent connections and 10 queries per request. This prevents it from overwhelming the infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dependency Injection is more than a design pattern — it is a security mechanism.&lt;/strong&gt; When an agent is a function that accepts a &lt;code&gt;SandboxedDatabase&lt;/code&gt; as a parameter, it is physically impossible for it to access an unrestricted database. &lt;strong&gt;The interface IS the guardrail.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Tool Registry
&lt;/h3&gt;

&lt;p&gt;We apply the same pattern to the agent's tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentToolkit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search_products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_search_products&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# OK
&lt;/span&gt;            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;get_order_status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_order_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# OK
&lt;/span&gt;            &lt;span class="c1"&gt;# 'send_email': ...    — NOT registered = unavailable
&lt;/span&gt;            &lt;span class="c1"&gt;# 'execute_code': ... — NOT registered = unavailable
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                          &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PermissionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tool not available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_available&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pillar 3: Observability &amp;amp; Traceability – OpenTelemetry
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why traditional APM doesn't work for AI agents
&lt;/h3&gt;

&lt;p&gt;Tools like Datadog or New Relic were designed for synchronous, short-lived HTTP requests: request arrives, processed in 50ms, response sent. Core metrics focus on latency, error rate, and throughput.&lt;/p&gt;

&lt;p&gt;AI agents fundamentally break this model. A single request can last 30–120 seconds, during which the agent executes a &lt;strong&gt;Chain of Thought (CoT)&lt;/strong&gt; — a sequence of reasoning steps, each involving an LLM call and potential tool executions. Traditional APM sees one long, opaque request with zero visibility into what happened inside.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;OpenTelemetry (OTel)&lt;/strong&gt; is an open-source standard for distributed tracing, metrics, and logs. It provides a vendor-agnostic framework independent of the backend (Jaeger, Tempo, Datadog, Honeycomb). Instrument your code once, export to any destination.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Instrumenting FastAPI with OpenTelemetry
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# observability/tracing.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPIInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.httpx&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HTTPXClientInstrumentor&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_telemetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;service.name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent-service&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;service.version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2.1.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deployment.environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://otel-collector:4317&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Auto-instrumentation — all FastAPI requests are automatically traced
&lt;/span&gt;    &lt;span class="n"&gt;FastAPIInstrumentor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;instrument_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# HTTP client instrumentation — LLM API calls are also tracked
&lt;/span&gt;    &lt;span class="nc"&gt;HTTPXClientInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tracing the Chain of Thought – manual spans
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# agent/reasoning.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent.reasoning&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent.execute&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;agent_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;agent_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent.query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;agent_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent.model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_STEPS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cot.step_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;step_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;step_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cot.step_number&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm.call&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;llm_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;llm_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm.prompt_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;llm_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm.completion_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;llm_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm.cost_usd&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool.call&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tool_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;tool_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool.name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;tool_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool.success&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_final&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What you see in Grafana Tempo / Jaeger
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent.execute                          [0ms ────────────────────── 45,230ms]
  cot.step_0                           [12ms ─────── 8,400ms]
    llm.call                           [15ms ─── 7,800ms]   tokens: 412/623
    tool.call: search_products         [8,410ms ─ 8,890ms]  success: true
  cot.step_1                           [8,900ms ─────── 28,100ms]
    llm.call                           [8,900ms ─ 27,600ms] tokens: 1024/892
    tool.call: get_order_status        [27,610ms ─ 28,080ms]
  cot.step_2 (final)                   [28,100ms ─── 45,200ms]
    llm.call                           [28,100ms ─ 45,100ms] tokens: 2048/312
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can immediately see that &lt;code&gt;cot.step_1&lt;/code&gt; took 19 seconds — specifically the &lt;code&gt;llm.call&lt;/code&gt; with 1024 input tokens. This provides a clear signal for optimization: prompt caching, switching models for that step, or context compression.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Without OTel:&lt;/strong&gt; &lt;em&gt;"request took 45 seconds."&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;With OTel:&lt;/strong&gt; &lt;em&gt;"step_1.llm.call took 18.7s with 1024 input tokens — probable cause: context window overhead."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Pillar 4: Async ROI – The Event Loop Economy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem with synchronous code
&lt;/h3&gt;

&lt;p&gt;Imagine a call center. In a synchronous model, one employee handles one customer at a time. While the customer spends 30 seconds looking up their order number, the employee sits idle. To handle 100 customers simultaneously, you need 100 employees.&lt;/p&gt;

&lt;p&gt;In server terms: &lt;strong&gt;one thread per request&lt;/strong&gt;. While a thread waits for an LLM response (typically 2–30 seconds), it consumes CPU and memory doing nothing. Scaling to 1,000 concurrent agents requires 1,000 threads. A server with 32 vCPUs can only handle ~50–100 threads efficiently before degrading.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asyncio: cooperative multitasking
&lt;/h3&gt;

&lt;p&gt;Python's &lt;code&gt;asyncio&lt;/code&gt; implements cooperative multitasking. One thread, one event loop, many coroutines. When a coroutine waits for I/O (network, disk, LLM API), it &lt;strong&gt;yields control&lt;/strong&gt; back to the event loop, which immediately starts processing another coroutine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Synchronous — blocks the thread
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_request_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LLM_API&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{...})&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- BLOCKS 8 seconds
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;              &lt;span class="c1"&gt;# Thread sits idle
&lt;/span&gt;
&lt;span class="c1"&gt;# Asynchronous — releases the thread while waiting
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_request_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LLM_API&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{...})&lt;/span&gt;
        &lt;span class="c1"&gt;# await = 'wait for the result, but release the event loop'
&lt;/span&gt;        &lt;span class="c1"&gt;# During the 8-second wait, the event loop processes other requests
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/agent/query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handle_request_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Concrete math: Synchronous vs Async
&lt;/h3&gt;

&lt;p&gt;Based on a typical agent workload (8s average latency per LLM call):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Sync (Flask/Django)&lt;/th&gt;
&lt;th&gt;Async (FastAPI)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Concurrent requests / vCPU&lt;/td&gt;
&lt;td&gt;~15–25&lt;/td&gt;
&lt;td&gt;~500–2000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM per 1,000 requests&lt;/td&gt;
&lt;td&gt;~8–16 GB (threads)&lt;/td&gt;
&lt;td&gt;~0.5–1 GB (coroutines)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU utilization during LLM wait&lt;/td&gt;
&lt;td&gt;~2–5% (blocking)&lt;/td&gt;
&lt;td&gt;~70–90% (other requests)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instances required (1,000 req/s)&lt;/td&gt;
&lt;td&gt;~40–60&lt;/td&gt;
&lt;td&gt;~4–8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated monthly cost (AWS)&lt;/td&gt;
&lt;td&gt;~$4,800–7,200&lt;/td&gt;
&lt;td&gt;~$480–960&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;A 5–10× cost reduction is a mathematical certainty: async allows a single thread to handle hundreds of concurrent LLM wait states. For high I/O latency workloads — exactly what AI agents are — this is the most significant cost optimization you can implement without a complete architectural overhaul.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pitfalls of asynchronous code
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU-bound tasks block the event loop.&lt;/strong&gt; Heavy computation (parsing large JSON, data transformations) freezes everything. Use &lt;code&gt;asyncio.to_thread()&lt;/code&gt; or &lt;code&gt;ProcessPoolExecutor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synchronous libraries block.&lt;/strong&gt; Using &lt;code&gt;requests&lt;/code&gt; or &lt;code&gt;psycopg2&lt;/code&gt; from async code blocks the event loop. Use async counterparts: &lt;code&gt;httpx&lt;/code&gt;, &lt;code&gt;asyncpg&lt;/code&gt;, &lt;code&gt;aiofiles&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging is harder.&lt;/strong&gt; Coroutine stack traces are less readable. Use &lt;code&gt;asyncio.current_task()&lt;/code&gt; and structured logging.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG — sync operation blocks the event loop
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bad_agent_step&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;heavy_json_parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;large_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Blocks everyone!
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;

&lt;span class="c1"&gt;# CORRECT — delegate CPU-bound work to a thread pool
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;good_agent_step&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;heavy_json_parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;large_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary: Production-Grade Architecture
&lt;/h2&gt;

&lt;p&gt;Four pillars create a cohesive control layer around the non-deterministic agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Middleware&lt;/strong&gt; — enforces policies regardless of what the agent generates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Injection&lt;/strong&gt; — isolates the agent from infrastructure, minimizes blast radius&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt; — provides visibility inside long, multi-step Chain of Thought processes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async FastAPI&lt;/strong&gt; — maximizes hardware utilization, reduces costs 5–10×&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common denominator: the agent is a black box — and that is perfectly fine. You don't need to control how the LLM thinks. You control the &lt;strong&gt;system boundaries&lt;/strong&gt;: what the agent can read, what it can send, how long it can run, and what the client is allowed to see. This is engineering, not magic.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deterministic guardrails for non-deterministic agents.&lt;/strong&gt; Don't try to "fix" non-determinism — surround it with deterministic infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.invra.co/en/deterministic-guardrails-for-non-deterministic-agents/" rel="noopener noreferrer"&gt;invra.co&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>security</category>
    </item>
  </channel>
</rss>
