DEV Community: Anna Danilec

RAG Evaluation with RAGAS: Measuring Faithfulness, Context Precision, and Recall in Production

Anna Danilec — Mon, 18 May 2026 10:03:19 +0000

Key takeaways:

RAGAS gives you four core metrics that split RAG failures into retrieval vs. generation problems

Faithfulness catches hallucinations; Context Recall catches retrieval gaps

Most metrics require no human-labeled data

Treat RAGAS like unit tests, run it in CI every time you change your pipeline

You've shipped a RAG-based product. Your engineers say it "seems to work well." Your users occasionally complain it gives wrong answers. You have no idea which part is broken, the retrieval, the generation, or both.

This is the state of most RAG deployments today. And it's a problem you can solve with a proper evaluation framework.

Let's talk about RAGAS.

The Problem With "It Seems Fine"

Building a RAG system is the easy part. Dozens of tutorials get you from zero to a working demo in an afternoon. But production RAG is a different beast. You're dealing with:

Retrieval failures - the system pulls irrelevant chunks from your vector store
Hallucinations - the LLM generates facts not present in the retrieved documents
Incomplete coverage - the retrieval misses key information needed to answer the question
Irrelevant answers - the response doesn't actually address what the user asked

Traditional NLP metrics like BLEU and ROUGE won't catch any of these. They measure surface-level text similarity to a reference answer, useful for machine translation, not for knowledge-grounded generation. They completely ignore whether the LLM is actually using the retrieved context or just making things up.

You need metrics designed specifically for the RAG pipeline.

What is RAGAS?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source Python framework for evaluating RAG systems. It was introduced by Shahul Es, Jithin James, and collaborators in a paper published in late 2023 and presented at EACL 2024.

The key design decision that makes RAGAS practical: most of its metrics require no human-labeled ground truth. It uses LLMs as judges, the same type of model you're evaluating is used to evaluate the evaluation. Yes, this is a meta-game, but it works surprisingly well in practice.

RAGAS processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It has 4,000+ GitHub stars and is backed by a Y Combinator company. It's become the de facto standard for RAG evaluation.

The Two-Axis Mental Model

Before diving into specific metrics, understand the RAG pipeline as having two distinct components, each with its own failure modes:

Retriever failures: Wrong chunks, missing chunks, poorly ranked chunks
Generator failures: Hallucination, ignoring context, irrelevant response

RAGAS gives you metrics for both axes. If you only measure end-to-end output quality, you can't tell which half is broken.

The Core Four Metrics

1. Faithfulness: Does the answer stay true to the retrieved context?

What it catches: Hallucinations from the generator

How it works:
RAGAS extracts individual statements from the generated answer, then asks an LLM judge whether each statement can be logically inferred from the retrieved context. The score is the fraction of statements that can be supported.

Faithfulness = Supported Statements / Total Statements in Answer

Score range: 0 to 1 (higher is better)

Example:
Question: "What is our refund policy?"
Context: "Refunds are available within 30 days of purchase."
Answer: "Refunds are available within 30 days. We also offer exchanges for 60 days."

The second sentence isn't in the context → Faithfulness < 1

What low Faithfulness scores tell you, and how to fix it:

1. Tighten your system prompt: The most immediate lever. Add explicit grounding instructions:

"Answer only using the information provided in the context below. If the context does not contain enough information to answer, say so explicitly."

"Do not use any prior knowledge. Every claim in your answer must be traceable to the context."

Negative framing helps too: "Do not speculate. Do not add information not present in the provided documents."

2. Lower the temperature: High temperature = more creative, more likely to drift from the context. For factual RAG tasks, set temperature to 0 or close to it. There's no good reason to have randomness in a document Q&A system.

3. Switch or downgrade your model: Counter-intuitively, more capable models sometimes hallucinate more confidently. A model like GPT-4o has seen so much training data that it may "helpfully" fill gaps from its parametric memory rather than admitting the context is insufficient. Sometimes a smaller, instruction-tuned model with a strict prompt outperforms a frontier model on Faithfulness specifically.

2. Answer Relevancy: Does the answer actually address the question?

What it catches: Verbose, off-topic, or evasive answers from the generator

How it works:
An LLM generates several hypothetical questions that the given answer would be the answer to. Then it computes the cosine similarity between those generated questions and the original question. High similarity means the answer is directly addressing what was asked.

Answer Relevancy = avg(cosine_similarity(generated_questions, original_question))

Score range: 0 to 1

Example of a low-relevancy response:
Question: "When does the system auto-scale?"
Answer: "Our platform uses Kubernetes for container orchestration. It supports multiple cloud providers and can be deployed on-premises." ← technically related but doesn't answer the question

What low Answer Relevancy scores tell you, and how to fix it:

1. Your prompt template isn't directing the LLM toward the question: The most common cause. If your template looks like "Use the context below to help the user", the LLM has too much freedom to respond however feels natural, which often means answering a slightly different, easier version of the question. Fix it by anchoring the response explicitly to the input:

"Answer the following question directly and concisely: {question}"

"Your response must directly address what was asked. Do not provide background information unless it is necessary to answer the question."

2. Your retriever is pulling tangentially related chunks: This one is subtle, the LLM isn't hallucinating, it's faithfully summarizing context that happens to be adjacent to the topic but doesn't answer the specific question asked. The answer sounds reasonable, passes a Faithfulness check, but misses the point entirely.

Cross-reference with Context Precision: if that score is also low, the retriever is the culprit. The fix is better retrieval, a reranker, stricter similarity thresholds, or query rewriting before retrieval.

3. The LLM is being evasive or overly hedged: Some models, especially when given ambiguous context, default to safe, non-committal answers: "This is a complex topic with many perspectives…" These score very low on Answer Relevancy because a hypothetical question reverse-engineered from that answer looks nothing like the original query.

The fix is prompt-level: instruct the model to commit to an answer and flag uncertainty explicitly rather than hiding behind vagueness, "If you cannot find a direct answer in the context, say: I don't have enough information to answer this. Do not speculate."

3. Context Precision: Are the retrieved chunks actually useful? Are the best ones ranked first?

What it catches: Noisy retrieval, retrieving a lot of documents but ranking the relevant ones poorly

How it works:
For each retrieved chunk, an LLM judge decides whether that chunk is useful for answering the question. Context Precision then uses Average Precision, a ranking-aware metric that penalizes systems that bury the relevant chunks at the bottom.

This is important: two systems could retrieve the same relevant chunks but if one puts them at positions 1 and 2 and another at positions 8 and 9, the LLM may not use them effectively.

What low Context Precision scores tell you, and how to fix it:
1. Add a reranker as a second retrieval stage: Your embedding model does a decent job finding broadly relevant chunks, but cosine similarity in vector space is a blunt instrument, it measures general topic overlap, not “does this chunk actually help answer this specific question.”

A cross-encoder reranker (Cohere Rerank, BGE Reranker, Jina Reranker) reads the query and each chunk together and produces a much more accurate relevance score. The typical pattern is: retrieve top-20 with your vector store, rerank, pass top-5 to the LLM. This often moves Context Precision more than any other single change.

2. Fix your chunking strategy: Poorly sized chunks are a hidden precision killer. Chunks that are too large contain the relevant sentence plus a lot of surrounding noise, the chunk scores as retrieved but most of its content is irrelevant, dragging precision down.

Chunks that are too small lose surrounding context and get ranked inconsistently. The fix isn’t always obvious because the right chunk size is domain-dependent: dense technical documentation needs smaller chunks than narrative prose.

Test with a few different sizes (256, 512, 1024 tokens) and run Context Precision against each. Also consider sentence-window retrieval or parent-child chunking, retrieve small chunks for precision, but pass their larger parent context to the LLM.

3. Rewrite the query before retrieval: User queries are often poorly formed for vector search. They’re conversational, ambiguous, or assume context from earlier in the conversation. The embedding model then retrieves chunks that match the surface phrasing of the query rather than its intent.

Query rewriting with an LLM before hitting the vector store (sometimes called HyDE, Hypothetical Document Embeddings, or simply query expansion) can dramatically improve what gets ranked at the top. A simple prompt like “Rewrite this question as a declarative statement that would appear in a technical document” often moves the needle more than swapping embedding models.

4. Context Recall: Did the retriever find everything needed to answer the question?

What it catches: Retrieval gaps, the right information exists in your knowledge base but wasn’t retrieved

How it works:
This is the one metric that typically needs a ground truth reference answer. RAGAS decomposes the reference answer into individual statements, then checks which statements can be attributed to the retrieved context.

Context Recall = Statements attributable to context / Total statements in reference answer

What low Context Recall scores tell you, and how to fix it:
1. Increase your top-K and experiment with retrieval depth: The simplest fix first. If you’re retrieving top-3 or top-5 chunks, relevant information that exists in your knowledge base simply isn’t making it into the context window. Try top-10 or top-20 and re-measure.

The tradeoff is more noise (which hurts Context Precision), so watch both metrics together, you’re looking for the sweet spot where recall improves without precision collapsing. A reranker helps here because it lets you retrieve broadly and then filter aggressively.

2. Fix your chunking before fixing your retrieval: Low Context Recall is often misdiagnosed as a retrieval problem when it’s actually a chunking problem. If a single answer requires information spread across a document, an introduction, a table in the middle, and a caveat at the end, but your chunks split those pieces apart and only one gets retrieved, recall will suffer regardless of how good your embeddings are.

Consider parent-child chunking: index small chunks for precise matching, but when a small chunk is retrieved, pass its larger parent document to the LLM. This way you get retrieval precision without losing surrounding context.

3. Switch to hybrid search: Pure vector search fails on specific, precise queries, exact product names, version numbers, acronyms, proper nouns. The embedding model generalizes these into semantic space where they lose their distinctiveness.

BM25 (keyword search) handles them perfectly. Hybrid search combines both signals, dense retrieval for semantic understanding, sparse retrieval for exact matching, and consistently improves recall across diverse query types without significantly hurting precision. Most modern vector stores (Elasticsearch, Weaviate, Qdrant) support hybrid search natively.

How the Metrics Map to Your Architecture

Metric	Measures	Failure Points to Investigate
Context Precision	Retrieval quality & ranking	Embedding model, reranker, chunk size
Context Recall	Retrieval coverage	top-K setting, chunking, indexing strategy
Faithfulness	Generator groundedness	System prompt, temperature, model choice
Answer Relevancy	Generator focus	Prompt template, retrieval quality

A useful diagnostic pattern: if Faithfulness is fine but Answer Relevancy is low, your LLM is staying honest but the retrieved context isn’t helping it answer the actual question. That’s a retrieval problem dressed up as a generation problem.

Beyond the Core Four

RAGAS has expanded significantly since its original release. For production systems, you should also look at:

Noise Sensitivity – how much does your answer quality degrade when irrelevant chunks are retrieved alongside relevant ones? Critical for adversarial or domain-drift scenarios.
Context Entities Recall – checks whether specific entities (names, numbers, dates) from the ground truth appear in the retrieved context. Useful for fact-dense domains like legal or finance.
Factual Correctness – a reference-based metric that checks whether the answer is factually correct, not just grounded in context. This requires ground truth but gives you absolute accuracy, not just relative faithfulness.

For teams building agentic RAG pipelines, RAGAS also covers Tool Call Accuracy, Agent Goal Accuracy, and Topic Adherence.

The Evaluation Dataset Problem (And How RAGAS Solves It)

Here’s the real bottleneck: to run these metrics at scale, you need test questions. Building hundreds of representative questions by hand is expensive and slow.

RAGAS includes a synthetic test data generation module. It ingests your source documents, builds a knowledge graph, and generates diverse question types automatically, including multi-hop questions that require reasoning across multiple documents.

This lets you create a meaningful evaluation dataset in hours rather than weeks. It’s not perfect, you’ll still want human review for high-stakes domains, but it dramatically lowers the barrier to having a real eval suite before your next deployment.

What a RAGAS Workflow Looks Like

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Your RAG system output
data = {
    "question": ["What is our data retention policy?", ...],
    "contexts": [["Our data is retained for 90 days...", ...], ...],
    "answer": ["Data is retained for 90 days.", ...],
    "ground_truth": ["Data is retained for 90 days per GDPR requirements.", ...],
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
])

print(results)
# {'faithfulness': 0.91, 'answer_relevancy': 0.87, 
#  'context_precision': 0.76, 'context_recall': 0.82}

The real value isn’t a single run. It’s running RAGAS as part of your CI/CD pipeline every time you change your prompt template, swap your embedding model, or update your knowledge base. Treat it like unit tests for your AI system.

Practical Guidance

Start with Faithfulness and Context Recall. These two are the highest signal metrics for most production systems. Faithfulness catches the most dangerous failure mode (hallucination), and Context Recall tells you if your retrieval architecture is fundamentally sound.

Don’t optimize a single metric. You can game Context Precision by returning fewer, more targeted chunks, but this hurts Context Recall. You need to watch all four together.

Use RAGAS scores to run A/B experiments. Want to know if switching from text-embedding-ada-002 to text-embedding-3-large improves your system? Run RAGAS before and after. Now you have data instead of intuition.

Integrate with observability tools. RAGAS works natively with LangSmith and Langfuse. This means you can trace individual requests that score poorly and inspect exactly what was retrieved and how the LLM used it.

The LLM-as-judge limitation. Be aware that RAGAS uses LLMs internally for most metrics. This means your evaluation has its own failure modes, LLM judges can be inconsistent, sensitive to prompt phrasing, and exhibit position bias. Use a strong, reliable model (GPT-4o, Claude) for your judge. For critical systems, validate RAGAS scores against a sample of human annotations.

Summary

Most teams ship RAG systems and evaluate them with vibes. RAGAS gives you a structured, automated way to know exactly where your pipeline is failing, retrieval or generation, and gives you the feedback loop to fix it systematically.

This is the difference between iterating on your AI system and guessing about it.

The framework is open-source, takes an afternoon to integrate, and has become the standard for a reason. If you’re running RAG in production without evaluation metrics, that’s the technical debt your team should be paying down next.

How we built an MCP Guardrail to enforce tech policy in real-time

Anna Danilec — Tue, 12 May 2026 06:30:59 +0000

In 2026, most mid-sized and large organizations are aggressively adopting AI coding assistants such as Cursor, Claude Desktop and Windsurf. Developers are now generating a significant portion of code using LLMs.

However, this acceleration brings serious risks. According to the GitGuardian State of Secrets Sprawl 2026 report, in 2025 alone 28.65 million new hardcoded secrets were detected on public GitHub, a 34% year-over-year increase. Secrets related to AI services grew by 81%. Commits assisted by tools like Claude Code show nearly double the credential leak rate (3.2% vs 1.5% for manually written code).

Additionally, Veracode’s 2025 research reveals that 45% of AI-generated code contains OWASP Top 10 vulnerabilities. In some languages (e.g. Java), this figure reaches as high as 70%.

The core challenge for CTOs and architects is clear: we provide developers with extremely powerful generative tools, but we fail to supply them with the company’s current context. Language models have no knowledge of internal Tech Radars, approved library lists, security policies, Architecture Decision Records (ADRs), or forbidden patterns. As a result, LLMs often suggest solutions that are outdated, misaligned with company architecture, or dangerous.

Organizations now face a difficult dilemma: either limit AI adoption (which is unrealistic in a competitive market) or lose control over their technology stack, security, and compliance.

Architect’s Guardrail solves this problem. It is a lightweight, local MCP Server (Model Context Protocol) that acts like “an architect sitting on the developer’s shoulder.”

When a developer asks the AI to generate code, the server delivers the company’s current policy to the LLM in real time, approved technologies, required patterns, security rules, and restrictions. This enables the AI to either generate policy-compliant code or clearly explain why a specific technology or approach is not allowed and suggest an alternative.

True AI Governance in 2026 is not about blocking language models.

It is about delivering up-to-date company knowledge exactly at the moment decisions are made.

The project is fully open-source:

GitHub: https://github.com/annadanilec/architect-guardrail

The Problem: Why Traditional Approaches Fail

Traditional DevSecOps controls, such as pre-commit hooks, SAST (Static Application Security Testing), SCA (Software Composition Analysis), and dependency scanning, act too late in the AI-assisted code generation process.

These tools analyze code only after it has been created, usually at the commit, pull request, or CI/CD build stage. In the world of agentic AI, where a model can generate dozens or hundreds of lines of code in seconds, a post-factum reaction is insufficient. The problem occurs the moment the developer accepts the LLM’s suggestion.

The root cause is the lack of organizational context in language models. LLMs do not know:

The company’s internal Tech Radar
Approved / unapproved libraries and their versions
Architecture Decision Records (ADR)
Security policies (e.g., how to handle secrets, authentication, logging)
Forbidden architectural patterns (e.g., direct database calls in backend services)

As a result, models rely on public training data, which is often outdated or misaligned with internal standards.

Key risks:

Credential and secret leaks
Use of unapproved, risky libraries
Inconsistency with company architecture
Lack of auditability of AI decisions

Aspect	Classic DevSecOps	Agentic AI 2026	Consequence
Problem detection moment	After commit / PR	At code generation time	Too late
Company policy awareness	None (only technical rules)	No organizational context	High risk
Feedback speed	Minutes–hours	Seconds (without guardrails)	Errors multiply quickly
AI decision auditability	Limited	Very low without dedicated tooling	Compliance difficulties

In the era of classic DevSecOps, control was reactive and code-focused. In the era of Agentic AI, control must be proactive and contextual, delivering company policy exactly when the model is making a decision.

Traditional tools simply cannot keep up with the speed and nature of generative AI.

Solution: Architect’s Guardrail (MCP Server)

Architect’s Guardrail is a lightweight, local server compliant with the MCP (Model Context Protocol). It serves as a context layer between the developer and the language model (Claude, Cursor, Windsurf, etc.).

What is MCP and why is it ideal for a guardrail?

MCP is an open protocol introduced by Anthropic in 2025. It allows language models to securely and structurally retrieve context from external sources (tools, resources, and prompts) in real time.

Unlike classic tool calling (e.g., LangChain or OpenAI function calling), MCP was designed for local IDE integration and secure context delivery. It supports two modes:

stdio (most common for local use)
HTTP (for enterprise deployments)

Why MCP is perfect for a guardrail:

Works before code generation, delivers context while the model is still thinking.
Natively supported by Claude Desktop, Cursor, and other modern AI tools.
Supports both tools and resources (e.g., the entire company policy as context).
Lightweight, runs locally, and does not require sending code outside the organization.
Provides full auditability, every request to the server can be logged.

Architecture

The architecture is deliberately simple so it can be implemented in just a few hours:

MCP Server – core (Python + FastMCP or TypeScript)
Policy sources: policy.json, Tech Radar (JSON or external integration), approved libraries, security rules, ADRs, forbidden patterns
Integration: Claude Desktop / Cursor / VS Code
Optional: Git repository with policy (auto-refresh on startup)

A basic version can fit in a single ~150–200 line server.py file.

How Architect’s Guardrail Works Step by Step

1. The developer writes a prompt

In Cursor or Claude Desktop, the developer enters:

“Create a secure payment processing endpoint in FastAPI with Stripe integration.”

2. The IDE sends a request to the MCP Server

Before generating a response, the AI tool sends the current context:

the prompt,
the current file,
and the workspace context

to the local MCP Server using either the stdio or HTTP protocol.

3. The MCP Server provides policy context

The server reads the current organizational policy and returns structured context, for example:

{
  "approved_frameworks": ["fastapi", "django", "nestjs"],
  "preferred_http_client": "httpx",
  "forbidden": ["requests", "urllib3"],
  "secrets_handling": "Must use Vault / AWS Secrets Manager. No hardcoded keys.",
  "payment_requirements": [
    "rate limiting",
    "idempotency key",
    "PCI-compliant logging"
  ]
}

4. The LLM receives full organizational context

The model receives not only the user request, but also the company’s current engineering standards and security policies.

As a result, it:

generates policy-compliant code,
uses httpx,
implements proper secret handling,
adds rate limiting,
and follows required architectural patterns.

If the model attempts to use something forbidden, it explains why it is not allowed and proposes an approved alternative.

Example AI response with Guardrail enabled

“According to company policy, I cannot use the requests library. Instead, I will use httpx with retry logic, as required by our Tech Radar (ADR-187). Stripe secrets will be retrieved from Vault…”

Architecture Flow

Thanks to this approach, the guardrail operates proactively rather than reactively.

The policy is always up to date because the server can continuously read the latest version directly from Git or a centralized repository.

Implementation: Building Architect’s Guardrail in ~2 Hours

Open-source repository:
https://github.com/annadanilec/architect-guardrail

The basic version can be built in 1.5–2 hours if you have intermediate Python knowledge.

Project Structure

architect-guardrail/
├── src/
│   ├── server.py
│   ├── config.py
│   ├── policy/
│   └── tools/
├── policy/
│   ├── policy.json
│   ├── tech-radar.json
│   └── adrs/
├── mcp.json
├── pyproject.toml
└── README.md

Dependency Installation (pyproject.toml)

[project]
name = "architect-guardrail"
dependencies = [
    "mcp>=0.1.0",
    "pydantic>=2.0",
    "pyyaml",
    "httpx",
    "python-dotenv"
]

Policy Configuration and Loading (config.py)

from pydantic import BaseModel
import json
from pathlib import Path

class Policy(BaseModel):
    approved_libraries: dict[str, list[str]]
    forbidden_libraries: list[str]
    secrets_handling: str
    preferred_patterns: dict
    adrs: dict

def load_policy() -> Policy:
    path = Path("policy/policy.json")
    with open(path) as f:
        data = json.load(f)
    return Policy(**data)

Main MCP Server (server.py)

from mcp.server.fastmcp import FastMCP
from .config import load_policy
from .tools.policy_tools import get_approved_libraries, validate_against_policy

mcp = FastMCP("Architect-Guardrail")

# Resources – automatically available to the LLM
@mcp.resource("policy://tech-radar")
def get_tech_radar() -> str:
    policy = load_policy()
    return policy.model_dump_json(indent=2)

@mcp.resource("policy://security-rules")
def get_security_rules() -> str:
    policy = load_policy()
    return f"""Security rules:
    - Secrets: {policy.secrets_handling}
    - Authentication: OAuth2 + JWT + short-lived tokens
    - Logging: structured logging + PII masking"""

# Tools
@mcp.tool()
def get_approved_libraries(language: str = "python", category: str = None) -> str:
    """Returns a list of approved libraries for a given language and category."""
    policy = load_policy()
    libs = policy.approved_libraries.get(language, [])
    if category:
        # category filtering
        return f"Approved {language} libraries in {category}: {libs}"
    return f"Approved {language} libraries: {libs}"

@mcp.tool()
def validate_against_policy(code_snippet: str, intent: str) -> dict:
    """Validates code against the current company policy."""
    policy = load_policy()
    issues = []

    if "requests" in code_snippet and "httpx" not in code_snippet:
        issues.append({
            "severity": "high",
            "message": "Use httpx instead of requests (according to ADR-142)",
            "suggestion": "from httpx import AsyncClient + retry logic"
        })

    if any(secret in code_snippet for secret in ["api_key", "password", "secret"]):
        issues.append({
            "severity": "critical",
            "message": "Hardcoded secrets are forbidden",
            "suggestion": "Use Vault / AWS Secrets Manager"
        })

    return {"valid": len(issues) == 0, "issues": issues, "intent": intent}

Server Security

Runs exclusively locally (transport="stdio")
No public port exposure by default
Optional lightweight token authentication for HTTP mode (enterprise deployment)

Example Use Cases

Example 1: AI Attempts to Use requests

Developer Prompt

“Write a function to call an external API.”

AI Response with Guardrail Enabled

“According to our policy (Tech Radar + ADR-142), we do not use the requests library. Instead, we use httpx with automatic retry and circuit breaker support. Here is the updated implementation…”

Example 2: Hardcoded secret

AI Response

“I cannot generate code containing a hardcoded API key. According to the security policy (policy://security-rules), all secrets must be retrieved from Vault. Here is a version using hvac / boto3 Secrets Manager…”

Example 3: Deprecated Framework

AI Response

“The proposed usage of Flask 2.x is not compliant with our Tech Radar. The currently approved framework is FastAPI 0.115+. See the migration path in ADR-203.”

Advanced Capabilities

Tech Radar Integration – automatically fetch policies from databases, Notion, internal APIs, or Git repositories every X minutes
Policy Versioning – policy/v1.2.json + Git history
Audit trail – log prompts, returned context, AI decisions, and policy validations
Human in the loop – use tools such as request_approval() for critical changes like introducing a new library
Multi-policy support – different policies for backend, frontend, mobile, data science, and platform engineering teams
CI/CD integration – validate pull requests using MCP logs, policy validation results, and AI interaction history

Practical Limitation: Context Availability vs Policy Enforcement

Current MCP integrations in tools such as Cursor significantly improve organizational context delivery, but an important distinction must be understood:

MCP makes policies available to the model, but does not automatically guarantee that the model will always consult them before generating code.

In practice, this means that simply attaching an MCP Server is not yet equivalent to deterministic policy enforcement. During testing of Architect’s Guardrail in Cursor, policy compliance was significantly more reliable when the AI was explicitly instructed to always consult MCP resources before generating production code.

This reveals one of the key challenges of enterprise AI governance in 2026:

Context availability is not the same as policy enforcement.

To improve reliability, several additional layers can be introduced.

Recommended Enforcement Layers

Persistent Workspace Rules (.cursorrules) – A lightweight but effective approach is adding project-level instructions such as:

Always consult Architect's Guardrail MCP resources before generating code.

Never use external libraries before validating them against company policy.

Always validate:
- secrets handling
- authentication
- logging
- architectural patterns

This significantly increases the probability that Cursor consistently consults MCP resources during generation.

Mandatory Policy Pre-Check – A stronger approach is introducing a middleware layer that automatically injects policy context before generation:

User Prompt
    ↓
Policy Validation Layer
    ↓
MCP Policy Retrieval
    ↓
LLM Generation

This transforms governance from probabilistic to deterministic.

Output Validation – Even after generation, code can be validated against organizational policy:

Generated Code
      ↓
Policy Validator
      ↓
Approve / Reject / Rewrite

This creates a true enterprise-grade governance pipeline.

Architect’s Guardrail should therefore be viewed not as a complete enforcement system by itself, but as foundational governance infrastructure for AI-assisted development environments.

MCP provides the missing organizational context layer. Deterministic enforcement still requires orchestration, workflow design, and validation mechanisms on top of it.

Business Results and Impact

Implementing Architect’s Guardrail delivers measurable benefits for both the organization and individual teams.

Benefits for CTOs, Architects, and Security Teams:

Real-time policy enforcement
Significant risk reduction
Full auditability of AI decisions
Reduced manual review workload
Easier maintenance of Tech Radar consistency

Benefits for Developers:

Faster development
Less friction
Educational effect (AI explains decisions)
Higher confidence that code will pass review

|---|---|---|---|

| Compliance rate | 60–70% | 90–96% | +30–40% |

| Risky AI suggestions per week | 20–35 | 3–6 | -80–85% |

| Hardcoded secrets in PRs | 6–12 | 0–2 | -85–90% |

| Time from prompt to approved code | 40–55 min | 25–35 min | -30–40% |

Comparison with Other Approaches

There are several ways to enforce policies in AI environments. Below is a comparison of the most common approaches alongside the MCP-based Architect’s Guardrail solution.

Approach	Strengths	Weaknesses	Real-time Effectiveness
Prompt Engineering	Very easy to start	Brittle, context limits, hard to maintain	Low
LangChain / LangGraph	Flexible for complex agents	Heavy, high overhead, poor IDE integration	Medium
Central Proxy (Guardrails AI, NeMo)	Strong central control	Latency, single point of failure, complex	High
Local MCP Guardrail	Native IDE integration, low latency, local	Requires local installation	Very High

Architect’s Guardrail stands out due to its simplicity, locality, and excellent developer experience.

Development Roadmap & Recommendations

Stage 1: Basic Guardrail (2–4 hours)

Stage 2: Full Tech Radar + ADR integration (1–2 days)

Stage 3: Multi-team / multi-policy support (3–5 days)

Stage 4: Agent-driven policy updates (advanced)

Recommendations:

Start with Stage 1 in a pilot team.
Treat policy as code, store it in Git.
Combine rollout with team training on effective AI usage.
Aim for organization-wide standardization.

Conclusion

The best way to achieve AI Governance in 2026 is not by blocking models, but by giving them the right company context at the right moment.

Architect’s Guardrail proves that you can reconcile two seemingly contradictory goals: dramatically accelerating development with AI while maintaining strong control over quality, security, and architectural consistency.

MCP is emerging as a highly effective and elegant standard. It is lightweight, local, natively integrated with the best developer tools, and built with security in mind.

Recommendation: If your organization is already using Claude, Cursor, or similar tools, build your own Guardrail. The basic version takes just one evening.

In the coming months, MCP has a strong chance of becoming the de facto standard for enterprise AI governance, just as GitHub Actions became the standard for CI/CD and OpenTelemetry for observability.

It’s time to stop treating AI as an uncontrolled guest in the team.

It’s time to treat it as a highly talented team member that requires clear, precise guidance.

And it all starts with context.

Open-source project:
https://github.com/annadanilec/architect-guardrail
If you are experimenting with AI governance, MCP, or architecture-aware coding assistants, feel free to contribute or fork the project.

Deterministic Guardrails for Non-Deterministic Agents

Anna Danilec — Fri, 08 May 2026 11:18:53 +0000

By design, Large Language Models (LLMs) are non-deterministic. Even with an identical prompt, they can return different answers, trigger the wrong API, leak sensitive personal data, or initiate a costly chain of requests that evaporates a monthly cloud budget in seconds. For engineers managing production systems, this isn't an abstract risk — it's the nightmare scenario that keeps them up at night.

The solution does not lie in hoping for better models. Instead, it lies in a deterministic guardrail layer that governs the agent, regardless of the model's output. This article explores four strategic pillars of such an architecture, all built using FastAPI.

Pillar	Technology	Purpose
Governance Layer	FastAPI middleware	Compliance, PII Masking, & Cost Control
Resource Sandboxing	Dependency Injection	Minimizing the "Blast Radius"
Observability & Traceability	OpenTelemetry + OTLP	Visualizing Chain of Thought (CoT)
Async ROI	asyncio event loop	Infrastructure Cost Optimization

Pillar 1: Governance Layer – FastAPI Middleware

What is middleware and why is it a perfect place for guardrails?

In web architecture, middleware is logic executed during the lifecycle of every HTTP request — both before it reaches the intended handler and after the response is generated but before it reaches the client. Think of it as an airport security checkpoint: every passenger (request) must pass through it without exception.

In the context of AI agents, FastAPI middleware intercepts the agent's output before it reaches the end user. This provides a single, centralized layer to enforce compliance and safety policies, regardless of how many specialized agents are operating within the system.

The golden rule: guardrails are NOT part of the agent. The agent is a black box that can behave unpredictably. Guardrails must exist as an external control layer, operating independently of the agent's internal logic.

Implementation: Compliance Middleware

The middleware below intercepts every agent response and performs three critical operations: compliance validation, PII masking, and cost tracking.

# middleware/guardrails.py
import time
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import Response, JSONResponse
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from prometheus_client import Histogram, Counter
import logging

logger = logging.getLogger(__name__)

# ── Prometheus Metrics ──────────────────────────────────────────
# The histogram tracks the distribution of response times (not just the average)
REQUEST_DURATION = Histogram(
    "agent_request_duration_ms",
    "Agent response time in ms",
    buckets=[100, 500, 1000, 5000, 15000, 30000, 60000]
)
# The Counter increases monotonically — ideal for anomaly detection alerts
POLICY_VIOLATIONS = Counter(
    "agent_policy_violations_total", "Number of blocked responses",
    labelnames=["reason"]
)
PII_DETECTIONS = Counter(
    "agent_pii_detections_total", "Number of detected PII entities",
    labelnames=["entity_type"]
)

# ── Prohibited Phrases (Prompt Injection / Compliance) ──────────
FORBIDDEN_PHRASES = [
    "ignore previous instructions",
    "ignore all instructions",
    "you are now",    # classic prompt injection attack
    "disregard your",
]


class AgentGuardrailsMiddleware(BaseHTTPMiddleware):

    def __init__(self, app, languages: list[str] = None):
        super().__init__(app)
        # Initialize ONCE in the constructor — NLP models are resource-heavy (~200ms);
        # we want to avoid loading them per-request
        self.languages = languages or ['en', 'pl']
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    async def dispatch(self, request: Request, call_next):
        start_time = time.monotonic()

        # ── STEP 1: Call proper handler (agent) ───────────────────
        # call_next passes the request deeper into the middleware stack
        # until it reaches the intended endpoint. The agent performs all its logic:
        # querying the LLM, executing tools, and building the response.
        # We wait, and only THEN do we receive the response object.
        response = await call_next(request)

        # ── STEP 2: Collect streaming body ───────────────────────
        # Starlette streams responses chunk-by-chunk — there is
        # no single 'response.text'. We must assemble the body manually.
        # Note: this buffers the entire response in RAM.
        body = b''
        async for chunk in response.body_iterator:
            body += chunk
        text = body.decode('utf-8')

        # ── STEP 3: Compliance check ─────────────────────────────
        # Check if the agent returned any prohibited content.
        # We block the response here, before anything reaches the client.
        violation = self._check_policy(text)
        if violation:
            POLICY_VIOLATIONS.labels(reason=violation).inc()
            logger.warning(f'Policy violation [{violation}]')
            return JSONResponse(
                status_code=403,
                content={'error': 'Response blocked', 'reason': violation}
            )

        # ── STEP 4: PII masking with Presidio ────────────────────
        # Presidio uses NLP models (spaCy) instead of pure RegEx.
        # It understands context: '48123456789' as a phone number
        # in 'call me at...' but not when it's an Order ID.
        masked_text = self._mask_pii(text)

        # ── STEP 5: Metrics ──────────────────────────────────────
        # duration_ms is sent to the Prometheus Histogram, allowing us
        # to calculate p50/p95/p99 latency and trigger alerts on threshold violations.
        duration_ms = (time.monotonic() - start_time) * 1000
        REQUEST_DURATION.observe(duration_ms)
        logger.info(f'Agent: {duration_ms:.0f}ms, {len(text)} chars')

        # ── STEP 6: Return response ──────────────────────────────
        # Crucial: replicate the original status_code and headers.
        # Without this you lose Content-Type and bypass original 403/404 codes.
        return Response(
            content=masked_text.encode('utf-8'),
            status_code=response.status_code,   # keep original code
            headers=dict(response.headers),      # keep original headers
            media_type=response.media_type,
        )

    def _check_policy(self, text: str) -> str | None:
        """Returns the violation description or None if the text is compliant."""
        lower = text.lower()
        for phrase in FORBIDDEN_PHRASES:
            if phrase in lower:
                return f'prompt_injection: "{phrase}"'
        return None

    def _mask_pii(self, text: str) -> str:
        # Presidio operates in two stages:
        # 1. analyzer.analyze() — detects entities and returns their positions
        # 2. anonymizer.anonymize() — replaces them according to the defined strategy
        results = []
        for lang in self.languages:
            results.extend(self.analyzer.analyze(
                text=text, language=lang,
                entities=['PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER',
                          'IBAN_CODE', 'CREDIT_CARD', 'IP_ADDRESS'],
                score_threshold=0.6,  # eliminates most false positives
            ))
        if not results:
            return text

        # Log the entity TYPE only — never the value itself (that would be a PII leak!)
        for r in results:
            PII_DETECTIONS.labels(entity_type=r.entity_type).inc()
            logger.info(f'PII: {r.entity_type} score={r.score:.2f}')

        anonymized = self.anonymizer.anonymize(
            text=text,
            analyzer_results=results,
            operators={
                "PERSON":        OperatorConfig("replace", {"new_value": "[PERSON]"}),
                "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
                "PHONE_NUMBER":  OperatorConfig("replace", {"new_value": "[PHONE]"}),
                "IBAN_CODE":     OperatorConfig("replace", {"new_value": "[IBAN]"}),
                "CREDIT_CARD":   OperatorConfig("replace", {"new_value": "[CARD]"}),
                "IP_ADDRESS":    OperatorConfig("replace", {"new_value": "[IP]"}),
            }
        )
        return anonymized.text

What happens step by step?

call_next(request) — The middleware passes the request down the stack until it reaches the intended endpoint (the agent). The agent performs its logic: querying the LLM, calling tools, and constructing the response. We wait for the final output.
response.body_iterator — Starlette streams responses chunk-by-chunk, so there is no single response.text field available. We must manually reassemble the body in a loop. Warning: this buffers the entire response in RAM; for extremely large payloads, a size limit should be implemented.
_check_policy(text) — We validate the agent's output for prohibited content, including prompt injection attempts, off-limits topics, or profanity. If a violation is detected, we block the request immediately — before it ever reaches the client — and return a 403 Forbidden.
_mask_pii(text) — Microsoft Presidio in a two-stage process: first, analyzer.analyze() detects entities and their positions by understanding linguistic context; then, anonymizer.anonymize() replaces them with placeholders. Crucial: we log only the entity TYPE, never the actual value.
REQUEST_DURATION.observe() — Response time is recorded in a Prometheus Histogram. A Histogram (rather than a simple Counter) allows us to calculate p50/p95/p99 latency and build percentile-based alerts.
Response(status_code=…, headers=…) — We return the final response while preserving the original status code and headers. Without this, you lose metadata like Content-Type or Cache-Control, and the client might receive a 200 OK even if the underlying logic returned a 404.

Why is this better than internal agent guardrails? Because middleware operates across ALL endpoints simultaneously. Whether you have one agent or fifty, policies are defined once and enforced centrally.

Registering the middleware

# main.py
from fastapi import FastAPI
from middleware.guardrails import AgentGuardrailsMiddleware

app = FastAPI()

# Middleware stack — order matters!
# Last in = first executed (LIFO)
app.add_middleware(AgentGuardrailsMiddleware)
app.add_middleware(CostTrackingMiddleware)
app.add_middleware(AuthMiddleware)  # <-- this executes first

Middleware is executed in reverse order of registration (LIFO). Authentication should always be registered last to ensure it triggers first — there is no point processing PII masking for an unauthorized request.

Pillar 2: Resource Sandboxing – Dependency Injection

The problem: what if the agent goes in the wrong direction?

Imagine an AI agent with access to your production database. The LLM hallucinates, generates a destructive SQL query, and within seconds your data is gone. This isn't science fiction — it's a reality already faced by some early adopters of AI agents.

Blast radius is an SRE term defining the maximum potential damage caused by a single component failure. For AI agents, we minimize this through Resource Sandboxing: the agent is granted only the minimum necessary permissions, operates in read-only mode, and remains strictly isolated from the rest of the infrastructure.

Dependency Injection in FastAPI

FastAPI's built-in DI system lets us inject dependencies from the outside rather than creating them within the class. This gives us total control over exactly which resources the agent can access.

# dependencies/sandboxed_db.py
from fastapi import Depends
from sqlalchemy import create_engine, text
from sqlalchemy.orm import Session

# Separate connection pool — read-only
# DB account with SELECT-only privileges; no INSERT/UPDATE/DELETE
READONLY_DB_URL = 'postgresql://agent_ro:pass@db-replica/prod'

readonly_engine = create_engine(
    READONLY_DB_URL,
    pool_size=5,           # Small pool — prevents the agent from exhausting connections
    max_overflow=2,        # Maximum of 7 total concurrent connections
    pool_timeout=10,       # Timeout — ensures the agent doesn't wait indefinitely
    connect_args={'options': '-c default_transaction_read_only=on'}
)

class SandboxedDatabase:
    """Database with enforced read-only access at the connection level."""

    def __init__(self, session: Session):
        self._session = session
        self._query_count = 0
        self._max_queries = 10  # Rate limit per request

    def execute_query(self, sql: str, params: dict = None):
        if self._query_count >= self._max_queries:
            raise PermissionError('Query limit exceeded for this agent session')

        # Block dangerous operations at the application level
        forbidden = ['DROP', 'DELETE', 'TRUNCATE', 'ALTER', 'GRANT']
        for keyword in forbidden:
            if keyword in sql.upper():
                raise PermissionError(f'Forbidden SQL keyword: {keyword}')

        self._query_count += 1
        return self._session.execute(text(sql), params or {})


def get_sandboxed_db() -> SandboxedDatabase:
    with Session(readonly_engine) as session:
        yield SandboxedDatabase(session)


@app.post('/agent/query')
async def agent_query(
    request: AgentRequest,
    db: SandboxedDatabase = Depends(get_sandboxed_db),  # Injection
):
    # The agent has access ONLY to db.execute_query()
    # No access to the engine, session, or other databases
    result = db.execute_query(request.sql)
    return {'data': result.fetchall()}

Layered security – Defense in Depth

Four independent layers of protection:

Layer 1 – DB permissions: agent_ro has only SELECT at the PostgreSQL level. Even if the agent constructs a DELETE, the database will reject it.
Layer 2 – Connection options: default_transaction_read_only=on enforced at the connection level. An additional lock on the SQL session itself.
Layer 3 – Application validation: We scan for forbidden keywords in Python code. Any violation is logged and an alert is triggered.
Layer 4 – Pool limits: The agent is restricted to 7 concurrent connections and 10 queries per request. This prevents it from overwhelming the infrastructure.

Dependency Injection is more than a design pattern — it is a security mechanism. When an agent is a function that accepts a SandboxedDatabase as a parameter, it is physically impossible for it to access an unrestricted database. The interface IS the guardrail.

Tool Registry

We apply the same pattern to the agent's tools:

class AgentToolkit:
    def __init__(self, allowed_tools: list[str]):
        self._tools = {
            'search_products': self._search_products,   # OK
            'get_order_status': self._get_order_status, # OK
            # 'send_email': ...    — NOT registered = unavailable
            # 'execute_code': ... — NOT registered = unavailable
        }
        self._available = {k: v for k, v in self._tools.items()
                          if k in allowed_tools}

    def call(self, tool_name: str, **kwargs):
        if tool_name not in self._available:
            raise PermissionError(f'Tool not available: {tool_name}')
        return self._available[tool_name](**kwargs)

Pillar 3: Observability & Traceability – OpenTelemetry

Why traditional APM doesn't work for AI agents

Tools like Datadog or New Relic were designed for synchronous, short-lived HTTP requests: request arrives, processed in 50ms, response sent. Core metrics focus on latency, error rate, and throughput.

AI agents fundamentally break this model. A single request can last 30–120 seconds, during which the agent executes a Chain of Thought (CoT) — a sequence of reasoning steps, each involving an LLM call and potential tool executions. Traditional APM sees one long, opaque request with zero visibility into what happened inside.

OpenTelemetry (OTel) is an open-source standard for distributed tracing, metrics, and logs. It provides a vendor-agnostic framework independent of the backend (Jaeger, Tempo, Datadog, Honeycomb). Instrument your code once, export to any destination.

Instrumenting FastAPI with OpenTelemetry

# observability/tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

def setup_telemetry(app):
    provider = TracerProvider(
        resource=Resource.create({
            'service.name': 'agent-service',
            'service.version': '2.1.0',
            'deployment.environment': 'production',
        })
    )

    exporter = OTLPSpanExporter(endpoint='http://otel-collector:4317')
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrumentation — all FastAPI requests are automatically traced
    FastAPIInstrumentor.instrument_app(app)

    # HTTP client instrumentation — LLM API calls are also tracked
    HTTPXClientInstrumentor().instrument()

Tracing the Chain of Thought – manual spans

# agent/reasoning.py
from opentelemetry import trace

tracer = trace.get_tracer('agent.reasoning')

async def execute_agent(user_query: str) -> str:
    with tracer.start_as_current_span('agent.execute') as agent_span:
        agent_span.set_attribute('agent.query', user_query)
        agent_span.set_attribute('agent.model', 'claude-3-5-sonnet')

        for step in range(MAX_STEPS):
            with tracer.start_as_current_span(f'cot.step_{step}') as step_span:
                step_span.set_attribute('cot.step_number', step)

                with tracer.start_as_current_span('llm.call') as llm_span:
                    llm_span.set_attribute('llm.prompt_tokens', len(prompt))
                    response = await call_llm(prompt)
                    llm_span.set_attribute('llm.completion_tokens',
                                          response.usage.completion_tokens)
                    llm_span.set_attribute('llm.cost_usd',
                                          calculate_cost(response.usage))

                if response.tool_call:
                    with tracer.start_as_current_span('tool.call') as tool_span:
                        tool_span.set_attribute('tool.name', response.tool_call.name)
                        result = await execute_tool(response.tool_call)
                        tool_span.set_attribute('tool.success', True)

                if response.is_final:
                    break

        return response.content

What you see in Grafana Tempo / Jaeger

agent.execute                          [0ms ────────────────────── 45,230ms]
  cot.step_0                           [12ms ─────── 8,400ms]
    llm.call                           [15ms ─── 7,800ms]   tokens: 412/623
    tool.call: search_products         [8,410ms ─ 8,890ms]  success: true
  cot.step_1                           [8,900ms ─────── 28,100ms]
    llm.call                           [8,900ms ─ 27,600ms] tokens: 1024/892
    tool.call: get_order_status        [27,610ms ─ 28,080ms]
  cot.step_2 (final)                   [28,100ms ─── 45,200ms]
    llm.call                           [28,100ms ─ 45,100ms] tokens: 2048/312

You can immediately see that cot.step_1 took 19 seconds — specifically the llm.call with 1024 input tokens. This provides a clear signal for optimization: prompt caching, switching models for that step, or context compression.

Without OTel: "request took 45 seconds."
With OTel: "step_1.llm.call took 18.7s with 1024 input tokens — probable cause: context window overhead."

Pillar 4: Async ROI – The Event Loop Economy

The problem with synchronous code

Imagine a call center. In a synchronous model, one employee handles one customer at a time. While the customer spends 30 seconds looking up their order number, the employee sits idle. To handle 100 customers simultaneously, you need 100 employees.

In server terms: one thread per request. While a thread waits for an LLM response (typically 2–30 seconds), it consumes CPU and memory doing nothing. Scaling to 1,000 concurrent agents requires 1,000 threads. A server with 32 vCPUs can only handle ~50–100 threads efficiently before degrading.

Asyncio: cooperative multitasking

Python's asyncio implements cooperative multitasking. One thread, one event loop, many coroutines. When a coroutine waits for I/O (network, disk, LLM API), it yields control back to the event loop, which immediately starts processing another coroutine.

# Synchronous — blocks the thread
def handle_request_sync(user_query: str) -> str:
    response = requests.post(LLM_API, json={...})  # <-- BLOCKS 8 seconds
    return response.json()['content']              # Thread sits idle

# Asynchronous — releases the thread while waiting
async def handle_request_async(user_query: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(LLM_API, json={...})
        # await = 'wait for the result, but release the event loop'
        # During the 8-second wait, the event loop processes other requests
    return response.json()['content']

@app.post('/agent/query')
async def agent_endpoint(request: AgentRequest):
    result = await handle_request_async(request.query)
    return {'response': result}

Concrete math: Synchronous vs Async

Based on a typical agent workload (8s average latency per LLM call):

Metric	Sync (Flask/Django)	Async (FastAPI)
Concurrent requests / vCPU	~15–25	~500–2000
RAM per 1,000 requests	~8–16 GB (threads)	~0.5–1 GB (coroutines)
CPU utilization during LLM wait	~2–5% (blocking)	~70–90% (other requests)
Instances required (1,000 req/s)	~40–60	~4–8
Estimated monthly cost (AWS)	~$4,800–7,200	~$480–960

A 5–10× cost reduction is a mathematical certainty: async allows a single thread to handle hundreds of concurrent LLM wait states. For high I/O latency workloads — exactly what AI agents are — this is the most significant cost optimization you can implement without a complete architectural overhaul.

Pitfalls of asynchronous code

CPU-bound tasks block the event loop. Heavy computation (parsing large JSON, data transformations) freezes everything. Use asyncio.to_thread() or ProcessPoolExecutor.
Synchronous libraries block. Using requests or psycopg2 from async code blocks the event loop. Use async counterparts: httpx, asyncpg, aiofiles.
Debugging is harder. Coroutine stack traces are less readable. Use asyncio.current_task() and structured logging.

# WRONG — sync operation blocks the event loop
async def bad_agent_step():
    data = heavy_json_parse(large_response)  # Blocks everyone!
    return data

# CORRECT — delegate CPU-bound work to a thread pool
async def good_agent_step():
    data = await asyncio.to_thread(heavy_json_parse, large_response)
    return data

Summary: Production-Grade Architecture

Four pillars create a cohesive control layer around the non-deterministic agent:

Middleware — enforces policies regardless of what the agent generates
Dependency Injection — isolates the agent from infrastructure, minimizes blast radius
OpenTelemetry — provides visibility inside long, multi-step Chain of Thought processes
Async FastAPI — maximizes hardware utilization, reduces costs 5–10×

The common denominator: the agent is a black box — and that is perfectly fine. You don't need to control how the LLM thinks. You control the system boundaries: what the agent can read, what it can send, how long it can run, and what the client is allowed to see. This is engineering, not magic.

Deterministic guardrails for non-deterministic agents. Don't try to "fix" non-determinism — surround it with deterministic infrastructure.

Originally published at invra.co