<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Trương Minh Sơn</title>
    <description>The latest articles on DEV Community by Trương Minh Sơn (@104221795).</description>
    <link>https://dev.to/104221795</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3927445%2Fa3753140-8801-42b4-bf76-3dfdd14b9adf.jpeg</url>
      <title>DEV Community: Trương Minh Sơn</title>
      <link>https://dev.to/104221795</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/104221795"/>
    <language>en</language>
    <item>
      <title># Building a Full Evaluation and Guardrail System for a RAG App</title>
      <dc:creator>Trương Minh Sơn</dc:creator>
      <pubDate>Tue, 12 May 2026 14:57:39 +0000</pubDate>
      <link>https://dev.to/104221795/-building-a-full-evaluation-and-guardrail-system-for-a-rag-app-2n44</link>
      <guid>https://dev.to/104221795/-building-a-full-evaluation-and-guardrail-system-for-a-rag-app-2n44</guid>
      <description>&lt;h1&gt;
  
  
  Building a Full Evaluation and Guardrail System for a RAG App
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Publication-ready draft for Medium, dev.to, or a course blog.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In Lab 24, I built a full evaluation and guardrail layer around a retrieval-augmented generation system. The goal was not just to make a RAG demo work, but to make it measurable, safer, and easier to operate. The final system connects to my Day 18 corpus, generates an evaluation test set, runs RAGAS-style scoring, performs LLM-as-judge calibration, applies input and output guardrails, runs adversarial tests, benchmarks latency, and documents production SLOs in a blueprint.&lt;/p&gt;

&lt;p&gt;The system is intentionally reproducible. When API keys are unavailable, it uses deterministic fallback logic so every script still runs locally on Windows. Live Gemini judging, Groq output guarding, and Presidio NER are supported as opt-in extensions, but the default grading path remains stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 18 Corpus Integration
&lt;/h2&gt;

&lt;p&gt;The evaluation set is grounded in the Day 18 RAG corpus. The corpus includes two source PDFs: a BCTC tax document and Nghị định 13/2023/NĐ-CP on personal data protection. The source PDFs contain 41 PDF pages, and Lab 24 derives 52 text evidence pages/chunks for evaluation. This gives the evaluation enough coverage to test simple factual questions, reasoning questions, and multi-context questions.&lt;/p&gt;

&lt;p&gt;The Day 18 dense pipeline can require Qdrant and local model downloads, so I added a lightweight adapter for Lab 24. The adapter uses the Day 18 test set, PDF evidence, and prior RAGAS report contexts to provide local retrieval behavior without requiring external services during grading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase A: RAGAS Evaluation
&lt;/h2&gt;

&lt;p&gt;The generated test set contains 52 questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;26 simple questions&lt;/li&gt;
&lt;li&gt;13 reasoning questions&lt;/li&gt;
&lt;li&gt;13 multi-context questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current RAGAS-style aggregate scores are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;0.955&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Relevancy&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;td&gt;0.787&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Recall&lt;/td&gt;
&lt;td&gt;0.908&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The evaluation gate fails if any minimum threshold is missed. This makes the evaluation usable in CI/CD rather than just as an offline report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase B: LLM-as-Judge and Calibration
&lt;/h2&gt;

&lt;p&gt;The judge system includes pairwise comparison, absolute scoring, and human calibration labels. Pairwise judging runs each comparison twice with swapped answer order to reduce position bias. Absolute scoring evaluates accuracy, relevance, conciseness, and helpfulness.&lt;/p&gt;

&lt;p&gt;I also added a cross-judge bonus protocol. It uses three judge profiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;accuracy_first&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;concise_first&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;completeness_first&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final winner is selected by majority aggregation. This makes judge behavior easier to inspect because disagreement between judge profiles becomes visible instead of hidden inside one score.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase C: Guardrails
&lt;/h2&gt;

&lt;p&gt;The guardrail stack has four layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;L1 input guard: PII redaction, topic validation, injection detection&lt;/li&gt;
&lt;li&gt;L2 RAG/LLM pipeline&lt;/li&gt;
&lt;li&gt;L3 output guard&lt;/li&gt;
&lt;li&gt;L4 async audit logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The input guard catches Vietnamese and English PII patterns such as emails, phone numbers, CCCD numbers, bank-account-like numbers, and names after phrases like “tôi là” or “my name is.” The output guard checks for harmful instructions, private data leakage, prompt leakage, unsafe high-stakes certainty, violent/hateful content, and jailbreak compliance.&lt;/p&gt;

&lt;p&gt;The latest guardrail results are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII recall: 88%&lt;/li&gt;
&lt;li&gt;Adversarial detection: 100%&lt;/li&gt;
&lt;li&gt;Output guard detection: 100%&lt;/li&gt;
&lt;li&gt;Output guard false positive rate: 0%&lt;/li&gt;
&lt;li&gt;L1 P95 latency: below 1ms in local fallback mode&lt;/li&gt;
&lt;li&gt;L3 P95 latency: below 1ms in local fallback mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase D: Production Blueprint
&lt;/h2&gt;

&lt;p&gt;The blueprint defines production SLOs, alert thresholds, incident playbooks, architecture, and cost estimates. It includes playbooks for faithfulness drops, latency spikes, guardrail detection drops, and false-positive spikes. The cost model estimates about $330/month for 100k monthly queries, with optimizations such as sampling, caching, smaller judge models, and async logging.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The biggest lesson is that RAG quality is not one metric. Faithfulness, relevancy, precision, and recall tell different stories. A system can retrieve relevant-looking chunks but still miss critical evidence. It can answer fluently but not faithfully. A CI gate helps because it turns these quality checks into a release requirement.&lt;/p&gt;

&lt;p&gt;The second lesson is that guardrails should be layered. Input validation alone is not enough because unsafe content can appear in retrieved documents or generated outputs. Output validation alone is not enough because malicious prompts can waste resources or leak into downstream components. Defense in depth is more practical.&lt;/p&gt;

&lt;p&gt;The third lesson is that reproducibility matters. In a classroom or CI environment, external APIs can fail or produce variable outputs. Deterministic fallbacks make the project easier to grade and debug, while opt-in live providers keep a path open for production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Improvements
&lt;/h2&gt;

&lt;p&gt;The next step would be to connect the full Day 18 dense retrieval pipeline in production mode with Qdrant and model-backed generation. I would also replace starter human labels with real reviewer labels, add more Vietnamese adversarial examples, and publish dashboards from CI artifacts automatically.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
