<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akhona Eland</title>
    <description>The latest articles on DEV Community by Akhona Eland (@akhona_eland_072dac9e0c2c).</description>
    <link>https://dev.to/akhona_eland_072dac9e0c2c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3857770%2F3a45453f-4618-4d5d-b8ec-16606097be8b.png</url>
      <title>DEV Community: Akhona Eland</title>
      <link>https://dev.to/akhona_eland_072dac9e0c2c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akhona_eland_072dac9e0c2c"/>
    <language>en</language>
    <item>
      <title>I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Wed, 22 Apr 2026 11:45:42 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/i-fine-tuned-a-compliance-judge-and-beat-the-stock-model-by-296pp-f1-4cgb</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/i-fine-tuned-a-compliance-judge-and-beat-the-stock-model-by-296pp-f1-4cgb</guid>
      <description>&lt;h1&gt;
  
  
  I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; if your LLM-powered product touches personal information in South Africa, POPIA sits over it. The regulator doesn't ask "is your model good?" — they ask "can you demonstrate the output was validated against the clause, and can you show me the validation?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The uncomfortable answer most teams give today:&lt;/strong&gt; "we call GPT-4 as a judge with a prompt that mentions POPIA." That's not a defence. It's non-deterministic, sends personal information cross-border, and produces no receipt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built instead:&lt;/strong&gt; a local NLI cross-encoder fine-tuned on 7 POPIA clauses, released under Apache 2.0, shipped as a quantized ONNX model, scored and gated on every CI run.&lt;/p&gt;

&lt;p&gt;The result, on a pinned 150-pair holdout:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Stock &lt;code&gt;cross-encoder/nli-MiniLM2-L6-H768&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Fine-tuned &lt;code&gt;nli-popia-v1&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Macro F1&lt;/td&gt;
&lt;td&gt;0.517&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.813&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;0.707&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worst clause&lt;/td&gt;
&lt;td&gt;0.400 (general processing / data subject rights)&lt;/td&gt;
&lt;td&gt;0.727 (cross-border transfers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best per-clause lift&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;+0.493&lt;/strong&gt; (general processing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regressions&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;zero&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;+29.6 percentage points macro F1, every clause improved, nothing got worse.&lt;/strong&gt; 79MB per CPU-variant INT8 ONNX on disk, ~15ms per inference on CPU, zero API calls.&lt;/p&gt;

&lt;p&gt;Here's how it went.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why NLI, not a prompt-based judge
&lt;/h2&gt;

&lt;p&gt;Natural Language Inference is an old, narrow, boring task: given a premise and a hypothesis, return the probability the premise entails the hypothesis. Cross-encoders have been doing this deterministically for a decade.&lt;/p&gt;

&lt;p&gt;If you reframe "does this text satisfy POPIA's consent clause?" as an NLI problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Premise:&lt;/strong&gt; the LLM's output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis:&lt;/strong&gt; "The text collects personal information only after obtaining explicit, informed, opt-in consent."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…you get a deterministic score in 0.0–1.0, in one tiny ONNX model, without shipping customer data to a third-party API.&lt;/p&gt;

&lt;p&gt;The catch: stock NLI models are trained on SNLI/MNLI. They're great at "a dog is playing in the park / an animal is outside" and terrible at "This message confirms your purchase; we'll process your data per our privacy policy / The text obtains explicit opt-in consent before collecting personal information."&lt;/p&gt;

&lt;p&gt;Stock macro F1 on POPIA clauses: &lt;strong&gt;0.517.&lt;/strong&gt; Two of the seven clauses — general processing and data subject rights — came in at &lt;strong&gt;0.400 F1&lt;/strong&gt;. Coin-flip territory.&lt;/p&gt;

&lt;p&gt;So I fine-tuned.&lt;/p&gt;




&lt;h2&gt;
  
  
  The data: 180 hand-authored pairs, no scraping
&lt;/h2&gt;

&lt;p&gt;This is the part nobody wants to hear: I wrote the training data by hand.&lt;/p&gt;

&lt;p&gt;Seven clauses — consent, minimality, security safeguards, breach notification, cross-border transfers, general processing, data subject rights — × a handful of positive examples (text that satisfies the clause) + a handful of negatives (text that violates it) + paraphrases. About 180 pairs.&lt;/p&gt;

&lt;p&gt;Why hand-authored:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scraped legal text is the wrong distribution.&lt;/strong&gt; My users aren't writing statutes; they're writing support replies, KYC confirmations, breach emails. I needed LLM-shaped text, not Act-shaped text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic generation would poison the eval.&lt;/strong&gt; If GPT-4 writes my training data and GPT-4 writes the outputs being validated in production, I'm measuring GPT-4's self-consistency, not POPIA compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;180 pairs is enough for 7-clause cross-encoder fine-tuning.&lt;/strong&gt; The base model already speaks English; I'm teaching it a narrow decision boundary, not a new language.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 150-pair holdout was hand-authored separately, pinned by hash, and never leaks into training. If the hash of the eval file changes, the release gate fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  The fine-tune: 5 epochs, six minutes on CPU
&lt;/h2&gt;

&lt;p&gt;The whole training recipe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[train]"&lt;/span&gt;
python scripts/train_popia.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood it's unremarkable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Base: &lt;code&gt;cross-encoder/nli-MiniLM2-L6-H768&lt;/code&gt; (~22M params, tiny)&lt;/li&gt;
&lt;li&gt;5 epochs, batch 16, lr 2e-5, warmup 10%, weight decay 0.01&lt;/li&gt;
&lt;li&gt;Cross-entropy loss, early stopping on &lt;code&gt;eval_loss&lt;/code&gt; against a 10% dev split&lt;/li&gt;
&lt;li&gt;CPU training on 180 rows: &lt;strong&gt;~6 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;ONNX export with four CPU-variant INT8 quantizations (AVX2 / AVX512 / AVX512-VNNI / ARM64), auto-selected at load time based on CPU detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each quantized variant is ~79MB; consumers only download the one their CPU needs. Inference is zero-PyTorch — &lt;code&gt;onnxruntime&lt;/code&gt; + &lt;code&gt;tokenizers&lt;/code&gt;, nothing else.&lt;/p&gt;




&lt;h2&gt;
  
  
  The release gate: CI fails if the next fine-tune regresses
&lt;/h2&gt;

&lt;p&gt;This is the part I think more ML projects should steal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/popia-eval.yml (abridged)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run release gate&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;python -m semantix.cli eval popia --json | tee report.json&lt;/span&gt;
    &lt;span class="s"&gt;python -c "import json; r=json.load(open('report.json'));&lt;/span&gt;
               &lt;span class="s"&gt;import sys; sys.exit(0 if r['release_gate_passed'] else 1)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gate logic is boring and strict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;release_gate_passed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finetune_macro_f1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;stock_macro_f1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;
    &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;no_per_clause_regression_vs_stock&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any future fine-tune that drops below a +10pp macro-F1 lift, OR regresses a single clause vs stock, fails CI and blocks the release. The model artifact has the same quality gate as the code that loads it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[popia]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.presets.popia&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;POPIA_CONSENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;POPIA_SECURITY&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POPIA_CONSENT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compose_signup_confirmation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Your LLM call here. If the output doesn't satisfy POPIA_CONSENT,
&lt;/span&gt;    &lt;span class="c1"&gt;# the decorator retries with structured feedback. If it still fails,
&lt;/span&gt;    &lt;span class="c1"&gt;# it raises — with a Semantic Certificate in the audit trail.
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On first import, the quantized ONNX model downloads once from HuggingFace and caches locally. No HF token required — the model is public.&lt;/p&gt;

&lt;p&gt;Seven presets ship with the library, one per clause. Each has a pre-tuned threshold based on the per-clause F1 on the holdout. You can override any threshold; the defaults are the F1-optimal operating points.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I will and won't claim
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I will claim:&lt;/strong&gt; on a pinned 150-pair hand-authored POPIA holdout, the fine-tune beats the stock MiniLM2 NLI cross-encoder by +29.6pp macro F1, every clause improves, no regressions. That result is reproducible — the eval set is hashed, the CI gate enforces it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I won't claim:&lt;/strong&gt; this model replaces a POPIA specialist, a DPIA, or the Information Regulator's guidance. It's a deterministic, local, auditable primitive you can wire into your validation pipeline. It tells you whether a specific output is consistent with a specific POPIA clause at a specific threshold. That's a narrower claim than "POPIA-compliant" and it's the only claim I can actually defend with a holdout F1 number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I especially won't claim:&lt;/strong&gt; 180 pairs is enough training data for every production use case. If your domain has dialect, local legal phrasing, or adversarial customers trying to slip past the guard, you should fine-tune on &lt;em&gt;your&lt;/em&gt; failures. The repo includes the training recipe for exactly that reason.&lt;/p&gt;




&lt;h2&gt;
  
  
  The reusable part
&lt;/h2&gt;

&lt;p&gt;The thing I'm most interested in is that the entire recipe — hand-authored seeds + paraphrases + cross-encoder fine-tune + ONNX export + release gate — is regulation-agnostic. Swap POPIA for GDPR and you get &lt;code&gt;nli-gdpr-v1&lt;/code&gt;. Swap for HIPAA and you get &lt;code&gt;nli-hipaa-v1&lt;/code&gt;. Swap for EU AI Act clause libraries and you get a judge per article.&lt;/p&gt;

&lt;p&gt;v0.2.0 already ships a &lt;strong&gt;GDPR sibling-model scaffold&lt;/strong&gt; — same &lt;code&gt;Judge&lt;/code&gt; interface, 7 EU-clause presets, expansion seeds, training script, and a documented runtime fallback to POPIA weights until the GDPR artifact trains. It is deliberately a scaffold: same API surface, same CI gate pattern, no weights pretending to exist. That is the contract. The second regulator costs less than the first.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nli-popia-v1&lt;/code&gt; is the first trained artifact. It's 0.813 macro F1 and it's live:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/0.2.0/" rel="noopener noreferrer"&gt;https://pypi.org/project/semantix-ai/0.2.0/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub release:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai/releases/tag/v0.2.0" rel="noopener noreferrer"&gt;https://github.com/labrat-akhona/semantix-ai/releases/tag/v0.2.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model card:&lt;/strong&gt; &lt;a href="https://huggingface.co/labrat-aiko/nli-popia-v1" rel="noopener noreferrer"&gt;https://huggingface.co/labrat-aiko/nli-popia-v1&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  If you're building on this
&lt;/h2&gt;

&lt;p&gt;The failure modes to watch for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Threshold tuning matters more than you'd think.&lt;/strong&gt; The per-clause F1-optimal thresholds in the presets are tuned on my holdout, not yours. If your domain's distribution is different, re-tune.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False negatives on ambiguous consent language.&lt;/strong&gt; "By continuing, you agree to…" is legally grey, and the model reflects that. Tighten the threshold if you want the library to err on the side of rejecting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This is a classifier, not a reasoner.&lt;/strong&gt; It doesn't explain &lt;em&gt;why&lt;/em&gt; a clause failed. Pair it with &lt;code&gt;semantix.judges.ForensicJudge&lt;/code&gt; (ships with &lt;code&gt;[turbo]&lt;/code&gt;) if you need a mask-perturbation saliency breach report.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you ship something interesting with it, or fine-tune a sibling (GDPR, HIPAA, UK DPA), I'd love to see it. Issues and PRs welcome on the repo.&lt;/p&gt;




&lt;h3&gt;
  
  
  Discuss this on LinkedIn
&lt;/h3&gt;

&lt;p&gt;I'm posting the short-form announcement over on LinkedIn — replies, questions, and "this would break on my domain because…" threads all land there: &lt;strong&gt;[link to the LinkedIn post in the first comment below]&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Or open an issue on the repo if you'd rather keep it with the code.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;semantix-ai is an MIT-licensed semantic type system for AI outputs. &lt;code&gt;v0.2.0&lt;/code&gt; is the first release with compliance-specific fine-tunes and ships both the trained POPIA artifact and a GDPR sibling-model scaffold. The POPIA model weights are Apache 2.0. Everything here was built by one person; numbers are reproducible, judgement calls are mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
      <category>nlp</category>
    </item>
    <item>
      <title>A 70ms Local NLI Judge Hits 0.596 Pearson r With Groq Llama 3.3 70B on DSPy Reward Scoring</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Wed, 22 Apr 2026 06:35:30 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/a-70ms-local-nli-judge-hits-0596-pearson-r-with-groq-llama-33-70b-on-dspy-reward-scoring-1d76</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/a-70ms-local-nli-judge-hits-0596-pearson-r-with-groq-llama-33-70b-on-dspy-reward-scoring-1d76</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;semantic_reward&lt;/code&gt; is a drop-in DSPy reward function powered by a local quantized NLI cross-encoder&lt;/strong&gt; — no API call, no key, deterministic, ~70ms per evaluation on CPU.&lt;/li&gt;
&lt;li&gt;On &lt;strong&gt;50 paired customer-support examples&lt;/strong&gt;, semantix reaches &lt;strong&gt;Pearson r = 0.596&lt;/strong&gt; with Groq Llama 3.3 70B, and &lt;strong&gt;Cohen's kappa 0.633 at threshold 0.3&lt;/strong&gt; (substantial agreement), at &lt;strong&gt;~11× lower latency and $0.13 cheaper per 1k calls&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Full reproducibility: code, dataset, raw CSVs at &lt;a href="https://github.com/labrat-akhona/semantix-ai/tree/master/benchmarks" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai/benchmarks&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why another reward function?
&lt;/h2&gt;

&lt;p&gt;DSPy's &lt;code&gt;BestOfN&lt;/code&gt; and &lt;code&gt;Refine&lt;/code&gt; lean on a &lt;code&gt;reward_fn&lt;/code&gt; that scores each candidate from 0 to 1. In practice most users wire up another LLM call — cheap per-request but adds 300–1000 ms and a few cents per optimization run. If you're iterating, that adds up fast.&lt;/p&gt;

&lt;p&gt;semantix-ai ships a ~79 MB INT8 quantized NLI cross-encoder (one of four CPU-specific variants, auto-selected based on your hardware) that scores "does text X entail intent Y?" in ~70ms on CPU. Plugging it into DSPy takes one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.integrations.dspy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;semantic_reward&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Grounded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The answer must be grounded in the provided context.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;qa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChainOfThought&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context, question -&amp;gt; answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;refined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BestOfN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;qa&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;semantic_reward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Grounded&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The honest scope
&lt;/h2&gt;

&lt;p&gt;I originally set out to benchmark four judges across two tasks with an optimization experiment. Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;customer_support_qa, semantix vs Groq Llama 3.3 70B: 50/50 paired scores, clean.&lt;/strong&gt; That's this post.&lt;/li&gt;
&lt;li&gt;⚠️ Gemini 2.5 Flash: 15/50 hit the free-tier 20-requests-per-day-per-model cap mid-run.&lt;/li&gt;
&lt;li&gt;⚠️ Gemini 2.5 Pro: 25/25 hit the same cap.&lt;/li&gt;
&lt;li&gt;⚠️ HotpotQA task and &lt;code&gt;BestOfN&lt;/code&gt; optimization experiment deferred — without Gemini as the final judge I couldn't close the loop, and I'd rather ship one clean pair than a multi-task table with holes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The raw CSV is committed with error columns intact. Everything you're about to see is reproducible from the 50 rows both judges agreed to complete.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; 50 customer-support response candidates paired with one of ~10 intents ("The response must be polite and professional", "The response must stay on topic", "The agent must decline without being rude", etc.). Seeded generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;semantix&lt;/strong&gt;: &lt;code&gt;QuantizedNLIJudge&lt;/code&gt; from v0.2.0. Auto-detected CPU variant, INT8 ONNX, &lt;code&gt;onnxruntime&lt;/code&gt; only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groq&lt;/strong&gt;: &lt;code&gt;groq-llama-3.3-70b-versatile&lt;/code&gt;, free-tier API, temperature 0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoring protocol&lt;/strong&gt;: Both judges return a continuous 0–1 score. &lt;code&gt;passed&lt;/code&gt; is derived at threshold.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Agreement results (paired n = 50)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pearson r (continuous scores)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.596&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's kappa @ 0.3&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's kappa @ 0.4&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's kappa @ 0.5&lt;/td&gt;
&lt;td&gt;0.487&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen's kappa @ 0.7&lt;/td&gt;
&lt;td&gt;0.421&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary agreement @ 0.5&lt;/td&gt;
&lt;td&gt;76% (38/50)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary agreement @ 0.3&lt;/td&gt;
&lt;td&gt;84% (42/50)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pearson r = 0.596 is a &lt;strong&gt;moderate positive correlation&lt;/strong&gt; between the two judges on raw scores. The binary pass/fail story is more interesting: at the semantix-default threshold 0.5 the two agree on 76% of calls (moderate kappa of 0.487). Drop the threshold to 0.3 and they agree on 84% of calls at &lt;strong&gt;substantial kappa 0.633&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The actionable knob: &lt;strong&gt;if you want semantix to track Groq Llama 3.3 70B's polite-response classification, run it with threshold 0.3–0.4.&lt;/strong&gt; The default 0.5 is tuned against strict NLI datasets; for pragmatic customer-support scoring, a slightly looser threshold is closer to what a 70B LLM-judge would mark as "polite enough".&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency and cost
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;semantix&lt;/th&gt;
&lt;th&gt;groq-llama-3.3-70b&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean latency&lt;/td&gt;
&lt;td&gt;70 ms&lt;/td&gt;
&lt;td&gt;799 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;64 ms&lt;/td&gt;
&lt;td&gt;777 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95&lt;/td&gt;
&lt;td&gt;121 ms&lt;/td&gt;
&lt;td&gt;992 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Paid cost / 1k calls&lt;/td&gt;
&lt;td&gt;$0.0000&lt;/td&gt;
&lt;td&gt;$0.1312&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;~11× lower latency.&lt;/strong&gt; On a paid Groq plan, 1M calls per day would cost ~$131/day in Groq API fees alone; semantix adds $0 and never leaves your machine. For a DSPy optimization loop calling the reward function hundreds of times per trial, the difference compounds into hours saved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use semantix as your &lt;code&gt;reward_fn&lt;/code&gt; in &lt;code&gt;BestOfN&lt;/code&gt; and &lt;code&gt;Refine&lt;/code&gt;&lt;/strong&gt; when per-call latency of an LLM-as-judge would dominate your optimization loop. At substantial kappa with Groq on polite classification, it's a reasonable signal with two orders-of-magnitude better cost structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune the threshold&lt;/strong&gt; against your own held-out examples. The default 0.5 is too strict for conversational-tone tasks; 0.3–0.4 tracks a 70B LLM-judge more faithfully on this task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't use it as a reasoner.&lt;/strong&gt; It's a narrow entailment classifier. If your task needs "why is this wrong?", pair it with &lt;code&gt;ForensicJudge&lt;/code&gt; (mask-perturbation saliency) or keep the LLM for final scoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A footnote on the bug that almost killed this post
&lt;/h2&gt;

&lt;p&gt;The original benchmark run on 2026-04-21 showed &lt;strong&gt;Pearson r = -0.594&lt;/strong&gt; — a strongly &lt;em&gt;negative&lt;/em&gt; correlation. I almost shipped that as "semantix disagrees with Groq, caveat emptor". Digging in, I found a label-ordering bug in &lt;code&gt;QuantizedNLIJudge&lt;/code&gt; (shipped in v0.1.5, fixed in v0.2.0): the code was reading &lt;code&gt;probs[2]&lt;/code&gt; (neutral) as the entailment score instead of &lt;code&gt;probs[1]&lt;/code&gt;. Fixing the bug and re-running the 50 cached texts against v0.2.0 flipped the correlation sign and shifted the kappa from near-zero to substantial.&lt;/p&gt;

&lt;p&gt;The raw CSV preserves both runs' scores through git history if anyone wants to see the before/after. I'm noting this here because (a) it's a useful cautionary tale about trusting your benchmark when the numbers look too surprising, and (b) it's the exact kind of thing a release gate (like v0.2.0's POPIA macro-F1 gate) is supposed to catch, which it now does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/labrat-akhona/semantix-ai
&lt;span class="nb"&gt;cd &lt;/span&gt;semantix-ai
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[turbo]"&lt;/span&gt;  &lt;span class="c"&gt;# zero-PyTorch install&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; benchmarks/requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# add GROQ_API_KEY&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; benchmarks.dspy.customer_support.run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results land in &lt;code&gt;benchmarks/dspy/customer_support/results/&lt;/code&gt; (&lt;code&gt;raw.csv&lt;/code&gt;, &lt;code&gt;summary.md&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Same minimal-first methodology will be applied to &lt;a href="https://github.com/dottxt-ai/outlines" rel="noopener noreferrer"&gt;outlines&lt;/a&gt;, &lt;a href="https://github.com/PrefectHQ/marvin" rel="noopener noreferrer"&gt;marvin&lt;/a&gt;, and &lt;a href="https://github.com/run-llama/llama_index" rel="noopener noreferrer"&gt;llama_index&lt;/a&gt; — one paired comparison, no holes, real numbers. A PR at stanfordnlp/dspy referencing this work is open: &lt;a href="https://github.com/stanfordnlp/dspy/pull/9653" rel="noopener noreferrer"&gt;stanfordnlp/dspy#9653&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;semantix-ai is MIT-licensed. PyPI: &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;. v0.2.0 also ships a POPIA-compliance fine-tune reaching 0.813 macro-F1 on a pinned holdout.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dspy</category>
      <category>llm</category>
      <category>python</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>Build LLM Guardrails in 3 Lines of Python (No API Key, No Cloud)</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Mon, 13 Apr 2026 09:39:35 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/build-llm-guardrails-in-3-lines-of-python-no-api-key-no-cloud-5amf</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/build-llm-guardrails-in-3-lines-of-python-no-api-key-no-cloud-5amf</guid>
      <description>&lt;h1&gt;
  
  
  Build LLM Guardrails in 3 Lines of Python (No API Key, No Cloud)
&lt;/h1&gt;

&lt;p&gt;Your LLM just told a customer their rash "looks like it could be melanoma." Your chatbot leaked a user's email address in a support response. Your RAG pipeline went off-topic and started explaining how to pick locks.&lt;/p&gt;

&lt;p&gt;These aren't hypotheticals. They're Tuesday.&lt;/p&gt;

&lt;p&gt;You need guardrails. Here's what that currently looks like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regex.&lt;/strong&gt; You write &lt;code&gt;r"(?i)(you should take|I recommend taking)"&lt;/code&gt; to catch medical advice. The model rephrases to "it might help to consider" and your filter is useless. You add more patterns. The model finds more phrasings. You are now maintaining a regex zoo that catches false positives and misses actual violations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM-as-judge.&lt;/strong&gt; Call GPT-4 to review every output. That's 500ms–2s per check, $0.01–0.03 per call, and a hard dependency on an external API. Your guardrail is now slower than the thing it's guarding. Also, you need an API key in production, your costs scale with traffic, and when OpenAI has a bad day your guardrails go down.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud guardrail services.&lt;/strong&gt; AWS Bedrock Guardrails, Azure Content Safety, etc. Vendor lock-in, network latency, usage-based pricing, and your data leaves your infrastructure. Good luck explaining that to your compliance team.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are good. What you actually want is: check whether the output &lt;em&gt;means&lt;/em&gt; something bad, locally, in milliseconds, for free.&lt;/p&gt;




&lt;h2&gt;
  
  
  3 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;semantix-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NoPII&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text does not contain personal information such as names, emails, phone numbers, or addresses.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NoMedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text does not provide medical diagnoses or treatment recommendations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;NoPII&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;NoMedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every call to &lt;code&gt;my_chatbot&lt;/code&gt; now runs through a local NLI model that checks whether the output violates your policies. ~15ms on CPU. No API key. No network call. No tokens burned.&lt;/p&gt;

&lt;p&gt;If the output leaks PII or gives medical advice, it raises &lt;code&gt;SemanticIntentError&lt;/code&gt; with the score, the violated intent, and a reason. The bad output never reaches your user.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the negation pattern works
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;~&lt;/code&gt; operator is the key. An &lt;code&gt;Intent&lt;/code&gt; describes what something &lt;em&gt;is&lt;/em&gt;. &lt;code&gt;~Intent&lt;/code&gt; checks that the output is &lt;em&gt;not&lt;/em&gt; that thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ToxicLanguage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text contains insults, profanity, threats, or aggressive language.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text provides medical diagnoses or treatment recommendations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text contains personal information like names, emails, phone numbers, or addresses.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LegalAdvice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text provides specific legal counsel or interprets laws for the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s situation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each of these describes a &lt;em&gt;bad thing&lt;/em&gt;. Negate them and you have guardrails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;ToxicLanguage&lt;/span&gt;
&lt;span class="n"&gt;Compliant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;MedicalAdvice&lt;/span&gt;
&lt;span class="n"&gt;Private&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt;
&lt;span class="n"&gt;NotALawyer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;LegalAdvice&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, &lt;code&gt;~MedicalAdvice&lt;/code&gt; creates a &lt;code&gt;Not[MedicalAdvice]&lt;/code&gt; intent. The NLI model checks whether the output entails the original description. If it does, the negated check fails. If it doesn't, the output is clean.&lt;/p&gt;

&lt;p&gt;This works because NLI models understand &lt;em&gt;meaning&lt;/em&gt;, not patterns. "You should take ibuprofen" and "Consider an anti-inflammatory" both entail medical advice. A regex catches neither unless you enumerated both phrasings. The NLI model catches both because they mean the same thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Composing policies
&lt;/h2&gt;

&lt;p&gt;Real compliance isn't one rule. It's a policy — multiple constraints that all need to hold, or where at least one must hold. semantix gives you &lt;code&gt;&amp;amp;&lt;/code&gt; and &lt;code&gt;|&lt;/code&gt; for this.&lt;/p&gt;

&lt;h3&gt;
  
  
  All constraints must pass
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customer_support&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;ToxicLanguage&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;MedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;amp;&lt;/code&gt; operator creates an &lt;code&gt;AllOf&lt;/code&gt; composite. Every negated intent is checked. If any one fails, the output is rejected. This is your production safety policy in one line of Python type annotation.&lt;/p&gt;

&lt;h3&gt;
  
  
  At least one constraint must pass
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Apology&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text contains a sincere apology for the inconvenience.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text redirects the user to the appropriate support channel.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_complaint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Apology&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Redirect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;|&lt;/code&gt; operator creates an &lt;code&gt;AnyOf&lt;/code&gt; composite. The output passes if it satisfies at least one intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mix positive and negative
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Helpful&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text provides a clear, actionable answer to the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Helpful&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;ToxicLanguage&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output must be helpful AND must not be toxic AND must not leak PII. Positive and negative constraints compose freely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-healing retries
&lt;/h2&gt;

&lt;p&gt;Guardrails that just block are a blunt instrument. Sometimes you want the LLM to try again with feedback about what went wrong. Add &lt;code&gt;retries&lt;/code&gt; and a &lt;code&gt;semantix_feedback&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Helpful&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;ToxicLanguage&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer this customer question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the first call, &lt;code&gt;semantix_feedback&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. If the output fails validation, the decorator automatically injects a structured Markdown feedback block explaining what went wrong — the violated intent, the score, the rejected output. The LLM gets a second chance to fix it.&lt;/p&gt;

&lt;p&gt;This turns a guardrail from a wall into a feedback loop. The model learns from its mistake in-context and self-corrects. In practice, most violations are fixed on the first retry.&lt;/p&gt;

&lt;p&gt;The feedback looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Semantix Self-Healing Feedback&lt;/span&gt;

Attempt &lt;span class="gs"&gt;**1**&lt;/span&gt; failed validation.

&lt;span class="gu"&gt;### What went wrong&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Intent:**&lt;/span&gt; &lt;span class="sb"&gt;`Not[PIILeakage]`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Score:**&lt;/span&gt; 0.9142 (threshold not met)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Judge reason:**&lt;/span&gt; Text contains what appears to be an email address

&lt;span class="gu"&gt;### What is required&lt;/span&gt;
The text must NOT satisfy the following:

The text contains personal information like names, emails, phone numbers, or addresses.

&lt;span class="gu"&gt;### Your previous output (rejected)&lt;/span&gt;
Sure, I can help! John's email is john.doe@example.com...

Please generate a new response that satisfies the requirement above.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing guardrails in CI
&lt;/h2&gt;

&lt;p&gt;Guardrails in production are half the story. You also need to test that they work before you deploy. Two tools:&lt;/p&gt;

&lt;h3&gt;
  
  
  pytest-semantix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pytest-semantix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text contains personal information like names, emails, phone numbers, or addresses.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_pii_in_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tell me about user 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_medical_advice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my head hurts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;MedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each test runs in ~15ms locally. No API key in CI secrets. No flaky network calls. Your guardrail tests run as fast as your unit tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Action
&lt;/h3&gt;

&lt;p&gt;Add semantic checks to your CI pipeline with the &lt;a href="https://github.com/labrat-akhona/semantic-test-action" rel="noopener noreferrer"&gt;semantic-test-action&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;labrat-akhona/semantic-test-action@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;test-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tests/&lt;/span&gt;
    &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.8&lt;/span&gt;
    &lt;span class="na"&gt;report-format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs your &lt;code&gt;pytest-semantix&lt;/code&gt; tests in CI and produces a report. Failed guardrail tests block the PR. Your compliance policy is enforced before code reaches main.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's actually happening under the hood
&lt;/h2&gt;

&lt;p&gt;When you write &lt;code&gt;~MedicalAdvice&lt;/code&gt; and the decorator validates an output, here's the sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The decorator calls your function and captures the raw string output.&lt;/li&gt;
&lt;li&gt;It extracts the intent description from the class docstring.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;Not[X]&lt;/code&gt;, it checks whether the output entails &lt;code&gt;X&lt;/code&gt;. If entailment score is &lt;em&gt;above&lt;/em&gt; threshold, the negated check &lt;em&gt;fails&lt;/em&gt; — the output matches the bad thing.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;AllOf&lt;/code&gt;, it checks every component. All must pass.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;AnyOf&lt;/code&gt;, it checks components until one passes.&lt;/li&gt;
&lt;li&gt;The NLI model runs locally via ONNX Runtime (quantized INT8). No GPU required. ~15ms per check on CPU.&lt;/li&gt;
&lt;li&gt;If validation fails and retries remain, feedback is injected and the function is called again.&lt;/li&gt;
&lt;li&gt;If all retries are exhausted, &lt;code&gt;SemanticIntentError&lt;/code&gt; is raised with full diagnostics.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model is downloaded once (~100MB) and cached locally. After that, everything is offline. Your guardrails work on an airplane.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to use this vs. other approaches
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use semantix guardrails when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need low-latency checks (&amp;lt; 20ms) in the hot path&lt;/li&gt;
&lt;li&gt;You can't send data to external APIs (compliance, air-gapped, privacy)&lt;/li&gt;
&lt;li&gt;You want deterministic, reproducible guardrail behavior&lt;/li&gt;
&lt;li&gt;You need guardrails in CI/CD, not just production&lt;/li&gt;
&lt;li&gt;You want zero marginal cost per check&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use an LLM-as-judge when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need nuanced, context-heavy evaluation that NLI can't capture&lt;/li&gt;
&lt;li&gt;Latency and cost don't matter&lt;/li&gt;
&lt;li&gt;You're doing one-off evaluations, not real-time guardrailing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use regex/keyword filters when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a known, fixed list of exact strings to block (e.g., specific slurs, specific SSN formats)&lt;/li&gt;
&lt;li&gt;You don't need semantic understanding, just pattern matching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, these stack. Use semantix for the fast semantic layer, regex for known-exact patterns, and LLM-as-judge for the hard cases that need deep reasoning. semantix handles the 90% that regex can't and LLM-as-judge is too slow for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;semantix-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python 3.10+. No API key. No GPU. Works on Linux, macOS, Windows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://labrat-akhona.github.io/semantix-ai/" rel="noopener noreferrer"&gt;labrat-akhona.github.io/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pytest-semantix:&lt;/strong&gt; &lt;a href="https://pypi.org/project/pytest-semantix/" rel="noopener noreferrer"&gt;pypi.org/project/pytest-semantix&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>security</category>
      <category>llm</category>
    </item>
    <item>
      <title>Test Your LLM Outputs in pytest (15ms, No API Key)</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Mon, 13 Apr 2026 07:57:25 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/test-your-llm-outputs-in-pytest-15ms-no-api-key-1mmj</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/test-your-llm-outputs-in-pytest-15ms-no-api-key-1mmj</guid>
      <description>&lt;p&gt;You've got an LLM-powered feature in production. You want to test it. Here are your options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;String matching.&lt;/strong&gt; Works until the model rephrases "I'd be happy to help" as "Sure, let me assist you." Now your test is red and nothing is actually wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regex.&lt;/strong&gt; You write a pattern. It passes today, breaks tomorrow when the model adds a comma. You write a more permissive pattern. Now it passes on garbage too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM-as-judge.&lt;/strong&gt; Call GPT-4 to evaluate the output. Your test suite now takes 4 minutes, costs money, and fails when OpenAI has a bad day. Your CI pipeline needs an API key in secrets. Your team stops running the tests locally.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are good. What you actually want is to test whether your output &lt;em&gt;means&lt;/em&gt; the right thing — without any of that overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  pytest-semantix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pytest-semantix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_chatbot_is_polite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handle angry customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polite and professional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a real pytest test. It runs locally on CPU in ~15ms. No API key. No network calls. No tokens burned.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pytest-semantix&lt;/code&gt; is a pytest plugin that wraps &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;semantix-ai&lt;/a&gt;'s semantic assertion engine as a native fixture. Under the hood, it uses a local NLI (Natural Language Inference) model to check whether your LLM output entails the given intent. You describe what you mean in plain English. The model checks entailment. Done.&lt;/p&gt;

&lt;p&gt;On failure, you get a score, the intent, and a reason — not just a raw traceback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssertionError: Semantic check failed (score=0.12)
  Intent:  polite and professional
  Output:  "You're an idiot for asking that."
  Reason:  Text contains aggressive language
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Markers
&lt;/h2&gt;

&lt;p&gt;If you want to attach an intent to the test itself rather than the assertion call, use the &lt;code&gt;@pytest.mark.semantic&lt;/code&gt; marker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polite and professional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_with_marker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handle angry customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# intent comes from the marker
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful when you have a single intent per test and want to see it at a glance in the decorator rather than buried in the function body.&lt;/p&gt;




&lt;h2&gt;
  
  
  Terminal Reports
&lt;/h2&gt;

&lt;p&gt;Pass &lt;code&gt;--semantic-report&lt;/code&gt; and you get a color-coded summary after the test session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pytest &lt;span class="nt"&gt;--semantic-report&lt;/span&gt;
&lt;span class="go"&gt;
======================== semantic assertion report =========================
  Total: 5  |  Passed: 4  |  Failed: 1

  [PASS] tests/test_bot.py::test_polite  [12ms]
  [PASS] tests/test_bot.py::test_helpful  [14ms]
  [FAIL] tests/test_bot.py::test_no_pii  (score=0.67)  Contains email address  [11ms]
  [PASS] tests/test_bot.py::test_on_topic  [13ms]
  [PASS] tests/test_bot.py::test_concise  [15ms]

============================================================================
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Green for pass, red for fail. Each line shows the test, the score on failure, the reason, and the wall time. No need to scroll through pytest output hunting for which semantic check broke.&lt;/p&gt;




&lt;h2&gt;
  
  
  JSON Reports for CI
&lt;/h2&gt;

&lt;p&gt;For CI integration, export results to JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;--semantic-report-json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;semantic-results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"nodeid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tests/test_bot.py::test_polite"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polite and professional"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;12.3&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"nodeid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tests/test_bot.py::test_no_pii"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The text does not contain personal information"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Contains email address"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11.1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feed this into your CI dashboard, your Slack alerts, your artifact storage — whatever your pipeline already does with JSON test results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Negation for Compliance Testing
&lt;/h2&gt;

&lt;p&gt;Some of the most important LLM tests aren't about what the output &lt;em&gt;should&lt;/em&gt; say. They're about what it &lt;em&gt;shouldn't&lt;/em&gt; say.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text provides medical diagnoses or treatment recommendations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text contains personal information like names, emails, or phone numbers.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_medical_advice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my head hurts what should I take&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;MedicalAdvice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_no_pii_leakage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tell me about user 42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PIILeakage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;~&lt;/code&gt; operator negates the intent. The test passes only when the output does &lt;em&gt;not&lt;/em&gt; match. This is how you test guardrails: toxicity, off-topic drift, unauthorized disclosures, regulatory compliance. Define the bad thing as an intent, negate it, assert against your output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Composing with Existing pytest
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pytest-semantix&lt;/code&gt; is a normal pytest plugin. It doesn't replace anything in your test suite — it adds a fixture. Everything you already use works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parametrize
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt,intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handle angry customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polite and professional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explain a refund policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clear and informative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;say goodbye&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;friendly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_chatbot_intents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Combine with other fixtures
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;MyChatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_with_fixtures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;friendly greeting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Mix semantic and regular assertions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_structured_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate a JSON summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# regular assertion: valid JSON
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;   &lt;span class="c1"&gt;# regular assertion: has the key
&lt;/span&gt;    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concise and accurate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# semantic: means the right thing
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Global threshold
&lt;/h3&gt;

&lt;p&gt;If your team wants a stricter baseline across all tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;--semantic-threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Individual tests can still override:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_strict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;assert_semantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accurate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Actually Happening
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;assert_semantic(output, intent)&lt;/code&gt;, the plugin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolves the intent (from the argument, the marker, or raises an error)&lt;/li&gt;
&lt;li&gt;Passes the output and intent to a local NLI model via &lt;code&gt;semantix-ai&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The model returns a score and verdict&lt;/li&gt;
&lt;li&gt;The plugin records the result (nodeid, intent, score, duration) for reporting&lt;/li&gt;
&lt;li&gt;On failure, it raises &lt;code&gt;AssertionError&lt;/code&gt; with score + reason&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No network call. No subprocess. No container. The NLI model loads once per session and runs inference in-process. That's why it's ~15ms per assertion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pytest-semantix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.10+ and pytest 7+. Pulls in &lt;code&gt;semantix-ai&lt;/code&gt; automatically.&lt;/p&gt;

&lt;p&gt;Then just use the &lt;code&gt;assert_semantic&lt;/code&gt; fixture in your tests. No configuration, no &lt;code&gt;conftest.py&lt;/code&gt; boilerplate, no setup step.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/pytest-semantix/" rel="noopener noreferrer"&gt;pypi.org/project/pytest-semantix&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/pytest-semantix" rel="noopener noreferrer"&gt;github.com/labrat-akhona/pytest-semantix&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;semantix-ai (the engine):&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>testing</category>
      <category>llm</category>
      <category>pytest</category>
    </item>
    <item>
      <title>How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Fri, 10 Apr 2026 11:58:35 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/how-to-fine-tune-gpt-4o-mini-on-your-own-guardrail-failures-50-lines-of-python-3l4n</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/how-to-fine-tune-gpt-4o-mini-on-your-own-guardrail-failures-50-lines-of-python-3l4n</guid>
      <description>&lt;h1&gt;
  
  
  How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)
&lt;/h1&gt;

&lt;p&gt;Every time your LLM gets corrected by a guardrail, a training example is born and immediately thrown away. This tutorial shows you how to catch those examples and use them to make your model better — automatically, with no manual labeling.&lt;/p&gt;

&lt;p&gt;By the end, you'll have a working pipeline that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validates LLM outputs against natural language requirements&lt;/li&gt;
&lt;li&gt;Retries failures with structured feedback&lt;/li&gt;
&lt;li&gt;Captures every (rejected → corrected) pair to disk&lt;/li&gt;
&lt;li&gt;Exports those pairs in OpenAI fine-tuning format&lt;/li&gt;
&lt;li&gt;Uploads to OpenAI for fine-tuning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total code: ~50 lines. Total manual labeling: zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[all]"&lt;/span&gt; openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll need an OpenAI API key for the LLM calls and fine-tuning upload. The validation itself runs locally — no API cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Define What "Correct" Means
&lt;/h2&gt;

&lt;p&gt;Semantix uses Intent classes. The docstring is the requirement. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation without
    being rude, dismissive, or aggressive.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ConstructiveFeedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must provide encouraging, constructive feedback
    that acknowledges effort and suggests specific improvements.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't prompts. They're contracts. The validator checks every output against them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Wire Up Validation + Collection
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TrainingCollector&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Decline this invitation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what happens when you call &lt;code&gt;decline_invite("the company retreat")&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GPT-4o-mini generates a response&lt;/li&gt;
&lt;li&gt;Semantix validates it against the docstring using a local NLI model (~15ms)&lt;/li&gt;
&lt;li&gt;If it fails: structured feedback is injected via &lt;code&gt;semantix_feedback&lt;/code&gt; and the function retries&lt;/li&gt;
&lt;li&gt;If the retry passes: the (rejected, accepted) pair is appended to &lt;code&gt;training_data.jsonl&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If it passes first try: nothing is collected (no correction happened)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;semantix_feedback&lt;/code&gt; parameter is optional. Declare it and the decorator fills it automatically on retries. Don't declare it and retries still work — the model just doesn't get the structured hint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Generate Traffic
&lt;/h2&gt;

&lt;p&gt;In production, this happens organically. For this tutorial, simulate it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a birthday party for someone you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t like&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a mandatory corporate retreat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a wedding where you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re the best man&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a networking event at a bar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a charity gala you can&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t afford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a baby shower for a coworker you barely know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a holiday dinner with your in-laws&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a surprise party that isn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t a surprise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;... -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;... -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this, check what was captured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Correction pairs collected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_pairs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Intents: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;intents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every pair represents a case where the model got it wrong, got feedback, and got it right. These are the hardest examples — exactly the ones worth training on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Export to Fine-Tuning Format
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training.exporters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;export_openai&lt;/span&gt;

&lt;span class="nf"&gt;export_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finetune.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each correction pair becomes a chat completion training example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You must satisfy the following requirement:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;The text must politely decline an invitation without being rude, dismissive, or aggressive."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Generate a response that satisfies the above requirement."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank you for the invitation, but I won't be able to attend..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the &lt;em&gt;accepted&lt;/em&gt; output is used as the training target. The rejected output served its purpose — it triggered the correction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Upload and Fine-Tune
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Upload the file
&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finetune.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;purpose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fine-tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Start fine-tuning
&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fine_tuning&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;training_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini-2024-07-18&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fine-tuning job: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for the job to complete (usually 10-30 minutes for small datasets). Then swap your model ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: gpt-4o-mini
# After:  ft:gpt-4o-mini-2024-07-18:your-org::job-id
&lt;/span&gt;
&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ft:gpt-4o-mini-2024-07-18:your-org::job-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- fine-tuned
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. Fine-tune again. Fails even less.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Flywheel
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week 1: gpt-4o-mini          → 15% failure rate → 200 correction pairs
Week 2: fine-tuned-v1        →  5% failure rate →  70 correction pairs  
Week 3: fine-tuned-v2        →  2% failure rate →  25 correction pairs
Week 4: fine-tuned-v3        →  &amp;lt;1% failure rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These numbers are illustrative, but the pattern is real: each round of fine-tuning reduces the failure rate, which reduces the number of corrections, which means each subsequent training set is smaller but harder — exactly what you want.&lt;/p&gt;

&lt;p&gt;No human labeled a single example. The guardrail did the labeling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Without an API Key
&lt;/h2&gt;

&lt;p&gt;Don't have an OpenAI key? Run the full loop locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/labrat-akhona/semantix-ai.git
&lt;span class="nb"&gt;cd &lt;/span&gt;semantix-ai
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
python examples/flywheel_demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo uses a simple keyword judge instead of NLI, but the pipeline is identical: validate, fail, correct, capture, export.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Actually Happening Under the Hood
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;@validate_intent&lt;/code&gt; decorator does four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Calls your function&lt;/strong&gt; and gets the raw string output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluates&lt;/strong&gt; the string against the Intent's docstring using an NLI model (locally, ~15ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On failure&lt;/strong&gt;: builds a structured Markdown feedback report, injects it via &lt;code&gt;semantix_feedback&lt;/code&gt;, retries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On success after failure&lt;/strong&gt;: calls &lt;code&gt;collector.record()&lt;/code&gt; with the rejected output, accepted output, scores, and feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The NLI model (cross-encoder/nli-MiniLM2-L6-H768) computes an entailment probability — how likely is it that the output satisfies the requirement? If the probability is below the threshold (default 0.5), validation fails.&lt;/p&gt;

&lt;p&gt;No LLM is used for validation. No API calls. No tokens burned on checking.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;This pattern works best when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your LLM has a specific behavioral requirement&lt;/strong&gt; (tone, style, compliance, safety)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're already retrying failures&lt;/strong&gt; (so correction pairs exist)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want domain-specific fine-tuning&lt;/strong&gt; without paying for human annotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your failure rate is high enough&lt;/strong&gt; to generate meaningful training data (&amp;gt;5%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works less well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your requirements are purely structural (use Pydantic)&lt;/li&gt;
&lt;li&gt;Your model never fails (you don't need a guardrail)&lt;/li&gt;
&lt;li&gt;Your outputs are too short or uniform to benefit from fine-tuning&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Full Script
&lt;/h2&gt;

&lt;p&gt;Here's the complete pipeline in one file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TrainingCollector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training.exporters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;export_openai&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Define the requirement
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation without
    being rude, dismissive, or aggressive.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Set up collection
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Wrap your LLM call
&lt;/span&gt;&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Decline this invitation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Generate traffic
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a retreat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a wedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a gala&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Export and fine-tune
&lt;/span&gt;&lt;span class="nf"&gt;export_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finetune.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Collected &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_pairs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; training pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ready for: openai api fine_tuning.jobs.create -t finetune.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Your guardrail is now your training pipeline.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;semantix-ai&lt;/strong&gt; — &lt;code&gt;pip install 'semantix-ai[all]'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; | &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://dev.to/akhona_eland_072dac9e0c2c/your-ai-guardrail-is-a-dead-end-ours-is-a-feedback-loop-4n6a"&gt;Previous article: Your AI Guardrail Is a Dead End&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;Akhona Eland&lt;/a&gt; in South Africa. 166 tests. Zero labeling. Your failures are now your curriculum.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Your AI Guardrail Is a Dead End. Ours Is a Feedback Loop.</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Fri, 10 Apr 2026 10:54:14 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/your-ai-guardrail-is-a-dead-end-ours-is-a-feedback-loop-4n6a</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/your-ai-guardrail-is-a-dead-end-ours-is-a-feedback-loop-4n6a</guid>
      <description>&lt;h1&gt;
  
  
  Your AI Guardrail Is a Dead End. Ours Is a Feedback Loop.
&lt;/h1&gt;

&lt;p&gt;Every AI guardrail on the market does the same thing: check the output, pass or fail, move on. The failure data — the &lt;em&gt;most valuable signal your system produces&lt;/em&gt; — gets thrown away.&lt;/p&gt;

&lt;p&gt;Think about that. Every time your LLM generates something wrong, gets corrected, and produces something right, you're witnessing a training example being created and destroyed in the same breath. Thousands of correction pairs, generated organically from your actual production traffic, evaporating into logs nobody reads.&lt;/p&gt;

&lt;p&gt;Semantix v0.1.7 stops the evaporation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Insight Nobody Acted On
&lt;/h2&gt;

&lt;p&gt;Here's what happens inside a self-healing validation loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your LLM generates an output&lt;/li&gt;
&lt;li&gt;A judge evaluates it against the business intent&lt;/li&gt;
&lt;li&gt;It fails — score 0.23, reason: "too aggressive"&lt;/li&gt;
&lt;li&gt;The system feeds structured feedback back to the LLM&lt;/li&gt;
&lt;li&gt;The LLM generates a corrected output&lt;/li&gt;
&lt;li&gt;It passes — score 0.94&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 3-6 just produced a &lt;strong&gt;perfect fine-tuning example&lt;/strong&gt;: a rejected output, a reason for rejection, and an accepted correction. This is exactly the data format that RLHF, DPO, and supervised fine-tuning consume.&lt;/p&gt;

&lt;p&gt;Every guardrail system with retry logic produces this data. None of them capture it.&lt;/p&gt;

&lt;p&gt;Until now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Training Collector
&lt;/h2&gt;

&lt;p&gt;Semantix v0.1.7 introduces the &lt;code&gt;TrainingCollector&lt;/code&gt; — an opt-in component that captures correction pairs during self-healing retries and writes them to an append-only JSONL file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TrainingCollector&lt;/span&gt;

&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation without being rude.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every time a retry succeeds after a failure, the collector appends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProfessionalDecline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent_description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The text must politely decline an invitation without being rude."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rejected_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I'd rather gouge my eyes out than attend your event."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rejected_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rejected_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Too aggressive, contains violent imagery"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accepted_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank you for the invitation, but I'm unable to attend."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accepted_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feedback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Semantix Self-Healing Feedback&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Attempt 1 failed..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-10T12:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No infrastructure. No database. No configuration. One file, growing one line at a time, containing the exact data you need to make your model smarter.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Guardrail to Flywheel
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting.&lt;/p&gt;

&lt;p&gt;The collector exports directly to OpenAI fine-tuning format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training.exporters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;export_openai&lt;/span&gt;

&lt;span class="nf"&gt;export_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finetune.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each correction pair becomes a chat completion training example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You must satisfy the following requirement:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;The text must politely decline an invitation without being rude."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Generate a response that satisfies the above requirement."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank you for the invitation, but I'm unable to attend."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upload to &lt;code&gt;openai api fine_tuning.jobs.create&lt;/code&gt;. Wait. Deploy the fine-tuned model. Watch your failure rate drop.&lt;/p&gt;

&lt;p&gt;Then the fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. The model gets fine-tuned again. Fails even less.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the flywheel:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Validate → Fail → Correct → Capture → Fine-tune → Validate (fewer failures)
    ↑                                                          |
    └──────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every other guardrail is a wall. Semantix is a ramp.&lt;/p&gt;




&lt;h2&gt;
  
  
  Also in v0.1.7: Framework Integrations
&lt;/h2&gt;

&lt;p&gt;We shipped native adapters for the three biggest structured output frameworks. Semantix now drops into your existing stack with one line:&lt;/p&gt;

&lt;h3&gt;
  
  
  Instructor
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.integrations.instructor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticStr&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SemanticStr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must be polite and professional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pydantic AI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.integrations.pydantic_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;semantix_validator&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai:gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output_validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;semantix_validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Polite&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  LangChain
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.integrations.langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticValidator&lt;/span&gt;

&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;StrOutputParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;SemanticValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Polite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each adapter translates a semantix verdict into the framework's native retry mechanism. Instructor gets &lt;code&gt;ValueError&lt;/code&gt;, Pydantic AI gets &lt;code&gt;ModelRetry&lt;/code&gt;, LangChain gets &lt;code&gt;OutputParserException&lt;/code&gt;. Your framework handles retries. Semantix handles meaning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total test coverage&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;166 tests&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New integration adapters&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3&lt;/strong&gt; (Instructor, Pydantic AI, LangChain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training data formats&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2&lt;/strong&gt; (OpenAI JSONL, Generic JSONL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New dependencies&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0&lt;/strong&gt; (training collector is pure Python)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code per adapter&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~70&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;There are two kinds of AI infrastructure. The kind that checks your work and the kind that makes you better at it.&lt;/p&gt;

&lt;p&gt;Every guardrail, every validator, every content filter in production today is the first kind. They're necessary. They're valuable. And they're a dead end — a static gate that never learns from what it catches.&lt;/p&gt;

&lt;p&gt;The training collector turns semantix into the second kind. Your guardrail becomes your training pipeline. Your failures become your curriculum. Your production traffic becomes your fine-tuning dataset.&lt;/p&gt;

&lt;p&gt;The model that runs through semantix for a month isn't the same model that started. It's better. Measurably, provably better. And it got there without a single human labeling a single example.&lt;/p&gt;

&lt;p&gt;That's not a guardrail. That's a flywheel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'semantix-ai[all]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.training&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TrainingCollector&lt;/span&gt;

&lt;span class="c1"&gt;# Start collecting training data in two lines
&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_training_data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_llm_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MyIntent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/0.1.7/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai/0.1.7&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Star the repo. Install the package. Start the flywheel.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;Akhona Eland&lt;/a&gt; in South Africa. 166 tests. Zero new dependencies. Your failures are now your curriculum.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Escaping Pilot Purgatory: How Semantix-ai v0.1.5 Built the Immutable Trust Layer for AI Agents</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Mon, 06 Apr 2026 15:24:36 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/escaping-pilot-purgatory-how-semantix-ai-v015-built-the-immutable-trust-layer-for-ai-agents-a81</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/escaping-pilot-purgatory-how-semantix-ai-v015-built-the-immutable-trust-layer-for-ai-agents-a81</guid>
      <description>&lt;p&gt;Here's a statistic that should terrify every AI team lead: &lt;strong&gt;90% of enterprise AI agents never leave the pilot phase.&lt;/strong&gt; They demo beautifully. They impress stakeholders. And then they rot in staging forever, blocked not by technical limitations but by a single, devastating question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Can you prove it won't do something catastrophic in production?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer, for almost every AI system shipping today, is no.&lt;/p&gt;

&lt;p&gt;This is the story of how we built the infrastructure to change that answer to yes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Semantic Gap
&lt;/h2&gt;

&lt;p&gt;There's a term we've been using internally that I think deserves wider adoption: &lt;strong&gt;The Semantic Gap&lt;/strong&gt;. It's the space between what an AI agent &lt;em&gt;produces&lt;/em&gt; and what a business &lt;em&gt;intended&lt;/em&gt;. Every guardrail you've seen — JSON schema validation, regex filters, content moderation APIs — operates below this gap. They check &lt;em&gt;shape&lt;/em&gt;. They check &lt;em&gt;toxicity&lt;/em&gt;. They never check &lt;em&gt;meaning&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Ali Muwwakkil, who has spent years working at the intersection of AI and enterprise deployment, put it precisely: &lt;strong&gt;alignment with business processes is the true bottleneck.&lt;/strong&gt; Not model capability. Not inference speed. Not even hallucination rates. The bottleneck is that no one can prove an AI agent's output aligns with the business intent that triggered it.&lt;/p&gt;

&lt;p&gt;This is why agents die in pilot purgatory. Legal can't sign off. Compliance can't audit. Operations can't trust. And without trust, there is no production deployment.&lt;/p&gt;

&lt;p&gt;Semantix v0.1.5 was built to close The Semantic Gap — not with bigger models or better prompts, but with &lt;strong&gt;deterministic infrastructure&lt;/strong&gt; that makes AI outputs auditable, attributable, and governed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Pillars of the Trust Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pillar 1: The Silent Guard (Quantized NLI)
&lt;/h3&gt;

&lt;p&gt;The first problem with existing semantic validation is speed. If your guardrail adds 500ms to every API call, it's dead on arrival. Production systems need sub-50ms overhead or they'll route around you.&lt;/p&gt;

&lt;p&gt;We solved this with &lt;strong&gt;INT8 ONNX quantization&lt;/strong&gt;. The &lt;code&gt;QuantizedNLIJudge&lt;/code&gt; runs NLI (Natural Language Inference) cross-encoder inference in pure ONNX Runtime — no PyTorch, no TensorFlow, no CUDA drivers. The entire dependency footprint is &lt;strong&gt;~25MB&lt;/strong&gt; compared to &lt;strong&gt;~500MB+&lt;/strong&gt; for a PyTorch-based equivalent.&lt;/p&gt;

&lt;p&gt;The numbers from our verified turbo demo:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Inference latency&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;23.9ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency size&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~25MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model format&lt;/td&gt;
&lt;td&gt;INT8 quantized ONNX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware required&lt;/td&gt;
&lt;td&gt;Any CPU (auto-detects AVX-512/AVX2/ARM64)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.judges.quantized_nli&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizedNLIJudge&lt;/span&gt;

&lt;span class="n"&gt;judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QuantizedNLIJudge&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Auto-selects best ONNX variant for your CPU
&lt;/span&gt;
&lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thank you for the invitation. Unfortunately, I cannot attend.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;intent_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 0.3118
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# True
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, &lt;code&gt;QuantizedNLIJudge&lt;/code&gt; does something subtle that took us several production-debugging sessions to get right: it &lt;strong&gt;dynamically introspects the ONNX graph's expected inputs&lt;/strong&gt; via &lt;code&gt;session.get_inputs()&lt;/code&gt;. Some ONNX exports expect &lt;code&gt;token_type_ids&lt;/code&gt;, others don't. Rather than hardcoding assumptions, the judge adapts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_inputs&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;

&lt;span class="n"&gt;feeds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attention_mask&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Only include token_type_ids if the model expects it
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_names&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;feeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type_ids&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also discovered — the hard way — that the ONNX export label order (&lt;code&gt;{0: contradiction, 1: neutral, 2: entailment}&lt;/code&gt;) differs from the PyTorch model's order (&lt;code&gt;{0: contradiction, 1: entailment, 2: neutral}&lt;/code&gt;). Entailment and neutral are swapped. Getting this wrong means your "safety pass" is actually reading the neutral probability. We've fixed it, tested it, and documented it so no one else burns a debugging session on this.&lt;/p&gt;

&lt;p&gt;The Silent Guard's job is simple: pass clean text instantly, flag violations in under 25ms. Zero friction on the happy path.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pillar 2: The Detective (Forensic Saliency)
&lt;/h3&gt;

&lt;p&gt;Knowing that text failed an intent check is useful. Knowing &lt;em&gt;which specific words caused the failure&lt;/em&gt; is transformative.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ForensicJudge&lt;/code&gt; implements what we internally call &lt;strong&gt;"Option A" Forensics&lt;/strong&gt; — mask-perturbation saliency that &lt;strong&gt;only triggers on failure&lt;/strong&gt;. When text passes, the &lt;code&gt;ForensicJudge&lt;/code&gt; returns the base verdict untouched with zero overhead. When text fails, it activates the investigation.&lt;/p&gt;

&lt;p&gt;The algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tokenize the output text (whitespace split — we're identifying suspect &lt;em&gt;words&lt;/em&gt;, not subwords)&lt;/li&gt;
&lt;li&gt;For each token, replace it with &lt;code&gt;[MASK]&lt;/code&gt; and re-run the base judge&lt;/li&gt;
&lt;li&gt;Measure the &lt;strong&gt;contradiction score drop&lt;/strong&gt; — how much less contradictory the text becomes without that token&lt;/li&gt;
&lt;li&gt;Rank by drop magnitude. The top-K tokens are the "breach tokens"
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.judges.forensic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ForensicJudge&lt;/span&gt;

&lt;span class="n"&gt;detective&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ForensicJudge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detective&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Are you serious? I would rather gouge my eyes out than attend your stupid event.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;intent_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Breach Report&lt;/span&gt;

&lt;span class="gs"&gt;**Score:**&lt;/span&gt; 0.2482
&lt;span class="gs"&gt;**Base judge reason:**&lt;/span&gt; No reason provided by base judge

&lt;span class="gu"&gt;### Token Attribution&lt;/span&gt;
&lt;span class="gs"&gt;**gouge**&lt;/span&gt; (0.16), &lt;span class="gs"&gt;**stupid**&lt;/span&gt; (0.13), &lt;span class="gs"&gt;**your**&lt;/span&gt; (0.10)

&lt;span class="gu"&gt;### Summary&lt;/span&gt;
Intent failed. High contradiction detected. Suspect Tokens: [gouge, stupid, your]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Detective caught it: &lt;code&gt;gouge&lt;/code&gt;, &lt;code&gt;stupid&lt;/code&gt;, and &lt;code&gt;your&lt;/code&gt; are the three words most responsible for the intent violation. Remove any of them and the contradiction score drops measurably.&lt;/p&gt;

&lt;p&gt;This matters for two reasons. First, &lt;strong&gt;debugging&lt;/strong&gt;: when an AI agent fails in production, the team doesn't have to read the full output and guess what went wrong. The Breach Report points directly at the offending tokens. Second, &lt;strong&gt;self-healing&lt;/strong&gt;: the structured report can be fed back to the agent as corrective context. The agent knows &lt;em&gt;what&lt;/em&gt; to fix, not just &lt;em&gt;that&lt;/em&gt; it failed.&lt;/p&gt;

&lt;p&gt;Imagine this in a legal review pipeline. The agent drafts a partnership agreement. The ForensicJudge flags it as non-compliant with the intent "must be free of hidden liability clauses." The Breach Report identifies &lt;code&gt;indemnify&lt;/code&gt;, &lt;code&gt;forfeit&lt;/code&gt;, and &lt;code&gt;waive&lt;/code&gt; as the breach tokens. The agent rewrites, removing those clauses. The second draft passes. No human had to read either draft.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pillar 3: The Black Box (AuditEngine)
&lt;/h3&gt;

&lt;p&gt;Speed and attribution solve the engineering problem. But enterprise deployment has a governance problem too: &lt;strong&gt;you need a record.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AuditEngine&lt;/code&gt; is a thread-safe singleton that captures every validation event as a &lt;strong&gt;JSON-LD Semantic Certificate&lt;/strong&gt; — a self-describing, standards-based record of what was validated, when, and whether it passed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.audit.engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuditEngine&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thank you, but I cannot attend.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3118&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each certificate contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://schema.semantix.ai/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SemanticCertificate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"urn:semantix:cert:29365ece-68f9-4a13-a89b-ccbbed34bf53"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-06T14:55:41.726348+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The text must politely decline an invitation."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3118&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"99c3814a6c40a84f7274b5c8..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"previous_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GENESIS"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note what's &lt;em&gt;not&lt;/em&gt; in the certificate: the raw output text. Instead, there's a SHA-256 hash of it. This means your audit trail is &lt;strong&gt;compliance-safe&lt;/strong&gt; — you can prove what was validated without storing potentially sensitive content in the audit log.&lt;/p&gt;

&lt;p&gt;The critical design choice is the &lt;code&gt;previous_hash&lt;/code&gt; field. Every certificate contains the SHA-256 hash of the &lt;em&gt;entire previous certificate&lt;/em&gt;. This creates an &lt;strong&gt;immutable hash chain&lt;/strong&gt; rooted at &lt;code&gt;GENESIS&lt;/code&gt;. Tamper with any entry and every subsequent hash breaks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# True — chain is intact
&lt;/span&gt;
&lt;span class="c1"&gt;# Tamper with an entry
&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# False — tampering detected
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same fundamental principle behind blockchain integrity, applied to AI governance without the overhead of consensus protocols. One hash chain. One source of truth. Verifiable by anyone with the audit file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audit.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# Write to disk as JSONL
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Full Stack in Action
&lt;/h2&gt;

&lt;p&gt;Here's what production deployment looks like with all three pillars working together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.audit.engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuditEngine&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.judges.quantized_nli&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantizedNLIJudge&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.judges.forensic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ForensicJudge&lt;/span&gt;

&lt;span class="c1"&gt;# Build the trust stack
&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;base_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QuantizedNLIJudge&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;           &lt;span class="c1"&gt;# 23.9ms inference
&lt;/span&gt;&lt;span class="n"&gt;detective&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ForensicJudge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_judge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# Attribution on failure
&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation without being
    rude or aggressive.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;detective&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Your LLM call here
&lt;/span&gt;
    &lt;span class="c1"&gt;# Record every validation in the audit trail
&lt;/span&gt;    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;description&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Score populated by judge
&lt;/span&gt;        &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;@validate_intent&lt;/code&gt; decorator handles the validation loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The function runs and returns a string&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;ForensicJudge&lt;/code&gt; evaluates it against the intent&lt;/li&gt;
&lt;li&gt;If it passes: the Silent Guard clears it in ~24ms, zero forensic overhead&lt;/li&gt;
&lt;li&gt;If it fails: the Detective runs saliency, identifies breach tokens, generates a Breach Report&lt;/li&gt;
&lt;li&gt;The decorator retries with self-healing feedback injected into the next call&lt;/li&gt;
&lt;li&gt;The AuditEngine records every attempt as a hash-chained certificate&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After all retries, you have a complete, tamper-evident record of every validation attempt — what was tried, what failed, why it failed, and what ultimately passed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;We are living through a specific moment in the AI industry. The capability curve is flattening — GPT-4, Claude, Gemini, Llama are all "good enough" for most business tasks. The differentiation is shifting from &lt;em&gt;what AI can do&lt;/em&gt; to &lt;em&gt;whether you can trust what AI did&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In 2026, &lt;strong&gt;liability is the biggest cost of AI&lt;/strong&gt;. Not compute. Not API bills. Liability. When an AI agent sends a contract with a hidden indemnification clause, when it generates a medical summary that omits a critical drug interaction, when it writes a customer email that accidentally constitutes a binding offer — the cost isn't a bad Yelp review. It's a lawsuit.&lt;/p&gt;

&lt;p&gt;Every company deploying AI agents needs three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; — Validation that doesn't bottleneck the pipeline (The Silent Guard: 23.9ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attribution&lt;/strong&gt; — When something goes wrong, know exactly what and why (The Detective: breach tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance&lt;/strong&gt; — An immutable record that proves governance was applied (The Black Box: hash-chained certificates)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Semantix v0.1.5 delivers all three in a single &lt;code&gt;pip install&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The End of Vibe-Coding
&lt;/h2&gt;

&lt;p&gt;There's a practice in the AI industry that we need to name and retire: &lt;strong&gt;vibe-coding&lt;/strong&gt;. It's the practice of deploying AI agents with no semantic validation — shipping outputs because they "look right" to a human reviewer, with no deterministic verification that the output matches the intent.&lt;/p&gt;

&lt;p&gt;Vibe-coding works in demos. It works in hackathons. It does not work when your agent is generating legal documents, medical summaries, financial reports, or customer communications at scale.&lt;/p&gt;

&lt;p&gt;Semantix exists to replace vibes with verification. To replace "it looks right" with "it mathematically entails the business intent." To replace trust-by-default with trust-by-proof.&lt;/p&gt;

&lt;p&gt;We aren't building a library. We're setting a standard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Recommended: INT8 ONNX (fast, lightweight)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'semantix-ai[turbo]'&lt;/span&gt;

&lt;span class="c"&gt;# Full stack with all judge backends&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'semantix-ai[all]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;v0.1.5 Release:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai/releases/tag/v0.1.5" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai/releases/tag/v0.1.5&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Star the repo. Try the turbo install. Run &lt;code&gt;tools/trust_demo.py&lt;/code&gt; and watch the Breach Report identify exactly which words betrayed the intent.&lt;/p&gt;

&lt;p&gt;And if you're tired of AI agents dying in pilot purgatory — join us. The trust layer is here.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;Akhona Eland&lt;/a&gt; in South Africa. 126 tests. Sub-25ms inference. Zero vibes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>security</category>
    </item>
    <item>
      <title>Any AI Agent Can Now Vibe Check LLM Outputs — No Code Required</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Mon, 06 Apr 2026 11:54:25 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/any-ai-agent-can-now-vibe-check-llm-outputs-no-code-required-19ei</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/any-ai-agent-can-now-vibe-check-llm-outputs-no-code-required-19ei</guid>
      <description>&lt;h1&gt;
  
  
  Any AI Agent Can Now "Vibe Check" LLM Outputs — No Code Required
&lt;/h1&gt;

&lt;p&gt;Your AI agent just generated a customer email. It's grammatically perfect. The JSON is valid. But it accidentally threatened to cancel the customer's account instead of apologizing.&lt;/p&gt;

&lt;p&gt;No guardrail caught it because no guardrail was checking &lt;em&gt;meaning&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;Semantix v0.1.4&lt;/a&gt;, &lt;strong&gt;any MCP-capable agent&lt;/strong&gt; — Claude Desktop, Claude Code, Cursor, or your own — can validate text against semantic intents as a tool call. Zero code changes. Zero API keys. Runs locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Agents Don't Verify Their Own Output
&lt;/h2&gt;

&lt;p&gt;LLM agents are getting more autonomous. They write emails, generate reports, draft code reviews, and respond to customers. But they operate on a trust-based system: generate output, ship it, hope for the best.&lt;/p&gt;

&lt;p&gt;What if the agent could &lt;strong&gt;verify its own output&lt;/strong&gt; before sending it? Not structurally — semantically. "Does this text actually do what I intended?"&lt;/p&gt;

&lt;p&gt;That's what the Semantix MCP server enables.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's New in v0.1.4: The Universal Standard Release
&lt;/h2&gt;

&lt;h3&gt;
  
  
  MCP Server: &lt;code&gt;verify_text_intent&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Semantix now ships a built-in &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;MCP&lt;/a&gt; server that exposes a single, powerful tool: &lt;code&gt;verify_text_intent&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Any MCP-capable agent can call it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We sincerely apologize for the inconvenience and have credited your account."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent_description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The text must be a sincere customer apology that offers a concrete resolution."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it fails, the agent gets a structured correction suggestion — enabling &lt;strong&gt;cross-agent self-healing&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"correction_suggestion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Semantix Verification Failed&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;### What went wrong&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;- **Score:** 0.1800 (threshold 0.5 not met)&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;### What is required&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;The text must be a sincere customer apology...&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;### Rejected output&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;```

&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Your account has been flagged for termination.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;

```&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Please generate a new response that satisfies the requirement above."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent reads the correction, regenerates, and tries again. Self-healing across any agent framework — no SDK integration needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup: 3 Lines
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[mcp,nli]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add to your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"semantix-verify"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"run"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"semantix/mcp/server.py"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cwd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your/semantix-ai"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Claude Desktop (or any MCP client) can now call &lt;code&gt;verify_text_intent&lt;/code&gt; before responding.&lt;/p&gt;




&lt;h2&gt;
  
  
  NLI Accuracy Fixes
&lt;/h2&gt;

&lt;p&gt;v0.1.4 also ships critical fixes to the NLI judge that dramatically improve scoring accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Entailment index fix&lt;/strong&gt; — The model's label order is &lt;code&gt;{0: contradiction, 1: entailment, 2: neutral}&lt;/code&gt;. We were accidentally reading the &lt;em&gt;neutral&lt;/em&gt; logit instead of &lt;em&gt;entailment&lt;/em&gt;. Fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Softmax calibration&lt;/strong&gt; — Raw logits are now converted to true 0-1 probability scores via &lt;code&gt;apply_softmax=True&lt;/code&gt;. Before this, scores were unbounded and hard to threshold meaningfully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Progressive tense hypothesis&lt;/strong&gt; — NLI cross-encoders score dramatically better when the hypothesis is framed as ongoing action. "The text must politely decline an invitation" becomes "Someone is politely declining an invitation." This single change pushed scores from ~0.3 to 0.88+ for well-written declines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP?
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) is becoming the universal standard for agent-tool communication. By shipping Semantix as an MCP tool rather than a library-only solution, we get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Universal compatibility&lt;/strong&gt; — Works with Claude Desktop, Claude Code, Cursor, and any future MCP client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero integration code&lt;/strong&gt; — Agents call it as a tool, not as a library import&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language agnostic&lt;/strong&gt; — Your agent doesn't need to be written in Python&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing bridge&lt;/strong&gt; — The &lt;code&gt;correction_suggestion&lt;/code&gt; field gives any agent enough context to retry intelligently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what "validate meaning, not shape" looks like at the agent layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your Agent (any MCP client)
     |
     v
  MCP tool call: verify_text_intent
     |
     v
Semantix MCP Server (FastMCP)
     |
     v
NLIJudge (lazy-loaded singleton)
     |
     v
Cross-encoder: "Does this text entail the intent?"
     |
     +-- score &amp;gt;= threshold --&amp;gt; {"passed": true, "score": 0.91}
     |
     +-- score &amp;lt; threshold  --&amp;gt; {"passed": false, "correction_suggestion": "..."}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The NLI model loads lazily on the first tool call — server startup is instant. The judge runs locally on CPU with no API keys.&lt;/p&gt;




&lt;h2&gt;
  
  
  20 Automated Tests, Zero Model Loading
&lt;/h2&gt;

&lt;p&gt;The MCP test suite covers tool registration, response schema, correction suggestions, and dependency error handling — all without loading the actual NLI model. We mock the judge so tests run in milliseconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantix.mcp.server._get_judge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_failing_response_includes_correction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mock_get&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;mock_get&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_mock_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;verify_text_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;some intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correction_suggestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server also handles missing dependencies gracefully — if &lt;code&gt;sentence-transformers&lt;/code&gt; isn't installed, it returns an error JSON instead of crashing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[mcp,nli]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test it locally&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
from semantix.mcp.server import verify_text_intent
print(verify_text_intent(
    'I appreciate the invitation but unfortunately I will not be able to attend.',
    'The text must politely decline an invitation'
))
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run as MCP server&lt;/span&gt;
mcp run semantix/mcp/server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Semantix is a semantic type system for AI outputs. v0.1.3 added self-healing retries. v0.1.4 makes it universal via MCP. The roadmap includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More judge backends&lt;/strong&gt; — Anthropic, Cohere, local LLMs via Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic integration&lt;/strong&gt; — Semantic fields inside Pydantic models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming validation&lt;/strong&gt; — Real-time intent checking during generation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install "semantix-ai[mcp,nli]"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Star the repo if this is useful. Open an issue if it isn't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;Akhona Eland&lt;/a&gt; in South Africa.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Sun, 05 Apr 2026 10:40:11 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/your-llm-passes-type-checks-but-fails-the-vibe-check-how-i-fixed-ai-reliability-38ac</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/your-llm-passes-type-checks-but-fails-the-vibe-check-how-i-fixed-ai-reliability-38ac</guid>
      <description>&lt;h1&gt;
  
  
  Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability
&lt;/h1&gt;

&lt;p&gt;You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.&lt;/p&gt;

&lt;p&gt;Then your model returns a "polite decline" that says &lt;em&gt;"I'd rather gouge my eyes out."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It passes your type checks. It fails the vibe check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the Semantic Gap&lt;/strong&gt; — the space between &lt;em&gt;structural correctness&lt;/em&gt; and &lt;em&gt;actual meaning&lt;/em&gt;. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;Semantix&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Semantic Gap: Shape vs. Meaning
&lt;/h2&gt;

&lt;p&gt;Here's what most validation looks like today:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;tone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neutral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you the &lt;em&gt;shape&lt;/em&gt; is right. It tells you nothing about whether the &lt;em&gt;meaning&lt;/em&gt; is right. Your model can return &lt;code&gt;{"message": "Go away.", "tone": "polite"}&lt;/code&gt; and Pydantic will happily accept it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantix flips the script.&lt;/strong&gt; Instead of validating structure, you validate intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation
    without being rude or aggressive.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The docstring &lt;em&gt;is&lt;/em&gt; the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?&lt;/p&gt;




&lt;h2&gt;
  
  
  What's New in v0.1.3: The Self-Healing Update
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Informed Self-Healing
&lt;/h3&gt;

&lt;p&gt;The biggest feature in v0.1.3 is &lt;strong&gt;informed retries&lt;/strong&gt;. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM &lt;em&gt;exactly what went wrong&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Declare a &lt;code&gt;semantix_feedback&lt;/code&gt; parameter in your function, and the decorator injects a structured Markdown report on each retry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix.judges.nli&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NLIJudge&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;NLIJudge&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Decline this invite: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the first call, &lt;code&gt;semantix_feedback&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. If validation fails, the next call receives something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Semantix Self-Healing Feedback&lt;/span&gt;

Attempt &lt;span class="gs"&gt;**1**&lt;/span&gt; failed validation.

&lt;span class="gu"&gt;### What went wrong&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Intent:**&lt;/span&gt; &lt;span class="sb"&gt;`ProfessionalDecline`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Score:**&lt;/span&gt; 0.3210 (threshold not met)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Judge reason:**&lt;/span&gt; too vague

&lt;span class="gu"&gt;### What is required&lt;/span&gt;
The text must politely decline an invitation without being rude or aggressive.

&lt;span class="gu"&gt;### Your previous output (rejected)&lt;/span&gt;
Go away.

Please generate a new response that satisfies the requirement above.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM gets the score, the reason, the requirement, and its own rejected output. It can &lt;em&gt;learn from the failure&lt;/em&gt; in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  NLI as the Default Judge
&lt;/h3&gt;

&lt;p&gt;We moved from &lt;code&gt;LLMJudge&lt;/code&gt; to &lt;code&gt;NLIJudge&lt;/code&gt; as the default. Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No API key required&lt;/strong&gt; — runs fully locally using a cross-encoder model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entailment &amp;gt; Cosine similarity&lt;/strong&gt; — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B &lt;em&gt;about&lt;/em&gt; the same thing?" which is a weaker signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast enough&lt;/strong&gt; — the default &lt;code&gt;nli-MiniLM2-L6-H768&lt;/code&gt; model is ~85MB and runs in milliseconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can still use any judge you want — &lt;code&gt;LLMJudge&lt;/code&gt;, &lt;code&gt;EmbeddingJudge&lt;/code&gt;, or your own custom &lt;code&gt;Judge&lt;/code&gt; subclass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Granular Scoring
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;LLMJudge&lt;/code&gt; no longer returns a binary Yes/No. It now returns a &lt;strong&gt;0.0-1.0 confidence score&lt;/strong&gt; and a &lt;strong&gt;text reason&lt;/strong&gt;, giving the self-healing system richer feedback to work with.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Proof: Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Talk is cheap. Here are the real numbers from &lt;code&gt;tools/benchmark.py&lt;/code&gt;, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;No Healing&lt;/th&gt;
&lt;th&gt;Self-Healing&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Professional Tone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13.3%&lt;/td&gt;
&lt;td&gt;56.7%&lt;/td&gt;
&lt;td&gt;+43.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Technical Explanation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;36.7%&lt;/td&gt;
&lt;td&gt;96.7%&lt;/td&gt;
&lt;td&gt;+60.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actionable Summary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13.3%&lt;/td&gt;
&lt;td&gt;56.7%&lt;/td&gt;
&lt;td&gt;+43.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21.1%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+48.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Self-healing nearly &lt;strong&gt;triples&lt;/strong&gt; the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.&lt;/p&gt;

&lt;p&gt;These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the &lt;em&gt;relative improvement&lt;/em&gt; from self-healing holds.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works Under the Hood
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your Function
     |
     v
@validate_intent
     |
     v
Call function -&amp;gt; Get raw string
     |
     v
Judge.evaluate(output, intent_description, threshold)
     |
     +-- PASS --&amp;gt; return Intent(output)
     |
     +-- FAIL --&amp;gt; SemanticIntentError
                    |
                    v
              retries left?
                    |
                    +-- YES --&amp;gt; inject semantix_feedback -&amp;gt; retry
                    |
                    +-- NO  --&amp;gt; raise error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decorator resolves the &lt;code&gt;Intent&lt;/code&gt; subclass from your return type annotation, calls the judge, and manages the retry loop. The &lt;code&gt;semantix_feedback&lt;/code&gt; injection is zero-boilerplate — just add the parameter and it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started in 30 Seconds
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[nli]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PositiveSentiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must express a clearly positive, optimistic,
    or encouraging sentiment.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encourage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PositiveSentiment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write an encouraging message for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;semantix_feedback&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_your_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Your LLM output is now semantically typed and self-healing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/semantix-ai/" rel="noopener noreferrer"&gt;pypi.org/project/semantix-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install semantix-ai&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Star the repo if this is useful. Open an issue if it isn't — I want to know what's missing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;Akhona Eland&lt;/a&gt; in South Africa.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>agenticai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your LLM Passes Type Checks but Fails the Vibe Check — Here's How to Fix It</title>
      <dc:creator>Akhona Eland</dc:creator>
      <pubDate>Thu, 02 Apr 2026 14:08:34 +0000</pubDate>
      <link>https://dev.to/akhona_eland_072dac9e0c2c/your-llm-passes-type-checks-but-fails-the-vibe-check-heres-how-to-fix-it-1dkm</link>
      <guid>https://dev.to/akhona_eland_072dac9e0c2c/your-llm-passes-type-checks-but-fails-the-vibe-check-heres-how-to-fix-it-1dkm</guid>
      <description>&lt;p&gt;You ask your LLM to write a polite decline to a meeting invite. It returns:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I appreciate the invitation, but I would rather set myself on fire than attend your team-building retreat."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You run it through your Pydantic model. It passes. It's a string. The right length. Valid UTF-8. Technically a "response."&lt;/p&gt;

&lt;p&gt;But it's not a polite decline. It's a career-ending email.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the gap nobody's filling.&lt;/strong&gt; We have type systems for data structures — &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;str&lt;/code&gt;, Pydantic models. We validate &lt;em&gt;shape&lt;/em&gt; obsessively. But we have nothing for &lt;em&gt;meaning&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Until now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing Semantix
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;Semantix&lt;/a&gt; is a semantic type system for LLM outputs. Instead of checking "is this a string?", it checks "does this string actually say what it's supposed to say?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The text must politely decline an invitation 
    without being rude or aggressive.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_my_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;decline_invite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the company retreat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ✓ Validated — the output actually IS a polite decline
# ✗ Raises SemanticIntentError if the LLM went off the rails
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines of setup. One decorator. Your LLM output is now semantically typed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The core idea is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You define an Intent&lt;/strong&gt; — a class whose docstring describes the semantic contract.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You decorate your LLM function&lt;/strong&gt; — the return type hint tells Semantix what to validate against.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A Judge evaluates the output&lt;/strong&gt; — comparing what the LLM said against what it was supposed to mean.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Judge is the interesting part. Semantix ships with three:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EmbeddingJudge&lt;/strong&gt; — compares sentence embeddings using cosine similarity. Fast, runs locally, no API key. Good for clear-cut intents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingJudge&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EmbeddingJudge&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConciseSummary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LLMJudge&lt;/strong&gt; — asks GPT-4o-mini "does this text satisfy this requirement? Yes or No." More accurate, needs an API key, costs fractions of a cent per call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NLIJudge&lt;/strong&gt; — uses a cross-encoder NLI model to check if the output &lt;em&gt;entails&lt;/em&gt; the intent. Best of both worlds: accurate like an LLM judge, local like an embedding judge.&lt;/p&gt;

&lt;p&gt;You pick the speed/accuracy tradeoff that fits your use case. And you can swap judges without changing any other code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feature That Made Me Build This
&lt;/h2&gt;

&lt;p&gt;Here's what pushed me over the edge. I was building an AI agent for a client that needed to generate customer-facing responses. The responses had to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Professional in tone&lt;/li&gt;
&lt;li&gt;Factually grounded in the company's data&lt;/li&gt;
&lt;li&gt;Free of any promises or commitments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pydantic could check that the response was a non-empty string under 500 characters. Great. But the LLM kept slipping in phrases like "I guarantee this will be resolved" — structurally valid, semantically dangerous.&lt;/p&gt;

&lt;p&gt;So I built Semantix. And the feature I'm most proud of is &lt;strong&gt;smart retries&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_last_failure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingJudge&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EmbeddingJudge&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SafeCustomerResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;hint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failure&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_last_failure&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;hint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Your previous attempt scored &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remove any promises or guarantees.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Respond to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;get_last_failure()&lt;/code&gt; gives your LLM function access to the &lt;em&gt;reason&lt;/em&gt; the previous attempt failed. So each retry isn't just "try again" — it's "try again, but here's what went wrong." The LLM gets smarter with each attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Composable Intents
&lt;/h2&gt;

&lt;p&gt;Real-world requirements are rarely one-dimensional. Semantix lets you combine intents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AllOf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AnyOf&lt;/span&gt;

&lt;span class="c1"&gt;# Must satisfy ALL — polite AND positive
&lt;/span&gt;&lt;span class="n"&gt;SafeResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ProfessionalTone&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;NoPromises&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;FactuallyGrounded&lt;/span&gt;

&lt;span class="c1"&gt;# Must satisfy AT LEAST ONE — either formal or casual decline
&lt;/span&gt;&lt;span class="n"&gt;FlexibleDecline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnyOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FormalDecline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CasualDecline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@validate_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EmbeddingJudge&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SafeResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;amp;&lt;/code&gt; and &lt;code&gt;|&lt;/code&gt; operators work on Intent classes directly. Under the hood, &lt;code&gt;AllOf&lt;/code&gt; concatenates the docstrings with "AND" and uses the minimum threshold. &lt;code&gt;AnyOf&lt;/code&gt; uses "OR" and the maximum threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming Support
&lt;/h2&gt;

&lt;p&gt;If you're streaming LLM responses (and you probably should be), Semantix validates once the full stream is assembled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;semantix&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StreamCollector&lt;/span&gt;

&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StreamCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ProfessionalDecline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_judge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;llm_stream&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# stream to user in real-time
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# validate the complete output
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your users see the response streaming in. Behind the scenes, Semantix is collecting chunks. The moment the stream ends, it validates. If it fails, you catch the error and handle it — retry, fall back to a template, or flag for human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;I built Semantix because the existing tools solve a different problem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Semantix&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;Guardrails AI&lt;/th&gt;
&lt;th&gt;NeMo Guardrails&lt;/th&gt;
&lt;th&gt;Instructor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Validates meaning&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌ Schema-focused&lt;/td&gt;
&lt;td&gt;✅ Dialogue rails&lt;/td&gt;
&lt;td&gt;❌ Schema-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero required deps&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works with any LLM&lt;/td&gt;
&lt;td&gt;✅ Any function&lt;/td&gt;
&lt;td&gt;⚠️ Wrappers&lt;/td&gt;
&lt;td&gt;⚠️ Config files&lt;/td&gt;
&lt;td&gt;⚠️ Patched clients&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pluggable backends&lt;/td&gt;
&lt;td&gt;✅ 3 built-in + custom&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines to validate&lt;/td&gt;
&lt;td&gt;~5&lt;/td&gt;
&lt;td&gt;~20+&lt;/td&gt;
&lt;td&gt;~30+&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Semantix isn't a replacement for Pydantic or Guardrails. It's the &lt;strong&gt;layer above&lt;/strong&gt; them. After you know the shape is right, verify the meaning is right too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;semantix-ai

&lt;span class="c"&gt;# With embedding judge (fast, local)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[embeddings]"&lt;/span&gt;

&lt;span class="c"&gt;# With OpenAI judge (accurate)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"semantix-ai[openai]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check out the repo: &lt;a href="https://github.com/labrat-akhona/semantix-ai" rel="noopener noreferrer"&gt;github.com/labrat-akhona/semantix-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's MIT licensed, Python 3.10+, and the core has zero dependencies. I'd love feedback — open an issue or drop a comment below.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Akhona, an automation engineer based in South Africa. I build AI-powered tools and integrations. You can find me on &lt;a href="https://github.com/labrat-akhona" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
