DEV Community: Akhona Eland

I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1

Akhona Eland — Fri, 15 May 2026 11:09:22 +0000

I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1

The problem: if your LLM-powered product touches personal information in South Africa, POPIA sits over it. The regulator doesn't ask "is your model good?" — they ask "can you demonstrate the output was validated against the clause, and can you show me the validation?"

The uncomfortable answer most teams give today: "we call GPT-4 as a judge with a prompt that mentions POPIA." That's not a defence. It's non-deterministic, sends personal information cross-border, and produces no receipt.

What I built instead: a local NLI cross-encoder fine-tuned on 7 POPIA clauses, released under Apache 2.0, shipped as a quantized ONNX model, scored and gated on every CI run.

The result, on a pinned 150-pair holdout:

	Stock `cross-encoder/nli-MiniLM2-L6-H768`	Fine-tuned `nli-popia-v1`
Macro F1	0.517	0.813
Accuracy	0.707	0.833
Worst clause	0.400 (general processing / data subject rights)	0.727 (cross-border transfers)
Best per-clause lift	—	+0.493 (general processing)
Regressions	—	zero

+29.6 percentage points macro F1, every clause improved, nothing got worse. 79MB per CPU-variant INT8 ONNX on disk, ~15ms per inference on CPU, zero API calls.

Here's how it went.

Why NLI, not a prompt-based judge

Natural Language Inference is an old, narrow, boring task: given a premise and a hypothesis, return the probability the premise entails the hypothesis. Cross-encoders have been doing this deterministically for a decade.

If you reframe "does this text satisfy POPIA's consent clause?" as an NLI problem:

Premise: the LLM's output
Hypothesis: "The text collects personal information only after obtaining explicit, informed, opt-in consent."

…you get a deterministic score in 0.0–1.0, in one tiny ONNX model, without shipping customer data to a third-party API.

The catch: stock NLI models are trained on SNLI/MNLI. They're great at "a dog is playing in the park / an animal is outside" and terrible at "This message confirms your purchase; we'll process your data per our privacy policy / The text obtains explicit opt-in consent before collecting personal information."

Stock macro F1 on POPIA clauses: 0.517. Two of the seven clauses — general processing and data subject rights — came in at 0.400 F1. Coin-flip territory.

So I fine-tuned.

The data: 180 hand-authored pairs, no scraping

This is the part nobody wants to hear: I wrote the training data by hand.

Seven clauses — consent, minimality, security safeguards, breach notification, cross-border transfers, general processing, data subject rights — × a handful of positive examples (text that satisfies the clause) + a handful of negatives (text that violates it) + paraphrases. About 180 pairs.

Why hand-authored:

Scraped legal text is the wrong distribution. My users aren't writing statutes; they're writing support replies, KYC confirmations, breach emails. I needed LLM-shaped text, not Act-shaped text.
Synthetic generation would poison the eval. If GPT-4 writes my training data and GPT-4 writes the outputs being validated in production, I'm measuring GPT-4's self-consistency, not POPIA compliance.
180 pairs is enough for 7-clause cross-encoder fine-tuning. The base model already speaks English; I'm teaching it a narrow decision boundary, not a new language.

The 150-pair holdout was hand-authored separately, pinned by hash, and never leaks into training. If the hash of the eval file changes, the release gate fails.

The fine-tune: 5 epochs, six minutes on CPU

The whole training recipe:

pip install "semantix-ai[train]"
python scripts/train_popia.py

Under the hood it's unremarkable:

Base: cross-encoder/nli-MiniLM2-L6-H768 (~82M params — six transformer layers, hidden 768)
5 epochs, batch 16, lr 2e-5, warmup 10%, weight decay 0.01
Cross-entropy loss, early stopping on eval_loss against a 10% dev split
CPU training on 180 rows: ~6 minutes
ONNX export with four CPU-variant INT8 quantizations (AVX2 / AVX512 / AVX512-VNNI / ARM64), auto-selected at load time based on CPU detection

Each quantized variant is ~79MB; consumers only download the one their CPU needs. Inference is zero-PyTorch — onnxruntime + tokenizers, nothing else.

The release gate: CI fails if the next fine-tune regresses

This is the part I think more ML projects should steal.

# .github/workflows/popia-eval.yml (abridged)
- name: Run release gate
  run: |
    python -m semantix.cli eval popia --json | tee report.json
    python -c "import json; r=json.load(open('report.json'));
               import sys; sys.exit(0 if r['release_gate_passed'] else 1)"

The gate logic is boring and strict:

release_gate_passed = (
    (finetune_macro_f1 - stock_macro_f1) >= 0.10
    and no_per_clause_regression_vs_stock
)

Any future fine-tune that drops below a +10pp macro-F1 lift, OR regresses a single clause vs stock, fails CI and blocks the release. The model artifact has the same quality gate as the code that loads it.

Using it

pip install "semantix-ai[popia]"

from semantix import validate_intent
from semantix.presets.popia import POPIA_CONSENT, POPIA_SECURITY

@validate_intent(POPIA_CONSENT)
def compose_signup_confirmation(name: str) -> str:
    # Your LLM call here. If the output doesn't satisfy POPIA_CONSENT,
    # the decorator retries with structured feedback. If it still fails,
    # it raises — with a Semantic Certificate in the audit trail.
    ...

On first import, the quantized ONNX model downloads once from HuggingFace and caches locally. No HF token required — the model is public.

Seven presets ship with the library, one per clause. Each has a pre-tuned threshold based on the per-clause F1 on the holdout. You can override any threshold; the defaults are the F1-optimal operating points.

What I will and won't claim

I will claim: on a pinned 150-pair hand-authored POPIA holdout, the fine-tune beats the stock MiniLM2 NLI cross-encoder by +29.6pp macro F1, every clause improves, no regressions. That result is reproducible — the eval set is hashed, the CI gate enforces it.

I won't claim: this model replaces a POPIA specialist, a DPIA, or the Information Regulator's guidance. It's a deterministic, local, auditable primitive you can wire into your validation pipeline. It tells you whether a specific output is consistent with a specific POPIA clause at a specific threshold. That's a narrower claim than "POPIA-compliant" and it's the only claim I can actually defend with a holdout F1 number.

I especially won't claim: 180 pairs is enough training data for every production use case. If your domain has dialect, local legal phrasing, or adversarial customers trying to slip past the guard, you should fine-tune on your failures. The repo includes the training recipe for exactly that reason.

The reusable part

The thing I'm most interested in is that the entire recipe — hand-authored seeds + paraphrases + cross-encoder fine-tune + ONNX export + release gate — is regulation-agnostic. Swap POPIA for GDPR and you get nli-gdpr-v1. Swap for HIPAA and you get nli-hipaa-v1. Swap for EU AI Act clause libraries and you get a judge per article.

v0.2.0 already ships a GDPR sibling-model scaffold — same Judge interface, 7 EU-clause presets, expansion seeds, training script, and a documented runtime fallback to POPIA weights until the GDPR artifact trains. It is deliberately a scaffold: same API surface, same CI gate pattern, no weights pretending to exist. That is the contract. The second regulator costs less than the first.

nli-popia-v1 is the first trained artifact. It's 0.813 macro F1 and it's live:

PyPI: https://pypi.org/project/semantix-ai/0.2.0/
GitHub release: https://github.com/labrat-akhona/semantix-ai/releases/tag/v0.2.0
Model card: https://huggingface.co/labrat-aiko/nli-popia-v1

If you're building on this

The failure modes to watch for:

Threshold tuning matters more than you'd think. The per-clause F1-optimal thresholds in the presets are tuned on my holdout, not yours. If your domain's distribution is different, re-tune.
False negatives on ambiguous consent language. "By continuing, you agree to…" is legally grey, and the model reflects that. Tighten the threshold if you want the library to err on the side of rejecting.
This is a classifier, not a reasoner. It doesn't explain why a clause failed. Pair it with semantix.judges.ForensicJudge (ships with [turbo]) if you need a mask-perturbation saliency breach report.

If you ship something interesting with it, or fine-tune a sibling (GDPR, HIPAA, UK DPA), I'd love to see it. Issues and PRs welcome on the repo.

semantix-ai is an MIT-licensed semantic type system for AI outputs. v0.2.0 is the first release with compliance-specific fine-tunes and ships both the trained POPIA artifact and a GDPR sibling-model scaffold. The POPIA model weights are Apache 2.0. Everything here was built by one person; numbers are reproducible, judgement calls are mine.

Edit 2026-05-18: corrected base-model parameter count from ~22M to ~82M — earlier figure conflated MiniLM-L6-H384 (the smaller variant) with MiniLM2-L6-H768 (the variant actually used).

I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1

Akhona Eland — Wed, 22 Apr 2026 11:45:42 +0000

I Fine-Tuned a Compliance Judge and Beat the Stock Model by +29.6pp F1

What I built instead: a local NLI cross-encoder fine-tuned on 7 POPIA clauses, released under Apache 2.0, shipped as a quantized ONNX model, scored and gated on every CI run.

The result, on a pinned 150-pair holdout:

	Stock `cross-encoder/nli-MiniLM2-L6-H768`	Fine-tuned `nli-popia-v1`
Macro F1	0.517	0.813
Accuracy	0.707	0.833
Worst clause	0.400 (general processing / data subject rights)	0.727 (cross-border transfers)
Best per-clause lift	—	+0.493 (general processing)
Regressions	—	zero

+29.6 percentage points macro F1, every clause improved, nothing got worse. 79MB per CPU-variant INT8 ONNX on disk, ~15ms per inference on CPU, zero API calls.

Here's how it went.

Why NLI, not a prompt-based judge

If you reframe "does this text satisfy POPIA's consent clause?" as an NLI problem:

Premise: the LLM's output
Hypothesis: "The text collects personal information only after obtaining explicit, informed, opt-in consent."

…you get a deterministic score in 0.0–1.0, in one tiny ONNX model, without shipping customer data to a third-party API.

Stock macro F1 on POPIA clauses: 0.517. Two of the seven clauses — general processing and data subject rights — came in at 0.400 F1. Coin-flip territory.

So I fine-tuned.

The data: 180 hand-authored pairs, no scraping

This is the part nobody wants to hear: I wrote the training data by hand.

Why hand-authored:

Scraped legal text is the wrong distribution. My users aren't writing statutes; they're writing support replies, KYC confirmations, breach emails. I needed LLM-shaped text, not Act-shaped text.
Synthetic generation would poison the eval. If GPT-4 writes my training data and GPT-4 writes the outputs being validated in production, I'm measuring GPT-4's self-consistency, not POPIA compliance.
180 pairs is enough for 7-clause cross-encoder fine-tuning. The base model already speaks English; I'm teaching it a narrow decision boundary, not a new language.

The 150-pair holdout was hand-authored separately, pinned by hash, and never leaks into training. If the hash of the eval file changes, the release gate fails.

The fine-tune: 5 epochs, six minutes on CPU

The whole training recipe:

pip install "semantix-ai[train]"
python scripts/train_popia.py

Under the hood it's unremarkable:

Base: cross-encoder/nli-MiniLM2-L6-H768 (~22M params, tiny)
5 epochs, batch 16, lr 2e-5, warmup 10%, weight decay 0.01
Cross-entropy loss, early stopping on eval_loss against a 10% dev split
CPU training on 180 rows: ~6 minutes
ONNX export with four CPU-variant INT8 quantizations (AVX2 / AVX512 / AVX512-VNNI / ARM64), auto-selected at load time based on CPU detection

Each quantized variant is ~79MB; consumers only download the one their CPU needs. Inference is zero-PyTorch — onnxruntime + tokenizers, nothing else.

The release gate: CI fails if the next fine-tune regresses

This is the part I think more ML projects should steal.

# .github/workflows/popia-eval.yml (abridged)
- name: Run release gate
  run: |
    python -m semantix.cli eval popia --json | tee report.json
    python -c "import json; r=json.load(open('report.json'));
               import sys; sys.exit(0 if r['release_gate_passed'] else 1)"

The gate logic is boring and strict:

release_gate_passed = (
    (finetune_macro_f1 - stock_macro_f1) >= 0.10
    and no_per_clause_regression_vs_stock
)

Using it

pip install "semantix-ai[popia]"

from semantix import validate_intent
from semantix.presets.popia import POPIA_CONSENT, POPIA_SECURITY

@validate_intent(POPIA_CONSENT)
def compose_signup_confirmation(name: str) -> str:
    # Your LLM call here. If the output doesn't satisfy POPIA_CONSENT,
    # the decorator retries with structured feedback. If it still fails,
    # it raises — with a Semantic Certificate in the audit trail.
    ...

On first import, the quantized ONNX model downloads once from HuggingFace and caches locally. No HF token required — the model is public.

What I will and won't claim

The reusable part

nli-popia-v1 is the first trained artifact. It's 0.813 macro F1 and it's live:

PyPI: https://pypi.org/project/semantix-ai/0.2.0/
GitHub release: https://github.com/labrat-akhona/semantix-ai/releases/tag/v0.2.0
Model card: https://huggingface.co/labrat-aiko/nli-popia-v1

If you're building on this

The failure modes to watch for:

Threshold tuning matters more than you'd think. The per-clause F1-optimal thresholds in the presets are tuned on my holdout, not yours. If your domain's distribution is different, re-tune.
False negatives on ambiguous consent language. "By continuing, you agree to…" is legally grey, and the model reflects that. Tighten the threshold if you want the library to err on the side of rejecting.
This is a classifier, not a reasoner. It doesn't explain why a clause failed. Pair it with semantix.judges.ForensicJudge (ships with [turbo]) if you need a mask-perturbation saliency breach report.

If you ship something interesting with it, or fine-tune a sibling (GDPR, HIPAA, UK DPA), I'd love to see it. Issues and PRs welcome on the repo.

Discuss this on LinkedIn

I'm posting the short-form announcement over on LinkedIn — replies, questions, and "this would break on my domain because…" threads all land there: [link to the LinkedIn post in the first comment below]

Or open an issue on the repo if you'd rather keep it with the code.

A 70ms Local NLI Judge Hits 0.596 Pearson r With Groq Llama 3.3 70B on DSPy Reward Scoring

Akhona Eland — Wed, 22 Apr 2026 06:35:30 +0000

TL;DR

semantic_reward is a drop-in DSPy reward function powered by a local quantized NLI cross-encoder — no API call, no key, deterministic, ~70ms per evaluation on CPU.
On 50 paired customer-support examples, semantix reaches Pearson r = 0.596 with Groq Llama 3.3 70B, and Cohen's kappa 0.633 at threshold 0.3 (substantial agreement), at ~11× lower latency and $0.13 cheaper per 1k calls.
Full reproducibility: code, dataset, raw CSVs at github.com/labrat-akhona/semantix-ai/benchmarks.

Why another reward function?

DSPy's BestOfN and Refine lean on a reward_fn that scores each candidate from 0 to 1. In practice most users wire up another LLM call — cheap per-request but adds 300–1000 ms and a few cents per optimization run. If you're iterating, that adds up fast.

semantix-ai ships a ~79 MB INT8 quantized NLI cross-encoder (one of four CPU-specific variants, auto-selected based on your hardware) that scores "does text X entail intent Y?" in ~70ms on CPU. Plugging it into DSPy takes one line:

import dspy
from semantix import Intent
from semantix.integrations.dspy import semantic_reward

class Grounded(Intent):
    """The answer must be grounded in the provided context."""

qa = dspy.ChainOfThought("context, question -> answer")
refined = dspy.BestOfN(module=qa, N=5, reward_fn=semantic_reward(Grounded))

The honest scope

I originally set out to benchmark four judges across two tasks with an optimization experiment. Reality:

✅ customer_support_qa, semantix vs Groq Llama 3.3 70B: 50/50 paired scores, clean. That's this post.
⚠️ Gemini 2.5 Flash: 15/50 hit the free-tier 20-requests-per-day-per-model cap mid-run.
⚠️ Gemini 2.5 Pro: 25/25 hit the same cap.
⚠️ HotpotQA task and BestOfN optimization experiment deferred — without Gemini as the final judge I couldn't close the loop, and I'd rather ship one clean pair than a multi-task table with holes.

The raw CSV is committed with error columns intact. Everything you're about to see is reproducible from the 50 rows both judges agreed to complete.

Setup

Dataset: 50 customer-support response candidates paired with one of ~10 intents ("The response must be polite and professional", "The response must stay on topic", "The agent must decline without being rude", etc.). Seeded generation.
semantix: QuantizedNLIJudge from v0.2.0. Auto-detected CPU variant, INT8 ONNX, onnxruntime only.
Groq: groq-llama-3.3-70b-versatile, free-tier API, temperature 0.
Scoring protocol: Both judges return a continuous 0–1 score. passed is derived at threshold.

Agreement results (paired n = 50)

Metric	Value
Pearson r (continuous scores)	0.596
Cohen's kappa @ 0.3	0.633
Cohen's kappa @ 0.4	0.633
Cohen's kappa @ 0.5	0.487
Cohen's kappa @ 0.7	0.421
Binary agreement @ 0.5	76% (38/50)
Binary agreement @ 0.3	84% (42/50)

Pearson r = 0.596 is a moderate positive correlation between the two judges on raw scores. The binary pass/fail story is more interesting: at the semantix-default threshold 0.5 the two agree on 76% of calls (moderate kappa of 0.487). Drop the threshold to 0.3 and they agree on 84% of calls at substantial kappa 0.633.

The actionable knob: if you want semantix to track Groq Llama 3.3 70B's polite-response classification, run it with threshold 0.3–0.4. The default 0.5 is tuned against strict NLI datasets; for pragmatic customer-support scoring, a slightly looser threshold is closer to what a 70B LLM-judge would mark as "polite enough".

Latency and cost

	semantix	groq-llama-3.3-70b
Mean latency	70 ms	799 ms
p50	64 ms	777 ms
p95	121 ms	992 ms
Paid cost / 1k calls	$0.0000	$0.1312

~11× lower latency. On a paid Groq plan, 1M calls per day would cost ~$131/day in Groq API fees alone; semantix adds $0 and never leaves your machine. For a DSPy optimization loop calling the reward function hundreds of times per trial, the difference compounds into hours saved.

What this means in practice

Use semantix as your reward_fn in BestOfN and Refine when per-call latency of an LLM-as-judge would dominate your optimization loop. At substantial kappa with Groq on polite classification, it's a reasonable signal with two orders-of-magnitude better cost structure.
Tune the threshold against your own held-out examples. The default 0.5 is too strict for conversational-tone tasks; 0.3–0.4 tracks a 70B LLM-judge more faithfully on this task.
Don't use it as a reasoner. It's a narrow entailment classifier. If your task needs "why is this wrong?", pair it with ForensicJudge (mask-perturbation saliency) or keep the LLM for final scoring.

A footnote on the bug that almost killed this post

The original benchmark run on 2026-04-21 showed Pearson r = -0.594 — a strongly negative correlation. I almost shipped that as "semantix disagrees with Groq, caveat emptor". Digging in, I found a label-ordering bug in QuantizedNLIJudge (shipped in v0.1.5, fixed in v0.2.0): the code was reading probs[2] (neutral) as the entailment score instead of probs[1]. Fixing the bug and re-running the 50 cached texts against v0.2.0 flipped the correlation sign and shifted the kappa from near-zero to substantial.

The raw CSV preserves both runs' scores through git history if anyone wants to see the before/after. I'm noting this here because (a) it's a useful cautionary tale about trusting your benchmark when the numbers look too surprising, and (b) it's the exact kind of thing a release gate (like v0.2.0's POPIA macro-F1 gate) is supposed to catch, which it now does.

Reproducing

git clone https://github.com/labrat-akhona/semantix-ai
cd semantix-ai
pip install -e ".[turbo]"  # zero-PyTorch install
pip install -r benchmarks/requirements.txt
cp .env.example .env  # add GROQ_API_KEY
python -m benchmarks.dspy.customer_support.run

Results land in benchmarks/dspy/customer_support/results/ (raw.csv, summary.md).

What's next

Same minimal-first methodology will be applied to outlines, marvin, and llama_index — one paired comparison, no holes, real numbers. A PR at stanfordnlp/dspy referencing this work is open: stanfordnlp/dspy#9653.

semantix-ai is MIT-licensed. PyPI: pypi.org/project/semantix-ai. v0.2.0 also ships a POPIA-compliance fine-tune reaching 0.813 macro-F1 on a pinned holdout.

Build LLM Guardrails in 3 Lines of Python (No API Key, No Cloud)

Akhona Eland — Mon, 13 Apr 2026 09:39:35 +0000

Build LLM Guardrails in 3 Lines of Python (No API Key, No Cloud)

Your LLM just told a customer their rash "looks like it could be melanoma." Your chatbot leaked a user's email address in a support response. Your RAG pipeline went off-topic and started explaining how to pick locks.

These aren't hypotheticals. They're Tuesday.

You need guardrails. Here's what that currently looks like:

Regex. You write r"(?i)(you should take|I recommend taking)" to catch medical advice. The model rephrases to "it might help to consider" and your filter is useless. You add more patterns. The model finds more phrasings. You are now maintaining a regex zoo that catches false positives and misses actual violations.
LLM-as-judge. Call GPT-4 to review every output. That's 500ms–2s per check, $0.01–0.03 per call, and a hard dependency on an external API. Your guardrail is now slower than the thing it's guarding. Also, you need an API key in production, your costs scale with traffic, and when OpenAI has a bad day your guardrails go down.
Cloud guardrail services. AWS Bedrock Guardrails, Azure Content Safety, etc. Vendor lock-in, network latency, usage-based pricing, and your data leaves your infrastructure. Good luck explaining that to your compliance team.

None of these are good. What you actually want is: check whether the output means something bad, locally, in milliseconds, for free.

3 lines

pip install semantix-ai

from semantix import Intent, validate_intent

class NoPII(Intent):
    """The text does not contain personal information such as names, emails, phone numbers, or addresses."""

class NoMedicalAdvice(Intent):
    """The text does not provide medical diagnoses or treatment recommendations."""

@validate_intent
def my_chatbot(message: str) -> ~NoPII & ~NoMedicalAdvice:
    return call_my_llm(message)

That's it. Every call to my_chatbot now runs through a local NLI model that checks whether the output violates your policies. ~15ms on CPU. No API key. No network call. No tokens burned.

If the output leaks PII or gives medical advice, it raises SemanticIntentError with the score, the violated intent, and a reason. The bad output never reaches your user.

How the negation pattern works

The ~ operator is the key. An Intent describes what something is. ~Intent checks that the output is not that thing.

from semantix import Intent

class ToxicLanguage(Intent):
    """The text contains insults, profanity, threats, or aggressive language."""

class MedicalAdvice(Intent):
    """The text provides medical diagnoses or treatment recommendations."""

class PIILeakage(Intent):
    """The text contains personal information like names, emails, phone numbers, or addresses."""

class LegalAdvice(Intent):
    """The text provides specific legal counsel or interprets laws for the user's situation."""

Each of these describes a bad thing. Negate them and you have guardrails:

Safe = ~ToxicLanguage
Compliant = ~MedicalAdvice
Private = ~PIILeakage
NotALawyer = ~LegalAdvice

Under the hood, ~MedicalAdvice creates a Not[MedicalAdvice] intent. The NLI model checks whether the output entails the original description. If it does, the negated check fails. If it doesn't, the output is clean.

This works because NLI models understand meaning, not patterns. "You should take ibuprofen" and "Consider an anti-inflammatory" both entail medical advice. A regex catches neither unless you enumerated both phrasings. The NLI model catches both because they mean the same thing.

Composing policies

Real compliance isn't one rule. It's a policy — multiple constraints that all need to hold, or where at least one must hold. semantix gives you & and | for this.

All constraints must pass

@validate_intent
def customer_support(msg: str) -> ~ToxicLanguage & ~PIILeakage & ~MedicalAdvice:
    return call_my_llm(msg)

The & operator creates an AllOf composite. Every negated intent is checked. If any one fails, the output is rejected. This is your production safety policy in one line of Python type annotation.

At least one constraint must pass

class Apology(Intent):
    """The text contains a sincere apology for the inconvenience."""

class Redirect(Intent):
    """The text redirects the user to the appropriate support channel."""

@validate_intent
def handle_complaint(msg: str) -> Apology | Redirect:
    return call_my_llm(msg)

The | operator creates an AnyOf composite. The output passes if it satisfies at least one intent.

Mix positive and negative

class Helpful(Intent):
    """The text provides a clear, actionable answer to the user's question."""

@validate_intent
def safe_assistant(msg: str) -> Helpful & ~ToxicLanguage & ~PIILeakage:
    return call_my_llm(msg)

The output must be helpful AND must not be toxic AND must not leak PII. Positive and negative constraints compose freely.

Self-healing retries

Guardrails that just block are a blunt instrument. Sometimes you want the LLM to try again with feedback about what went wrong. Add retries and a semantix_feedback parameter:

from typing import Optional

@validate_intent(retries=2)
def safe_chatbot(
    message: str,
    semantix_feedback: Optional[str] = None,
) -> Helpful & ~ToxicLanguage & ~PIILeakage:
    prompt = f"Answer this customer question: {message}"
    if semantix_feedback:
        prompt += f"\n\n{semantix_feedback}"
    return call_my_llm(prompt)

On the first call, semantix_feedback is None. If the output fails validation, the decorator automatically injects a structured Markdown feedback block explaining what went wrong — the violated intent, the score, the rejected output. The LLM gets a second chance to fix it.

This turns a guardrail from a wall into a feedback loop. The model learns from its mistake in-context and self-corrects. In practice, most violations are fixed on the first retry.

The feedback looks like this:

## Semantix Self-Healing Feedback

Attempt **1** failed validation.

### What went wrong
- **Intent:** `Not[PIILeakage]`
- **Score:** 0.9142 (threshold not met)
- **Judge reason:** Text contains what appears to be an email address

### What is required
The text must NOT satisfy the following:

The text contains personal information like names, emails, phone numbers, or addresses.

### Your previous output (rejected)
Sure, I can help! John's email is john.doe@example.com...

Please generate a new response that satisfies the requirement above.

Testing guardrails in CI

Guardrails in production are half the story. You also need to test that they work before you deploy. Two tools:

pytest-semantix

pip install pytest-semantix

from semantix import Intent

class PIILeakage(Intent):
    """The text contains personal information like names, emails, phone numbers, or addresses."""

def test_no_pii_in_response(assert_semantic):
    response = my_chatbot("tell me about user 42")
    assert_semantic(response, ~PIILeakage)

def test_no_medical_advice(assert_semantic):
    response = my_chatbot("my head hurts")
    assert_semantic(response, ~MedicalAdvice)

Each test runs in ~15ms locally. No API key in CI secrets. No flaky network calls. Your guardrail tests run as fast as your unit tests.

GitHub Action

Add semantic checks to your CI pipeline with the semantic-test-action:

- uses: labrat-akhona/semantic-test-action@v1
  with:
    test-path: tests/
    threshold: 0.8
    report-format: json

This runs your pytest-semantix tests in CI and produces a report. Failed guardrail tests block the PR. Your compliance policy is enforced before code reaches main.

What's actually happening under the hood

When you write ~MedicalAdvice and the decorator validates an output, here's the sequence:

The decorator calls your function and captures the raw string output.
It extracts the intent description from the class docstring.
For Not[X], it checks whether the output entails X. If entailment score is above threshold, the negated check fails — the output matches the bad thing.
For AllOf, it checks every component. All must pass.
For AnyOf, it checks components until one passes.
The NLI model runs locally via ONNX Runtime (quantized INT8). No GPU required. ~15ms per check on CPU.
If validation fails and retries remain, feedback is injected and the function is called again.
If all retries are exhausted, SemanticIntentError is raised with full diagnostics.

The model is downloaded once (~100MB) and cached locally. After that, everything is offline. Your guardrails work on an airplane.

When to use this vs. other approaches

Use semantix guardrails when:

You need low-latency checks (< 20ms) in the hot path
You can't send data to external APIs (compliance, air-gapped, privacy)
You want deterministic, reproducible guardrail behavior
You need guardrails in CI/CD, not just production
You want zero marginal cost per check

Use an LLM-as-judge when:

You need nuanced, context-heavy evaluation that NLI can't capture
Latency and cost don't matter
You're doing one-off evaluations, not real-time guardrailing

Use regex/keyword filters when:

You have a known, fixed list of exact strings to block (e.g., specific slurs, specific SSN formats)
You don't need semantic understanding, just pattern matching

In practice, these stack. Use semantix for the fast semantic layer, regex for known-exact patterns, and LLM-as-judge for the hard cases that need deep reasoning. semantix handles the 90% that regex can't and LLM-as-judge is too slow for.

Install

pip install semantix-ai

Python 3.10+. No API key. No GPU. Works on Linux, macOS, Windows.

PyPI: pypi.org/project/semantix-ai
GitHub: github.com/labrat-akhona/semantix-ai
Docs: labrat-akhona.github.io/semantix-ai
pytest-semantix: pypi.org/project/pytest-semantix

Test Your LLM Outputs in pytest (15ms, No API Key)

Akhona Eland — Mon, 13 Apr 2026 07:57:25 +0000

You've got an LLM-powered feature in production. You want to test it. Here are your options:

String matching. Works until the model rephrases "I'd be happy to help" as "Sure, let me assist you." Now your test is red and nothing is actually wrong.
Regex. You write a pattern. It passes today, breaks tomorrow when the model adds a comma. You write a more permissive pattern. Now it passes on garbage too.
LLM-as-judge. Call GPT-4 to evaluate the output. Your test suite now takes 4 minutes, costs money, and fails when OpenAI has a bad day. Your CI pipeline needs an API key in secrets. Your team stops running the tests locally.

None of these are good. What you actually want is to test whether your output means the right thing — without any of that overhead.

pytest-semantix

pip install pytest-semantix

def test_chatbot_is_polite(assert_semantic):
    response = my_chatbot("handle angry customer")
    assert_semantic(response, "polite and professional")

That's a real pytest test. It runs locally on CPU in ~15ms. No API key. No network calls. No tokens burned.

pytest-semantix is a pytest plugin that wraps semantix-ai's semantic assertion engine as a native fixture. Under the hood, it uses a local NLI (Natural Language Inference) model to check whether your LLM output entails the given intent. You describe what you mean in plain English. The model checks entailment. Done.

On failure, you get a score, the intent, and a reason — not just a raw traceback:

AssertionError: Semantic check failed (score=0.12)
  Intent:  polite and professional
  Output:  "You're an idiot for asking that."
  Reason:  Text contains aggressive language

Markers

If you want to attach an intent to the test itself rather than the assertion call, use the @pytest.mark.semantic marker:

import pytest

@pytest.mark.semantic("polite and professional")
def test_with_marker(assert_semantic):
    response = my_chatbot("handle angry customer")
    assert_semantic(response)  # intent comes from the marker

This is useful when you have a single intent per test and want to see it at a glance in the decorator rather than buried in the function body.

Terminal Reports

Pass --semantic-report and you get a color-coded summary after the test session:

$ pytest --semantic-report

======================== semantic assertion report =========================
  Total: 5  |  Passed: 4  |  Failed: 1

  [PASS] tests/test_bot.py::test_polite  [12ms]
  [PASS] tests/test_bot.py::test_helpful  [14ms]
  [FAIL] tests/test_bot.py::test_no_pii  (score=0.67)  Contains email address  [11ms]
  [PASS] tests/test_bot.py::test_on_topic  [13ms]
  [PASS] tests/test_bot.py::test_concise  [15ms]

============================================================================

Green for pass, red for fail. Each line shows the test, the score on failure, the reason, and the wall time. No need to scroll through pytest output hunting for which semantic check broke.

JSON Reports for CI

For CI integration, export results to JSON:

pytest --semantic-report-json=semantic-results.json

The output:

{
  "summary": { "total": 5, "passed": 4, "failed": 1 },
  "results": [
    {
      "nodeid": "tests/test_bot.py::test_polite",
      "intent": "polite and professional",
      "passed": true,
      "score": null,
      "reason": "",
      "duration_ms": 12.3
    },
    {
      "nodeid": "tests/test_bot.py::test_no_pii",
      "intent": "The text does not contain personal information",
      "passed": false,
      "score": 0.67,
      "reason": "Contains email address",
      "duration_ms": 11.1
    }
  ]
}

Feed this into your CI dashboard, your Slack alerts, your artifact storage — whatever your pipeline already does with JSON test results.

Negation for Compliance Testing

Some of the most important LLM tests aren't about what the output should say. They're about what it shouldn't say.

from semantix import Intent

class MedicalAdvice(Intent):
    """The text provides medical diagnoses or treatment recommendations."""

class PIILeakage(Intent):
    """The text contains personal information like names, emails, or phone numbers."""

def test_no_medical_advice(assert_semantic):
    response = my_chatbot("my head hurts what should I take")
    assert_semantic(response, ~MedicalAdvice)

def test_no_pii_leakage(assert_semantic):
    response = my_chatbot("tell me about user 42")
    assert_semantic(response, ~PIILeakage)

The ~ operator negates the intent. The test passes only when the output does not match. This is how you test guardrails: toxicity, off-topic drift, unauthorized disclosures, regulatory compliance. Define the bad thing as an intent, negate it, assert against your output.

Composing with Existing pytest

pytest-semantix is a normal pytest plugin. It doesn't replace anything in your test suite — it adds a fixture. Everything you already use works.

Parametrize

import pytest

@pytest.mark.parametrize("prompt,intent", [
    ("handle angry customer", "polite and professional"),
    ("explain a refund policy", "clear and informative"),
    ("say goodbye", "friendly"),
])
def test_chatbot_intents(assert_semantic, prompt, intent):
    response = my_chatbot(prompt)
    assert_semantic(response, intent)

Combine with other fixtures

@pytest.fixture
def chatbot():
    return MyChatbot(model="gpt-4o-mini", temperature=0.2)

def test_with_fixtures(chatbot, assert_semantic):
    response = chatbot.respond("hello")
    assert_semantic(response, "friendly greeting")

Mix semantic and regular assertions

def test_structured_response(assert_semantic):
    result = my_llm("generate a JSON summary")
    data = json.loads(result)  # regular assertion: valid JSON
    assert "summary" in data   # regular assertion: has the key
    assert_semantic(data["summary"], "concise and accurate")  # semantic: means the right thing

Global threshold

If your team wants a stricter baseline across all tests:

pytest --semantic-threshold=0.85

Individual tests can still override:

def test_strict(assert_semantic):
    assert_semantic(response, "accurate", threshold=0.95)

What's Actually Happening

When you call assert_semantic(output, intent), the plugin:

Resolves the intent (from the argument, the marker, or raises an error)
Passes the output and intent to a local NLI model via semantix-ai
The model returns a score and verdict
The plugin records the result (nodeid, intent, score, duration) for reporting
On failure, it raises AssertionError with score + reason

No network call. No subprocess. No container. The NLI model loads once per session and runs inference in-process. That's why it's ~15ms per assertion.

Install

pip install pytest-semantix

Requires Python 3.10+ and pytest 7+. Pulls in semantix-ai automatically.

Then just use the assert_semantic fixture in your tests. No configuration, no conftest.py boilerplate, no setup step.

PyPI: pypi.org/project/pytest-semantix
GitHub: github.com/labrat-akhona/pytest-semantix
semantix-ai (the engine): pypi.org/project/semantix-ai

How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)

Akhona Eland — Fri, 10 Apr 2026 11:58:35 +0000

How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)

Every time your LLM gets corrected by a guardrail, a training example is born and immediately thrown away. This tutorial shows you how to catch those examples and use them to make your model better — automatically, with no manual labeling.

By the end, you'll have a working pipeline that:

Validates LLM outputs against natural language requirements
Retries failures with structured feedback
Captures every (rejected → corrected) pair to disk
Exports those pairs in OpenAI fine-tuning format
Uploads to OpenAI for fine-tuning

Total code: ~50 lines. Total manual labeling: zero.

Prerequisites

pip install "semantix-ai[all]" openai

You'll need an OpenAI API key for the LLM calls and fine-tuning upload. The validation itself runs locally — no API cost.

Step 1: Define What "Correct" Means

Semantix uses Intent classes. The docstring is the requirement. That's it.

from semantix import Intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without
    being rude, dismissive, or aggressive."""

class ConstructiveFeedback(Intent):
    """The text must provide encouraging, constructive feedback
    that acknowledges effort and suggests specific improvements."""

These aren't prompts. They're contracts. The validator checks every output against them.

Step 2: Wire Up Validation + Collection

from typing import Optional
from openai import OpenAI
from semantix import validate_intent
from semantix.training import TrainingCollector

client = OpenAI()
collector = TrainingCollector("training_data.jsonl")

@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
    if semantix_feedback:
        messages.append({"role": "user", "content": semantix_feedback})
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    ).choices[0].message.content

Here's what happens when you call decline_invite("the company retreat"):

GPT-4o-mini generates a response
Semantix validates it against the docstring using a local NLI model (~15ms)
If it fails: structured feedback is injected via semantix_feedback and the function retries
If the retry passes: the (rejected, accepted) pair is appended to training_data.jsonl
If it passes first try: nothing is collected (no correction happened)

The semantix_feedback parameter is optional. Declare it and the decorator fills it automatically on retries. Don't declare it and retries still work — the model just doesn't get the structured hint.

Step 3: Generate Traffic

In production, this happens organically. For this tutorial, simulate it:

events = [
    "a birthday party for someone you don't like",
    "a mandatory corporate retreat",
    "a wedding where you're the best man",
    "a networking event at a bar",
    "a charity gala you can't afford",
    "a baby shower for a coworker you barely know",
    "a holiday dinner with your in-laws",
    "a surprise party that isn't a surprise",
]

for event in events:
    try:
        result = decline_invite(event)
        print(f"OK: {event[:40]}... -> {str(result)[:60]}")
    except Exception as e:
        print(f"FAIL: {event[:40]}... -> {e}")

After running this, check what was captured:

stats = collector.stats()
print(f"Correction pairs collected: {stats['total_pairs']}")
print(f"Intents: {stats['intents']}")

Every pair represents a case where the model got it wrong, got feedback, and got it right. These are the hardest examples — exactly the ones worth training on.

Step 4: Export to Fine-Tuning Format

from semantix.training.exporters import export_openai

export_openai("training_data.jsonl", "finetune.jsonl")

Each correction pair becomes a chat completion training example:

{
  "messages": [
    {"role": "system", "content": "You must satisfy the following requirement:\n\nThe text must politely decline an invitation without being rude, dismissive, or aggressive."},
    {"role": "user", "content": "Generate a response that satisfies the above requirement."},
    {"role": "assistant", "content": "Thank you for the invitation, but I won't be able to attend..."}
  ]
}

Only the accepted output is used as the training target. The rejected output served its purpose — it triggered the correction.

Step 5: Upload and Fine-Tune

from openai import OpenAI

client = OpenAI()

# Upload the file
file = client.files.create(
    file=open("finetune.jsonl", "rb"),
    purpose="fine-tune",
)

# Start fine-tuning
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
)

print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")

Wait for the job to complete (usually 10-30 minutes for small datasets). Then swap your model ID:

# Before: gpt-4o-mini
# After:  ft:gpt-4o-mini-2024-07-18:your-org::job-id

@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    return client.chat.completions.create(
        model="ft:gpt-4o-mini-2024-07-18:your-org::job-id",  # <-- fine-tuned
        messages=[...],
    ).choices[0].message.content

The fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. Fine-tune again. Fails even less.

The Flywheel

Week 1: gpt-4o-mini          → 15% failure rate → 200 correction pairs
Week 2: fine-tuned-v1        →  5% failure rate →  70 correction pairs  
Week 3: fine-tuned-v2        →  2% failure rate →  25 correction pairs
Week 4: fine-tuned-v3        →  <1% failure rate

These numbers are illustrative, but the pattern is real: each round of fine-tuning reduces the failure rate, which reduces the number of corrections, which means each subsequent training set is smaller but harder — exactly what you want.

No human labeled a single example. The guardrail did the labeling.

Try It Without an API Key

Don't have an OpenAI key? Run the full loop locally:

git clone https://github.com/labrat-akhona/semantix-ai.git
cd semantix-ai
pip install -e .
python examples/flywheel_demo.py

The demo uses a simple keyword judge instead of NLI, but the pipeline is identical: validate, fail, correct, capture, export.

What's Actually Happening Under the Hood

The @validate_intent decorator does four things:

Calls your function and gets the raw string output
Evaluates the string against the Intent's docstring using an NLI model (locally, ~15ms)
On failure: builds a structured Markdown feedback report, injects it via semantix_feedback, retries
On success after failure: calls collector.record() with the rejected output, accepted output, scores, and feedback

The NLI model (cross-encoder/nli-MiniLM2-L6-H768) computes an entailment probability — how likely is it that the output satisfies the requirement? If the probability is below the threshold (default 0.5), validation fails.

No LLM is used for validation. No API calls. No tokens burned on checking.

When to Use This

This pattern works best when:

Your LLM has a specific behavioral requirement (tone, style, compliance, safety)
You're already retrying failures (so correction pairs exist)
You want domain-specific fine-tuning without paying for human annotation
Your failure rate is high enough to generate meaningful training data (>5%)

It works less well when:

Your requirements are purely structural (use Pydantic)
Your model never fails (you don't need a guardrail)
Your outputs are too short or uniform to benefit from fine-tuning

The Full Script

Here's the complete pipeline in one file:

from typing import Optional
from openai import OpenAI
from semantix import Intent, validate_intent
from semantix.training import TrainingCollector
from semantix.training.exporters import export_openai

# 1. Define the requirement
class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without
    being rude, dismissive, or aggressive."""

# 2. Set up collection
client = OpenAI()
collector = TrainingCollector("training_data.jsonl")

# 3. Wrap your LLM call
@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
    if semantix_feedback:
        messages.append({"role": "user", "content": semantix_feedback})
    return client.chat.completions.create(
        model="gpt-4o-mini", messages=messages,
    ).choices[0].message.content

# 4. Generate traffic
for event in ["a party", "a retreat", "a wedding", "a gala"]:
    try:
        decline_invite(event)
    except Exception:
        pass

# 5. Export and fine-tune
export_openai("training_data.jsonl", "finetune.jsonl")
print(f"Collected {collector.stats()['total_pairs']} training pairs")
print("Ready for: openai api fine_tuning.jobs.create -t finetune.jsonl")

That's it. Your guardrail is now your training pipeline.

semantix-ai — pip install 'semantix-ai[all]'

PyPI | GitHub | Previous article: Your AI Guardrail Is a Dead End

Built by Akhona Eland in South Africa. 166 tests. Zero labeling. Your failures are now your curriculum.

Your AI Guardrail Is a Dead End. Ours Is a Feedback Loop.

Akhona Eland — Fri, 10 Apr 2026 10:54:14 +0000

Your AI Guardrail Is a Dead End. Ours Is a Feedback Loop.

Every AI guardrail on the market does the same thing: check the output, pass or fail, move on. The failure data — the most valuable signal your system produces — gets thrown away.

Think about that. Every time your LLM generates something wrong, gets corrected, and produces something right, you're witnessing a training example being created and destroyed in the same breath. Thousands of correction pairs, generated organically from your actual production traffic, evaporating into logs nobody reads.

Semantix v0.1.7 stops the evaporation.

The Insight Nobody Acted On

Here's what happens inside a self-healing validation loop:

Your LLM generates an output
A judge evaluates it against the business intent
It fails — score 0.23, reason: "too aggressive"
The system feeds structured feedback back to the LLM
The LLM generates a corrected output
It passes — score 0.94

Steps 3-6 just produced a perfect fine-tuning example: a rejected output, a reason for rejection, and an accepted correction. This is exactly the data format that RLHF, DPO, and supervised fine-tuning consume.

Every guardrail system with retry logic produces this data. None of them capture it.

Until now.

The Training Collector

Semantix v0.1.7 introduces the TrainingCollector — an opt-in component that captures correction pairs during self-healing retries and writes them to an append-only JSONL file.

from semantix import validate_intent, Intent
from semantix.training import TrainingCollector

collector = TrainingCollector("training_data.jsonl")

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without being rude."""

@validate_intent(retries=2, collector=collector)
def decline(event: str) -> ProfessionalDecline:
    return call_my_llm(event)

That's it. Every time a retry succeeds after a failure, the collector appends:

{
  "intent": "ProfessionalDecline",
  "intent_description": "The text must politely decline an invitation without being rude.",
  "rejected_output": "I'd rather gouge my eyes out than attend your event.",
  "rejected_score": 0.23,
  "rejected_reason": "Too aggressive, contains violent imagery",
  "accepted_output": "Thank you for the invitation, but I'm unable to attend.",
  "accepted_score": 0.94,
  "feedback": "## Semantix Self-Healing Feedback\n\nAttempt 1 failed...",
  "attempts": 2,
  "timestamp": "2026-04-10T12:00:00Z"
}

No infrastructure. No database. No configuration. One file, growing one line at a time, containing the exact data you need to make your model smarter.

From Guardrail to Flywheel

Here's where it gets interesting.

The collector exports directly to OpenAI fine-tuning format:

from semantix.training.exporters import export_openai

export_openai("training_data.jsonl", "finetune.jsonl")

Each correction pair becomes a chat completion training example:

{
  "messages": [
    {"role": "system", "content": "You must satisfy the following requirement:\n\nThe text must politely decline an invitation without being rude."},
    {"role": "user", "content": "Generate a response that satisfies the above requirement."},
    {"role": "assistant", "content": "Thank you for the invitation, but I'm unable to attend."}
  ]
}

Upload to openai api fine_tuning.jobs.create. Wait. Deploy the fine-tuned model. Watch your failure rate drop.

Then the fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. The model gets fine-tuned again. Fails even less.

This is the flywheel:

Validate → Fail → Correct → Capture → Fine-tune → Validate (fewer failures)
    ↑                                                          |
    └──────────────────────────────────────────────────────────┘

Every other guardrail is a wall. Semantix is a ramp.

Also in v0.1.7: Framework Integrations

We shipped native adapters for the three biggest structured output frameworks. Semantix now drops into your existing stack with one line:

Instructor

from semantix.integrations.instructor import SemanticStr

class Response(BaseModel):
    reply: SemanticStr["must be polite and professional", 0.85]

Pydantic AI

from semantix.integrations.pydantic_ai import semantix_validator

agent = Agent("openai:gpt-4o", output_type=str)
agent.output_validator(semantix_validator(Polite))

LangChain

from semantix.integrations.langchain import SemanticValidator

chain = prompt | llm | StrOutputParser() | SemanticValidator(Polite)

Each adapter translates a semantix verdict into the framework's native retry mechanism. Instructor gets ValueError, Pydantic AI gets ModelRetry, LangChain gets OutputParserException. Your framework handles retries. Semantix handles meaning.

The Numbers

Metric	Value
Total test coverage	166 tests
New integration adapters	3 (Instructor, Pydantic AI, LangChain)
Training data formats	2 (OpenAI JSONL, Generic JSONL)
New dependencies	0 (training collector is pure Python)
Lines of code per adapter	~70

What This Means

There are two kinds of AI infrastructure. The kind that checks your work and the kind that makes you better at it.

Every guardrail, every validator, every content filter in production today is the first kind. They're necessary. They're valuable. And they're a dead end — a static gate that never learns from what it catches.

The training collector turns semantix into the second kind. Your guardrail becomes your training pipeline. Your failures become your curriculum. Your production traffic becomes your fine-tuning dataset.

The model that runs through semantix for a month isn't the same model that started. It's better. Measurably, provably better. And it got there without a single human labeling a single example.

That's not a guardrail. That's a flywheel.

Get Started

pip install 'semantix-ai[all]'

from semantix import validate_intent, Intent
from semantix.training import TrainingCollector

# Start collecting training data in two lines
collector = TrainingCollector("my_training_data.jsonl")

@validate_intent(retries=2, collector=collector)
def my_llm_function(prompt: str) -> MyIntent:
    return call_my_llm(prompt)

PyPI: pypi.org/project/semantix-ai/0.1.7

Repository: github.com/labrat-akhona/semantix-ai

Star the repo. Install the package. Start the flywheel.

Built by Akhona Eland in South Africa. 166 tests. Zero new dependencies. Your failures are now your curriculum.

Escaping Pilot Purgatory: How Semantix-ai v0.1.5 Built the Immutable Trust Layer for AI Agents

Akhona Eland — Mon, 06 Apr 2026 15:24:36 +0000

Here's a statistic that should terrify every AI team lead: 90% of enterprise AI agents never leave the pilot phase. They demo beautifully. They impress stakeholders. And then they rot in staging forever, blocked not by technical limitations but by a single, devastating question:

"Can you prove it won't do something catastrophic in production?"

The answer, for almost every AI system shipping today, is no.

This is the story of how we built the infrastructure to change that answer to yes.

The Semantic Gap

There's a term we've been using internally that I think deserves wider adoption: The Semantic Gap. It's the space between what an AI agent produces and what a business intended. Every guardrail you've seen — JSON schema validation, regex filters, content moderation APIs — operates below this gap. They check shape. They check toxicity. They never check meaning.

Ali Muwwakkil, who has spent years working at the intersection of AI and enterprise deployment, put it precisely: alignment with business processes is the true bottleneck. Not model capability. Not inference speed. Not even hallucination rates. The bottleneck is that no one can prove an AI agent's output aligns with the business intent that triggered it.

This is why agents die in pilot purgatory. Legal can't sign off. Compliance can't audit. Operations can't trust. And without trust, there is no production deployment.

Semantix v0.1.5 was built to close The Semantic Gap — not with bigger models or better prompts, but with deterministic infrastructure that makes AI outputs auditable, attributable, and governed.

Three Pillars of the Trust Layer

Pillar 1: The Silent Guard (Quantized NLI)

The first problem with existing semantic validation is speed. If your guardrail adds 500ms to every API call, it's dead on arrival. Production systems need sub-50ms overhead or they'll route around you.

We solved this with INT8 ONNX quantization. The QuantizedNLIJudge runs NLI (Natural Language Inference) cross-encoder inference in pure ONNX Runtime — no PyTorch, no TensorFlow, no CUDA drivers. The entire dependency footprint is ~25MB compared to ~500MB+ for a PyTorch-based equivalent.

The numbers from our verified turbo demo:

Metric	Value
Inference latency	23.9ms
Dependency size	~25MB
Model format	INT8 quantized ONNX
Hardware required	Any CPU (auto-detects AVX-512/AVX2/ARM64)

from semantix.judges.quantized_nli import QuantizedNLIJudge

judge = QuantizedNLIJudge()  # Auto-selects best ONNX variant for your CPU

verdict = judge.evaluate(
    output="Thank you for the invitation. Unfortunately, I cannot attend.",
    intent_description="The text must politely decline an invitation.",
    threshold=0.30,
)

print(verdict.score)   # 0.3118
print(verdict.passed)  # True

Under the hood, QuantizedNLIJudge does something subtle that took us several production-debugging sessions to get right: it dynamically introspects the ONNX graph's expected inputs via session.get_inputs(). Some ONNX exports expect token_type_ids, others don't. Rather than hardcoding assumptions, the judge adapts:

self._input_names = {inp.name for inp in self._session.get_inputs()}

feeds = {
    "input_ids": np.array([encoded.ids], dtype=np.int64),
    "attention_mask": np.array([encoded.attention_mask], dtype=np.int64),
}
# Only include token_type_ids if the model expects it
if "token_type_ids" in self._input_names:
    feeds["token_type_ids"] = np.array([encoded.type_ids], dtype=np.int64)

We also discovered — the hard way — that the ONNX export label order ({0: contradiction, 1: neutral, 2: entailment}) differs from the PyTorch model's order ({0: contradiction, 1: entailment, 2: neutral}). Entailment and neutral are swapped. Getting this wrong means your "safety pass" is actually reading the neutral probability. We've fixed it, tested it, and documented it so no one else burns a debugging session on this.

The Silent Guard's job is simple: pass clean text instantly, flag violations in under 25ms. Zero friction on the happy path.

Pillar 2: The Detective (Forensic Saliency)

Knowing that text failed an intent check is useful. Knowing which specific words caused the failure is transformative.

The ForensicJudge implements what we internally call "Option A" Forensics — mask-perturbation saliency that only triggers on failure. When text passes, the ForensicJudge returns the base verdict untouched with zero overhead. When text fails, it activates the investigation.

The algorithm:

Tokenize the output text (whitespace split — we're identifying suspect words, not subwords)
For each token, replace it with [MASK] and re-run the base judge
Measure the contradiction score drop — how much less contradictory the text becomes without that token
Rank by drop magnitude. The top-K tokens are the "breach tokens"

from semantix.judges.forensic import ForensicJudge

detective = ForensicJudge(base_judge=judge, top_k=3)

verdict = detective.evaluate(
    output="Are you serious? I would rather gouge my eyes out than attend your stupid event.",
    intent_description="The text must politely decline an invitation.",
    threshold=0.30,
)

print(verdict.passed)  # False
print(verdict.reason)

Output:

## Breach Report

**Score:** 0.2482
**Base judge reason:** No reason provided by base judge

### Token Attribution
**gouge** (0.16), **stupid** (0.13), **your** (0.10)

### Summary
Intent failed. High contradiction detected. Suspect Tokens: [gouge, stupid, your]

The Detective caught it: gouge, stupid, and your are the three words most responsible for the intent violation. Remove any of them and the contradiction score drops measurably.

This matters for two reasons. First, debugging: when an AI agent fails in production, the team doesn't have to read the full output and guess what went wrong. The Breach Report points directly at the offending tokens. Second, self-healing: the structured report can be fed back to the agent as corrective context. The agent knows what to fix, not just that it failed.

Imagine this in a legal review pipeline. The agent drafts a partnership agreement. The ForensicJudge flags it as non-compliant with the intent "must be free of hidden liability clauses." The Breach Report identifies indemnify, forfeit, and waive as the breach tokens. The agent rewrites, removing those clauses. The second draft passes. No human had to read either draft.

Pillar 3: The Black Box (AuditEngine)

Speed and attribution solve the engineering problem. But enterprise deployment has a governance problem too: you need a record.

The AuditEngine is a thread-safe singleton that captures every validation event as a JSON-LD Semantic Certificate — a self-describing, standards-based record of what was validated, when, and whether it passed.

from semantix.audit.engine import AuditEngine

engine = AuditEngine()

engine.record(
    intent="The text must politely decline an invitation.",
    output="Thank you, but I cannot attend.",
    score=0.3118,
    passed=True,
)

Each certificate contains:

{
    "@context": "https://schema.semantix.ai/v1",
    "@type": "SemanticCertificate",
    "id": "urn:semantix:cert:29365ece-68f9-4a13-a89b-ccbbed34bf53",
    "timestamp": "2026-04-06T14:55:41.726348+00:00",
    "intent": "The text must politely decline an invitation.",
    "score": 0.3118,
    "passed": true,
    "reason": null,
    "output_hash": "99c3814a6c40a84f7274b5c8...",
    "previous_hash": "GENESIS"
}

Note what's not in the certificate: the raw output text. Instead, there's a SHA-256 hash of it. This means your audit trail is compliance-safe — you can prove what was validated without storing potentially sensitive content in the audit log.

The critical design choice is the previous_hash field. Every certificate contains the SHA-256 hash of the entire previous certificate. This creates an immutable hash chain rooted at GENESIS. Tamper with any entry and every subsequent hash breaks:

engine.verify_chain()  # True — chain is intact

# Tamper with an entry
engine.entries[0]["score"] = 0.99

engine.verify_chain()  # False — tampering detected

This is the same fundamental principle behind blockchain integrity, applied to AI governance without the overhead of consensus protocols. One hash chain. One source of truth. Verifiable by anyone with the audit file.

engine.flush(Path("audit.jsonl"))  # Write to disk as JSONL

The Full Stack in Action

Here's what production deployment looks like with all three pillars working together:

from semantix import validate_intent, Intent
from semantix.audit.engine import AuditEngine
from semantix.judges.quantized_nli import QuantizedNLIJudge
from semantix.judges.forensic import ForensicJudge

# Build the trust stack
engine = AuditEngine()
base_judge = QuantizedNLIJudge()           # 23.9ms inference
detective = ForensicJudge(base_judge)      # Attribution on failure


class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without being
    rude or aggressive."""


@validate_intent(judge=detective, retries=2)
def decline_invite(event: str) -> ProfessionalDecline:
    response = call_my_llm(event)  # Your LLM call here

    # Record every validation in the audit trail
    engine.record(
        intent=ProfessionalDecline.description(),
        output=response,
        score=0.0,  # Score populated by judge
        passed=True,
    )

    return response

The @validate_intent decorator handles the validation loop:

The function runs and returns a string
The ForensicJudge evaluates it against the intent
If it passes: the Silent Guard clears it in ~24ms, zero forensic overhead
If it fails: the Detective runs saliency, identifies breach tokens, generates a Breach Report
The decorator retries with self-healing feedback injected into the next call
The AuditEngine records every attempt as a hash-chained certificate

After all retries, you have a complete, tamper-evident record of every validation attempt — what was tried, what failed, why it failed, and what ultimately passed.

Why This Matters Now

We are living through a specific moment in the AI industry. The capability curve is flattening — GPT-4, Claude, Gemini, Llama are all "good enough" for most business tasks. The differentiation is shifting from what AI can do to whether you can trust what AI did.

In 2026, liability is the biggest cost of AI. Not compute. Not API bills. Liability. When an AI agent sends a contract with a hidden indemnification clause, when it generates a medical summary that omits a critical drug interaction, when it writes a customer email that accidentally constitutes a binding offer — the cost isn't a bad Yelp review. It's a lawsuit.

Every company deploying AI agents needs three things:

Speed — Validation that doesn't bottleneck the pipeline (The Silent Guard: 23.9ms)
Attribution — When something goes wrong, know exactly what and why (The Detective: breach tokens)
Provenance — An immutable record that proves governance was applied (The Black Box: hash-chained certificates)

Semantix v0.1.5 delivers all three in a single pip install.

The End of Vibe-Coding

There's a practice in the AI industry that we need to name and retire: vibe-coding. It's the practice of deploying AI agents with no semantic validation — shipping outputs because they "look right" to a human reviewer, with no deterministic verification that the output matches the intent.

Vibe-coding works in demos. It works in hackathons. It does not work when your agent is generating legal documents, medical summaries, financial reports, or customer communications at scale.

Semantix exists to replace vibes with verification. To replace "it looks right" with "it mathematically entails the business intent." To replace trust-by-default with trust-by-proof.

We aren't building a library. We're setting a standard.

Get Started

# Recommended: INT8 ONNX (fast, lightweight)
pip install 'semantix-ai[turbo]'

# Full stack with all judge backends
pip install 'semantix-ai[all]'

v0.1.5 Release: github.com/labrat-akhona/semantix-ai/releases/tag/v0.1.5

Repository: github.com/labrat-akhona/semantix-ai

PyPI: pypi.org/project/semantix-ai

Star the repo. Try the turbo install. Run tools/trust_demo.py and watch the Breach Report identify exactly which words betrayed the intent.

And if you're tired of AI agents dying in pilot purgatory — join us. The trust layer is here.

Built by Akhona Eland in South Africa. 126 tests. Sub-25ms inference. Zero vibes.

Any AI Agent Can Now Vibe Check LLM Outputs — No Code Required

Akhona Eland — Mon, 06 Apr 2026 11:54:25 +0000

Any AI Agent Can Now "Vibe Check" LLM Outputs — No Code Required

Your AI agent just generated a customer email. It's grammatically perfect. The JSON is valid. But it accidentally threatened to cancel the customer's account instead of apologizing.

No guardrail caught it because no guardrail was checking meaning.

With Semantix v0.1.4, any MCP-capable agent — Claude Desktop, Claude Code, Cursor, or your own — can validate text against semantic intents as a tool call. Zero code changes. Zero API keys. Runs locally.

The Problem: Agents Don't Verify Their Own Output

LLM agents are getting more autonomous. They write emails, generate reports, draft code reviews, and respond to customers. But they operate on a trust-based system: generate output, ship it, hope for the best.

What if the agent could verify its own output before sending it? Not structurally — semantically. "Does this text actually do what I intended?"

That's what the Semantix MCP server enables.

What's New in v0.1.4: The Universal Standard Release

MCP Server: `verify_text_intent`

Semantix now ships a built-in MCP server that exposes a single, powerful tool: verify_text_intent.

Any MCP-capable agent can call it:

{
  "text": "We sincerely apologize for the inconvenience and have credited your account.",
  "intent_description": "The text must be a sincere customer apology that offers a concrete resolution.",
  "threshold": 0.5
}

Response:

{
  "score": 0.91,
  "passed": true,
  "reason": null
}

If it fails, the agent gets a structured correction suggestion — enabling cross-agent self-healing:

{
  "score": 0.18,
  "passed": false,
  "reason": null,
  "correction_suggestion": "## Semantix Verification Failed\n\n### What went wrong\n- **Score:** 0.1800 (threshold 0.5 not met)\n\n### What is required\nThe text must be a sincere customer apology...\n\n### Rejected output\n```

\nYour account has been flagged for termination.\n

```\n\nPlease generate a new response that satisfies the requirement above."
}

The agent reads the correction, regenerates, and tries again. Self-healing across any agent framework — no SDK integration needed.

Setup: 3 Lines

pip install "semantix-ai[mcp,nli]"

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "semantix-verify": {
      "command": "mcp",
      "args": ["run", "semantix/mcp/server.py"],
      "cwd": "/path/to/your/semantix-ai"
    }
  }
}

That's it. Claude Desktop (or any MCP client) can now call verify_text_intent before responding.

NLI Accuracy Fixes

v0.1.4 also ships critical fixes to the NLI judge that dramatically improve scoring accuracy:

Entailment index fix — The model's label order is {0: contradiction, 1: entailment, 2: neutral}. We were accidentally reading the neutral logit instead of entailment. Fixed.

Softmax calibration — Raw logits are now converted to true 0-1 probability scores via apply_softmax=True. Before this, scores were unbounded and hard to threshold meaningfully.

Progressive tense hypothesis — NLI cross-encoders score dramatically better when the hypothesis is framed as ongoing action. "The text must politely decline an invitation" becomes "Someone is politely declining an invitation." This single change pushed scores from ~0.3 to 0.88+ for well-written declines.

Why MCP?

MCP (Model Context Protocol) is becoming the universal standard for agent-tool communication. By shipping Semantix as an MCP tool rather than a library-only solution, we get:

Universal compatibility — Works with Claude Desktop, Claude Code, Cursor, and any future MCP client
Zero integration code — Agents call it as a tool, not as a library import
Language agnostic — Your agent doesn't need to be written in Python
Self-healing bridge — The correction_suggestion field gives any agent enough context to retry intelligently

This is what "validate meaning, not shape" looks like at the agent layer.

The Architecture

Your Agent (any MCP client)
     |
     v
  MCP tool call: verify_text_intent
     |
     v
Semantix MCP Server (FastMCP)
     |
     v
NLIJudge (lazy-loaded singleton)
     |
     v
Cross-encoder: "Does this text entail the intent?"
     |
     +-- score >= threshold --> {"passed": true, "score": 0.91}
     |
     +-- score < threshold  --> {"passed": false, "correction_suggestion": "..."}

The NLI model loads lazily on the first tool call — server startup is instant. The judge runs locally on CPU with no API keys.

20 Automated Tests, Zero Model Loading

The MCP test suite covers tool registration, response schema, correction suggestions, and dependency error handling — all without loading the actual NLI model. We mock the judge so tests run in milliseconds:

@patch("semantix.mcp.server._get_judge")
def test_failing_response_includes_correction(self, mock_get):
    mock_get.return_value = _mock_judge(passed=False, score=0.15)
    result = json.loads(verify_text_intent("bad text", "some intent"))
    assert result["passed"] is False
    assert "correction_suggestion" in result

The server also handles missing dependencies gracefully — if sentence-transformers isn't installed, it returns an error JSON instead of crashing.

Get Started

pip install "semantix-ai[mcp,nli]"

# Test it locally
python -c "
from semantix.mcp.server import verify_text_intent
print(verify_text_intent(
    'I appreciate the invitation but unfortunately I will not be able to attend.',
    'The text must politely decline an invitation'
))
"

# Run as MCP server
mcp run semantix/mcp/server.py

What's Next

Semantix is a semantic type system for AI outputs. v0.1.3 added self-healing retries. v0.1.4 makes it universal via MCP. The roadmap includes:

More judge backends — Anthropic, Cohere, local LLMs via Ollama
Pydantic integration — Semantic fields inside Pydantic models
Streaming validation — Real-time intent checking during generation

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

Akhona Eland — Sun, 05 Apr 2026 10:40:11 +0000

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.

Then your model returns a "polite decline" that says "I'd rather gouge my eyes out."

It passes your type checks. It fails the vibe check.

This is the Semantic Gap — the space between structural correctness and actual meaning. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix.

The Semantic Gap: Shape vs. Meaning

Here's what most validation looks like today:

class Response(BaseModel):
    message: str
    tone: Literal["polite", "neutral", "firm"]

This tells you the shape is right. It tells you nothing about whether the meaning is right. Your model can return {"message": "Go away.", "tone": "polite"} and Pydantic will happily accept it.

Semantix flips the script. Instead of validating structure, you validate intent:

from semantix import Intent, validate_intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation
    without being rude or aggressive."""

@validate_intent
def decline_invite(event: str) -> ProfessionalDecline:
    return call_my_llm(event)

The docstring is the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?

What's New in v0.1.3: The Self-Healing Update

Informed Self-Healing

The biggest feature in v0.1.3 is informed retries. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM exactly what went wrong.

Declare a semantix_feedback parameter in your function, and the decorator injects a structured Markdown report on each retry:

from typing import Optional
from semantix import validate_intent
from semantix.judges.nli import NLIJudge

@validate_intent(judge=NLIJudge(), retries=2)
def decline(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    prompt = f"Decline this invite: {event}"
    if semantix_feedback:
        prompt += f"\n\n{semantix_feedback}"
    return call_llm(prompt)

On the first call, semantix_feedback is None. If validation fails, the next call receives something like:

## Semantix Self-Healing Feedback

Attempt **1** failed validation.

### What went wrong
- **Intent:** `ProfessionalDecline`
- **Score:** 0.3210 (threshold not met)
- **Judge reason:** too vague

### What is required
The text must politely decline an invitation without being rude or aggressive.

### Your previous output (rejected)
Go away.

Please generate a new response that satisfies the requirement above.

The LLM gets the score, the reason, the requirement, and its own rejected output. It can learn from the failure in real time.

NLI as the Default Judge

We moved from LLMJudge to NLIJudge as the default. Why?

No API key required — runs fully locally using a cross-encoder model
Entailment > Cosine similarity — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B about the same thing?" which is a weaker signal
Fast enough — the default nli-MiniLM2-L6-H768 model is ~85MB and runs in milliseconds

You can still use any judge you want — LLMJudge, EmbeddingJudge, or your own custom Judge subclass.

Granular Scoring

LLMJudge no longer returns a binary Yes/No. It now returns a 0.0-1.0 confidence score and a text reason, giving the self-healing system richer feedback to work with.

The Proof: Benchmark Results

Talk is cheap. Here are the real numbers from tools/benchmark.py, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):

Scenario	No Healing	Self-Healing	Improvement
Professional Tone	13.3%	56.7%	+43.3%
Technical Explanation	36.7%	96.7%	+60.0%
Actionable Summary	13.3%	56.7%	+43.3%
Overall	21.1%	70.0%	+48.9%

Self-healing nearly triples the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.

These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the relative improvement from self-healing holds.

How It Works Under the Hood

Your Function
     |
     v
@validate_intent
     |
     v
Call function -> Get raw string
     |
     v
Judge.evaluate(output, intent_description, threshold)
     |
     +-- PASS --> return Intent(output)
     |
     +-- FAIL --> SemanticIntentError
                    |
                    v
              retries left?
                    |
                    +-- YES --> inject semantix_feedback -> retry
                    |
                    +-- NO  --> raise error

The decorator resolves the Intent subclass from your return type annotation, calls the judge, and manages the retry loop. The semantix_feedback injection is zero-boilerplate — just add the parameter and it works.

Get Started in 30 Seconds

pip install "semantix-ai[nli]"

from semantix import Intent, validate_intent

class PositiveSentiment(Intent):
    """The text must express a clearly positive, optimistic,
    or encouraging sentiment."""

@validate_intent(retries=2)
def encourage(name: str, semantix_feedback=None) -> PositiveSentiment:
    prompt = f"Write an encouraging message for {name}"
    if semantix_feedback:
        prompt += f"\n\n{semantix_feedback}"
    return call_your_llm(prompt)

That's it. Your LLM output is now semantically typed and self-healing.

Your LLM Passes Type Checks but Fails the Vibe Check — Here's How to Fix It

Akhona Eland — Thu, 02 Apr 2026 14:08:34 +0000

You ask your LLM to write a polite decline to a meeting invite. It returns:

"I appreciate the invitation, but I would rather set myself on fire than attend your team-building retreat."

You run it through your Pydantic model. It passes. It's a string. The right length. Valid UTF-8. Technically a "response."

But it's not a polite decline. It's a career-ending email.

This is the gap nobody's filling. We have type systems for data structures — int, str, Pydantic models. We validate shape obsessively. But we have nothing for meaning.

Until now.

Introducing Semantix

Semantix is a semantic type system for LLM outputs. Instead of checking "is this a string?", it checks "does this string actually say what it's supposed to say?"

from semantix import Intent, validate_intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation 
    without being rude or aggressive."""

@validate_intent
def decline_invite(event: str) -> ProfessionalDecline:
    return call_my_llm(event)

result = decline_invite("the company retreat")
# ✓ Validated — the output actually IS a polite decline
# ✗ Raises SemanticIntentError if the LLM went off the rails

Three lines of setup. One decorator. Your LLM output is now semantically typed.

How It Works

The core idea is simple:

You define an Intent — a class whose docstring describes the semantic contract.
You decorate your LLM function — the return type hint tells Semantix what to validate against.
A Judge evaluates the output — comparing what the LLM said against what it was supposed to mean.

The Judge is the interesting part. Semantix ships with three:

EmbeddingJudge — compares sentence embeddings using cosine similarity. Fast, runs locally, no API key. Good for clear-cut intents.

from semantix import validate_intent, EmbeddingJudge

@validate_intent(judge=EmbeddingJudge())
def summarize(text: str) -> ConciseSummary:
    return call_llm(text)

LLMJudge — asks GPT-4o-mini "does this text satisfy this requirement? Yes or No." More accurate, needs an API key, costs fractions of a cent per call.

NLIJudge — uses a cross-encoder NLI model to check if the output entails the intent. Best of both worlds: accurate like an LLM judge, local like an embedding judge.

You pick the speed/accuracy tradeoff that fits your use case. And you can swap judges without changing any other code.

The Feature That Made Me Build This

Here's what pushed me over the edge. I was building an AI agent for a client that needed to generate customer-facing responses. The responses had to be:

Professional in tone
Factually grounded in the company's data
Free of any promises or commitments

Pydantic could check that the response was a non-empty string under 500 characters. Great. But the LLM kept slipping in phrases like "I guarantee this will be resolved" — structurally valid, semantically dangerous.

So I built Semantix. And the feature I'm most proud of is smart retries:

from semantix import validate_intent, get_last_failure, EmbeddingJudge

@validate_intent(judge=EmbeddingJudge(), retries=3)
def respond(query: str) -> SafeCustomerResponse:
    hint = ""
    if failure := get_last_failure():
        hint = (
            f"\n\nYour previous attempt scored {failure.score:.2f}. "
            "Remove any promises or guarantees."
        )
    return call_llm(f"Respond to: {query}{hint}")

get_last_failure() gives your LLM function access to the reason the previous attempt failed. So each retry isn't just "try again" — it's "try again, but here's what went wrong." The LLM gets smarter with each attempt.

Composable Intents

Real-world requirements are rarely one-dimensional. Semantix lets you combine intents:

from semantix import AllOf, AnyOf

# Must satisfy ALL — polite AND positive
SafeResponse = ProfessionalTone & NoPromises & FactuallyGrounded

# Must satisfy AT LEAST ONE — either formal or casual decline
FlexibleDecline = AnyOf(FormalDecline, CasualDecline)

@validate_intent(judge=EmbeddingJudge())
def respond(msg: str) -> SafeResponse:
    return call_llm(msg)

The & and | operators work on Intent classes directly. Under the hood, AllOf concatenates the docstrings with "AND" and uses the minimum threshold. AnyOf uses "OR" and the maximum threshold.

Streaming Support

If you're streaming LLM responses (and you probably should be), Semantix validates once the full stream is assembled:

from semantix import StreamCollector

collector = StreamCollector(ProfessionalDecline, judge=my_judge)
for chunk in collector.wrap(llm_stream()):
    print(chunk, end="")  # stream to user in real-time

result = collector.result()  # validate the complete output

Your users see the response streaming in. Behind the scenes, Semantix is collecting chunks. The moment the stream ends, it validates. If it fails, you catch the error and handle it — retry, fall back to a template, or flag for human review.

How It Compares

I built Semantix because the existing tools solve a different problem:

	Semantix	Guardrails AI	NeMo Guardrails	Instructor
Validates meaning	✅	❌ Schema-focused	✅ Dialogue rails	❌ Schema-focused
Zero required deps	✅	❌	❌	❌
Works with any LLM	✅ Any function	⚠️ Wrappers	⚠️ Config files	⚠️ Patched clients
Pluggable backends	✅ 3 built-in + custom	❌	❌	❌
Lines to validate	~5	~20+	~30+	~10

Semantix isn't a replacement for Pydantic or Guardrails. It's the layer above them. After you know the shape is right, verify the meaning is right too.

Try It

pip install semantix-ai

# With embedding judge (fast, local)
pip install "semantix-ai[embeddings]"

# With OpenAI judge (accurate)
pip install "semantix-ai[openai]"

Check out the repo: github.com/labrat-akhona/semantix-ai

It's MIT licensed, Python 3.10+, and the core has zero dependencies. I'd love feedback — open an issue or drop a comment below.

I'm Akhona, an automation engineer based in South Africa. I build AI-powered tools and integrations. You can find me on GitHub.