Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability
You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.
Then your model returns a "polite decline" that says "I'd rather gouge my eyes out."
It passes your type checks. It fails the vibe check.
This is the Semantic Gap — the space between structural correctness and actual meaning. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix.
The Semantic Gap: Shape vs. Meaning
Here's what most validation looks like today:
class Response(BaseModel):
message: str
tone: Literal["polite", "neutral", "firm"]
This tells you the shape is right. It tells you nothing about whether the meaning is right. Your model can return {"message": "Go away.", "tone": "polite"} and Pydantic will happily accept it.
Semantix flips the script. Instead of validating structure, you validate intent:
from semantix import Intent, validate_intent
class ProfessionalDecline(Intent):
"""The text must politely decline an invitation
without being rude or aggressive."""
@validate_intent
def decline_invite(event: str) -> ProfessionalDecline:
return call_my_llm(event)
The docstring is the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?
What's New in v0.1.3: The Self-Healing Update
Informed Self-Healing
The biggest feature in v0.1.3 is informed retries. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM exactly what went wrong.
Declare a semantix_feedback parameter in your function, and the decorator injects a structured Markdown report on each retry:
from typing import Optional
from semantix import validate_intent
from semantix.judges.nli import NLIJudge
@validate_intent(judge=NLIJudge(), retries=2)
def decline(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
prompt = f"Decline this invite: {event}"
if semantix_feedback:
prompt += f"\n\n{semantix_feedback}"
return call_llm(prompt)
On the first call, semantix_feedback is None. If validation fails, the next call receives something like:
## Semantix Self-Healing Feedback
Attempt **1** failed validation.
### What went wrong
- **Intent:** `ProfessionalDecline`
- **Score:** 0.3210 (threshold not met)
- **Judge reason:** too vague
### What is required
The text must politely decline an invitation without being rude or aggressive.
### Your previous output (rejected)
Go away.
Please generate a new response that satisfies the requirement above.
The LLM gets the score, the reason, the requirement, and its own rejected output. It can learn from the failure in real time.
NLI as the Default Judge
We moved from LLMJudge to NLIJudge as the default. Why?
- No API key required — runs fully locally using a cross-encoder model
- Entailment > Cosine similarity — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B about the same thing?" which is a weaker signal
-
Fast enough — the default
nli-MiniLM2-L6-H768model is ~85MB and runs in milliseconds
You can still use any judge you want — LLMJudge, EmbeddingJudge, or your own custom Judge subclass.
Granular Scoring
LLMJudge no longer returns a binary Yes/No. It now returns a 0.0-1.0 confidence score and a text reason, giving the self-healing system richer feedback to work with.
The Proof: Benchmark Results
Talk is cheap. Here are the real numbers from tools/benchmark.py, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):
| Scenario | No Healing | Self-Healing | Improvement |
|---|---|---|---|
| Professional Tone | 13.3% | 56.7% | +43.3% |
| Technical Explanation | 36.7% | 96.7% | +60.0% |
| Actionable Summary | 13.3% | 56.7% | +43.3% |
| Overall | 21.1% | 70.0% | +48.9% |
Self-healing nearly triples the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.
These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the relative improvement from self-healing holds.
How It Works Under the Hood
Your Function
|
v
@validate_intent
|
v
Call function -> Get raw string
|
v
Judge.evaluate(output, intent_description, threshold)
|
+-- PASS --> return Intent(output)
|
+-- FAIL --> SemanticIntentError
|
v
retries left?
|
+-- YES --> inject semantix_feedback -> retry
|
+-- NO --> raise error
The decorator resolves the Intent subclass from your return type annotation, calls the judge, and manages the retry loop. The semantix_feedback injection is zero-boilerplate — just add the parameter and it works.
Get Started in 30 Seconds
pip install "semantix-ai[nli]"
from semantix import Intent, validate_intent
class PositiveSentiment(Intent):
"""The text must express a clearly positive, optimistic,
or encouraging sentiment."""
@validate_intent(retries=2)
def encourage(name: str, semantix_feedback=None) -> PositiveSentiment:
prompt = f"Write an encouraging message for {name}"
if semantix_feedback:
prompt += f"\n\n{semantix_feedback}"
return call_your_llm(prompt)
That's it. Your LLM output is now semantically typed and self-healing.
Links
- GitHub: github.com/labrat-akhona/semantix-ai
- PyPI: pypi.org/project/semantix-ai
-
Install:
pip install semantix-ai
Star the repo if this is useful. Open an issue if it isn't — I want to know what's missing.
Built by Akhona Eland in South Africa.
Top comments (0)