DEV Community

Cover image for Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability
Akhona Eland
Akhona Eland

Posted on

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.

Then your model returns a "polite decline" that says "I'd rather gouge my eyes out."

It passes your type checks. It fails the vibe check.

This is the Semantic Gap — the space between structural correctness and actual meaning. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix.


The Semantic Gap: Shape vs. Meaning

Here's what most validation looks like today:

class Response(BaseModel):
    message: str
    tone: Literal["polite", "neutral", "firm"]
Enter fullscreen mode Exit fullscreen mode

This tells you the shape is right. It tells you nothing about whether the meaning is right. Your model can return {"message": "Go away.", "tone": "polite"} and Pydantic will happily accept it.

Semantix flips the script. Instead of validating structure, you validate intent:

from semantix import Intent, validate_intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation
    without being rude or aggressive."""

@validate_intent
def decline_invite(event: str) -> ProfessionalDecline:
    return call_my_llm(event)
Enter fullscreen mode Exit fullscreen mode

The docstring is the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?


What's New in v0.1.3: The Self-Healing Update

Informed Self-Healing

The biggest feature in v0.1.3 is informed retries. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM exactly what went wrong.

Declare a semantix_feedback parameter in your function, and the decorator injects a structured Markdown report on each retry:

from typing import Optional
from semantix import validate_intent
from semantix.judges.nli import NLIJudge

@validate_intent(judge=NLIJudge(), retries=2)
def decline(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    prompt = f"Decline this invite: {event}"
    if semantix_feedback:
        prompt += f"\n\n{semantix_feedback}"
    return call_llm(prompt)
Enter fullscreen mode Exit fullscreen mode

On the first call, semantix_feedback is None. If validation fails, the next call receives something like:

## Semantix Self-Healing Feedback

Attempt **1** failed validation.

### What went wrong
- **Intent:** `ProfessionalDecline`
- **Score:** 0.3210 (threshold not met)
- **Judge reason:** too vague

### What is required
The text must politely decline an invitation without being rude or aggressive.

### Your previous output (rejected)
Go away.

Please generate a new response that satisfies the requirement above.
Enter fullscreen mode Exit fullscreen mode

The LLM gets the score, the reason, the requirement, and its own rejected output. It can learn from the failure in real time.

NLI as the Default Judge

We moved from LLMJudge to NLIJudge as the default. Why?

  • No API key required — runs fully locally using a cross-encoder model
  • Entailment > Cosine similarity — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B about the same thing?" which is a weaker signal
  • Fast enough — the default nli-MiniLM2-L6-H768 model is ~85MB and runs in milliseconds

You can still use any judge you want — LLMJudge, EmbeddingJudge, or your own custom Judge subclass.

Granular Scoring

LLMJudge no longer returns a binary Yes/No. It now returns a 0.0-1.0 confidence score and a text reason, giving the self-healing system richer feedback to work with.


The Proof: Benchmark Results

Talk is cheap. Here are the real numbers from tools/benchmark.py, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):

Scenario No Healing Self-Healing Improvement
Professional Tone 13.3% 56.7% +43.3%
Technical Explanation 36.7% 96.7% +60.0%
Actionable Summary 13.3% 56.7% +43.3%
Overall 21.1% 70.0% +48.9%

Self-healing nearly triples the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.

These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the relative improvement from self-healing holds.


How It Works Under the Hood

Your Function
     |
     v
@validate_intent
     |
     v
Call function -> Get raw string
     |
     v
Judge.evaluate(output, intent_description, threshold)
     |
     +-- PASS --> return Intent(output)
     |
     +-- FAIL --> SemanticIntentError
                    |
                    v
              retries left?
                    |
                    +-- YES --> inject semantix_feedback -> retry
                    |
                    +-- NO  --> raise error
Enter fullscreen mode Exit fullscreen mode

The decorator resolves the Intent subclass from your return type annotation, calls the judge, and manages the retry loop. The semantix_feedback injection is zero-boilerplate — just add the parameter and it works.


Get Started in 30 Seconds

pip install "semantix-ai[nli]"
Enter fullscreen mode Exit fullscreen mode
from semantix import Intent, validate_intent

class PositiveSentiment(Intent):
    """The text must express a clearly positive, optimistic,
    or encouraging sentiment."""

@validate_intent(retries=2)
def encourage(name: str, semantix_feedback=None) -> PositiveSentiment:
    prompt = f"Write an encouraging message for {name}"
    if semantix_feedback:
        prompt += f"\n\n{semantix_feedback}"
    return call_your_llm(prompt)
Enter fullscreen mode Exit fullscreen mode

That's it. Your LLM output is now semantically typed and self-healing.


Links

Star the repo if this is useful. Open an issue if it isn't — I want to know what's missing.


Built by Akhona Eland in South Africa.

Top comments (0)