Your LLM Passes Type Checks but Fails the Vibe Check — Here's How to Fix It

#ai #llm #opensource #python

You ask your LLM to write a polite decline to a meeting invite. It returns:

"I appreciate the invitation, but I would rather set myself on fire than attend your team-building retreat."

You run it through your Pydantic model. It passes. It's a string. The right length. Valid UTF-8. Technically a "response."

But it's not a polite decline. It's a career-ending email.

This is the gap nobody's filling. We have type systems for data structures — int, str, Pydantic models. We validate shape obsessively. But we have nothing for meaning.

Until now.

Introducing Semantix

Semantix is a semantic type system for LLM outputs. Instead of checking "is this a string?", it checks "does this string actually say what it's supposed to say?"

from semantix import Intent, validate_intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation 
    without being rude or aggressive."""

@validate_intent
def decline_invite(event: str) -> ProfessionalDecline:
    return call_my_llm(event)

result = decline_invite("the company retreat")
# ✓ Validated — the output actually IS a polite decline
# ✗ Raises SemanticIntentError if the LLM went off the rails

Three lines of setup. One decorator. Your LLM output is now semantically typed.

How It Works

The core idea is simple:

You define an Intent — a class whose docstring describes the semantic contract.
You decorate your LLM function — the return type hint tells Semantix what to validate against.
A Judge evaluates the output — comparing what the LLM said against what it was supposed to mean.

The Judge is the interesting part. Semantix ships with three:

EmbeddingJudge — compares sentence embeddings using cosine similarity. Fast, runs locally, no API key. Good for clear-cut intents.

from semantix import validate_intent, EmbeddingJudge

@validate_intent(judge=EmbeddingJudge())
def summarize(text: str) -> ConciseSummary:
    return call_llm(text)

LLMJudge — asks GPT-4o-mini "does this text satisfy this requirement? Yes or No." More accurate, needs an API key, costs fractions of a cent per call.

NLIJudge — uses a cross-encoder NLI model to check if the output entails the intent. Best of both worlds: accurate like an LLM judge, local like an embedding judge.

You pick the speed/accuracy tradeoff that fits your use case. And you can swap judges without changing any other code.

The Feature That Made Me Build This

Here's what pushed me over the edge. I was building an AI agent for a client that needed to generate customer-facing responses. The responses had to be:

Professional in tone
Factually grounded in the company's data
Free of any promises or commitments

Pydantic could check that the response was a non-empty string under 500 characters. Great. But the LLM kept slipping in phrases like "I guarantee this will be resolved" — structurally valid, semantically dangerous.

So I built Semantix. And the feature I'm most proud of is smart retries:

from semantix import validate_intent, get_last_failure, EmbeddingJudge

@validate_intent(judge=EmbeddingJudge(), retries=3)
def respond(query: str) -> SafeCustomerResponse:
    hint = ""
    if failure := get_last_failure():
        hint = (
            f"\n\nYour previous attempt scored {failure.score:.2f}. "
            "Remove any promises or guarantees."
        )
    return call_llm(f"Respond to: {query}{hint}")

get_last_failure() gives your LLM function access to the reason the previous attempt failed. So each retry isn't just "try again" — it's "try again, but here's what went wrong." The LLM gets smarter with each attempt.

Composable Intents

Real-world requirements are rarely one-dimensional. Semantix lets you combine intents:

from semantix import AllOf, AnyOf

# Must satisfy ALL — polite AND positive
SafeResponse = ProfessionalTone & NoPromises & FactuallyGrounded

# Must satisfy AT LEAST ONE — either formal or casual decline
FlexibleDecline = AnyOf(FormalDecline, CasualDecline)

@validate_intent(judge=EmbeddingJudge())
def respond(msg: str) -> SafeResponse:
    return call_llm(msg)

The & and | operators work on Intent classes directly. Under the hood, AllOf concatenates the docstrings with "AND" and uses the minimum threshold. AnyOf uses "OR" and the maximum threshold.

Streaming Support

If you're streaming LLM responses (and you probably should be), Semantix validates once the full stream is assembled:

from semantix import StreamCollector

collector = StreamCollector(ProfessionalDecline, judge=my_judge)
for chunk in collector.wrap(llm_stream()):
    print(chunk, end="")  # stream to user in real-time

result = collector.result()  # validate the complete output

Your users see the response streaming in. Behind the scenes, Semantix is collecting chunks. The moment the stream ends, it validates. If it fails, you catch the error and handle it — retry, fall back to a template, or flag for human review.

How It Compares

I built Semantix because the existing tools solve a different problem:

	Semantix	Guardrails AI	NeMo Guardrails	Instructor
Validates meaning	✅	❌ Schema-focused	✅ Dialogue rails	❌ Schema-focused
Zero required deps	✅	❌	❌	❌
Works with any LLM	✅ Any function	⚠️ Wrappers	⚠️ Config files	⚠️ Patched clients
Pluggable backends	✅ 3 built-in + custom	❌	❌	❌
Lines to validate	~5	~20+	~30+	~10

Semantix isn't a replacement for Pydantic or Guardrails. It's the layer above them. After you know the shape is right, verify the meaning is right too.

Try It

pip install semantix-ai

# With embedding judge (fast, local)
pip install "semantix-ai[embeddings]"

# With OpenAI judge (accurate)
pip install "semantix-ai[openai]"