You've got an LLM-powered feature in production. You want to test it. Here are your options:
String matching. Works until the model rephrases "I'd be happy to help" as "Sure, let me assist you." Now your test is red and nothing is actually wrong.
Regex. You write a pattern. It passes today, breaks tomorrow when the model adds a comma. You write a more permissive pattern. Now it passes on garbage too.
LLM-as-judge. Call GPT-4 to evaluate the output. Your test suite now takes 4 minutes, costs money, and fails when OpenAI has a bad day. Your CI pipeline needs an API key in secrets. Your team stops running the tests locally.
None of these are good. What you actually want is to test whether your output means the right thing — without any of that overhead.
pytest-semantix
pip install pytest-semantix
def test_chatbot_is_polite(assert_semantic):
response = my_chatbot("handle angry customer")
assert_semantic(response, "polite and professional")
That's a real pytest test. It runs locally on CPU in ~15ms. No API key. No network calls. No tokens burned.
pytest-semantix is a pytest plugin that wraps semantix-ai's semantic assertion engine as a native fixture. Under the hood, it uses a local NLI (Natural Language Inference) model to check whether your LLM output entails the given intent. You describe what you mean in plain English. The model checks entailment. Done.
On failure, you get a score, the intent, and a reason — not just a raw traceback:
AssertionError: Semantic check failed (score=0.12)
Intent: polite and professional
Output: "You're an idiot for asking that."
Reason: Text contains aggressive language
Markers
If you want to attach an intent to the test itself rather than the assertion call, use the @pytest.mark.semantic marker:
import pytest
@pytest.mark.semantic("polite and professional")
def test_with_marker(assert_semantic):
response = my_chatbot("handle angry customer")
assert_semantic(response) # intent comes from the marker
This is useful when you have a single intent per test and want to see it at a glance in the decorator rather than buried in the function body.
Terminal Reports
Pass --semantic-report and you get a color-coded summary after the test session:
$ pytest --semantic-report
======================== semantic assertion report =========================
Total: 5 | Passed: 4 | Failed: 1
[PASS] tests/test_bot.py::test_polite [12ms]
[PASS] tests/test_bot.py::test_helpful [14ms]
[FAIL] tests/test_bot.py::test_no_pii (score=0.67) Contains email address [11ms]
[PASS] tests/test_bot.py::test_on_topic [13ms]
[PASS] tests/test_bot.py::test_concise [15ms]
============================================================================
Green for pass, red for fail. Each line shows the test, the score on failure, the reason, and the wall time. No need to scroll through pytest output hunting for which semantic check broke.
JSON Reports for CI
For CI integration, export results to JSON:
pytest --semantic-report-json=semantic-results.json
The output:
{
"summary": { "total": 5, "passed": 4, "failed": 1 },
"results": [
{
"nodeid": "tests/test_bot.py::test_polite",
"intent": "polite and professional",
"passed": true,
"score": null,
"reason": "",
"duration_ms": 12.3
},
{
"nodeid": "tests/test_bot.py::test_no_pii",
"intent": "The text does not contain personal information",
"passed": false,
"score": 0.67,
"reason": "Contains email address",
"duration_ms": 11.1
}
]
}
Feed this into your CI dashboard, your Slack alerts, your artifact storage — whatever your pipeline already does with JSON test results.
Negation for Compliance Testing
Some of the most important LLM tests aren't about what the output should say. They're about what it shouldn't say.
from semantix import Intent
class MedicalAdvice(Intent):
"""The text provides medical diagnoses or treatment recommendations."""
class PIILeakage(Intent):
"""The text contains personal information like names, emails, or phone numbers."""
def test_no_medical_advice(assert_semantic):
response = my_chatbot("my head hurts what should I take")
assert_semantic(response, ~MedicalAdvice)
def test_no_pii_leakage(assert_semantic):
response = my_chatbot("tell me about user 42")
assert_semantic(response, ~PIILeakage)
The ~ operator negates the intent. The test passes only when the output does not match. This is how you test guardrails: toxicity, off-topic drift, unauthorized disclosures, regulatory compliance. Define the bad thing as an intent, negate it, assert against your output.
Composing with Existing pytest
pytest-semantix is a normal pytest plugin. It doesn't replace anything in your test suite — it adds a fixture. Everything you already use works.
Parametrize
import pytest
@pytest.mark.parametrize("prompt,intent", [
("handle angry customer", "polite and professional"),
("explain a refund policy", "clear and informative"),
("say goodbye", "friendly"),
])
def test_chatbot_intents(assert_semantic, prompt, intent):
response = my_chatbot(prompt)
assert_semantic(response, intent)
Combine with other fixtures
@pytest.fixture
def chatbot():
return MyChatbot(model="gpt-4o-mini", temperature=0.2)
def test_with_fixtures(chatbot, assert_semantic):
response = chatbot.respond("hello")
assert_semantic(response, "friendly greeting")
Mix semantic and regular assertions
def test_structured_response(assert_semantic):
result = my_llm("generate a JSON summary")
data = json.loads(result) # regular assertion: valid JSON
assert "summary" in data # regular assertion: has the key
assert_semantic(data["summary"], "concise and accurate") # semantic: means the right thing
Global threshold
If your team wants a stricter baseline across all tests:
pytest --semantic-threshold=0.85
Individual tests can still override:
def test_strict(assert_semantic):
assert_semantic(response, "accurate", threshold=0.95)
What's Actually Happening
When you call assert_semantic(output, intent), the plugin:
- Resolves the intent (from the argument, the marker, or raises an error)
- Passes the output and intent to a local NLI model via
semantix-ai - The model returns a score and verdict
- The plugin records the result (nodeid, intent, score, duration) for reporting
- On failure, it raises
AssertionErrorwith score + reason
No network call. No subprocess. No container. The NLI model loads once per session and runs inference in-process. That's why it's ~15ms per assertion.
Install
pip install pytest-semantix
Requires Python 3.10+ and pytest 7+. Pulls in semantix-ai automatically.
Then just use the assert_semantic fixture in your tests. No configuration, no conftest.py boilerplate, no setup step.
- PyPI: pypi.org/project/pytest-semantix
- GitHub: github.com/labrat-akhona/pytest-semantix
- semantix-ai (the engine): pypi.org/project/semantix-ai
Top comments (0)