DEV Community

Cover image for Keep the LLM out of your chatbot tests
Mikhail Golikov
Mikhail Golikov

Posted on

Keep the LLM out of your chatbot tests

It is a warm evening. You are home from work, you crack the window open, and the room fills with that low gold light you get for about twenty minutes a day. You have changed into the soft clothes. There is a cold drink sweating in your hand. You drop onto the sofa, pick up the remote, and hit play on the show you have been saving all week.

It does not play. Your account "needs attention." And the only thing now standing between you and that perfect evening is a long, heartfelt conversation with a support bot.

So that bot had better not lose the plot on turn three. Somebody has to build it, and somebody has to test it, and the moment you test a multi-turn conversation you hit a fork in the road.

TLDR

  • An LLM that grades your chatbot's answers is a second non-deterministic system. Now two of them have to agree before CI goes green, and they will not always agree.
  • Most of what you actually want to assert about a conversation is boring and checkable: did the bot ask for the city after it got the name, did it keep the slot, did it return the right reply on turn three.
  • pytest-conversational gives you a small Conversation object, a bot adapter you write, and pytest fixtures that keep turn order and per-turn state. No model on the test side.
  • It is alpha. Single-turn and multi-turn assertions work today; a YAML scenario format and async adapters are on the roadmap, not in the box yet.

Two ways people test chatbots, and why both hurt

When you sit down to test a conversational interface, you usually land in one of two camps.

The first camp is a pile of requests.post calls. You fire a message at the bot, read the JSON, and write an assertion against it. This works for one turn. By turn four you are hand-rolling the history payload, threading state through local variables, and copy-pasting the same setup into every test. When something breaks, the failure message tells you a dict did not equal another dict, and you go spelunking to find out which turn went wrong.

The second camp is a full conversational-testing framework that owns your whole setup. It handles multi-turn flows, but it pins you to one platform and one way of describing a bot. The day your bot moves behind a different transport, or you want to test a plain Python function instead of an HTTP service, you are fighting the framework.

There is a third option that keeps coming up, and it is the one I want to push back on: have a language model read the bot's reply and decide whether it was good. It looks appealing because it handles paraphrase. The bot said "sure thing, what city?" instead of "got it, what city?" and the grader shrugs and passes it.

The problem is what you just built. Your test suite now contains two non-deterministic systems, and CI only goes green when both of them agree. The model under test drifts when you change a prompt. The grader model drifts when the provider ships a new version. A test that passed on Monday fails on Thursday and nothing in your code changed. You also pay latency and tokens on every run, and you have given up the one thing a test should give you: a clear, repeatable yes or no.

A colleague who leads engineering on a chatbot platform put it well when we were going back and forth on this online. Keep the test side deterministic. Let the bot be as clever as it likes; the thing checking the bot should be dumb and predictable.

The pivot: make the test side boring on purpose

Here is the shift. Most of what you care about in a conversation is not "was the answer phrased nicely." It is structural:

  • After the user gives a name, does the bot ask for the next slot?
  • Does the bot remember the name three turns later?
  • On an out-of-scope message, does it fall back instead of inventing an answer?
  • Does the reply contain the order number, match a known pattern, or land in a known set of options?

All of that is checkable without a model. You need something that keeps turn order, holds per-conversation state, and prints a readable transcript when an assertion fails. That is the whole job pytest-conversational is trying to do, and nothing more.

You bring the bot. The plugin wires it into pytest.

What it looks like

Install it:

pip install pytest-conversational
Enter fullscreen mode Exit fullscreen mode

Python 3.10 and up.

A bot is any callable that takes the user text and the conversation, and returns a string. The simplest test:

def my_bot(text, convo):
    if "hello" in text.lower():
        return "hi"
    return "sorry, did not get that"


def test_greeting(conversation_factory):
    convo = conversation_factory(bot=my_bot)
    convo.say("hello there")
    assert convo.last.bot == "hi"
Enter fullscreen mode Exit fullscreen mode

conversation_factory is a fixture. You hand it a bot, you get a fresh Conversation back, isolated from every other test.

Multi-turn state, without globals

This is where the single-turn camp suffers. A bot adapter can read convo.state and convo.turns, so slot filling stays inside the conversation instead of leaking into your test body:

def slot_filling_bot(text, convo):
    slots = convo.state.setdefault("slots", {})
    if "name" not in slots:
        slots["name"] = text
        return "got it, what city?"
    if "city" not in slots:
        slots["city"] = text
        return f"hello {slots['name']} from {slots['city']}"
    return "done"


def test_two_slot_flow(conversation_factory):
    convo = conversation_factory(bot=slot_filling_bot)
    convo.say("Mikhail")
    convo.say("Hove")
    assert convo.state["slots"] == {"name": "Mikhail", "city": "Hove"}
    assert convo.last.bot == "hello Mikhail from Hove"
Enter fullscreen mode Exit fullscreen mode

The test reads top to bottom like the conversation it describes. When it fails, you call convo.transcript() and see every turn, not a diff of two dictionaries.

multi-turn failure transcript

A multi-turn test catching a bot that drops the name on the final turn. The failure message shows exactly what the bot said versus what the test expected.

Matchers for the fuzzy bits

Exact string equality is fine until it is not. Some replies legitimately vary, and you want to assert a shape rather than a literal. The expect module covers the common cases and puts the actual reply in the failure message, so pytest output shows what the bot said versus what you wanted:

from pytest_conversational import expect

def test_replies(conversation_factory):
    convo = conversation_factory(bot=my_bot)
    convo.say("hi")

    expect.contains(convo.last.bot, "hello")
    expect.regex(convo.last.bot, r"^hello\s+\w+")
    expect.one_of(convo.last.bot, ["hello there", "hi there", "hey"])
Enter fullscreen mode Exit fullscreen mode

contains is a case-insensitive substring check by default. regex runs re.search and hands back the match object so you can inspect captured groups. one_of checks the reply against a list of alternatives, with an exact mode and a substring mode. None of this needs a model to decide anything. The rules are yours and they do not drift.

Testing a bot that lives behind HTTP

If your bot is a deployed service, you do not have to write a transport by hand. There is a bundled webhook adapter:

pip install pytest-conversational[http]
Enter fullscreen mode Exit fullscreen mode
from pytest_conversational import Conversation
from pytest_conversational.adapters import http_webhook


def test_remote_bot():
    bot = http_webhook("https://my-bot.example.com/webhook", timeout=3.0)
    convo = Conversation(bot=bot)
    convo.say("hello")
    assert "hi" in convo.last.bot.lower()
Enter fullscreen mode Exit fullscreen mode

The default contract is a POST of {"user": text, "history": [[u, b], ...]} and a 200 with {"reply": "..."}. If your endpoint speaks a different shape, pass request_builder and response_parser callbacks.

One thing worth slowing down on. The webhook URL goes straight to httpx as written. If a test ever feeds it a URL that came from fixture data, a config file, or anything you did not type yourself, the adapter will dutifully hit it. That includes 127.0.0.1, the cloud metadata address 169.254.169.254, and anything on 10.x.x.x inside your network. A test reaching the metadata endpoint is not a hypothetical; it is how a fair number of SSRF incidents start. Pin the URL to a literal in the test, or run it through your own allowlist before it gets near the adapter.

Where this fits, and where it does not

Determinism on the test side is the right call for structure, state, routing, and fallbacks. That is the bulk of what breaks in a conversational product, and it is exactly the part a model-as-judge handles worst.

It is not the right tool for judging open-ended generation quality. If your actual question is "is this a good, helpful, on-brand paragraph," a rule cannot answer that, and you should not pretend it can. Use human review or an evaluation harness for that, and keep it out of the gate that decides whether your build is broken. The two jobs are different. Conflating them is how you end up with a flaky CI that nobody trusts.

Honest status: the plugin is alpha, with a v1.0 release targeted for June 2026. Single-turn and multi-turn assertions, the matchers, and the HTTP adapter all work now. A scenario format you can load from YAML or plain-text fixtures is on the roadmap, and so is async adapter support for coroutine-based bots. Those are not shipped, and I would rather say so than have you pip install and go looking for them.

Try it

pip install pytest-conversational
Enter fullscreen mode Exit fullscreen mode

Point it at a bot you already have, write one multi-turn test, and see whether the failure transcript tells you more than your current setup does. The code, issues, and roadmap are on GitHub: https://github.com/golikovichev/pytest-conversational

If you do try it, the assertions you wish existed are the most useful issues to open. That is what shapes the scenario format before v1.0.


Top comments (0)