Building software with large language models (LLM) introduces a testing problem that traditional approaches cannot solve. When a function can return different yet equally valid outputs on each invocation, how do you know if it's working correctly?
The standard answer in traditional software is simple: write tests that assert exact outputs. But LLMs are probabilistic systems. Ask them the same question twice and you might get two different phrasings, both correct. Ask them an ambiguous question and you might get a plausible-sounding answer that's completely fabricated (hallucination).
This creates tension. On one hand, LLMs enable applications that would be impossible to build with deterministic code. On the other hand, they behave in ways that make them difficult to test, debug, and deploy with confidence.
This article explores a practical solution: rules-based evaluations integrated into continuous integration pipelines. We will demonstrate this through an AI-powered quiz generator, but the patterns apply to any LLM system where outputs must stay grounded in a specific context.
Why Testing LLM Applications Is Fundamentally Different
Traditional software testing relies on predictability: input X, expect output Y, and assert equality. The test either passes or fails.
LLM applications operate differently. The same prompt can produce multiple valid responses. A request for a quiz about science might correctly return questions about telescopes, physics, or Leonardo da Vinci's inventions depending on what the model selects from the available context.
This probabilistic behavior isn't a bug. But it introduces risks that don't exist in deterministic systems:
- Hallucination: The model generates information that sounds plausible but isn't grounded in the provided context. Ask for a quiz about Rome when your dataset only covers Paris, and the model might confidently generate Roman history questions using its pre-training knowledge.
- Bias and toxicity: Without constraints, LLMs can produce outputs that reflect societal biases present in their training data or generate harmful content.
- Context drift: Even when the model initially behaves correctly, changes to prompts, data, or model versions can introduce regressions that traditional tests won't catch.
These failure modes mean you can't simply write assert output == expected_output. You need evaluation strategies that account for variability while still enforcing correctness.
What Are Evals?
"Evals" are systematic methods for assessing model behavior. While academic benchmarks like MMLU measure general intelligence, they tell you nothing about whether your specific application will hallucinate. A model that scores well on a reading comprehension benchmark might still hallucinate answers when integrated into your customer support bot.
Application-specific evals fill this gap. Instead of testing general capabilities, you test the specific behaviors your application requires:
- Does the model only use facts from the provided knowledge base?
- Does it refuse requests that fall outside its intended scope?
- Does it follow the output format your application expects?
These questions can only be answered by tests you write yourself, tailored to your application's constraints and requirements.
When to Use Automated Evals
Once you understand what evals are, the next question becomes when and how deeply to apply them. Automated evaluations evolve alongside your application, shifting from rapid feedback loops to deep quality assessments.
- Development: Rules-based evals provide near-instant feedback on every commit, catching obvious breaks and formatting issues for pennies per run.
- Staging: Model-graded evals use "Judge LLMs" to assess nuance and helpfulness on release branches, trading higher costs for deeper insight.
- Production: Continuous evals serve as regression detectors, ensuring that prompt tweaks or model updates don't silently degrade performance over time.
Core Evaluation Dimensions for LLM Apps
Regardless of your specific application, most LLM evaluations focus on four dimensions:
- Context adherence: Does the output align with the provided context?
- Context relevance: Is the retrieved context relevant to the user's query?
- Correctness: Does the output align with ground truth or expected behavior?
- Bias and toxicity: Does the output contain harmful or offensive content?
Each dimension requires different evaluation techniques. Context adherence might be tested with keyword matching. Bias detection might require a specialized classifier. The implementation depends on your specific constraints.
Case Study: Building a Quiz Generator
To make these concepts concrete, consider an AI-powered quiz generator. The application accepts category requests which include science, geography, or art and returns three quiz questions derived strictly from a predefined dataset.
The critical constraint
The model must never generate questions about subjects outside this dataset. If a user requests a quiz about Rome and the dataset contains no Roman history, the correct behavior is refusal not hallucinated content pulled from pre-training. This constraint makes the application testable. We know exactly what inputs should succeed and which should fail. We can write assertions that verify the model respects these boundaries.
Implementation: Prompt Design
Most LLM behavior is controlled by the prompt. A well-designed prompt establishes rules, provides context, and defines failure modes:
system_message = f"""
Follow these steps to generate a customized quiz for the user.
Step 1: Identify the category from: Geography, Science, or Art
Step 2: Select up to two subjects from the quiz bank that match the category
Step 3: Generate 3 questions using only the facts provided
Additional rules:
- Only use explicit matches for the category
- If no information is available, respond: "I'm sorry I do not have information about that"
"""
This structure breaks generation into discrete steps, lists valid categories explicitly, and defines refusal behavior preventing hallucination.
The LLM Pipeline
LangChain provides composable primitives for building the application:
def assistant_chain(
system_message=system_message,
llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
output_parser=StrOutputParser()):
chat_prompt = ChatPromptTemplate.from_messages([
("system", system_message),
("human", "{question}"),
])
return chat_prompt | llm | output_parser
This pipeline serves double duty: it powers the production application and provides a stable interface for automated evaluations.
Writing Rules-Based Evaluations
Rules-based evals check for specific patterns in outputs. Here are three essential patterns:
1. Keyword Matching: Content Validation
Verifies the output contains expected domain terms:
def eval_expected_words(system_message, question, expected_words):
assistant = assistant_chain(system_message)
answer = assistant.invoke({"question": question})
assert any(word in answer.lower() for word in expected_words), \
f"Expected words {expected_words} not found in output"
# Test case
def test_science_quiz():
question = "Generate a quiz about science."
expected_subjects = ["davinci", "telescope", "physics", "curie"]
eval_expected_words(system_message, question, expected_subjects)
This assertion is deliberately loose. It passes if *any* expected word appears, accommodating the model's freedom to select different subjects from the valid set. You're not testing for exact phrasing, you're verifying that the model chose appropriate content.
2. Refusal Testing: Hallucination Prevention
Verifies the model declines out-of-scope requests:
def evaluate_refusal(system_message, question, decline_response):
assistant = assistant_chain(system_message)
answer = assistant.invoke({"question": question})
assert decline_response.lower() in answer.lower(), \
f"Expected refusal with '{decline_response}', got: {answer}"
# Test case
def test_refusal_rome():
question = "Help me create a quiz about Rome"
decline_response = "I'm sorry"
evaluate_refusal(system_message, question, decline_response)
When asked about Rome a topic outside the dataset the model should respond with the specified refusal phrase. If it generates a quiz instead, the test fails, signaling that the prompt's guardrails aren't working.
3. Format Validation: Structural Correctness
Verifies the output follows expected structure using regex patterns:
import re
def eval_quiz_format(system_message, question, expected_questions=3):
"""Verify the quiz follows delimiter-based format"""
assistant = assistant_chain(system_message)
answer = assistant.invoke({"question": question})
# Pattern to match "Question 1:####", "Question 2:####", etc.
pattern = r'Question\s+\d+:####'
matches = re.findall(pattern, answer)
assert len(matches) == expected_questions, \
f"Expected {expected_questions} questions in format 'Question N:####', " \
f"found {len(matches)}"
# Test case
def test_science_quiz_format():
question = "Generate a quiz about science."
eval_quiz_format(system_message, question, expected_questions=3)
LLMs can "drift" and start returning different formats. If your quiz generator is supposed to use delimiters and suddenly returns markdown or plain text, downstream parsing breaks. Format enforcement prevents the model from returning markdown-wrapped or JSON-formatted output when plain text is expected:
def eval_output_type(system_message, question, should_be_plain_text=True):
"""Verify output format matches expectations"""
assistant = assistant_chain(system_message)
answer = assistant.invoke({"question": question})
if should_be_plain_text:
# Should NOT be wrapped in markdown code blocks
assert not answer.strip().startswith('\`\`\`'), \\
f"Expected plain text, got markdown-wrapped output"
# Should NOT start with JSON/dict characters
assert not answer.strip().startswith('{'), \\
f"Expected plain text, got JSON-like output"
# Test case
def test_output_is_plain_text():
question = "Generate a quiz about art."
eval_output_type(system_message, question, should_be_plain_text=True)
This catches a common LLM failure mode: returning content wrapped in markdown code fences (text ...) or unexpectedly switching to JSON format.
Format tests catch issues that keyword tests miss:
- Format drift: Model updates or prompt changes cause output structure to shift
- Integration failures: If your frontend expects JSON and the LLM returns markdown, your app breaks
- Faster debugging: When a test fails, you immediately know whether it's a content issue or a format issue.
Integrating Evals into Continuous Integration
Manual testing doesn't scale. As teams grow and development accelerates, the discipline required to run tests before every commit erodes. This is even more critical for LLM applications because:
- Prompt changes are easy to make but hard to validate without testing
- Model behavior can shift with version updates
- Context changes affect outputs in non-obvious ways
By integrating evals into CI, every change is automatically validated before it reaches production.
The GitHub Actions Workflow
Every push and pull request triggers automated evaluation:
name: Run Evaluations
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
run-evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: pytest test.py -v
The OpenAI API key is stored as a repository secret, keeping credentials secure while allowing tests to make real API calls. When you push code, GitHub Actions automatically runs your evaluation suite:
The workflow shows all tests passing. Any test failure blocks the merge, preventing broken behavior from reaching production.
Limitations of Rules-Based Evaluations
Rules-based evaluations have clear boundaries. They can verify patterns and keywords but cannot assess:
- Factual accuracy beyond keyword presence: The test verifies "telescope" appears but not whether the telescope facts are correct
- Output quality: A response might contain correct keywords but be poorly phrased or confusing
- Subtle hallucinations: The model might include expected keywords while inserting fabricated details between them
The Path Forward
For production systems, extend this foundation with:
- Model-graded evaluations: Use a second LLM to assess subjective qualities (helpfulness, clarity, tone)
- Bias and toxicity detection: Integrate specialized classifiers
- A/B testing: Compare prompt variations with data-driven metrics
- Production observability: Monitor outputs continuously; some failure modes only emerge at scale
The strategy: Run rules-based evals on every commit, model-graded evals on release branches, and continuous monitoring in production.
Conclusion
LLM applications don't have to be opaque or untestable. By combining explicit constraints, semantic evaluations, and automated execution, you transform probabilistic models into reliable system components. The workflow is straightforward and repeatable:
- Define constraints in prompts
- Write tests that verify those constraints (keyword, format, refusal)
- Run evaluations automatically on every change
- Use rules-based evals for speed, model-graded evals for depth
The result becomes a software you can iterate on quickly, deploy confidently, and maintain sustainably even when it's powered by probabilistic models. See full implementation here (https://github.com/iamkalio/llmops-eval-ci)

Top comments (1)
good read