End-to-end testing of Gen AI Apps

#ai #testing #llm #opensource

The shift to building applications on top of Large Language Models (LLMs) fundamentally breaks traditional software testing. Methodologies designed for predictable, deterministic systems are ill-equipped to handle the non-deterministic nature of LLM-based apps, creating a critical gap in quality assurance. This blog post dives into the specific challenges of end-to-end testing for these apps and introduces an open-source framework, SigmaEval, designed to provide a statistically rigorous solution.

The Challenges of Testing Gen AI Apps

Testing Gen AI applications is fundamentally different from testing traditional software. The core difficulty stems from two interconnected problems that create a cascade of complexity.

The Infinite Input Space: The range of possible user inputs is basically endless. You cannot write enough static test cases to cover every scenario or every subtle variation in how a user might phrase a question. This problem is compounded in conversational AI, where multi-turn chats branch out fast. Even small wording changes in an early turn can dramatically shift how later turns unfold, making the state space of a conversation explode.
Non-Deterministic & "Fuzzy" Outputs: Unlike a traditional function where 2 + 2 is always 4, a Gen AI model can produce a wide variety of responses to the same prompt. Outputs aren't consistent, so a single clean run doesn’t mean much. On top of this, quality itself is fuzzy. There’s rarely a single right answer, and you’re often juggling competing priorities like helpfulness, safety, and tone all at once. Even human judges don’t always agree on what constitutes a "good" response.

These two fundamental challenges are amplified by the complex systems we build around the models. The integration of tools, memory, and RAG adds more moving parts, each a potential point of failure. External APIs or vector indexes drift over time, and even the underlying models can change quietly in the background. When you add operational concerns like costs, rate limits, and flaky integrations, it’s easy to think you’ve tested enough when you really haven’t.

This reality means we must move beyond simple pass/fail checks and adopt a more robust, statistical approach to evaluation. The process becomes less like traditional software testing and more like a clinical trial. The goal isn't to guarantee a specific outcome for every individual interaction, but to ensure the application is effective for a significant portion of users, within a defined risk tolerance.

Introducing SigmaEval: A Statistical Approach to Gen AI Evaluation

SigmaEval is an open-source Python framework designed specifically for the statistical evaluation of Gen AI apps, agents, and bots. It helps you move from "it seems to work" to making statistically rigorous statements about your AI's quality.

With SigmaEval, you can set and enforce objective quality bars, making statements like:

"We are confident that at least 90% of user issues coming into our customer support chatbot will be resolved with a quality score of 8/10 or higher."

SigmaEval addresses the challenges of Gen AI testing through a novel approach that combines AI-driven user simulation, an AI judge, and inferential statistics.

How SigmaEval Works

SigmaEval uses two AI agents to automate the evaluation process: an AI User Simulator and an AI Judge.

Define "Good": You start by defining a test scenario in plain language, describing the user's goal and the successful outcome you expect. This becomes your objective quality bar.
Simulate and Collect Data: The AI User Simulator acts as a test user, interacting with your application based on your scenario. It runs these interactions multiple times to collect a robust dataset.
Judge and Analyze: The AI Judge scores each interaction against your definition of success. SigmaEval then applies statistical methods to these scores to determine if your quality bar has been met with a specified level of confidence.

This process transforms subjective assessments into quantitative, data-driven conclusions.

Getting Started with SigmaEval

Here’s a simple example of how to use SigmaEval to test a customer support chatbot's refund process:

from sigmaeval import SigmaEval, ScenarioTest, assertions
import asyncio
from typing import List, Dict, Any

# 1. Define a realistic, multi-turn ScenarioTest
scenario = (
    ScenarioTest("Customer Support Refund Process")
    .given(
        "A user is interacting with a customer support chatbot for an "
        "e-commerce store called 'GadgetWorld'."
    )
    .when("The user wants to get a refund for a faulty product they received.")
    .expect_behavior(
        "The chatbot should first empathize with the user's situation, then ask for "
        "the order number, and finally explain the next steps for the refund process. "
        "It should not resolve the issue immediately but guide the user effectively.",
        # We want to be confident that at least 90% of interactions will score an 8/10 or higher.
        criteria=assertions.scores.proportion_gte(min_score=8, proportion=0.9)
    )
    .max_turns(10) # Allow for a longer, more complex back-and-forth conversation
)

# 2. Implement the app_handler to connect SigmaEval to your application
async def app_handler(messages: List[Dict[str, str]], state: Any) -> str:
    # This handler is the bridge between SigmaEval and your application.
    # It's responsible for taking the conversation history from the AI User Simulator
    # and getting a response from your app, enabling multi-turn conversation testing.

    # `messages` contains the full conversation history, e.g.:
    # [
    #   {'role': 'user', 'content': 'My new headphones are broken.'},
    #   {'role': 'assistant', 'content': 'I am sorry to hear that. What is your order number?'}
    #   {'role': 'user', 'content': 'It's GADGET-12345.'}
    # ]

    # In a real test, you would make an API call to your application's endpoint here.
    # For example:
    # response = await http_client.post(
    #     "https://your-chatbot-api.com/chat",
    #     json={"messages": messages, "session_id": state}
    # )
    # return response.json()["content"]

    # This part must be implemented to return the actual response from your app.
    raise NotImplementedError("Connect this handler to your application.")


# 3. Initialize SigmaEval and run the evaluation
async def main():
    sigma_eval = SigmaEval(
        judge_model="gemini/gemini-2.5-flash",
        sample_size=20,  # The number of times to run the test
        significance_level=0.05  # Corresponds to a 95% confidence level
    )
    result = await sigma_eval.evaluate(scenario, app_handler)

    # Assert that the test passed for integration with testing frameworks like pytest
    assert result.passed

if __name__ == "__main__":
    # To run this, you would need to implement the app_handler
    # and handle the potential NotImplementedError.
    # asyncio.run(main())
    pass

When this code is executed (after implementing the app_handler), SigmaEval automates the entire end-to-end testing process:

Rubric Generation: The AI Judge first interprets the .expect_behavior() description to generate a detailed 1-10 scoring rubric that defines success for this specific scenario.
User Simulation: The AI User Simulator then runs 20 (sample_size) independent, multi-turn conversations with your application via the app_handler.
AI-Powered Judging: Each of the 20 conversation transcripts is scored against the rubric by the AI Judge.
Statistical Conclusion: SigmaEval performs a hypothesis test on the collected scores.

The final output is a result object. Its result.passed attribute will be True only if the test can conclude, with 95% confidence, that your chatbot meets the quality bar: at least 90% of interactions will score an 8 or higher. The result object also contains a wealth of data for inspection, including the full conversation logs, individual scores, the generated rubric, and the statistical analysis summary.

Core Concepts

SigmaEval uses a fluent builder API with a Given-When-Then pattern:

.given(): Establishes the context for the AI User Simulator.
.when(): Describes the goal or action the user will take.
.expect_behavior() or .expect_metric(): Specifies the expected outcomes, with statistical criteria.

Key Features

Statistical Assertions: Go beyond pass/fail with assertions like proportion_gte (is the performance good enough, most of the time?) and median_gte (is the typical user experience good?).
User Simulation with Various Writing Styles: The AI User Simulator can adopt different writing styles to ensure your app is robust to real-world user communication.
Built-in Metrics: Measure quantitative aspects like response latency and turn count out of the box.
Compatibility with Testing Libraries: Integrates seamlessly with pytest and unittest.

Practical Applications

SigmaEval is designed to integrate into a standard engineering workflow, providing a structured approach to Gen AI quality assurance. Its primary applications are:

CI/CD Integration: The framework's programmatic interface and clear pass/fail criteria (based on statistical confidence) allow it to be used as a quality gate in CI/CD pipelines. This prevents regressions in application behavior that might otherwise go unnoticed.
Objective A/B Testing: When comparing two versions of a prompt, an agent, or an underlying model, SigmaEval can provide statistically significant results on which version performs better against a defined quality bar.
Regression Testing: By running a suite of ScenarioTests, you can create a regression suite to ensure that changes to one part of your system don't negatively impact the holistic user experience.

Resources

SigmaEval is an open-source project. The repository contains the full source code and documentation: https://github.com/SigmaEval/SigmaEval. A dedicated documentation site is also available at docs.sigmaeval.com.