DEV Community

Cover image for In 2025: The Future of Automation Testing: Adapting to AI Model Validation
Taki (Kieu Dang)
Taki (Kieu Dang)

Posted on

5 1 1 1 1

In 2025: The Future of Automation Testing: Adapting to AI Model Validation

Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest

We'll go through the entire process of testing AI models, focusing on evaluating AI predictions using RAGAS, MLflow, and Pytest.


πŸ“Œ Step 1: Understanding Model Testing in AI

Model testing ensures that AI predictions are:
βœ… Accurate (Correct responses)

βœ… Consistent (Same output for same input)

βœ… Reliable (Performs well under different conditions)

AI models are non-deterministic, meaning they may generate slightly different responses each time. This requires customized testing approaches beyond traditional unit tests.

Types of AI Model Testing

Test Type What It Checks Tool Used
Functional Testing Does the model return expected results? Pytest
Evaluation Metrics Precision, Recall, F1-score RAGAS, MLflow
Performance Testing Latency, speed, and efficiency MLflow
Fairness & Bias Testing Does the model discriminate or favor some inputs? RAGAS

πŸ“Œ Step 2: Setting Up Your Environment

You'll need:
βœ… Python 3.10+

βœ… Pytest for writing tests

βœ… MLflow for logging experiments

βœ… RAGAS for evaluating LLM predictions

πŸ“₯ Install Dependencies

pip install pytest mlflow ragas transformers datasets openai
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Step 3: Functional Testing Using Pytest

Pytest helps validate model responses against expected outputs.

πŸ“ Example: Testing an AI Model

Let's assume we have an LLM-based question-answering system.

πŸ”Ή AI Model (Mocked)

# ai_model.py
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

def predict(question, context):
    return qa_pipeline(question=question, context=context)["answer"]
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Pytest Script

# test_ai_model.py
import pytest
from ai_model import predict

@pytest.mark.parametrize("question, context, expected_output", [
    ("What is AI?", "Artificial Intelligence (AI) is a branch of computer science...", "Artificial Intelligence"),
    ("Who discovered gravity?", "Isaac Newton discovered gravity when an apple fell on his head.", "Isaac Newton"),
])
def test_ai_model_predictions(question, context, expected_output):
    response = predict(question, context)
    assert expected_output in response, f"Unexpected AI response: {response}"
Enter fullscreen mode Exit fullscreen mode

βœ… Validates AI responses using predefined test cases.

Run the test:

pytest test_ai_model.py
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Step 4: Evaluating Model Performance Using MLflow

MLflow helps track AI model performance across different versions.

πŸ“ Steps:

  1. Log model predictions.
  2. Track accuracy, loss, latency, and versioning.
  3. Compare multiple model versions.

πŸ”Ή Log Model Performance Using MLflow

# mlflow_logger.py
import mlflow
import time
from ai_model import predict

# Start MLflow run
mlflow.set_experiment("AI_Model_Tracking")

with mlflow.start_run():
    question = "What is AI?"
    context = "Artificial Intelligence (AI) is a branch of computer science..."

    start_time = time.time()
    output = predict(question, context)
    end_time = time.time()

    latency = end_time - start_time

    mlflow.log_param("question", question)
    mlflow.log_param("context_length", len(context))
    mlflow.log_metric("latency", latency)

    print(f"Predicted Answer: {output}")
    mlflow.log_artifact("mlflow_logger.py")
Enter fullscreen mode Exit fullscreen mode

βœ… Logs AI predictions & latency into MLflow.

πŸ“Š Check MLflow Dashboard

Run:

mlflow ui
Enter fullscreen mode Exit fullscreen mode

Then open http://localhost:5000 to visualize logs.


πŸ“Œ Step 5: Evaluating Model Accuracy Using RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) helps test LLM accuracy, relevance, and faithfulness.

Key RAGAS Metrics

Metric Description
Faithfulness Is the response factually correct?
Relevance Is the response related to the query?
Answer Correctness Is the AI-generated answer meaningful?

πŸ”Ή Running a RAGAS Evaluation

# test_ragas.py
from ragas.metrics import faithfulness, answer_correctness, relevance
from ai_model import predict

# Sample question & AI response
question = "Who discovered gravity?"
context = "Isaac Newton discovered gravity when an apple fell on his head."
response = predict(question, context)

# Evaluate
faithfulness_score = faithfulness([{"question": question, "answer": response, "context": context}])
correctness_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])

# Print results
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Answer Correctness Score: {correctness_score}")
print(f"Relevance Score: {relevance_score}")
Enter fullscreen mode Exit fullscreen mode

βœ… Evaluates AI accuracy using RAGAS metrics.

Run:

python test_ragas.py
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Step 6: Automating Model Evaluation with Pytest & RAGAS

Now, let's combine RAGAS with Pytest for automated evaluation.

πŸ”Ή Pytest-RAGAS Script

# test_ai_ragas.py
import pytest
from ai_model import predict
from ragas.metrics import faithfulness, answer_correctness, relevance

test_cases = [
    ("Who discovered gravity?", "Isaac Newton discovered gravity...", "Isaac Newton"),
    ("What is AI?", "Artificial Intelligence is a branch...", "Artificial Intelligence")
]

@pytest.mark.parametrize("question, context, expected_output", test_cases)
def test_ai_ragas(question, context, expected_output):
    response = predict(question, context)

    faith_score = faithfulness([{"question": question, "answer": response, "context": context}])
    correct_score = answer_correctness([{"question": question, "answer": response, "context": context}])
    relevance_score = relevance([{"question": question, "answer": response, "context": context}])

    assert faith_score > 0.7, f"Low faithfulness: {faith_score}"
    assert correct_score > 0.7, f"Low answer correctness: {correct_score}"
    assert relevance_score > 0.7, f"Low relevance: {relevance_score}"
Enter fullscreen mode Exit fullscreen mode

βœ… Runs automated AI model evaluation using Pytest & RAGAS.

Run:

pytest test_ai_ragas.py
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Step 7: Integrating Everything in CI/CD Pipeline

For continuous AI model testing, integrate tests into GitHub Actions, Jenkins, or GitLab CI/CD.

πŸ”Ή Sample GitHub Actions Workflow

name: AI Model Testing

on: [push]

jobs:
  test_model:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Install Dependencies
        run: pip install pytest mlflow ragas transformers datasets

      - name: Run AI Tests
        run: pytest test_ai_ragas.py
Enter fullscreen mode Exit fullscreen mode

βœ… Automatically runs tests on code push.


πŸ“Œ Summary

Step Task Tool
1️⃣ Write functional AI model tests Pytest
2️⃣ Log AI performance & latency MLflow
3️⃣ Evaluate AI responses for accuracy RAGAS
4️⃣ Automate AI model evaluation Pytest + RAGAS
5️⃣ Integrate AI tests in CI/CD GitHub Actions

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free β†’