Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest
We'll go through the entire process of testing AI models, focusing on evaluating AI predictions using RAGAS, MLflow, and Pytest.
π Step 1: Understanding Model Testing in AI
Model testing ensures that AI predictions are:
β
Accurate (Correct responses)
β
Consistent (Same output for same input)
β
Reliable (Performs well under different conditions)
AI models are non-deterministic, meaning they may generate slightly different responses each time. This requires customized testing approaches beyond traditional unit tests.
Types of AI Model Testing
Test Type | What It Checks | Tool Used |
---|---|---|
Functional Testing | Does the model return expected results? | Pytest |
Evaluation Metrics | Precision, Recall, F1-score | RAGAS, MLflow |
Performance Testing | Latency, speed, and efficiency | MLflow |
Fairness & Bias Testing | Does the model discriminate or favor some inputs? | RAGAS |
π Step 2: Setting Up Your Environment
You'll need:
β
Python 3.10+
β
Pytest for writing tests
β
MLflow for logging experiments
β
RAGAS for evaluating LLM predictions
π₯ Install Dependencies
pip install pytest mlflow ragas transformers datasets openai
π Step 3: Functional Testing Using Pytest
Pytest helps validate model responses against expected outputs.
π Example: Testing an AI Model
Let's assume we have an LLM-based question-answering system.
πΉ AI Model (Mocked)
# ai_model.py
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")
def predict(question, context):
return qa_pipeline(question=question, context=context)["answer"]
πΉ Pytest Script
# test_ai_model.py
import pytest
from ai_model import predict
@pytest.mark.parametrize("question, context, expected_output", [
("What is AI?", "Artificial Intelligence (AI) is a branch of computer science...", "Artificial Intelligence"),
("Who discovered gravity?", "Isaac Newton discovered gravity when an apple fell on his head.", "Isaac Newton"),
])
def test_ai_model_predictions(question, context, expected_output):
response = predict(question, context)
assert expected_output in response, f"Unexpected AI response: {response}"
β Validates AI responses using predefined test cases.
Run the test:
pytest test_ai_model.py
π Step 4: Evaluating Model Performance Using MLflow
MLflow helps track AI model performance across different versions.
π Steps:
- Log model predictions.
- Track accuracy, loss, latency, and versioning.
- Compare multiple model versions.
πΉ Log Model Performance Using MLflow
# mlflow_logger.py
import mlflow
import time
from ai_model import predict
# Start MLflow run
mlflow.set_experiment("AI_Model_Tracking")
with mlflow.start_run():
question = "What is AI?"
context = "Artificial Intelligence (AI) is a branch of computer science..."
start_time = time.time()
output = predict(question, context)
end_time = time.time()
latency = end_time - start_time
mlflow.log_param("question", question)
mlflow.log_param("context_length", len(context))
mlflow.log_metric("latency", latency)
print(f"Predicted Answer: {output}")
mlflow.log_artifact("mlflow_logger.py")
β Logs AI predictions & latency into MLflow.
π Check MLflow Dashboard
Run:
mlflow ui
Then open http://localhost:5000 to visualize logs.
π Step 5: Evaluating Model Accuracy Using RAGAS
RAGAS (Retrieval-Augmented Generation Assessment) helps test LLM accuracy, relevance, and faithfulness.
Key RAGAS Metrics
Metric | Description |
---|---|
Faithfulness | Is the response factually correct? |
Relevance | Is the response related to the query? |
Answer Correctness | Is the AI-generated answer meaningful? |
πΉ Running a RAGAS Evaluation
# test_ragas.py
from ragas.metrics import faithfulness, answer_correctness, relevance
from ai_model import predict
# Sample question & AI response
question = "Who discovered gravity?"
context = "Isaac Newton discovered gravity when an apple fell on his head."
response = predict(question, context)
# Evaluate
faithfulness_score = faithfulness([{"question": question, "answer": response, "context": context}])
correctness_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])
# Print results
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Answer Correctness Score: {correctness_score}")
print(f"Relevance Score: {relevance_score}")
β Evaluates AI accuracy using RAGAS metrics.
Run:
python test_ragas.py
π Step 6: Automating Model Evaluation with Pytest & RAGAS
Now, let's combine RAGAS with Pytest for automated evaluation.
πΉ Pytest-RAGAS Script
# test_ai_ragas.py
import pytest
from ai_model import predict
from ragas.metrics import faithfulness, answer_correctness, relevance
test_cases = [
("Who discovered gravity?", "Isaac Newton discovered gravity...", "Isaac Newton"),
("What is AI?", "Artificial Intelligence is a branch...", "Artificial Intelligence")
]
@pytest.mark.parametrize("question, context, expected_output", test_cases)
def test_ai_ragas(question, context, expected_output):
response = predict(question, context)
faith_score = faithfulness([{"question": question, "answer": response, "context": context}])
correct_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])
assert faith_score > 0.7, f"Low faithfulness: {faith_score}"
assert correct_score > 0.7, f"Low answer correctness: {correct_score}"
assert relevance_score > 0.7, f"Low relevance: {relevance_score}"
β Runs automated AI model evaluation using Pytest & RAGAS.
Run:
pytest test_ai_ragas.py
π Step 7: Integrating Everything in CI/CD Pipeline
For continuous AI model testing, integrate tests into GitHub Actions, Jenkins, or GitLab CI/CD.
πΉ Sample GitHub Actions Workflow
name: AI Model Testing
on: [push]
jobs:
test_model:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Install Dependencies
run: pip install pytest mlflow ragas transformers datasets
- name: Run AI Tests
run: pytest test_ai_ragas.py
β Automatically runs tests on code push.
π Summary
Step | Task | Tool |
---|---|---|
1οΈβ£ | Write functional AI model tests | Pytest |
2οΈβ£ | Log AI performance & latency | MLflow |
3οΈβ£ | Evaluate AI responses for accuracy | RAGAS |
4οΈβ£ | Automate AI model evaluation | Pytest + RAGAS |
5οΈβ£ | Integrate AI tests in CI/CD | GitHub Actions |
Top comments (0)