Model Validation Framework
Automated model testing, data drift detection, and validation gates for your ML pipeline. Catch bad models before they reach production with statistical tests, performance benchmarks, and fairness checks.
Key Features
- Data drift detection — PSI, KS-test, and chi-squared tests to detect feature distribution shifts
- Model performance gates — configurable accuracy, F1, and latency thresholds that block bad deployments
- Schema validation — enforce input/output column types, ranges, and null constraints
- Bias and fairness testing — demographic parity, equalized odds, and disparate impact metrics
- Regression testing — compare candidate models against the current production baseline
- Automated reports — HTML validation reports with pass/fail summaries and detailed metrics
- CI/CD integration — run validation as a GitHub Actions step that gates model promotion
Quick Start
# 1. Copy the config
cp config.example.yaml config.yaml
# 2. Run the full validation suite
python -m validation.runner \
--model artifacts/model.pkl \
--test-data data/test.csv \
--reference-data data/train.csv \
--config config.yaml
"""Validate a model before deployment."""
from validation import ModelValidator, ValidationConfig
import joblib
import pandas as pd
config = ValidationConfig(accuracy_threshold=0.85, f1_threshold=0.80,
max_psi=0.25, max_latency_ms=100)
model = joblib.load("artifacts/model.pkl")
test_df = pd.read_csv("data/test.csv")
reference_df = pd.read_csv("data/train.csv")
validator = ModelValidator(config)
report = validator.validate(model=model, test_data=test_df, reference_data=reference_df, target_column="target")
print(f"Validation: {'PASSED' if report.passed else 'FAILED'}")
for check in report.checks:
status = "PASS" if check.passed else "FAIL"
print(f" [{status}] {check.name}: {check.value:.4f} (threshold: {check.threshold})")
Architecture
model-validation-framework/
├── config.example.yaml # Validation thresholds
├── templates/
│ ├── validation/
│ │ ├── runner.py # CLI validation runner
│ │ ├── validators/ # performance, drift, schema, fairness, latency
│ │ ├── report.py # HTML report generation
│ │ └── config.py # ValidationConfig dataclass
│ └── ci/
│ └── validation_gate.yaml # GitHub Actions workflow
├── docs/
│ └── overview.md
└── examples/
├── basic_validation.py
└── drift_monitoring.py
Usage Examples
Data Drift Detection
"""Detect feature drift between training and production data."""
import numpy as np
from scipy.stats import ks_2samp
import pandas as pd
def detect_drift(reference: np.ndarray, current: np.ndarray, threshold: float = 0.05) -> dict:
"""KS test for distribution drift. p < threshold = drift detected."""
statistic, p_value = ks_2samp(reference, current)
return {"statistic": statistic, "p_value": p_value, "drift_detected": p_value < threshold}
train_df = pd.read_csv("data/train.csv")
prod_df = pd.read_csv("data/production_sample.csv")
for col in ["age", "income", "credit_score", "account_age_days"]:
result = detect_drift(train_df[col].values, prod_df[col].values)
flag = "DRIFT" if result["drift_detected"] else "OK"
print(f"{col}: KS={result['statistic']:.4f}, p={result['p_value']:.4f} [{flag}]")
Fairness Validation
"""Check model fairness across demographic groups (80% rule)."""
import numpy as np
def check_demographic_parity(y_pred: np.ndarray, sensitive_attr: np.ndarray, threshold: float = 0.8) -> dict:
groups = np.unique(sensitive_attr)
rates = {str(g): y_pred[sensitive_attr == g].mean() for g in groups}
ratio = min(rates.values()) / max(rates.values()) if max(rates.values()) > 0 else 0
return {"group_rates": rates, "disparate_impact_ratio": ratio, "passed": ratio >= threshold}
result = check_demographic_parity(predictions, test_df["gender"].values)
print(f"Disparate impact: {result['disparate_impact_ratio']:.3f} — {'PASS' if result['passed'] else 'FAIL'}")
CI/CD Validation Gate
Add this to .github/workflows/model_validation.yaml to gate model promotion:
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: |
python -m validation.runner \
--model models/candidate.pkl \
--test-data data/test.csv \
--reference-data data/train.csv \
--config config.yaml
Configuration
# config.example.yaml
performance:
accuracy_threshold: 0.85 # Minimum accuracy to pass
f1_threshold: 0.80 # Minimum weighted F1
compare_to_baseline: true # Compare vs current production model
drift:
method: "psi" # psi | ks_test | chi_squared
psi_threshold: 0.25 # PSI > 0.25 = significant drift
features_to_monitor: "all" # all | list of column names
fairness:
enabled: true
sensitive_attributes: ["gender", "age_group"]
disparate_impact_threshold: 0.8 # 80% rule
latency:
max_p99_ms: 100 # 99th percentile threshold
n_benchmark: 1000 # Number of timed predictions
Best Practices
- Run validation on every model change — automate with CI/CD so no model skips the gate
- Keep a reference dataset frozen — use training data distribution as your drift baseline
- Set thresholds based on business impact — 1% accuracy drop matters more in fraud than recommendations
- Include fairness checks early — harder to fix bias after deployment than during development
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| All drift checks fail | Reference data is stale or from wrong distribution | Regenerate reference data from the latest training set |
| Latency test passes locally, fails in CI | CI runners have fewer resources | Set separate thresholds for CI vs production or use n_warmup
|
| Fairness check always fails | Imbalanced classes in sensitive attributes | Check sample sizes per group; use stratified sampling |
| Schema validation rejects valid data | Column types changed after preprocessing | Update schema config to match your current preprocessing pipeline |
This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Validation Framework] with all files, templates, and documentation for $39.
Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.
Top comments (0)