Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Model Validation Framework

#machinelearning #python #datascience #ai

Model Validation Framework

Automated model testing, data drift detection, and validation gates for your ML pipeline. Catch bad models before they reach production with statistical tests, performance benchmarks, and fairness checks.

Key Features

Data drift detection — PSI, KS-test, and chi-squared tests to detect feature distribution shifts
Model performance gates — configurable accuracy, F1, and latency thresholds that block bad deployments
Schema validation — enforce input/output column types, ranges, and null constraints
Bias and fairness testing — demographic parity, equalized odds, and disparate impact metrics
Regression testing — compare candidate models against the current production baseline
Automated reports — HTML validation reports with pass/fail summaries and detailed metrics
CI/CD integration — run validation as a GitHub Actions step that gates model promotion

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Run the full validation suite
python -m validation.runner \
  --model artifacts/model.pkl \
  --test-data data/test.csv \
  --reference-data data/train.csv \
  --config config.yaml

"""Validate a model before deployment."""
from validation import ModelValidator, ValidationConfig
import joblib
import pandas as pd

config = ValidationConfig(accuracy_threshold=0.85, f1_threshold=0.80,
                          max_psi=0.25, max_latency_ms=100)

model = joblib.load("artifacts/model.pkl")
test_df = pd.read_csv("data/test.csv")
reference_df = pd.read_csv("data/train.csv")

validator = ModelValidator(config)
report = validator.validate(model=model, test_data=test_df, reference_data=reference_df, target_column="target")

print(f"Validation: {'PASSED' if report.passed else 'FAILED'}")
for check in report.checks:
    status = "PASS" if check.passed else "FAIL"
    print(f"  [{status}] {check.name}: {check.value:.4f} (threshold: {check.threshold})")

Architecture

model-validation-framework/
├── config.example.yaml             # Validation thresholds
├── templates/
│   ├── validation/
│   │   ├── runner.py               # CLI validation runner
│   │   ├── validators/             # performance, drift, schema, fairness, latency
│   │   ├── report.py               # HTML report generation
│   │   └── config.py               # ValidationConfig dataclass
│   └── ci/
│       └── validation_gate.yaml    # GitHub Actions workflow
├── docs/
│   └── overview.md
└── examples/
    ├── basic_validation.py
    └── drift_monitoring.py

Usage Examples

Data Drift Detection

"""Detect feature drift between training and production data."""
import numpy as np
from scipy.stats import ks_2samp
import pandas as pd

def detect_drift(reference: np.ndarray, current: np.ndarray, threshold: float = 0.05) -> dict:
    """KS test for distribution drift. p < threshold = drift detected."""
    statistic, p_value = ks_2samp(reference, current)
    return {"statistic": statistic, "p_value": p_value, "drift_detected": p_value < threshold}

train_df = pd.read_csv("data/train.csv")
prod_df = pd.read_csv("data/production_sample.csv")

for col in ["age", "income", "credit_score", "account_age_days"]:
    result = detect_drift(train_df[col].values, prod_df[col].values)
    flag = "DRIFT" if result["drift_detected"] else "OK"
    print(f"{col}: KS={result['statistic']:.4f}, p={result['p_value']:.4f} [{flag}]")

Fairness Validation

"""Check model fairness across demographic groups (80% rule)."""
import numpy as np

def check_demographic_parity(y_pred: np.ndarray, sensitive_attr: np.ndarray, threshold: float = 0.8) -> dict:
    groups = np.unique(sensitive_attr)
    rates = {str(g): y_pred[sensitive_attr == g].mean() for g in groups}
    ratio = min(rates.values()) / max(rates.values()) if max(rates.values()) > 0 else 0
    return {"group_rates": rates, "disparate_impact_ratio": ratio, "passed": ratio >= threshold}

result = check_demographic_parity(predictions, test_df["gender"].values)
print(f"Disparate impact: {result['disparate_impact_ratio']:.3f} — {'PASS' if result['passed'] else 'FAIL'}")

CI/CD Validation Gate

Add this to .github/workflows/model_validation.yaml to gate model promotion:

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: |
          python -m validation.runner \
            --model models/candidate.pkl \
            --test-data data/test.csv \
            --reference-data data/train.csv \
            --config config.yaml

Configuration

# config.example.yaml
performance:
  accuracy_threshold: 0.85           # Minimum accuracy to pass
  f1_threshold: 0.80                 # Minimum weighted F1
  compare_to_baseline: true          # Compare vs current production model

drift:
  method: "psi"                      # psi | ks_test | chi_squared
  psi_threshold: 0.25               # PSI > 0.25 = significant drift
  features_to_monitor: "all"        # all | list of column names

fairness:
  enabled: true
  sensitive_attributes: ["gender", "age_group"]
  disparate_impact_threshold: 0.8   # 80% rule

latency:
  max_p99_ms: 100                    # 99th percentile threshold
  n_benchmark: 1000                  # Number of timed predictions

Best Practices

Run validation on every model change — automate with CI/CD so no model skips the gate
Keep a reference dataset frozen — use training data distribution as your drift baseline
Set thresholds based on business impact — 1% accuracy drop matters more in fraud than recommendations
Include fairness checks early — harder to fix bias after deployment than during development

Troubleshooting

Problem	Cause	Fix
All drift checks fail	Reference data is stale or from wrong distribution	Regenerate reference data from the latest training set
Latency test passes locally, fails in CI	CI runners have fewer resources	Set separate thresholds for CI vs production or use `n_warmup`
Fairness check always fails	Imbalanced classes in sensitive attributes	Check sample sizes per group; use stratified sampling
Schema validation rejects valid data	Column types changed after preprocessing	Update schema config to match your current preprocessing pipeline

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Validation Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

Model Validation Framework

Model Validation Framework

Key Features

Quick Start

Architecture

Usage Examples

Data Drift Detection

Fairness Validation

CI/CD Validation Gate

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)