DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Model Validation Framework

Model Validation Framework

Automated model testing, data drift detection, and validation gates for your ML pipeline. Catch bad models before they reach production with statistical tests, performance benchmarks, and fairness checks.

Key Features

  • Data drift detection — PSI, KS-test, and chi-squared tests to detect feature distribution shifts
  • Model performance gates — configurable accuracy, F1, and latency thresholds that block bad deployments
  • Schema validation — enforce input/output column types, ranges, and null constraints
  • Bias and fairness testing — demographic parity, equalized odds, and disparate impact metrics
  • Regression testing — compare candidate models against the current production baseline
  • Automated reports — HTML validation reports with pass/fail summaries and detailed metrics
  • CI/CD integration — run validation as a GitHub Actions step that gates model promotion

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Run the full validation suite
python -m validation.runner \
  --model artifacts/model.pkl \
  --test-data data/test.csv \
  --reference-data data/train.csv \
  --config config.yaml
Enter fullscreen mode Exit fullscreen mode
"""Validate a model before deployment."""
from validation import ModelValidator, ValidationConfig
import joblib
import pandas as pd

config = ValidationConfig(accuracy_threshold=0.85, f1_threshold=0.80,
                          max_psi=0.25, max_latency_ms=100)

model = joblib.load("artifacts/model.pkl")
test_df = pd.read_csv("data/test.csv")
reference_df = pd.read_csv("data/train.csv")

validator = ModelValidator(config)
report = validator.validate(model=model, test_data=test_df, reference_data=reference_df, target_column="target")

print(f"Validation: {'PASSED' if report.passed else 'FAILED'}")
for check in report.checks:
    status = "PASS" if check.passed else "FAIL"
    print(f"  [{status}] {check.name}: {check.value:.4f} (threshold: {check.threshold})")
Enter fullscreen mode Exit fullscreen mode

Architecture

model-validation-framework/
├── config.example.yaml             # Validation thresholds
├── templates/
│   ├── validation/
│   │   ├── runner.py               # CLI validation runner
│   │   ├── validators/             # performance, drift, schema, fairness, latency
│   │   ├── report.py               # HTML report generation
│   │   └── config.py               # ValidationConfig dataclass
│   └── ci/
│       └── validation_gate.yaml    # GitHub Actions workflow
├── docs/
│   └── overview.md
└── examples/
    ├── basic_validation.py
    └── drift_monitoring.py
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Data Drift Detection

"""Detect feature drift between training and production data."""
import numpy as np
from scipy.stats import ks_2samp
import pandas as pd

def detect_drift(reference: np.ndarray, current: np.ndarray, threshold: float = 0.05) -> dict:
    """KS test for distribution drift. p < threshold = drift detected."""
    statistic, p_value = ks_2samp(reference, current)
    return {"statistic": statistic, "p_value": p_value, "drift_detected": p_value < threshold}

train_df = pd.read_csv("data/train.csv")
prod_df = pd.read_csv("data/production_sample.csv")

for col in ["age", "income", "credit_score", "account_age_days"]:
    result = detect_drift(train_df[col].values, prod_df[col].values)
    flag = "DRIFT" if result["drift_detected"] else "OK"
    print(f"{col}: KS={result['statistic']:.4f}, p={result['p_value']:.4f} [{flag}]")
Enter fullscreen mode Exit fullscreen mode

Fairness Validation

"""Check model fairness across demographic groups (80% rule)."""
import numpy as np

def check_demographic_parity(y_pred: np.ndarray, sensitive_attr: np.ndarray, threshold: float = 0.8) -> dict:
    groups = np.unique(sensitive_attr)
    rates = {str(g): y_pred[sensitive_attr == g].mean() for g in groups}
    ratio = min(rates.values()) / max(rates.values()) if max(rates.values()) > 0 else 0
    return {"group_rates": rates, "disparate_impact_ratio": ratio, "passed": ratio >= threshold}

result = check_demographic_parity(predictions, test_df["gender"].values)
print(f"Disparate impact: {result['disparate_impact_ratio']:.3f}{'PASS' if result['passed'] else 'FAIL'}")
Enter fullscreen mode Exit fullscreen mode

CI/CD Validation Gate

Add this to .github/workflows/model_validation.yaml to gate model promotion:

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: |
          python -m validation.runner \
            --model models/candidate.pkl \
            --test-data data/test.csv \
            --reference-data data/train.csv \
            --config config.yaml
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
performance:
  accuracy_threshold: 0.85           # Minimum accuracy to pass
  f1_threshold: 0.80                 # Minimum weighted F1
  compare_to_baseline: true          # Compare vs current production model

drift:
  method: "psi"                      # psi | ks_test | chi_squared
  psi_threshold: 0.25               # PSI > 0.25 = significant drift
  features_to_monitor: "all"        # all | list of column names

fairness:
  enabled: true
  sensitive_attributes: ["gender", "age_group"]
  disparate_impact_threshold: 0.8   # 80% rule

latency:
  max_p99_ms: 100                    # 99th percentile threshold
  n_benchmark: 1000                  # Number of timed predictions
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Run validation on every model change — automate with CI/CD so no model skips the gate
  2. Keep a reference dataset frozen — use training data distribution as your drift baseline
  3. Set thresholds based on business impact — 1% accuracy drop matters more in fraud than recommendations
  4. Include fairness checks early — harder to fix bias after deployment than during development

Troubleshooting

Problem Cause Fix
All drift checks fail Reference data is stale or from wrong distribution Regenerate reference data from the latest training set
Latency test passes locally, fails in CI CI runners have fewer resources Set separate thresholds for CI vs production or use n_warmup
Fairness check always fails Imbalanced classes in sensitive attributes Check sample sizes per group; use stratified sampling
Schema validation rejects valid data Column types changed after preprocessing Update schema config to match your current preprocessing pipeline

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Validation Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)