Haji Rufai

Posted on May 25

Building a Data Drift Detection Framework in Python with Statistical Rigor

#machinelearning #python #datascience #devops

Data drift — the silent killer of ML models and data pipelines. Your model worked perfectly in production for months, then gradually its predictions started degrading. The culprit? The data changed under your feet.

I built DataDrift, a Python framework that detects schema changes, distribution shifts, and data quality degradation using rigorous statistical methods. Here's how and why.

The Problem

In production ML/data systems, data drift causes ~90% of model failures. Common scenarios:

Feature distribution shifts — Customer behavior changes seasonally
Schema breaks — Upstream team renames a column
Data quality degradation — A pipeline starts producing more nulls
New categories appear — A new payment method gets added

Without monitoring, these issues silently degrade your system. DataDrift catches them before they cause damage.

Architecture

┌─────────────────────────────────────────────┐
│              User Interface Layer             │
│   CLI | Python SDK | HTML Reports             │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│             Detection Engine                  │
│                                               │
│  Schema Drift   │ Distribution  │ Statistics  │
│  • New/removed  │ • KS test     │ • Mean shift│
│  • Type changed │ • Chi² test   │ • Null rate │
│  • Nullable     │ • PSI         │ • Quantiles │
│                 │ • Wasserstein │ • Cardinality│
│                 │ • Jensen-     │             │
│                 │   Shannon     │             │
│                                               │
│  ┌───────────────────────────────────────┐   │
│  │        Data Quality Checks            │   │
│  │  Missing values, duplicates, ranges,  │   │
│  │  correlation drift, constant columns  │   │
│  └───────────────────────────────────────┘   │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  HTML Report │ JSON Report │ CLI Summary     │
└─────────────────────────────────────────────┘

Statistical Methods — The Core Engine

Population Stability Index (PSI)

PSI is the industry-standard metric for drift detection, widely used in banking and fintech:

def _compute_psi_numerical(ref_values, curr_values, n_bins=10):
    epsilon = 1e-4
    bin_edges = np.linspace(ref_values.min(), ref_values.max(), n_bins + 1)
    bin_edges[0] = min(bin_edges[0], curr_values.min()) - 0.001
    bin_edges[-1] = max(bin_edges[-1], curr_values.max()) + 0.001

    ref_counts, _ = np.histogram(ref_values, bins=bin_edges)
    curr_counts, _ = np.histogram(curr_values, bins=bin_edges)

    ref_pct = ref_counts / len(ref_values) + epsilon
    curr_pct = curr_counts / len(curr_values) + epsilon

    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    return float(psi)

Interpretation:
| PSI Value | Meaning |
|-----------|---------|
| < 0.1 | Stable — no action needed |
| 0.1 – 0.2 | Moderate drift — monitor |
| ≥ 0.2 | Significant drift — investigate |

Kolmogorov-Smirnov Test

The KS test compares the empirical CDFs of two samples:

from scipy import stats

ks_stat, ks_p = stats.ks_2samp(ref_clean, curr_clean)
# p < 0.05 → distributions are significantly different

Chi-Squared Test (for Categories)

For categorical columns, we use a contingency table:

observed = np.array([ref_counts, curr_counts])
chi2, p, _, _ = stats.chi2_contingency(observed)

Wasserstein Distance

Also called "Earth Mover's Distance" — measures the minimum "work" to transform one distribution into another:

wass = stats.wasserstein_distance(ref_clean, curr_clean)
# Normalize by reference std for comparability
wass_normalized = wass / np.std(ref_clean) if np.std(ref_clean) > 0 else wass

Usage — Three Ways

1. Python SDK

from datadrift import DriftDetector
import pandas as pd

ref_df = pd.read_csv("reference_data.csv")
curr_df = pd.read_csv("current_data.csv")

detector = DriftDetector(
    psi_threshold=0.2,
    p_value_threshold=0.05,
)

report = detector.compare(ref_df, curr_df)

print(f"Score: {report.overall_score}/100")
print(f"Drifted: {report.drifted_columns}")

report.to_html("drift_report.html")
report.to_json("drift_report.json")

2. CLI

# Quick comparison
datadrift compare reference.csv current.csv --summary-only

# Generate HTML report
datadrift compare ref.csv curr.csv --report html -o report.html

Exit codes make CI/CD integration trivial:

0 — No drift
1 — Moderate/high drift
2 — Critical drift

3. CI/CD Pipeline

- name: Data Drift Check
  run: |
    datadrift compare data/baseline.csv data/latest.csv \
      --report json -o drift.json

HTML Report — The Flagship Feature

The HTML report is a single self-contained file with:

Overall drift score (0-100) with severity ring
Schema comparison table with color-coded changes
Interactive Plotly charts — distribution overlays per column
Collapsible stats tables — before/after for every statistic
Quality issues — sortable by severity

All CSS (Tailwind), JS (Plotly), and data are embedded — no external dependencies. Share it via email, S3, or Jira.

Data Quality Checks

Beyond distribution drift, DataDrift catches quality issues:

# Detects:
# - Null rate increase (>5% change flagged)
# - Range violations (values outside reference range)
# - New/removed categories
# - Constant columns (was diverse, now single value)
# - Correlation drift between column pairs
# - Duplicate rate changes

Testing

45 tests covering every component:

$ python -m pytest tests/ -v
tests/test_schema.py          7 passed
tests/test_distributions.py  10 passed
tests/test_statistics.py      7 passed
tests/test_quality.py         7 passed
tests/test_detector.py        9 passed
tests/test_cli.py             5 passed
======================== 45 passed in 2.10s ====================

Sample Output

Running against the included e-commerce sales demo data:

🔴 Overall Drift Score: 84.7/100  [CRITICAL]

📋 Schema Changes
  city         — removed  🔴 critical
  customer_age — added    ℹ️  info
  discount_pct — nullable 🟡 medium

📊 Distribution Drift — 6/11 columns drifted
  delivery_days    PSI=0.6355  🔴 critical
  rating           PSI=0.3897  🔴 critical
  product_category PSI=0.3356  🔴 critical
  payment_method   PSI=0.1776  🟡 medium

⚠️ Quality Issues
  discount_pct — null rate: 0% → 19.24% 🟠 high
  order_amount — range expanded         🟡 medium
  payment_method — new: {crypto}        🔵 low
  product_category — new: {AI_Tools}    🔵 low

Tech Stack

Component	Technology
Statistics	scipy, numpy
Data	pandas
Visualization	Plotly (interactive charts)
Reports	Jinja2 (HTML templates)
CLI	Click + Rich
Testing	pytest (45 tests)
CI/CD	GitHub Actions

Try It

git clone https://github.com/hajirufai/datadrift.git
cd datadrift
pip install -r requirements.txt
python sample_data/generate_samples.py
python -m datadrift.cli compare sample_data/reference_sales.csv sample_data/current_sales.csv --report html -o demo_report.html

Open demo_report.html in your browser — interactive charts and all.

DataDrift is open source under the MIT license. Check it out on GitHub and star it if you find it useful!

python #datascience #mlops #dataengineering

Top comments (1)

Maya Andersson • May 25

Solid framing on statistical rigor. Two methodology notes from running drift detection on production LLM traces. First, KS-test on feature marginals misses semantic drift entirely. Two distributions can be marginally identical and semantically very different. We added MMD with a learned kernel for the semantic signal. Second, the multiple-testing correction matters more than people expect. If you are running a per-feature KS test on 50 features daily, with alpha 0.05 you get about 2.5 false alarms per day before Bonferroni. After Bonferroni you get under 0.1. The framework looks clean. Did you build in any correction step, and how do you decide when a drift signal is actionable rather than alert fatigue?