DEV Community

Haji Rufai
Haji Rufai

Posted on

Building a Data Drift Detection Framework in Python with Statistical Rigor

Data drift — the silent killer of ML models and data pipelines. Your model worked perfectly in production for months, then gradually its predictions started degrading. The culprit? The data changed under your feet.

I built DataDrift, a Python framework that detects schema changes, distribution shifts, and data quality degradation using rigorous statistical methods. Here's how and why.

The Problem

In production ML/data systems, data drift causes ~90% of model failures. Common scenarios:

  • Feature distribution shifts — Customer behavior changes seasonally
  • Schema breaks — Upstream team renames a column
  • Data quality degradation — A pipeline starts producing more nulls
  • New categories appear — A new payment method gets added

Without monitoring, these issues silently degrade your system. DataDrift catches them before they cause damage.

Architecture

┌─────────────────────────────────────────────┐
│              User Interface Layer             │
│   CLI | Python SDK | HTML Reports             │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│             Detection Engine                  │
│                                               │
│  Schema Drift   │ Distribution  │ Statistics  │
│  • New/removed  │ • KS test     │ • Mean shift│
│  • Type changed │ • Chi² test   │ • Null rate │
│  • Nullable     │ • PSI         │ • Quantiles │
│                 │ • Wasserstein │ • Cardinality│
│                 │ • Jensen-     │             │
│                 │   Shannon     │             │
│                                               │
│  ┌───────────────────────────────────────┐   │
│  │        Data Quality Checks            │   │
│  │  Missing values, duplicates, ranges,  │   │
│  │  correlation drift, constant columns  │   │
│  └───────────────────────────────────────┘   │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  HTML Report │ JSON Report │ CLI Summary     │
└─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Statistical Methods — The Core Engine

Population Stability Index (PSI)

PSI is the industry-standard metric for drift detection, widely used in banking and fintech:

def _compute_psi_numerical(ref_values, curr_values, n_bins=10):
    epsilon = 1e-4
    bin_edges = np.linspace(ref_values.min(), ref_values.max(), n_bins + 1)
    bin_edges[0] = min(bin_edges[0], curr_values.min()) - 0.001
    bin_edges[-1] = max(bin_edges[-1], curr_values.max()) + 0.001

    ref_counts, _ = np.histogram(ref_values, bins=bin_edges)
    curr_counts, _ = np.histogram(curr_values, bins=bin_edges)

    ref_pct = ref_counts / len(ref_values) + epsilon
    curr_pct = curr_counts / len(curr_values) + epsilon

    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    return float(psi)
Enter fullscreen mode Exit fullscreen mode

Interpretation:
| PSI Value | Meaning |
|-----------|---------|
| < 0.1 | Stable — no action needed |
| 0.1 – 0.2 | Moderate drift — monitor |
| ≥ 0.2 | Significant drift — investigate |

Kolmogorov-Smirnov Test

The KS test compares the empirical CDFs of two samples:

from scipy import stats

ks_stat, ks_p = stats.ks_2samp(ref_clean, curr_clean)
# p < 0.05 → distributions are significantly different
Enter fullscreen mode Exit fullscreen mode

Chi-Squared Test (for Categories)

For categorical columns, we use a contingency table:

observed = np.array([ref_counts, curr_counts])
chi2, p, _, _ = stats.chi2_contingency(observed)
Enter fullscreen mode Exit fullscreen mode

Wasserstein Distance

Also called "Earth Mover's Distance" — measures the minimum "work" to transform one distribution into another:

wass = stats.wasserstein_distance(ref_clean, curr_clean)
# Normalize by reference std for comparability
wass_normalized = wass / np.std(ref_clean) if np.std(ref_clean) > 0 else wass
Enter fullscreen mode Exit fullscreen mode

Usage — Three Ways

1. Python SDK

from datadrift import DriftDetector
import pandas as pd

ref_df = pd.read_csv("reference_data.csv")
curr_df = pd.read_csv("current_data.csv")

detector = DriftDetector(
    psi_threshold=0.2,
    p_value_threshold=0.05,
)

report = detector.compare(ref_df, curr_df)

print(f"Score: {report.overall_score}/100")
print(f"Drifted: {report.drifted_columns}")

report.to_html("drift_report.html")
report.to_json("drift_report.json")
Enter fullscreen mode Exit fullscreen mode

2. CLI

# Quick comparison
datadrift compare reference.csv current.csv --summary-only

# Generate HTML report
datadrift compare ref.csv curr.csv --report html -o report.html
Enter fullscreen mode Exit fullscreen mode

Exit codes make CI/CD integration trivial:

  • 0 — No drift
  • 1 — Moderate/high drift
  • 2 — Critical drift

3. CI/CD Pipeline

- name: Data Drift Check
  run: |
    datadrift compare data/baseline.csv data/latest.csv \
      --report json -o drift.json
Enter fullscreen mode Exit fullscreen mode

HTML Report — The Flagship Feature

The HTML report is a single self-contained file with:

  • Overall drift score (0-100) with severity ring
  • Schema comparison table with color-coded changes
  • Interactive Plotly charts — distribution overlays per column
  • Collapsible stats tables — before/after for every statistic
  • Quality issues — sortable by severity

All CSS (Tailwind), JS (Plotly), and data are embedded — no external dependencies. Share it via email, S3, or Jira.

Data Quality Checks

Beyond distribution drift, DataDrift catches quality issues:

# Detects:
# - Null rate increase (>5% change flagged)
# - Range violations (values outside reference range)
# - New/removed categories
# - Constant columns (was diverse, now single value)
# - Correlation drift between column pairs
# - Duplicate rate changes
Enter fullscreen mode Exit fullscreen mode

Testing

45 tests covering every component:

$ python -m pytest tests/ -v
tests/test_schema.py          7 passed
tests/test_distributions.py  10 passed
tests/test_statistics.py      7 passed
tests/test_quality.py         7 passed
tests/test_detector.py        9 passed
tests/test_cli.py             5 passed
======================== 45 passed in 2.10s ====================
Enter fullscreen mode Exit fullscreen mode

Sample Output

Running against the included e-commerce sales demo data:

🔴 Overall Drift Score: 84.7/100  [CRITICAL]

📋 Schema Changes
  city         — removed  🔴 critical
  customer_age — added    ℹ️  info
  discount_pct — nullable 🟡 medium

📊 Distribution Drift — 6/11 columns drifted
  delivery_days    PSI=0.6355  🔴 critical
  rating           PSI=0.3897  🔴 critical
  product_category PSI=0.3356  🔴 critical
  payment_method   PSI=0.1776  🟡 medium

⚠️ Quality Issues
  discount_pct — null rate: 0% → 19.24% 🟠 high
  order_amount — range expanded         🟡 medium
  payment_method — new: {crypto}        🔵 low
  product_category — new: {AI_Tools}    🔵 low
Enter fullscreen mode Exit fullscreen mode

Tech Stack

Component Technology
Statistics scipy, numpy
Data pandas
Visualization Plotly (interactive charts)
Reports Jinja2 (HTML templates)
CLI Click + Rich
Testing pytest (45 tests)
CI/CD GitHub Actions

Try It

git clone https://github.com/hajirufai/datadrift.git
cd datadrift
pip install -r requirements.txt
python sample_data/generate_samples.py
python -m datadrift.cli compare sample_data/reference_sales.csv sample_data/current_sales.csv --report html -o demo_report.html
Enter fullscreen mode Exit fullscreen mode

Open demo_report.html in your browser — interactive charts and all.


DataDrift is open source under the MIT license. Check it out on GitHub and star it if you find it useful!

python #datascience #mlops #dataengineering

Top comments (1)

Collapse
 
maya_andersson_dev profile image
Maya Andersson

Solid framing on statistical rigor. Two methodology notes from running drift detection on production LLM traces. First, KS-test on feature marginals misses semantic drift entirely. Two distributions can be marginally identical and semantically very different. We added MMD with a learned kernel for the semantic signal. Second, the multiple-testing correction matters more than people expect. If you are running a per-feature KS test on 50 features daily, with alpha 0.05 you get about 2.5 false alarms per day before Bonferroni. After Bonferroni you get under 0.1. The framework looks clean. Did you build in any correction step, and how do you decide when a drift signal is actionable rather than alert fatigue?