Data drift — the silent killer of ML models and data pipelines. Your model worked perfectly in production for months, then gradually its predictions started degrading. The culprit? The data changed under your feet.
I built DataDrift, a Python framework that detects schema changes, distribution shifts, and data quality degradation using rigorous statistical methods. Here's how and why.
The Problem
In production ML/data systems, data drift causes ~90% of model failures. Common scenarios:
- Feature distribution shifts — Customer behavior changes seasonally
- Schema breaks — Upstream team renames a column
- Data quality degradation — A pipeline starts producing more nulls
- New categories appear — A new payment method gets added
Without monitoring, these issues silently degrade your system. DataDrift catches them before they cause damage.
Architecture
┌─────────────────────────────────────────────┐
│ User Interface Layer │
│ CLI | Python SDK | HTML Reports │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Detection Engine │
│ │
│ Schema Drift │ Distribution │ Statistics │
│ • New/removed │ • KS test │ • Mean shift│
│ • Type changed │ • Chi² test │ • Null rate │
│ • Nullable │ • PSI │ • Quantiles │
│ │ • Wasserstein │ • Cardinality│
│ │ • Jensen- │ │
│ │ Shannon │ │
│ │
│ ┌───────────────────────────────────────┐ │
│ │ Data Quality Checks │ │
│ │ Missing values, duplicates, ranges, │ │
│ │ correlation drift, constant columns │ │
│ └───────────────────────────────────────┘ │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ HTML Report │ JSON Report │ CLI Summary │
└─────────────────────────────────────────────┘
Statistical Methods — The Core Engine
Population Stability Index (PSI)
PSI is the industry-standard metric for drift detection, widely used in banking and fintech:
def _compute_psi_numerical(ref_values, curr_values, n_bins=10):
epsilon = 1e-4
bin_edges = np.linspace(ref_values.min(), ref_values.max(), n_bins + 1)
bin_edges[0] = min(bin_edges[0], curr_values.min()) - 0.001
bin_edges[-1] = max(bin_edges[-1], curr_values.max()) + 0.001
ref_counts, _ = np.histogram(ref_values, bins=bin_edges)
curr_counts, _ = np.histogram(curr_values, bins=bin_edges)
ref_pct = ref_counts / len(ref_values) + epsilon
curr_pct = curr_counts / len(curr_values) + epsilon
psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
return float(psi)
Interpretation:
| PSI Value | Meaning |
|-----------|---------|
| < 0.1 | Stable — no action needed |
| 0.1 – 0.2 | Moderate drift — monitor |
| ≥ 0.2 | Significant drift — investigate |
Kolmogorov-Smirnov Test
The KS test compares the empirical CDFs of two samples:
from scipy import stats
ks_stat, ks_p = stats.ks_2samp(ref_clean, curr_clean)
# p < 0.05 → distributions are significantly different
Chi-Squared Test (for Categories)
For categorical columns, we use a contingency table:
observed = np.array([ref_counts, curr_counts])
chi2, p, _, _ = stats.chi2_contingency(observed)
Wasserstein Distance
Also called "Earth Mover's Distance" — measures the minimum "work" to transform one distribution into another:
wass = stats.wasserstein_distance(ref_clean, curr_clean)
# Normalize by reference std for comparability
wass_normalized = wass / np.std(ref_clean) if np.std(ref_clean) > 0 else wass
Usage — Three Ways
1. Python SDK
from datadrift import DriftDetector
import pandas as pd
ref_df = pd.read_csv("reference_data.csv")
curr_df = pd.read_csv("current_data.csv")
detector = DriftDetector(
psi_threshold=0.2,
p_value_threshold=0.05,
)
report = detector.compare(ref_df, curr_df)
print(f"Score: {report.overall_score}/100")
print(f"Drifted: {report.drifted_columns}")
report.to_html("drift_report.html")
report.to_json("drift_report.json")
2. CLI
# Quick comparison
datadrift compare reference.csv current.csv --summary-only
# Generate HTML report
datadrift compare ref.csv curr.csv --report html -o report.html
Exit codes make CI/CD integration trivial:
-
0— No drift -
1— Moderate/high drift -
2— Critical drift
3. CI/CD Pipeline
- name: Data Drift Check
run: |
datadrift compare data/baseline.csv data/latest.csv \
--report json -o drift.json
HTML Report — The Flagship Feature
The HTML report is a single self-contained file with:
- Overall drift score (0-100) with severity ring
- Schema comparison table with color-coded changes
- Interactive Plotly charts — distribution overlays per column
- Collapsible stats tables — before/after for every statistic
- Quality issues — sortable by severity
All CSS (Tailwind), JS (Plotly), and data are embedded — no external dependencies. Share it via email, S3, or Jira.
Data Quality Checks
Beyond distribution drift, DataDrift catches quality issues:
# Detects:
# - Null rate increase (>5% change flagged)
# - Range violations (values outside reference range)
# - New/removed categories
# - Constant columns (was diverse, now single value)
# - Correlation drift between column pairs
# - Duplicate rate changes
Testing
45 tests covering every component:
$ python -m pytest tests/ -v
tests/test_schema.py 7 passed
tests/test_distributions.py 10 passed
tests/test_statistics.py 7 passed
tests/test_quality.py 7 passed
tests/test_detector.py 9 passed
tests/test_cli.py 5 passed
======================== 45 passed in 2.10s ====================
Sample Output
Running against the included e-commerce sales demo data:
🔴 Overall Drift Score: 84.7/100 [CRITICAL]
📋 Schema Changes
city — removed 🔴 critical
customer_age — added ℹ️ info
discount_pct — nullable 🟡 medium
📊 Distribution Drift — 6/11 columns drifted
delivery_days PSI=0.6355 🔴 critical
rating PSI=0.3897 🔴 critical
product_category PSI=0.3356 🔴 critical
payment_method PSI=0.1776 🟡 medium
⚠️ Quality Issues
discount_pct — null rate: 0% → 19.24% 🟠 high
order_amount — range expanded 🟡 medium
payment_method — new: {crypto} 🔵 low
product_category — new: {AI_Tools} 🔵 low
Tech Stack
| Component | Technology |
|---|---|
| Statistics | scipy, numpy |
| Data | pandas |
| Visualization | Plotly (interactive charts) |
| Reports | Jinja2 (HTML templates) |
| CLI | Click + Rich |
| Testing | pytest (45 tests) |
| CI/CD | GitHub Actions |
Try It
git clone https://github.com/hajirufai/datadrift.git
cd datadrift
pip install -r requirements.txt
python sample_data/generate_samples.py
python -m datadrift.cli compare sample_data/reference_sales.csv sample_data/current_sales.csv --report html -o demo_report.html
Open demo_report.html in your browser — interactive charts and all.
DataDrift is open source under the MIT license. Check it out on GitHub and star it if you find it useful!
Top comments (1)
Solid framing on statistical rigor. Two methodology notes from running drift detection on production LLM traces. First, KS-test on feature marginals misses semantic drift entirely. Two distributions can be marginally identical and semantically very different. We added MMD with a learned kernel for the semantic signal. Second, the multiple-testing correction matters more than people expect. If you are running a per-feature KS test on 50 features daily, with alpha 0.05 you get about 2.5 false alarms per day before Bonferroni. After Bonferroni you get under 0.1. The framework looks clean. Did you build in any correction step, and how do you decide when a drift signal is actionable rather than alert fatigue?