Kaviarasan Mani

Posted on Feb 21

Stop Bad Data From Breaking Your Pipelines — A Python Data Quality Framework

#dataengineering #dataqualty #etl

**Data breaks silently.
A null column passes through ETL.
A schema change slips into production.
An ML model trains on corrupted data.

Everything runs.
Nothing crashes.
But your metrics are wrong.

I ran into this problem repeatedly while working with Pandas and Spark pipelines.

So I built something to fix it.

🔍 The Problem

Most data pipelines:

Assume data is clean
Rely on manual checks
Validate schemas but not values
Detect problems too late

And while there are great data validation frameworks out there, I often needed something:

Lightweight
Easy to integrate
CI-friendly
Pandas + PySpark compatible
With built-in scoring

That’s why I built ValidateX.

💡 What Is ValidateX?

ValidateX is an open-source data quality validation framework for Python.

It supports:

🐼 Pandas
⚡ PySpark
CLI workflows
HTML report generation
Weighted data quality scoring (0–100)
CI/CD integration

GitHub:
https://github.com/kaviarasanmani/ValidateX

Docs:
https://validatex.readthedocs.io/en/latest/

PyPI:
https://pypi.org/project/validatex/

⚙️ Example: Validating a Pandas Dataset

Here’s a simple example:

import pandas as pd
from validatex import Validator

df = pd.DataFrame({
    "age": [25, 30, 17, None],
    "email": ["a@test.com", "b@test.com", "invalid", "c@test.com"]
})

validator = Validator(df)

validator.expect_column_not_null("age")
validator.expect_column_values_between("age", min_value=18, max_value=65)
validator.expect_column_values_to_match_regex(
    "email",
    r"^[^@]+@[^@]+\.[^@]+$"
)

result = validator.validate()

print(f"Data Quality Score: {result.score}")
validator.generate_report("report.html")

In just a few lines, you:

Define expectations
Validate your dataset
Get a 0–100 quality score
Generate a clean HTML report

📊 Why a Data Quality Score Matters

Most validation tools give you pass/fail checks.

ValidateX calculates a weighted data quality score, allowing you to:

Track data health over time
Define minimum quality thresholds
Fail CI builds automatically

Example CLI usage:

validatex validate data.csv --min-score 90

If quality drops below 90, the build fails.

That means bad data never reaches production.

🚦 CI/CD Integration Example (GitHub Actions)

You can integrate validation directly into CI:

- name: Validate Data
  run: validatex validate data.csv --min-score 90

Now your pipeline enforces data standards automatically.

🧪 Supported Validation Types

ValidateX supports:

Column-level expectations
Table-level checks
Cross-column validation
Regex pattern checks
Range checks
Null validation
Custom expectation extensions

It works across Pandas and Spark environments, making it useful for both small scripts and large data pipelines.

🎯 Who Is This For?

Data engineers building ETL pipelines
ML engineers validating training datasets
Analytics teams enforcing schema rules
Startups that want lightweight data quality enforcement

If you've ever thought:

“We should probably validate this dataset…”

This tool was built for that exact moment.

🚀 What’s Next

I’m actively improving ValidateX with:

More built-in expectations
Better scoring customization
Profiling enhancements
Possible drift detection features

It’s MIT licensed and fully open source.

If you're interested, I’d love feedback on:

API design
Performance
Missing features
Real-world edge cases

GitHub:
https://github.com/kaviarasanmani/ValidateX

💬 Final Thoughts

Data validation shouldn’t be complicated.

It shouldn’t require a full ecosystem setup.

And it shouldn’t be optional.

ValidateX is my attempt to make practical, production-ready data validation simple for Python developers.

If you try it out, I’d love to hear your thoughts.

DEV Community