**Data breaks silently.
A null column passes through ETL.
A schema change slips into production.
An ML model trains on corrupted data.
Everything runs.
Nothing crashes.
But your metrics are wrong.
I ran into this problem repeatedly while working with Pandas and Spark pipelines.
So I built something to fix it.
🔍 The Problem
Most data pipelines:
- Assume data is clean
- Rely on manual checks
- Validate schemas but not values
- Detect problems too late
And while there are great data validation frameworks out there, I often needed something:
- Lightweight
- Easy to integrate
- CI-friendly
- Pandas + PySpark compatible
- With built-in scoring
That’s why I built ValidateX.
💡 What Is ValidateX?
ValidateX is an open-source data quality validation framework for Python.
It supports:
- 🐼 Pandas
- ⚡ PySpark
- CLI workflows
- HTML report generation
- Weighted data quality scoring (0–100)
- CI/CD integration
GitHub:
https://github.com/kaviarasanmani/ValidateX
Docs:
https://validatex.readthedocs.io/en/latest/
PyPI:
https://pypi.org/project/validatex/
⚙️ Example: Validating a Pandas Dataset
Here’s a simple example:
import pandas as pd
from validatex import Validator
df = pd.DataFrame({
"age": [25, 30, 17, None],
"email": ["a@test.com", "b@test.com", "invalid", "c@test.com"]
})
validator = Validator(df)
validator.expect_column_not_null("age")
validator.expect_column_values_between("age", min_value=18, max_value=65)
validator.expect_column_values_to_match_regex(
"email",
r"^[^@]+@[^@]+\.[^@]+$"
)
result = validator.validate()
print(f"Data Quality Score: {result.score}")
validator.generate_report("report.html")
In just a few lines, you:
- Define expectations
- Validate your dataset
- Get a 0–100 quality score
- Generate a clean HTML report
📊 Why a Data Quality Score Matters
Most validation tools give you pass/fail checks.
ValidateX calculates a weighted data quality score, allowing you to:
- Track data health over time
- Define minimum quality thresholds
- Fail CI builds automatically
Example CLI usage:
validatex validate data.csv --min-score 90
If quality drops below 90, the build fails.
That means bad data never reaches production.
🚦 CI/CD Integration Example (GitHub Actions)
You can integrate validation directly into CI:
- name: Validate Data
run: validatex validate data.csv --min-score 90
Now your pipeline enforces data standards automatically.
🧪 Supported Validation Types
ValidateX supports:
- Column-level expectations
- Table-level checks
- Cross-column validation
- Regex pattern checks
- Range checks
- Null validation
- Custom expectation extensions
It works across Pandas and Spark environments, making it useful for both small scripts and large data pipelines.
🎯 Who Is This For?
- Data engineers building ETL pipelines
- ML engineers validating training datasets
- Analytics teams enforcing schema rules
- Startups that want lightweight data quality enforcement
If you've ever thought:
“We should probably validate this dataset…”
This tool was built for that exact moment.
🚀 What’s Next
I’m actively improving ValidateX with:
- More built-in expectations
- Better scoring customization
- Profiling enhancements
- Possible drift detection features
It’s MIT licensed and fully open source.
If you're interested, I’d love feedback on:
- API design
- Performance
- Missing features
- Real-world edge cases
GitHub:
https://github.com/kaviarasanmani/ValidateX
💬 Final Thoughts
Data validation shouldn’t be complicated.
It shouldn’t require a full ecosystem setup.
And it shouldn’t be optional.
ValidateX is my attempt to make practical, production-ready data validation simple for Python developers.
If you try it out, I’d love to hear your thoughts.
Top comments (0)