DEV Community

Cover image for Stop Bad Data From Breaking Your Pipelines — A Python Data Quality Framework
Kaviarasan Mani
Kaviarasan Mani

Posted on

Stop Bad Data From Breaking Your Pipelines — A Python Data Quality Framework

**Data breaks silently.
A null column passes through ETL.
A schema change slips into production.
An ML model trains on corrupted data.

Everything runs.
Nothing crashes.
But your metrics are wrong.

I ran into this problem repeatedly while working with Pandas and Spark pipelines.

So I built something to fix it.


🔍 The Problem

Most data pipelines:

  • Assume data is clean
  • Rely on manual checks
  • Validate schemas but not values
  • Detect problems too late

And while there are great data validation frameworks out there, I often needed something:

  • Lightweight
  • Easy to integrate
  • CI-friendly
  • Pandas + PySpark compatible
  • With built-in scoring

That’s why I built ValidateX.


💡 What Is ValidateX?

ValidateX is an open-source data quality validation framework for Python.

It supports:

  • 🐼 Pandas
  • ⚡ PySpark
  • CLI workflows
  • HTML report generation
  • Weighted data quality scoring (0–100)
  • CI/CD integration

GitHub:
https://github.com/kaviarasanmani/ValidateX

Docs:
https://validatex.readthedocs.io/en/latest/

PyPI:
https://pypi.org/project/validatex/


⚙️ Example: Validating a Pandas Dataset

Here’s a simple example:

import pandas as pd
from validatex import Validator

df = pd.DataFrame({
    "age": [25, 30, 17, None],
    "email": ["a@test.com", "b@test.com", "invalid", "c@test.com"]
})

validator = Validator(df)

validator.expect_column_not_null("age")
validator.expect_column_values_between("age", min_value=18, max_value=65)
validator.expect_column_values_to_match_regex(
    "email",
    r"^[^@]+@[^@]+\.[^@]+$"
)

result = validator.validate()

print(f"Data Quality Score: {result.score}")
validator.generate_report("report.html")
Enter fullscreen mode Exit fullscreen mode

In just a few lines, you:

  • Define expectations
  • Validate your dataset
  • Get a 0–100 quality score
  • Generate a clean HTML report

📊 Why a Data Quality Score Matters

Most validation tools give you pass/fail checks.

ValidateX calculates a weighted data quality score, allowing you to:

  • Track data health over time
  • Define minimum quality thresholds
  • Fail CI builds automatically

Example CLI usage:

validatex validate data.csv --min-score 90
Enter fullscreen mode Exit fullscreen mode

If quality drops below 90, the build fails.

That means bad data never reaches production.


🚦 CI/CD Integration Example (GitHub Actions)

You can integrate validation directly into CI:

- name: Validate Data
  run: validatex validate data.csv --min-score 90
Enter fullscreen mode Exit fullscreen mode

Now your pipeline enforces data standards automatically.


🧪 Supported Validation Types

ValidateX supports:

  • Column-level expectations
  • Table-level checks
  • Cross-column validation
  • Regex pattern checks
  • Range checks
  • Null validation
  • Custom expectation extensions

It works across Pandas and Spark environments, making it useful for both small scripts and large data pipelines.


🎯 Who Is This For?

  • Data engineers building ETL pipelines
  • ML engineers validating training datasets
  • Analytics teams enforcing schema rules
  • Startups that want lightweight data quality enforcement

If you've ever thought:

“We should probably validate this dataset…”

This tool was built for that exact moment.


🚀 What’s Next

I’m actively improving ValidateX with:

  • More built-in expectations
  • Better scoring customization
  • Profiling enhancements
  • Possible drift detection features

It’s MIT licensed and fully open source.

If you're interested, I’d love feedback on:

  • API design
  • Performance
  • Missing features
  • Real-world edge cases

GitHub:
https://github.com/kaviarasanmani/ValidateX


💬 Final Thoughts

Data validation shouldn’t be complicated.

It shouldn’t require a full ecosystem setup.

And it shouldn’t be optional.

ValidateX is my attempt to make practical, production-ready data validation simple for Python developers.

If you try it out, I’d love to hear your thoughts.


Top comments (0)