<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaviarasan Mani</title>
    <description>The latest articles on DEV Community by Kaviarasan Mani (@kaviarasanmani).</description>
    <link>https://dev.to/kaviarasanmani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1572952%2F03c335bd-2dac-457b-82f0-d856cbd54a3e.jpeg</url>
      <title>DEV Community: Kaviarasan Mani</title>
      <link>https://dev.to/kaviarasanmani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kaviarasanmani"/>
    <language>en</language>
    <item>
      <title>Stop Bad Data From Breaking Your Pipelines — A Python Data Quality Framework</title>
      <dc:creator>Kaviarasan Mani</dc:creator>
      <pubDate>Sat, 21 Feb 2026 20:54:10 +0000</pubDate>
      <link>https://dev.to/kaviarasanmani/stop-bad-data-from-breaking-your-pipelines-a-python-data-quality-framework-2l6p</link>
      <guid>https://dev.to/kaviarasanmani/stop-bad-data-from-breaking-your-pipelines-a-python-data-quality-framework-2l6p</guid>
      <description>&lt;p&gt;**Data breaks silently.&lt;br&gt;
A null column passes through ETL.&lt;br&gt;
A schema change slips into production.&lt;br&gt;
An ML model trains on corrupted data.&lt;/p&gt;

&lt;p&gt;Everything runs.&lt;br&gt;
Nothing crashes.&lt;br&gt;
But your metrics are wrong.&lt;/p&gt;

&lt;p&gt;I ran into this problem repeatedly while working with Pandas and Spark pipelines.&lt;/p&gt;

&lt;p&gt;So I built something to fix it.&lt;/p&gt;


&lt;h1&gt;
  
  
  🔍 The Problem
&lt;/h1&gt;

&lt;p&gt;Most data pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assume data is clean&lt;/li&gt;
&lt;li&gt;Rely on manual checks&lt;/li&gt;
&lt;li&gt;Validate schemas but not values&lt;/li&gt;
&lt;li&gt;Detect problems too late&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And while there are great data validation frameworks out there, I often needed something:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lightweight&lt;/li&gt;
&lt;li&gt;Easy to integrate&lt;/li&gt;
&lt;li&gt;CI-friendly&lt;/li&gt;
&lt;li&gt;Pandas + PySpark compatible&lt;/li&gt;
&lt;li&gt;With built-in scoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why I built &lt;strong&gt;ValidateX&lt;/strong&gt;.&lt;/p&gt;


&lt;h1&gt;
  
  
  💡 What Is ValidateX?
&lt;/h1&gt;

&lt;p&gt;ValidateX is an open-source data quality validation framework for Python.&lt;/p&gt;

&lt;p&gt;It supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐼 Pandas&lt;/li&gt;
&lt;li&gt;⚡ PySpark&lt;/li&gt;
&lt;li&gt;CLI workflows&lt;/li&gt;
&lt;li&gt;HTML report generation&lt;/li&gt;
&lt;li&gt;Weighted data quality scoring (0–100)&lt;/li&gt;
&lt;li&gt;CI/CD integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub:&lt;br&gt;
&lt;a href="https://github.com/kaviarasanmani/ValidateX" rel="noopener noreferrer"&gt;https://github.com/kaviarasanmani/ValidateX&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Docs:&lt;br&gt;
&lt;a href="https://validatex.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;https://validatex.readthedocs.io/en/latest/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PyPI:&lt;br&gt;
&lt;a href="https://pypi.org/project/validatex/" rel="noopener noreferrer"&gt;https://pypi.org/project/validatex/&lt;/a&gt;&lt;/p&gt;


&lt;h1&gt;
  
  
  ⚙️ Example: Validating a Pandas Dataset
&lt;/h1&gt;

&lt;p&gt;Here’s a simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;validatex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Validator&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a@test.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b@test.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invalid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;c@test.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expect_column_not_null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expect_column_values_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expect_column_values_to_match_regex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^[^@]+@[^@]+\.[^@]+$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data Quality Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In just a few lines, you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define expectations&lt;/li&gt;
&lt;li&gt;Validate your dataset&lt;/li&gt;
&lt;li&gt;Get a 0–100 quality score&lt;/li&gt;
&lt;li&gt;Generate a clean HTML report&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  📊 Why a Data Quality Score Matters
&lt;/h1&gt;

&lt;p&gt;Most validation tools give you pass/fail checks.&lt;/p&gt;

&lt;p&gt;ValidateX calculates a weighted &lt;strong&gt;data quality score&lt;/strong&gt;, allowing you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track data health over time&lt;/li&gt;
&lt;li&gt;Define minimum quality thresholds&lt;/li&gt;
&lt;li&gt;Fail CI builds automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example CLI usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;validatex validate data.csv &lt;span class="nt"&gt;--min-score&lt;/span&gt; 90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If quality drops below 90, the build fails.&lt;/p&gt;

&lt;p&gt;That means bad data never reaches production.&lt;/p&gt;




&lt;h1&gt;
  
  
  🚦 CI/CD Integration Example (GitHub Actions)
&lt;/h1&gt;

&lt;p&gt;You can integrate validation directly into CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Validate Data&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;validatex validate data.csv --min-score &lt;/span&gt;&lt;span class="m"&gt;90&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your pipeline enforces data standards automatically.&lt;/p&gt;




&lt;h1&gt;
  
  
  🧪 Supported Validation Types
&lt;/h1&gt;

&lt;p&gt;ValidateX supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Column-level expectations&lt;/li&gt;
&lt;li&gt;Table-level checks&lt;/li&gt;
&lt;li&gt;Cross-column validation&lt;/li&gt;
&lt;li&gt;Regex pattern checks&lt;/li&gt;
&lt;li&gt;Range checks&lt;/li&gt;
&lt;li&gt;Null validation&lt;/li&gt;
&lt;li&gt;Custom expectation extensions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works across Pandas and Spark environments, making it useful for both small scripts and large data pipelines.&lt;/p&gt;




&lt;h1&gt;
  
  
  🎯 Who Is This For?
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Data engineers building ETL pipelines&lt;/li&gt;
&lt;li&gt;ML engineers validating training datasets&lt;/li&gt;
&lt;li&gt;Analytics teams enforcing schema rules&lt;/li&gt;
&lt;li&gt;Startups that want lightweight data quality enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever thought:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“We should probably validate this dataset…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This tool was built for that exact moment.&lt;/p&gt;




&lt;h1&gt;
  
  
  🚀 What’s Next
&lt;/h1&gt;

&lt;p&gt;I’m actively improving ValidateX with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More built-in expectations&lt;/li&gt;
&lt;li&gt;Better scoring customization&lt;/li&gt;
&lt;li&gt;Profiling enhancements&lt;/li&gt;
&lt;li&gt;Possible drift detection features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s MIT licensed and fully open source.&lt;/p&gt;

&lt;p&gt;If you're interested, I’d love feedback on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API design&lt;/li&gt;
&lt;li&gt;Performance&lt;/li&gt;
&lt;li&gt;Missing features&lt;/li&gt;
&lt;li&gt;Real-world edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub:&lt;br&gt;
&lt;a href="https://github.com/kaviarasanmani/ValidateX" rel="noopener noreferrer"&gt;https://github.com/kaviarasanmani/ValidateX&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  💬 Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Data validation shouldn’t be complicated.&lt;/p&gt;

&lt;p&gt;It shouldn’t require a full ecosystem setup.&lt;/p&gt;

&lt;p&gt;And it shouldn’t be optional.&lt;/p&gt;

&lt;p&gt;ValidateX is my attempt to make practical, production-ready data validation simple for Python developers.&lt;/p&gt;

&lt;p&gt;If you try it out, I’d love to hear your thoughts.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>dataqualty</category>
      <category>etl</category>
    </item>
  </channel>
</rss>
