Digvijay Waghela

Posted on Jul 3

# Turning Data Quality Checks into a Measurable Trust Score

#database #dataengineering #datascience #pandas

Most data quality problems do not announce themselves loudly.

They do not always break a pipeline.
They do not always throw an exception.
They do not always fail a job.

Instead, they quietly move downstream.

A null value appears in a column that should never be null.
A customer ID is duplicated.
A date column contains malformed values.
A metric dashboard refreshes successfully, but the number is wrong.
A machine learning feature table gets built, but one key field has silently degraded.

By the time someone notices, the issue is no longer just a data problem. It has become a business trust problem.
That is the part of data quality that is often underestimated.
In many analytics and data engineering environments, the first person to detect a data issue is not the pipeline owner. It is often a business stakeholder, analyst, finance partner, product manager, or executive asking:

“Why does this number look different today?”

At that point, the technical failure has already become a credibility issue.

The Gap Between “Pipeline Passed” and “Data Is Trustworthy”

A data pipeline can run successfully and still produce bad data.

That happens because pipeline success usually means:

The job completed.
The SQL executed.
The file landed.
The table refreshed.
The dashboard updated.

But none of that automatically means:

The primary key is unique.
Required columns are populated.
Values are within expected ranges.
Dates are valid.
Categories match accepted values.
Row-level duplication did not increase.
The dataset is safe enough to use.

This is why data quality needs to be treated as a first-class engineering concern, not as an afterthought. The problem is that many teams either do too little or try to jump directly into heavyweight solutions.

On one side, teams rely on manual spot checks, SQL queries, or dashboard review.On the other side, there are mature data quality frameworks that are very powerful but may require upfront configuration, project structure, expectation suites, integrations, or operational setup.

Those tools are valuable, especially for mature production platforms. But sometimes the need is much simpler:

“I have a pandas DataFrame or CSV. Can I quickly understand whether this data looks safe enough to use?”

That lightweight question deserves a lightweight answer.

A Practical Data Quality Flow for Pandas DataFrames

When working with pandas data, I like to think about data quality in three simple layers:

Raw DataFrame / CSV
        |
        v
1. Profile the data
   - What columns exist?
   - How many nulls?
   - What are the basic stats?
   - What does the data shape look like?
        |
        v
2. Validate expected rules
   - Required fields
   - Unique identifiers
   - Accepted value ranges
   - Valid formats
   - No duplicate rows
        |
        v
3. Produce a quality result
   - Pass/fail status
   - 0–100 score
   - Failing checks
   - Exportable report
   - Optional CI/CD failure

This flow is useful because it separates three different questions:

What does the data look like?
Does it meet the rules I care about?
Can I turn the result into something actionable?

That last part is important.

A long validation output is useful for debugging, but a simple score is useful for dashboards, CI pipelines, and automated quality gates.

Example: Customer Data Quality Check

Imagine a simple customer dataset:

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 2, 4],
    "age": [29, 35, 200, None],
    "email": [
        "alex@example.com",
        "bad-email",
        "sam@example.com",
        None
    ],
    "country": ["US", "CA", "XX", "US"]
})

At first glance, this data might look usable. It has customer IDs, ages, emails, and countries.

But there are several issues:

id has a duplicate value.
age has an unrealistic value: 200.
age also has a missing value.
email has an invalid format.
email has a missing value.
country has a value outside the allowed list.

These are exactly the kinds of issues that may not crash a notebook or pipeline, but can create downstream problems.

A Lightweight Way to Check This

For quick pandas-based validation, I have been using a small Python utility called dqscore.

Install it with:

pip install dqscore

The idea is simple: profile a DataFrame, define expectations, and get a quality score/report without setting up a heavy framework.

import pandas as pd
import dqscore as dq

df = pd.DataFrame({
    "id": [1, 2, 2, 4],
    "age": [29, 35, 200, None],
    "email": [
        "alex@example.com",
        "bad-email",
        "sam@example.com",
        None
    ],
    "country": ["US", "CA", "XX", "US"]
})

Step 1: Profile the DataFrame

Before writing rules, it is useful to understand the data shape.

profile = dq.profile(df)

print(profile.to_markdown())
profile.to_html("profile.html")

This gives a quick profile of the DataFrame, including per-column information that can help identify obvious issues before deeper validation.

This is useful when reviewing a new CSV, an API extract, a vendor file, or an intermediate dataset in a notebook.

Step 2: Define the Data Quality Rules

Next, we define expectations.

schema = dq.Schema("customers")

schema.column("id").not_null().unique()
schema.column("age").not_null().is_numeric().in_range(0, 120)
schema.column("email").not_null().matches(r"^[^@]+@[^@]+\.[^@]+$")
schema.column("country").in_set(["US", "CA", "MX"])

schema.no_duplicate_rows()

This reads almost like a checklist:

id should not be null.
id should be unique.
age should exist, be numeric, and fall within a valid range.
email should exist and match a basic email pattern.
country should be one of the accepted values.
The dataset should not contain duplicate rows.

Step 3: Validate and Get a Score

result = schema.validate(df)

print(result.summary())
print("Quality score:", result.score)

result.to_html("dq_report.html")

The useful part is that the result is not just a pass/fail object. It gives a quality score from 0 to 100 and a report that can be exported.

That means the output can be used in multiple ways:

Printed in a notebook
Saved as an HTML report
Exported as JSON
Added to a CI pipeline
Used as a lightweight quality gate
Tracked over time as a data quality metric

Why a Score Helps

Pass/fail validation is helpful, but it can be too binary.

In real-world data systems, not every failure has the same urgency.

For example:

A missing optional field may be low severity.
A duplicate primary key may be high severity.
A malformed email may be medium severity.
A broken date field in a financial report may be critical.

A score gives teams a simple way to communicate quality without forcing every stakeholder to read validation logs.

Instead of saying:

“Three checks failed and two columns have invalid values.”

You can say:

“This dataset scored 72 out of 100 and failed uniqueness, email format, and valid country checks.”

That is much easier to put into a dashboard, alert, or review process.

Zero-Config Scan for Fast Checks

Sometimes you do not want to define a full schema.

You just want a quick read on a dataset.

result = dq.auto_scan(df)

print(result.summary())

This is useful when:

Reviewing a new CSV
Exploring a vendor file
Checking a notebook DataFrame
Validating an extract before sharing it
Adding a quick quality step before deeper transformation

The point is not to replace full production-grade data quality platforms. The point is to reduce the barrier to checking data early.

Command Line Usage

The same idea can also be used from the command line.

dqscore profile customers.csv --html profile.html

For a quick scan:

dqscore scan customers.csv --json report.json

You can also set a null threshold:

dqscore scan customers.csv --max-null-pct 5

This makes it useful for lightweight CI/CD or pre-commit style checks.

For example, a team could add a simple step before accepting a reference data file:

dqscore scan customers.csv --json dq_report.json

If the scan fails, the process can stop before bad data moves further downstream.

Where This Fits

This kind of lightweight approach fits well in the early and middle stages of the data workflow.

CSV / API Extract / DataFrame
        |
        v
Quick Profile
        |
        v
Schema Validation
        |
        v
Quality Score + Report
        |
        v
Notebook / CI / Dashboard / Pipeline Gate

It is especially useful for:

Data analysts working with CSV files
Data engineers validating intermediate datasets
Analytics engineers checking extracts before dbt/Snowflake loads
ML practitioners validating feature inputs
Small teams that want quality checks without operational overhead
Educators teaching practical data validation concepts

This Is Not About Replacing Bigger Tools

There are excellent data quality and validation tools in the Python and data engineering ecosystem.

Great Expectations is powerful for expectation-driven validation.
Pandera is strong for schema validation in pandas workflows.
ydata-profiling is useful for exploratory profiling and rich data reports.
dbt tests are excellent inside analytics engineering workflows.

The lightweight approach is not a replacement for those.

It is for the gap where you want something simple, fast, and scoreable:

One dependency: pandas
No heavy project setup
No config files required
DataFrame profiling
Fluent schema validation
0–100 quality score
Markdown, JSON, and HTML reports
CLI support

That small surface area is useful when the main barrier is not lack of tools, but lack of frictionless adoption.

A Simple Pattern I Like

For many pandas workflows, a practical pattern looks like this:

import pandas as pd
import dqscore as dq

df = pd.read_csv("customers.csv")

schema = dq.Schema("customers")
schema.column("id").not_null().unique()
schema.column("age").not_null().is_numeric().in_range(0, 120)
schema.column("email").matches(r"^[^@]+@[^@]+\.[^@]+$")
schema.column("country").in_set(["US", "CA", "MX"])
schema.no_duplicate_rows()

result = schema.validate(df)

print(result.summary())
print("Quality score:", result.score)

result.to_html("dq_report.html")

if not result.passed:
    raise SystemExit("Data quality checks failed")

This gives a clean engineering pattern:

Load data.
Define expectations.
Validate.
Generate a report.
Fail early if needed.

That is much better than discovering the issue later in a dashboard review.

Final Thought

Data quality does not always need to start with a large platform implementation.

Sometimes it starts with a simple habit:

Before trusting a dataset, profile it, validate it, score it, and make the result visible.

That habit alone can prevent many downstream issues.

Bad data is expensive not because it exists, but because it is usually discovered too late.

A lightweight quality check near the beginning of the workflow can save hours of debugging, reduce stakeholder confusion, and improve trust in analytics and machine learning outputs.

The goal is simple:

Catch boring-but-costly data problems before they become business problems.

Package: dqscore
Install:

pip install dqscore

GitHub:

https://github.com/dgvj-work/dqscore

PyPI:

https://pypi.org/project/dqscore/

If you work with pandas DataFrames, CSVs, analytics extracts, or quick validation workflows, try it and share feedback. The most useful improvements usually come from real datasets and real edge cases.

DEV Community