Originally published on Medium: https://medium.com/@kalluripradeep99/data-quality-at-scale-why-your-pipeline-needs-more-than-green-checkmarks-f3af3dbff8a4
Data Quality at Scale: Why Your Pipeline Needs More Than Green Checkmarks
I once watched a company make a major strategic decision based on a dashboard that had been showing incorrect data for three weeks. The scary part? Nobody knew. The data pipeline ran successfully every day. All green checkmarks in Airflow. Zero alerts. Everything looked fine.
Except the data was wrong.
After years of building data platforms, I've learned something important: moving data is the easy part. Making sure it's correct is what keeps you up at night.
In this article, I'll share what I've learned about data quality at scale. Not the theory you read in textbooks, but the practical stuff that actually matters when your CEO is looking at a dashboard you built.
The $2 Million Dashboard
Let me tell you about that incident I mentioned. A source system quietly changed how they tracked customer IDs. They sent us an email about it (that got lost in someone's inbox). Our pipeline kept running perfectly. Schema matched. No null values. Everything technically valid.
But we were now double-counting about 15% of customers.
For three weeks, our growth metrics looked amazing. Leadership loved it. They approved a massive marketing spend based on those numbers. Then someone in finance noticed the discrepancy during a reconciliation. We had to go back and explain that our "amazing growth" was actually a data bug.
That was expensive. Not just the money, but the trust. It took months to rebuild confidence in our data platform.
Why Traditional Testing Isn't Enough
If you're coming from software engineering, you might think, "Just write unit tests!" I thought that too. Didn't work.
Here's the thing: with code, you control the inputs. You write tests for expected scenarios. Code is deterministic.
Data is different. You don't control the source systems. They change without telling you. Business rules evolve. Schema drift happens. And here's the worst part - data can be technically valid but business invalid.
Some real examples I've seen:
The string that wasn't a string: Transaction amounts came through as "1,234.56" instead of 1234.56. Schema said "string field," so it passed validation. Try summing those in a SQL query. You get $0.
The date that wasn't wrong: A source system started sending dates in DD/MM/YYYY format instead of MM/DD/YYYY. Every date from the 1st to the 12th of the month worked fine. Then on the 13th, everything broke. Took us two weeks to figure out why.
The midnight ghost records: Mobile app transactions synced when users had WiFi. Some took 48 hours to arrive. Our daily reports were always incomplete, but we had no way to know which days were "final."
I learned the hard way that you need to test six things:
Schema (structure is right)
Values (numbers make sense)
Volume (right amount of data)
Freshness (data is recent enough)
Distribution (patterns look normal)
Relationships (foreign keys work)
Most teams only test the first one.
What Data Quality Actually Means
I organize quality checks into four categories. Each one catches different types of problems.
Completeness: Is Everything There?
This seems obvious, but it's where most issues start. You expect 100,000 rows. You get 60,000. Is that a problem or just a slow day?
I check:
Row counts against historical averages (alert if >20% different)
Null rates in critical fields (customer_id should never be null)
All expected dates/partitions are present
Foreign keys exist (every transaction has a valid customer)
One time we lost an entire day of data because the source system had a disk space issue. They dumped empty files to our S3 bucket. Our pipeline happily processed zero rows. Everything succeeded. We only found out when someone asked why yesterday's revenue was $0.
Now I check row counts. If today is 50% lower than the average of the last 7 days, I get paged.
Accuracy: Is the Data Correct?
This is harder because "correct" depends on business context. A $1 million transaction might be valid for some businesses, fraud for others.
I focus on:
Range checks (transaction amounts between $0 and $100,000)
Format validation (emails look like emails, dates are dates)
Business rules (refund amount can't exceed original purchase)
Reconciliation with source systems (row counts and totals match)
The trick is working with business users to define what "correct" means. Don't guess. Ask.
Consistency: Does It All Make Sense Together?
Data doesn't exist in isolation. Tables relate to each other. Metrics calculated different ways should match.
I check for:
Orphaned records (transactions without a customer)
Duplicate primary keys (should be impossible but happens)
Cross-table consistency (revenue calculated two ways gives same answer)
Time-series anomalies (revenue doesn't drop 90% overnight unless something big happened)
We once had a bug where a retry mechanism created duplicate records. Started at 0.1%. Grew to 15% over three months. Aggregations were inflated. We were reporting 15% higher revenue than we actually had. Found it during a financial audit. Not fun.
Freshness: Is It Recent Enough?
Stale data is useless data. But "fresh enough" depends on the use case. Real-time fraud detection needs data from the last minute. Monthly reports can tolerate day-old data.
I monitor:
Maximum timestamp in each table
Time since last successful pipeline run
SLA breaches (data should be <2 hours old for dashboards)
Set clear SLAs. Measure against them. Alert when you miss them.
How I Actually Implement This
Theory is nice. Let me show you what I actually do.
Great Expectations for Processing Layer
This tool changed how I think about data quality. Instead of writing custom validation code, you define expectations. Then run them automatically.
Here's a real example from a transaction pipeline:
pythonimport great_expectations as ge
Load your data
df = ge.read_csv("s3://curated-zone/transactions.csv")
Critical checks - pipeline stops if these fail
df.expect_column_values_to_not_be_null("transaction_id")
df.expect_column_values_to_be_unique("transaction_id")
df.expect_column_values_to_not_be_null("customer_id")
Range checks
df.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
Valid values only
df.expect_column_values_to_be_in_set("currency", ["USD", "EUR", "GBP"])
df.expect_column_values_to_be_in_set("status", ["pending", "completed", "failed"])
Format validation
df.expect_column_values_to_match_regex(
"email",
r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$"
)
Business rule: refunds can't exceed original amount
df.expect_column_pair_values_A_to_be_greater_than_B(
column_A="original_amount",
column_B="refund_amount"
)
Row count check (based on historical average)
df.expect_table_row_count_to_be_between(
min_value=50000,
max_value=200000
)
Run all checks
results = df.validate()
if not results.success:
failed = [exp for exp in results.results if not exp.success]
print(f"Quality check failed! {len(failed)} issues found")
# Send alert, stop pipeline, whatever makes sense
raise ValueError("Data quality check failed")
I run this in my Airflow pipeline right after reading from the raw zone. If validation fails, the pipeline stops. Bad data never reaches production.
The key is starting simple. Five checks on day one. Add more as you learn what can go wrong. I now have 50+ checks on critical tables. Built up over time based on actual incidents.
dbt Tests for Analytics Layer
While Great Expectations handles the processing layer, I use dbt for the warehouse. Tests live right next to the models. Easy for analysts to write and maintain.
yaml# models/schema.yml
version: 2
models:
- name: fct_daily_revenue
description: "Daily revenue by product"
columns:
- name: date
tests:
- not_null
- unique
- name: product_id
tests:
- not_null
- relationships: to: ref('dim_products') field: product_id
- name: revenue
tests:
- not_null
- dbt_utils.accepted_range: min_value: 0 max_value: 10000000 And custom tests for business logic: sql-- tests/revenue_reconciliation.sql -- Revenue in warehouse should match source system
- name: date
tests:
with warehouse as (
select sum(revenue) as total
from {{ ref('fct_daily_revenue') }}
where date = current_date - 1
),
source as (
select total_revenue as total
from {{ ref('source_summary') }}
where date = current_date - 1
)
select *
from warehouse w
cross join source s
where abs(w.total - s.total) / s.total > 0.01 -- Fail if >1% difference
Run these after every model build. If tests fail, you know immediately.
Monitoring and Alerts
Quality checks are useless if nobody looks at them. You need alerts that actually get attention.
I use three severity levels:
Critical (page someone):
Pipeline completely failed
Zero rows loaded
SLA breach by >4 hours
High (Slack with @channel):
Quality checks failed
Volume drop >50%
Freshness breach by >2 hours
Medium (Slack notification):
Warning-level checks failed
Volume drop 20-50%
Minor anomalies
Don't alert on everything. Alert fatigue is real. I learned this by setting alerts too aggressively and then ignoring them. Start conservative. Tune based on false positives.
Building a Quality Culture
Here's what I've learned about getting teams to care about quality:
Show the impact. Don't say "we need more tests." Say "last month's incorrect dashboard cost us a $2M budgeting mistake. These tests prevent that."
Make it visible. We have a dashboard showing data quality scores for every table. Updates daily. Everyone can see it. When scores drop, people notice.
Make it easy. Pre-built test templates. Clear documentation. If adding quality checks is hard, people won't do it.
Celebrate wins. "Zero quality incidents this month!" matters. Recognize teams that maintain high quality scores.
Share incidents. When things break (and they will), do a blameless post-mortem. What happened? What did we learn? How do we prevent it? Share these widely. Learn from mistakes.
The Quality Checklist
Before any new pipeline goes to production, I make sure:
Schema validation exists
Critical fields have null checks
Value ranges are validated
Row count checks are in place
Freshness monitoring configured
Alerts set up and tested
Team knows how to respond to alerts
Runbook exists for common failures
Takes 30 minutes to set up. Saves hours when things break.
Real Incidents I've Seen
Let me share three more incidents and what I learned from each.
The silent schema change: A source system added a new status code without telling us. Our pipeline treated it as invalid and dropped those records. 10% of data quietly disappeared. We found out when a business user asked why certain transactions weren't showing up.
Lesson: Monitor unexpected values. If a new status code appears, alert on it. Don't silently drop data.
The weekend bug: Our pipeline ran fine Monday through Friday. Every weekend it failed. Why? Because weekend volume was 80% lower. Our row count check had a fixed threshold, not a relative one. Every Sunday morning, someone got paged.
Lesson: Make thresholds context-aware. Weekend expectations ≠ weekday expectations.
The currency confusion: Transaction amounts suddenly doubled. Took us a day to figure out why. A source system changed from sending amounts in dollars to cents. $100.00 became 10000. Technically valid (still a number), but wrong.
Lesson: Compare against historical distributions. If average transaction amount suddenly changes by 100x, something's wrong.
What Actually Matters
After years of doing this, here's what I've learned:
Start simple. Five good checks beat 50 mediocre ones. Focus on what actually breaks in your pipelines.
Monitor trends, not just values. A gradual increase in null rates is harder to catch than a sudden spike. Watch the trends.
Test what you can't see. Schema and row counts are easy. Business logic is hard. Both matter.
Make quality everyone's job. Data engineers build the checks. Analysts write tests for their models. Business users define what "correct" means. Shared responsibility works.
Learn from failures. Every incident is a chance to add a test that prevents it from happening again. Build your quality suite from real problems.
Alert strategically. Too many alerts and people ignore them. Too few and you miss real issues. Tune constantly.
The goal isn't perfection. It's trust. When someone looks at a dashboard, they should trust the numbers. When leadership makes a decision based on data, it should be the right decision.
That's what data quality is really about.
Getting Started
If you're building a data platform or trying to improve an existing one:
Pick your most critical table
Add five basic checks (not null, unique, value ranges, row count, freshness)
Set up alerts
Wait for something to break
Add a check that would have caught it
Repeat
Don't try to build perfect quality checks on day one. Build them incrementally based on what actually goes wrong.
And when something does break (it will), treat it as a learning opportunity. What check would have caught this? Add it. Move on.
Data quality isn't a project with an end date. It's ongoing vigilance. The teams that do it well make it part of their culture, not just a checklist.
Want to discuss data quality or pipeline architecture? Connect with me on LinkedIn or check out my portfolio. Always happy to talk about building reliable data systems.
And if you're working on data quality tools or have war stories to share, I'd love to hear them!
Thanks for reading! If this was helpful, follow for more articles on data engineering, building reliable systems, and lessons learned from production incidents.
Top comments (0)