Sumit Agarwal

Posted on Dec 27, 2025

Schema Validation Passed - So Why Did My Pipeline Fail?

#webdev #datascience #cicd #help

You're 3 AM. Your dashboards are blank. Your CI logs show: Schema validation: PASSED.

Somewhere, a data engineer is screaming into their keyboard.

This is the moment where schema validation reveals its dirty secret: it catches syntax, not reality. And the gap between what passes validation and what actually works? That gap is where production breaks.

Let's talk about why your pipeline failed, and why your validation tools didn't catch it.

The False Comfort of "Validation Passed"

Schema validation does one job really well: it checks if your data file is parseable.

// This passes every schema validator alive
{
  "user_id": "12345",
  "email": "test@example.com",
  "created_date": "2025-12-26"
}

Looks good, right? The JSON is valid. The CSV has the right number of columns. The XML tags are closed properly.

But here's what schema validation doesn't care about:

Whether user_id should actually be a number (you're storing it as a string)
Whether created_date is really a date, or just a string that looks like one
Whether the file has only headers and no data rows
Whether a column you're counting on actually exists
Whether email values are suddenly changing from user@domain.com to "N/A" or null

Your validator checks the shape. It doesn't check if the shape makes sense.

Real-World: The Column Rename That Cost 6 Hours

Here's a scenario that happens in production more often than you'd think:

Your vendor sends you a CSV every day. Your pipeline imports it into a database. Downstream dashboards depend on it. For months, everything works.

Then one morning, a column name changes.

Maybe it was customer_name. Now it's full_name. Maybe order_date became date_order. Your validation passes. The file parses. The schema check says "all good."

But your transformation code? It's looking for customer_name. It doesn't find it. Your pipeline either fails hard or silently drops that column, and your dashboard now shows incomplete data for an entire day.

One engineer on Reddit described exactly this:

"In an enterprise environment you are usually not in control of the data sources. Column renames manifest as missing columns in your expected schema and a new column at the same time. The pipeline cannot resolve this issue and will fail."

Schema validation saw:

Valid file structure ✓
All columns present ✓
No parse errors ✓

What it missed:

The column you're depending on is gone ✗
A new, unexpected column appeared ✗

Based on this, there's one tool that handles this quite well:
TRY IT: DatumInt

The Silent Killers: Issues That Pass Validation Every Time

1. Headers-Only Files (The Truncation Trap)

Your vendor sends a CSV with only column headers and zero data rows. Maybe the system crashed mid-export. Maybe someone hit "export template" by accident.

Your validation checks: "Does this parse?" Yes, it does. Headers are valid. Columns are correct.

But when you load this into your data warehouse with a truncate-before-copy strategy? You just deleted all your data and replaced it with nothing.

One engineer mentioned this exact scenario:

"A vendor sent a file with headers only from a truncate pre-copy script that passes schema validation."

They solved it by adding a file-size check. Headers only? Usually less than a few hundred bytes. Data present? Much larger. Simple, but catches a production failure that validation missed.

2. Type Mismatches That Slip Through

Your schema says age should be a number. But the file has:

age
25
30
"unknown"
35

Most validators will parse this as a string column (the "safe" choice). Your downstream system expects an integer. Now you're either:

Getting type-conversion errors downstream
Silently casting "unknown" to 0 or NULL
Breaking your aggregations (can't average strings)

The file is perfectly valid. The data isn't.

3. Date Format Chaos

Your schema expects ISO 8601 dates. But the vendor's system switched regions and is now sending:

12/25/2025
26-12-2025
2025.12.26

All valid date representations. All different parsers. All failing your ETL in different ways.

Schema validation: "It looks like a string. It's a valid string. Ship it."

Your pipeline: "What the hell is 26-12-2025?"

4. The Null Tsunami

A column suddenly fills with NULL values. Or worse, it fills with string placeholders:

email
user1@example.com
user2@example.com
"N/A"
"unknown"
null

Your schema says "emails are present." Technically true. But 40% of your records now have garbage data. Your downstream analytics will churn out garbage metrics, and no one will know why.

This is the silent killer because:

No validation error
No parsing failure
Just bad data that corrupts everything downstream

One data engineer described the pain clearly:

"High null or duplicate record ratios silently corrupt downstream dashboards and analytics without obvious error signals."

Why Do These Issues Slip Through?

Schema validation is deterministic and intentionally narrow. It's like a bouncer checking your ID at the club:

"Is this real ID?" → Yes
"Does it look tampered with?" → No
"Are you actually the person in the photo?" → Not their job

Validation checks:

Syntax correctness ✓
Expected column presence ✓
Basic type structure ✓

Validation does NOT check:

Whether columns are actually used downstream
Whether values make sense for your business logic
Whether unexpected changes happened
Whether file size suggests truncation
Whether data quality degraded

This isn't a flaw in validation. It's by design. You can't know every business rule, context, or dependency.

But you can catch the common ones before they blow up your pipeline.

The Overkill Approach: Enterprise Data Validation

If you're running a massive data operation, there are heavy-duty tools:

Great Expectations (Python, comprehensive, mature)
dbt-expectations (if you use dbt, highly recommended)
dlt (data load tool, handles schema evolution)
Airbyte (SaaS, out-of-the-box validation)

These are powerful. They let you define expectations:

"This column should never be NULL"
"Percentages should be 0-100"
"user_id should be unique"
"Dates should be within reasonable bounds"
"These categorical fields should only have these values"

But they also require:

Setup time (30 min to days)
Ongoing maintenance (as your schema changes)
Infrastructure (especially dbt)
Team coordination (who writes the expectations?)

For a solo engineer? A small team? A vendor integration you're doing once? That's often overkill.

The Middle Ground: Lightweight Pre-Ingestion Checks

There's a sweet spot between "nothing" and "enterprise platform": quick, deterministic checks right before you ingest.

Think of it as a health check before you let data into your system:

Check 1: Schema Diff

Expected columns: [user_id, email, created_date]
Actual columns: [user_id, email, creation_date]  ← Different name!
Status: MISMATCH

Catches column renames, missing columns, surprise new columns. Takes seconds.

Check 2: File Size / Row Count

File size: 342 bytes (headers only?)
Row count: 0
Status: WARNING - File has headers but no data

Catches truncations, empty exports, failed syncs.

Check 3: Type and Value Validation

Column: age
Expected: numeric
Actual values: 25, 30, "unknown", 35
Status: TYPE MISMATCH in row 3
Value "unknown" is not a number

Catches type mismatches and garbage values.

Check 4: Null and Outlier Detection

Column: email
Nulls: 12% (expected <1%)
Outliers: "N/A" appears 47 times
Status: WARNING - Abnormal null ratio

Catches sudden data quality drops.

Check 5: Logical Consistency

start_date: 2025-01-01
end_date: 2024-12-31
Status: ERROR - Start date is after end date

Catches logical contradictions that parsing won't catch.

The beauty? You can run all of this in seconds, in a browser, with zero infrastructure. No setup. No Python libraries. No waiting.

And critically: you get a human-readable explanation of what's wrong, not just a pass/fail.

In practice, this is what DatumInt does with its Detective D analysis engine. You upload a JSON, CSV, YAML, or XML file. Detective D runs these exact checks deterministically and shows you:

Which rows are problematic
Why they're problematic
What kind of data quality issue it is

You can try it in under a minute: just upload a file and see what Detective D catches. No login required. No setup. It's built specifically for the moment when you're asking "Is this file safe to ingest?" before your pipeline touches it.

TRY IT: DatumInt

What This Approach Does NOT Solve

Let's be honest about the limits:

This Won't Catch...

Business logic failures: "The total revenue is negative." Maybe that's intentional (refunds). A check can flag it, but it can't know if it's right.

Cross-table inconsistencies: "User IDs in this file don't match our existing user database." You need database context.

Semantic drift: "We changed what 'active_user' means, and now our metrics are wrong." Data looks fine. The definition changed.

Deep anomalies: "This month's sales are 3x normal, but the numbers look valid." Maybe there's a new campaign. Maybe there's a bug. You need analysis, not validation.

This is the realm of monitoring, alerting, and investigation - not validation.

That's also why Detective D doesn't attempt ML-based anomaly detection or automatic fixing. It would be guessing. Instead, it focuses on what it can be confident about: structural issues, schema mismatches, type problems, and basic logical contradictions.

This IS Good For...

Early catches: Stopping broken data at the door before it corrupts dashboards.

Debugging speed: When something breaks, this tells you which file broke and why, in seconds.

Peace of mind: You know when you're sending clean data downstream.

Vendor reliability: Quickly spotting when a vendor changed formats without telling you.

The Real World: Why This Matters Right Now

Here's the painful truth: Most teams don't have time to set up Great Expectations or dbt-expectations for every data source.

A solo engineer? Forget it. An early-stage startup with 10 data sources and one person managing them? Not happening.

But they do have 5 minutes before they ingest a new file. They do want to know why their pipeline failed before they spend 2 hours debugging.

One data engineer described their actual workflow:

"I often run into structural or data quality issues that I need to gracefully handle... I store all raw data for reprocessing purposes. After corrections, I automatically reprocess raw data for failures."

Translation: They catch errors, fix them at the source, then re-ingest. Fast validation would save them hours.

This is where a lightweight, browser-based tool fits. Not instead of enterprise tooling for big operations. Alongside it. Or instead of it, if you're not at that scale yet.

Your Next Move

When you receive a data file and need to ask:

"Why did this break?"
"Which rows are problematic?"
"Is this safe to ingest?"

You have options:

Set up enterprise tooling (if you have the time and scale)
Do it manually (if you like debugging at 3 AM)
Use a lightweight check (if you want answers in seconds)

If it's option 3, you can upload your file to DatumInt right now. Detective D will scan it, flag issues, explain what went wrong, and give you a clear picture of whether it's safe to ingest. No infrastructure. No setup. Just answers.

The shape of your data is valid. The reality of your data? That's where the problems hide.