Rutika Khaire

Posted on Dec 24, 2025

Behind the Scenes of Data Ingestion: How Small Issues Cause Big Headaches

#dataingestion #medallion #dataengineering #datafactory

Data ingestion is often treated as a solved problem-until it breaks. What looks like a simple pipeline moving data from source to destination can quietly introduce inconsistencies, missing records, or silent failures that ripple across analytics and reporting systems.

In this blog, we’ll go behind the scenes of a real-world data ingestion architecture, explore common issues that arise, uncover their root causes, and share best practices to build more resilient pipelines.

Why Data Ingestion Matters

Data ingestion is the foundation of every data platform. When ingestion goes wrong:

Reports show incorrect numbers
Business decisions are based on stale or incomplete data
Engineers spend hours firefighting instead of building features

The most dangerous problems aren’t always the obvious failures—they’re the subtle ones that go unnoticed for weeks

Our Ingestion Architecture at a Glance

Azure Data Factory (ADF) for orchestration
SQL Server Views as the source layer
Medallion Architecture
Bronze: Raw ingested data
Silver: Cleaned and transformed data
Gold: Business-ready datasets

Watermark-based incremental loads to process only changed data

This architecture is scalable and efficient—but only when implemented carefully

Issue #1: Missing Deletes – “Why Is This Customer Still Active?”

Real-World Example

A customer account is deleted in the CRM system due to GDPR requirements.
However, the sales dashboard still shows the customer as active weeks later.

Impact

Compliance risk
Incorrect KPIs
Loss of trust from business users

What Actually Happened

Source System        Target Table
Customer Deleted  →  No Delete Captured

Fix (Realistic Approach)
Source (CDC Enabled) │ ├── Insert ├── Update └── Delete ──► Target Table

Enable CDC or soft-delete flags
Add delete-handling logic during MERGE
Periodically reconcile record counts

Issue #2: Race Conditions – Incremental Loads Miss Late-Arriving Updates

Real-World Example
Incremental loads often rely on a watermark column such as LastModifiedDate.

Example:

Last successful watermark = 10:00 AM
A record is updated at 9:55 AM
That update arrives late due to upstream delays

Because the timestamp is older than the watermark, the incremental query skips it entirely.

Why This Is Dangerous

No pipeline failure
No alert
Data is permanently missed unless a full reload happens

This is a silent data loss scenario.

Common Root Causes

Eventual consistency in source systems
Batch updates applied late
Reliance on timestamps without buffering

How to fix it

Use a lookback window (e.g., watermark − 5 or 10 minutes)
Prefer CDC or version-based sequencing
Reprocess recent partitions regularly

Issue #3: Over-Aggressive Filtering – “Where Did My Data Go?”

Real-World Example

A filter is added to exclude test users:

WHERE username NOT LIKE '%test%'

Suddenly, legitimate users like contest_winner disappear from reports.

Hidden Damage

No pipeline failure
No alert
Business notices weeks later

Better Filtering Strategy
Incoming Data │ ├── Valid Users ──► Continue └── Test Users ──► Logged + Reviewed
How to fix it

Use exact match lists (e.g., TestUserList)
Log filtered records
Validate filters with production samples

Best Practices

Resilient Ingestion Design
Detect Change │ Validate Data │ Apply Idempotent Load │ Monitor + Reconcile
Operational Guardrails

Row-count reconciliation
Data freshness checks
Alerting on anomalies—not just failures

Final Thoughts

Most data ingestion failures don’t crash pipelines—they quietly corrupt trust. Missed deletes, late-arriving updates, and poorly managed watermarks often go unnoticed until business users start asking uncomfortable questions.

The takeaway is simple: build ingestion pipelines for real-world behavior, not ideal scenarios. Expect delays, partial failures, and messy data. Validate early, reconcile often, and treat control logic like watermarks as first-class citizens.

In data engineering, it’s rarely the big failures that hurt the most—it’s the small ones you didn’t see coming.

Top comments (2)

Lorraine Moyo • Feb 24

This resonates deeply. I spent over a week investigating a bug that was exactly as you described: a 'silent' failure. There were no pipeline crashes or alerts, just a quiet corruption of trust between the database and the api. It turns out that relying on standard watermarks without account for late-arriving updates can lead to these exact 'uncomfortable questions' from stakeholders.

Rutika Khaire • Mar 6

Exactly! Anticipating potential failure points early on makes it much easier to handle these kinds of problems.