DEV Community

Cover image for Behind the Scenes of Data Ingestion: How Small Issues Cause Big Headaches
Rutika Khaire
Rutika Khaire

Posted on

Behind the Scenes of Data Ingestion: How Small Issues Cause Big Headaches

Data ingestion is often treated as a solved problem-until it breaks. What looks like a simple pipeline moving data from source to destination can quietly introduce inconsistencies, missing records, or silent failures that ripple across analytics and reporting systems.

In this blog, we’ll go behind the scenes of a real-world data ingestion architecture, explore common issues that arise, uncover their root causes, and share best practices to build more resilient pipelines.


Why Data Ingestion Matters

Foundation

Data ingestion is the foundation of every data platform. When ingestion goes wrong:

  • Reports show incorrect numbers
  • Business decisions are based on stale or incomplete data
  • Engineers spend hours firefighting instead of building features

The most dangerous problems aren’t always the obvious failures—they’re the subtle ones that go unnoticed for weeks


Our Ingestion Architecture at a Glance

  • Azure Data Factory (ADF) for orchestration
  • SQL Server Views as the source layer

  • Medallion Architecture
    Bronze: Raw ingested data
    Silver: Cleaned and transformed data
    Gold: Business-ready datasets

Medallion architecture

  • Watermark-based incremental loads to process only changed data

This architecture is scalable and efficient—but only when implemented carefully


Issue #1: Missing Deletes – “Why Is This Customer Still Active?”

Missing deletes

Real-World Example

A customer account is deleted in the CRM system due to GDPR requirements.
However, the sales dashboard still shows the customer as active weeks later.

Impact

  • Compliance risk
  • Incorrect KPIs
  • Loss of trust from business users

What Actually Happened

Source System        Target Table
Customer Deleted  →  No Delete Captured

Enter fullscreen mode Exit fullscreen mode

Fix (Realistic Approach)
Source (CDC Enabled)

├── Insert
├── Update
└── Delete ──► Target Table

  • Enable CDC or soft-delete flags
  • Add delete-handling logic during MERGE
  • Periodically reconcile record counts

Issue #2: Race Conditions – Incremental Loads Miss Late-Arriving Updates

Race condition
Real-World Example
Incremental loads often rely on a watermark column such as LastModifiedDate.

Example:

  • Last successful watermark = 10:00 AM
  • A record is updated at 9:55 AM
  • That update arrives late due to upstream delays

Because the timestamp is older than the watermark, the incremental query skips it entirely.

Why This Is Dangerous

  • No pipeline failure
  • No alert
  • Data is permanently missed unless a full reload happens

This is a silent data loss scenario.

Common Root Causes

  • Eventual consistency in source systems
  • Batch updates applied late
  • Reliance on timestamps without buffering

How to fix it

  • Use a lookback window (e.g., watermark − 5 or 10 minutes)
  • Prefer CDC or version-based sequencing
  • Reprocess recent partitions regularly

Issue #3: Over-Aggressive Filtering – “Where Did My Data Go?”

Aggressive Filtering
Real-World Example

A filter is added to exclude test users:

WHERE username NOT LIKE '%test%'
Enter fullscreen mode Exit fullscreen mode

Suddenly, legitimate users like contest_winner disappear from reports.

Hidden Damage

  • No pipeline failure
  • No alert
  • Business notices weeks later

Better Filtering Strategy
Incoming Data

├── Valid Users ──► Continue
└── Test Users ──► Logged + Reviewed

How to fix it

  • Use exact match lists (e.g., TestUserList)
  • Log filtered records
  • Validate filters with production samples

Best Practices

Resilient Ingestion Design
Detect Change

Validate Data

Apply Idempotent Load

Monitor + Reconcile

Operational Guardrails

  • Row-count reconciliation
  • Data freshness checks
  • Alerting on anomalies—not just failures

Final Thoughts

Most data ingestion failures don’t crash pipelines—they quietly corrupt trust. Missed deletes, late-arriving updates, and poorly managed watermarks often go unnoticed until business users start asking uncomfortable questions.

The takeaway is simple: build ingestion pipelines for real-world behavior, not ideal scenarios. Expect delays, partial failures, and messy data. Validate early, reconcile often, and treat control logic like watermarks as first-class citizens.

In data engineering, it’s rarely the big failures that hurt the most—it’s the small ones you didn’t see coming.

Top comments (0)