Data ingestion is often treated as a solved problem-until it breaks. What looks like a simple pipeline moving data from source to destination can quietly introduce inconsistencies, missing records, or silent failures that ripple across analytics and reporting systems.
In this blog, we’ll go behind the scenes of a real-world data ingestion architecture, explore common issues that arise, uncover their root causes, and share best practices to build more resilient pipelines.
Why Data Ingestion Matters
Data ingestion is the foundation of every data platform. When ingestion goes wrong:
- Reports show incorrect numbers
- Business decisions are based on stale or incomplete data
- Engineers spend hours firefighting instead of building features
The most dangerous problems aren’t always the obvious failures—they’re the subtle ones that go unnoticed for weeks
Our Ingestion Architecture at a Glance
- Azure Data Factory (ADF) for orchestration
SQL Server Views as the source layer
Medallion Architecture
Bronze: Raw ingested data
Silver: Cleaned and transformed data
Gold: Business-ready datasets
- Watermark-based incremental loads to process only changed data
This architecture is scalable and efficient—but only when implemented carefully
Issue #1: Missing Deletes – “Why Is This Customer Still Active?”
Real-World Example
A customer account is deleted in the CRM system due to GDPR requirements.
However, the sales dashboard still shows the customer as active weeks later.
Impact
- Compliance risk
- Incorrect KPIs
- Loss of trust from business users
What Actually Happened
Source System Target Table
Customer Deleted → No Delete Captured
Fix (Realistic Approach)
Source (CDC Enabled)
│
├── Insert
├── Update
└── Delete ──► Target Table
- Enable CDC or soft-delete flags
- Add delete-handling logic during MERGE
- Periodically reconcile record counts
Issue #2: Race Conditions – Incremental Loads Miss Late-Arriving Updates

Real-World Example
Incremental loads often rely on a watermark column such as LastModifiedDate.
Example:
- Last successful watermark = 10:00 AM
- A record is updated at 9:55 AM
- That update arrives late due to upstream delays
Because the timestamp is older than the watermark, the incremental query skips it entirely.
Why This Is Dangerous
- No pipeline failure
- No alert
- Data is permanently missed unless a full reload happens
This is a silent data loss scenario.
Common Root Causes
- Eventual consistency in source systems
- Batch updates applied late
- Reliance on timestamps without buffering
How to fix it
- Use a lookback window (e.g., watermark − 5 or 10 minutes)
- Prefer CDC or version-based sequencing
- Reprocess recent partitions regularly
Issue #3: Over-Aggressive Filtering – “Where Did My Data Go?”
A filter is added to exclude test users:
WHERE username NOT LIKE '%test%'
Suddenly, legitimate users like contest_winner disappear from reports.
Hidden Damage
- No pipeline failure
- No alert
- Business notices weeks later
Better Filtering Strategy
Incoming Data
│
├── Valid Users ──► Continue
└── Test Users ──► Logged + Reviewed
How to fix it
- Use exact match lists (e.g., TestUserList)
- Log filtered records
- Validate filters with production samples
Best Practices
Resilient Ingestion Design
Detect Change
│
Validate Data
│
Apply Idempotent Load
│
Monitor + Reconcile
Operational Guardrails
- Row-count reconciliation
- Data freshness checks
- Alerting on anomalies—not just failures
Final Thoughts
Most data ingestion failures don’t crash pipelines—they quietly corrupt trust. Missed deletes, late-arriving updates, and poorly managed watermarks often go unnoticed until business users start asking uncomfortable questions.
The takeaway is simple: build ingestion pipelines for real-world behavior, not ideal scenarios. Expect delays, partial failures, and messy data. Validate early, reconcile often, and treat control logic like watermarks as first-class citizens.
In data engineering, it’s rarely the big failures that hurt the most—it’s the small ones you didn’t see coming.




Top comments (0)