This article was originally published on the layline.io blog.
Schema drift and upstream breaking changes are the number one cause of silent data failures — but most pipeline content focuses on infrastructure, not source system behavior
The field that changed type on a Tuesday
A team I know runs payment reconciliation for a mid-size e-commerce company. Their pipeline pulls transaction data from a third-party payment processor, transforms it, and loads it into their data warehouse. It's been running without incident for two and a half years.
On a Tuesday afternoon in November, the payment processor quietly updated their API. One field — transaction_amount — changed from a string (because some legacy systems represent money as "47.50") to a native float (47.50). No versioning. No deprecation notice. No email. The documentation updated sometime over the following week.
The pipeline didn't crash. It kept running. It kept processing transactions. It kept reporting success.
What it stopped doing was casting correctly. The downstream transformation assumed string input and applied a regex to strip currency symbols before converting. With a float coming in, the regex matched nothing, the conversion produced null, and every transaction for the next six hours had an amount of zero.
Six hours of zero-dollar transactions, all showing as processed. Nobody noticed until the daily reconciliation report came out the next morning and the numbers looked like a rounding error had swallowed the business.
I'm telling this story because it illustrates something that most pipeline architecture writing misses: the scariest failures don't come from your infrastructure. They come from systems you don't control.
Three types of upstream change
Not all upstream changes are equal. I've watched teams get burned by each of them, and they require different defenses.
Additive changes are the ones vendors announce as "backward compatible." New fields appear in the response. Existing fields stay the same. In theory, your pipeline should be fine — you're not using the new fields. In practice, additive changes break pipelines when they hit implicit size assumptions (a JSON response now exceeds a buffer limit), when wildcard schema captures start picking up fields you didn't expect, or when that new field is named something that collides with a field you already have in your destination table.
Breaking changes are the honest ones, at least. The field is renamed. The type changes. An endpoint is deprecated. These should be announced — and usually are, for reputable vendors. But "announced" doesn't mean "acted on." The announcement sits in an email digest that nobody reads because the team that receives it isn't the team that owns the pipeline, and by the time the deprecation date arrives, the original engineer has moved to a different company.
Silent changes are the payment processor situation. The kind nobody tells you about because, from the vendor's perspective, nothing changed. The semantics are the same. The data is the same. Just the type changed. Or the encoding. Or the null handling behavior. Silent changes are the ones that turn into six-hour data corruption events before anyone notices.
The proportion of each type varies by vendor maturity. Established financial APIs are mostly breaking changes with long deprecation windows. SaaS products with fast release cycles are mostly silent and additive. Partner-provided data feeds — the unglamorous, critical kind that run B2B integrations — are genuinely unpredictable.
Why most pipelines fail at the wrong layer
Here's the thing about schema validation: almost every modern pipeline tool supports it. You can define schemas. You can validate at ingestion. You can reject malformed records.
Most teams don't do it, for understandable reasons.
In the early days of a pipeline, the schema changes constantly. The source system is still in development. Strict validation would fail the pipeline every time a field gets added or renamed during normal iteration. So validation gets turned off, or loosened to "best effort," and by the time the pipeline reaches production, nobody remembers to tighten it back up.
There's also a philosophical split in how teams think about schema enforcement. Strict schema validation feels defensive. It feels like you're building a wall that will break the pipeline every time the source system breathes. Permissive handling feels pragmatic. Handle what you can, pass through what you can't, let the destination figure it out.
The problem with permissive handling is that it shifts the failure surface downstream and makes it invisible. Your pipeline doesn't fail. Your downstream analytics or application silently processes bad data. And by the time you notice — days later, when a report looks wrong, or a user reports a discrepancy — the corrupted records have been commingled with legitimate ones, compounded by downstream transformations, and possibly acted on.
Schema validation at the pipeline layer isn't about being strict for its own sake. It's about making failures loud and early rather than quiet and late.
The three classes of defense
After watching enough of these incidents, I've found that teams that handle upstream changes gracefully do three things consistently.
Shape validation, not just types. Type validation catches the payment processor situation. But shape validation catches the subtler cases: a required field becoming optional (and therefore sometimes absent), an array that used to always have one element now sometimes having zero, an object that used to be flat now nesting one level deeper.
The distinction matters because type errors produce loud failures. Shape mismatches produce quiet ones. A field that's present 99.9% of the time and absent 0.1% of the time will produce a null-handling bug that takes weeks to surface because it only triggers on rare transaction types, or specific geographic regions, or edge-case payment methods.
Schema drift monitoring, not just job status. Job status tells you whether the pipeline ran. Schema drift monitoring tells you whether what the pipeline processed today is the same shape as what it processed yesterday.
This doesn't require a sophisticated observability platform. The simplest version is a daily check that hashes the inferred schema of a sample of records from each source and alerts if the hash changes. It's crude but effective. Most schema drift events are detectable by this method within 24 hours.
More sophisticated versions track field-level statistics: null rates by field, cardinality by field, type distribution by field. When the null rate for transaction_amount goes from 0.0% to 0.1%, something changed upstream. Maybe it's intentional. Maybe it's a bug. Either way, you want to know before it becomes a problem.
Separating ingestion from processing. This is the architectural pattern that buys the most time when upstream changes happen. If your pipeline ingests raw data into a landing zone before processing it, you have the option to replay against historical raw data after fixing a schema issue. If ingestion and processing are coupled, you lose that option.
The raw landing zone doesn't have to be expensive or complex. For many use cases, an append-only object store (S3, GCS, Azure Blob) with partitioned raw JSON is sufficient. The transformation layer reads from the landing zone, not directly from the source. When something goes wrong upstream, you fix the transformation and replay. The data is still there.
Contract testing at the pipeline layer: is it worth it?
You'll hear about consumer-driven contract testing as the "correct" solution to this problem. The idea is that your pipeline publishes a contract — these are the fields I depend on, these are the types I expect, this is what I consider a breaking change — and the source system is expected to validate against that contract before deploying changes.
This works well when you control both sides of the integration. If you're integrating internal microservices, or working with a vendor who takes integration stability seriously, contract testing is genuinely valuable. Tools like Pact make it tractable.
For the majority of integrations I see in practice — third-party SaaS, partner APIs, data feeds from systems you have no pull over — contract testing is a nice theory. You cannot compel a payment processor to run your Pact tests before they deploy. You cannot negotiate contract publication rights with a vendor whose legal team has never heard of consumer-driven contracts.
The more practical frame is: what can you do on your side of the boundary to detect changes and recover from them quickly?
Which brings me back to schema monitoring, landing zones, and pipeline-level validation. Not glamorous. Not the technically interesting solution. But the one that actually works across the full range of upstream scenarios you'll encounter.
The question to ask at every integration kickoff
I've started asking one question at every integration design review: What's the process when this changes without warning?
Not if. When.
It sounds pessimistic. The partner integration team sometimes takes it personally. But the question forces a conversation that almost always surfaces assumptions nobody had made explicit: the assumption that the source system's team will communicate breaking changes, the assumption that someone on the integration team will read the changelog, the assumption that the pipeline can tolerate X days of incorrect data before someone notices.
Those assumptions are usually wrong. Making them explicit gives you a chance to design around them.
The answer to "what happens when this changes without warning" should involve at minimum: where the alert fires, who receives it, how quickly the team can identify which field changed, and how quickly they can replay affected data from the raw landing zone. If the answer is "we'd have to investigate and probably call the vendor," the pipeline isn't ready for production.
Where layline.io fits in this
Schema evolution is one of the things we think about a lot in how layline.io handles data processing. When you're dealing with both batch and streaming pipelines — and the reality is that most teams run both indefinitely — the upstream change problem compounds. A schema change in a streaming source hits you in real time. The same change in a batch source might not surface for 24 hours.
layline.io's processing model supports you with schema evolution through explicit version routing: when a new schema version is introduced, you can apply separate logic and validation or route those records to a separate flow for validation and handling altogether, rather than letting them contaminate your main processing path.
It's not magic. You still have to design your integration with the assumption that upstream things will change. But it means that when they do change, the failure surface is smaller and the recovery path is faster.
The teams that handle upstream changes gracefully aren't the ones with the most sophisticated infrastructure. They're the ones that stopped assuming the source system would never surprise them.


Top comments (0)