Adnan Arif

Posted on Jan 26

How "Good Data" Slowly Becomes Bad

#dataquality #datamanagement #dataengineering #analytics

How "Good Data" Slowly Becomes Bad

Image credit: Ralphs_Fotos via Pixabay

You inherit a dataset. The documentation looks solid. The schemas are clean. Previous analyses produced reliable results.

Six months later, something's off. Dashboards show impossible values. Models that worked perfectly now predict nonsense. Reports contradict each other.

Nothing dramatic happened. No one made obvious mistakes. Yet the data that was trustworthy has become unreliable.

This is data decay, and it happens to every organization. Understanding how good data goes bad is the first step to preventing it.

The Illusion of Stable Data

Data feels permanent. It sits in databases, unchanged until someone modifies it. The numbers from last quarter remain the numbers from last quarter.

But data exists in context. The systems generating it change. The business processes it represents evolve. The definitions people use drift.

The data itself might be static. Everything around it is not.

A customer segment that made sense three years ago—"high-value enterprise accounts"—might now include companies that don't fit the original definition. The code is correct. The underlying concept has shifted.

Definition Drift

The most insidious form of data decay is definition drift. It happens so gradually that no one notices.

Consider a field called active_user. When it was created, the definition was clear: someone who logged in within the last 30 days.

Then marketing wanted to count users differently for a campaign. They added a variation. Engineering created a separate field with slightly different logic. Someone wrote a report using the wrong one.

Five years later, there are four fields that might mean "active user." Each has slightly different logic. New employees don't know which is authoritative.

The data hasn't changed. Its meaning has become ambiguous.

Schema Creep

Databases evolve to meet new requirements. Fields get added, rarely removed. Tables multiply.

Over time, the schema accumulates cruft. There are columns that nobody uses but nobody deletes. Tables created for one-time analyses linger permanently.

The problem isn't just messiness. It's that important fields get buried among irrelevant ones. Someone looking for the right data can't distinguish active fields from abandoned experiments.

Worse, schema changes often lack documentation. A new column appears, and six months later, no one remembers why it exists or how it should be interpreted.

Source System Changes

Your data comes from somewhere. When that somewhere changes, your data changes too—whether you realize it or not.

A CRM upgrade that changes how addresses are formatted. An API provider that deprecates a field and replaces it with something subtly different. A business unit that switches software platforms.

These changes often happen without notification to downstream data teams. The first sign of trouble is when something breaks.

By then, you might have months of corrupted data. Identifying exactly when the problem started requires detective work.

Business Process Evolution

Data doesn't exist in isolation. It represents business activities that evolve over time.

A manufacturing company tracks defect_rate. The formula has been consistent for a decade.

But inspection processes have changed. New quality checkpoints catch issues earlier. Some defects are now fixed before reaching the measurement point.

The historical trend shows improvement. But you're not comparing apples to apples. The metric hasn't changed, but what it measures has.

This happens constantly. Sales processes change. Customer onboarding flows evolve. Product features get added and removed.

Each change subtly shifts what your data represents.

Temporal Traps

Some data is only valid for a specific time. When time passes, correct data becomes misleading.

Consider exchange rates. A transaction logged in foreign currency needs conversion. If you use today's rate for a transaction from six months ago, your analysis is wrong.

Addresses are similar. People move. Companies relocate. The address on file was correct when captured but isn't anymore.

Any data that represents a point-in-time reality becomes less reliable as time passes. The longer the gap, the greater the decay.

The Pipeline Decay Problem

Most data passes through transformation pipelines before reaching analysts. These pipelines rot from within.

A script written three years ago still runs daily. But the libraries it depends on haven't been updated. The APIs it calls have changed behavior. Edge cases that didn't exist in 2022 now cause silent failures.

The output looks fine—most of the time. The bugs only manifest in specific conditions. Months might pass before anyone notices.

Pipeline decay is especially dangerous because the problems are hidden. Unlike missing data or obvious errors, the corrupted records look normal.

Documentation Decay

Documentation decays faster than code.

When a system is built, someone usually documents it. But documentation requires active maintenance. When the system changes and the docs don't, they become actively harmful.

Outdated documentation is worse than no documentation. It gives false confidence. Analysts trust what they read and make decisions on incorrect assumptions.

The irony is that the more thorough the original documentation, the more dangerous its decay. People trust comprehensive docs more than sparse ones.

Institutional Knowledge Loss

Data quality depends on the people who understand it. When they leave, their knowledge goes with them.

Every organization has tribal knowledge—undocumented quirks and context that explain why data behaves certain ways. This particular field needs special handling for customers created before 2019. That column has nulls that actually mean zero. The December 2021 data has known issues from a migration.

When the person who knows these things moves on, the knowledge evaporates. New team members make reasonable assumptions that happen to be wrong.

Organizations don't feel this loss immediately. The damage accumulates silently over months and years.

The Measurement Problem

You can't manage what you don't measure. Data quality is notoriously difficult to measure.

Most organizations have no systematic tracking of data quality. They discover problems reactively—when a dashboard breaks, when a report contradicts reality, when an important decision turns out to be based on bad data.

By then, determining the scope and origin of the problem is difficult. How long has this been wrong? Which decisions were affected? Where else might similar issues exist?

Without ongoing measurement, decay goes undetected until it causes visible harm.

Prevention Strategies

Data decay is inevitable, but it can be slowed.

Data contracts. Formalize expectations about data structure and quality between producers and consumers. When changes are needed, force explicit communication.

Automated quality checks. Build tests that run continuously. Validate value ranges, null rates, cardinality, and distributions. Alert when patterns deviate from expectations.

Version everything. Schemas, transformations, definitions—all should be versioned and tracked. When something breaks, you can identify what changed.

Sunset aggressively. Remove unused fields, tables, and pipelines. The less you have, the less can decay.

Document the why, not just the what. Explain the business context behind data decisions. Future maintainers need to understand intent, not just structure.

Recovery Is Harder Than Prevention

Once data quality has degraded significantly, recovery is painful.

You need to identify where problems started—often months in the past. You need to understand what correct data should look like without having correct data to compare against. You need to backfill or correct historical records while maintaining audit trails.

Sometimes, recovery isn't possible. The correct values were never captured. The source systems that would have the truth have themselves changed.

Prevention is orders of magnitude cheaper than recovery.

Accepting Imperfection

Perfect data doesn't exist. Every dataset has quality issues—the question is degree.

The goal isn't eliminating decay but managing it. Keep the most important data within acceptable quality bounds. Know where the problems are and communicate them to consumers.

Transparent imperfection beats hidden rot. When you tell stakeholders that a particular dataset has known issues, they can adjust their decisions accordingly.

Pretending data is cleaner than it is leads to overconfidence and eventually failure.

Frequently Asked Questions

How do I know if my data quality is degrading?
Implement automated quality monitoring that tracks key metrics over time. Look for trends in null rates, value distributions, cardinality, and validation failures. Compare fresh data against historical patterns.

What's the biggest cause of data quality issues?
Usually changes that weren't communicated—source system updates, business process evolution, or upstream pipeline modifications that downstream teams didn't know about.

How often should I audit data quality?
Automated checks should run continuously with every data update. More comprehensive manual audits should happen quarterly for critical datasets.

Who's responsible for data quality—producers or consumers?
Both, but they have different responsibilities. Producers should ensure data meets documented specifications. Consumers should validate data meets their needs and report issues.

Should I fix historical data or just fix it going forward?
Depends on the cost-benefit. Some analyses require clean historical data. Others can work with forward-only fixes. Consider how the data is actually used.

How do I prevent definition drift?
Maintain a central data dictionary with authoritative definitions. Require changes to go through a formal approval process. Regularly audit how terms are actually used versus how they're defined.

What should I document about my data?
Business context and intent, not just technical specifications. Why does this field exist? What decisions does it support? What caveats should analysts know?

How do I handle data quality when the source is external?
Implement validation at ingestion. Create data contracts with external providers when possible. Build resilience to expect and handle quality issues.

When is data too bad to use?
When the quality issues are large relative to the signal you're trying to detect, or when you can't characterize the issues well enough to adjust for them.

Is this problem getting better with modern data tools?
Modern tools help with monitoring and testing, but the fundamental issues persist. More data and more complexity often mean more opportunities for decay.

Conclusion

Good data doesn't stay good automatically. It requires active maintenance, monitoring, and care.

The organizations that treat data quality as a one-time problem inevitably discover their data has decayed. The ones that treat it as an ongoing process maintain trustworthy data over time.

Understanding how data goes bad is the first step toward keeping it good.

Hashtags

DataQuality #DataManagement #DataEngineering #Analytics #DataAnalysis #DataGovernance #DataOps #DataDriven #DataAnalyst #DataScience

This article was refined with the help of AI tools to improve clarity and readability.

DEV Community

How "Good Data" Slowly Becomes Bad

How "Good Data" Slowly Becomes Bad

The Illusion of Stable Data

Definition Drift

Schema Creep

Source System Changes

Business Process Evolution

Temporal Traps

The Pipeline Decay Problem

Documentation Decay

Institutional Knowledge Loss

The Measurement Problem

Prevention Strategies

Recovery Is Harder Than Prevention

Accepting Imperfection

Frequently Asked Questions

Conclusion

Hashtags

DataQuality #DataManagement #DataEngineering #Analytics #DataAnalysis #DataGovernance #DataOps #DataDriven #DataAnalyst #DataScience

Top comments (0)