Why do data analysts spend so much time preparing data? (and what can we do about it?)

#analytics #software #tooling

Data analysts spend a majority of their time cleaning data, not generating insights. According to a Forbes study, 60% of their time is dedicated to cleaning data. And more recent testimonials suggest this number may have gone up since the time of publication.

But why is this so important? That’s because insights are only as accurate as their underlying data. Bad data quality can mean reducing accuracy with fewer data points, or skewing results with outliers.

This is why every year US organizations incur average costs of $12.9 million each on poor data quality, according to a Gartner study.

Why does bad data quality cost so much?

The costs of bad data can be broken down as such:

Storage costs: ideally, organizations should be paying to store data that they can use. Increasing a balance sheet to store unusable data amounts to a waste of resources.
Cleaning costs: this is the work that analysts spend reviewing a dataset, filtering the undesired data, adapting some data points, and producing a usable dataset.
Fixing costs: this includes work like documentation, back-and-forth between teams to increase the quality of the data in the future, and other processes to that effect.
Opportunity costs: data value decreases over time. Leaders and managers need to make timely decisions based on data. If too much time is lost cleaning up a bad dataset, they will have to make uninformed decisions or abandon a project entirely.

At this point, some may have realized that the costs of poor data quality happen after data has been received.

Thus, the solution seems rather obvious: ensuring the quality of data before it is received. However, upstream validation also comes at a cost.

The costs of upstream validation

Upstream validation is a great technique to ensure a higher quality of data without incurring the costs mentioned in the previous section.

As any software engineer knows, shift-left testing – moving testing earlier in the lifecycle – is a great way to reduce the cost of detecting defects.

Implementing upstream validation comes with 2 costs:

Infrastructure costs: this can mean putting servers online, scaling with traffic and other engineering solutions.
Resource diversion: mainly from dedicating a team of engineers to fix data quality over improving a product.

While not perfect, this is already a significant improvement over simply fixing bad data.

Building a new solution

Bad data quality is the gap between the intention behind a data model and its implementation.

When analysts design a data model, their intention is clear, but there is no guarantee the implementation will follow. This is how bad data emerges.

That’s why Filasys was built to be a self-serve data analytics platform that treats data models as enforceable contracts, not just documentation.

This fixes the gap between modeling and implementation by letting analysts specify validation when modeling the data and then enforcing the validation when receiving the data.

You can take a look at our tutorial to see how to create enforceable data models with Filasys.

Top comments (1)

toshihiro shishido • Apr 30

A chunk of "prep time" comes from undefined metrics. CVR has three definitions across GA / Shopify / FP&A in most ecommerce stacks I've seen, and reconciliation eats hours per week.

Pinning a single 1-page metric definition shared org-wide cuts the prep work more than tooling does.