Cover Photo by Claudio Schwarz on Unsplash
Before diving in, let us discuss what data quality entails. IBM puts it so well - Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness and fitness for purpose, and it is critical to all data governance initiatives within an organisation. Data quality does not only care about the data being pristine, but also being fit for intended use, meaning that data quality is context-specific. The domain in which the data is collected and used is as important as the other checks. In fact, in many situations, it provides the foundation to define checks for accuracy, validity, consistency, timeliness and uniqueness. Furthermore, data quality can build or destroy trust within the team.
This is a reflective piece that encapsulates my experience setting up a roadmap for maintaining high-quality data. When asked previously about data quality enforcement and implementation, I always responded with “we can use the tool, or that tech to achieve this”. In practice, I was faced with a rude awakening about how limited that response was.
To provide a better context, let us define a particular metric where a group of skilled data analysts and scientists were asked to calculate the same metric for a product for a given time window - month, weeks, days, etc. The result was that everyone came with a different number representing that metric. These inconsistencies erode trust, and the output of any data process is predicated on trustworthiness. It matters because if people get different numbers for a metric, then the question is, “Is the data being collected good? Are we introducing errors during processing?"
This was the point at which I realised that data quality is not just about the tools. It is about key elements that must be present for the effective delivery of data products. I categorise them into 3 main elements. These elements are process, people, and technology, all working together and in harmony. I will expand on process, people, and technology in the following sections of this article.
People
As long as the data is going to be read and interpreted by more than one person, the people element must be considered. While this may seem like a downstream activity, it is essential to be addressed as early as possible in any workflow, request, whether automated, routine, or ad hoc.
To make this practical, the quality of the data starts from understanding the request being made. There must be a free flow of knowledge among all stakeholders in the delivery of any analysis. This means, when a senior executive asks, “What is the day 3 retention for product A?”, instead of going ahead to write fancy SQL and Python scripts, it is worth responding with clarifying questions such as:
- Do you mean classic retention or rolling retention?
- For a global product, do you want regional retention that may show regional patterns?
- Should the retention be in UTC 24-hour cycles, and so on?
What this does is either give you clarity on what exactly needs to be calculated or give leeway for assumptions to be made. Overall, the quality of the resulting data and its interpretation depends on people communicating effectively. More so, in a team where analysts and engineers work collaboratively, clear definitions must be documented and made available to produce good downstream data.
Technology
The tooling list is ever-increasing these days. With tools like Great Expectations, Amazon Deequ, SODA, and platform-embedded tools like the validation rules in DBT, AWS Glue Data Quality and other similar tools, data quality checks are a solved problem technologically. The only questions worth asking are on cost, competencies of the team, and best fit with the existing tech stack; basically, the checks that occur during a tool assessment. These tools do a good job of creating valid definitions of what the data should have. Typical examples of what these tools provide are consistent ways to create, store and report the results of these checks.
Additionally, it is now a commonplace practice to treat data processing and analysis efforts similarly to software development practices. Therefore, writing maintainable, readable, and modifiable modular code becomes a requirement to foster collaboration and longevity, rather than luxury. Using version control systems like Git becomes a non-negotiable in achieving this.
Process
We have seen how people working together in harmony play a vital role in aligning on expectations. I consider ‘process’ to be the wrapper around people and technology. Good processes foster seamless interaction among people and with tools to achieve defined goals. For instance, a group of data engineers and data analysts define a process using the Write-Audit-Publish (WAP) pattern with data quality and validation tests at the audit layer. Therefore, no data product is published without passing all defined tests. Alternatively, large datasets might have preliminary checks to leverage fail-fast mechanisms.
Building effective processes is not always straightforward. Too many steps in a process make achieving a goal too tedious. Alternatively, processes that are too simple and lenient may lack the robustness to define the safe boundaries and guidelines required for consistent and sustainable results. A good process strikes that balance, and it may take several iterations to build that process.
Conclusion
It is tempting to argue that data quality can be achieved with only tooling, but without the right processes, tools become useless, and without the right people and a commitment to uphold a structure, processes are easily circumvented. This is what really matters. The absence of technology, process and people all working in harmony creates a fragile data quality framework for any organisation or project. Furthermore, the ideas discussed may not be seen as data quality but instead as an aspect of data governance. Also, there is a strong alignment with these same principles in the concept behind data contracts. Overall, the intricate application of data governance, data contracts, data quality frameworks or whatever it may be called in an organisation will rely on these and more.
Top comments (0)