How to build and maintain high quality data without raising billions
Airbnb has always been a data driven company.
Back in 2015, they were laying the foundation to ensure that data science was democratized at Airbnb. Meanwhile, they have grown to more than 6,000 people and have raised more than $6b of venture funding.
To stay data driven through this massive change has required making big investments in data quality, as outlined by their recent Data Quality at Airbnb series: Part 1 — Rebuilding at Scale and Part 2 — A New Gold Standard.
Companies aspiring to be as data driven and successful as Airbnb will also need to prioritize data quality.
It does not matter how much data is collected, how fast it can be queried, how insightful analyses are or how intelligent a machine learning model is. If the underlying data is unreliable and of poor quality, then everything that depends upon it will also be suspect.
Fortunately, companies no longer need to reinvent the wheel or make massive investments to improve and maintain high quality data. New startups, such as ours at Anomalo, are building the technology needed to monitor, triage and root cause data quality issues efficiently at scale.
In their first post, Part 1 — Rebuilding at Scale, Airbnb set the following goals for themselves.
All of these investments are critical, and projects like dbt are making it easier to build high quality data pipelines, and there are many open source data discovery tools under development.
But in this article I want to focus on two of their goals in particular: ensuring important data meets SLAs and is trustworthy and routinely validated.
In their latest post, Part 2 — A New Gold Standard, Airbnb outlines the following automated data quality checks they run to validate important datasets:
To summarize, Airbnb has the following key requirements:
Row count time series analysis
Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?Time series anomaly detection for key business metrics
Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?Standard run-time data quality checks
Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?Perfect metric consistency checks
Do columns and metrics satisfy critical relationships that can be expressed as simple equations?
For the rest of this post, we will use open source data in BigQuery to illustrate how each of these requirements is supported in Anomalo.
Row count time series analysis
Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?
The first question to answer for any production dataset is “has it been updated?” In particular, are there any records from the most recent date and is the row count within an expected range?
If not, then something probably broke upstream in the collection, loading or processing of the table. This must be fixed before anyone can trust the latest data.
In Anomalo, every table is automatically configured to monitor if the most recent period of data is complete.
For example, for the NY 311 service request data, here is the time series of row counts by day, and the predicted interval used to determine if the row counts are complete:
For November 23rd, we expected at least 5,223 rows, and in fact there were 7,056.
You can also pick any column and check that the row counts haven’t plummeted for any important segments in that column. For example, we can evaluate the row counts by borough (neighborhood) in the NY 311 data:
Anomalo customers use this feature to ensure their data is complete for a wide variety of segments. Ranging from geography (country, state) to event or API call type to device platform (iOS v Android).
Finally, we can also tell if the data was significantly delayed. This can indicate that an upstream processing stage is taking longer than normal, and may eventually cause data to be incomplete when users query for it.
Here is how long it took for each of the last 8 days of data to arrive for the New York 311 dataset:
On average, it takes around 25 hours for the New York 311 data to arrive in BigQuery, and you can easily set a threshold of when you would like to be notified for delayed data:
For more on data completeness, why it happens so often and how tools like Anomalo can be used to monitor it check out our post on When Data Disappears.
Time series anomaly detection for key business metrics
Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?
Once we know that data is complete in a timely manner, the next step is to ensure that any metrics we compute from the data are within expected ranges.
In Anomalo, you can easily configure new metrics, for example, we can easily monitor the mean score for posts on the Hacker News dataset in BigQuery:
If the score ever moves sharply outside of the predicted interval, a notification is sent in Slack. This saves a lot of time spent watching metrics for suspicious changes.
Underneath the hood, Anomalo is building a sophisticated time series model, which decomposes the metric into an overall trend (blue), holiday spikes (orange), season of year (red) and day of week (teal) components:
It looks like hacker news scores have been trending up over time, are sensitive to holidays (Thanksgiving and Christmas look good, others bad), and are much higher on weekends. Hopefully someone posts this article on hacker news on a Sunday near Christmas 🎄.
In Anomalo, metrics can be defined using a variety of pre-defined aggregates:
Or you can define a custom metric using any SQL aggregate statement available in your warehouse. For example, in the SF Fire Department service calls dataset in BigQuery, we can measure how many minutes on average it takes for the fire department to reach the scene of a call, and be alerted whenever it takes longer than 15 minutes:
It looks like the average time for the SF Fire Department to respond to calls increased dramatically on October 19th:
To learn more about testing your data with time series and unsupervised learning models, check out our post on dynamic data testing.
Standard run-time data quality checks
Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?
Ensuring that metrics are within expected ranges is important, but ultimately only tests the surface of your data. Even if all of your metrics are within expected tolerances, there could be cracks appearing in your data foundation.
That is where rule based data quality checks come in. These checks typically test that a condition never or always is satisfied.
For example, in Anomalo we can easily set up a number of foundational data quality checks on the Hacker News dataset in BigQuery. For example, to test that id
is always unique:
We can then see at a glance which rules have passed, and which have failed:
For failing checks, such as “timestamp
is never NULL”, we can see a summary of the issue:
A sample of bad and good records we can scan through:
And a statistical analysis identifying the segments of data where the issue is most prominent:
Depending upon the nature of the check, the summary visualization changes to provide the most useful context. In this case, it appears there are many stories with duplicate titles:
Knowing not just that the data is broken, but exactly how, where and when the issue occurs is critical to quickly triaging, root causing and fixing it.
Perfect metric consistency checks
Do columns and metrics satisfy critical relationships that can be expressed as simple equations?
Real world data is complicated. While key metric and fixed rule based checks may capture 95% of the common data quality use cases, the last 5% often require complex logic specific to an organization and dataset.
For example, in the New York Police Department dataset of motor vehicle collisions on BigQuery, there are multiple fields tracking how many people were injured or killed in an accident.
We can validate that the fatality counts are consistent with the following:
When run, we find that there are 0.003% of records (45 of 1,734,004 as of 2020-11-27) that have inconsistent fatality counts.
When such a validation rule fails, we can also see sample bad and good records:
In this case, it appears that there are records where motorists were killed, and yet they are not appearing in the total. Again, we also show any segments of the data that indicate where the issue is occurring most often (limited in this case to just records with some non-zero fatality rows).
In this case, when at least one of the fatality columns is non-zero this issue is most likely to happen when the contributing_factor_vehicle
number 3
column is unspecified. This intelligence could be a meaningful hint towards identifying where and how this data quality issue arose.
Today, it is easier than ever to become a data driven organization.
The volume and diversity of data keeps growing. Warehouses like Snowflake and BigQuery make storing and querying it all easy. From BI and visualization through to machine learning, there are a plethora of tools that we can leverage to generate insights and create value.
But most companies have only just begun the journey to ensure the data they depend upon can be trusted. As demonstrated by Airbnb, investing in data quality is critical to staying data driven as data organizations mature.
At Anomalo, we have built a platform that will allow any company to achieve and sustain the same vision of high quality data. If you are interested in starting a trial, head to our site to learn more or request a demo.
Top comments (0)