DEV Community

Cover image for Data Drift: Understanding and Detecting Changes in Data Distribution
Elahe Dorani
Elahe Dorani

Posted on

Data Drift: Understanding and Detecting Changes in Data Distribution

What is Data Drift?

Data drift refers to the distributional change between the data used for training model and the data being send to deployed model. One of the important approaches in machine learning modeling is the probabilistic modeling.

From Probabilistic Machine Learning perspective, we can assume that features in a dataset, are drawn from a hypothetical distribution.

However, in real-world modeling, it becomes evident that data does not remain constant over time. It is influenced by various factors such as seasonality changes, missing values, technical issues, and time fluctuations. This means that a dataset collected for machine learning modeling may not be the same at all times.

Regular monitoring of the model performance allows us to catch instances of data drift. It is crucial to monitor the change in data distribution between the training data and live data from time to time.

In most cases data drift occurrence, shows that our trained model is becoming outdated, and it should be retrained or updated with the newest dataset. Here, "live data" refers to the data that is being sent to the deployed model.

Top 5 Data Drift Techniques

Due to my need on deploy model evaluation, I had to monitor the result of the model on the unseen data. But it was a real quest to understand how to measure the model performance. It was also not clear that how could I measure the data behavior!

EvidentlyAI is one of websites I check regularly its articles. In this article, it has introduced the data drift concept and top 5 techniques to detect it on the features used in a large dataset. It also has provided a simple example

These techniqueys are:

  • Kolmogrorov-Smirnov (KS) technique which is more suitable for numerical features. It is a non-parametric test score. When we use this test, we want to accept or reject that if two datasets are drawn from the same distribution or not.
  • Population stability index (PSI) used to measure the data shift between two different datasets. It is suitable for both numerical and categorical dataset. The more this metric, the more different between the distribution of two datasets.
  • Kullback-Leibler divergence(KL) is a metric to measure the difference between two distributions. I could be applied on numeric and categorical datasets. Its range is between 0 to infinity. The more smaller KL metric shows that two distributions are very similar.
  • Jensen-Shannon divergence is defined based on the KL divergence. Its difference is that it relies between 0 to 1.
  • Wasserstein distance is a measure to monitor the numerical data drift. It is measured by the difference of the dataset means. This article also has provided a practical example which I could apply on my own data to understand it well.

More Resources:

As I work with Azure Machine Learning platform, I am very interested to unlock features in it.
First of all, I found a mini course about data drift which you could easily get throw and understand the main concepts in this field.

Then, I really suggest to have a look to this article which clearly has described the data and model drift. It also tried to apply it using the Azure Machine Learning capabilities for data drift.

Finally, I found a git repository which is tried to monitory the data drift using Azure ML and integrate it with Power BI dashboard.

I am interested to know more about this topic. If you know other useful resources please put some notes about them :)

Top comments (0)