The intelligence of an ML system lies in its ability to adapt to new situations. One of the biggest mistakes in machine learning is that the data scientists assume their models will keep working after deployment. What about the data, which will change?
The dynamicity of data makes us periodically question our machine learning models. Especially daily when it comes to environmental data that we offer as weather API at Ambee. Here’s why: climate changes occur due to a multitude of reasons, be it human activities, natural calamities, geopolitical developments, etc.
In the ML world, concept drift is the term used to describe this precisely. The concepts developed around the data change over time. And for this purpose, we will look in-depth at the ensemble training techniques to tackle the concept drift problem.
Before we dig in, let’s understand the difference between online learning, incremental training, and ensemble learning.
The online learning technique updates the model using each incoming data point arriving during the production environment without storing any of it. In contrast, incremental learning is learning from batches of data at different time intervals and can stabilize the historical knowledge of the learning model over novel data patterns. Hence, the model is updated to any new data point received while keeping its existing knowledge intact.
In ensemble learning, multiple base learners are trained, and their predictions are combined. The fundamental principle of dynamic ensemble learning is to divide a large data stream into small data chunks and train different models on each data chunk independently. The biggest difference between incremental learning and ensemble learning is that ensemble learning may discard training data outdated, but incremental learning may not.
Why ensemble training is the way forward during concept drift?
Your motive is to improve the model and the data quality so that it starts to represent current situations. There are some things we should ask ourselves first.
How do we know which data points to add to our existing training data?
How do we know if the retrained model is better?
The simplest way to know which real-time data to add to our existing training data is by looking at the skewness. Use the production real-time data with the range of data that lacks in the number of samples. After adding the real-time data from production, we filter the database with a specific range of values we need to add to our training data. The next step will be to evaluate our model’s predictions on real-time production data and add those data points to our training data where the residuals or errors were high enough.
We are doing representative subsample selection to narrow down on which data points can be added. We know which kind of data we need to add, but we don’t know when to add this particular data. It could be when:
When our training data distribution is different from our incoming production data. This can indicate that your model is outdated or that you’re in a dynamic environment.
Retraining based on intervals: weekly, monthly, or yearly. Retraining your model based on an interval only makes sense if it aligns with your business use case. Otherwise, selecting a random period will lead to complexities and might even give you a worse model than what you started with previously.
On-demand retraining - when you just simply know that retraining is required.
An interesting aspect of this story is the part about finding out how the ensemble method handles data drifts. It's not necessary for the old model to be retrained. We can have multiple models trained on different data samples.
Now, in terms of time-series data, that we at Ambee work on, different environmental data samples like weather, pollen and air quality data samples mostly rely on time factors. One of the ideas to tackle this is to train models on month-old , week-old and year-old data. These three models can be a part of the ensemble pipeline and their predictions can be weighted to give the final output.
Which model should be given higher weightage - a month-old model, a week-old model, or a year old one?
This is something open to trial and error. But in a more logical sense of solving data drifts, it makes sense to give lower weightage to a year-old model because environmental data is highly dynamic.
- Each data chunk is relatively small, so the cost of training a classifier is not high.
- We saved a well-trained model instead of the whole instances in the data chunk which cost much less memory.
- It can adapt to various concept drifts via different weighing policies. So the dynamic ensemble learning models can cope with unlimited increasing amounts of data and concept drift problems in a time series setting.
The difficult problems we discussed mainly in ensemble learning are as follows.
- What base models should we choose?
- How to set the size of a data chunk?
- How do we assign weighting values to different models?
- And finally, how to discard previous data?
Yippee! Now you have taken care of adding new data to your system somehow. The next big question:
Should you deploy as soon as you retrain?
Even though you have retrained your model and the performance metrics look great, there’s still a big risk of the updated model performing poorer than the previous model even after retraining.
It’s good practice to leave the old model for a specific window or until the model has served a particular number of requests. Then you can serve the new model the same data and get the predictions. This way, you can compare both models’ predictions to understand which of them is performing better.
Concept and data drift if not handled properly, can lead to a system becoming as good as random guesses. Weighting policy, instances selection, and model diversity are the main techniques employed in the ML literature survey. With the very fast development of the information industry, we have to face the reality of the information explosion. In that situation, more and more models will be trained, and real-time processing will become a challenge. Therefore, the next problem to tackle is how to effectively manage a large number of models. Stay tuned for part 2!
Top comments (1)
Superb article. Brilliantly written. I am eagerly looking forward for part 2.